The world is run by [people] who form [entities] that perform [actions] based on [decisions] which affect other [people] in various [places].
In essence, when I’m taking in a piece of news, I’m trying to fill in these brackets. My goal is to crystallize a picture of the relationships amongst different people and distill a sense of the motivations behind their decisions which affect the world we live in; past, present, and future.
Right now I use my brain to filter out this type of semantic metadata, but I can only remember so many names and associations. I’d much rather have a computer separate the who-what-where and other relational chunks from a news story, then organize them in a neat way so I can easily recall who I’ve been reading about and what things they’ve done.
The reason for doing this is to make it easy to publish meaningful discoveries to investigative encyclopedia hubs like Sourcewatch, Little Sis, Crocodyl, Muckety, (and of course Wikipedia), and thus bring the world closer to knowledge Nirvana.
(By the way, any good ideas out there on how to reconcile the data from all these great research hubs into one place to avoid redundancy?)
Developers have been thinking a lot about “natural language processing” and how to go beyond syntax and analyze semantic relationships. There are scads of projects underway. However, the challenges run similar to other noble efforts on the web, where we find overlapping projects that don’t play nice with each other due to individual political interests that result in frustration for the average user.
So sticking to the #JCarn topic, my plea to developers is this:
Help me become a supercharged research wizard that can pull people / places / actions / etc from any article on the web and integrate them into my “personal encyclopedia” … and do it in a way that enables sharing and collaboration with fellow knowledge junkies.
Thankfully some researchers at The University of Queensland in Australia did a lot of legwork on analyzing the semantic application landscape, this invaluable report (pdf) they published in Feb 2011 is relatively painless, and a great place to get started.
In there you’ll find a breakdown of literally dozens of services, APIs, etc. which are evaluated on metrics like Open Source, Open Standards, Interoperability, Scalability, Usability, etc.
So, here’s my proposed workflow:
- Patrol the web and grab the good stuff using Zotero — a nifty Firefox plugin from George Mason University. It’s free, open source, and not compromised by commercial interests. It pulls stuff right off the web and stores a local copy that is taggable and searchable.
- Collaborate, annotate, and share libraries with other Zotero researchers who are also passionate about digging up answers on who’s really running this crazy world. (They already have thousands of groups working together on various research projects.).
- Analyze the source documents through services like OpenCalais, Zemanta, Alchemy API, OpenAmplify, Meaningtool, etc. that pull out names of people and organizations and things they’ve done that the public should know about.
- Publish our findings to encyclopedia hubs like Sourcewatch, Little Sis, Crocodyl, Muckety, Wikipedia, and others.
- Generate narratives and graphical representations for other journalists and the general public to pick up on.
For example, there’s a very useful article on Sourcewatch that lists people who walk through the “Government-Industry Revolving Door” i.e. folks like James L. Connaughton who worked as a lobbyist help big polluters like General Electric and ARCO avoid responsiblity for cleaning up toxic superfund sites. He then headed up pollution-policy development in the Bush administration where he fought to weaken standards for getting arsenic out of drinking water, stalled efforts to move forward on global warming, and pressured the Environmental Protection Agency to soften up their language on the asbestos in the air after 9/11 that poisoned rescue workers. He now left his post as wolf guarding the public henhouse and lobbies for Constellation Energy.
Information on folks like Mr. Connuaghton, John D. Graham, J. Steven Griles, etc. are first dug up by investigators like Jim Hightower (who publish things like this article in Utne Reader about government conflicts of interest) and then have to be manually processed by people who gradually code it into encyclopedias like Sourcewatch.
How can we pull from thousands of investigative articles and streamline the contribution process to these encyclopedias? Furthermore, once they’re organized nicely in the encyclopedias, how can we pull out awesome visualizations like Muckety that assemble the big BIG picture interactively so we can grasp it?
I’ve pounded my head figuring out how to do this in a manageable fashion, and am still coming up a bit short.
I see that I can export my zotero library as an RDF file (the preferred format for semantic apps, far as I know), so the next step is to figure out how to analyze all the documents through APIs mentioned above, and pull together a map of names, organizations, and activities they’ve been involved with (especially those of corruption and skullduggery). Assuming the RDF is compatible, I’d have to figure out how to feed the factoids into the encyclopedias and avoid errors.
Other questions and challenges:
- Is Zotero the right tool?
- One developer noted that Zotero is a walled garden due to the API not being accessible by applications other than Zotero.
That article was written in 2010, is that still the case? Will this be resolved, and if not, does that stop this effort dead in it’s tracks?Update — Adam Smith left a comment below stating that this post is untrue and the Zotero API & code is AGPL licensed. Christopher Warner, then chimed in with this discussion thread to defend his position that the API is still insufficient. See the comments below for the full discussion, including a word from Zotero project manager Sean Takats.
- The Criminal Intent Project used Zotero as part of their semantic analysis of the 127 million words of the Old Bailey Trials. Further exploration is required to see if their workflow is transferable the type of endeavor I’m proposing.
- Also, there are other collaborative research tools and repositories like Diigo, Mendeley, Academia.edu, etc. would these serve better? And if not, can they integrate with Zotero group collections?
- One developer noted that Zotero is a walled garden due to the API not being accessible by applications other than Zotero.
- Politics.
- OpenCalais has one of the leading semantic API’s out there, e.g. their engine finds relationships in the ever-awesome DocumentCloud library…however…they are owned by Thomson-Reuters which sued Zotero’s makers at George Mason University over claims that they stole intellectual property from their non-open source Endnote product. Zotero is ultimately a better product and better for the research community because it’s open source, and although the lawsuit was dropped, I’m not sure how warm Thomson-Reuters would be to having a fully integrated semantic solution with researchers who use Zotero.
- There’s plenty of politics surrounding the notion of making the semantic web truly open. I can’t go into more detail other than point out that there are many commercial enterprises trying to be leaders in this space, which may or may not corrupt the integrity of knowledge for everyone.
- Social Media
- This workflow doesn’t analyze breaking news in Twitter / Facebook / etc. which is where people publish first. Sadly, not everybody blogs. However, there is a neat study from the University of Baltimore that shows how to pull and analyze names, organizations, and locations from tweets using Crowdflower and Amazon Mechanical Turk.
FURTHER RESOURCES
This challenge is not going away, the prospect of connecting knowledge is just too delicious to ignore. Here are some resources to stay involved.
- The Meta Meta project on Google Groups, started by 2012 Knight-Mozilla fellow Dan Schultz at the Berlin Hackathon.
- this is where I got the uber useful report from the University of Queensland
- Group activity may be a bit quiet since Dan is now spending a year with the Boston Globe courtesy of the Knight-Mozilla partnership to build Truth Goggles with FactCheck.org. However, there are others in the group that I’m sure would be eager to chew on brilliant suggestions.
- SemanticWeb.org — a clearinghouse about this stuff
- Open Annotation Collaboration — A collaboration between University of Illinois at Urbana-Champaign, Los Alamos National Laboratory, University of Maryland, George Mason University and the University of Queensland that aims to develop a common annotation model to support interoperability across clients, servers, disciplines and platforms.
- Linked Data — the set of standards and best practices spearheaded by world wide web inventor himself, Tim Berners Lee.
***BONUS hot tip*** if you’re on WordPress you can use the Simple Tag plugin, or Tagaroo, which accesses semantic APIs to scan your post and suggest tags for you. Very convenient!
This post was written as part of the December 2011 Carnival of Journalism hosted this month by Martin Belam of the Guardian Developer Blog


















Mozilla asked me to link up with the Seattle chapter of Hacks/Hackers, an organization that shares a similar MoJo hybrid theory of bringing together journalists (hacks) + technologists (hackers) with the goal of changing news for the better. One week later we threw together a sold out Brainstorm 2011 that brought in journalists and technologists throughout the city who came to mash up ideas and enter the challenge.