There has, obviously, been a lot of coverage of last night and this morning’s Wikileaks revelations. 250,000 classified documents is a big haul, and whatever the content of the cables themselves a leak that size is always going to make headlines.
The amount of effort put into sifting through and analysing the data by journalists at the Guardian, New York Times, and Der Spiegel also deserves a mention – but however good the stories to come out of it are, does this release mark a step backwards in data journalism?
One of the best things about data journalism in the last few years is how easy it has become to open that data up to other people. It lets us regular punters look through and decide what we think is important, to create our own visualisations, and maybe even to pick out something the pros didn’t spot.
That’s why people find things like the Guardian Datablog so useful. We still rely on journalists to provide us with context and do much of the heavy lifting, but having the data there means they’re no longer the gatekeepers they used to be.
This works very well with smaller datasets, as the Datablog shows every day. But with something as vast as the Wikileaks dump it becomes impossible – it took the media organisations two months to wade through it after all, so it’s pretty unlikely that any individuals will be able to do much with it.
In itself that’s not unprecedented. The MPs’ expenses crowdsourcing project undertaken by the Guardian is looking a bit forlorn, with under half the documents looked at 18 months on. But with that you could pick your individual MP and rifle through their receipts, which is probably all most of us wanted to do anyway, rather than search for all electricity bills submitted or something.
The cables have the disadvantage of being text-based and not directly related to specific incidents, unlike much of the war logs. So it lends itself much less to any sort of statistical analysis beyond very, very dull stuff about how many cables were sent from one place to another.
But where a text-based data dump has a definite advantage is the scope for searching, and the Guardian’s not done a great job of that either. The interface looks nice, but the ‘search keyword’ function is very disappointing – it’s not a search box, just a drop-down menu of people, places, and subjects you might be interested in. Want something else? You’re on your own.
In terms of the analysis, then, this is quite possibly a new high point for data journalism. But as for opening that data up to the public, there’s still a way to go.