Tip: How to Google Docs as a scraper tool

While looking to scrape some data from wikipedia recently I was inspired to do a blog post on how easy this process is when using ready made tools on Google Docs.

Wikipedia may not be the most impenetrable of fortresses for information but it is the home for many useful stats and info – such as the list of top points scorers in the Magners League which I was looking for.

What’s slightly more advantageous for data journlists is that statistics on Wikipedia are set out in a uniform manner and so are normally relatively easy to get at.

What many people don’t realise is that Google spreadsheets can be quite easily used as a data scraper. By using the Google spreadsheet function  =importHTML(“Website”,”table”,N) you can scrape a table from an HMTL web page into a Google Doc and have it clean and ready to be used for any data journalism you wish.

So for example during my data scraping adventure I went to the wikipedia page for The 2010-11 Magners League Season and found the list of eight top point scorers towards the bottom.

I then went onto my Google Docs account and opened up a new spreadsheet and in the first cell entered =ImportHTML (by the time you get to =Impor…. the autocomplete should have offered you the correct formula).

To get the table I wanted I entered =ImportHTML(“http://en.wikipedia.org/wiki/2010-11_Magners_League”,”Table”,146). Within the brackets the target web page and the instruction Table need to be in double quotes or it wont work so be doubly sure to check before you proceed and if any errors occur, always check there first.

As with this example to get the table you really want may take some playing around so be patient. If there are only a few tables on the wikipedia page you should be able to figure it out – the first table will be =Import(“WikipediaPage”,”Table”,1) while the second will be =Import(“WikipediaPage”,”Table”,2) and so on. As you may be able to see with my example however is that Wikipedia seems to consider most non standardised text to be a table. So if the table you want is below a lot of other information be prepared to make educated guesses at the right number the table may be until you get the right number.

As if by magic the table appears and you have readily available clean data to use for whatever you please. Googles sharing options also let you publish the data as a web page, a PDF or a CSV if you wish to then re-filter the info through some other data tools we’ve discussed (such as BatchGeo). A quick trip to many eyes meant I was able to produce this little visualisation:

Behold! The Magners League top point scorers in all their glory

Posted in Data tools, Statistics, Uncategorized, visualisation | Tagged , , , , , , , , , | Leave a comment

…and one pretty pathetic data visualisation win

Following my completely failure to produce the map of UKUncut actions I had spent hours working towards, I have decided to temporarily sooth my bitterness with something a lot simpler.

Below is a simple graph of UKuncut actions over time. A couple of pivot tables, a lot of data entry, and about 3 minutes on Many Eyes produced this:

However, even though this is a pretty paltry copntribution considering my grand ambitions, it does actually have some use. The graph shows that the highest number of actions on any one day (this of course does not measure the number opf people who turned out, only the individual actiopns planned) occurs on December 18. This is right towards the end of the most active period of student protest this country has seen for decades, and just days before the violent scenes in Parliament Square as the tuition fees bill was passed.

This snippet of infromation not be surprising, but it is something that I would simply never have noticed if I hadn’t bothered putting it in a graph. Even something as simpleas a basic graph can turn mindboggling data into another piece of information that might just end up being useful to a journalist.

I think I’ll have another go at that map….

Posted in Data tools, Protest, visualisation | Tagged | Leave a comment

One ambitious data visualisation fail…

I thought it might be interesting to actually make an interesting visualisation of a current news story.

So first I picked my data. UKuncut have obviously been in the news a lot, so I thought it would be interesting to take a look at how frequently UKuncut protests were held in different parts of the country.

Taking a look at the UKUncut website I found that it lists all its previous actions. Excellent, this should be easy.

Not so. The format they have been posted made it pretty much impossible to cut and paste the data in any way within my skills, so I began the laborious task of entering them all (almost 350 of them) into an Excel spreadsheet.

Two hours later I realised I had made a stupid mistake. Though I had entered in a location for each action, no computer program would be abl;e to just read “Aberdeen” and translate that into a marker on a map. So I went back and added aproximate postcodes for each and every action location. This took even longer as I had to look each location up on Google and guess an approximate postcode. So another three hours later, I was ready to have a go at using my data tool of choice: Many Eyes.

This is a lovely and simple piece of software to use to create a range of different visualisations of data, some simple some not so. I went straight for a map of the UK. After a fair amount of fiddling around with formats I was reacdy to go, two columns, one a list of postcodes, and another showing how many times an action had taken place there. Just put it into the software and…


It had failed to recognise nearly every postcode, despite my best efforts to put it into the right format.

I will crack this over the next few days, but this is just another exapmple of how data journalism can be frustrating and time consuming.

Posted in Data tools, Protest, visualisation | Tagged | Leave a comment

How to visualise ( data about ) an orgasm

Whilst I know I may well be in danger of sounding like a high school math teacher in constantly trying to make my subject sound fun, when to so many it patently isn’t, I thought I might as well give it one last go.

In my day to day trawling of blogs (honestly, it was blogs, nothing top shelf) I came across this rather informative visualisation on the statistical and scientific information of the human orgasm.

While it may not help us find any trends or patterns in the data that can help us uncover new stories, it is very well put together and on a subject dear to all of our hearts.


The wonders of the human orgasm

Posted in How to, visualisation | Tagged , , , , , , , , , | Leave a comment

GCSE and NVQ Data: have a go yourself

With deadlines for most City University students looming it seems only fitting that today the Department for Eduction (DfE) released the figures for last years end of Key stage 4 exams. i.e GCSE’s and NVQ’s.

The DfE have rather helpfully highlighted some of the most crucial data trends. the data set itself though is rather large but has been broken down into pretty simple categories and is easily manageable.

So it seem that at a time when exams are at the forefront of most students minds it seems the perfect data set to try out all the data skills covered in our blog.

Play around wit the information and let us at the Data Day know if you find anything interesting.

Posted in Data Releases | Tagged , , , , , | Leave a comment

Data journalism – Not a sexy as Julian Assange’s hair

Wikileaks made data journalism seem sexy. Not only was the content of the leaked documents exciting – who wouldn’t want to write about war and diplomacy? – the manner in which it was obtained would have made it seem very 007. Clandestine meetings in cafes. Encrypted USB sticks with keys scribbled on hotel napkins, even car chases (no matter how imaginary).

But after a little time trying to do data journalism you rapidly realise that it is far more likely to revolve around large spreadsheets full of baffling data about prescriptions, drink prices or the weekly pay packet of care home workers.

Even PDF’s about duck house claims are more exciting than the norm. So what does data journalistic really do for the working journalist other than bore him to death?

Well for a start, it gives you a get out of jail free card when you have to find a story, any story. Unlike the bulk of news journalism, data stories are lying around waiting to be found. There’s so much data out there that no one will ever have uncovered every angle. You should always be able to find something to keep the editor off your back, even if you can only produce it after hours of grueling pivot tables.

But perhaps more importantly, data journalism gives journalists another way to justify their continued employment. The advent of the internet has turned everyone into an journalist. Report on any event not hidden away from the public and you can bet that someone else has got there first, and there’s even a good chance they’ve got as a good a story as you. But not many people have got the skills, or the inclination, to sift through a huge pile of information and unearth the juicy nugget of a story that makes it worth publishing.

So it may not be as sexy as Julian Assange’s glistening silver mane,  but data journalism may one day save you from unemployment. And there’s not much less sexy than unemployment.

Posted in Afghanistan, Data Releases, Iraq, Uncategorized, Wikileaks | Tagged , , | Leave a comment

More on McCandless

Having credited Information is Beautiful author David McCandless with helping make data visualisation popular, I have now stumbled across this interview with him on the Visusalising Data blog.

The impression you get from McCandless is encouraging for any slightly feckless computer game nerd (he was a university dropout before becoming UK Doom Champion). The way he tells it, he basically stumbled upon the idea of expressing data in beautiful pictures when trying to make sense of various different strands of thought on evolution and creationism.

So, to keep track of all these different camps, I drew a visual map, and tried to sum up, distill, each camp into the minimum words.

In the end, I had this pretty interesting diagram. I looked at it and thought: “Hmmmm, I don’t really need to write the article now. It’s done the job of delivering the information.

Then I thought: “Maybe I could do this to loads of subjects? Instead of writing an article, this diagram could be the article?”


But while he comes across as lucky to a certain degree, he also makes it clear there was a huge amount of work involved, both in doing the design work to present data in an appealing way, but also in the journalism that underpins it (including some good old fashioned leaking).

And going back to my last post based on McCandless, about computer generated visualsiations sometimes lacking something, he reveals that he never uses any software tools other than Illustrator to produce final designs.


Other than using (presumably) Illustrator, are there any other software tools that you use to help create the final designs?


Are there any particularly intricate or advanced features of this software that you could share?

Not really sorry.

But perhaps his penultimate point is the one that should stick the most:

Data needs humans to be interesting.


Posted in visualisation | Tagged | Leave a comment