While looking to scrape some data from wikipedia recently I was inspired to do a blog post on how easy this process is when using ready made tools on Google Docs.
Wikipedia may not be the most impenetrable of fortresses for information but it is the home for many useful stats and info – such as the list of top points scorers in the Magners League which I was looking for.
What’s slightly more advantageous for data journlists is that statistics on Wikipedia are set out in a uniform manner and so are normally relatively easy to get at.
What many people don’t realise is that Google spreadsheets can be quite easily used as a data scraper. By using the Google spreadsheet function =importHTML(“Website”,”table”,N) you can scrape a table from an HMTL web page into a Google Doc and have it clean and ready to be used for any data journalism you wish.
So for example during my data scraping adventure I went to the wikipedia page for The 2010-11 Magners League Season and found the list of eight top point scorers towards the bottom.
I then went onto my Google Docs account and opened up a new spreadsheet and in the first cell entered =ImportHTML (by the time you get to =Impor…. the autocomplete should have offered you the correct formula).
To get the table I wanted I entered =ImportHTML(“http://en.wikipedia.org/wiki/2010-11_Magners_League”,”Table”,146). Within the brackets the target web page and the instruction Table need to be in double quotes or it wont work so be doubly sure to check before you proceed and if any errors occur, always check there first.
As with this example to get the table you really want may take some playing around so be patient. If there are only a few tables on the wikipedia page you should be able to figure it out – the first table will be =Import(“WikipediaPage”,”Table”,1) while the second will be =Import(“WikipediaPage”,”Table”,2) and so on. As you may be able to see with my example however is that Wikipedia seems to consider most non standardised text to be a table. So if the table you want is below a lot of other information be prepared to make educated guesses at the right number the table may be until you get the right number.
As if by magic the table appears and you have readily available clean data to use for whatever you please. Googles sharing options also let you publish the data as a web page, a PDF or a CSV if you wish to then re-filter the info through some other data tools we’ve discussed (such as BatchGeo). A quick trip to many eyes meant I was able to produce this little visualisation: