Unfortunately for journalists, not all data on the internet is available in easy downloadable formats. Before the advent of public data releases, screen scraping was the most common method to pull information into a nice format. Screen scraping is a program that navigates through a website and attempts to strip out the required data based on a search result. Although it can be unreliable and complicated, scraping can be useful when nothing else is available.
But how does one screen scrape? Technical users would write a quick program themselves but how do ordinary uses scrape? Enter Dapper. Previously a small startup, it was purchased by Yahoo in October last year and has found its niche as a display advertising company. Thankfully, they have kept the Open Dapper screen scraper free. You type in the URL of your site and follow the instructions. It’s rather easy to use – the site will guide you through selecting a sample net of data and prompt you to what format you would like the data produced in.
I first encountered Dapper when I successfully used it for a music ‘knowledge base’ project. Dapper came in use for pulling data from the Allmusic Guide. Allmusic do not provide comprehensive RSS or XML feeds of their reviews and I wanted to include this in the project. Dapper successfully pulled the hometown, years active and members for each band with searches created on the fly.
Do be aware that if you are using screen-scraping for a commercial project, you may be pulling copyrighted information. It is always worth checking with the owner before using it. Besides this, scraping is a useful tool for analysing and downloading data without having to resort to copy and pasting.