Sunday, October 26, 2008

Mashup Screencast 2: The Scraping

This is one sequel you didn't have to wait long for; the second part of the mashup screencast trilogy, Web Scraping, is out. In this episode, Jonathan uses a real world example to teach you the basics of screen scraping.

For the un-initiated, scraping enables you to extract information available in web pages and make it available in a machine consumable form. It's a technique most people want to learn as soon as they start using the WSO2 Mashup Server, because it potentially allows you to use the entire web as your data source.

Spoilers: In the screencast Jonathan teaches you to write a scraper configuration to retrieve the contents of a web page and create a sanitized XML document from it. He then uses 'firebug', a firefox plugin, to view the structure of the web page and help him extract the specific data element he's after from the XML. By the end of the screencast you'll be ready to go out and scrape a few pages yourself!

If you've got good bandwidth, you'll appreciate the hi-res version, but if you don't mind youtube quality, click below.


As before, watch this space for the next installment.

No comments: