|
HowToCreateaDataRssScraper
How-to create a DataRSS feed from HTML pages.
IntroductionIt's not always possible to create a fresh DataRSS feed from scratch, or to embed RDFa in your HTML documents, so in this tutorial we look at how to 'scrape' the data from existing HTML pages to create a DataRSS feed. Writing your own scraping applicationIf you are using Yahoo!'s SearchMonkey, then this translation from HTML to DataRSS can be carried out using Yahoo! developer tools. If not, then you will need to set up a process that:
The XSLT document described in the section Specifying the transformation provides an illustration of how to write a transformation for step 2. Using SeachMonkeyCreating a new applicationIf you are going to use the SearchMonkey tools, then navigate to the SearchMonkey page at http://developer.yahoo.com/searchmonkey/. Click on the button labelled "Build an app", and you'll be asked to log in if you aren't already. If you don't have a Yahoo! ID; then you'll need to create one, but that only takes a minute or two. Once you have registered and logged in, you'll see the SearchMonkey dashboard:
From the dashboard you can create a number of different types of application, but the one that we are interested in is a custom data service; click Create a new data service and you'll see a screen like this:
Enter a name and description for your feed, agree to the terms of service, and click Next Step. For example, if we were creating a feed that scrapes Foreign and Commonwealth Office vacancies, then we might choose the name "FCO Vacancies (Scrape)", and the description "A feed of vacancies at the FCO." Indicating where to find your dataThe next step is to tell SearchMonkey where to find your information, which involves a URL pattern, rather than a specific list of URLs. To illustrate how to work out what your URL pattern is, we'll use the FCO again. The FCO has a number of vacancies listed in a common folder: http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/Admin-Assistant http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/A2 http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/003-UN http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/secretary-antarctic http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/ico-cos Each of these jobs has a common part of the URL, which is: http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/ The 'trigger' URL pattern is created by placing a wildcard on the end of this base URL, as follows: http://www.fco.gov.uk/en/about-the-fco/working-for-us/careers/vacancies/* Add the trigger URL that is appropriate for your information, and click on Autofind URLs. This instructs the tool to try to find some URLs for you to test with, but if no suitable matches are found you can enter some into the list below. We'll enter all five of the FCO jobs listed above so that it gives us as much data as possible to test with:
Note that alongside each URL are the words 'URL OK" in green, which indicates that the service was able to access our page correctly. Press Next Step, and you'll be presented with a form which will allow you to indicate how to convert your HTML page into DataRSS:
Specifying the transformationIf you are going to run your own transformation service then you'll need to ensure that your HTML page is converted to XHTML before it can be translated with XSLT. However, SearchMonkey will already have done that for us, so at this point we simply need to add the XSLT stylesheet itself. Paste the following text into the large text entry field, overwriting the entire existing contents: <?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<adjunctcontainer>
<adjunct id="smid:{$smid}" version="1.0" name="vacancy">
<item rel="rel:Employer">
<meta property="vcard:fn">FCO</meta>
<item rel="vcard:url" resource="http://www.fco.gov.uk/" />
<item rel="rel:Thumbnail" resource="http://www.fco.gov.uk/sitepack/layouts/xm_supplied/files/images/fco_mobile_logo.gif">
<meta property="media:width">127</meta>
<meta property="media:height">45</meta>
</item>
</item>
<item rel="rel:Listing" xmlns:v="http://code.google.com/p/argot-hub/wiki/ArgotVacancy#">
<item rel="dc:source" resource="{$CURRURL}" />
<meta property="dc:title"><xsl:value-of select="//div[@id='topbar']/h1" /></meta>
<meta property="dc:description">
<xsl:value-of select="//div[@class='moreinfo']/h2[contains(., 'The Role') or contains(., 'The role')]/following-sibling::p" />
</meta>
<xsl:variable name="location" select="//div[@class='moreinfo']/h2[contains(., 'Location')]/following-sibling::p" />
<xsl:if test="$location">
<meta property="geo:location">
<xsl:value-of select="$location" />
</meta>
</xsl:if>
<meta property="job:salaryType">annual</meta>
<xsl:variable name="salary" select="
translate(
substring-before(
substring-after(
//div[@class='moreinfo']/h2[contains(., 'Salary')]/following-sibling::p,
'Starting salary of £'
),
' '
),
',',
''
)"
/>
<xsl:if test="$salary">
<meta property="job:salaryFrom" datatype="currency:GBP">
<xsl:value-of select="$salary" />
</meta>
<meta property="job:salaryTo" datatype="currency:GBP">
<xsl:value-of select="$salary" />
</meta>
</xsl:if>
<meta property="job:hireType">full-time</meta>
<xsl:variable name="closingDate" select="
concat(
translate(
substring(
substring-after(
//div[@class='moreinfo']/h2[contains(., 'How to Apply')]/following-sibling::p[2],
'Closing date:'
),
2
),
'.',
''
),
translate(
substring-after(
//div[@class='moreinfo']/h2[contains(., 'How to apply') or contains(., 'Closing Date') or contains(., 'Closing date')]/following-sibling::p,
'by '
),
'.',
''
)
)"
/>
<xsl:if test="$closingDate">
<meta property="v:closingDate">
<xsl:value-of select="$closingDate" />
</meta>
</xsl:if>
</item>
</adjunct>
</adjunctcontainer>
</xsl:template>
</xsl:stylesheet>We're now ready to test our transformation. Testing the transformationAs you work on your XSLT transformation you'll want to test that it creates the desired output. The SearchMonkey tool helps you to do this by using the pages from the test URLs provided earlier, and translating them using your current XSLT. To request that SearchMonkey runs your transformation against the pages, click on Save & Refresh. The viewer at the bottom should change to look something like this:
This data viewer is incredibly useful when trying to get your XSLT correct, and has the following features:
The result of applying our XSLT to the first FCO vacancy is laid out like this:
As you can see, some of the information has been provided by the transformation -- such as the FCO's logo -- and other parts have been provided by scraping the HTML. Saving the transformationOnce you have finished testing the XSLT against your HTML pages you can click Next Step and your new data service will be saved. You data service can now be used to provide data to a SearchMonkey presentation application. If you'd like to create such an application then see How To Create a SearchMonkey Presentation Application. |
