|
WebScraping
IntroductionXQuery can be used effectively as a HTML/XML web-scraping engine (see #1 for detail). ExampleWe provide here an example that extracts the following text messeges in this HTML page by XQuery. Extract me please! Me too ;-( The following query extracts the table as provided below. <table border="1">
<tr bgcolor="lightgreen">
<td>order</td><td>message</td>
</tr>
{
let $page := fn:doc("http://code.google.com/p/xbird/wiki/WebScraping")
for $code at $pos in $page/html/body/div[@id='maincol']/div[@id='wikicontent']/pre
where $pos le 2
return
<tr>
<td>{ $pos }</td>
<td>{ fn:data($code) }</td>
</tr>
}
</table>
Current limitationXBird uses NekoHTML or TagSoup for the HTML parser in DocumentTableModel class. These parsers does not handle javascripts. The HTML parser is configurable through xbird.util.xml.HTMLSAXParser property in xbird.properties. Javascript-aware HTML parsers such as Cobra: Java HTML Parser could be a help. List of future information
| ||||||
► Sign in to add a comment