My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
WebScraping  
HTML web page scraping
Phase-Implementation, Featured
Updated Feb 4, 2010 by yuin...@gmail.com

Introduction

XQuery can be used effectively as a HTML/XML web-scraping engine (see #1 for detail).

Example

We provide here an example that extracts the following text messeges in this HTML page by XQuery.

Extract me please!
Me too ;-(

The following query extracts the table as provided below.

 <table border="1">
  <tr bgcolor="lightgreen">
    <td>order</td><td>message</td>
  </tr>
 {
	let $page := fn:doc("http://code.google.com/p/xbird/wiki/WebScraping")
	for $code at $pos in $page/html/body/div[@id='maincol']/div[@id='wikicontent']/pre
        where $pos le 2
	return 
		<tr>
		  <td>{ $pos }</td>
		  <td>{ fn:data($code) }</td>
		</tr>
 }
 </table>

order message
1 Extract me please!
2 Me too ;-(

Current limitation

XBird uses NekoHTML or TagSoup for the HTML parser in DocumentTableModel class. These parsers does not handle javascripts.

The HTML parser is configurable through xbird.util.xml.HTMLSAXParser property in xbird.properties.

Javascript-aware HTML parsers such as Cobra: Java HTML Parser could be a help.

List of future information


Sign in to add a comment
Powered by Google Project Hosting