|
Project Information
Featured
Downloads
Links
|
forklabs >> forklabs-javaxpcom The forklabs-javaxpcom open source project provides the skeleton to build a web crawler using Mozilla and SWT. IntroductionThe Crawler is designed to painlessly navigate to web pages and wait for them to load and to give access to the underlying document using JavaXPCOM. The source code includes an example Crawler that will come to this page and read the label on each of the top most menu (Project Home, Downloads, ...). From this example Crawler it is simple to build a more complex Crawler that can follow links, fill in forms and collect even more data. Finally, one can create a Crawler from an existing Display (and thus control the event thread) or one can let the Crawler create its own SWT deamon thread. Getting StartedUsing this library to make an automated crawler is quite easy. To make a more interactive crawler or leverage other facets of the library is much more difficult. Long ago, the Laboratorio de Algoritmia Distribuida y Redes published XULRunner with Java: JavaXPCOM Tutorial (local copy) to help with such endeavour. Please note that the guide has been reproduced as is and might contain outdated information. However its essence should still be there to help you with your own projects. DOM InteractionDOM NavigationTo help navigate and quickly find nodes of interest in the document, the crawler offers a selection mechanism based on node properties (i.e. node name, node attribute name, node attribute value). This mechanism is accessed with the Selector API and method selector() on the crawler itself. DOM InspectionIncluded in this package is an utility class targetted at developers whose purpose is to inspect DOM nodes. It will print on the standard output the main qualities of a node, such as but not limited to its type, its names, its attributes and its childrens. DOM ConversionAnother utility class allows easy conversion by hiding the calls to interface querying making the code much more readable. DownloadA distribution is available for download. It contains the source, the library, the documentation as well as a Windows version of SWT and a Windows version of XULRunner. SourceThe source code can be browsed through the Mercurial web interface. JavadocThe Javadoc API can also be browsed through the Mercurial web interface. DependenciesThe crawler depends on two external tools. The first is SWT, a GUI library from Eclipse, and XULRunner from Mozilla. The correct versions for your operating system is required. The 32-bit Windows version each of these tools is included in the Mercurial repository. When installing XULRunner, make sure its path does not contain any non-ASCII characters.The crawler also depends on library forklabs-baselib, a sibling project containing core language features. That library is included in the distribution and has a compile entry in the POM.CompatibilityIt is recommended to run your own crawlers on a 32-bit JVM as compatibility problems were encountered while trying to run a crawler on 64-bit Windows machine. Digging down, it was found that XULRunner does not have any official 64-bit version (at least for Windows). This lack of version has for effect that the 64-bit version of SWT for Windows is missing a glue library between SWT and XUL. To fix it, one could hack SWT source code to compile the missing native library under 64-bit and then use the unofficial 64-bit version of XULRunner. But for the moment I do not have the resources for that. Stack TracesThis stacktrace is using a 32-bit version of SWT on a 64-bit JVM: Exception in thread "SWT-Thread" java.lang.UnsatisfiedLinkError: Cannot load 32-bit SWT libraries on 64-bit JVM at org.eclipse.swt.internal.Library.loadLibrary(Unknown Source) at org.eclipse.swt.internal.Library.loadLibrary(Unknown Source) at org.eclipse.swt.internal.C.<clinit>(Unknown Source) at org.eclipse.swt.widgets.Display.<clinit>(Unknown Source) This stacktrace is using a 64-bit version of SWT on a 32-bit JVM: Exception in thread "SWT-Thread" java.lang.UnsatisfiedLinkError: Cannot load 64-bit SWT libraries on 32-bit JVM at org.eclipse.swt.internal.Library.loadLibrary(Unknown Source) at org.eclipse.swt.internal.Library.loadLibrary(Unknown Source) at org.eclipse.swt.internal.C.<clinit>(Unknown Source) at org.eclipse.swt.widgets.Display.<clinit>(Unknown Source) This stack trace is running a 64-bit version of SWT on a 64-bit JVM: Caused by: org.eclipse.swt.SWTError: No more handles (java.lang.UnsatisfiedLinkError: Could not load SWT library. Reasons: no swt-xulrunner-win32-3650 in java.library.path no swt-xulrunner-win32 in java.library.path Can't load library: C:\Users\Forklabs\AppData\Local\Temp\swtlib-64\swt-xulrunner-win32-3650.dll Can't load library: C:\Users\Forklabs\AppData\Local\Temp\swtlib-64\swt-xulrunner-win32.dll ) at org.eclipse.swt.SWT.error(Unknown Source) at org.eclipse.swt.SWT.error(Unknown Source) at org.eclipse.swt.browser.Mozilla.initXULRunner(Unknown Source) at org.eclipse.swt.browser.Mozilla.create(Unknown Source) at org.eclipse.swt.browser.Browser.<init>(Unknown Source) AcknowledgementI wish to thank the Laboratorio de Algoritmia Distribuida y Redes for its help in kickstarting me with SWT, Mozilla and JavaXPCOM as well as allowing me to publish some of their code and their documentation. |