My favorites | Sign in
Project Home
READ-ONLY: This project has been archived. For more information see this post.
Search
for
DeploymentView  
Updated Dec 21, 2010 by federic...@gmail.com

Table of Contents

Deployment View

The Deployment View depicts a static view of the run-time configuration of processing nodes and the software components that run on those nodes. The deployment scenario described in this section assumes the existence of three server nodes but in a minimalistic scenario the system can be deployed on just one node server.

Descriptions of the three nodes and their software dependencies are presented in Table 1 and the individual components hosted on these nodes are described in Table 2.

Table 1. The three nodes of the GBIF metadata harvesting and catalogue system, and their respective requirements.

Name & Description Requirements
Harvester Server – hosts the harvester applicationJVM 1.6x; HTTP access to the URLs of registered metadata repositories; read/write access to the file system repository. The harvester will store the xml files in the repository.
Servlet Container – Application - hosts the main catalogue application components including search functionality, request and response handling, web interface, and the OAI-PMH Service (both were described in the Use Case View).JVM 1.6x; Servlet 2.5 compliant container; HTTP access to the Solr Index server.
Servlet Container – Index - hosts the Solr 1.4 indexing services.JVM 1.6x; Servlet 2.5 compliant container; read access to the file system repository.

Figure 9. The file system repository structure that is created by the GBIF harvesting component.

Table 2. The main software components of the GBIF metadata harvesting and catalogue system.

||A standalone Java application responsible for harvesting metadata from the list of OAI-PMH servers and the GBIF portal. This application can be executed in one of two modes: “run once” and “scheduled thread pool server”. The “run once” is intended for use when the operating system (or any other task scheduler service) is responsible for running the harvester periodically. The scheduled thread pool service is a configurable java application that allows running the harvester at fixed time intervals.
ComponentDescription
File System RepositoryA reachable file server. If the directory is empty, the first time the harvester runs, the structure depicted in Figure 9 is created inside the directory.
metacatalogue.warA Java Web Application that includes i) the OAI-PMH Service as an implementation of the OAI-PMH repository protocol and ii) the search functionalities (both explained in more detail in the Use Case View section). This web application does not have dependencies on libraries except for those that are already included in the war file.
metacatalogharvester.jar
solr.warThe standard Solr 1.4 web application; it can be downloaded from the main Solr distribution repository.
gbif-solr.jarA small Java library that contains utility classes used by the data import handlers; specifically the Dublin Core import handlers requires the class “ListDateFormatTransformer” included in this jar archive. This file must be copied to the Solr “lib” folder.

1. Installation and Configuration

This section describes the mains steps that should be followed in order to install different components of the GBIF metadata harvesting and catalogue system.

1.1. Installing Apache Maven

1.2. Installing Solr with Apache Tomcat

Download Apache Solr 1.4 from http://lucene.apache.org/solr/ Download Apache Tomcat 6 from http://tomcat.apache.org/ To install Solr with Apache Tomcat please follow the guide http://wiki.apache.org/solr/SolrTomcat

2. Installing the metacatalogue harvester

  • Download the source code from the svn repository: https://gbif-metadata.googlecode.com/svn/trunk/metacatalog; this creates a local metacatalog directory
  • Enter the metacatalog directory and build and package the application using the Maven command:
  • mvn –Dmaven.test.skip=true package
  • This command will create in the Maven target directory the file metacatalog-0.0.1-SNAPSHOT.jar and the directory libs; both must be copied to the harvester installation directory.

2.1. Configuring Solr to index the harvested files

  • Stop the Solr servlet container.
  • From the metacatalog project directory, copy the files dc-import, eml-import, schema.xml and solrconfig.xml into the directory SOLR_HOME/conf. (Those files are located in the solr.config directory/package)
  • Modify the “baseDir” in dc-import and eml-import files, this attribute must point to the directory where oai_dc and eml files are stored in the file system.
  • <entity name="dcdataset" rootEntity="false" dataSource="null"
    processor="FileListEntityProcessor" fileName="^.*\.xml$"     recursive="true"
    baseDir="/opt/metacatalog/data/oai_dc">
  • Copy the file gbif-solr.jar from the target directory to the SOLR_HOME/lib directory (this file was generated when the project was compiled using the Maven package phase in previous section).

2.2. Running the harvester and Solr import handlers

The metadata harvesting process can be set up to run once (and repeated manually as required) or as a scheduled task (using the scheduled thread pool server).

  • For running the “run once” harvester use the following command:
    • Java –jar metacatalog-0.0.1-SNAPSHOT.jar runOnce
    • or running the “scheduled thread pool server” use the following command: Java –jar metacatalog-0.0.1-SNAPSHOT.jar.
      • The “scheduled thread pool server” uses two configuration files:
  • harvester.properties: located in the directory/package org.gbif.metacatalog.harvester. This file contains the following parameters:
  • #number of threads in the pool
    threadpoolsize=4
    #frequency in seconds that the GBIF portal will be harvested
    portalharvesterpollingfrequency=360000
    #File system repository directory
    outputdirectory=/opt/metacatalog/data/
    #File system directory repository for the GBIF portal
    portaloutputdirectory=/opt/metacatalog/data/eml/
    • serverList.xml: located in the directory/package org.gbif.metacatalog.harvester. This file contains the list of OAI-PMH servers that will be harvested. The structure of each element in the server list is:
    • <?xml version="1.0" encoding="UTF-8"?>
      <serverlist>
        <server>
         <organisationId>1</organisationId>
         <organisation>Secretariat..</organisation>
          <serverId>1111</serverId>
          <country>Burkina Faso</country>
          <serverUrl>http://gbif.spconedd.org/oai2.php</serverUrl>
          <metadataPrefix>oai_dc</metadataPrefix>
          <pollingFrequency>360</pollingFrequency>
          <lastDatePolling></lastDatePolling>
      </server>…
  • Organisation id: unique identifier for the organization contact (note: this is not used in the current implementation).
  • ServerId: unique identifier for the source; this identifier will be used to create a directory where all the documents obtained from this source will be stored (see Figure 6).
  • Country
  • serverUrl: OAI-PMH service URL.
  • metadataPrefix: prefix used by the server OAI-PMH protocol; this operation is repeated for each available metadata format required
  • pollingFrequency: frequency in seconds at which the server will be harvested; the thread assigned for the server will be executed every “pollingFrequency” number of seconds.
  • lastDatePolling: is used by the harvester to store the last date on which the source was harvested; the next time the server is harvested this value will be used to filter the data that was changed since the last run.
  • Note: it is highly recommended to use the operating system scheduled task services; for example the “run Once” harvester can be used for incremental harvesting using a crontab job description as the following:
    • 0 0 * * * –jar metacatalog-0.0.1-SNAPSHOT.jar runOnce
    • The harvester will invoke the Solr import handler after each run, but if you desire to invoke the handlers manually, it is possible to use the administration web interface or any other tool able to invoke HTTP URLs (curl for example). The administration Web interface can be executed using a regular Internet browser; for example, invoking the URL: http://solrsever:8080/solr/admin/dataimport.jsp?handler=/emlimport will display a Web page that allows the user to execute several handler commands: full-import, debug, delta import, commit and execute the handler in verbose mode (see the Solr documentation for more information).
Powered by Google Project Hosting