My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
DwCArchive  
The Darwin Core Archive format is a simple and extensible schema for sharing biodiversity data
Updated Nov 7, 2011 by wixner@gmail.com

This page has been published but is subject to revision.

The GBIF Global Names Architecture Format: Darwin Core Checklist Archives

GBIF has developed a data exchange format for publishing annotated checklist data based on the ratified Darwin Core terms and the Darwin Core text guidelines. This approach provides a solution that is simple and extensible. It conforms to a table-based, "spreadsheet-style" format that is more comfortable and familiar to biologists. It uses plain text-files but it is tied to processes that support consistency and stability.

The GBIF GNA format consists of a set of files where one (or more) files represents the 'core' taxonomic data where a single row represents a single taxon reference. The DarwinCore Taxon class provides the majority of concepts supported in the format that enable taxonomic and nomenclatural semantics and syntax (classification, taxonomic and nomenclatural synonymy, status, etc.) to be expressed.

Other files represent "extensions" to this core table and allow additional data elements to be linked to a taxon in the core table with a many to one relationship. The overall topology of one or more of these extensions to the core table is referred to as a "star schema" and provides a compromise between an overly simple flat-file representation of data and more complex multi-related files. In addition to these files, an additional descriptor file serves as a key to the other files. Collectively, these files can be further zipped into a single compressed archive file for portability. This compressed file is known as a Darwin Core Archive (DwCA) file.

As an example we have taken the German Standard List and converted it into a darwin core archive.

A more detailed description of the Darwin Core Archive Components is listed below.

For supporting information see the TDWG DwC text guidelines.

For general information about Darwin Core Archives and extensions please read the introduction on the GBIF Communications site.

Core taxon file (taxa.txt)

One or more (typically one) files serve as the core data file. The contents of this file are tabular with a single line representing a row and a single row corresponding to a distinct taxon in the source. In DarwinCore terms, the more succinct concept is a "taxon name usage" as a line may represent a taxon reference that is no longer considered to be the accepted name for a taxon but is instead considered a synonym according to the source.

A minimal requirement is that each row contains a unique taxonID. Typically this ID is derived from identifiers in the source checklist dataset that is being published. Additional elements, described in the core taxon description, are used to define taxonomic and nomenclatural information such as the referenced scientific name, higher taxonomy information, taxonomic status, etc. See the list GNA core taxon terms supported.

This file can be formatted and visualised as a typical spreadsheet. It is intended to be readable and manageable by someone familiar with using spreadsheets.

No other files are needed, but additional optional information can be shared by using one or more extension data files.

Extension file(s)

Extension file are also simple text files that can visualised as a spreadsheet. They are tied to the core taxon file through a copy of the taxonID used in the core taxon file that is repeated once for each row in the extension file in a manner similar to foreign keys in a relational database. An extension file may include Darwin Core terms as well as terms defined through other means.

The use of extension files allows checklist information to be represented in a one-to-many relation ship between the core taxon file and the extension. For example, a taxon may be known by multiple vernacular names in different languages. A extension describing vernacular names could contain multiple rows that each describe one vernacular name and that use the taxonID in the core taxon file to reference the same taxon in the core file.

The use of a core taxon file and one or more extensions that link to the core file is referred to as a star schema

Archive Descriptor (meta.xml)

The DwC Archive format relies on a special file, called meta.xml that is used as a map to describe the core taxon file and any extensions. Each field or column is identified and described so that the star schema can be interpreted. The meta.xml descriptor of the GermanSL example from above looks like this:

<archive xmlns="http://rs.tdwg.org/dwc/text/" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://rs.tdwg.org/dwc/text/   http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd">

  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy='' ignoreHeaderLines="0" rowType="http://rs.tdwg.org/dwc/terms/Taxon">
    <files>
      <location>taxa.txt</location>
    </files>
    <id index="0" />
    <field index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/taxonomicStatus"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsageID"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsage"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/parentNameUsageID"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/nameAccordingTo"/>
    <field default="ICBN" term="http://rs.tdwg.org/dwc/terms/nomenclaturalCode"/>
  </core>

  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy='' ignoreHeaderLines="0" rowType="http://rs.gbif.org/terms/1.0/Distribution">
    <files>
      <location>distribution.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/occurrenceStatus"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/locationID"/>
    <field default="DE" term="http://rs.tdwg.org/dwc/terms/country"/>
  </extension>

  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy='' ignoreHeaderLines="0" rowType="http://rs.gbif.org/terms/1.0/VernacularName">
    <files>
      <location>vernacular.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>
    <field index="2" term="http://purl.org/dc/terms/language"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/locality	"/>
  </extension>
</archive>

Default mappings

Some terms in a dataset may always be the same. For example in a list of plants, all records share the same Kingdom ("Plantae"). The GNA format allows these data to be published as repeated data in the core data record itself or the value can be declared just once.

You can do this by using a default attribute for a field in the meta.xml. For example if all your records are in the kingdom Plantae you can do this:

<field default="Plantae" term="http://rs.tdwg.org/dwc/terms/kingdom"/>

As the word suggests this default mapping also allows to map to a column at the same time. In this case the default value is only applied when the mapped column contains no data. For example:

<field index="3" default="Plantae" term="http://rs.tdwg.org/dwc/terms/kingdom"/>

Variables in static mappings

An advanced usage of static mappings are variables being part of the default value. Any column can be referred to as well as the core record id by using curly brackets with the column index or "id" for the core record id. For example {id} for the record id or {4} for the value of the 4th column.

For example this can be used in checklists to declare the link to the individual species page without the need to include the link in the data files:

# a record id based link to the species page:
<field default="http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value={id}" term="http://purl.org/dc/terms/identifier"/>


# a scientific name (column #2) based link to a species page:
<field default="http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=Scientific_Name&search_value={2}" term="http://purl.org/dc/terms/identifier"/>

Validation

The meta.xml file needs to comply with the corresponding XML schema. If you handcode a meta.xml or write your own software please test the xml and make sure it validates!

The easiest though is to use our online service to validate against the latest dwca schema: http://tools.gbif.org/dwca-validator/

Dataset metadata

An additional file describing the dataset as a whole, not pictured in the diagram above, may also be included in a Darwin Core archive, or optionally referenced in the Meta file by a URI. This file provides information about the entire published checklist such as the title of the checklist, authors, web and publication, information, etc.

This information is stored either in the Ecological Markup Language (EML) or as simple Dublin Core (DC) as an xml file which is referenced via the metadata attribute of the archive element, e.g. <archive metadata="eml.xml">.

GBIF has defined a subset of EML known as the GBIF metadata profile. Please see the http://rs.gbif.org/schema/eml/ for an xml schema or an example file for all possible options in this profile.

Alternatively you can also use Dublin Core for the basic dataset metadata, an example dublin core file is available through the dwc archive reader project.

Comment by pmurray....@gmail.com, Jul 20, 2011

The schema file at http://darwincore.googlecode.com/svn/trunk/text/tdwg_dwc_text.xsd does not seem to be valid. It is missing a declaration of the "arch" namespace prefix, specifically:

xmlns:arch='http://rs.tdwg.org/dwc/text/'

should be included in the outermost xs:schema element.


Sign in to add a comment
Powered by Google Project Hosting