|
CoL_Comparison
Comparison of Catalogue of Life standard dataset to GNA DwC format
This page has been published but is subject to revision.
Mapping the GBIF Global Names Architecture data standards to the Species 2000 Standard Dataset Version 3.2 (December 2004)Species 2000 has defined 10 field groups as the standard set of data for each species (or infraspecies) contained within the Catalogue of Life Annual Checklist. These field groups are defined in a document on the Catalogue of Life website. The Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EoL), and other initiatives with an interest in utilising taxonomic data have sought to develop common infrastructure for discovering, publishing, and exchanging taxonomic data. This common infrastructure is called the Global Names Architecture (GNA) One aspect of this development has been in the creation of common data exchange formats that are relatively simple to define, understand, and utilise. GBIF has focused on expanding the scope of the Darwin Core vocabulary to include taxonomic data properties and extending the core taxon class contained in Darwin Core with extensions that are defined according to the Darwin Core text guidelines. We refer to this set of terms and extensions as the GBIF GNA format. This document reviews the ten field groups and reconciles them to the GBIF GNA format that is currently documented at http://spreadsheets.google.com/pub?key=r4I1G8E7mDIgY_kt9Rxyc8A&output=html Each field group is referenced Accepted Scientific NameThe focus of this field group in the Catalogue of Life is on the designation of species and infraspecies. The two main components of the group are the parsed components of the taxon name and the designation of the taxonomic status of the name An example record is provided as:
This field group is entirely treated within the core Taxon table of the GNA specification. The scientific name elements of species and infraspecies can be published in one of several ways, as an unparsed,partially parsed, or completely parsed name using the following terms:
dwc:taxonomicStatus There is no direct transformation of the NameStatus term. Instead it is divided into two terms within the GNA standard: dwc:taxonomicStatus and dwc:nomenclaturalStatus The best fit for Name Status element is treated with the dwc:taxonomicStatus term. A recommended vocabulary for use with this term is in draft form on the GBIF vocabulary server at http://vocabularies.gbif.org/en/vocabularies/taxonomic_status. The term "provisionally accepted name" is currently not a member of this vocabulary at the time of this writing. dwc:namePublishedIn The Reference property referenced in the standard dataset allows for a single bibliographic citation regarding the relevant nomenclatural act that resulted in the name as a Nomenclatural Reference. This is best accommodated using the dwc:namePublishedIn term. GNA Reference Extension to Darwin Core Taxon An accepted name may be tied to one or more references that accept this species in the same taxonomic status and with the same name. This is best achieved in the GNA standard using the GNA Reference Extension. This extension is composed of Dublin Core parsed and unparsed citation terms. The extension allows one or more references to be tied to the accepted scientific name Taxon Record. Core Taxon
Reference Extension
Limitations with GNA standard relative to Accepted Scientific NameThe standard dataset allows a parsed infraspecies name to be formatted with both specific and infraspecific authorship (ex. "Agalinus paupercula (Gray) Britton var. borealis (Pennell) Deam"). This form is not compliant to the botanical code of nomenclature. The GNA standard allows this form to be published as using the scientificName term as a complete and unparsed namestring but the atomised form is missing a term for enabling multiple authorship for both species and infraspecies. Limitations with Species 2000 Standard Dataset relative to Accepted Scientific NameThe GNA terms and extensions provides flexibility in expressing and distinguishing both nomenclatural and taxonomic status information regarding a taxonomic reference (both for accepted names and synonyms) via the taxonomicStatus and nomenclaturalStatus terms and accompanying vocabularies. SynonymsSynonyms are treated within the Standard Dataset in a similar manner to accepted names with the exception of a change in values to the Name Status term. Therefore, the treatment of synonyms is exactly the same as above except for the following differences. Assignment of "Name Status" to dwc:taxonomicStatus and dwc:NomenclaturalStatus The Standard Dataset identifies three values for synonyms within the nameStatus term: Unambiguous Synonym, Ambiguous Synonym, and Misapplied name. The first two values are unique to the Catalogue of Life and are not terms in the current GNA taxonomicStatus vocabulary. The term "Ambiguous synonym" may refer to either a homonym (a nomenclatural value within dwc:nomenclaturaStatus) or a pro-parte synonym (a taxonomic value within dwc:taxonomicStatus). The GNA standard divides a name status into taxonomic and nomenclatural status to provide more details and precision. For example, a single status property requires an implicit nomenclatural status of valid/available for all taxonomically accepted name. Implicitly this is true but in reality it may not be affirmed or known. Likewise, the nomenclatural status of a name may be known whereas it's taxonomic status may remain unknown. Reference to accepted taxon via dwc:acceptedScientificNameID or dwc:acceptedScientificName The GNA format allows references to accepted taxa and synonyms to be published in a single table where each taxon reference occupies one row. Synonyms are identified by their taxonomic status value and reference the row of the accepted taxon via reference. That reference can be be explicitly by row identifier using dwc:acceptedScientificNameID or implicitly via the dwc:acceptedScientificName value Common Name(s)The GNA format covers all standard dataset common name terms. In the Standard Data set a vernacular record is tied to the accepted taxon record by implication. In the GNA format a vernacular name may be linked to any row in the core taxon table (synonym or accepted name).
The vernacular extension drafted by GBIF for use in the GNA contains matching data elements for all these terms and extends the vernacular name record with many more additional terms that include regional use of the name and reference to gender, body parts, life stages. The two standards differ in the manner that vernacular references are cited but both allow citation of vernacular sources. The GNA format allows references to be cited via the dc:source property of the Vernacular Extension or via the GNA Reference Extension.
Latest Taxonomic ScrutinyThe Standard Dataset uses this group to convey information regarding the more recent review of the data record by a taxonomic expert.
In the GNA Standard this field group can be accommodated in the core taxon record in the following manner.
This information in this field group relates to the source databases (Global Species Datasets), not to taxon records and therefore correspond to dataset metadata in GNA terms. Current metadata standards for dataset are based on Ecological Markup Language. All of the Standard Dataset Database Source Database fields are accommodated in the partial EML dataset currently utilised within version 1 of the Integrated Publishing Toolkit(IPT). Additional DataThis field group is a single free-text value provided as a placeholder for additional information the data publisher wishes to provide. Examples include specimen data, habit or life form, ecology, etc. The ability to define and publish extensions to the Darwin Core core taxon properties provides a more concise solution both in principle and in the current extensions drafted for use in the Global Names Architecture. Specimen data for example, is treated in a much more explicit manner with the Types and Specimens extension that utilises the full suite of Darwin Core terms. Additional extensions that might better serve the future needs of the Catalogue of Life can be drafted and circulated for comment and refinement using the GBIF vocabularies and extensions server. Family Name and Field Group: Classification Above FamilyThese groups are treated collectively within the GNA format as they have no logical distinction within the Darwin Core taxon class. Taxonomic hierarchy may be represented in more than one way using Darwin Core: normalised using explicit or implicit references to a higher taxon or denormalised using references to individual higher taxon terms. KingdomName PhylumName ClassName OrderName Family Species Plantae Rhodophyta Rhodophycease Bangiales Bangiaceae Phyllona carnea To represent this example using a normalised foreign key approach using Darwin Core
To represent this example using an higher taxon reference using Darwin Core
To represent the example using a denormalised form of DarwinCore
DistributionThis fieldgroup contains two properties, Country and OccurenceStatus with recommended vocabularies for each. The GBIF GNA extension for Distribution contains both of these elements in addition to several others with similar recommended vocabularies. Standard Dataset Example
GBIF Distribution GNA Extension
ReferencesIn the Standard Dataset, parsed references can be linked to Accepted names, synonyms and common names. The GBIF GNA schema describes a Reference extension to the Darwin Core that extends the core taxon class so that the extension applies to accepted names and synonyms. References to vernacular name sources are referenced using this extension via the linked scientific name or via the dc:source property within the vernacular extension itself. The Standard Dataset contains a ReferenceType property that can qualify a reference as either a Nomenclatural Reference that references the original nomenclatural event, a Taxonomic Reference that refer to sources that accept the name with the same taxonomic status, or a common name reference that reference the use of the common name. The GBIF GNA schema utilises the dwc:namePublishedIn term to store an unparsed reference corresponding to the Nomenclatural Reference in the core taxon table that links to the primary source of the original name or combination. The Reference extension isr used for additional and subsequent taxonomic references to the name that form the bibliography of the materials used in the taxonomic review of the species.
The same information formatted to the GNA Reference extension to the Darwin Core
The dwc:TaxonRemarks property can serve the same purpose as ReferenceType in the common dataset. SummaryThe Catalogue of Life Common Dataset version 3.2 is highly compatible with the GBIF Global Names Architecture data exchange standard. Differences are relatively minor and there is very little loss of fidelity or detail in moving from the Common Dataset to the GNA format. The extensibility of the GNA format and the simple text-based star schema provide flexibility and simplicity for data publishers. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||