My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
DataTransformations  
Documenting data transformations
Updated Jul 21, 2009 by dprem...@gmail.com

Introduction

In many cases the soruce data does not easily map to given darwin core terms or defined extensions. In order to support these data sources, one usually has to do some data transformations which can be reduced to a certain number of common patterns or transformation types, that are applied in different scenarios.

Looking at the IPT data flow, transformations have to happen before data is mapped onto the final darwin core. To provide context, a few examples might include:

Use Cases

  • A data source has common names separated into separate columns, one for English, one for French, one for German. To transform the data to the target schema, these three columns would be transformed to three rows, with one column containing the common name and one column containing the language reference.
  • A data source as latitude and longitude separated into a degree column, a minutes column, and a second column. To transform the data to the target schema would require concatenization into latitude and longitude columns
  • A classification is expressed in a series of columns with kingdom, phylum, class, order, etc repeated for every row down to species. To transform the data to the target schema would require converting this denormalized form to a normal “parent-child” adjacency list.

Typical Input Formats

Classifications or taxonomic hierarchy

Classifications place taxa into a unified hierarchical structure. Each taxon has only a single parent and a given taxon can only appear once in a single classification tree. Taxonomic groups may or may not be labeled with taxonomic ranks (“Kingdom,”Phylum,””Class,” etc.) Taxonomic hierarchies are typically represented in a normalized or denormalized form. The objective is to transform source formats into a normalized form.

Normalized Explicit

A normalized taxonomic hierarchy is expressed as an adjacency table where each taxon is expressed as a row and a reference to a parent taxon is made in that table. In this format, taxonomic rank is recorded in a single column. Taxa may be referenced explicitly through numeric identifiers or through inferred by repeated reference to the taxon name string. The example below shows both representations. The first version of the table (first three columns + ParentID) is the target format. The purpose of the tool is to enable the transformation of the other formats detailed in this section to that format. A variation of this format uses a repeated test name for as a reference to the parent. Transforming a fulltext “Parent Name” requires matching it to the parent “Taxon” string and referencing the “ID” of the parent row.

Target Format: Adjacency Table

ID Taxon Rank Parent ID
1 Animalia Kingdom 0
2 Chordata Phylum 1
3 Arthropoda Phylum 1
4 Mammalia Class 2
5 Insecta Class 3
6 Carnivora Order 4
7 Felidae Family 6
8 Panthera tigris species 7
9 Panthera leo species 7
10 Acinonyx jubatus species 7
11 Panthera pardus species 7
12 Hymenoptera Order 3
13 Apidae Family 12
14 Apis mellifera species 13

Normalized Implicit

This format requires access to an external source of Taxonomic Rank sequence information to transform to an adjacency list.

Rank Taxon Status
Kingdom Animalia valid
Phylum Chordata valid
Class Mammalia valid
Order Carnivora valid
Family Felidae valid
Species Panthera tigris valid
Species Panthera leo valid
Species Felis leo invalid
Species Acinonyx jubatus valid
Species Panthera pardus valid
Phylum Arthropoda valid
Class Insecta valid
Order Hymenoptera valid
Family Apidae valid
Species Apis melifera valid

Denormalized Explicit

In a de-normalized hierarchy, each row refers to a distinct member of the classification. The hierarchy itself is represented across multiple columns with the most specific rank representing the terminal taxon. The terminal taxon may be a concatenated species name as in the example here or it may be a fully atomized taxon name (see Denormalized and atomized polynomial names) Rank names are identified via the first row of the table. Higher groups are repeated for each instance that cites a child member. Transformation requires identifying all columns that represent the taxonomic hierarchy and building an adjacency table from the structure that terminates with the terminal taxon in each row. Local integer identifiers will need to be created to build the adjacency list as the higher taxa will have no source identifiers.

Kingdom Phylum Class Order Family Species
Animalia Chordata Mammalia Carnivora Felidae Panthera tigris
Animalia Chordata Mammalia Carnivora Felidae Panthera leo
Animalia Chordata Mammalia Carnivora Felidae Acinonyx jubatus
Animalia Chordata Mammalia Carnivora Felidae Panthera pardus
Animalia Arthropoda Insecta Hymenoptera Apidae Apis melifera

Denormalized Implicit

This format falls between normal and denormalized formats. One row equals one taxon so that all taxa can support additional attributes yet the hierarchy is spread across multiple columns.

Kingdom Phylum Class Order Family Species
Animalia
Chordata
Mammalia
Carnivora
Felidae
Panthera tigris
Panthera leo
Rodentia Muridae Mus musculus
Castoridae Castor canadensis
Arthropoda
Insecta
Hymenoptera
Apidae
Apis melifera

Common names

Common (or vernacular) names are generally treated separately within checklist data as they have no formal structure and carry different associated attributes than scientific names. The different formats described here are related to either cardinality differences between common name and the associated taxon or the common name and associated language.

The target format is a separate file that is related to a master Taxon table via a primary key that identifies a single row in the Taxon table. This identifer is repeated one to many times within the Common Names output where each row is a single taxon name usage in specific language and optional regional Area. Some sources that focus primarily on cataloging common names may require a TaxonOrSynonym table to be derived from the source in order to meet the requirements of the schema.

Denormalized Taxon Names/Normalized Common Names

This format supports additional attributes to qualify a vernacular name such as location, preference, and source citation. Variations on this format may cite the associated scientific name via a parent or accepted name identifier occupying a different row. The tool will have to match the Taxon name in the vernacular list to an associated taxon in a distinct taxon list.

Taxon Vernacular Name Language
Abies alba jedle bělokorá Czech
Accipiter cooperii Cooper’s Hawk English
Accipiter cooperii Rundschwanzsperber German
Acris crepitans Northern Cricket Frog English
Pomatomus saltatrix Tassergal French

De-normalized common names – grouped by language

This format contains a single row per taxon and one or more columns each defined by a different language. Within each column are one or more common names separated by a standard delimiter, generally a comma. Thus, each language column needs to be unpacked and parsed out to a more normal form and taxa matched to a distinct list.

Taxon English German
Abies alba Silver fir, Christmas tree Edeltanne, Weisstanne, Pechbaum
Accipiter cooperii Cooper’s Hawk, Hen hawk, quail hawk, swift hawk Rundschwanzsperber
Pomatomus saltatrix Bluefish, snapper blue, chopper, Blaufisch, Blaubarsch

De-normalized common names – single column for all languages

Probably the least commonly encountered format. This requires a series of parsing to separate the name and country elements.

Taxon Common names
Abies alba Silver fir (eng), Christmas tree (eng), Edeltanne (ger), Weisstanne (ger), Pechbaum (ger)
Accipiter cooperii Cooper’s Hawk (eng), Hen hawk (eng), quail hawk(eng), swift hawk (eng), Rundschwanzsperber(ger)

Synonyms/ References to Accepted Names

Synonyms are a class of taxonomic name record references that are related to another Taxon Name record. This relationship, and the nature of the synonymy relationship are represented in a few different ways. The semantic component is via mapping the primary key of the taxon row to an acceptedParentID. The syntactic components that qualify the synonym type are via the various Status vocabularies.

Accepted Name by Foreign Key Reference

This is the target format as per the schema.

idTaxonRankparentIDacceptedID
1AnimaliaKingdom 1
2ArthropodaPhylum12
3InsectaClass23
4LepidopteraOrder33
5NoctuidaeFamily45
6Noctua pronuba LinnaeusSpecies56
7Noctua atlantica WarranSpecies 6
8Noctua comes HübnerSpecies 6
7Orthopleura plectaSpecies67

Accepted Name by Taxon Name Reference

This format uses the Taxon name as a foreign key to the accepted name instead of a primary key identifier. The tool will match the text names and provide feedback on negative matches.

id Taxon Rank parentID Accepted Name
1 Animalia Kingdom
2 Arthropoda Phylum 1
3 Insecta Class 2
4 Lepidoptera Order 3
5 Noctuidae Family 4
6 Noctua pronuba Linnaeus Species 5
7 Noctua atlantica Warran Species Noctua pronuba Linnaeus
8 Noctua comes Hübner Species Noctua pronuba Linnaeus
7 Orthopleura plecta Species 6

Synonym Reference merged with Higher Taxon Reference

This format combines the acceptedTaxonID with the higherTaxonID into a single column. The context of whether the taxon row references a synonym or a higher taxon is based on the taxonomic status of the name. This format was used in a list of genera output by Fauna Europaea.

Effective transformation would entail iterating through the rows and, based on the interpretation of the taxonomic status value (accepted | synonym) would move the parent taxon ID to a new acceptedTaxonID column.

id Taxon Rank StatusparentID
1AcerentomidaeFamilyAccepted0
2AcerentomonGenusAccepted1
3Acerentomon affine Bagnall 1912speciesAccepted2
4Acerentomon agrorum Womersley 1928 species Synonym3
5Acerentomon metarhinus Womersley 1928 species Synonym3
6Acerentomon oblongum Womersley 1927 species Synonym3
7Acerentomon propinquum Condé 1945 species Synonym3

Transformation Types

Pivots

Normalisation

ID Lookup

Term Lookup

This is already implemented in the current IPT when mapping controlled vocabularies. It would be desirable to allow term translations in any property mapping, regardless whether the term defines a controlled vocabulary or not.

Concatenation

Concatenating multiple columns and fixed strings.

Substring splitting

Splitting column values into substrings

IPT Implementation

  • on file level or rather as a temporary H2 database?
  • H2 would allow to do many of the transformations as simple SQL views, not producing any new files at all!
    1. initially import all files or external SQL sources into H2
    2. define h2 sql views
    3. run final import/dwc-a generation on intermediate h2 db

Sign in to add a comment
Powered by Google Project Hosting