|
DataTransformations
Documenting data transformations
IntroductionIn many cases the soruce data does not easily map to given darwin core terms or defined extensions. In order to support these data sources, one usually has to do some data transformations which can be reduced to a certain number of common patterns or transformation types, that are applied in different scenarios. Looking at the IPT data flow, transformations have to happen before data is mapped onto the final darwin core. To provide context, a few examples might include: Use Cases
Typical Input FormatsClassifications or taxonomic hierarchyClassifications place taxa into a unified hierarchical structure. Each taxon has only a single parent and a given taxon can only appear once in a single classification tree. Taxonomic groups may or may not be labeled with taxonomic ranks (“Kingdom,”Phylum,””Class,” etc.) Taxonomic hierarchies are typically represented in a normalized or denormalized form. The objective is to transform source formats into a normalized form. Normalized ExplicitA normalized taxonomic hierarchy is expressed as an adjacency table where each taxon is expressed as a row and a reference to a parent taxon is made in that table. In this format, taxonomic rank is recorded in a single column. Taxa may be referenced explicitly through numeric identifiers or through inferred by repeated reference to the taxon name string. The example below shows both representations. The first version of the table (first three columns + ParentID) is the target format. The purpose of the tool is to enable the transformation of the other formats detailed in this section to that format. A variation of this format uses a repeated test name for as a reference to the parent. Transforming a fulltext “Parent Name” requires matching it to the parent “Taxon” string and referencing the “ID” of the parent row. Target Format: Adjacency Table
Normalized ImplicitThis format requires access to an external source of Taxonomic Rank sequence information to transform to an adjacency list.
Denormalized ExplicitIn a de-normalized hierarchy, each row refers to a distinct member of the classification. The hierarchy itself is represented across multiple columns with the most specific rank representing the terminal taxon. The terminal taxon may be a concatenated species name as in the example here or it may be a fully atomized taxon name (see Denormalized and atomized polynomial names) Rank names are identified via the first row of the table. Higher groups are repeated for each instance that cites a child member. Transformation requires identifying all columns that represent the taxonomic hierarchy and building an adjacency table from the structure that terminates with the terminal taxon in each row. Local integer identifiers will need to be created to build the adjacency list as the higher taxa will have no source identifiers.
Denormalized ImplicitThis format falls between normal and denormalized formats. One row equals one taxon so that all taxa can support additional attributes yet the hierarchy is spread across multiple columns.
Common namesCommon (or vernacular) names are generally treated separately within checklist data as they have no formal structure and carry different associated attributes than scientific names. The different formats described here are related to either cardinality differences between common name and the associated taxon or the common name and associated language. The target format is a separate file that is related to a master Taxon table via a primary key that identifies a single row in the Taxon table. This identifer is repeated one to many times within the Common Names output where each row is a single taxon name usage in specific language and optional regional Area. Some sources that focus primarily on cataloging common names may require a TaxonOrSynonym table to be derived from the source in order to meet the requirements of the schema. Denormalized Taxon Names/Normalized Common NamesThis format supports additional attributes to qualify a vernacular name such as location, preference, and source citation. Variations on this format may cite the associated scientific name via a parent or accepted name identifier occupying a different row. The tool will have to match the Taxon name in the vernacular list to an associated taxon in a distinct taxon list.
De-normalized common names – grouped by languageThis format contains a single row per taxon and one or more columns each defined by a different language. Within each column are one or more common names separated by a standard delimiter, generally a comma. Thus, each language column needs to be unpacked and parsed out to a more normal form and taxa matched to a distinct list.
De-normalized common names – single column for all languagesProbably the least commonly encountered format. This requires a series of parsing to separate the name and country elements.
Synonyms/ References to Accepted NamesSynonyms are a class of taxonomic name record references that are related to another Taxon Name record. This relationship, and the nature of the synonymy relationship are represented in a few different ways. The semantic component is via mapping the primary key of the taxon row to an acceptedParentID. The syntactic components that qualify the synonym type are via the various Status vocabularies. Accepted Name by Foreign Key ReferenceThis is the target format as per the schema.
Accepted Name by Taxon Name ReferenceThis format uses the Taxon name as a foreign key to the accepted name instead of a primary key identifier. The tool will match the text names and provide feedback on negative matches.
Synonym Reference merged with Higher Taxon ReferenceThis format combines the acceptedTaxonID with the higherTaxonID into a single column. The context of whether the taxon row references a synonym or a higher taxon is based on the taxonomic status of the name. This format was used in a list of genera output by Fauna Europaea. Effective transformation would entail iterating through the rows and, based on the interpretation of the taxonomic status value (accepted | synonym) would move the parent taxon ID to a new acceptedTaxonID column.
Transformation TypesPivots
Normalisation
ID Lookup
Term Lookup
This is already implemented in the current IPT when mapping controlled vocabularies. It would be desirable to allow term translations in any property mapping, regardless whether the term defines a controlled vocabulary or not. Concatenation
Concatenating multiple columns and fixed strings. Substring splittingSplitting column values into substrings IPT Implementation
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||