My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
NameNormalisation  
Processes for normalising names in Occurrence and Checklist Data
Updated Sep 3, 2009 by dprem...@gmail.com

This page is under construction and subject to significant revision.

Normalisation Processes

Taxon names are evaluated and processed at a number of different points in the Nub Building and Checklist Bank import processes.

  1. When inserting name records into Checklist Bank
  2. During name parsing processes where names are split into atomic parts
  3. Binding names in occurrence records to the synthesized "Nub" taxonomy

Normalising involves making minor and consistent changes to different namestrings that are considered to be not relevant to conveying the information content of the string and which would otherwise lead to an unnecessary conflation of name records. Normalisation is a critical aspect of indexing multiple, independent sources in order to be able to effectively compare then and identify references to the same name where the name may differ in orthography.

A simple example is the processing of extra "white space" in a taxon name.

Example Raw name string Normalised name string
White space processing (dashes refer to space in example) Abies- - - - -alba Abies alba

Currently three normalising routines are employed

Basic for most minor processing of Checklist Bank resources
  • Collapses whitespace to a single space. A genus+species combination with two spaces between them is considered identical to the same name with a single space.
  • Catches and fixes encoding problems such as octal or hexadecimal encodings inadvertently occurring in the name
  • Removes commas before publication year in scientific name authorship. The text "Linnaeus, 1768" and "Linnaeus 1768" are considered identical and converted to the latter representation.
  • standardises whitespace around hyphens and commas used in a name.
  • normalises the use whitespace around periods (".") and ampersands ("&"). Inserts whitespace after character if it doesn't exist.
    • Racosperma spirorbe subsp. solandri (Benth.)Pedley -> Racosperma spirorbe subsp. solandri (Benth.) Pedley
    • Racosperma spirorbe Benth&Pedley -> Racosperma spirorbe Benth & Pedley

Strong for more complex processing
  • converts "&," "and", "und" "et." , etc. to a single form
    • Racosperma spirorbe Benth and Pedley -> Racosperma spirorbe Benth & Pedley
  • replaces different types of brackets to "( )". For example "[,],{,}" are transformed to parentheses.
  • removes enclosing brackets around genera
  • Author (Year) is converted to (Author Year) following evaluation.
ICBN §46.2 Note 1. When authorship of a name differs from authorship of the publication in which it was validly published, both are sometimes cited, connected by the word "in". In such a case, "in" and what follows are part of a bibliographic citation and are better omitted unless the place of publication is being cited.

Original Normalised
Paralamyctes chilensis Gervaisin, in Walckenaer & Gervais (1847) Paralamyctes chilensis (Gervaisin 1847)

Dirty for processing sources that may have more significant problems with orthography
  • Performs case transformations where a genus name may be lower case or a name may be UPPERCASE
  • removes annotations such as "comb. nov," "indet.," "sp.," "spec," etc.

Hybrid Normalisation

Hybrid names present special difficulties in parsing and normalisation. First, hybrid name formats are inconsistently referenced with many instances using a literal "x" instead of the proper "×." Additionally whitespace is inconsistently used with leading or trailing whitespace optionally included. Lastly there are different formatting conventions for hybrid names and hybrid formulas.

Currently we normalise to the following three formats:

  • ×Agropogon littoralis
  • Salix ×capreola Andersson
  • Agrostis L. . × Polypogon Desf

Publication Year processing

Some taxonomic data sources provide taxon names where a publication year is cited in square brackets.

Other examples sometimes include two distinct years associated with a single name.

Our interpretation of these bracketed years is that they represent "imprint year" which refers to the year the publication was produced in print which may have been later than the accepted year of the publication of the name. This could be due to earlier publication of parts of a larger book or compilation.

Authorship in braces

Cases where an scientific name authorship appears in braces ("[Röding]") indicates the author to whom the publication is determined to belong, does not appear in the published work.

Example: mollusc genera in Museum Boltenianum, published anonymously but frequently ascribed to Bolten, should now instead be ascribed to Röding since other evidence now indicates that he was the author of those sections.

pers. comm. from Tony Rees

sensu, sec, secundum

A series of annotations may be a component of taxon names that refer to the use of a taxon name within a publication in a sense that differs from that published by the original author of the name. Includes:

  • sensu, sec, secundum
  • auct., auct brit, etc.
  • sensu lato, sensu stricto
  • non, auct. non., etc

Parsing Rule: Parsing rules will be refined to identify these formats. These annotations will be expressed in dwc:taxonRemarks

original name scientificName taxonRemarks
Pilosella blyttiana auct. non (Fr.) F. W. Schultz & Sch. Bip. Pilosella blyttiana auct. non (Fr.) F. W. Schultz & Sch. Bip.

Variations of Anon.

Indexed names may be included in sources the include some of the following terms. Dr. Gurcharan Singh provided the following explanation after a Taxacom post.

  1. anon. is used when author in not known: anonymous
  2. ined. is used when name is not published: unpublished, ineditus
  3. ined. followed by year means it was published in that year but not validly

Acacia conspicua anon. we do not know author
Lepraria jolithus (L.) anon. we do not know author who made combination based on binomial by Linnaeus
Hygrocybe nitida (Berk. & M.A. Curtis) anon. ined. 1916 we do not know author, it was published in 1916 but not validly
Parmelia revoluta f. minor (anon.) anonwe do not know the name of author who described species originally, nor who made combination

Parsing rule:

  1. Everything after the first anon instance is split into dwc:taxonRemarks
  2. Could possibly also set the nomenclatural status of the lexical group to invalid following a bit more research.

Links to Source

See example file with tests

See parser source file


Sign in to add a comment
Powered by Google Project Hosting