|
Nom5NubBuilding
Building a management classification of all names
This page is under construction and subject to significant revision. IntroductionThis should give a rough overview of the main processes behind creating a management classification for use within the GBIF data portal indexing processes. The end result is a management classification constructed by merging classifications from multiple sources. This merged classification is then indexed and ergo searchable. Data ingestionName sources are harvested into ChecklistBank schema and modeled as a checklist. This is achieved with the use of command line tools developed in java that parse data archives and populate the ChecklistBank relational data model. The preferred archive format is currently the Darwin core archive in the ChecklistFormat. TCS support is on development road map. Lexical groupingUpon ingestion of the Darwin core archives, name strings are lexically grouped (see lexical_group table). The following name strings would reside in the same lexical group:
This grouping is achieved with a combination of algorithms ran as java processes which make use of the Levenstein distance algorithm combined with manual curation using developed web GUI. Each lexical group is associated with one or more classifications from one or more checklists. Currently names strings are separated into lexical groups using higher classification to avoid problems with homonyms. This should be replaced by separating namestrings into different lexical group where authors differ. Hence currently "Aus bus D.Martin" and "Aus bus M.Doring" will be placed in the same lexical group unless higher classification differ. Nomenclatural groupingNomenclatural grouping is performed using the homotypic synonymy provided by checklists. Example checklists providing synonym information are:
Creating the management classification (nub)The management classification (nub) is assembled as a separate checklist. The classification is stored in the name_usage table with parent-child relationships (analogous to taxon_concept in GBIF data portal schema) with combinations of name usages from all checklists. This is done by:
from which we assemble:
This assembling algorithm makes use of ranking for individual classifications (a classification is a hierarchy captured in name usage table). Using the management classificationFrom this management classification we then build a Lucene index to enable lookups given a scientific name. For each name string in a lexical group we insert a document into the lucene index. The inserted name string is indexed for searching and the document includes a serialised version of the classification associated with the lexical group. This classification is the merged classification for this lexical group. The all scientific names in the indexed field are uppercased and entries for the abbreviated monomial are inserted e.g. a key "P. concolor" will be inserted for "Puma concolor". An entry in the index will resemble:
The lucene index includes higher taxa as key values to support lookups for occurrence records where only a monomial for a higher taxon is supplied. The lucene index is then used in lookups by software such as the GBIF data portal during the harvesting of occurrence records to help place these records in a hierarchy Outcomes
| |||||||||||||||||||