My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Nom5RawClassification  
Overview of the raw classifications from the occurrence data
Updated Jul 29, 2009 by timrobertson100

Introduction

This provides an overview of the "raw" identifications provided with the specimen and observational occurrence records. Please email trobertson@gbif.org to request further specific statistics.

The identifications attached to the occurrence data were extracted as follows. Note: The dwc:BasisOfRecord was not taken into account, so this includes records of all types (including known fossils). This decision was taken because all records needs to be organised, but in practice some records, such as known fossils, might be treated differently.

Resource (res) Kingdom (k) Phylum (p) Class (c) Order (o) Family (f)  Genus (g) Scientific name (name) Author (auth)

A good example is given for Mytilus edulis

It is important to note the following:

  • not all fields are filled for all records
  • data quality varies wildly
  • Scientific name is not always a true "species" but might be a duplication of a higher taxon
  • Author may or may not be provided as a separate atom
  • When author is given, it may or may not be included in the Scientific name

Details Count
1. Distinct classifications (including an author "atom") and resource 5.5 million
2. Distinct classifications (including an author "atom") 4.75 million
3. Distinct classifications ignoring author atom 4.54 million
4. Distinct k,f,name 3.9 million
5. Distinct author 457,745
6. Number names with variation in author 758,302
7. Number names with the author atom contained in the name 992,540
8. Names ending in " sp" 6889
9. Names ending in " sp." 89,103
10. Names in UPPERCASE 29,468
11. Name equal to another atom (k,p,c,o,f,g) 79,874 (71,073 are genera)
12. Names covered in Cat. Life 2009 (straight name match only) 1,534,655
13. Distinct k,name 3.43 million
14. Distinct name 3.01 million
15. Names covered in Thomson (straight name match only) 1,014,870
16. Names in Thomson and CoL2009 (canonical, and canonical+author hack) 9,154,011
17. Names covered in Thomson and CoL2009 (straight name match only) 1,953,960
18. Classifications with an author atom 2,524,885

Comments

  • From 1. and 2. we see that there is not much repetition across resources when author atom is used
  • From 6. we see that there is a large amount of variation of authorship - scanning data suggests mostly abbreviation
  • 10. Hint: select count(*) from raw_classification where name is not null and name like binary upper(name)=1
  • 12. By extracting the canonical and the "canonical + author" from Catalogue of Life 2009 annual checklist
  • 13. Hint: concat(k,name) returns null for a null k, need to concat_ws(',',k,name)
  • 15. Same as 12
  • 16. Being very hacky, taking canonical and also canonical+author concatenated with a space (only to get some estimates of overlap)
  • 17. taking 12. and 15. and combining the names, and name+authorship, doing a straight name match

Higher Major Ranks

Counts are distinct names

Rank GBIF Raw Cat Life 2009
Kingdom2378
Phylum546106
Class1489245
Order38061068
Family202317246
Total distinct 25198 8652

Of those 25198 raw names, 7263 are covered in Cat Life 2009. There is a lot of noise from errors, bad mappings and the suchlike

  • 11.01, (Stereales), ++, A. Hansen
and many apparent spelling errors
  • Annimalia
but there are a LOT of real higher taxa that need to be sourced and historical names which are also valid
  • Ambulocetidae
  • Anazygidae


Sign in to add a comment
Powered by Google Project Hosting