Introduction
This provides an overview of the "raw" identifications provided with the specimen and observational occurrence records. Please email trobertson@gbif.org to request further specific statistics.
The identifications attached to the occurrence data were extracted as follows. Note: The dwc:BasisOfRecord was not taken into account, so this includes records of all types (including known fossils). This decision was taken because all records needs to be organised, but in practice some records, such as known fossils, might be treated differently.
| Resource (res) | Kingdom (k) | Phylum (p) | Class (c) | Order (o) | Family (f) | Genus (g) | Scientific name (name) | Author (auth) |
A good example is given for Mytilus edulis
It is important to note the following:
- not all fields are filled for all records
- data quality varies wildly
- Scientific name is not always a true "species" but might be a duplication of a higher taxon
- Author may or may not be provided as a separate atom
- When author is given, it may or may not be included in the Scientific name
| | Details | Count |
| 1. | Distinct classifications (including an author "atom") and resource | 5.5 million |
| 2. | Distinct classifications (including an author "atom") | 4.75 million |
| 3. | Distinct classifications ignoring author atom | 4.54 million |
| 4. | Distinct k,f,name | 3.9 million |
| 5. | Distinct author | 457,745 |
| 6. | Number names with variation in author | 758,302 |
| 7. | Number names with the author atom contained in the name | 992,540 |
| 8. | Names ending in " sp" | 6889 |
| 9. | Names ending in " sp." | 89,103 |
| 10. | Names in UPPERCASE | 29,468 |
| 11. | Name equal to another atom (k,p,c,o,f,g) | 79,874 (71,073 are genera) |
| 12. | Names covered in Cat. Life 2009 (straight name match only) | 1,534,655 |
| 13. | Distinct k,name | 3.43 million |
| 14. | Distinct name | 3.01 million |
| 15. | Names covered in Thomson (straight name match only) | 1,014,870 |
| 16. | Names in Thomson and CoL2009 (canonical, and canonical+author hack) | 9,154,011 |
| 17. | Names covered in Thomson and CoL2009 (straight name match only) | 1,953,960 |
| 18. | Classifications with an author atom | 2,524,885 |
Comments
- From 1. and 2. we see that there is not much repetition across resources when author atom is used
- From 6. we see that there is a large amount of variation of authorship - scanning data suggests mostly abbreviation
- 10. Hint: select count(*) from raw_classification where name is not null and name like binary upper(name)=1
- 12. By extracting the canonical and the "canonical + author" from Catalogue of Life 2009 annual checklist
- 13. Hint: concat(k,name) returns null for a null k, need to concat_ws(',',k,name)
- 15. Same as 12
- 16. Being very hacky, taking canonical and also canonical+author concatenated with a space (only to get some estimates of overlap)
- 17. taking 12. and 15. and combining the names, and name+authorship, doing a straight name match
Higher Major Ranks
Counts are distinct names
| Rank | GBIF Raw | Cat Life 2009 |
| Kingdom | 237 | 8 |
| Phylum | 546 | 106 |
| Class | 1489 | 245 |
| Order | 3806 | 1068 |
| Family | 20231 | 7246 |
| Total distinct | 25198 | 8652 |
Of those 25198 raw names, 7263 are covered in Cat Life 2009. There is a lot of noise from errors, bad mappings and the suchlike
- 11.01, (Stereales), ++, A. Hansen
and many apparent spelling errors
but there are a LOT of real higher taxa that need to be sourced
and historical names which are also valid