|
HBaseSchema
The HBase schema
IntroductionThis discussion relates to the schema design for the occurrence record table, the core table for most operations in the index. The entities stored are occurrences of a species collection (e.g. a specimen in a museum) or a species observational record (e.g. species x was seen at this location at this time). The metadata captured for the occurrence include the scientific identification (e.g. the species) along with temporal and location metadata. The metadata terms are likely to closely follow the Darwin Core standard. The table is populated by crawling 1000s of databases that are published on the internet using recognised protocols, much like Google and Yahoo crawl web pages. The incoming data is in the form of XML and CSV files, and likely to include RDF-XML in the coming months / years. For the purposes of this discussion, the following simple example of an incoming source record is used. This represents a simple record, but shows one complexity that exists; namely the inherent many2one with the scientific identifications (Puma sp. and Puma concolor) of the specimen. All the identifications should be returned when a "get by key" operation is requested. When generating indexes, the record should become searchable using any identification.
QuestionsHow to model the multiple scientific identifications associated with the record? Option 1Keep the record as a single row and create a custom serialization for the scientific identifications HBase stores bytes, meaning that a List of scientific identifications can be serialized in a custom manner as a datatype for the column:family.
Option 2Expand the record into 2 rows identical except for the scientific identification A simple unit qualifier could be added to the row key so that looking at a key was enough to know records were derived from same source:
Option 3Expand each identification into multiple families in the same row The families would be named along the lines of rawScientificIdentification1, rawScientificIdentification2 etc, each with a :kingdom, :phylum... :scientificName etc.
An alternative for option 3 could be to keep in the same family and have multiple columns for Kingdom1, Kingdom2 etc. This is not thought through yet Option 4Use multiple tables for a single record Much like a traditional RDBMS the identifications could be held in a separate table and joined.
Preferred choiceCurrently option 1 is the preferred candidate for testing. You point out that this would have poor scanning performance because of the need for deserialization, but I don't necessarily agree. That can be quite fast, depending on implementation, and there's a great deal of serialization/deserialization being done behind the scenes to even get the data to you in the first place. Something like protobufs has very efficient and fast serialize/deserialize operations. Java serialization is inefficient in space and can be slow, which is why HBase and Hadoop implement the Writable interface and provide a minimal/efficient/binary serialization. I do think that is the by far the best approach here, the serialization/deserialization should be orders of magnitude faster than round-trip network latency. Jonathan Gray (HBase list) Tim: There is a test for this exact scenario in the /occurrence-store project (IdentificationsProtosTest). The result is that protobufs can do serialisation and deserialisation of a list of identifications (only 1 in the list) at a speed of 130 per millisecond on a Macbook Pro. Therefore the serialisation of this is at acceptable performance. Real Test #1The following schema is in use as test #1 for harvesting into HBase. It's a single row per harvested record, so in the style of Option 1, above. The column names for the ror are generated dynamically through reflection on the model object (getters and setters). Each record can have many supporting records (ids, images, links and typifications). The count field for each supporting record cf tells the deserializer how many of those records to read from the column family.
The column names for the supporting records are hardcoded with a suffix indicating the count of that type of record. E.g. for two ids the cf looks like
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||