My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
HBaseSchema  
The HBase schema
Updated Jun 10, 2011 by oliver.m...@gmail.com

Introduction

This discussion relates to the schema design for the occurrence record table, the core table for most operations in the index. The entities stored are occurrences of a species collection (e.g. a specimen in a museum) or a species observational record (e.g. species x was seen at this location at this time). The metadata captured for the occurrence include the scientific identification (e.g. the species) along with temporal and location metadata. The metadata terms are likely to closely follow the Darwin Core standard.

The table is populated by crawling 1000s of databases that are published on the internet using recognised protocols, much like Google and Yahoo crawl web pages. The incoming data is in the form of XML and CSV files, and likely to include RDF-XML in the coming months / years.

For the purposes of this discussion, the following simple example of an incoming source record is used. This represents a simple record, but shows one complexity that exists; namely the inherent many2one with the scientific identifications (Puma sp. and Puma concolor) of the specimen. All the identifications should be returned when a "get by key" operation is requested. When generating indexes, the record should become searchable using any identification.

Questions

How to model the multiple scientific identifications associated with the record?

Option 1

Keep the record as a single row and create a custom serialization for the scientific identifications

HBase stores bytes, meaning that a List of scientific identifications can be serialized in a custom manner as a datatype for the column:family.

Pros Record remains logically grouped in the same underlying HDFS file
Cons Slight scanning performance drop due to serialization (Jonathan Gray: Use protobufs, not a major concern)

Option 2

Expand the record into 2 rows identical except for the scientific identification

A simple unit qualifier could be added to the row key so that looking at a key was enough to know records were derived from same source:

  • 1234:1
  • 1234:2
Pros No serialisation required, so good for scanning
Cons No longer able to do a "get by key". It would require a scan or multiple "get by key" to build a single record

Counting operations need to understand this, and reduce the rows into a single count

Option 3

Expand each identification into multiple families in the same row

The families would be named along the lines of rawScientificIdentification1, rawScientificIdentification2 etc, each with a :kingdom, :phylum... :scientificName etc.

Pros Scientific identification remains logically grouped with the row

Good for simple scanning and counting
Cons A scan that requires checking all scientific identifications means HBase will need to scan over 2 "files" (confirmed by Jonathan Gray) which is harder work for HBase

An alternative for option 3 could be to keep in the same family and have multiple columns for Kingdom1, Kingdom2 etc. This is not thought through yet

Option 4

Use multiple tables for a single record

Much like a traditional RDBMS the identifications could be held in a separate table and joined.

Pros Perhaps smaller data volume
Cons Joins are very slow, and since this will be in 90%+ of filters this is not really an option

Preferred choice

Currently option 1 is the preferred candidate for testing.

You point out that this would have poor scanning performance because of the need for deserialization, but I don't necessarily agree. That can be quite fast, depending on implementation, and there's a great deal of serialization/deserialization being done behind the scenes to even get the data to you in the first place.
Something like protobufs has very efficient and fast serialize/deserialize operations. Java serialization is inefficient in space and can be slow, which is why HBase and Hadoop implement the Writable interface and provide a minimal/efficient/binary serialization.
I do think that is the by far the best approach here, the serialization/deserialization should be orders of magnitude faster than round-trip network latency. Jonathan Gray (HBase list)

Tim: There is a test for this exact scenario in the /occurrence-store project (IdentificationsProtosTest). The result is that protobufs can do serialisation and deserialisation of a list of identifications (only 1 in the list) at a speed of 130 per millisecond on a Macbook Pro. Therefore the serialisation of this is at acceptable performance.

Real Test #1

The following schema is in use as test #1 for harvesting into HBase. It's a single row per harvested record, so in the style of Option 1, above. The column names for the ror are generated dynamically through reflection on the model object (getters and setters). Each record can have many supporting records (ids, images, links and typifications). The count field for each supporting record cf tells the deserializer how many of those records to read from the column family.

metapropsidsimageslinkstypes
keyis_dirtyror fields countid fields countimage fields countlink fields counttypification fields

The column names for the supporting records are hardcoded with a suffix indicating the count of that type of record. E.g. for two ids the cf looks like

column family: ids
countidentifier_1identifier_type_1identifier_2identifier_type_2
2IEE 242CGN147064


Sign in to add a comment
Powered by Google Project Hosting