|
|
This walk through was written using a pre-release version of hypertable -- you may see small variations in output from the current release.
CONTENTS
- SETUP
- INTERACTIVE HYPERTABLE & HQL
- CREATE TABLE
- LOAD DATA INFILE
- SELECT
- INSERT & DELETE
- SHOW TABLES & DROP TABLE
- BATCH MODE AND SOURCE
SETUP
For this example you'll need to download the data from http://hypertable.googlecode.com/files/access.tsv.gz and run:
tinweasel:~> gunzip access.tsv.gz tinweasel:~> mv access.tsv ~/hypertable/examples/hql_tutorial/
In this example we'll startup a local hypertable instance, create a table, load some access log data from a file, and do some basic operations using interactive Hypertext Query Language (HQL) commands. Afterwards, we'll demonstrate use of the --batch mode for HQL.
Hypertable is designed to be "always on" but for the purposes of this example we'll assume you're running locally from a fresh install. Also, following the README.txt in the distribution we assume your hypertable directory is ~/hypertable
codeslinger:~> cd ~/hypertable codeslinger:~/hypertable> bin/start-all-servers.sh local codeslinger:~/hypertable> bin/start-all-servers.sh local Successfully started DFSBroker (local) Successfully started Hyperspace Successfully started Hypertable.Master Successfully started Hypertable.RangeServer
Now, fire up an interactive session:
codeslinger:~/hypertable> bin/hypertable Welcome to the HQL command interpreter. Type 'help;' or '\h' for help. Type '\c' to clear the buffer. hypertable>
Start with a simple 'help':
hypertable> help For information about Hypertable, visit http://www.hypertable.org/ Interpreter Commands -------------------- ? (\?) Synonym for `help'. clear (\c) Clear command. exit (\q) Exit hypertable. Same as quit. print (\p) Print current command. quit (\q) Quit hypertable. source <f> (.) Execute all HQL commands in file <f>. system (\!) Execute a system shell command. HQL Statements -------------- CREATE TABLE ....... Creates a table DELETE ............. Deletes all or part of a row from a table DESCRIBE TABLE ..... Displays a table's schema DROP TABLE ......... Removes a table INSERT ............. Inserts data into a table LOAD DATA INFILE ... Loads data from a tab delimited input file into a table SELECT ............. Selects (and display) cells from a table SHOW CREATE TABLE .. Displays CREATE TABLE command used to create table SHOW TABLES ........ Displays the list of tables Statements must be terminated with ';' to execute. For more information on a specific statement, type 'help <statement>', where <statement> is one from the preceeding list.
INTERACTIVE HYPERTABLE & HQL
For this example we're working with a simplified access log that contains four columns: a date field, the referrer's url 'refer-url', an http code 'http-code', and a field we'll call 'rowkey' -- actually a url from one of the hosts we serve. There are 100,000 records in the file.
The file is under examples/hql_tutorial/access.tsv and you'll need to unzip it first if it's still compressed. NOTE: LOAD DATA INFILE will a compressed file option soon.
CREATE TABLE
First, we need to create a table to hold the data from our simplified access log -- we'll call it Pages:
hypertable> create table Pages ( date, "refer-url", "http-code");
Column names that contain a dash or any non-alphanumeric character (e.g. "refer-url") must be enclosed in quotes. Now, issue the SHOW CREATE TABLE command to make sure you got everything right. We didn't have to include the field called 'rowkey' because we'll use that in our LOAD DATA INFILE command later:
hypertable> show create table Pages; CREATE TABLE Pages ( 'refer-url', 'http-code', date, ACCESS GROUP default ( 'refer-url', 'http-code' ) )
And, notice that, by default, a single ACCESS GROUP is created -- see the full documentation for details on that but basically the ACCESS GROUP is an optimization.
Further, we get a minimal schema description via:
hypertable> describe table Pages;
<Schema generation="1">
<AccessGroup name="default">
<ColumnFamily id="1">
<Name>refer-url</Name>
</ColumnFamily>
<ColumnFamily id="2">
<Name>http-code</Name>
</ColumnFamily>
<ColumnFamily id="3">
<Name>date</Name>
</ColumnFamily>
</AccessGroup>
</Schema>
LOAD DATA INFILE
Now, let's load some data using the MySQL-like TAB delimited format (TSV). For that, we assume you have the example data in examples/hql_tutorial/access.tsv. This format includes a single header line that gives column_family_names (no column_qualifiers supported for this format yet). For the example, our data file has header:
# rowkey date refer-url http-code events.mercurynews.com/venues 2008-01-25 15:19:32 events.mercurynews.com/search 200 www.zvents.com/events/auto_complete_for_artist_name 2008-01-25 15:19:32 www.zvents.com/indio-ca/events/show/81296496-coachella 200 calendar.denverpost.com/search 2008-01-25 15:19:32 calendar.denverpost.com/search 200 www.zvents.com/search 2008-01-25 15:19:32 www.zvents.com/search 200
Reminder: you'll need to unzip it first if it's still compressed.
LOAD DATA INFILE (INTO TABLE)
So, now we load it into our Pages table -- notice that we are specifying a column present in the data file ('rowkey') that's not in our schema. We can specify any field name present in the data for the ROW_KEY_COLUMN option.
hypertable> load data infile ROW_KEY_COLUMN=rowkey "examples/hql_tutorial/access.tsv" into table Pages;
Loading 10,872,957 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 6.00 s
Avg value size: 18.68 bytes
Avg key size: 48.70 bytes
Throughput: 1811875.04 bytes/s
Total inserts: 300000
Throughput: 49992.15 inserts/s
Resends: 0
LOAD DATA INFILE (INTO FILE)
Also, make sure you don't skip over an important utility for converting these more standard formatted files into insert files used by hypertable client programs:
hypertable> load data infile ROW_KEY_COLUMN=rowkey "examples/hql_tutorial/access.tsv" into file "test.tsv";
Loading 10,872,957 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 2.41 s
Avg value size: 18.68 bytes
Avg key size: 48.70 bytes
Throughput: 4513692.45 bytes/s
Total inserts: 300000
Throughput: 124539.05 inserts/sAnd, now we have a new file, test.tsv, with the fully specified hypertable insert format which illustrates the fundamental key/value sparse representation of our data (which would be more clear if we were using column_qualifiers)
codeslinger:~/hypertable> head -10 test.tsv rowkey columnkey value events.mercurynews.com/venues date 2008-01-25 15:19:32 events.mercurynews.com/venues refer-url events.mercurynews.com/search events.mercurynews.com/venues http-code 200 www.zvents.com/events/auto_complete_for_artist_name date 2008-01-25 15:19:32 www.zvents.com/events/auto_complete_for_artist_name refer-url www.zvents.com/indio-ca/events/show/81296496-coachella www.zvents.com/events/auto_complete_for_artist_name http-code 200 calendar.denverpost.com/search date 2008-01-25 15:19:32 calendar.denverpost.com/search refer-url calendar.denverpost.com/search calendar.denverpost.com/search http-code 200
SELECT
In the examples that follow I'm using the options spec DISPLAY_TIMESTAMPS at the end of each select but that's entirely optional and you can omit them from most of the examples (except where you want to see them in the output) -- here's a minimal SELECT:
hypertable> select 'http-code' from Pages where ROW='events.getoutaz.com/scottsdale-az/venues/show/455885-scorch-bar'; events.getoutaz.com/scottsdale-az/venues/show/455885-scorch-bar http-code 200 events.getoutaz.com/scottsdale-az/venues/show/455885-scorch-bar http-code 200 events.getoutaz.com/scottsdale-az/venues/show/455885-scorch-bar http-code 200 events.getoutaz.com/scottsdale-az/venues/show/455885-scorch-bar http-code 200
Now, let's use the SELECT command to examine our data a bit -- let's start by looking at all the accesses for a particular page (a Steve Martin event in san francisco) and check the referrer's url:
hypertable> select "refer-url" from Pages where ROW = "www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin" DISPLAY_TIMESTAMPS; 2008-02-02 15:53:02.573496124 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.google.com/search 2008-02-02 15:53:02.573496122 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.google.com/search 2008-02-02 15:53:01.673853169 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.zvents.com/search 2008-02-02 15:53:00.073333232 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.google.com/search 2008-02-02 15:53:00.073333229 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.google.com/search 2008-02-02 15:52:59.673141187 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin refer-url www.google.com/search
Those referrals all came in from google search. Okay, let's check the http code we returned to those visitors:
hypertable> select "http-code" from Pages where ROW = "www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin" DISPLAY_TIMESTAMPS; 2008-02-02 15:53:02.573496123 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:02.573496120 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:01.673853170 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333231 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333230 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:52:59.673141186 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200
Looks good -- we made sure that Steve Martin fans got the content they needed. Okay, let's restrict the select using the system time stamp by adding to the WHERE clause:
hypertable> select "http-code" from Pages where ROW = "www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin" && TIMESTAMP >= '2008-02-02 00:00:00' DISPLAY_TIMESTAMPS; 2008-02-02 15:53:02.573496123 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:02.573496120 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:01.673853170 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333231 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333230 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:52:59.673141186 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200
Or, restrict to just a particular 2 second period:
hypertable> select "http-code" from Pages where ROW = "www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin" && TIMESTAMP > '2008-02-02 15:53:00' && TIMESTAM P <= '2008-02-02 15:53:02' DISPLAY_TIMESTAMPS; 2008-02-02 15:53:01.673853170 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333231 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200 2008-02-02 15:53:00.073333230 www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin http-code 200
Of course, we can choose a range of row keys as well, say to look at the referrer's url:
select "refer-url" from Pages where ROW >= "www.zvents.com/events/buy_tickets/80283482" && ROW <= "www.zvents.com/events/buy_tickets/80310000" DISPLAY_TIMESTAMPS; 2008-02-02 15:53:02.573493952 www.zvents.com/events/buy_tickets/80283482 refer-url www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin 2008-02-02 15:53:02.573493949 www.zvents.com/events/buy_tickets/80283482 refer-url www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin 2008-02-02 15:53:00.073330904 www.zvents.com/events/buy_tickets/80283482 refer-url www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin 2008-02-02 15:53:00.073330907 www.zvents.com/events/buy_tickets/80308810 refer-url www.zvents.com/las-vegas-nv/events/show/80308810-the-platters-coasters-and-the-legendary-lead-singers-of-the-temptations
And, as before with a single row key, we can restrict the select to a specific range of system time:
hypertable> select "refer-url" from Pages where ROW >= "www.zvents.com/events/buy_tickets/80283482" && ROW <= "www.zvents.com/events/buy_tickets/80310000" && TIMESTAMP >= '2008-02-02 15:53:00' && TIMESTAMP <= '2008-02-02 15:53:01' DISPLAY_TIMESTAMPS; 2008-02-02 15:53:00.073330904 www.zvents.com/events/buy_tickets/80283482 refer-url www.zvents.com/san-francisco-ca/events/show/80283482-steve-martin 2008-02-02 15:53:00.073330907 www.zvents.com/events/buy_tickets/80308810 refer-url www.zvents.com/las-vegas-nv/events/show/80308810-the-platters-coasters-and-the-legendary-lead-singers-of-the-temptations
INSERT & DELETE
Though it's unlikely to be needed in interactive mode let's do a quick demonstration of INSERT. The key to remember is that hypertable data is versioned by time stamp and that you can either set the time stamp on insert or accept the default based on your current system time. Delete operations also have a time stamp and the delete operates on all data occurring at the time stamp or before.
For example, let's INSERT some data in the past and issue a select command to verify the data:
hypertable> insert into Pages values ('2006-01-01 23:59:59', "www.hypertable.org", "refer-url", "www.zvents.com");
hypertable> select * from Pages where ROW = "www.hypertable.org" DISPLAY_TIMESTAMPS;
2006-01-01 23:59:59.000000000 www.hypertable.org refer-url www.zvents.comNow, let's do a DELETE with the default time stamp and verify that's it deleted using a SELECT:
hypertable> delete * from Pages where ROW = "www.hypertable.org"; delete: row='www.hypertable.org' family=0 ts=0 hypertable> select * from Pages where ROW = "www.hypertable.org";
Now, we'll INSERT some data far into the future (as of this tutorial) and verify that it survives the DELETE:
hypertable> insert into Pages values ('2037-01-01 23:59:59', "www.hypertable.org", "refer-url", "www.zvents.com");
hypertable> delete * from Pages where ROW = "www.hypertable.org";
delete: row='www.hypertable.org' family=0 ts=0
hypertable> select * from Pages where ROW = "www.hypertable.org" DISPLAY_TIMESTAMPS;
2037-01-01 23:59:59.000000000 www.hypertable.org refer-url www.zvents.comNow, let's get rid of it with DELETE time stamped further into the future:
hypertable> delete * from Pages where ROW = "www.hypertable.org" TIMESTAMP '2038-01-01 23:59:59'; delete: row='www.hypertable.org' family=0 ts=2146031999000000001 hypertable> select * from Pages where ROW = "www.hypertable.org";
SHOW TABLES & DROP TABLE
Now, we want to get rid of our table so we'll use DROP TABLE to do that -- first verify the tables we have, then drop, then verify that they are gone after the command:
hypertable> show tables; METADATA PAGES hypertable> drop table Pages; hypertable> show tables; METADATA
The name of the table is case insensitive.
BATCH MODE AND SOURCE
After killing the servers via kill-servers.sh your tables and data are no longer available (for the initial alpha release, even though the data is persisted on disk, as the logic to restart the servers is tied with server failover/recovery, which is not yet complete until the beta release. The replicated commit logs in hypertable is designed prevent data loss in the event of hardware failure). To load your tables and other data back to the system, run the system in batch mode or use SOURCE from an interactive console. First, run kill-servers.sh and start-all-servers.sh:
tinweasel:~/hypertable> bin/kill-servers.sh tinweasel:~/hypertable> bin/start-all-servers.sh local Successfully started DFSBroker (local) Successfully started Hyperspace Successfully started Hypertable.Master Successfully started Hypertable.RangeServer
You can fire up an interactive session and use the source command:
tinweasel:~/hypertable> bin/hypertable
Welcome to the HQL command interpreter.
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
hypertable> source ./examples/hql_tutorial/Pages-create.hql;
Loading 10,872,957 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 2.68 s
Avg value size: 18.68 bytes
Avg key size: 48.70 bytes
Throughput: 4060564.38 bytes/s
Total inserts: 300000
Throughput: 112036.62 inserts/s
Resends: 0
Alternatively, to avoid opening up an interactive session you can use the --batch command -- first, run kill-servers.sh and start-all-servers.sh:
tinweasel:~/hypertable> bin/start-all-servers.sh local
Successfully started DFSBroker (local)
Successfully started Hyperspace
Successfully started Hypertable.Master
Successfully started Hypertable.RangeServer
tinweasel:~/hypertable> bin/hypertable --batch < examples/hql_tutorial/Pages-create.hql
CREATE TABLE Pages (
date,
"refer-url",
"http-code",
ACCESS GROUP default ( date, "refer-url", "http-code" )
);
load data infile ROW_KEY_COLUMN=rowkey "examples/hql_tutorial/access.tsv" into table Pages;
Loading 10,872,957 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 2.71 s
Avg value size: 18.68 bytes
Avg key size: 48.70 bytes
Throughput: 4019384.28 bytes/s
Total inserts: 300000
Throughput: 110900.40 inserts/s
Resends: 0
You can examine the file in examples/hql_tutorial/Pages-create.hql to verify that it's just a batch of HQL commands.
tinweasel:~/hypertable> cat examples/hql_tutorial/Pages-create.hql
CREATE TABLE Pages (
date,
"refer-url",
"http-code",
ACCESS GROUP default ( date, "refer-url", "http-code" )
);
load data infile ROW_KEY_COLUMN=rowkey "examples/hql_tutorial/access.tsv" into table Pages;
Sign in to add a comment

How does it know that the column "date" is the timestamp? What if my data has more than one timestamp formatted columns?
I don't see any mention of data type. Are all columns strings? Even the timestamp?
When you drop table the show tables shows it up again. Wrong copy&paste? ;)
Are there plans for a C++ interface to this? In particular, I will need random access to large quantities of data within Hadoop Mappers / Reducers and having to go through HQL seems slow.
Sorry for the slow response -- the interactive HQL session uses the C++ API so you can do all of this in a client program. See for example ApacheLogLoad and ApacheLogQuery.
Paolo: yes, thanks, that's a copy & paste error.
David: for now everything is a byte strings. Types will be added shortly to support simple functions like SUM.
Using the comment area for questions is awkward. Let's use the hypertable-user mailing-list/discussion group for Q & A.
http://groups.google.com/group/hypertable-user
joseluis: apologies -- the narrative text was wrong and based on an earlier version of this tutorial. I've fixed it now. The date is just another data field in this tutorial (earlier we had it in as an override for the system time stamp).
To override the system time stamp upon data load you use the TIMESTAMP_COLUMN option:
And, date would be required to be in our format: 'YYYY-MM-DD HH:MM:SS'
Okay, all, sorry for the delay -- we didn't have comment notification on previously but we've got it now so we should be more on top of the feedback.