My favorites | Sign in
Project Home Wiki Issues Source
Project Information
Members
Featured
Downloads
Wiki pages
Links

Shard-Query

What is it?

ShardQuery is a PHP class which is intended to make working with a partitioned dataset easier. It offers an easy to use PHP class interface for querying the dataset. Also included is run_query.php, which is an example application for the ShardQuery class.

Shard-Query can be used transparently, and includes an experimental LUA script for MySQL proxy.

Key Features

  • ParallelPipelining - MPP distributed query engines runs fragments of queries in parallel, combining the results at the end. Like map/reduce except it speaks SQL directly.
  • Access many shards in parallel
  • QueryRouting - Sends queries only to the shard containing the requested data.
  • ConditionPushdown - Aggregation, joins and filtering are always performed at the shard level which fully distributes the work
  • Gearman Workers - PHP is not threaded. Gearman (Net_Gearman) is leveraged instead.
  • Parallel loader and pre-splitter can load or preprocess delimited files with ease. Overhead of looking up/creating new shard keys is spread over coprocesses

The following constructs can be leveraged to add parallelism


  • Subqueries in the FROM clause
  • UNION - however, de-duplication is currently single threaded
  • UNION ALL
  • IN
  • BETWEEN (with integer or date operands)

What kind of results can I expect?

Here is a graph of the performance of Shard-Query on a 55GB MySQL database with an increasing number of workers, compared to a single threaded database query as the amount of data examined increases. The exact same MySQL server and settings were used for both.

In the above test, a 24 core server was used, and all of the data, indexes, etc, could fit within the InnoDB buffer pool (in-memory test), so you see no improvement from 24 cores to 32 cores. You would have to add cores (or use more than one storage node) to get increased throughput to 32 (or more) threads.

And here is the blog post explaining that graph: http://www.mysqlperformanceblog.com/2010/11/15/shard-query-adds-parallelism-to-queries/

What kind of queries can I run? (see some limitations below)

You can run just about all queries over your sharded dataset. SELECT queries sent via the ShardQuery->query() method are subject to the limitations below. There is also a method for sending any type of query ShardQuery->broadcast() but no parsing is done on the queries and you are responsible for proper aggregation, etc. The broadcast() method allows you to send DDL or DML to every shard.

If you use aggregate functions

  • All non-aggregated expressions in the SELECT clause must appear in the GROUP BY
  • Aggregate functions must stand alone: sum(c1+c2) is legal, sum(c1)+sum(c2) is not.
  • SUM,COUNT,MIN and MAX are the fastest aggregate operations
  • Other aggregate functions are considerably more expensive and more data must be processed.

Other limitations

  • Only a single partitioning column is supported (when using QueryRouting)
  • You should specify an alias for all expressions for best results
  • select * is not yet supported (might never be). Please list the columns you want.

How it works

  • The query is parsed using http://code.google.com/p/php-sql-parser
  • A modified version of the query is executed on each shard.
  • The queries are executed in parallel using http://gearman.org
  • The results from each shard are combined together
  • A version of the original query is then executed over the combined results
  • All aggregation is done on the slaves (pushed down)
  • Queries with inlists can be made into parallel queries.
  • A callback can be used for QueryRouting. You provide a partition column, and a callback which returns information pointing to the correct shard. The most convenient way to do this is with Shard-Key-Mapper

Consider the query:

select sum(c1) as expr, count(*) as cnt from t1 where the_date in ('2010-05-20','2010-05-21')

Assume that the table t1 is partitioned on the_date and that each in-list value refers to a different partition. The query will be rewritten to be the equivalent of:

select sum(`expr`) as `expr`, sum(`cnt`) as `cnt`
from ( select sum(c1) as `expr`, count(*) as `cnt`
         from t1
        where the_date = '2010-05-20'
       union all
       select sum(c1) as `expr`, count(*) as `cnt`
         from t1
        where the_date = '2010-05-21'
     ) 

In reality, both of those select statements (the ones in the UNION ALL) are sent in parallel using gearman, and a table is used to combine the results. Assuming you have the CPU and disk capacity to scan multiple partitions in parallel, then this query should be faster than scanning the partitions serially. The table into which the rows are inserted is equivalent to the UNION ALL operation.

Example Application

run_query.php is the example application which shows how to use the ShardQuery class.

Use php run_query.php --help to get available command line options.

Powered by Google Project Hosting