My favorites | Sign in
Project Home Source
Project Information
Members

R, the foremost free 'language and environment for statistical computing', includes various functions for querying and manipulating tabulated datasets, but regrettably R for this purpose cannot natively process SQL (even though many people may be familiar with its verbose and declarative syntax). The function sql.select strives to interpret SQL's SELECT-statements to provide an alternative means to extract, to rearrange, and to analyze tabulated datasets within R.

For illustration, here's is a screenshot (see also examples and syntax):

Using SQL to access data.frames may seem comparable to sqldf, but there are differences: Since sql.select does not pass processing on to specialized database routines (and thus cannot gain from inherent efficiency and scalability gains), it will more likely fail to work on more sizable datasets. On the other hand, its self-contained parsing effort allows sql.select to also interpret mingled R syntax (see Why does sqldf("select var(x) from DF") not work?). This promises to be simultaneously versatile and concise.

Features

  • When enclosed by an escape character (by default '`'), R expressions can be evaluated within several contexts (e.g. SELECT `var(x)` FROM DF).
  • Tables can be compounded 'vertically' with UNION or INTERSECT.
  • Tables can be combined 'horizontally' by using t1 INNER|LEFT JOIN t2 ON|USING (), t1 NATURAL JOIN t2, or t1,t2 (Cartesian product).
  • Anything that is interpretable as a data.frame should also be accessible as a table source.

Limitations

  • Comprehended syntax does not adhere to any standard, but rather loosely revolves around an arbitrary subset of idioms that is presumably frequented by users of SQLite or PostgreSQL.
  • Even the most trivial logical or arithmetic operations have to be provided as R expressions, and require wrapping with escape characters (e.g. WHERE `cond1 & cond2` instead of WHERE cond1 AND cond2), because sql.select itself cannot understand any of them.
  • sql.select does not involve any optimization, so it should rapidly choke on bulky datasets.
  • DESC/ASC modifiers cannot be attached to single ORDER BY-expressions (but sequences can be reversed individually by using `-seq` instead of seq).
  • JOIN sources may not be bracketed (only subselect have to be).
  • SELECT * cannot be combined with SELECTing additional columns or in conjunction with GROUP BY.
  • Anonymous subselects are impermissible, naming is mandatory: ...FROM (SELECT * FROM some.table) AS some.identifier....
  • Explicitly naming tables (SELECT `tablename$columnname`) is only applicable after WHERE, when using INNER JOIN t2 ON (`t1$id==t2$id`), or when SELECTing columns without GROUP BY; qualifying table names thus cannot be used after GROUP BY, after HAVING, after ORDER BY, or when SELECTing columns while the query also contains GROUP BY.
  • sql.select does not understand other commands like INSERT, UPDATE, or DELETE; it also cannot define table structures from scratch.
  • Currently it is just an early-stage prototype, so better expect to be surprised.

At the moment there is no such thing as a downloadable/installable package, but credulous users might consider to presume

source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R")

and head to the examples in order to start tinkering now.

Powered by Google Project Hosting