My favorites | Sign in
Project Logo
       
Code license: GNU General Public License v2
Labels: R, SQL
People details
Project owners:
  ggrothendieck

To write it, it took three months; to conceive it – three minutes; to collect the data in it – all my life. F. F. Scott Fitzgerald

sqldf is an R package for performing SQL statements on R data frames, optimized for convenience. sqldf is free software published under the GNU General Public License and can be downloaded from CRAN. To get information on how to cite sqldf in papers, issue the R commands:

library(sqldf)
citation("sqldf")

sqldf consists of a thin layer over the R packages RSQLite and RMySQL. (The code for accessing RSQLite has been tested but the code for accessing RMySQL has only been partly tested.). More information can be found from within R by installing and loading the sqldf package and then entering ?sqldf and, in the devel version also ?read.csv.sql. A number of examples are at the end of this page and more examples are accessible from within R in the examples section of the ?sqldf help page.

As can be seen from these examples, with sqldf the user is freed from having to do the following, all of which are automatically done:

It can be used:

Below on this page are sections on:

News

September 25, 2009. A new version of sqldf is on CRAN. It contains bug fixes and can also handle table names with a dot in the name provided the table name is enclosed in back quotes in the SQL statement.

August 30, 2009. Added Example 4f temporal join to this page.

July 3, 2009. Current effort (not yet in repository) is to get sqldf to work with the H2 data base.

June 16, 2009. Added read.csv2.sql to development version. It is like read.csv.sql except that sep defaults to ";" . See Example 13b at the end of this page.

June 7, 2009. Version 0.1-5 of sqldf is now on CRAN and should propogate to the mirrors shortly. read.csv.sql is new. See Example 13 below.

June 4, 2009. New command read.csv.sql.

May 16, 2009. Example 6g added below.

April 22, 2009. Added example 4e (left join) in the Examples section below. Example 4 section

March 29, 2009. Added example 7c in the Examples section below. Example 7 section

March 25, 2009. Added to FAQ 3 showing how to use group_concat to apply R functions.

March 17, 2009. Added Example 4d, temporal join, in Examples section below.

February 20, 2009. Added Example 12. Combine two files in permanent database.

February 5, 2009. Added to FAQ 2 and created new Example 11 thanks to Michael Rehberg.

January 16, 2009. Added new FAQ section below and incorporated old Heuristic section into it as question 1.

December 10, 2008. sqldf 0.1-4 uploaded to CRAN

November 19, 2008. Minor improvements to this web page.

September 30, 2008. Added example 6f which shows how to work with files that have fixed columns widths (as opposed to the fields being delimited).

June 17, 2008. Added persistent connections to sqldf. It allows one to write this: sqldf(); sqldf(s1); sqldf(s2); sqldf() where s1 and s2 are character strings containing SQL statements. The first and last sqldf statements with no args open and close a connection and the middle two use it implicitly. There are also facilities to explicitly reference the connection so that sqldf and RSQLite calls can be intermixed. See Examples 10a and 10b below -- which are new.

June 16, 2008. Added Example 9 below.

April 18, 2008. Updated section below on the sqldf heuristic.

April 14, 2008. New section on the Heuristic sqldf uses further down on this page.

January 29, 2008. New Example 8 below was added.

November 16, 2007. Added Example 7b below. This shows a query that is similar to 7a but in the context of time series.

October 28, 2007. Added Example 7 below showing a complex query.

October 12, 2007. Added Example 6e showing how to read a random set of rows from a file without reading the entire file into R.

August 29, 2007. Expanded Example 6 below.

August 11, 2007. Changes in the development version of sqldf are that the sql argument, x can now be a vector with one component per sql command. Each will be executed in turn and result of last one returned.

August 7, 2007. Changes in the development version of sqldf are:

July 31, 2007. sqldf 0.1-1 (replacing sqldf 0.1-0) is on CRAN. See NEWS file for changes.

FAQ

1. How does sqldf handle classes and factors?

sqldf uses a heuristic to assign classes and factor levels to returned results. It checks each column name returned against the column names in the input data frames and uses the class (or if its a factor it uses both the class and factor levels) of the column name that it first finds to match exactly. If the column names of all data frames are distinct this normally works and it also normally works if column names of the same name have the same class. If there are multiple columns with the same names but with different classes or if there are multiple column names that are factors but have different factor levels then it will always use the first one so be sure to order your data frames in the statement such that any output corresponds to the data frame that comes first. If that is not feasible then rename the columns prior to running sqldf to make them distinct. Also note that by using the argument method = "raw" one can avoid the automatic class assignment entirely and this may give better results or even be essential in certain cases with duplicate column names. Here is an example of how things can go wrong. In the first call to sqldf there are two b columns and since it searches in order the second b column which should have levels X and Y were given the levels from the first b column mistakenly. We hope to make an improvement in the heuristic that will detect such situations. At any rate, using method = "raw" currently addresses it as shown in the second call to sqldf (since that avoids the heuristic entirely).

> # example thanks to Adrian Dragulescu
> D1 <- data.frame(a = c(1,1,2,2), b = c("x", "y", "x", "y"))
> D2 <- data.frame(a = c(1,2,1,2), b = c("X", "X", "Y", "Y"))
> sqldf("select * from D1, D2 where D1.a = D2.a") # oops!
  a b a b
1 1 x 1 x
2 1 x 1 x
3 1 y 1 y
4 1 y 1 y
5 2 x 2 x
6 2 x 2 x
7 2 y 2 y
8 2 y 2 y
> # it works ok if we use method = "raw"
> sqldf("select * from D1, D2 where D1.a = D2.a", method = "raw")
  a b a b
1 1 x 1 X
2 1 x 1 Y
3 1 y 1 X
4 1 y 1 Y
5 2 x 2 X
6 2 x 2 Y
7 2 y 2 X
8 2 y 2 Y

sqldf knows about Date, POSIXt (POSIXct, POSIXlt) and chron (dates, times) classes but not other date and time classes.

2. Why does sqldf seem to mangle certain variable names?

This happens when you try to use variable with a dot in it (as dots have special meaning to SQL) or if you try to use SQL92 reserved keywords. SQLite/RSQLite replaces dots with underscores and changes keywords words so that they can be used. The keywords in question can be found by entering the following at the R command line:

.SQL92Keywords

Note that using such names can sometimes result in an error message such as:

Error in sqliteExecStatement(con, statement, bind.data) :
 RS-DBI driver: (error in statement: no such column: ...)

which appears to suggest that there is no column but that is because it has a different name than expected. For an example of what happens:

> # based on example by Adrian Dragulescu
> DF <- data.frame(index=1:12, date=rep(c(Sys.Date()-1, Sys.Date()), 6),
+   group=c("A","B","C"), value=round(rnorm(12),2))
>
> library(sqldf)
> sqldf("select * from DF")
  index__1 date__1 group__1 value__1
1         1 14259.0        A    -0.24
2         2 14260.0        B     0.16
3         3 14259.0        C     1.24
4         4 14260.0        A    -1.16
5         5 14259.0        B    -0.19
6         6 14260.0        C     0.65
7         7 14259.0        A    -1.24
8         8 14260.0        B    -0.34
9         9 14259.0        C    -0.27
10       10 14260.0        A    -0.18
11       11 14259.0        B     0.57
12       12 14260.0        C    -0.83
> intersect(names(DF), tolower(.SQL92Keywords))
[1] "index" "date"  "group" "value"
> DF2 <- DF
> # change column names to i, d, g and v
> names(DF2) <- substr(names(DF), 1, 1)
> sqldf("select * from DF2")
    i          d g     v
1   1 2009-01-16 A  0.35
2   2 2009-01-17 B -0.96
3   3 2009-01-16 C  0.76
4   4 2009-01-17 A  0.07
5   5 2009-01-16 B  0.03
6   6 2009-01-17 C  0.19
7   7 2009-01-16 A -2.03
8   8 2009-01-17 B  0.98
9   9 2009-01-16 C -1.21
10 10 2009-01-17 A -0.67
11 11 2009-01-16 B  2.49
12 12 2009-01-17 C -0.63

Note that in the first case above had we wished to refer to value, say, we would have had to refer to it as value1, e.g. sqldf("select value__1 from DF")

3. Why does sqldf("select var(x) from DF") not work?

The SQL statement passed to sqldf must be a valid SQL statement understood by the database. That includes simple functions and aggregate functions. R functions are not supported. For SQLite see the lists of aggregate functions and core functions.

If each group is not too large we can use group_concat to return all group members and then later use apply in R to calculate the function. For example, in the following we summarize the data using sqldf and then apply a function based on var:

> DF <- data.frame(a = 1:8, g = gl(2, 4))
> out <- sqldf("select group_concat(a) groupa from DF group by g")
> out
   groupa
1 1,2,3,4
2 5,6,7,8
> out$var <- apply(out, 1, function(x) var(as.numeric(strsplit(x, ",")[[1]])))
> out
   groupa      var
1 1,2,3,4 1.666667
2 5,6,7,8 1.666667

4. How does sqldf work with "Date" class variables?

"Date" class variables are stored as numbers, i.e. days since the Epoch, in R and when transferred to sqlite get transferred as numbers. Try this:

> library(sqldf)
> DF <- data.frame(a = Sys.Date() + 1:5, b = 1:5)
> DF
          a b
1 2009-07-31 1
2 2009-08-01 2
3 2009-08-02 3
4 2009-08-03 4
5 2009-08-04 5
> Sys.Date() + 2
[1] "2009-08-01"
> s <- sprintf("select * from DF where a >= %d", Sys.Date() + 2)
> s
[1] "select * from DF where a >= 14457"
> sqldf(s)
          a b
1 2009-08-01 2
2 2009-08-02 3
3 2009-08-03 4
4 2009-08-04 5

> # to compare against character string store a as character
> DF2 <- transform(DF, a = as.character(a))
> sqldf("select * from DF2 where a >= '2009-08-01'")
          a b
1 2009-08-01 2
2 2009-08-02 3
3 2009-08-03 4
4 2009-08-04 5

Examples

These examples illustrate both sqldf and SQL, in general. sqldf does not itself directly process SQL statements itself but merely hands them off to RSQLite or RMySQL which in turn hands off to SQLite or MySQL; nevertheless, the examples do show that one is not limited by sqldf in creating queries:

Example 1. Ordering and Limiting

Here is an example of sorting and limiting output from an SQL select statement on the iris data frame that comes with R. Note that although the iris dataset uses the name Sepal.Length the R DBI layer which underlies both RSQLite and RMySQL converts that to Sepal_Length. After installing sqldf in R, just type the first two lines into the R console (without the >):

> library(sqldf)
> sqldf("select * from iris order by Sepal_Length desc limit 3")

  Sepal_Length Sepal_Width Petal_Length Petal_Width   Species
1          7.9         3.8          6.4         2.0 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica

Example 2. Averaging and Grouping

Here is an example which processes an SQL select statement whose functionality is similar to the R aggregate function.

> sqldf("select Species, avg(Sepal_Length) from iris group by Species")

     Species avg(Sepal_Length)
1     setosa             5.006
2 versicolor             5.936
3  virginica             6.588

Example 3. Nested Select

Here is a more complex example. For each Species, find the average Sepal Length among those rows where Sepal Length exceeds the average Sepal Length for that Species. Note the use of a subquery and explicit column naming:

> sqldf("select iris.Species '[Species]', 
+       avg(Sepal_Length) '[Avg of SLs > avg SL]'
+    from iris, 
+         (select Species, avg(Sepal_Length) SLavg 
+         from iris group by Species) SLavg
+    where iris.Species = SLavg.Species 
+       and Sepal_Length > SLavg
+    group by iris.Species")

   [Species] [Avg of SLs > avg SL]
1     setosa              5.313636
2 versicolor              6.375000

3  virginica              7.159091

Example 4. Join

Here is one more example. We define a new data frame, Abbr, join it with iris and perform the aggregation:

> # Example 4a.
> Abbr <- data.frame(Species = levels(iris$Species), 
+    Abbr = c("S", "Ve", "Vi"))
>
> sqldf("select Abbr, avg(Sepal_Length) 
+   from iris natural join Abbr group by Species")

  Abbr avg(Sepal_Length)
1    S             5.006
2   Ve             5.936
3   Vi             6.588

Although the above is probably the shortest way to write it in SQL, using natural join can be a bit dangerous since one must be very sure one knows precisely which column names are common to both tables. For example, had we included the row_names as a column in both tables (by specifying row.names = TRUE to sqldf) the natural join would not work as intended since the row_names columns would participate in the join. An alternate and safer way to write this would be with join and using:

> # Example 4b.
> sqldf("select Abbr, avg(Sepal_Length) 
+   from iris join Abbr using(Species) group by Species")

  Abbr avg(Sepal_Length)
1    S             5.006
2   Ve             5.936
3   Vi             6.588

or with a where clause:

> # Example 4c.
> sqldf("select Abbr, avg(Sepal_Length) from iris, Abbr
+    where iris.Species = Abbr.Species group by iris.Species")

  Abbr avg(Sepal_Length)
1    S             5.006
2   Ve             5.936
3   Vi             6.588

or a temporal join where the goal is, for each Species/station_id pair, to join the records with the closest date/times.

> # Example 4d. Temporal Join
> # see: https://stat.ethz.ch/pipermail/r-help/2009-March/191938.html
>
> library(chron)
> 
> Species.Lines <- "Species,Date_Sampled
+ SpeciesB,2008-06-23 13:55:11
+ SpeciesA,2008-06-23 13:43:11
+ SpeciesC,2008-06-23 13:55:11"
> 
> species <- read.csv(textConnection(Species.Lines), as.is = TRUE)
> species$dt <- as.numeric(as.chron(species$Date))
> 
> Temp.Lines <- "Station_id,Date,Value
+ ANH,2008-06-23 13:00:00,1.96
+ ANH,2008-06-23 14:00:00,2.25
+ BDT,2008-06-23 13:00:00,4.23
+ BDT,2008-06-23 13:15:00,4.11
+ BDT,2008-06-23 13:30:00,4.01
+ BDT,2008-06-23 13:45:00,3.9
+ BDT,2008-06-23 14:00:00,3.82"
> 
> temp <- read.csv(textConnection(Temp.Lines), as.is = TRUE)
> temp$dt <- as.numeric(as.chron(temp$Date))
> 
> out <- sqldf("select s.Species, s.dt, t.Station_id, t.Value__1 
+ from species s, temp t 
+ where abs(s.dt - t.dt) = 
+ (select min(abs(s2.dt - t2.dt)) 
+ from species s2, temp t2
+ where s.Species = s2.Species and t.Station_id = t2.Station_id)")
> out$dt <- chron(out$dt)
> out
   Species                  dt Station_id Value__1
1 SpeciesB (06/23/08 13:55:11)        ANH     2.25
2 SpeciesB (06/23/08 13:55:11)        BDT     3.82
3 SpeciesA (06/23/08 13:43:11)        ANH     2.25
4 SpeciesA (06/23/08 13:43:11)        BDT     3.90
5 SpeciesC (06/23/08 13:55:11)        ANH     2.25
6 SpeciesC (06/23/08 13:55:11)        BDT     3.82

or a left join

> # Example 4e. Left Join
> # https://stat.ethz.ch/pipermail/r-help/2009-April/195882.html
> #
> SNP1x <-
+ structure(list(Animal = c(194073197L, 194073197L, 194073197L, 
+ 194073197L, 194073197L), Marker = structure(1:5, .Label = c("P1001", 
+ "P1002", "P1004", "P1005", "P1006", "P1007"), class = "factor"), 
+     x = c(2L, 1L, 2L, 0L, 2L)), .Names = c("Animal", "Marker", 
+ "x"), row.names = c("3213", "1295", "915", "2833", "1487"), class = "data.frame")
> 
> SNP4 <- 
+ structure(list(Animal = c(194073197L, 194073197L, 194073197L, 
+ 194073197L, 194073197L, 194073197L), Marker = structure(1:6, .Label = c("P1001", 
+ "P1002", "P1004", "P1005", "P1006", "P1007"), class = "factor"), 
+     Y = c(0.021088, 0.021088, 0.021088, 0.021088, 0.021088, 0.021088
+     )), .Names = c("Animal", "Marker", "Y"), class = "data.frame", row.names = c("3213", 
+ "1295", "915", "2833", "1487", "1885"))
>
> SNP1x
        Animal Marker x
3213 194073197  P1001 2
1295 194073197  P1002 1
915  194073197  P1004 2
2833 194073197  P1005 0
1487 194073197  P1006 2
> SNP4
        Animal Marker        Y
3213 194073197  P1001 0.021088
1295 194073197  P1002 0.021088
915  194073197  P1004 0.021088
2833 194073197  P1005 0.021088
1487 194073197  P1006 0.021088
1885 194073197  P1007 0.021088
>
> library(sqldf)
> sqldf("select * from SNP4 left join SNP1x using (Animal, Marker)")
     Animal Marker        Y  x
1 194073197  P1001 0.021088  2
2 194073197  P1002 0.021088  1
3 194073197  P1004 0.021088  2
4 194073197  P1005 0.021088  0
5 194073197  P1006 0.021088  2
6 194073197  P1007 0.021088 NA
> # or if that takes up too much memory 
> # create/use/destroy external database
> sqldf("select * from SNP4 left join SNP1x using (Animal, Marker)", dbname = "test.db")
     Animal Marker        Y  x
1 194073197  P1001 0.021088  2
2 194073197  P1002 0.021088  1
3 194073197  P1004 0.021088  2
4 194073197  P1005 0.021088  0
5 194073197  P1006 0.021088  2
6 194073197  P1007 0.021088 NA
> # Example 4f.  Another temporal join.
> # join DF2 to row in DF for which DF.tt and DF2.tt are closest
> 
> DF <- structure(list(tt = c(3, 6)), .Names = "tt", row.names = c(NA, 
+ -2L), class = "data.frame")
> DF
  tt
1  3
2  6
> 
> DF2 <- structure(list(tt = c(1, 2, 3, 4, 5, 7), d = c(8.3, 10.3, 19, 
+ 16, 15.6, 19.8)), .Names = c("tt", "d"), row.names = c(NA, -6L
+ ), class = "data.frame", reference = "A1.4, p. 270")
> DF2
  tt    d
1  1  8.3
2  2 10.3
3  3 19.0
4  4 16.0
5  5 15.6
6  7 19.8
> 
> out <- sqldf("select * from DF d, DF2 a, DF2 b 
+ where a.row_names = b.row_names - 1 
+ and d.tt > a.tt and d.tt <= b.tt", 
+ row.names = TRUE)
>  
> out$dd <- with(out, ifelse(tt < (tt.1 + tt.2) / 2, d, d.1))
> out
  tt tt.1    d tt.2  d.1   dd
1  3    2 10.3    3 19.0 19.0
2  6    5 15.6    7 19.8 19.8

Example 5. Insert Variables

Here is an example of inserting evaluated variables into a query using gsubfn quasi-perl-style string interpolation. gsubfn is used by sqldf so its already loaded. Note that we must use the fn$ prefix to invoke the interpolation functionality:

> minSL <- 7
> limit <- 3
> fn$sqldf("select * from iris where Sepal_Length > $minSL limit $limit")

  Sepal_Length Sepal_Width Petal_Length Petal_Width   Species
1          7.1         3.0          5.9         2.1 virginica
2          7.6         3.0          6.6         2.1 virginica
3          7.3         2.9          6.3         1.8 virginica

Example 6. File Input

Note that there is a new command read.csv.sql which provides an alternate interface to the the approach discussed in this section. See Example 13 for that.

sqldf will not only look for data frames used in the SQL statement but will also look for R objects of class "file". For such objects it will directly import the associated file into the database without going through R allowing files that are larger than an R workspace to be handled and also providing for potential speed advantages. That is, if f <- file("abc.csv") is a file object and f is used as the table name in the sql statement then the file abc.csv is imported into the database as table f. With SQLite, the actual reading of the file into the database is done in a C routine in RSQLite so the file is transferred directly to the database without going through R. If the sqldf argument dbname is used then it specifies a filename (either existing or created by sqldf if not existing). That filename is used as a database (rather than memory) allowing larger files than physical memory. By using an appropriate where statement or a subset of column names a portion of the table can be retrieved into R even if the file itself is too large for R or for memory.

There are some caveats. The RSQLite dbWriteTable/sqliteImportFile routines that sqldf uses to transfer the file directly to the database are intended for speed thus they are not as flexible as read.table. Also they have slightly different defaults. The default for sep is file.format = list(sep = ","). If the first row of the file has one fewer component than subsequent ones then it assumes that file.format = list(header = TRUE, row.names = TRUE) and otherwise that file.format = list(header = FALSE, row.names = FALSE). .csv file format is only partly supported -- quotes are not regarded as special.

In addition to the examples below there is an example here.

> # Example 6a.
> # test of file connections with sqldf
> 
> # create test .csv file of just 3 records
> write.table(head(iris, 3), "iris3.dat", sep = ",", quote = FALSE)
> 
> # look at contents of iris3.dat
> readLines("iris3.dat")
[1] "Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species"
[2] "1,5.1,3.5,1.4,0.2,setosa"                                 
[3] "2,4.9,3,1.4,0.2,setosa"                                   
[4] "3,4.7,3.2,1.3,0.2,setosa"                                 
> 
> # set up file connection
> iris3 <- file("iris3.dat")
> sqldf("select * from iris3 where Sepal_Width > 3")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
>
> # Example 6b.
> # similar but uses disk - useful if file were large
> # According to http://www.sqlite.org/whentouse.html
> # SQLite can handle files up to several dozen gigabytes.
> # (Note in this case readTable and readTableIndex in R.utils
> # package or read.table from the base of R, setting the colClasses 
> # argument to "NULL" for columns you don't want read in, might be
> # alternatives.)
> sqldf("select * from iris3 where Sepal_Width > 3", dbname = tempfile())
 Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa

> # Example 6c.
> # with this format, header=TRUE needs to be specified
> write.table(head(iris, 3), "iris3a.dat", sep = ",", quote = FALSE, 
+  row.names = FALSE)
> iris3a <- file("iris3a.dat")
> sqldf("select * from iris3a", file.format = list(header = TRUE))
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

> # Example 6d.
> # header can alternately be specified as object attribute
> attr(iris3a, "file.format") <- list(header = TRUE)
> sqldf("select * from iris3a")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

> # Example 6e.
> # create a test file with all 150 records from iris
> # and select 4 records at random without reading entire file into R
> write.table(iris, "iris150.dat", sep = ",", quote = FALSE)
> iris150 <- file("iris150.dat")
> sqldf("select * from iris150 order by random(*) limit 4")
  Sepal_Length Sepal_Width Petal_Length Petal_Width   Species
1          4.9         2.5          4.5         1.7 virginica
2          4.8         3.0          1.4         0.1    setosa
3          6.1         2.6          5.6         1.4 virginica
4          7.4         2.8          6.1         1.9 virginica

Example 6f. If our file has fixed width fields rather than delimited then we can still handle it if we parse the lines manually with substr:

# write some test data to "fixed"
# Field 1 has width of 1 column and field 2 has 4 columns
cat("1 8.3
210.3

319.0
416.0
515.6
719.8
", file = "fixed")

# get 3 random records using sqldf
fixed <- file("fixed")
sqldf("select substr(V1, 1, 1) f1, substr(V1, 2, 4) f2 from fixed order by random(*) limit 3")

Example 6g. Defaults.

# If first row has one fewer columns than subsequent rows then 
# header <- row.names <- TRUE is assumed as in example 6a; otherwise,
# header <- row.names <- FALSE is assumed as shown here:

> write.table(head(iris, 3), "iris3nohdr.dat", col.names = FALSE, row.names = FALSE, sep = ",", quote = FALSE)
> readLines("iris3nohdr.dat")
[1] "5.1,3.5,1.4,0.2,setosa" "4.9,3,1.4,0.2,setosa"   "4.7,3.2,1.3,0.2,setosa"
> sqldf("select * from iris3nohdr")
   V1  V2  V3  V4     V5
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa

Example 7. Nested Select

For each species show the two rows with the largest sepal lengths:

> # Example 7a.
> sqldf("select * from iris i 
+   where rowid in 
+    (select rowid from iris where Species = i.Species order by Sepal_Length desc limit 2)
+   order by i.Species, i.Sepal_Length desc")

  Sepal_Length Sepal_Width Petal_Length Petal_Width    Species
1          5.8         4.0          1.2         0.2     setosa
2          5.7         4.4          1.5         0.4     setosa
3          7.0         3.2          4.7         1.4 versicolor
4          6.9         3.1          4.9         1.5 versicolor
5          7.9         3.8          6.4         2.0  virginica
6          7.7         3.8          6.7         2.2  virginica

Here is a similar example. In this one DF represents a time series whose values are in column x and whose times are dates in column tt. The times have gaps -- in fact only every other day is present. The code below displays the first row at or past the 21st of the month for each year/month. First we append year, month and day columns using month.day.year from the chron package and then do the computation using sqldf. (For a version of this using the zoo package rather than sqldf see: https://stat.ethz.ch/pipermail/r-help/2007-November/145925.html).

> # Example 7b.
> #
> library(chron)
> DF <- data.frame(x = 101:200, tt = as.Date("2000-01-01") + seq(0, len = 100, by = 2))
> DF <- cbind(DF, month.day.year(unclass(DF$tt)))
> 
> sqldf("select * from DF d
+   where rowid in 
+    (select rowid from DF 
+       where year__1 = d.year__1 and month__1 = d.month__1 and day__1 >= 21 limit 1)
+    order by tt")
    x         tt month__1 day__1 year__1
1 111 2000-01-21        1     21    2000
2 127 2000-02-22        2     22    2000
3 141 2000-03-21        3     21    2000
4 157 2000-04-22        4     22    2000
5 172 2000-05-22        5     22    2000
6 187 2000-06-21        6     21    2000

Here is another example of a nested select. We select each row of a for which st/en overlaps with some st/en of b.

> # Example 7c.
> #
> a <- read.table(textConnection("st en
+ 1 4
+ 11 14
+ 3 4"), header = TRUE)
> 
> b <- read.table(textConnection("st en
+ 2 5
+ 3 6
+ 30 44"), TRUE)
> 
> sqldf("select * from a where 
+ (select count(*) from b where a.en >= b.st and b.en >= a.st) > 0")
  st en
1  1  4
2  3  4

Example 8. Specifying File Format

When using file() as used as in Example 6 RSQLite reads in the first 50 lines to determine the column classes. What if they all have numbers in them but then later we start to see letters? In that case we will have to override its choice. Here are two ways:

library(sqldf)

# example example 8a - file.format attribute on file.object

numStr <- as.character(1:100)
DF <- data.frame(a = c(numStr, "Hello"))
write.table(DF, file = "~/tmp.csv", quote = FALSE, sep = ",")
ff <- file("~/tmp.csv")

attr(ff, "file.format") <- list(colClasses = c(a = "character"))

tail(sqldf("select * from ff"))


# example 8b - using file.format argument

numStr <- as.character(1:100)
DF <- data.frame(a = c(numStr, "Hello"))
write.table(DF, file = "~/tmp.csv", quote = FALSE, sep = ",")
ff <- file("~/tmp.csv")

tail(sqldf("select * from ff",
 file.format = list(colClasses = c(a = "character"))))

Example 9. Working with Databases

sqldf is usually used to operate on data frames but it can be used to store a table in a database and repeatedly query it in subsequent sqldf statements using the fact that if you specify the database explicitly then it does not delete the database at the end and if you create a table explicitly using create table then it does not delete the table.

# create new empty database called mydb
sqldf("attach 'mydb' as new") 



# create a new table, mytab, in the new database
# Note that sqldf does not delete tables created from create.
sqldf("create table mytab as select * from BOD", dbname = "mydb")

# shows its still there
sqldf("select * from mytab", dbname = "mydb")

Example 10. Persistent Connections

These two examples show the use of persistent connections in the development version of sqldf. This would be used when one has a large database that one wants to store and then make queries from so that one does not have to reload it on each execution of sqldf. (Note that if one just needs a series of sql statements ending in a single query an alternative would be just to use a vector of sql statements in a single sqldf call.)

> # Example 10a.
>
> # create test .csv file of just 3 records (same as example 6)
> write.table(head(iris, 3), "iris3.dat", sep = ",", quote = FALSE)
> # set up file connection
> iris3 <- file("iris3.dat")
> # creates connection so in memory database persists after sqldf call
> sqldf() 
<SQLiteConnection:(7384,62)> 
> 
> # uses connection just created
> sqldf("select * from iris3 where Sepal_Width > 3")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
> # we now have iris3 variable in R workspace and an iris3 table
> # so ensure sqldf uses the one in the main database by writing
> # main.iris3.  (Another possibility here would have been to
> # delete the iris3 variable from the R workspace to avoid the
> # ambiguity -- in that case one could just write iris3 instead
> # of main.iris3.)
> sqldf("select * from main.iris3 where Sepal_Width = 3")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          4.9           3          1.4         0.2  setosa
> 
> # close
> sqldf()
NULL

> # Example 10b.
> #
> # Here is another way to do example 10a.  We use the same iris3,
> # iris3.dat and sqldf development version as above.  
> # We grab connection explicitly, set up the database using sqldf and then 
> # for the second call we call dbGetQuery from RSQLite.  
> # In that case we don't need to qualify iris3 as main.iris3 since
> # RSQLite would not understand R variables anyways so there is no 
> # ambiguity.

> con <- sqldf() 
> 
> # uses connection just created
> sqldf("select * from iris3 where Sepal_Width > 3")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
> dbGetQuery(con, "select * from iris3 where Sepal_Width = 3")
  row_names Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1         2          4.9           3          1.4         0.2  setosa
> 
> # close
> sqldf()
NULL

Example 11. Between and Alternatives

# example thanks to Michael Rehberg
#
# build sample dataframes
seqdf <- data.frame(thetime=seq(100,225,5),thevalue=factor(letters))
boundsdf <- data.frame(thestart=c(110,160,200),theend=c(130,180,220),groupID=c(555,666,777))

# run the query using two inequalities
testquery_1 <- sqldf("select seqdf.thetime, seqdf.thevalue, boundsdf.groupID 
from seqdf left join boundsdf on (seqdf.thetime <= boundsdf.theend) and (seqdf.thetime >= boundsdf.thestart)")

# run the same query using 'between...and' clause
testquery_2 <- sqldf("select seqdf.thetime, seqdf.thevalue, boundsdf.groupID 
from seqdf LEFT JOIN boundsdf ON (seqdf.thetime BETWEEN boundsdf.thestart AND boundsdf.theend)")

Example 12. Combine two files in permanent database

When we issue a series of normal sqldf statements after each one sqldf automatically removes any tables and databases it creates in that statement; however, it does not know about ones that sqlite creates so a database created using attach and the tables created using create table won't be deleted.

Also if sqldf is used without the x= argument (omitting x= denotes the opening of a persistent connection) then objects created in the database including those by sqldf and sqlite are not deleted when the persistent connection is destroyed by the next sqldf statement with no x= argument.

If we have forgetten whether you have a connection open or not we can check either of these:

dbListConnections(SQLite()) # from DBI

getOption("sqldf.connection") # set by sqldf

Here is an example that illustrates part of the above. See the prior examples for more.

> # set up some test data
> write.table(head(iris, 3), "irishead.dat", sep = ",", quote = FALSE)
> write.table(tail(iris, 3), "iristail.dat", sep = ",", quote = FALSE)
> 
> library(sqldf)
> 
> # create new empty database called mydb
> sqldf("attach 'mydb' as new") 
NULL
> 
> irishead <- file("irishead.dat")
> iristail <- file("iristail.dat")
> 
> # read tables into mydb
> sqldf("select count(*) from irishead", dbname = "mydb")
  count(*)
1        3
> sqldf("select count(*) from iristail", dbname = "mydb")
  count(*)
1        3
> 
> # get count of all records from union
> sqldf('select count(*) from (select * from main.irishead 
+ union 
+ select * from main.iristail)', dbname = "mydb")
  count(*)
1        6

Example 13. read.csv.sql

read.csv.sql is an interface to sqldf that works like read.csv in R except that it also provides an sql= argument and not all of the other arguments of read.csv are supported. It uses (1) SQLite's import facility via RSQLite to read the input file into a temporary disk-based SQLite database which is created on the fly. (2) Then it uses the provided SQL statement to read the table so created into R. As the first step imports the data directly into SQLite without going through R it can handle larger files than R itself can handle as long as the SQL statement filters it to a size that R can handle. Here is Example 6c redone using this facility:

# Example 13a.
library(sqldf)

write.table(iris, "iris.csv", sep = ",", quote = FALSE, row.names = FALSE)
iris.csv <- read.csv.sql("iris.csv", 
	sql = "select * from file where Sepal_Length > 5")

The development version of sqldf also has read.csv2.sql which uses semicolon in place of comma:

# Example 13b.

# get read.csv2.sql from devel version of sqldf
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")

# note sep = ";"
write.table(iris, "iris.csv", sep = ";", quote = FALSE, row.names = FALSE)
iris.csv2 <- read.csv2.sql("iris.csv", 
	sql = "select * from file where Sepal_Length > 5")

Bugs

There is a problem involving "Date" class variables and certain other class variables.

Below, the result suggests that the internal numeric representation of the Date data is stored in the database as character data. As a result the maximum is taken alphabetically whereas we want the numerical maximum. Since 1997-01-03 is internally represented by 9864.0 (number of days since 1970-01-01) and since that is the largest alphabetically (though not numerically) we unfortunately get that as the answer:

> library(sqldf)
> test1 <- data.frame(sale_date = as.Date(c("2008-08-01", "2031-01-09",
+ "1990-01-03", "2007-02-03", "1997-01-03", "2004-02-04")))

> as.numeric(test1[[1]])
[1] 14092 22288  7307 13547  9864 12452

> sqldf("select max(sale_date) from test1")
 max(sale_date)
1         9864.0

The problem stems from the fact that RSQLite does not specifically handle Date variables so it necessarily falls back to its default processing which does not act in the desired way here. The RSQLite maintainer has indicated that there is a likelihood that he will address this.









Hosted by Google Code