gsubfn is an R package providing utilities for strings and function arguments. The key command, also called gsubfn, is like gsub but can take a function, formula, list or proto replacement object.
Below on this page are sections on:
NEWS
News can be found in the NEWS file (also available within R in the package itself). Also note that, except for a few, the examples and other information that were previously on this page have been moved to the vignette available within R via `library(gsubfn); vignette("gsubfn")
July 2, 2009. gsubfn 0.5-0 is on CRAN. Main new features are that strapply runs faster and by default strapply uses the tcl regexp engine. See the May 14, 2009 news item below for a performance comparison.
May 14, 2009. A new version of strapply has been committed to the svn repository. The core of it has been rewritten to use tcl via R's tcltk package. It runs several times faster than the previous version of strapply on larger problems. It requires the tcltk package (which is bundled with R on Windows, UNIX and on the full binary version of R on Mac). One implication of this is that strapply now uses Tcl regular expressions. TODO: There are still some incompatibilities with the old strapply that need to be ironed out and proto support still needs to be reimplemented but most examples now run the same as before.
> # CRAN version of strapply
> library(gsubfn)
Loading required package: proto
> system.time(demo("gsubfn-gries"))
user system elapsed
20.14 0.01 20.27
>
> # devel version of strapply
> source("http://gsubfn.googlecode.com/svn/trunk/R/gsubfn.R")
> library(tcltk)
Loading Tcl/Tk interface ... done
> system.time(demo("gsubfn-gries"))
user system elapsed
3.88 0.06 3.95 April 25, 2009. In EXAMPLES section we added reference to a new book by Stefan Th. Gries which includes examples of using this package.
Dec 22, 2008. Added mixsort example in the EXAMPLES section below.
Dec 14, 2008. Uploaded gsubfn 0.3-8 to CRAN. Should be available on CRAN shortly. See NEWS file.
Dec 9, 2008. Added detabbing example in the EXAMPLES section below.
Nov 30, 2008. Added the character escaping example to the EXAMPLES section below.
Oct 28, 2008. devel version now has improved backref default. Previously it was minus the number of left parens in the regexp. Now it is minus the number of non-escaped left parens in the regexp. To use this version in the interim until its available on CRAN do this in R:
library(gsubfn)
# overwrite relevant function with devel version of it
source("http://gsubfn.googlecode.com/svn/trunk/R/gsubfn.R")See this example on r-help.
Oct 18/08. gsubfn version 0.3-7 is now on CRAN (and in the svn under the Source tab above on this site). The new version of gsubfn fixes all known bugs, adds list replacement objects and changes the backref= default in the gsubfn and strapply commands. Although the changed default introduces an incompatibility with prior versions this incompatability is small because it only affects situations where backeferences are present in the regular expression and backref was not used. Since the previous default for backref was not useful there would be very few, if any, such cases. On the other hand it means that in most cases backref= will not need to be specified as it now takes a more useful default. See announcement.
OVERVIEW
gsubfn is an R package used for string matching, substitution and parsing. Its is freely available under the GNU Public License and is available on CRAN now.
gsubfn. A seemingly small generalization of the R gsub function, namely allow the replacement string to be a replacement function, formula or proto object, can result in significantly increased power and applicability. The resulting function, gsubfn, is the namesake of this package. In the case of a replacement formula the formula is interpreted as a function with the right side of the formula representing the body of the function. In the case of a replacement proto object the object space is used to store persistant data to be communicated from one function invocation to the next as well as to store the replacement function/method itself.
strapply. Built on top of gsubfn is strapply which is similar to gsubfn except that it returns the output of the function rather than substituting it back into the source string. The argument list is analogous to apply with the string being operated on being the first argument, the regular expression taking the place of the dimension or second argument and an optional function to apply to the match as the third argument. A common use of strapply is to split or extract strings based on content rather than delimiters.
fn$. The ability to have formula arguments that represent functions can be used not only in the functions of the gsubfn package but can also be used with any R function without modifying its source. Just preface any R function with fn$ and subject to certain rules to distinguish which formulas are intended to be functions and which are not, the formula arguments will be translated to functions, e.g.
fn$integrate(~ x^2, 0, 1)
The fn$ prefix will also perform quasi-perl style string interpolation on character arguments (subject to certain rules to determine which are intended to be subject to such translation). e.g.
fn$cat('pi = $pi, exp = `exp(1)`\n')match.funfn. match.funfn is provided to allow developers to readily build this functionality into their own functions so that even the fn$ prefix need not be used.
CITATION
To get the citation for this package use the R command:
citation("gsubfn")BUGS
1. strapply may give an error if a variable c is defined in your workspace. To avoid this remove c, or use ostrapply in place of strapply (but then you won't get the tcl engine) or use the devel version of strapply where its been fixed.
EXAMPLES
In addition to the examples below additional examples can be found in the help pages, the demos and in the vignette. Also, the book Quantitative Corpus Linguistics with R by Stefan Th. Gries has examples of the use of functions from this package in a linguistics setting.
Here are some examples.
library(gsubfn)
library(help = gsubfn) # list help files available
?gsubfn # show a specific help file
vignette("gsubfn") # show vignette
demo(package = "gsubfn") # list demos available
demo("gsubfn-si") # run a specific demo
# gsubfn - for each number in input string reduce the number of digits after decimal to 3
# problem comes from: http://yihui.name/en/2009/08/formatting-decimals-in-texts-with-r/
x <- "CC = 16.5547557654 + 0.0173022117998*PP + 0.216234040485 * PP(-1) + 0.810182697599 * (WP + WG)"
gsubfn("[0-9]+[.][0-9]+", ~ formatC(as.numeric(x), digit = 3, format = "f"), x)
# strapply - return words from a string
# output is: c('the', 'big', 'brown', 'cat')
strapply('the big brown cat', '\\w+', c)[[1]]
# gsubfn with function represented by formula - increment each number
# output is: '35 abc45g7'
gsubfn('[0-9]+', ~ as.numeric(x) + 1, '34 abc44g6')
# an example with a longer function. Escapes each punctuation character
# in s with \\ and each non-punctuation character with [...]
s <- '(ab)'
gsubfn('.',
~ if (any(grep("[[:punct:]]", x))) paste0('\\', x) else paste0('[', x, ']'),
s)
# gsubfn and proto - replace each number with cumulative sum
# The statement incrementing sum could alternately be written: sum <<- sum + as.numeric(x)
# output is: '34 abc78g84'
p <- proto(pre = function(this) this$sum <- 0,
fun = function(this, x) this$sum <- this$sum + as.numeric(x)
)
gsubfn('[0-9]+', p, '34 abc44g6')
p$sum
# fn$ - specify aggregate function using a formula
# CO2 is a built in data set in R
fn$aggregate(CO2[4:5], CO2[3], ~mean(range(x)))
# convert date ending in 2 digit year to 4 digit year using cutoff of 10 on year
library(gsubfn)
gsubfn('..$', ~ as.numeric(x) + 100*(as.numeric(x) < 10) + 1900, '1-Mar-50')
# running mean of length 3 of Sepal.Length for each Species
attach(iris)
fn$tapply(Sepal.Length, Species, ~ diff(c(NA,NA,0,cumsum(x)),3)/3, simplify=c)
# returns text between pat1 and pat2
# In this example, output is: c('name1', 'name2')
# Note that (?U) turns on ungreedy perl-style matching.
a <- 'something2 ....pat1 name1 pat2 something2....pat1 name2 pat2....'
strapply(a, '(?U)pat1 (.*) pat2', perl = TRUE)[[1]]
### more examples
# detabbing - slightly modified from original by Greg Snow
# https://stat.ethz.ch/pipermail/r-help/2008-December/182086.html
tmp <- strsplit('one\ttwo\nthree\tfour\n12345678\t910\na\tbc\tdef\tghi\n','\n')[[1]]
gsubfn('([^\t]*)\t', ~ sprintf("%s%*s", x, 8-nchar(x)%%8, " "), tmp)
# mixed sort - Input, s, is a character vector. Treating each substring of numerics
# and each substring of non-numerics as a key field, we regard these as records
# of key fields which are sorted and returned.
# Internals: L is a list of character vectors whose ith component is the split up
# numerics and non-numerics in input s[i]. e.g. L[[3]] is c("x", "02", "b") in
# example below since s[3] is "x02b". We arrange this into matrix, L2, so that components
# of L correspond to rows in L2 replacing NAs with "" to give L3. Finally convert
# it to data frame, getting the ordering, ord, and apply that to original
# character vector, s.
# from: https://stat.ethz.ch/pipermail/r-help/2008-December/183209.html
mixsort <- function(s) {
L <- strapply(s, "([0-9]+)|([^0-9]+)", ~ if (nchar(x)) sprintf("%99s", x) else y)
L2 <- t(do.call(cbind, lapply(L, ts)))
L3 <- replace(L2, is.na(L2), "")
ord <- do.call(order, as.data.frame(L3, stringsAsFactors = FALSE))
s[ord]
}
s <- c("x1b", "x1a", "x02b", "x02a", "x02", "y1a1", "y10a2", "y10a10",
"y10a1", "y2", "var10a2", "var2", "y10")
mixsort(s)
# hex to binary
# from: https://stat.ethz.ch/pipermail/r-help/2009-May/198655.html
> library(gsubfn)
> binary.digits <-
+ list("0"= "0000", "1"= "0001", "2"= "0010", "3"= "0011",
+ "4"= "0100", "5"= "0101", "6"= "0110", "7"= "0111",
+ "8"= "1000", "9"= "1001", "A"= "1010", "B"= "1011",
+ "C"= "1100", "D"= "1101", "E"= "1110", "F"= "1111")
>
> gsubfn("[0-9A-F]", binary.digits, "0X1.921FB54442D18P+1")
[1] "0000X0001.1001001000011111101101010100010001000010110100011000P+0001"