Important: This is an old version of this page. For the latest version, use the links in the left-side navbar.
Google Base is a general-purpose repository hosting structured, typed data. Query languages used for other Google services like web search, on the other hand, are designed for searching unstructured textual content. In order to allow Google Base client applications to submit precise queries that constrain the structure of items as well as the value of certain attributes, the Google Base Data API supports a structured attribute-based query language. This language also features full support for traditional free-form text search queries.
This document provides both an overview over the Google Base query language and a detailed description of the various language elements. It is intended for programmers who want to query data from Google Base using the Google Base Data API. The item feeds of the API accept queries in the presented language.
The document assumes that the reader is familiar with one of Google's existing query languages. Knowledge of BNF (Backus Naur Form) is helpful for understanding the syntax specification of the various language constructs.
The Google Base query language adds linguistic constructs for attribute-centric
search queries in a conservative way, fully reusing established query
composition mechanisms. For instance, a sequence of search terms is interpreted
as a conjunction: only documents that match all the individual
terms also match the full query. The following Google Base query would
yield all data items that contain both terms digital and camera in
an arbitrary order in arbitrary attribute values or attribute names.
digital camera
To restrict the search to only those data items describing digital cameras
whose resolution is at least 3 megapixels, we have to add a constraint
on an imaginary attribute megapixel of type float:
digital camera [megapixel(float) >= 3.0]
The sub-query [megapixel(float) >= 3.0] only matches
data items which define a megapixel attribute of type float with
a value of at least 3. The tuple megapixel(float) is called
an attribute identifier, It specifies both the attribute name and
the attribute type. The remainder of the attribute query >=
3.0 refers to the constraint on the attribute value.
Since it is tedious to state the full attribute identifier whenever a
user wants to constrain an attribute's value, the proposed query language
is able to infer the type from the value constraint. The following query
is equivalent to the one before; the constraint on the type of attribute megapixel gets
inferred from the value 3.0.
digital camera [megapixel >= 3.0]
Sometimes a user is just interested in the existence of an attribute.
This can easily be queried by omitting a value constraint. For instance,
the following specialization of the previous query would only return
data items which have an associated attribute price of type int.
digital camera [megapixel >= 3.0] [price(int)]
Again, it is possible to omit the type specification. In this case, any attribute with the given name matches the query:
digital camera [megapixel >= 3.0] [price]
Currently, Google Base supports the predefined attribute types text, bool, number, int, float, date, daterange,
and location. The numeric types have a parametric form where
the type parameter refers to a unit of measure. For instance, the value 3.0
px has the parameterized type float px, referring
to floating point values associated with the unit px. Thus,
if we want to search only for digital cameras with a resolution of at
least 3.0 megapixels and a price of at most $500, we could use the following
query:
digital camera [megapixel >= 3.0] [price <= 500.0 USD]
This query makes sure that we are only interested in cameras sold in
US currency by inferring float USD as the type of attribute price.
For instance, items defining an attribute price(float EUR) would
not match.
Unlike Google's web search, the Google Base query language has to support
arbitrary nesting of queries, allowing users to express non-trivial attribute
relationships. Here is an example for a query that returns only data
items matching the terms digital and camera where
either the megapixel value is at least 3.0 or the price
is below the threshold 100 USD:
digital camera ([megapixel >= 3.0] | [price <= 100.0 USD])
The | operator is used to express alternatives (OR), the
parenthesis group queries. In addition to conjunctive and disjunctive
queries, there is also support for phrase queries as well as the - operator
which is checking for the absence of a match. Here is a more complex
query illustrating a combination of the features:
[title:"digital camera"] [price <= 100.0 USD] -[label:sold]
This query returns only data items whose mandatory title attribute contains
the phrase digital camera, whose price is below $100, and
that are not tagged with the sold label. As this example
shows, label restricts are simply expressed in terms of an attribute
constraint for the standardized attribute label of type text.
In Google Base, the key value mapping of attributes is rather a relation
than a function, allowing a single attribute having multiple values.
Thus, the following query which refers to data items that are tagged
with both a sold and a product label is actually
satisfactory:
[label:sold] [label:product]
The remaining document provides a brief specification of the query syntax, including informal explanations of the semantics.
For specifying the syntax of the Google Base query language, we use a
Backus Naur Form (BNF). Non-terminal symbols are printed in italics.
Terminal symbols are either of the form 'token', or they
are represented by a symbol printed in non-italicised form. The lexical
grammar will explain the micro-syntax of such token classes.
Google's query languages all support at least operators for building
conjunctive ( ) and disjunctive combinations (|)
of queries. Furthermore, they support a unary negation operator (-)
for checking that a query does not match within a certain context like
the whole document or in a special part like the title.
| Query | = | Query '|' Conjunction |
| | | Conjunction | |
| Conjunction | = | Conjunction Negated |
| | | Negated | |
| Negated | = | '-' Delimited |
| | | Delimited |
The Google Base query language defines a small set of delimited queries which are queries that are either atomic or that are composite but delimited by special tokens. In this section we focus on the queries relevant for text search in general. Search in attribute values is discussed in the next section.
| Delimited | = | '(' Query ')' |
| | | '"' PhraseQuery '"' | |
| | | term | |
| | | '*' | |
| | | AttribQuery |
The Google Base query language supports arbitrary nesting of queries. Parenthesis are used to express query nesting.
Phrase queries are conjunctions of smaller sub-queries. The difference
to regular conjunctions is that the order of the sub-queries matters
and that matches of the sub-queries have to be subsequent. Phrases are
expressed by enclosing the corresponding sub-queries with quotation marks
("this is a phrase"). Phrase queries can neither
be nested nor is it possible to refer to attributes within a phrase.
In general, the components of a phrase query have to conform to the following
grammar:
| PhraseQuery | = | PhraseQuery '|' PhraseConjunction |
| | | PhraseConjunction | |
| PhraseConjunction | = | PhraseConjunction PhraseNegated |
| | | PhraseNegated | |
| PhraseNegated | = | '-' PhraseDelimited |
| | | PhraseDelimited | |
| PhraseDelimited | = | '(' PhraseQuery ')' |
| | | term | |
| | | '*' |
The only atomic queries that are useful for text search are terms and
the wildcard token. A term is basically a word delimited by whitespaces
or other term delimiters (see the Literal section). The wildcard token * matches
every possible term.
Searching for attributes and attribute values is a novel feature of the Google Base query language. Syntactically, such attribute queries are delimited by brackets:
| AttribQuery | = | '[' AttribConstraint ']' |
An attribute constraint associates an attribute name with a typed attribute value. Thus, the syntax of attribute constraints has to allow us to distinguish names from values, and has to make it possible to distinguish values of different types. It turns out that delimiting the attribute constraint by an explicit delimiter (i.e. brackets) is essential for supporting a natural, unambiguous notation. In particular, because
In addition to that, the usage of the bracket delimiters makes it easy to optically distinguish between query components used for free form text search and for queries expressing structured attribute constraints.
An attribute constraint consists of an attribute identifier and an attribute value constraint. The attribute identifier specifies an attribute with a name and a type. The attribute value constraint is defined in terms of an operator and a value query. This is the syntax of an attribute constraint:
[attribute name(atype) ?? avquery]
A data item matches such a constraint if it defines an attribute attribute
name of type atype which has a value that matches
the constraint specified by the operator ?? and the
value query avquery. Here are examples for valid attribute
queries:
[access rights(text)
: private | protected] access
rights of type text exists and that its value
matches the query private | protected. Note that
the type designator is optional. In many cases, the query parser
will be able to infer it from the attribute value (which is a
text query in this case). The colon operator is used to express
that an attribute value is an element of an attribute domain
specified by query avquery [copies <=
32] copies of
a numerical type (i.e. either int or float)
exists and that its value is lower or equals than 32. A similar query
could use the colon operator and specify a number range as the corresponding
domain: [copies : 0..32]. This query assumes the number
is positive. [start date
: 2006-02-08] start date of type daterange whose
value is a date/time at February 8. 2006. If no value constraint is specified in an attribute query, the system
will just query for the existence of the specified attribute. It will
not look at its value. For instance, query [destination(location)] matches
all data items that have an attributed named destination of
type location, independent of its actual attribute value.
Here is the grammar specifying the various variants of the attribute constraint syntax:
| AttribConstraint | = | AttribDescriptor AttribOp AttribValue |
| | | AttribDescriptor ':' AttribValueQuery | |
| | | AttribDescriptor | |
| AttribDescriptor | = | AttribIdentifier |
| AttribIdentifier | = | AttribName '(' AttribType ')' |
| | | AttribName |
The most fundamental operator for constraining the value of an
attribute is the matching operator :. It expects a value
query as its right-hand side operand describing a domain of valid values.
A query of the form [name(type): value_query] matches only
if the value of an attribute identified by a name and an optional type
matches the given value query; i.e. the value is contained in the domain
described by this value query. Value queries are either arbitrary text
queries, concrete attribute values, ranges of numbers, or
geographical areas.
For instance, query [rooms: 2..4] matches only if the
value of numeric attribute rooms is contained in the range 2..4. Similarly, query [location: @"Mountain View, CA" + 3mi] matches only if the
location attribute is contained in the geographic area specified by the
center of Mountain View and a radius of 3 miles.
All other operators besides : can only be used for relating
values of attributes with given value literals. The equality operator == expresses an exact match between a given value and the
value of an attribute. It is defined for all types of values.
The four other comparison operators specified in the grammar below
cannot be used with arbitrary types. The operators <, <=, >=, and > work for
the numeric types int, float and number as well as date. For comparison
operators, the type checker of the query language will make sure that
the type of the left-hand operand and the right-hand operand are the same.
| AttribOp | = | '<' | '<=' | '==' | '>=' | '>' | '<<' |
| AttribValue | = | '"' PhraseQuery '"' |
| | | BoolValue | |
| | | int Unit | |
| | | float Unit | |
| | | date | |
| | | daterange | |
| | | location | |
| BoolValue | = | 'true' |
| | | 'false' |
The containment operator << is the only operator
besides : that can be used in combination with attributes of
type daterange. It expects a date literal as right-hand side
operand and checks that this date is included in the date range of the
attribute.
The following grammar defines the attribute types that are currently
supported by Google Base. The numeric types int, float and number have
a parametric variant which allows users to associate arbitrary units
with a number. For instance, type number CHF refers to numbers
with unit CHF, i.e. to prices in the currency CHF (Swiss Francs).
| AttribType | = | 'bool' |
| | | 'number' Unit | |
| | | 'int' Unit | |
| | | 'float' Unit | |
| | | 'text' | |
| | | 'date' | |
| | | 'daterange' | |
| | | 'location' | |
| Unit | = | Name |
| | | ε |
Attribute names can consist of multiple words.
For attributes of type daterange, the field selection
operator # can be used to project the attribute value
to either the start or end date. Field selections are appended directly
to the attribute name.
| AttribName | = | Name '#' Field |
| | | Name | |
| Name | = | Name term |
| | | term | |
| Field | = | 'start' |
| | | 'end' |
For example, a query for checking that a given start date is in a given range, one could use the following query:
[event date range#start : 2006-05Z]
Tokens are parsed using the longest match rule; i.e. the longest possible matching token is chosen if there are ambiguities.
The following list of delimiters separate tokens in the Google Base query language. It is possible to escape delimiters with a backslash if needed.
|, -, :, =, ", [, ], (, ), *, #, <, >,
whitespace. |
A term is a sequence of characters delimited by one of the special characters
listed above. Delimiters cannot be used as characters in a term. Terms
never start or end with two periods (..).
The query language supports both integer and floating point numbers as specified by the following grammar:
| int | = | uint | '-' uint |
| float | = | ufloat | '-' ufloat |
| uint | = | digits | '0' 'x' hexdigits |
| ufloat | = | digits '.' digits 'e' sign digits | digits '.' digits | digits 'e' sign digits |
| digits | = | digits digit | digit |
| digit | = | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' |
| hexdigits | = | hexdigits hexdigit | hexdigit |
| hexdigit | = | digit | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' |
The query language supports both integer and floating point number ranges as specified by the following grammar:
| intrange | = | int '.' '.' int |
| floatrange | = | float '.' '.' float | float '.' '.' int | int '.' '.' float |
Currently, number ranges can only be used in combination with the attribute
value matching operator :.
Dates are specified in the international format defined by ISO 8601.
The standard defines several alternatives; only the following one is
supported: yyyy-mm-ddThh:mm:ssZ (the Z is optional).
Only times in UTC are supported. In addition to the full date format,
the query language also accepts partial date specifications. Such partial
dates define date ranges. For example: 1973-02Z specifies
the date range 1973-02-01T00:00:00Z..1973-02-28T23:59:59Z,
which corresponds to February in 1973.
The previous example also shows that date ranges are either implicitly
defined using a partial date, or they are given by a date interval where
the dates are separated by two periods ...
Locations can be specified in two ways: either by using a human-readable
address (text), or by specifying both a latitude and longitude. The textual
form has the following microsyntax: @"1600 Amphitheatre Parkway,
Mountain View, CA, USA"; i.e. @ prefixes a string.
Locations given by a latitude/longitude have to use the syntax defined
by ISO 6709 (omitting altitudes). This standard defines several formats.
The Google Base query language supports the following forms: @+-DD.D...+-DDD.D..., @+-DDMM.M...+-DDDMM.M...,
and @+-DDMMSS.S...+-DDDMMSS.S.... The first form defines
the latitude (first number) and longitude as fractional degree values.
The second form consists of degrees (2/3 digits) and fractional minutes
(2 digits). The third form consists of degrees (2/3 digits), minutes
(2 digits), and fractional seconds (2 digits).
Here are some examples specifying locations via latitude/longitude: @+4852+00220 (Paris), @+48.8577+002.295 (Eiffel
Tower, Paris), @+90+000 (North Pole), +40.6894-074.0447 (Statue
of Liberty, NYC).
Geographic areas can be represented by a location and a radius specified
either in meters, kilometers, or miles. The syntax is defined above.
Here is an example: @+40.75-074.00 + 5mi (NYC + radius of
5 miles).