English | Site Directory

Google Base Data API

Query Language

Google Base is a general-purpose repository hosting structured, typed data. Query languages used for other Google services like web search, on the other hand, are designed for searching unstructured textual content. In order to allow Google Base client applications to submit precise queries that constrain the structure of items as well as the value of certain attributes, the Google Base data API supports a structured attribute-based query language. This language also features full support for traditional free-form text search queries.

This document provides both an overview over the Google Base query language and a detailed description of the various language elements. It is intended for programmers who want to query data from Google Base using the Google Base data API. The item feeds of the API accept queries in the presented language.

The document assumes that the reader is familiar with one of Google's existing query languages. Knowledge of BNF (Backus Naur Form) is helpful for understanding the syntax specification of the various language constructs.

Contents

  1. Overview
  2. Query Composition
  3. Text Search
  4. Attribute Search
  5. Lexical Syntax

Overview

Composing queries

The Google Base query language adds linguistic constructs for attribute-centric search queries in a conservative way, fully reusing established query composition mechanisms. For instance, a sequence of search terms is interpreted as a conjunction: only documents that match all the individual terms also match the full query. The following Google Base query would yield all data items that contain both terms digital and camera in an arbitrary order in arbitrary attribute values or attribute names.

digital camera

Expressing constraints on attributes

To restrict the search to only those data items describing digital cameras whose resolution is at least 3 megapixels, we have to add a constraint on an imaginary attribute megapixel of type float:

digital camera [megapixel(float) >= 3.0]

The sub-query [megapixel(float) >= 3.0] only matches data items which define a megapixel attribute of type float with a value of at least 3. The tuple megapixel(float) is called an attribute identifier, It specifies both the attribute name and the attribute type. The remainder of the attribute query >= 3.0 refers to the constraint on the attribute value.

Since it is tedious to state the full attribute identifier whenever a user wants to constrain an attribute's value, the proposed query language is able to infer the type from the value constraint. The following query is equivalent to the one before; the constraint on the type of attribute megapixel gets inferred from the value 3.0.

digital camera [megapixel >= 3.0]

Sometimes a user is just interested in the existence of an attribute. This can easily be queried by omitting a value constraint. For instance, the following specialization of the previous query would only return data items which have an associated attribute price of type int.

digital camera [megapixel >= 3.0] [price(int)]

Again, it is possible to omit the type specification. In this case, any attribute with the given name matches the query:

digital camera [megapixel >= 3.0] [price]

Currently, Google Base supports the predefined attribute types text, bool, number, int, float, date, daterange, and location. The numeric types have a parametric form where the type parameter refers to a unit of measure. For instance, the value 3.0 px has the parameterized type float px, referring to floating point values associated with the unit px. Thus, if we want to search only for digital cameras with a resolution of at least 3.0 megapixels and a price of at most $500, we could use the following query:

digital camera [megapixel >= 3.0] [price <= 500.0 USD]

This query makes sure that we are only interested in cameras sold in US currency by inferring float USD as the type of attribute price. For instance, items defining an attribute price(float EUR) would not match.

Back to top

Nesting queries

Unlike Google's web search, the Google Base query language has to support arbitrary nesting of queries, allowing users to express non-trivial attribute relationships. Here is an example for a query that returns only data items matching the terms digital and camera where either the megapixel value is at least 3.0 or the price is below the threshold 100 USD:

digital camera ([megapixel >= 3.0] | [price <= 100.0 USD])

The | operator is used to express alternatives (OR), the parenthesis group queries. In addition to conjunctive and disjunctive queries, there is also support for phrase queries as well as the - operator which is checking for the absence of a match. Here is a more complex query illustrating a combination of the features:

[title:"digital camera"] [price <= 100.0 USD] -[label:sold]

This query returns only data items whose mandatory title attribute contains the phrase digital camera, whose price is below $100, and that are not tagged with the sold label. As this example shows, label restricts are simply expressed in terms of an attribute constraint for the standardized attribute label of type text. In Google Base, the key value mapping of attributes is rather a relation than a function, allowing a single attribute having multiple values. Thus, the following query which refers to data items that are tagged with both a sold and a product label is actually satisfactory:

[label:sold] [label:product]

Specification

The remaining document provides a brief specification of the query syntax, including informal explanations of the semantics.

For specifying the syntax of the Google Base query language, we use a Backus Naur Form (BNF). Non-terminal symbols are printed in italics. Terminal symbols are either of the form 'token', or they are represented by a symbol printed in non-italicised form. The lexical grammar will explain the micro-syntax of such token classes.

Back to top

Query Composition

Conjunctions, Disjunctions, and Negations

Google's query languages all support at least operators for building conjunctive ( ) and disjunctive combinations (|) of queries. Furthermore, they support a unary negation operator (-) for checking that a query does not match within a certain context like the whole document or in a special part like the title.

Query = Query  '|'  Conjunction
  | Conjunction
Conjunction = Conjunction  Negated
  | Negated
Negated = '-'  Delimited
  | Delimited

Text Search

The Google Base query language defines a small set of delimited queries which are queries that are either atomic or that are composite but delimited by special tokens. In this section we focus on the queries relevant for text search in general. Search in attribute values is discussed in the next section.

Delimited = '('  Query  ')'
  | '"'  PhraseQuery  '"'
  | term
  | '*'
  | AttribQuery

Query nesting

The Google Base query language supports arbitrary nesting of queries. Parenthesis are used to express query nesting.

Phrase queries

Phrase queries are conjunctions of smaller sub-queries. The difference to regular conjunctions is that the order of the sub-queries matters and that matches of the sub-queries have to be subsequent. Phrases are expressed by enclosing the corresponding sub-queries with quotation marks ("this is a phrase"). Phrase queries can neither be nested nor is it possible to refer to attributes within a phrase. In general, the components of a phrase query have to conform to the following grammar:

PhraseQuery = PhraseQuery  '|'  PhraseConjunction
  | PhraseConjunction
PhraseConjunction = PhraseConjunction  PhraseNegated
  | PhraseNegated
PhraseNegated = '-'  PhraseDelimited
  | PhraseDelimited
PhraseDelimited = '('  PhraseQuery  ')'
  | term
  | '*'

Atomic queries

The only atomic queries that are useful for text search are terms and the wildcard token. A term is basically a word delimited by whitespaces or other term delimiters (see the Literal section). The wildcard token * matches every possible term.

Back to top

Attribute Search

Attribute queries

Searching for attributes and attribute values is a novel feature of the Google Base query language. Syntactically, such attribute queries are delimited by brackets:

AttribQuery = '['  AttribConstraint  ']'

An attribute constraint associates an attribute name with a typed attribute value. Thus, the syntax of attribute constraints has to allow us to distinguish names from values, and has to make it possible to distinguish values of different types. It turns out that delimiting the attribute constraint by an explicit delimiter (i.e. brackets) is essential for supporting a natural, unambiguous notation. In particular, because

  • Attribute names can consist of multiple words,
  • Attribute values can consist of a sequence of tokens,
  • Attribute types and value constraints are optional.

In addition to that, the usage of the bracket delimiters makes it easy to optically distinguish between query components used for free form text search and for queries expressing structured attribute constraints.

Attribute constraints

An attribute constraint consists of an attribute identifier and an attribute value constraint. The attribute identifier specifies an attribute with a name and a type. The attribute value constraint is defined in terms of an operator and a value query. This is the syntax of an attribute constraint:

[attribute name(atype) ?? avquery]

A data item matches such a constraint if it defines an attribute attribute name of type atype which has a value that matches the constraint specified by the operator ?? and the value query avquery. Here are examples for valid attribute queries:

[access rights(text) : private | protected]
This query asserts that an attribute access rights of type text exists and that its value matches the query private | protected. Note that the type designator is optional. In many cases, the query parser will be able to infer it from the attribute value (which is a text query in this case). The colon operator is used to express that an attribute value is an element of an attribute domain specified by query avquery
[copies <= 32]
This query asserts that an attribute copies of a numerical type (i.e. either int or float) exists and that its value is lower or equals than 32. A similar query could use the colon operator and specify a number range as the corresponding domain: [copies : 0..32]. This query assumes the number is positive.
[start date : 2006-02-08]
This query matches all data items that have an attribute named start date of type daterange whose value is a date/time at February 8. 2006.

If no value constraint is specified in an attribute query, the system will just query for the existence of the specified attribute. It will not look at its value. For instance, query [destination(location)] matches all data items that have an attributed named destination of type location, independent of its actual attribute value.

Here is the grammar specifying the various variants of the attribute constraint syntax:

AttribConstraint = AttribDescriptor  AttribOp  AttribValue
  | AttribDescriptor  ':'  AttribValueQuery
  | AttribDescriptor
AttribDescriptor = AttribIdentifier
AttribIdentifier = AttribName  '('  AttribType  ')'
  | AttribName

Operators

The most fundamental operator for constraining the value of an attribute is the matching operator :. It expects a value query as its right-hand side operand describing a domain of valid values. A query of the form [name(type): value_query] matches only if the value of an attribute identified by a name and an optional type matches the given value query; i.e. the value is contained in the domain described by this value query. Value queries are either arbitrary text queries, concrete attribute values, ranges of numbers, or geographical areas.

AttribValueQuery = TextQuery
  | AttribValue
  | intrange
  | floatrange
  | location  '+'  int  RadiusUnit
TextQuery = TextQuery  '|'  TextConjunction
  | TextConjunction
TextConjunction = TextConjunction  TextNegated
  | TextNegated
TextNegated = '-'  TextDelimited
  | TextDelimited
TextDelimited = '('  TextQuery  ')'
  | '"'  PhraseQuery  '"'
  | term
  | '*'
RadiusUnit = 'm'
  | 'mi'
  | 'km'

For instance, query [rooms: 2..4] matches only if the value of numeric attribute rooms is contained in the range 2..4. Similarly, query [location: @"Mountain View, CA" + 3mi] matches only if the location attribute is contained in the geographic area specified by the center of Mountain View and a radius of 3 miles.

All other operators besides : can only be used for relating values of attributes with given value literals. The equality operator == expresses an exact match between a given value and the value of an attribute. It is defined for all types of values.

The four other comparison operators specified in the grammar below cannot be used with arbitrary types. The operators <, <=, >=, and > work for the numeric types int, float and number as well as date. For comparison operators, the type checker of the query language will make sure that the type of the left-hand operand and the right-hand operand are the same.

AttribOp = '<'   |   '<='   |   '=='   |   '>='   |   '>'   |   '<<'
AttribValue = '"'  PhraseQuery  '"'
  | BoolValue
  | int  Unit
  | float  Unit
  | date
  | daterange
  | location
BoolValue = 'true'
  | 'false'

The containment operator << is the only operator besides : that can be used in combination with attributes of type daterange. It expects a date literal as right-hand side operand and checks that this date is included in the date range of the attribute.

Attribute types

The following grammar defines the attribute types that are currently supported by Google Base. The numeric types int, float and number have a parametric variant which allows users to associate arbitrary units with a number. For instance, type number CHF refers to numbers with unit CHF, i.e. to prices in the currency CHF (Swiss Francs).

AttribType = 'bool'
  | 'number'  Unit
  | 'int'  Unit
  | 'float'  Unit
  | 'text'
  | 'date'
  | 'daterange'
  | 'location'
Unit = Name
  | ε

Attribute names

Attribute names can consist of multiple words. For attributes of type daterange, the field selection operator # can be used to project the attribute value to either the start or end date. Field selections are appended directly to the attribute name.

AttribName = Name  '#'  Field
  | Name
Name = Name  term
  | term
Field = 'start'
  | 'end'

For example, a query for checking that a given start date is in a given range, one could use the following query:

[event date range#start : 2006-05Z]

Back to top

Lexical Syntax

Tokens are parsed using the longest match rule; i.e. the longest possible matching token is chosen if there are ambiguities.

Delimiters

The following list of delimiters separate tokens in the Google Base query language. It is possible to escape delimiters with a backslash if needed.

|, -, :, =, ", [, ], (, ), *, #, <, >, whitespace.

Terms

A term is a sequence of characters delimited by one of the special characters listed above. Delimiters cannot be used as characters in a term. Terms never start or end with two periods (..).

Numbers

The query language supports both integer and floating point numbers as specified by the following grammar:

int = uint  |  '-' uint
float = ufloat  |  '-' ufloat
uint = digits  |  '0' 'x' hexdigits
ufloat = digits '.' digits 'e' sign digits  |  digits '.' digits  |  digits 'e' sign digits
digits = digits digit  |  digit
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
hexdigits = hexdigits hexdigit  |  hexdigit
hexdigit = digit | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'

Number ranges

The query language supports both integer and floating point number ranges as specified by the following grammar:

intrange = int  '.'  '.'  int
floatrange = float  '.'  '.'  float  |  float  '.'  '.'  int  |  int  '.'  '.'  float

Currently, number ranges can only be used in combination with the attribute value matching operator :.

Dates

Dates are specified in the international format defined by ISO 8601. The standard defines several alternatives; only the following one is supported: yyyy-mm-ddThh:mm:ssZ (the Z is optional). Only times in UTC are supported. In addition to the full date format, the query language also accepts partial date specifications. Such partial dates define date ranges. For example: 1973-02Z specifies the date range 1973-02-01T00:00:00Z..1973-02-28T23:59:59Z, which corresponds to February in 1973.

The previous example also shows that date ranges are either implicitly defined using a partial date, or they are given by a date interval where the dates are separated by two periods ...

Locations

Locations can be specified in two ways: either by using a human-readable address (text), or by specifying both a latitude and longitude. The textual form has the following microsyntax: @"1600 Amphitheatre Parkway, Mountain View, CA, USA"; i.e. @ prefixes a string.

Locations given by a latitude/longitude have to use the syntax defined by ISO 6709 (omitting altitudes). This standard defines several formats. The Google Base query language supports the following forms: @+-DD.D...+-DDD.D..., @+-DDMM.M...+-DDDMM.M..., and @+-DDMMSS.S...+-DDDMMSS.S.... The first form defines the latitude (first number) and longitude as fractional degree values. The second form consists of degrees (2/3 digits) and fractional minutes (2 digits). The third form consists of degrees (2/3 digits), minutes (2 digits), and fractional seconds (2 digits).

Here are some examples specifying locations via latitude/longitude: @+4852+00220 (Paris), @+48.8577+002.295 (Eiffel Tower, Paris), @+90+000 (North Pole), +40.6894-074.0447 (Statue of Liberty, NYC).

Geographic areas can be represented by a location and a radius specified either in meters, kilometers, or miles. The syntax is defined above. Here is an example: @+40.75-074.00 + 5mi (NYC + radius of 5 miles).

Back to top