My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
WebDataModel  
A description of the Web Data Model, a lowest-common-denominator model for data on the web
Phase-Design
Updated Jul 29, 2010 by bill.bra...@gmail.com

Introduction

Web sites implicitly expose a common data model to clients, but exactly what does that data model consist of? This document attempts to describe the lowest-common-denominator for most websites; that model is what the in-memory database in Itemscript attempts to emulate.

Obviously, real sites differ in various ways from the idealized variation described here. The most common way is that access to resources may involve a request with a query-string even when what is happening is not technically a "query". For the purposes of this description, we're going to ignore that.

What is a URL?

We're going to assume that the URLs we're talking about are of typical generic form as described in RFC2396. We will assume that any relative URL has been converted to absolute form first. As far as we're concerned URLs have only four components - the "server" (identified as scheme + host + port), the path, the query string, and the fragment.

What is a server?

A server is some reachable service that contains resources and makes them available to clients. It may support certain types of queries on resources, or queries that answer other questions. It may allow specific resources to be changed in whole, or it may accept certain other kinds of changes that don't specifically affect a single resource, or affect only a sub-part of them.

For the purposes of this discussion, a "server" is identified by everything up to and including the server name and port number in a URL - the scheme, hostname, and port number. For instance, http://example.com, or ftp://foo.com:2323. The internal details of the "server" part can be ignored.

What is a resource?

A resource is a bunch of data with a content-type, encoding, and other metadata associated with it that helps the client interpret it. It might be a file, a web page, something generated by a database, and so on; what is important is that to the client, it is always identified by the same path.

We're going to pretend that all resources are JSON; either they actually are JSON, or they will be converted to JSON by one of the following methods depending on content-type:

  • xml - Converted to a JSON representation (to be determined).
  • text - Converted to a JSON object, with a field "contentType" with the content-type and a field "content" containing a JSON array, with each line (minus the line ending) from the original file being an entry in the array.
  • binary/other - Converted to a JSON object, with a field "contentType" with the content-type, and a field "content" containing a base-64 encoded version of the content in a JSON string.

For the purposes of this discussion, a "resource" is identified by the path section of a URL - everything between the hostname/port-number and the end of the path (which may be marked by a query-string or fragment). For instance, /some/path/ or /images/foo.gif.

We're not going to distinguish between "files" and "directories" initially, and therefore the presence or absence of a trailing slash on the path is irrelevant.

All paths start with a slash and are therefore rooted at the base of the resources a server offers; the rest of the path is divided by slashes and URL-decoded to the names of sub-resources. The model corresponds roughly to a List<String> in Java, or symbolically in JSON, the path string /some/path/foo.gif would correspond to:

[
    "/",
    "some",
    "path",
    "foo.gif"
]

The initial "/" entry reminds us that all navigation starts from the root in an absolute URL. All paths have at least that "/" entry.

What is a query?

A query is something asked of (or about) a given resource, indicated by the presence of a query-string in the URL after the path. A query does not necessarily return a given resource, but may return information about resources.

For the purposes of this discussion, a "query" is identified by the query-string section of a URL, everything between the end of the path and either the end of the URL or the beginning of the fragment section, and the actual query is decoded to an unordered set of keys and values, in which each key may have one or more string values. Keys with no value are assumed to have a value of an empty string. Roughly, this is a Map<String, List<String>> in Java. For instance, in the URL http://example.com/abc?key=value&multiKey=foo&multiKey=bar&emptyKey the query looks a bit like this in JSON:

{
    "key" : [
        "value"
    ],
    "multiKey" : [
        "foo",
        "bar"
    ],
    "emptyKey" : []
}

What is the model for organization of resources?

So how are resources organized on the server, at least as far as clients can see?

What is exposed to clients looks a lot like a filesystem, but has some important differences. Like a filesystem, it is a rooted, tree-structured model where nodes in the tree can have sub-nodes; unlike a filesystem, which draws a distinction between nodes with values (files) and nodes with sub-nodes (directories), that distinction is blurred on websites.

So, for instance, it is common to serve one resource at http://example.com/foo and another at http://example.com/foo/bar.html without a real distinction being made that the former is a "directory" and the latter is a "file". Technically speaking, the former might be a directory and the resource served might be "index.html", but to the client this is not apparent; nor is really apparent that bar.html represents a file and that a resource named /foo/bar.html/xyz might not also be available.

You can imagine this as being represented by the following JSON structure (and this is in fact the format used by dump operations on the Itemscript database):

{
    "value" : "",
    "subItems" : {
        "foo" : {
            "value" : "",
            "subItems" : {
                "bar.html" : {
                    "value" : "<html>etc etc</html>",
                     "subItems" : {}
                }
            }
        }
    }
}

Whether this is actually how it is implemented is not really the point; this is how it appears to clients. Not all nodes accept sub-nodes (for instance, in reality bar.html is unlikely to accept them) but that's really a restriction of the more general model, rather than a general rule in itself. For the purposes of this discussion, we'll assume that we are simply talking about a hierarchical tree of generic nodes that can have both values and sub-nodes, and that we navigate our way down this tree from the root using the series of keys given in the path.

What can we do?

Obviously most web servers don't allow just any old users to replace any resource they feel like, and the most common operation does not involve putting a new resource under a given path, but in sending some additional data and letting the server figure out a path to put it under. REST-style sites correspond most closely to the "put a whole resource" model; as do sites like Wikipedia that let you edit a resource and allow creation of hierarchically-organized sub-resources below any other resource.

For the purposes of this discussion, we'll assume that permissions are irrelevant and that the user is allowed to perform any operation.

What operations are available?

We're going to restrict the discussion to some common, simple operations. Except for a basic GET operation, all of these would be optional. All of them map to one of the following HTTP methods:

  • GET
  • PUT
  • DELETE
  • POST

You can think of this section as an attempt to reconcile the differences between HTTP/REST, and a programming language API with a Java Map-style interface with only three main methods, with all querying performed through the use of special URLs supplied to the get operation.

Some of these are very basic and supported by almost all web servers; others are implemented in various ways on various systems, but we're going to describe a common set of operations here and how they apply to an idealized server. A couple of things are non-standard, but perhaps ought to be! This will have a parallel discussion of the operations provided by Itemscript and how they correspond to certain HTTP operations. (Specifically, when Itemscript sees a mem: URL it does things on the in-memory database that correspond to certain things it does when it sees an http: URL.)

In some ways this is similar to something like WebDAV; the idea here is to create something simpler, with many fewer features, designed to specifically support data resources. The intention is also to be able to interact with all existing web servers, regardless of their support for all of the described operations, by treating them as if they supported this basic data model. The same interface can be mapped to a local filesystem quite accurately, and other data sources can be represented in a very similar way, with the main differences being exactly which resources are allowed to contain sub-resources.

HTTP GET

  • 1.1 - Get a particular resource.
  • 1.2 - Get part of a particular resource.
  • 1.3 - Get the results of a query about a resource.
  • 1.4 - Get part of the results of a query about a resource.
  • 1.5 - Get the results of a query not about a specific resource.
  • 1.6 - Get part of the results of a query not about a specific resource.

HTTP PUT

  • 2.1 - Put a new whole value for a resource, whether that resource existed before or not.
  • 2.2 - Add a new sub-resource to an existing resource under a generated UUID.

HTTP POST

  • 3.1 - Change part of the value of a resource.
  • 3.2 - Remove part of the value of a resource.
  • 3.3 - Perform some other state-changing operation on the server.

HTTP DELETE

  • 4.1 - Remove a resource.

These are mapped to the following combinations of methods and types of URL supplied to the three Map-style methods in Itemscript that are used to query, retrieve, or change data - get, put, and remove.

get

  • 1.1 - URL with no query string or fragment.
  • 1.2 - URL with no query string, with a fragment.
  • 1.3 - URL with a query string with a special key and no fragment.
  • 1.4 - URL with a query string with a special key and a fragment.
  • 1.5 - URL with a query string without a special key and no fragment.
  • 1.6 - URL with a query string without a special key, with no fragment.

put

  • 2.1 - URL with no query string or fragment.
  • 2.2 - URL with a query string with the special "uuid" key and no fragment.
  • 3.1 - URL with no query string, with a fragment.
  • 3.3 - URL with a query string without the special "uuid" key and no fragment.

remove

  • 4.1 - URL with no query string or fragment.
  • 3.2 - URL with no query string, with a fragment.

What do these operations do?

For each of the following examples, we'll assume an initial data model state looking like this, starting at the root value:

{
    "value" : "",
    "subItems" : {
        "abc" : {
            "value" : "xyz",
            "subItems" : {}
        },
        "def" : {
            "value" : {
                "a" : "b"
            },
            "subItems" : {}
        }
    }
}

This might be in-memory or on a remote server, we'll show examples of both.

1.1 - Get a particular resource

Itemscript mem: database call:

String value = system.getString("/abc"); // returns "xyz"

HTTP equivalent:

GET http://example.com/abc

1.2 - Get part of a particular resource

Itemscript mem: database call:

String value = system.getString("/def#a"); // returns "b"

HTTP equivalent:

GET http://example.com/abc
[ locally navigate to #a value ]

1.3 & 1.4 - Get the results of a query about a particular resource

The "special keys" I am suggesting (and that are supported in Itemscript) are these:

  • "countItems" - Count the sub-resources of a given resource.
  • "keys" - Give the keys of the sub-resources of a given resource as a list.
  • "pagedKeys" - Give just some of the keys of a given resource as a list.
  • "pagedItems" - Give both the keys and the contents of some of the sub-resources of a given resource as a list.
  • "dump" - Dump the given resource and all of its sub-resources as an object.

Itemscript mem: database call:

int count = system.getInt("/?countItems#count"); // returns 2

HTTP equivalent:

GET http://example.com/?countItems

1.5 & 1.6 - Get the results of a query not about a particular resource

All other queries that do not change state fall into this category.

There is no specific mem: database Itemscript call since the database doesn't support any other queries at present.

HTTP equivalent:

GET http://example.com/?someQuery=someValue

2.1 - Put a new whole value for a resource, whether that resource existed before or not.

Itemscript mem: database call:

    system.put("/ghi", "a new value");

HTTP equivalent:

PUT http://example.com/ghi
with content of "a new value"

The expected state of the data model after the call:

{
    "value" : "",
    "subItems" : {
        "abc" : {
            "value" : "xyz",
            "subItems" : {}
        },
        "def" : {
            "value" : {
                "a" : "b"
            },
            "subItems" : {}
        },
        "ghi" : {
            "value" : "a new value",
            "subItems" : {}
        }
    }
}

2.2 - Add a new sub-resource to an existing resource under a generated UUID.

Itemscript mem: database call:

    system.put("/abc?uuid", "a new value");

HTTP equivalent:

Itemscript translates a request like:

    system.put("http://example.com/abc?uuid", "a new value");

First, it creates a UUID and appends it to the existing path. Then it does:

PUT http://example.com/abc/550e8400-e29b-41d4-a716-446655440000

As you can see, we can translate this into an HTTP PUT to a specific resource name by generating the UUID for that resource.

The expected state of the data model after the call:

{
    "value" : "",
    "subItems" : {
        "abc" : {
            "value" : "xyz",
            "subItems" : {
                "550e8400-e29b-41d4-a716-446655440000" : {
                    "value" : "a new value",
                    "subItems" : {}
                }
            }
        },
        "def" : {
            "value" : {
                "a" : "b"
            },
            "subItems" : {}
        }
    }
}

3.1 - Change part of the value for a resource

Itemscript mem: database call:

    system.put("/def#c", "d");

HTTP equivalent:

No standard equivalent exists, but many non-standard ones do. Itemscript will implement (does not yet) a standard version for when it encounters a call like:

    system.put("http://example.com/def#c", "d");

The fragment, having no meaning in a POST URL, must be translated to some sort of field in the POST request. So what Itemscript will generate is a request like this:

POST http://example.com/def
with content of:
    [
        "action" : ["changeValue"],
        "fragment" : ["#c"],
        "value" : ["d"]
    ]

The expected state of the data model after the call:

{
    "value" : "",
    "subItems" : {
        "abc" : {
            "value" : "xyz",
            "subItems" : {}
        },
        "def" : {
            "value" : {
                "a" : "b",
                "c" : "d"
            },
            "subItems" : {}
        }
    }
}

3.2 - Remove part of the value of a resource.

Itemscript mem: database call:

    system.remove("/def#a");

HTTP equivalent:

No standard equivalent exists, but Itemscript will implement (does not now) a standard version for when it encounters a call like:

    system.remove("http://example.com/def#a");

Again, the fragment must be translated to part of a POST request like so:

POST http://example.com/def
with content of:
{
    "action" : ["removeValue"],
    "fragment" : ["#a"]
}

The expected state of the data model after the call:

{
    "value" : "",
    "subItems" : {
        "abc" : {
            "value" : "xyz",
            "subItems" : {}
        },
        "def" : {
            "value" : {},
            "subItems" : {}
        }
    }
}

3.3 - Perform some other state-changing operation on the server.

Itemscript mem: database call:

There is no real equivalent, as the in-memory database only supports the changing of resources or parts of resources.

HTTP equivalent:

Any POST request falls into this category.

Itemscript will implement (does not now) something to ease the creation of POST requests, by turning calls of the form:

    system.put("http://example.com?query=string

into HTTP calls like:

POST http://example.com
with content of "query=string"

This may seem a little backwards, but it's there to shoehorn POST requests into the three-call API.

4.1 - Remove a resource.

Itemscript mem: database call:

    system.remove("/abc");

HTTP equivalent:

DELETE http://example.com/abc

The expected state of the data model after the call:

{
    "value" : "",
    "subItems" : {
        "def" : {
            "value" : {
                "a" : "b"
            },
            "subItems" : {}
        }
    }
}

Itemscript is a registered trademark of Data Base Architects, Inc. The Itemscript specification, the Itemstore API specification and the JAM template language specification are open source works published under the new BSD license.


Sign in to add a comment
Powered by Google Project Hosting