My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
DSPDataPersistenceOnMongoDB  

Description of the persistence model for the DSP Messages using a Document-Oriented System

Introduction

This page describes the persistence model using a mongoDB, a document-oriented system based on the best features from a KVP and a RDBS databases. Although this document will cover a new architecture chosen during the technology research, it will reference the first version of the DSPDataPersistence as needed to avoid repeating information. Last, but not least, this document will be guided by the comments regarding revisions r584 and r585.

Persisting Generic Sensor Data into a Datastore

Sensor networks are commonly used in the scientific community, serving as tools to monitor the state of the environment. NASA SensorWeb uses sensors to collect data from specific volcanoes around the world that has given properties, informing researchers about volcanoes activity and behavior around the world.

No matter how data is collected, sensors collaborating in a network will produce meaningful data for the users based on environment conditions. In this way, sampling data can be collected during a week, a month, or any specific period of time in order draw analysis results about the state of the environment in different moments.

When it comes to the data representation, however, each sensor's Architect/Engineer had previously come up with his/her own Metadata to describe the sensor's properties. For example, the car's temperature and the water's salinity and temperature are different properties applied to the car and water, respectively. Therefore, sensors may contain several dozens or even hundreds of different properties, which can semantically identify data for different objects. Additionally, time is an important variable that must also be collected when collecting data from sensor networks, and in this case, Time Series studies ways to track these points in time.

Considering that sensors can join a dynamic sensor network, the data model chosen might accommodate a completely different list of properties. The model can closely relate to the relational model and yet just have properties described as key-value pairs. Furthermore, scalability plays an important role when choosing a persistence technology, specially in the area of sensor networks, where the number of sensors can grow at any time.

Choosing a Data Model for Sensor Networks

One may think what is a good design decision to model the data in such dynamic environment. Based on the properties described in the previous section, it's an important time to have find alternatives to the Relational Model, which has been the choice in different types of projects when it comes to persisting data in a persistent storage. The article Is The Relational Database Dommed? details the difficulties the Relational Data Model has to keep up with the constant schema changes faced on dynamic environments such as dynamic web sites, since the author argued that RDBS is more suitable for environments whose types are well-known and don't constantly change. For this reason, the authors presented the concepts of the Key-Value Pair (KVP) model, which considers a simple hash-like structure to describe the entities of a system using the notion of a set of key-value pairs to describe a given entity.

Taking into account that sensors are only based on properties and their relating values, the KVP model, I considered KVP an option to address the problem of the dynamic type of environments of sensor networks. However, in addition to the KVP model, the blog entry "What is the right data model?" describes not only the KVP model, but also the Tabular and the Document-Oriented Models. The former is described as an infinite table that hold infinity objects as implemented by Google's Big Table, while the latter describes a model in between the Relational and the Tabular, which gives the best properties of relating entities through its properties and yet using a KVP notation.

In general, a persistence storage system that can easily accept new data types without re-engineering the current schemas is what I wanted to find. However, trade-offs are considered to these types of approaches. KVP data model suggests the repetition of data, and in this case, the size of disk space used is larger then usual. Although this can represent a draw back, it has been shown that better and faster data retrieval algorithms pay off in the end, what is the case of the MapReduce. Furthermore, since disk space is considered commodity, scaling data storage with more machines in the format of a grid computing or cluster is cheaper than buying very expensive servers. Therefore, I was looking for a system that provides alternatives of System Replication or Database Partitioning, also referred to as Database Sharding.

mongoDB Data Store

Given that Document-Oriented Model makes a good candidate persist sensors' properties and the recently enumerated list technologies in the previous article DSPDataPersistence, the open-source project called mongoDB was chosen for the evaluation on our case study, the netBEAMS DSP Platform.

mongoDB supports storage based on collections of data, stored using BSON, a binary representation the JSON data representation format, including dynamic queries and indexing support. As it's stated in their web site, mongoDB "bridges the gap between key/value stores (which are fast and highly scalable) and traditional RDBMS systems (which are deep in functionality)".

  • mongoDB supports the notion of KVP;
  • mongoDB is written in C++, and therefore, can is available in any major platform, as well as offers a broad range of API drivers written in different languages such as Java, Python, Perl and Ruby;
  • mongoDB is open-source, with good community support and availability through mailing lists, freenode IRC channel, and commercial support through 10gen company;
  • mongoDB has support to distributed systems properties such as Master-Slave replication, and features like Database Shards with auto-sharding based on shard keys.

The rest of this documentation shows the experiments that were used with mongoDB, providing a data-centric persistence layer for NetBEAMS.

Experiment: Saving 1 Million Objects into mongoDB

Considering how much dynamic sensors can scale in size, I have designed an experiment that generates 1 million random Sonde Data Types to be saved in a mongoDB server. The Sonde Data Type is a POJO representation of the YSI Data Acquisition, and the goal of the experiment is to transfer the transient instances into the mongoDB server using a single host or a sharded distributed list of servers, verifying if the way scientists describe the data influence on the performance of the database system. In summary, the experiments will analyze the following:

  • Operational evaluation:
    • Single Server X Sharded Cluster Server
      • Full Key names X Shorter Key names
  • Questions asked:
    • File system size consumed by the produced data;
    • Data size reported by the database system;
    • How fast the system can count of the number of documents;
    • How fast the system can count a set of documents from a result of a query;

As for the use cases evaluation, the following experiments will be performed:

  • Use Cases Evaluation
    • Find the temperature set for a given position between a start and end dates;
    • Count the number of data collected by a type of sensor in the month of Dec 2000;

Case Study Requirements

Our case study is based on the NetBEAMS, a collaboration project between the department of Computer Science at San Francisco State University and the Romberg Tiburon Center (RTC) that assists the San Francisco Bay Environmental Assessment and Monitoring Station (SF-BEAMS) project. The RTC focuses its research on complex marine and estuarine environments and uses environmental sensors for its research. Its sensor network is located offshore of the RTC pier (SEE LIVE! cam).

Among the used devices, this research used the YSI 6600 ESD V2 sondes to develop the case study. A picture of an YSI sonde can be seen below:

In May 2009, the infrastructure setup for the YSI sondes were defined as follows:

  • 5 YSI sondes in operation at the RTC site Pier in Tiburon Island, San Francisco Bay;
  • Each YSI sonde produces 52 Bytes of data each time they measure the site conditions;
  • The read frequency rate ranges from 1, 6 or 15 minutes, depending on the software configuration.

In order to extract the measurements data, the YSI sonde provides a RS-232 serial connection that can be used to connect a computer. The following snapshot is an example of the 52 bytes of data (13x4 Bytes) transferred from the YSI data stream:

"21.20    193    179 5588.40   0.09   0.084   0.059  7.98   -79.6   99.5   8.83     0.4     8.7"

The size of the data in memory as estimated for the number of YSIs reported:

  • At the rate of 1 minute:

Number of YSI 1 Hour 1 Day 1 Week 1 Month 1 Year
1 3.04 KB 73.125 KB 511.875 KB 1.99 MB 23.99 MB
5 15.23 KB 365.625 KB 2.5 MB 9.997 MB 119.97 MB

  • At the rate of 6 minutes:

Number of YSI 1 Hour 1 Day 1 Week 1 Month 1 Year
1 0.5 KB 12.18 KB 85.3125 KB 341.25 KB 3.99 MB
5 2.54 KB 60.93 KB 426.56 KB 1.67 MB 19.99 MB

  • At the rate of 15 minutes:

Number of YSI 1 Hour 1 Day 1 Week 1 Month 1 Year
1 0.2 KB 4.875 KB 34.125 KB 136.5 KB 1.6 MB
5 1.0 KB 24.375 KB 170.625 KB 682.5 KB 7.99 MB

The following section describes the NetBEAMS infrastructure, developed to program and interrogate the Sf-Beams sensor network without requiring human intervention.

NetBEAMS Infrastructure

The NetBEAMS infrastructure is set on top of the existing one from the SF-BEAMS. The following image summarizes this joint infrastructure:

  • YSI 6600EDS Sonde: the YSI is responsible for sampling the data from the environment, and it is connected to the Gumstix, the NetBEAMS Gateway Embedded System;
  • Gumstix console-vx: it's a COTS ARM-based hardware that provides a computer-on-module environment for the development of small embedded systems. This is the main environment of data extraction and remote processing is accomplished. It runs a cross-compiled version of Linux Kernel 2.6, which manages the data transmission over a 3G Cellular Data Connection];
  • Huawei E220 USB Modem: it is used to transfer the data from the Gumstix to the RTC Labs Data Center.

This documentation focus on the development of a Software Platform for the NetBEAMS Gateway Embedded System. The architecture of the system can be summarized in the following picture.

The main components of such system can be summarized as follows:

  • Operating System: it uses a cross-compiled Gentoo Linux, which supports a wide variety of development tools, including the Java Virtual Machine (JVM), the underlying platform system;
  • Java Virtual Machine: it uses the JamVM, a ~200 KB version of the Sun Microsystems Java Virtual Machine, version 2.0;
  • OSGi Framework: it uses the Knopflerfish implementation of the OSGi 4.1 specification. More details in the following sections.
  • DSP Platform and other Bundles: The plug-and-play DSP Components are based on the OSGi bundles capabilities, which can reuse services registered in the OSGi framework.

OSGi - The Foundation for the Data Sensor Platform

Since Each NetBEAMS component is managed by an OSGi component and its infrastructure, this section describes the basic functionality of the OSGi platform.

The OSGi platform was conceived to support modularity in terms resources-limited environments such mobile devices and vehicles, but it was first widely deployed on Eclipe, the Integrated Development Environment (IDE) focused in different programming languages developed in Java because of its loosely-coupled architecture and easy-to-use API.

In general, the OSGi Platform can be run on top of any Operating System that contains the Java Virtual Machine (JVM), and publishing the set of OSGi bundles to the system, as it is shown in the following image.

The OSGi Platform provides 2 basic layers:

  • Module Layer: this layer is responsible for managing the OSGi bundles that are provided into the OSGi Platform, providing the necessary "wiring" of the components. At this layer, the OSGi Bundles can import or export packages in the level of a Java Class provided by the OSGi Platform;
  • Service Layer: this layer is responsible for the interoperability between 2 or more bundles, granting access to services that were registered by bundles;
  • Execution Layer: executes the bundles and change their life-cycle.

The interoperability of OSGi follows the simple Producer-Consumer paradigm of a service model as shown in the first picture below. The Producer of the service registers into the Service Broker, while the Consumer uses the Service Look up to find and reuse the service. In this way, during an OSGi bundle life-cycle, it can first publish its service to the OSGi Platform where other OSGi bundles can reuse it as show in the second picture below.

An existing Java application can be "bundled" as an OSGi bundle by providing descriptors following the Java Archival Repository (JAR) specification. In general, an OSGi bundle must provide specifications that describes the module to be published into the OSGi Platform, as shown in the next diagram.

The main properties of the OSGi MANIFEST.MF artifact can be summarized as follows:

  • Imported-Packages: the Java Packages needed by this OSGi Bundle. These Java Packages must be available from another OSGi bundle that have exported it;
  • Exported-Packages: the Java Packages that are provided by the OSGi Bundle to the OSGi Platform. Other OSGi bundles can reuse classes and services.
  • Activator: The name of the instance of an OSGi Activator class, responsible to manage the bundle.
  • Classpath: the necessary Java Jars list needed to run the bundle;

Once the OSGi bundle is installed into the OSGi Platform, it will be managed by the OSGi Execution layer and change the bundle state according to a set of specifications. The following diagram shows the UML State Diagram from the an OSGi Bundle life-cycle:

In summary, the OSGi Platform is the main foundation of the system, built using building blocks.

NetBEAMS and the Data Sensor Platform (DSP)

This experiment targets the transport of the data described above to a database system, here called persistence storage system, by using the NetBEAMS's DSP Platform. The DSP Platform is built on top of OSGi, taking advantage of the modular capabilities of its plug-and-play OSGi bundle infrastructure. In this way, the DSP Platform and each of its DSP Components are extensions of the OSGi bundles. The following diagrams depicts the OSGi platform:

       DSP Component  <:|------ DSP Component Activator ------|> OSGi Bundle

As a DSP Component Activator takes advantage of the basic specifications of the OSGI bundle infrastructure, it inherits the life cycle stages and properties. The following image summarizes the life cycle of a DSP Component Activator, which is responsible for initializing and terminating the DSP Component.

Whenever the DSP Component Activator is on the initialized mode, the DSP Component can be initialized by using initial configuration parameters provided by who configures the DSP. Similarly, when the DSP Component Activator is terminated, the DSP Component must be stopped. In this way, when the DSP Component is initialized, it start all the needed resources and be active in the system until its function is required. The only responsibility of the DSP Component is to implement the "contract": being a Data Producer (DP) or Data Consumer (DC). In this way, the "contract" is defined by the following Interface methods:

  • sendMessage(): if the component is a DP, it can send data by wrapping up the data in Message unit to be sent to the DSP;
  • deliverMessage(): if the component is a DC, it can receive data by unwrapping the data contained in a Message unit that was sent to the DSP.

The DSP Platform will route the message, inside of the message, to DSP Components that need to receive the message. Details will be added into the Data Delivery section.

Data Representation

As the YSI sonde documentation describes each of these values and data format, the data stream is mapped into a Java POJO called SondeDataType, which is marshalled into an XML instance of the XML Schema "Abstract Message Content". In addition to the regular data from the sensor, note it contains properties about time. As a result, more data have been added into the initial 130 bytes of data as shown in the picture below:

In the current implementation, the DSP Framework is responsible to wrap up each of the collected sampling data to be added into the the body of a DSP Message for transmission. Other information regarding the DSP component producer and consumer are added into the header of the DSP Message. The following UML Class Diagram shows the participating classes from the DSP Messages packages.

As highlighted in the diagram, there are several types of DSP Messages used for different purposes. For example, any measurement data must be wrapped up in a Measurement Message, while a Query Message is used to exchange messages among the components for the purpose of management. In this way, the main DSP Messages can be summarized as follows:

  • Measurement Message: used to transport any sensor collected data;
  • Query Message: used to query a DSP component about its configuration properties;
  • Update Message: used to update a DSP component's configuration properties;
  • Acknowledgement Message: used for the transport communication protocol. More details in the Remote Data delivery section.

Whenever a DSP component is ready to transmit messages, it wraps up the set of DSP Messages into an instance of a DSP Messages Container, which contains information about the collection of messages being transmitted with its own identification. In this fashion, the DSP Messages Container is the main communication unit between 2 different DSP Components.

Data Delivery

In general, when a DSP Component finishes preparing the DSP Messages Container, it contacts the DSP Broker to send the DSP Message. At this point, the DSP Broker acquires the a list of possible DSP Components that are expected to consume the DSP Message in the current DSP by the assistance of the DSP Matcher. In this way, the DSP Matcher can be seen as a function that takes a DSP Message as an input and returns a list of DSP consumers.

DSP Components Consumers ( DSP Message ) := Verify the DSP Message's Header + Verify the matcher rules (which contains the list of consumers) 

However, the selection is done by analyzing the matching rules against the specifications of the DSP Message header's properties, and upon receiving all the matching rules, the DSP Broker selects a set of unique DSP components to receive a copy of the DSP Message object in two different ways:

  • In-memory local message delivery: if the receiving DSP Component is located in the current local host, a deep copy of the instance of the DSP message is delivered;
    • IP Addresses are correctly resolved by using the Ethernet card in the device: localhost, 127.0.0.1, or the same IP Address for Producer and Receiver resolves into a local device.
  • Serialized remote message delivery: if the receiving DSP Component is located in a foreign/remote host, the message is serialized in a format defined by the transport protocol chosen the DSP Component responsible for the transport. The following section describes the existing DSP Data Transport component.
    • IP Address from the Producer and Consumer are different, and are not resolved to be in the same host.

Remote Data Transport and Delivery

The DSP Platform promotes the data transport by using specialized DSP Components that are capable of marshalling and unmarshalling POJO objects into XML and POJO, and vice-versa. In order to transport the DSP Messages, a pair of symmetric DSP Components were developed to use the HTTP protocol to transport the serialized version of the DSP Messages.

  • DSP Wire Transport Client: responsible for marshalling a DSP Messages Container in XML, and making an HTTP POST Request to the service provided by the Server component;
  • DSP Wire Transport Server: this component exposes a Web Server providing an HTTP POST service, which is used as a gateway to receive marshallized DSP Messages transmitted in XML. Upon receiving the XML instance, the component is responsible for unmarshalling the DSP Messages Container and its set of DSP Messages into a POJO instances.

The following image shows the XML Schema of the DSP Messages Container and the DSP Message. The former is the main unit of communication between 2 instances of remote DSP components, and contains is composed of at least one instance of the latter. The latter will carry the specific information about its producer and consumer in the header, and an instance of any payload in the body. Note that both the former and the latter have attributes regarding point in time as part of the time series definition for the collected data.

Here's an example of transmitted YSI data shown in the previous section. The Messages Container contains an instance of a Measurement Message marshallized from the host 192.168.0.103 to be transmitted to the host 192.168.0.106 using the DSP Wire Transport Client component. The Header of the DSP Message contains all information about its producer and potential consumer, as well as the body of the message containing an instance of the Sonde Data Container, carrying an instance of the Sonde Data Type enclosed.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MessagesContainer uudi="24929c29-60ee-4d17-af08-64d9446277ef"
        creationTime="2009-03-06T15:17:18-0800" destinationHost="192.168.0.106">
        <MeasureMessage ContentType="org.netbeams.dsp.ysi"
                messageID="435a61f6-370f-458d-aeb7-6e92270a79cb">
                <Header>
                        <CreationTime>1236381438480</CreationTime>
                        <Producer>
                                <ComponentType>
                                        org.netbeams.dsp.platform.management.component.ComponentManager
                                </ComponentType>
                                <ComponentLocator>
                                        <ComponentNodeId>1234</ComponentNodeId>
                                        <NodeAddress>192.168.0.103</NodeAddress>
                                </ComponentLocator>
                        </Producer>
                        <Consumer>
                                <ComponentType>org.netbeams.dsp.wiretransport.client
                                </ComponentType>
                                <ComponentLocator>
                                        <NodeAddress>LOCAL</NodeAddress>
                                </ComponentLocator>
                        </Consumer>
                </Header>
                <Body>
                        <SondeDataContainer>
                                <soundeData date="15:17:18" time="03-06-2009">  
                                        <Temp>21.20</Temp>
                                        <SpCond>193</SpCond>
                                        <Cond>179</Cond>
                                        <Resist>5588.40</Resist>
                                        <Sal>0.09</Sal>
                                        <Press>0.084</Press>
                                        <Depth>0.059</Depth>
                                        <pH>7.98</pH>
                                        <phmV>-79.6</phmV>
                                        <ODOSat>99.5</ODOSat>
                                        <ODOConc>8.83</ODOConc>
                                        <Turbid>0.4</Turbid>
                                        <Battery>8.7</Battery>
                                </soundeData>
                        </SondeDataContainer>
                </Body>
        </MeasureMessage>
</MessagesContainer>

When the counterpart DSP Wire Transport Server receives the Messages Container instance, it unmarshalls the DSP Messages back to a POJO, and sends it to the DSP Broker to make its normal delivery. As previously explained, the DSP Broker may decide to deliver the DSP Message or drop it, depending on the DSP Matcher rules specification.

DSP Data Persistence Component

  • How the data is enclosed into a PersistentMessageUnit

The output from the execution of the netBEAMS bundle is shown on DSPDataPersistence. The process of transforming it into a mongoDB ready data is described in the next sections.

The netBEAMS to mongoDB conversion is given at the DSPMongoCRUDService class, which uses the mongoDB Java driver. See the artifact http://code.google.com/p/netbeams/source/browse/branches/marcello/persistence/versions/v2/apps/osgi-bundles/dsp/DSPDataPersistence/src/org/netbeams/dsp/persistence/controller/DSPMongoCRUDService.java for details.

The following code snippet is the method that inserts the message content from the DSP message of the PersistentMessageUnit into the mongoDB. mongoDB drivers use the BasicDBObject instance to set key and values. The keys and values are created and then saved into the mongoDB.

    /**
     * Inserts the DSP Message Content into the mongoDB as it is extracted and converted from the given 
     * PersistentMessageUnit.
     * @param tranMsg is the PersistentMessageUnit containing information about the sensor location and the message.
     * @throws UnknownHostException
     * @throws MongoException
     */
    public static void insertPersistentUnitMessageContents(PersistentMessageUnit tranMsg) throws UnknownHostException,
            MongoException {
        DBCollection netbeamsDbCollection = getPersistenceStorage(tranMsg);
        MessageContent messageContent = tranMsg.getDspMessage().getBody().getAny();
        System.out.println("Starting mongodb transaction at " + DATE_FORMATTER.format(new Date()));
        getNetbeamMongoDb().requestStart();
        if (messageContent instanceof SondeDataContainer) {
            SondeDataContainer sondeContainer = (SondeDataContainer) messageContent;
            for (SondeDataType sondeData : sondeContainer.getSondeData()) {
                BasicDBObject docValue = new BasicDBObject();
                docValue.put("temperature", "" + sondeData.getTemp().floatValue());
                docValue.put("sp_condition", "" + sondeData.getSpCond().floatValue());
                docValue.put("condition", "" + sondeData.getCond().floatValue());
                docValue.put("resistence", "" + sondeData.getResist().floatValue());
                docValue.put("salinity", "" + sondeData.getSal().floatValue());
                docValue.put("pressure", "" + sondeData.getPress().floatValue());
                docValue.put("depth", "" + sondeData.getDepth().floatValue());
                docValue.put("ph", "" + sondeData.getPH().floatValue());
                docValue.put("pH_mv", "" + sondeData.getPhmV().floatValue());
                docValue.put("odo_sat", "" + sondeData.getODOSat().floatValue());
                docValue.put("odo_condition", "" + sondeData.getODOConc().floatValue());
                docValue.put("turbidity", "" + sondeData.getTurbid().floatValue());
                docValue.put("battery", "" + sondeData.getBattery().floatValue());

                BasicDBObject docKey = buildKeySegment(tranMsg);
                // extract the fact time from the message, adding to the key
                docKey.put("fact_time", sondeData.getDateTime().getTimeInMillis());
                docKey.put("data", docValue);
                // insert the final collection
                netbeamsDbCollection.insert(docKey);
            }
        }
        getNetbeamMongoDb().requestDone();
    }
  • Describe the translator from DSP Message to mongoDB format

Setting up the environment

Taking into account the mongoDB architecture and the properties of a DSP Message (see section "Acquiring the properties of a DSP Message Content" at DSPDataPersistence), here are the conventions followed on revision r585:

  • The database instance is called "netbeams";
  • The database "netbeams" may contain different collections, categorized by the Sensor Content Type, that is, depending on how the DSP Component was described;
  • The identification of the sensor is taken from the PersistentMessageUnit's DSP Message;

The following is the list of properties that composes the Key of a document:

  • sensor_ip_address: it's extracted from the DSP Message Produce and identifies which sensor generated the sampling;
  • message_id: it's extracted from each of the messages contained in the DSP Message Container;
  • transaction_time: it's extracted from the DSP message container creation time and it is used to identify when the transaction occurred (see notes on Temporal Databases);
  • fact_time: it's extracted from the SondeDataContainer's date and time, and identifies the time in which the collected data occurred (see notes on Temporal Databases);

The definition of the Value of a document is as follows:

  • data: this key defines the values of the document, and will have every different property of the sensor.

Some remarks about the creation of the items from contains with the collections:

  • In case a message container contains multiple readings in a message container, each item will be counted individually;

The following steps are described to run the experiment shell-script located at http://code.google.com/p/netbeams/source/browse/branches/marcello/persistence/versions/v2/persistence/run-persistence-experiment

  1. Update your working copy with revision r585 from the branch /branch/marcello/persistence;
  2. Run the task setup-mongodb from the automated ANT script is available at apps/osgi-bundles/dsp/DSPDataPersistence/build.xml. Also, make sure to add the directory mongodb/bin into the PATH if you want to execute any mongoDB related shell script from any directory in your system;
  3. Run the experiment file under the NETBEAMS/persistence/ directoy, passing as a parameter the number of random items you want the simulation to create. Additionally, you can pipe the output to a file using "tee".
./run-persistence-experiment 500 | tee running-500.log

Experiment

The execution of the command-line script will launch the mongoDB, remove old files, generate the given number of elements and insert them into the database, and will display the results, giving the shell access to the current database.

The goals are as follows:

  1. Generate a random number of SondeDataType with random values;
  2. Transform the random data into one used by the mongoDB to be inserted into the Database;
  3. After the transaction has been completed, the property of durability of database systems has to stand true: the data must be saved in files;
  4. Query of the persisted data can be provided via API or via shell;
  5. The ability to export the data to different formats, including CSV files, is a must for interoperability with spreadsheets, etc.

Main Experiment output

The following is the snapshot of the file http://code.google.com/p/netbeams/source/browse/branches/marcello/persistence/versions/v2/persistence/logs/experiment-1000000-main-20090912-202055.log

########### Netbeams to MongoDB Experiment 20090912-202055.log ############# 

* 1. Cleaning any existing MongoDB data at 'data'

total 12K
drwxr-xr-x 3 marcello marcello 4.0K 2009-09-12 19:50 .
drwxrwxrwx 6 marcello marcello 4.0K 2009-09-12 19:48 ..
drwxr-xr-x 6 marcello marcello 4.0K 2009-09-12 20:06 .svn

* 2. Starting MongoDB Server... NetBEAMS data will be saved at dir 'data'


* 3. Ready to run Java experiment with 1000000 samples

Sat Sep 12 20:20:55 Mongo DB : starting : pid = 5831 port = 27017 dbpath = data master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Sat Sep 12 20:20:55 db version v1.0.0, pdfile version 4.4
Sat Sep 12 20:20:55 git version: afe21e02c11f9a923ab1c95edf6fdd95b9a4a51e
Sat Sep 12 20:20:55 sys info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
Sat Sep 12 20:20:55 waiting for connections on port 27017

These first 3 steps are used to setup the environment for the new execution of the experiment.

  1. Delete any existing database file in the directory NETBEAMS/persistence/data;
  2. Start the mongoDB server, referencing the directory NETBEAMS/persistence/data;
  3. Start the Java application that creates 1 million objects and inserts into the database.

The Java execution started right after the log snippet, and it took the application around 1 minute and half to generate 1 million POJOS with random values. At this point, the conversion of the objects into mongoDB objects and the insertion of each of them is going to take place.

Experiment started at 09/12/2009 20:20:56:992
Starting to generate 1000000 sonde samples at 09/12/2009 20:20:56:993
Finished Generating 1000000 sonde samples on 88.368 seconds (88368 milliseconds) at 09/12/2009 20:22:25:361 consuming ~6848Kb
Started saving netbeams samples as mongodb objects at 09/12/2009 20:22:48:25
Starting mongodb transaction at 09/12/2009 20:22:59:513

At this point, the transaction with the mongoDB has been opened, and the database is locked, everything done using the Java driver provided by the mongoDB project. As you can see, the allocation of new file system space starts, as well as the creation of indexes for the new type. Everything from now on will be based on the database netbeams and the collection SondeDataContainer, as described in the beginning of this section.

Sat Sep 12 20:20:55 web admin interface listening on port 28017
Sat Sep 12 20:22:59 connection accepted from 127.0.0.1:19841 #1
Sat Sep 12 20:22:59 allocating new datafile data/netbeams.ns, filling with zeroes...
Sat Sep 12 20:22:59 done allocating datafile data/netbeams.ns, size: 16777216, took 0.018 secs
Sat Sep 12 20:22:59 allocating new datafile data/netbeams.0, filling with zeroes...
Sat Sep 12 20:23:00 done allocating datafile data/netbeams.0, size: 67108864, took 0.954 secs
Sat Sep 12 20:23:00 building new index on { _id: ObjId(000000000000000000000000) } for netbeams.SondeDataContainer...done for 0 records
Sat Sep 12 20:22:59 insert netbeams.SondeDataContainer 979ms
Sat Sep 12 20:23:00 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:23:00 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:23:00 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:23:00 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:23:00 insert netbeams.SondeDataContainer 0ms
...
...

The completion of the insertion into the database is completed after almost 3 minutes. Memory consumption went beyond the mark of 1.5Gb (which is not displayed correctly yet). The Java driver closes the connection with the database automatically after the Java program exits.

Sat Sep 12 20:25:45 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:25:45 insert netbeams.SondeDataContainer 0ms
Sat Sep 12 20:25:45 insert netbeams.SondeDataContainer 0ms
Finished saving netbeams samples to mongodb objects in 176.093 seconds (176093 milliseconds) at 09/12/2009 20:25:44:118 consuming ~11.0Kb
Experiment finished saving 1000000 sonde samples on MongoDB on 306.883 seconds (306883 milliseconds) at 09/12/2009 20:26:03:876 consuming ~11 Kb
Sat Sep 12 20:26:21 end connection 127.0.0.1:19841

The experiment shell outputs the directories where the logs are created, showing the list of files created and the size of them (please don't consider the ones ".svn") as part of the experiment).

* 4. Experiments Results on the following logs:
- Mongo DB Server Output: logs/experiment-1000000-mongodb-server-status-20090912-202055.log
- NetBEAMS to MongoDB data transfer output: logs/experiment-1000000-netbeams-to-mongodb-20090912-202055.log

* 5. MongoDB data dir 'data' size after experiments...

4.0K	data/.svn/tmp/props
4.0K	data/.svn/tmp/text-base
4.0K	data/.svn/tmp/prop-base
16K	data/.svn/tmp
4.0K	data/.svn/props
4.0K	data/.svn/text-base
4.0K	data/.svn/prop-base
44K	data/.svn
1.5G	data
total 1.5G
drwxr-xr-x 3 marcello marcello 4.0K 2009-09-12 20:25 .
drwxrwxrwx 6 marcello marcello 4.0K 2009-09-12 19:48 ..
-rwxr-xr-x 1 marcello marcello    5 2009-09-12 20:20 mongod.lock
-rw------- 1 marcello marcello  64M 2009-09-12 20:25 netbeams.0
-rw------- 1 marcello marcello 128M 2009-09-12 20:23 netbeams.1
-rw------- 1 marcello marcello 256M 2009-09-12 20:25 netbeams.2
-rw------- 1 marcello marcello 512M 2009-09-12 20:25 netbeams.3
-rw------- 1 marcello marcello 512M 2009-09-12 20:25 netbeams.4
-rw------- 1 marcello marcello  16M 2009-09-12 20:25 netbeams.ns
drwxr-xr-x 6 marcello marcello 4.0K 2009-09-12 20:06 .svn

The mongoDB client is running for the use and "check" on the just created data. Instructions are printed as well.

* 6. Running the MongDB after the experiments...

 - The database name is 'netbeams'. The collection name is 'ysi'
 - Type 'use netbeams' to change to that database.
 - Type 'show collections' to show all the collections in the current database
 - Type 'db.ysi.*' to issue a command to the collection 'ysi'
 - Ex: 'db.ysi.count()' = returns the number of elements on the collection 'ysi'
 -     'db.ysi.findOne()' = returns the first element of the collection 'ysi'
 -     'db.ysi.find().limit(3)' = returns the first 3 elements of the collection 'ysi'
 -     'db.ysi.find( {sensor_ip_address:192.168.0.79} ).count())' = returns the number of elements of the collection ysi with the given sensor's ip address.
 -     'db.ysi.find({data.ph:1.45})' = returns all the elements that has the property 'data.ph' equals to '1.45'

mongoDB Server Output

See the previous section.

The database can be started simply as follows:

mongod --dbpath NETBEAMS/persistence/data

mongoDB Client Output

The mongoDB client can be started by using the following command. Make sure you have started the mongoDB server before executing the mongoDB client.

mongo netbeams | tee output_number_date.log

Here, the iterative mongo client shell offers users to verify and navigate on a given database and its collections. This first section shows the connection of the mongo client to the database netbeams. It also highlights the query for the collections available. During the experiment, the SondeDataContainer collection was created as related to the type from the DSP Messages for the YSI Sonde.

The shell references to the mongoDB system can be found at http://www.mongodb.org/display/DOCS/dbshell+Reference

MongoDB shell version: 1.1.0-
url: netbeams
connecting to: netbeams
type "help" for help
> show collections
SondeDataContainer
system.indexes

Then, the first verification of the data integrity is regarding the number of elements created. Here, the first count() function on the collection returned 1000000.

> 
> db.SondeDataContainer.count()
1000000

An example about retrieving the first element of the collection can be done using the findOne() function. It will return an element instance on the JSON notation.

> db.SondeDataContainer.findOne()
{"_id" :  ObjectId( "d36f4007b7e7ac4a03c60000")  , "sensor_ip_address" : "192.168.0.136" , "message_id" : "7b6624d6-0ca1-4cba-a343-f166e88da73b" , 
"transaction_time" : 1252845473412 , "fact_time" : 1252845346000 , "data" : {"temperature" : "45.01" , "sp_condition" : "37.6" 
, "condition" : "145.8" , "resistence" : "159.77" , "salinitude" : "0.0" , "pressure" : "0.391" , "depth" : "0.46" , "ph" : "5.64" , 
"pH_mv" : "-62.1" , "odo_sat" : "89.7" , "odo_condition" : "59.34" , "turbidity" : "0.0" , "battery" : "9.4"}}

The query based on attributes can be done using the "dot" notation, as you navigate through the JSON documents. Additionally, you can use the functions as aggregated on the result of others. This next example counts the number of documents with the key "data.ph" equals to "5.64". (THIS REVISION USES STRIGS AS THE DATATYPE AS A BUG).

> db.SondeDataContainer.find({"data.ph":"5.64")}).count()
1226

The following example is the output of the first 3 documents from the same previous query using the limit() function.

> db.SondeDataContainer.find({"data.ph":"5.64"}).limit(3)
{"_id" :  ObjectId( "d36f4007b7e7ac4a03c60000")  , "sensor_ip_address" : "192.168.0.136" , "message_id" : "7b6624d6-0ca1-4cba-a343-f166e88da73b" 
, "transaction_time" : 1252845473412 , "fact_time" : 1252845346000 , "data" : {"temperature" : "45.01" , "sp_condition" : "37.6" , 
"condition" : "145.8" , "resistence" : "159.77" , "salinitude" : "0.0" , "pressure" : "0.391" , "depth" : "0.46" , "ph" : "5.64" , 
"pH_mv" : "-62.1" , "odo_sat" : "89.7" , "odo_condition" : "59.34" , "turbidity" : "0.0" , "battery" : "9.4"}}
{"_id" :  ObjectId( "d36f4007b7e7ac4a1fc80000")  , "sensor_ip_address" : "192.168.0.136" , "message_id" : "7b6624d6-0ca1-4cba-a343-f166e88da73b" , 
"transaction_time" : 1252845473412 , "fact_time" : 1252845346000 , "data" : {"temperature" : "46.71" , "sp_condition" : "60.8" , 
"condition" : "160.6" , "resistence" : "1399.4" , "salinitude" : "0.01" , "pressure" : "1.057" , "depth" : "2.485" , "ph" : "5.64" , 
"pH_mv" : "-16.3" , "odo_sat" : "58.8" , "odo_condition" : "19.29" , "turbidity" : "0.2" , "battery" : "9.2"}}
{"_id" :  ObjectId( "d36f4007b8e7ac4a1ec90000")  , "sensor_ip_address" : "192.168.0.136" , "message_id" : "7b6624d6-0ca1-4cba-a343-f166e88da73b" , 
"transaction_time" : 1252845473412 , "fact_time" : 1252845346000 , "data" : {"temperature" : "69.99" , "sp_condition" : "39.0" , 
"condition" : "115.7" , "resistence" : "3490.92" , "salinitude" : "0.05" , "pressure" : "0.537" , "depth" : "0.544" , "ph" : "5.64" , 
"pH_mv" : "-73.8" , "odo_sat" : "81.4" , "odo_condition" : "2.44" , "turbidity" : "0.0" , "battery" : "3.1"}}
> 

Other logs are located at http://code.google.com/p/netbeams/source/browse/#svn/branches/marcello/persistence/versions/v2/persistence/logs

Exporting the data into Spreasheets format (CSV)

mongoDB has an export facility shell called mongoexport. It can export the data in JSON format or CSV. One may also write its own export tool in any of the languages such as Java, PHP, Python, Perl, Ruby, among others. A list of the existing drivers in different languages is provided at http://www.mongodb.org/display/DOCS/Drivers. The following command can be executed to have the exported version of the data in CSV (read the help output of the command for details).

mongoexport -d netbeams -c SondeDataContainer --dbpath ./data/ --csv -f "_id,sensor_ip_address,transaction_time,fact_time,
data.temperature,data.sp_condition,data.condition,data.resistence,data.salinitude,data.pressure,data.depth,data.ph,data.pH_mv,data.odo_sat,
data.odo_condition,data.turbidity,data.battery" -o sonde-data-exported.csv

The result of the export can be downloaded at http://netbeams.googlecode.com/files/experiment-1000000-data-exported-20090913-053538.csv.tar.gz. The first items of the list is shown below. Note that the columns were printed in the order provided in the export command. This feature was fixed after I found a bug as described at (http://groups.google.com/group/mongodb-user/browse_thread/thread/d7f1685d006ae4c7).

_id,sensor_ip_address,transaction_time,fact_time,data.temperature,data.sp_condition,data.condition,data.resistence,data.salinitude,data.pressure,
data.depth,data.ph,data.pH_mv,data.odo_sat,data.odo_condition,data.turbidity,data.battery
"d36f400700e8ac4a00070600","192.168.0.136",1252845473412,1252845377000,"86.64","164.8","59.7","4594.6","0.06","0.09","2.32","6.49","-79.0",
"69.6","18.29","0.2","9.8"
"d36f400700e8ac4a00080600","192.168.0.136",1252845473412,1252845377000,"32.79","175.6","135.0","5346.77","0.07","1.289","2.477","7.5","-48.7",
"8.6","41.8","0.1","9.4"
"d36f400700e8ac4a00090600","192.168.0.136",1252845473412,1252845377000,"93.43","78.7","86.0","2467.38","0.01","1.384","0.287","0.47","-90.9",
"63.2","2.67","0.2","5.1"
"d36f400700e8ac4a000a0600","192.168.0.136",1252845473412,1252845377000,"72.17","179.3","7.3","2614.64","0.01","0.352","2.412","0.11","-85.5",
"90.6","59.33","0.2","7.8"
"d36f400700e8ac4a000b0600","192.168.0.136",1252845473412,1252845377000,"76.31","168.1","39.7","413.49","0.08","0.45","2.81","7.87","-8.2",
"19.5","54.78","0.0","3.4"

Data Access through Java API and REST Web Services

The mongoDB server offers different drivers to access the data, as well as the Web Services.

The following HTTP GET Request method returns the first 5 documents in the collection:

http://127.0.0.1:28017/netbeams/SondeDataContainer/?limit=-5

GET /netbeams/SondeDataContainer/?limit=-5 HTTP/1.1
Host: 127.0.0.1:28017
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko/2009033100 Ubuntu/9.04 (jaunty) Firefox/3.0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

The HTTP Response's body is a JSON format:

HTTP/1.0 200 OK
x-action: 
x-ns: netbeams.SondeDataContainer
Content-Type: text/plain;charset=utf-8

{
  "offset" : 0,
  "rows": [
    { "_id" : "156f4007e4c3b74a36ed3100", "sensor_ip_address" : "192.168.0.117", "message_id" : "08b02c08-9290-4517-9a28-c6ee7e16509a", 
"transaction_time" : 1253557219486, "fact_time" : 1253557217000, "data" : { "temperature" : "31.44", "sp_condition" : "99.8", "condition" : "53.5", 
"resistence" : "1157.08", "salinity" : "0.0", "pressure" : "1.066", "depth" : "0.161", "ph" : "1.08", "pH_mv" : "-82.0", "odo_sat" : "40.3", 
"odo_condition" : "56.85", "turbidity" : "0.2", "battery" : "8.2" } } ,
    { "_id" : "156f4007e5c3b74a37ed3100", "sensor_ip_address" : "192.168.0.117", "message_id" : "08b02c08-9290-4517-9a28-c6ee7e16509a", 
"transaction_time" : 1253557219486, "fact_time" : 1253557217000, "data" : { "temperature" : "37.83", "sp_condition" : "176.3", "condition" : "2.6", 
"resistence" : "1324.97", "salinity" : "0.01", "pressure" : "1.36", "depth" : "1.564", "ph" : "0.12", "pH_mv" : "-23.5", "odo_sat" : "104.5", 
"odo_condition" : "19.44", "turbidity" : "0.1", "battery" : "5.0" } } ,
    { "_id" : "156f4007e5c3b74a38ed3100", "sensor_ip_address" : "192.168.0.117", "message_id" : "08b02c08-9290-4517-9a28-c6ee7e16509a", 
"transaction_time" : 1253557219486, "fact_time" : 1253557217000, "data" : { "temperature" : "74.3", "sp_condition" : "84.0", "condition" : "104.7", 
"resistence" : "4089.13", "salinity" : "0.01", "pressure" : "1.222", "depth" : "2.788", "ph" : "6.56", "pH_mv" : "-78.1", "odo_sat" : "40.0", 
"odo_condition" : "6.02", "turbidity" : "0.3", "battery" : "3.2" } } ,
    { "_id" : "156f4007e5c3b74a39ed3100", "sensor_ip_address" : "192.168.0.117", "message_id" : "08b02c08-9290-4517-9a28-c6ee7e16509a", 
"transaction_time" : 1253557219486, "fact_time" : 1253557217000, "data" : { "temperature" : "87.79", "sp_condition" : "91.8", "condition" : "162.4", 
"resistence" : "3226.59", "salinity" : "0.02", "pressure" : "1.325", "depth" : "0.698", "ph" : "1.19", "pH_mv" : "-83.5", "odo_sat" : "4.8", 
"odo_condition" : "39.87", "turbidity" : "0.3", "battery" : "9.3" } } ,
    { "_id" : "156f4007e5c3b74a3aed3100", "sensor_ip_address" : "192.168.0.117", "message_id" : "08b02c08-9290-4517-9a28-c6ee7e16509a", 
"transaction_time" : 1253557219486, "fact_time" : 1253557217000, "data" : { "temperature" : "42.48", "sp_condition" : "170.4", "condition" : "0.5", 
"resistence" : "1710.97", "salinity" : "0.07", "pressure" : "1.532", "depth" : "1.354", "ph" : "5.46", "pH_mv" : "-24.9", "odo_sat" : "106.7", 
"odo_condition" : "28.61", "turbidity" : "0.0", "battery" : "0.5" } }
  ],

  "total_rows" : 5 ,
  "query" : {} ,
  "millis" : 0
}

Data visualisation tools for mongoDB is slowly being developed by open-source developers. The next picture shows the database "netbeams" and the collection "SondeDataContainer" being rendered by futon4mongodb, one of the open-source tools developed to visualise mongoDB data.

By clicking on the collection name, the list of all the "documents" are displayed. Note that 1 million documents are displayed in the counter. The ID is displayed as the main key, while the list of keys of the value column is displayed. This should be changed in the next releases of futon4mongodb.

To view a single document, just a click on one of the documents. The keys and values are displayed.

Data format used by Biologists

Note that this format can be easily translated to the OPenDAP format used by the RTC's sensor network. An example of such data can be see accessing the RTC's website link http://sfbeams.sfsu.edu:8080/opendap/sfbeams/data_ctd/rtc_ctd2-floating/archive/2008-RTCCTDM2_qc_DIST/2008-RTCCTDM2_qc_DIST.dat.ascii? using the ASCII representation.

Dataset: 2008-RTCCTDM2_qc_DIST.dat
CTD_DIST_CSV.Month, CTD_DIST_CSV.Day, CTD_DIST_CSV.Year, CTD_DIST_CSV.Hour, CTD_DIST_CSV.Min, CTD_DIST_CSV.Sec, CTD_DIST_CSV.Water_Temp, 
CTD_DIST_CSV.Cond, CTD_DIST_CSV.Pres, CTD_DIST_CSV.Skufa1, CTD_DIST_CSV.Skufa2, CTD_DIST_CSV.Xmis, CTD_DIST_CSV.PAR, CTD_DIST_CSV.Sal, 
CTD_DIST_CSV.Sigma, CTD_DIST_CSV.InstSN
1, 1, 2008, 0, 0, 31, 9.4281, 2.79835, 0.727, 1.6628, 0.4951, 7.0798, 0.8331, 25.2725, 19.4095, 4195
1, 1, 2008, 0, 6, 31, 9.4053, 2.79205, 0.726, 1.5797, 0.4723, 7.3472, 0.5699, 25.226, 19.3765, 4195
1, 1, 2008, 0, 12, 31, 9.3983, 2.79188, 0.725, 1.5886, 0.4672, 7.3773, 0.4411, 25.2291, 19.38, 4195
1, 1, 2008, 0, 18, 31, 9.3865, 2.79317, 0.726, 1.5865, 0.4639, 7.4817, 0.3513, 25.2503, 19.3981, 4195
1, 1, 2008, 0, 24, 31, 9.3812, 2.79453, 0.726, 1.5806, 0.4591, 7.5355, 0.3327, 25.2676, 19.4124, 4195

Experiment Analysis

  • Memory, Performance, How it scales, etc.
  • Data Access through different mechanisms
    • Iteractive Shell: mongo shell;
    • Exporting the data: mongoexport shell to JSON or CSV;
    • Through the API using Java, Python, Perl, etc;
    • Through the REST Web Services
  • File system: complete data directory that can be used with mongoDB with 1 million documents can be downloaded at http://netbeams.googlecode.com/files/netbeams-mongodb-1000000-data.tar.gz.

Sign in to add a comment
Powered by Google Project Hosting