My favorites | Sign in
Project Home Wiki Issues Source
Project Information
Members
Links

This project contains code from the NERC-sponsored Technology Proof of Concept project MashMyData. This project will demonstrate the intercomparison of environmental datasets on the web and will run from 1st February 2010 to 30th June 2011.

This project is a partnership between the Reading e-Science Centre and the Centre for Environmental Data Archival.

Background and overview

The ability to combine and compare diverse datasets is critical to furthering our understanding of the Earth system. Environmental scientists use numerous sources of data, including in situ measurements, remotely-sensed information and the results of numerical simulations. However, the integration of such datasets to generate new scientific knowledge can be difficult, largely due to inherent technical complexities. As a result, many valuable environmental datasets are underused.

It is therefore extremely important to develop information technology that will allow scientists to quickly and easily compare diverse environmental data without unnecessarily spending time on low-level tasks associated with converting between data formats, naming conventions, access routines and other such technical details. Our proposed proof-of-concept work will be an important and innovative step to lay foundations for this capability and demonstrate its scientific benefits.

The common factor linking all environmental data is its geospatial nature: all data is geographically referenced to the globe. Geographic Information Systems (GIS) are designed specifically for enabling the intercomparison of diverse geospatial datasets, but there are deep-seated technical difficulties inherent in applying GIS to many kinds of environmental science data. GIS tools have evolved to focus on largely-static land surface features, and hence are oriented around two-dimensional horizontal maps. By contrast, environmental scientists are often concerned with the rapid evolution in time of three-dimensional phenomena such as storms, ocean eddies or algal blooms. Nevertheless, recently-developed open GIS standards, such as those developed and promoted by the Open Geospatial Consortium (OGC) can be used with great potential scientific benefit. For example, the Reading e-Science Centre’s “Godiva2” website is based upon these standards and enables the interactive visual exploration of large datasets from numerical models, helping to drive model improvements. Users of Godiva2 frequently comment that they would very much like to be able to view a greater range of datasets, including in situ observations and satellite swaths. Furthermore they also wish to be able to upload their own data and to perform simple calculations in order to investigate and quantify the differences between datasets. Such a system has never before been developed; however recent advances by the project team and others mean that it is now possible to realize this vision.

The concept of “mashing up” information – i.e. using the Web as an easy-to-use platform to combine and overlay different sources of geospatial data quickly and easily – has recently become very popular in a number of fields including social science (MapTube). Interactive web mapping systems such as Google Maps and OpenLayers are used to display a variety of data in an intuitive and visually-compelling manner. We propose to extend this idea to the environmental sciences by developing a proof-of-concept web portal (“MashMyData”) that will use the latest Web GIS technologies to allow scientists to visualize and intercompare datasets without the need to understand the low-level technical details of the data’s format or physical location. MashMyData will demonstrate the following major new capabilities:

  • Scientists will be able to simultaneously visualize data from many sources, including their own uploaded data, data shared by their colleagues and third-party datasets (both public and restricted-access).
  • Scientists will be able to perform simple quantitative comparison calculations, such as calculating the misfit between a numerical model and a set of observations over a user-selected region of space and time.

Technical approach

Figure 1 illustrates the large-scale architecture of the system. The user interacts with the system through a web portal. The web portal communicates with a web server, which mediates between the user and the data sources. Data are held both locally to the web server and remotely on the Internet, with remote datasets accessed through web service protocols. The users of the two test cases will be able to tailor the web portal to their needs by uploading their own data.

Figure 1: Overall architecture of MashMyData. The system provides users with secure access to data from various sources, all of which can be visualized and analysed through a web browser. The user does not need to know anything about the format or location of the data sources. Padlocks denote secure data feeds that require user authorization.

This project will require us to solve a number of challenging technical problems. These are key problems in the environmental informatics community and their solutions will be very widely applicable in future projects.

Harmonizing data sources

Data will reside in different formats in different physical locations, accessed via different web service protocols. Furthermore, users will upload data in a variety of formats. It is very important to avoid writing specific data processing and visualization code for each individual dataset. The various datasets must therefore be exposed to the rest of the system in a consistent fashion. We shall develop a Java implementation of the data model defined by the Climate Science Modelling Language (CSML), which applies international standards to describe a very large proportion of environmental science data. The key is that CSML uses a small number of “feature types” to model a large number of datasets. (Feature types are based on the data’s geometry and include grids, vertical profiles, timeseries, trajectories and points.) All visualization and analysis routines then operate upon these feature types, without knowledge of how or where the underlying data are stored. Thereafter a generic routine that calculates the root-mean-square misfit between a set of observed vertical profiles and a numerical model can be applied both to the comparison of Argo float data with a model of the ocean, and the comparison of radiosonde data with a model of the atmosphere. We shall focus on harmonizing the particular data sources that are used in the test cases; however this project will create a framework in which new datasets can be added to the system in future with minimum effort.

Visualizing diverse data

In order to visualize data in the web portal, data must be made available in “web-friendly” visualization formats. Previous experience indicates that the most suitable and interoperable technologies are the Web Map Service for gridded data (such as numerical model output) and KML for non-gridded data (such as observations). Both of these technologies are OGC standards with which the project team has considerable experience. The use of standards ensures that visual representations of the data could be viewed in different visualization systems in future. As Figure 1 indicates, both “raw” data and data resulting from analysis will be visualized in this way. Accessing secure data. The web portal server will access and process data on behalf of the logged-on user. This requires that the user be able to delegate his or her authority to the web portal server. We will be able to exploit the considerable work done by the NERC DataGrid (NDG) team at CEDA, on secure data access services. The NDG team has developed and deployed solutions which have the functionality required here, and are in the process of extending these to interact with authentication and authorisation paradigms from 1) the U.S. Earth System Grid , and 2) the UK Shibboleth identity providers. This will ensure future compatibility with many secure data systems.

Performing calculations remotely

In order to ensure future scalability, and to avoid large data transfers where possible, this project will demonstrate the processing of data on remote compute servers that are close to the data stores. We shall employ the OGC Web Processing Service (WPS) as the interface to the remote compute servers. There is much current community interest in the use of WPS for this purpose, although the technology has rarely been employed in the environmental sciences. This will build upon previous CEDA experience with the Defra-sponsored UK Climate Impacts Programme.

Powered by Google Project Hosting