Export to GitHub

opmv - OPMVGuide.wiki


The Open Provenance Model is a model driven by a community effort for representing provenance information in order to facilitate interoperability between provenance systems (1). An existing OWL serialization for OPM is available at (2). However, a list of known issues have been identified (3). The Open Provenance Model Vocabulary (OPMV) aims to provide a new serialization of OPM and to support as many features of OPM as possible. We try to take advantage of RDF technologies during the implementation of OPM specification (1) and to reuse existing RDF vocabularies wherever possible. We also expect this new representation of OPM will allow users to perform reasoning over provenance graphs.

At this development stage, the Open Provenance Model Vocabulary is not aimed at being a profile of OPM. Although OPMV does include the three core concepts and five core properties defined by OPM, it has yet implemented every structure specified in OPM. Also OPMV introduces some terms and properties that are not specified in OPM, which we expect to be mapped to OPM, similar to how DC terms being mapped to OPM (4). A complete mapping of OPMV as an OPM profile is our future work in the next development stage.

The document is aimed at practitioners of data publishing who want to publish their data responsibly. It provides concrete examples to explain how to use OPMV, particularly how to specialise this ontology for application-specific needs and how to use it together with other RDF vocabularies and technologies (such as Named Graphs). The examples are contributed by the data.gov.uk team. Our examples not only cover provenance requirement from typical practices of creating linked data on the Web, such as by data transformation, replication, or query rewriting, but also cover general cases of data access on the Web (such as data downloading) and cases related to legislation.

All examples in this document are written in the Turtle RDF syntax. Throughout this document, the following namespaces are used:

``` @prefix xsd: http://www.w3.org/2001/XMLSchema# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix dcterms: http://dublincore.org/documents/dcmi-terms/ . @prefix owl: http://www.w3.org/2002/07/owl# . @prefix xsd: http://www.w3.org/2001/XMLSchema# . @prefix foaf: http://xmlns.com/foaf/0.1/ .

@prefix opmv: http://purl.org/net/opmv/ns# . @prefix opm: http://openprovenance.org/ontology# .

@prefix common: http://purl.org/net/opmv/types/common# . @prefix xslt: http://purl.org/net/opmv/types/xslt# . @prefix sparql: http://purl.org/net/opmv/types/sparql# . @prefix gate: http://purl.org/net/opmv/types/gate# .

@prefix eg: http://example.org.uk/ ```

Table of content

  • Overview of the OPMV Vocabulary
    • Implementing the basics of OPM
    • Implementation of OPM roles
    • Implementation of OPM account
    • Implementation of Time
  • Describe the creation of a dataset
    • Example: Data transformation by querying database
    • Example: Data creation on-the-fly by the service
    • Example: Data creation by a data downloading process
    • Example: Data creation by data transformation with XSLT
  • Describe the previous version of a dataset
    • Example: Updates to the RDF graph
  • Describe time information
    • Example: Time related to data
    • Example: Time related to data transformation
  • Customize OPMV

Overview of the OPMV Vocabulary

The Open Provenance Vocabulary currently is implemented as an OWL-DL ontology and is available in its namespace http://purl.org/net/opmv/ns#. The vocabulary is partitioned into the core OPMV vocabulary and typed modules that provide less frequently used terms and a broad range of specializations of the core terms. At the moment we have the following implemented modules: * The common module, under the namespace of http://purl.org/net/opmv/types/common#. * The xslt module, under the namespace of http://purl.org/net/opmv/types/xslt#. * The sparql module, under the namespace of http://purl.org/net/opmv/types/sparql#.

Implementing the basics of OPM

The three top OPM entities and five top properties are implemented in OPMV as classes and object properties: * opmv:wasDerivedFrom, dom(opmv:wasDerivedFrom) = opmv:Artifact and range (opmv:wasDerivedFrom) = opmv:Artifact; * opmv:used, dom(opmv:used) = opmv:Process and range (opmv:used) = opmv:Artifact; * opmv:wasGeneratedBy, dom(opmv:wasGeneratedBy) = opmv:Agent and range (opmv:wasGeneratedBy) = opmv:Process; * opmv:wasControlledBy, dom(opmv:wasControlledBy) = opmv:Process and range (opmv:wasControlledBy) = opmv:Agent; * opmv:wasTriggeredBy, dom(opmv:wasTriggeredBy) = opmv:Process and range (opmv:wasTriggeredBy) = opmv:Process.

These terms can be used to express some basic provenance information about data creation and transformation.

Implementation of OPM roles

A role is defined in OPM to "designate an artifact's or agent's function in a process" (1). This structure can be used to refine the provenance information expressed using the basic terms.

According to OPM, roles should be scoped to the process that they are related to. Each agent or artifact could play a specific role in a given process. Therefore, we could define sub-properties for each opmv:used, opmv:wasGeneratedBy and opmv:wasControlled.

For example, we could differentiate the role played by the different artifacts used by a process: one artifact being data while the other being the creation guideline. For this, we could refine the property of opmv:used in order to express different types of relationships between an opmv:Artifact and opmv:Process. An example of this has been implemented in the common typed module, which defines common:usedData and common:usedScript as two sub-properties of opmv:used.

Implementation of OPM account

The provenance information about an artifact could be expressed at different levels of abstractions or from different viewpoints (1). OPM specification introduces the concept of "account" to "represent a description at some level of detail as provided by one or more observers". Accounts should be uniquely identified and two accounts are equal if and only if they have the same identifier.

OPMV yet provides specific terms to implement accounts. We plan to use the Named Graphs to represent such information. A separate named graph can be created for provenance descriptions provided by a separate observer. Provenance descriptions at different levels of abstractions could either be extracted by queries to the RDF provenance data (using for example SPARQL) or be defined in different named graphs.

Example Updates to the RDF graph shows how we can create a separate named graph for descriptions about a school name, in order to provide additional provenance descriptions about the name of the school.

Implementation of Time

OPM provides a very refined time model. It differentiate instantaneous occurrences and those not. It recognizes four instantaneous occurrences: the creation and use of artifacts, and the starting and ending of processes. And the time information about each occurrence is expected to be observed. Given that time is observed, time accuracy is limited by the granularity of the clock and the granularity of the observer’s activities. While the occurrence of an event or artifact might be instantaneous, the observation of its occurrence should happen in a time interval. Hence, each instantaneous occurrence must happen at time t in between two observation times. This means that each relationship between OPM concepts (Artifact, Agent and Process) must be associated with two timestamps, to convey an observation interval.

In OPMV, we do not yet represent such as a refined model about time information, but we reuse the Time Ontology (http://www.w3.org/TR/owl-time/) to define properties like opmv:wasGeneratedAt to express the creation time of an artifact and opmv:wasPerformedAt and two sub-properties opmv:wasStartedAt and opmv:wasEndedAt to describe time information about a process.

Examples of using time-related properties can be found in Example Time related to data and Example Time related to data transformation.

Describe the creation of a dataset

Example: Data transformation by querying database

The use case is that Edubase starts publishing its data straight from its database, one page per school. The RDF generated for a school is generated on demand from the database by some .NET code (say) that formats the result of a SQL query on the database as RDF/XML. That's one account of the provenance of the graph that gets published. The provenance of the RDF graph about each school can be described as follows.

``` eg:school1 rdf:type http://www.w3.org/2004/03/trix/rdfg-1/Graph ; rdf:type opmv:Artifact, prv:DataItem ; opmv:wasDerivedFrom _:queryResult ; opmv:wasGeneratedBy [ rdf:type opmv:Process ;
opmv:used _:queryResult ; opmv:wasPerformedBy _:netcode ; ### sub-property of opmv:wasControlledBy opmv:wasControlledBy http://www.jenitennison.com/#me
] .

_:queryResult rdf:type opmv:Artifact ; ### or a opvt:SQLQueryResult?? opmv:wasGeneratedBy [ rdf:type opmv:Process ; ### or a opvt:SQLQueryExecution?? opmv:used http://example.edu/edubase ; opmv:used _:query ; ] .

_:netcode rdf:type opmv:Agent ;
rdfs:label ".NET code that formats the result of a SQL query on the database as RDF/XML" ; .

http://example.edu/edubase rdf:type opmv:Artifact, opmvTypes:SQLDatabase #### // TODO opmvTypes:SQLDatabase be a class from the OPMV Types module ; rdfs:label "Edubase: the database about schools and education." .

_:query rdf:type opmv:Artifact, prvTypes:SQLQuery rdfs:comments "select * from schools where *" .

```

The above RDF example tells us that the RDF graph about a school (eg:school1) is derived from a SQL query result (:queryResult) and that the graph is generated by some .NET code (:netcode) that transforms the SQL query result into RDF/XML. The SQL query result is generated by a SQL query execution process, which used a SQL query (:query) and accessed the Edubase (http://example.edu/edubase).

This example provides provenance of the RDF graph that contains statements about each school. In addition, data publishers could also provide more coarse-grained provenance information about the whole RDF dataset that publishes the Edubase database in RDF.

The example is expressed using top-level terms from OPMV as much as possible. We extended OPMV with an object property opmv:wasPerformedBy, which is a sub-property of opmv:wasControlledBy. Both properties have the same domain and range definitions.

To express the use case more precisely, we could create sub-types of top OPMV concepts in the OPMV Types module. This is planned to be implemented in a SQL type module.

Example: Data creation on-the-fly by the service

Now should be implemented using the Common Module.

Example: Data creation by a data downloading process

Now should be implemented using the Common Module.

Example: Data creation by data transformation with XSLT

Now should be implemented using the Xslt Module.

Describe the previous version of a dataset

Example: Updates to the RDF graph

Following on the Edubase example, Now the 'name' field of the school is changed and this changes the rdfs:label of the school in the RDF. The provenance would be to say that it is derived from the previous version of the RDF (available at a dated URI), generated by the removal of the existing rdfs:label triple and the addition of a new rdfs:label triple.

We create four named graphs: * ex:G1 and ex:G2 to represent respectively the older and newer version of the RDF graphs, providing information about the school (eg:school1); * ex:labelG1 and ex:labelG2 to represent respectively the name label of the school in the different versions of RDF graphs.

To express how a new version of an RDF graph (eg:G2) is created based on the deletion of a triple in an older version of the RDF graph (eg:G1) and the addition of a triple to the newer version of the graph, * we express that the new label replaces the old label, using the property of dcterms:replaces; * we express that the new graph is derived from the older graph, using the property of opmv:wasDerivedFrom; * we create an instance of opm:Process to describe that the older graph is used as the input to produce the a newer graph as the result.

The complete example is shown below, expressed using TriG, a serialization format of Named Graphs based on Turtle that is extended with graph naming.

We create named graphs to represent the different versions of the RDF graphs about the school1. ``` ex:G1 { eg:school1 rdfs:label "school a" ; foaf:homepage http://example.org/page/school1 . }

ex:G2 { eg:school1 rdfs:label "school B" ; foaf:homepage http://example.org/page/school1 . } ```

We define a named graph for each different version of the school name and we define each of this named graph as part of the whole RDF dataset using dcterms:isPartOf. ``` ex:labelG1 { eg:school1 rdfs:label "school a" ; }

ex:labelG2 { eg:school1 rdfs:label "school B" ; }

ex:labelG1 dcterms:isPartOf ex:G1 ; opmv:wasDerivedFrom ex:G1 . ex:labelG2 dcterms:isPartOf ex:G2 ; opmv:wasDerivedFrom ex:G2 . ```

We represent the fact that the school name has been updated in a process, which takes the old name graphs about the school to produce the output of newer version of named graphs describing the school and the school name. ``` ex:G2 dcterms:createdAt "2010-03-24"^^xsd:dateTime ; opmv:wasDerivedFrom ex:G1 ; opmv:wasGeneratedBy ex:dc .

ex:dc rdf:type opmv:Process ; opmv:used ex:labelG1 ; opmv:used ex:G1 .

ex:labelG2 opmv:wasGeneratedBy ex:dc ; dcterms:replaces ex:labelG1 ; owl:differentFrom ex:labelG1 ; opmv:wasDerivedFrom ex:labelG2 . ```

Describe time information

Example: Time related to data

The property opmv:wasGeneatedAt, with a domain of opmv:Artifact and a range of time:Instant, can be used to express the creation time of an artifact. The example below shows such an example.

_:recordVersion a opmv:Artifact ; opmv:wasGeneratedAt [ a time:Instant ; time:inXSDDateTime "{CREATION_TIME}"^^xsd:dateTime ; ] .

It is good to reuse existing vocabulary, such as the Dublin Core, to express this simple time-related information about an artifact. The above example could be expressed using the dcterms:created property from Dublin Core: _:recordVersion a opmv:Artifact ; dcterms:created "CREATION_TIME"^^xsd:dateTime ; .

Mapping OPMV to DC is still ongoing work and the mapping result will be provided in a separate document.

Example: Time related to data transformation

To express the performance time of a process we provide three properties: * opmv:wasPerformedAt: which has a rdfs:domain of opmv:Process and a rdfs:range of time:TemporalEntity, which means it could be either a time:Instant or a time:Interval; * opmv:wasStartAt and opmv:wasEndedAt, with their domain being opmv:Process and range as time:Instant.

For example, to express that a process took place during a time interval we can have the following: ``` http://reference.data.gov.uk/doc/day/2010-03-23 rdf:type opmv:Artifact ; opmv:wasGeneratedBy [ a opmv:Process ;
opmv:wasPerformedBy _:A0 ; opmv:wasPerformedAt ; ] .

a time:Interval ; . ```

Equivalently we can have: <http://reference.data.gov.uk/doc/day/2010-03-23> rdf:type opmv:Artifact ; opmv:wasGeneratedBy [ a opmv:Process ;
opmv:wasPerformedBy _:A0 ; opmv:wasStartedAt [ a time:Instant ; time:inXSDDateTime "{PROCESS START TIME}"^^xsd:dateTime ; ] opmv:wasEndedAt [ a time:Instant ; time:inXSDDateTime "{PROCESS END TIME}"^^xsd:dateTime ; ] ; ] .

The mapping of these two statements is defined by our owl:propertyChainAxiom, which defines the shortcut between opmv:wasPerformedAt and time:hasBeginning (with rdfs:domain of time:TempororalEntity and rdfs:range of time:Instant), as shown below: ``` http://purl.org/net/opmv/ns#wasStartedAt rdf:type owl:ObjectProperty ;

rdfs:domain <http://purl.org/net/opmv/ns#Process> ;

rdfs:range <http://www.w3.org/2006/time#Instant> ;

owl:propertyChainAxiom ( <http://purl.org/net/opmv/ns#wasPerformedAt>
     <http://www.w3.org/2006/time#hasBeginning>
) .

```

Similarly we define how to map opmv:wasEndedAt to opmv:wasPerformedAt and time:hasEnd using a similar owl:propertyChainAxiom.

Customize OPMV

Another goal of this guide document is to show users how to create more specific concepts and concepts to the core OPMV terms in order to express more precisely their specific data transformations, such as transformation with XLST, between different data formats, by query rewriting, etc. We propose three different ways for customizing the core OPMV.

User-defined specialization

We encourage users to create their own sub-types of OPMV classes and properties using RDFS and OWL constructs (such as rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentProperty). Users could keep their extensions as a local copy. Implementing RDFS/OWL reasoning to the instance data would require the support of RDFS/OWL reasoners. Approaches for carrying out the reasoning and for accessing such instance data is beyond the scope of this document. In the following we give some examples to show how to define sub-types of OPMV core vocabulary under a local namespace.

Reuse and map to existing vocabulary

Reuse existing classes and properties from other ontologies together with OPMV. Such concepts are not included inside OPMV but they can be used directly together with terms from OPMV to express provenance information needed by the users. For example the class doap:Version from the DOAP vocabulary can be used together with OPMV to express the information about the specific version of a software used in a process (opmv:Process) of creating or accessing an artifact (opmv:Artifact).

A special case of this is to create a mapping between an existing provenance-related vocabulary (such as the Provenance Vocabulary (http://purl.org/net/provenance/ns#) and OPMV so that the community could exchange RDF data expressed using these different vocabularies. In the following we give some examples to show how we can map of the core concepts and properties from the Provenance Vocabulary to OPMV. Creating a mapping between OPMV and the OPM OWL serialization is part of the next development stage.

The OPMV Types Module

We provide the OPMV Types module, which defines a set of pre-defined more specialized concepts and properties of the core OPMV, to cover the range of terms needed for describing the different types of data transformation processes. In this way, we will avoid overloading or over-specializing the OPMV core vocabulary. This is similar to the modularization approach taken by the SIOC (Semantically-Interlinked Online Community) project (http://rdfs.org/sioc/spec/). The OPMV Types module is available at http://purl.org/net./opmv/types#, and it includes the following sub-classes of the OPMV core vocabulary: TODO.

'''There would still be use cases whose requirements cannot be supported by any existing terms in the Types moduel, by any existing vocabularies, or by simply extending existing classes/properties in the OPMV core vocabulary. In such circumstances, we would encourage users of OPMV to get in touch with us and we seek for the best solution for their needs.'''