Google Code Archive - Long-term storage for Google Code Project Hosting.

Posted on Jan 14, 2009 by Happy Wombat

We are aware of the following issues with the statistics mechanism as it stands today:

In SCOVO, scovo:Items are grouped into scovo:Datasets, and there seems to be an implicit assumption that all items in such a dataset share the same dimensions. As described here, we attach items directly to a void:Dataset, which leads to mixing of items of different dimensionality. On the other hand, the correct SCOVO modelling would lead to awkwardly complex notation for simple statistics.
We encourage the use of classes and properties in places where SCOVO requires an instance of scovo:Dimension. This breaks the symmetry of the SCOVO model. SCOVO would require us to create a scovo:Dimension for each class or property. This would be quite verbose.
Because of the issues above, SPARQLing for statistics can be awkward. It will often require a verbose check to make sure that an item has only certain dimensions and no others.

Two possible approaches for fixing these issues:

Adapt SCOVO to better suit our needs, e.g. by making it a bit less verbose (esp. around definition of dimensions), making it easier to query (e.g. scovo:numberOfDimensions on the dataset, scovo:domainObject property for connecting dimensions to the domain, removing the subclassing of dcterms:Event) (downside: will still be verbose)
Create a new mechanism based on simple properties like void:numberOfTriples and having a powerful mechanism for specifying void:subsets (downside: how to do attribution of statistics with this is completely unclear)

We decided not to take action on those issues until after the first release of the Guide.

Comment #1

Posted on Aug 3, 2009 by Massive Ox

concerning #2: when specifying props/classes directly without instances of scovo:Dimension, I suggest to define sub-properties of scovo:dimension to give the associated dimensions distinct roles. It may occur that the same URI appears as a dimension data point in different dimensions.

Comment #2

Posted on Aug 10, 2009 by Happy Wombat

Thinking more about this, I think that SCOVO has some serious problems that make it a poor fit for voiD purposes. The problems are:

If there's a dimension with n possible values, SCOVO requires the definition of one subclass of scovo:Dimension, and n instances of that class. Now let's assume that we already know URIs for the real- world objects corresponding to the n possible values (e.g. airports, or RDFS classes). In that case, a better design would be to require the definition of one subproperty of scovo:dimension, and directly use those instances with it. We avoid the creation of n new resources.
SCOVO groups the scovo:StatItems belonging to a table by relating them to a scovo:Dataset. This is not a good design if one wants to cherry-pick just one or a few values of a table, as happens in voiD (e.g. I want to quote only the number of foaf:Person instances or owl:sameAs links in my dataset). A better design would be this: Instead of creating a scovo:Dataset, one creates a subclass of scovo:StatItem, such as void:ResourcesPerClassStatItem. Then one tags the individual StatItem as belonging to a certain table by giving it that type.

Taken together, this would imply a design where a scovo:Dataset is a bundle of one scovo:StatItem subclass, and x (for a x-dimension table) scovo:dimension subproperty.

I think that such a design would be a better fit for voiD. My proposal is to come up with such a design, position it as a general alternative or successor to SCOVO, and use it in voiD.

Comment #3

Posted on Sep 16, 2009 by Happy Wombat

From discussions with Michael: For pragmatic reasons, it's probably best to decouple voiD from SCOVO, define our own voiD-specific modelling for statistics, and define a (non-normative, perhaps) mapping between that modelling and SCOVO, perhaps via SPARQL CONSTRUCT queries.

Comment #4

Posted on Oct 14, 2009 by Happy Wombat

Additional requirement reported by kasei: Total number of distinct properties in a dataset. Total number of classes would also make sense I think.

Comment #5

Posted on Oct 31, 2009 by Swift Monkey

From the hcls list and F2F meeting, I have heard a lot of demands for describing statistics of their data. However, their needs might seem a bit different. For example, they would like to be able to describe a list of differentially expressed genes that are associated with values such as P-values, fold-change, etc. They also want to be able to include, as part of metadata about the data, the type of statistical test (e.g., ANOVA) and the array platform employed (e.g., Affymetrix U133A). In a room of ~30 of bio-related background, nearly ~20 of them expressed their urgent needs for such things. What do we think?

Comment #6

Posted on Nov 1, 2009 by Happy Wombat

Jun, I didn't understand a single word of what you said ;-)

It seems they are asking either for statistics that are part of the domain (biomedical statistics), or statistics that are part of provenance information (statistical tests that were performed as part of the data creation process). Correct?

Comment #7

Posted on Nov 1, 2009 by Helpful Camel

Hi Richard,

I think I didn't give enough context of my descriptions. If you are interested, please see the thread: http://lists.w3.org/Archives/Public/public-semweb-lifesci/2009Oct/0085.html

However, your understanding is correct.

Firstly, they want to be able to describe the number of distinctively expressed genes.

Then, they would like to be able to attach some metadata to these genes, which are very much domain-specific and provenance like.

Without knowing the statistics model you are going to propose, I think this use case raises one requirement, i.e., the new model should allow users of voiD to create their own dimensions (such as gene expressions) when describing statistics of their data. What do you think?

Comment #8

Posted on Nov 1, 2009 by Happy Wombat

Not sure. My hunch is that statistics about the dataset (number of triples, number of instances of certain classes etc) are best kept separate from domain-specific statistics (number of expressed genes). These two kinds of statistics are going to be generated by different tools, consumed by different tools, and possibly different people will care about them.

I hope that for voiD we can have a voiD-specific, simple, and non-extensible statistics module. I would prefer that over an extensible one that's more complicated for the voiD use cases and where I don't know how well it serves the needs of the HCLS community because I'm unfamiliar with their use cases and domain.

I'm afraid that also taking into account HCLS use cases could even further delay the creation of the new statistics module.

Comment #9

Posted on Nov 1, 2009 by Helpful Camel

re #8,

I agree that the hcls-related statistics might be generated and consumed by different tools as general lod community. However, I am not sure about making the statistics module as non-extensible. This means that either we gather all the requirements in the front or we are going to maintain the module if new requirements come across. I might being too nervous. The requirements might be clear and thoroughly gathered from the beginning:)

Comment #10

Posted on Nov 2, 2009 by Happy Wombat

Jun, that's no different than in the rest of voiD: When new requirements come up, then we have to add new things to voiD.

If we add an extension point, like scovo:dimension or void:Feature, then voiD users can add new things for themselves more easily, for their own local use, but these things will not be interoperable between different voiD users.

Maybe it's better to talk about this when a concrete proposal is on the table…

Comment #11

Posted on Nov 2, 2009 by Swift Monkey

I am going to talk about voiD and the progress of voiD in the hcls F2F tomorrow, remotely. Can I propose to the community that we are open to take use cases but we are not promising immediate solutions for the moment?

Comment #12

Posted on Nov 2, 2009 by Happy Wombat

My only comment is that I think of voiD as something that should be domain-neutral, hence I want to be able to express statistics about the RDF graph (triple counts and instance counts and the like), but I would like to think that statistics about the domain (gene expressions and statistical methods for scientific analysis) are out of scope. Terms for that are needed too of course, but I don't think that voiD should be the place for them.

I'm happy about all use cases that are related to discovering datasets, and gluing datasets together.

Comment #13

Posted on Jan 14, 2010 by Happy Wombat

Description of a proposal for a new statistics module:

http://groups.google.com/group/void-rdfs-internals/browse_thread/thread/cb9d1978158a0326

Comment #14

Posted on Jan 18, 2010 by Happy Monkey

(No comment was entered for this change.)

Comment #15

Posted on Jan 18, 2010 by Happy Monkey

(No comment was entered for this change.)

Comment #16

Posted on Jan 18, 2010 by Happy Monkey

(No comment was entered for this change.)

Comment #17

Posted on Jan 18, 2010 by Helpful Camel

Re comment 13.

I like this new proposal very much. It makes a lot of things much easier and simpler to say, and also makes it easier to explain to people. I can see myself using it soon.

I tried to run an example query over this new pattern: e.g., how many foaf:Person in the dataset :DS. To answer this, I can write a SPARQL query as:

SELECT ?number where { ?ds void:class foaf:Person; void:instances ?number .}

I think it could also be possible to write a query as: SELECT ?number where { ?ds void:class foaf:Person; void:triples ?number .}

As we can see here, users can either use void:instances or void:triples. Does this make writing SPARQl queries to the statistics information unpredictable? Do we need to recommend users which properties to use for different types of /subset/?

Another query I could run would be how many triples of pattern there are in the dataset.

Based on the example given by Richard, I can write the SPARQL query of: select ?number where {?ds void:class foaf:Person ; void:propertyBasedSubset ?ds2 . ?ds2 void:property foaf:mbox ; void:triples ?number .}

I like the idea of having void:class and void:property, and the different types of sub-properties of void:subset. However, I am still wondering whether these different types of properties are actually making the model more complex to grasp and making the queries more complex to write. It's just a thought.

I would also like to suggest that when updating the guide, we should explain that we had made the use of SCOVO much simpler in voiD2.0 and that to correctly use SCOVO with voiD would become much more verbose. This might help existing users to understand why we changed the statistic module and how great it is!

Comment #18

Posted on Jun 9, 2010 by Happy Wombat

Checked in a new Section 3: http://code.google.com/p/void-impl/source/detail?r=103

The names of the properties have changed again, now it's void:classPartition and void:propertyPartition.

Comment #19

Posted on Jul 5, 2010 by Grumpy Cat

I am fine with the current sect. 3. Shall we also add examples of void:distinctSubjects, void:distinctObjects and void:documents?

Comment #20

Posted on Sep 8, 2010 by Happy Monkey

As per 2010-09-08 telecon this has been resolved.

void-impl - issue #18

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20