My favorites | Sign in
Logo
                
Search
for
Updated Jan 17, 2009 by amokrane.belloui
Labels: Phase-Implementation, Module-Semsim
ConfigureSimilarityMeasures  
Configuring similarity measures

Similarity definitions with Spring

You can use the provided spring namespace to define easily your measures. Your configuration file should look something like :

<?xml version="1.0" encoding="UTF-8" ?>

<beans xmlns="http://www.springframework.org/schema/beans"
	   xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd  
http://www.gusto.com/schema/semsim http://www.gusto.com/schema/semsim.xsd"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:ss="http://www.gusto.com/schema/semsim">
	
        <!-- Define a similarity measure, here random value -->
	<ss:randomValue id="randomValueSim" />
	
</beans>

Measures

You can Define as many measure as you want and compose them using composed, compound and property similarities.

Identity

Returns 1.0 if the resources or values are similar and 0.0 otherwise. It also allows to define a number of stop words, for which the identity doesn't apply.

<ss:identity id="identity">
  <ss:stopwords>
    <ss:stopword>other</ss:stopword>
    <ss:stopword>sample</ss:stopword>
  </ss:stopwords>
</ss:identity>

Intervals

<ss:interval id="interval">
  <ss:entries>
    <ss:intervalEntry from="0" to="2" sim="1" />
    <ss:intervalEntry from="2" to="5" sim="0.82" />
    <ss:intervalEntry from="5" to="9" sim="0.6" />
    <ss:intervalEntry from="9" to="15" sim="0.4" />
    <ss:intervalEntry from="15" to="20" sim="0.2" />
  </ss:entries>
</ss:interval>
	
<ss:interval id="runtimeInterval">
  <ss:entries>
    <ss:intervalEntry from="0" to="10" sim="1" />
    <ss:intervalEntry from="10" to="25" sim="0.81" />
    <ss:intervalEntry from="25" to="40" sim="0.6" />
  </ss:entries>
</ss:interval>

<ss:dateInterval id="releaseDateInterval" unit="year">
  <ss:entries>
    <ss:intervalEntry from="0" to="2" sim="0.76" />
    <ss:intervalEntry from="2" to="5" sim="0.5" />
    <ss:intervalEntry from="5" to="15" sim="0.35" />
    <ss:intervalEntry from="15" to="30" sim="0.2" />
  </ss:entries>
</ss:dateInterval>

String

<ss:jaroWinkler id="jarowinklers" />
<ss:wordnet id="wns" firstWordOnly="false" 
            wordnetConfig="${wordnet.config}" 
            infocontent="${wordnet.infocontent}" 
            mapping="${wordnet.mapping}" />

You can force the 3 WordNet parameters or just define them in a properties file.

wordnet.config=config/wordnet/wordnet.xml
wordnet.infocontent=file:config/wordnet/ic-bnc-resnik-add1.dat
wordnet.mapping=file:config/wordnet/domain_independent.txt

Domain Specific

All the measures that are specific to a type of data, like Zip Codes, ...

Zip code allows to calculate the similarity between 2 ZIP Codes. Actually it is designed for codes on 5 positions. level1 is the similarity if all 5 digits are the same, level2 is when 4 digits are the same, etc.

<ss:zipCode id="zip" level1="1.0" level2="0.71" level3="0.61" level4="0.47" level5="0.21" />

Matrix

In the first example we define the matrix in-line.

<ss:matrix id="mpaaMatrix2" prefix="http://www.ini-cerist.dz/movie-lens.owl#">
  <ss:entries>
    <ss:matrixEntry val1="PG" val2="PG-13" sim="0.7" />
    <ss:matrixEntry val1="R" val2="NC-17" sim="0.8" />
    <ss:matrixEntry val1="R" val2="PG-13" sim="0.3" />
  </ss:entries>
</ss:matrix>

If the matrix entries are too important, it's better to externalize them to an external file. Here it's a classpath resource names matrix-mpaa.properties containing :

PG***PG-13=0.7
R***NC-17=0.81
R***PG-13=0.3

Notice that we can choose the separator between the 2 dimensions. Here we have chosen ***

<ss:matrix id="mpaaMatrix" file="classpath:config/movielens/matrix-mpaa.properties" fileSeparator="***" prefix="">
  <ss:stopwords>
    <ss:stopword>other</ss:stopword>
  </ss:stopwords>
</ss:matrix>

It's also possible to define stopwords.

Sets

JaccardBinary can be defined on Values (ex1) or on Resources (ex2)

<ss:jaccardBinary id="vjss" type="VALUE">
  <ss:stopwords>
    <ss:stopword>other</ss:stopword>
    <ss:stopword>misc</ss:stopword>
  </ss:stopwords>
</ss:jaccardBinary>

<ss:jaccardBinary id="rjss" type="RESOURCE">
  <ss:stopwords>
    <ss:stopword>other</ss:stopword>
    <ss:stopword>misc</ss:stopword>
  </ss:stopwords>
</ss:jaccardBinary>

We define the similarity measure that will be used in ressemblance via the similarity property. If first example we use language similarity, in the second we use rvs. Notice that those similarities are defined somewhere in the document.

<ss:ressemblance id="languagesRess" type="RESOURCE" similarity="language">
  <ss:stopwords>
    <ss:stopword>other</ss:stopword>
    <ss:stopword>misc</ss:stopword>
  </ss:stopwords>
</ss:ressemblance>

<ss:ressemblance id="rRESSEBLANCEss" type="RESOURCE" similarity="rvs" />

Edge Counting

Edge counting is the method that consists in considering the position of two resources in a hierarchy of terms to get the similarity. The implemented method is 'Wu & Palmer'.

You can describe EdgeCounting with the maximal depth, here equals 8; define parent properties, which are properties allowing to navigate to the parent element; and define the stop resources, which are special resources' ids that are not considered as resources and thus cannot be used in the process.

<ss:edgeCounting id="language" depth="8">
  <ss:parents>
    <ss:parent>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</ss:parent>
    <ss:parent>http://www.w3.org/2000/01/rdf-schema#subClassOf</ss:parent>
  </ss:parents>
  <ss:stops>
    <ss:stop>http://www.w3.org/2000/01/rdf-schema#Resource</ss:stop>
    <ss:stop>http://www.w3.org/2002/07/owl#Class</ss:stop>
    <ss:stop>http://www.w3.org/2000/01/rdf-schema#Class</ss:stop>
  </ss:stops>
</ss:edgeCounting>

You can simplify your measure configuration by using the stereotype. If you define your measure with the SEMANTIC stereotype the properties will be automatically injected.

<ss:edgeCounting id="languageBis" depth="8" stereotype="SEMANTIC">
  <!-- No need to define the properties -->
  <!-- You can specify extra properties, in addition to the stereotype ones -->
</ss:edgeCounting>

Composition

Each param is composed of a property, a type(Value|Resource|List), a weight and the similarity to apply on the property value.

<ss:composed id="movieSim">
  <ss:composedParam type="VALUE" weight="7" similarity="jarowinklers" property="hasTitle" />
  <ss:composedParam type="VALUE" weight="4" similarity="jarowinklers" property="hasAlternativeTitle" />
		
  <ss:composedParam type="VALUE" weight="2" similarity="jarowinklers" property="hasTagline" />
  <ss:composedParam type="VALUE" weight="1" similarity="jarowinklers" property="hasPlotOutline" />
		
  <ss:composedParam type="SET" weight="3" similarity="vjss" property="hasKeyWords" />
		
  <ss:composedParam type="VALUE" weight="3" similarity="releaseDateInterval" property="hasReleaseDate" />
  <ss:composedParam type="VALUE" weight="2" similarity="runtimeInterval" property="hasRuntime" />
		
  <ss:composedParam type="SET" weight="5" similarity="languagesRess" property="hasLanguage" />
		
  <ss:composedParam type="SET" weight="1" similarity="rjss" property="hasColor" />
  <ss:composedParam type="SET" weight="9" similarity="rjss" property="hasGenre" />
		
  <ss:composedParam type="RESOURCE" weight="3" similarity="mpaaMatrix" property="hasMPAA" />
  <ss:composedParam type="RESOURCE" weight="3" similarity="identity" property="hasCompany" />
  <ss:composedParam type="RESOURCE" weight="1" similarity="identity" property="hasAspectRation" />
</ss:composed>
<ss:composed id="movieGenreSim">
  <ss:composedParam type="SET" weight="2" similarity="rjss" property="hasGenre" />
</ss:composed>

Property

Defined by its type (Value|Resource|List) and the similarity that is applied on the property value.

In this example, the similarity is based on a unique property hasTitle and we apply on it the jarowinkler similarity.

<ss:property id="movie2Sim" type="VALUE" similarity="jarowinklers" property="hasTitle" />

Compound

A compound similarity that will be applied on an object of type User.

<ss:compound id="zipcountrySim" similarity="zip">
  <ss:property name="country" property="hasCountry" />
  <ss:property name="zip" property="hasZipCode" />
</ss:compound>

A composed similarity that will integrate several properties plus the compound one define above.

<ss:composed id="userSim">
  <ss:composedParam type="RESOURCE" weight="4" similarity="identity" property="hasOccupation" />
  <ss:composedParam type="VALUE" weight="1" similarity="identity" property="hasSex" />
  <ss:composedParam type="VALUE" weight="3" similarity="interval" property="hasAge" />
  <ss:composedParam type="VALUE" weight="3" similarity="zip" property="hasZipCode" />
  <!-- Integrating the compound similarity -->
  <ss:composedParam type="RESOURCE" similarity="zipcountrySim" />
</ss:composed>

Notice that composedParam allows to define


Sign in to add a comment
Hosted by Google Code