de.tuebingen.uni.sfs.germanet.relatedness
Class Relatedness

java.lang.Object
  extended by de.tuebingen.uni.sfs.germanet.relatedness.Relatedness

public class Relatedness
extends java.lang.Object

Implements some of the more well-known relatedness measures for GermaNet API version 8.0.
Where paths are involved (all methods but Lesk's and Hirst&St.Onge's), the methods all do edge counting, i.e. identity = distance 0, parent = 1, sister nodes = 2.
They also all return -1 if the input words have different categories, as no useful relatedness measure can be computed in that case (reason: GermaNet keeps nouns, verbs and adjectives in different subtrees of the hypernym hierarchy, though connected by a common root node; paths between different categories are overly long and falsify relatedness results).
In the following short summary of the methods, LCS= least common subsumer of synsets s1 and s2, dist = distance between two synsets.

path:
rel(s1,s2) = (max_dist-dist(s1,s2))/max_dist
wuAndPalmer:
rel(s1,s2) = (2*depth(lcs)) / (dist(s1,lcs)+dist(s2,lcs)+2*depth(lcs))
leacockAndChodorow:
rel(s1,s2) = -log(dist(s1,s2)/2*max_depth)
resnik:
rel(s1,s2) = -log(p(lcs)) = IC(lcs)
lin:
rel(s1,s2) = 2*IC(lcs) / (IC(s1) + IC(s2))
jiangAndConrath:
rel(s1,s2) = max_dist - (IC(c1) + IC(c2) − 2*IC(lcs))
hirstAndStOnge: rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
lesk:
rel(s1,s2) = sum(word_overlap(s1,s2)), extended by related synsets

The javadoc for each method includes the hypothetical minimum and maximum values for that method, which may or may not ever be reached in practice. Values in javadoc are taken from GermaNet API version 7.0 and XML version 6.0 and may not apply to later versions.


Field Summary
static de.tuebingen.uni.sfs.germanet.api.GermaNet gnet
           
static int MAX_DEPTH
           
static int MAX_SHORTEST_PATH
           
static java.lang.String ROOT
           
 
Constructor Summary
Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)
          Constructor taking a GermaNet instance as input.
Relatedness(java.lang.String germanetDirectory)
          Constructor taking the path to a GermaNet-XML directory as input; instanciates a GermaNet object with (germanetDirectory, true), i.e.
 
Method Summary
 RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)
          Relatedness according to Hirst and St-Onge 1998: "Lexical chains as representations of context for the detection and correction of malapropisms".
 RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, double c, double k)
          Relatedness according to Hirst and St-Onge 1995: "Lexical chains as representations of context for the detection and correction of malapropisms".
 RelatednessResult jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
          Relatedness according to Jiang and Conrath 1997: "Semantic Relatedness Based on Corpus Statistics and Lexical Taxonomy"
 RelatednessResult leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a, de.tuebingen.uni.sfs.germanet.api.Synset b)
          Relatedness according to Leacock&Chodorow, 1998: "Combining Local Context and WordNet Relatedness for Word Sense Identification".
 RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
 RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, org.tartarus.snowball.SnowballStemmer stemmer, int size, int limit, boolean oneOrthForm, boolean hypernymsOnly, boolean includeGermanetGloss, boolean includeWiktionaryGlosses)
          Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
 RelatednessResult lin(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
          Relatedness according to Lin 1998: "An Information-Theoretic Definition of Relatedness"
 RelatednessResult path(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)
          A very simple relatedness measure.
 RelatednessResult resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
          Relatedness according to Resnik 1995: "Using Information Content to Evaluate Semantic Relatedness in a Taxonomy".
 void runList(java.lang.String inFile, java.lang.String outFile, java.lang.String separator, java.lang.String encoding, java.lang.String methodName, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, java.util.HashMap<java.lang.String,java.lang.Long> frequencies, boolean normalized, java.lang.String cat)
          Reads a list of word pairs and returns a list of their relatedness.
 void sortList(java.lang.String inFile, java.lang.String separator, java.lang.String encoding, int index)
          Sorts an input csv file by the numeric value in the indicated column; inteded for sorting of word lists by reledness of the word pairs.
 RelatednessResult wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2)
          Relatedness/Similarity according to Wu and Palmer, 1994: "Verb Semantics and Lexical Selection"
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

gnet

public static de.tuebingen.uni.sfs.germanet.api.GermaNet gnet

MAX_DEPTH

public static int MAX_DEPTH

MAX_SHORTEST_PATH

public static int MAX_SHORTEST_PATH

ROOT

public static java.lang.String ROOT
Constructor Detail

Relatedness

public Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)
Constructor taking a GermaNet instance as input.

Parameters:
germanet - an instance of the GermaNet class.

Relatedness

public Relatedness(java.lang.String germanetDirectory)
Constructor taking the path to a GermaNet-XML directory as input; instanciates a GermaNet object with (germanetDirectory, true), i.e. case insensitive.

Parameters:
germanetDirectory -
Method Detail

path

public RelatednessResult path(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2)
A very simple relatedness measure.
Calculates relatedness as a function of the distance between two nodes and the longest possible 'shortest path' between any two nodes:
rel(s1,s2) = (MAX_SHORTEST_PATH - distance(s2,s2)) / MAX_SHORTEST_PATH .
Can only compare words of same class; returns -1 otherwise.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
Returns:
a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 0/40 = 0 (distance 40 = currently max. 'shortest' path)
Max = 40/40 = 1 (identity)

wuAndPalmer

public RelatednessResult wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1,
                                     de.tuebingen.uni.sfs.germanet.api.Synset c2)
Relatedness/Similarity according to Wu and Palmer, 1994: "Verb Semantics and Lexical Selection"

ConSim(C1, C2) = (2*N3) / (N1+N2+2*N3)

C1, C2: two synsets
C3: their least common subsumer/'superconcept' (LCS)
N1 = path length C1,C3
N2 = path length C2,C3
N3 = depth of C3

Can only compare words of same class; returns -1 otherwise.

Parameters:
c1 - first synset to be compared
c2 - second synset to be compared
Returns:
a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 2*0/(20+20+2*0) = 0 (leaf nodes with LCS=root)
Max = 2*20/(0+0+2*20) = 1 (for leaf node identity)

leacockAndChodorow

public RelatednessResult leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a,
                                            de.tuebingen.uni.sfs.germanet.api.Synset b)
Relatedness according to Leacock&Chodorow, 1998: "Combining Local Context and WordNet Relatedness for Word Sense Identification".

rel(a,b) = max [-log(N_p/2D)]
max is only relevant if no unique root node exists, thus here: rel(a,b) = -log(N_p/2D)

N_p = path length from a to b
D = max depth of taxonomy (for xml 6.0: maxDepth = 20, edge counting)
In this implementation, 1 is added to numerator and denominator to avoid -log(0) = infinity (for identity).

Can only compare words of same class; returns -1 otherwise.

Parameters:
a - first synset to be compared
b - second synset to be compared
Returns:
a value 0 <= x < 1, where larger values indicate greater relatedness.
Min = -log((40+1)/(2*20+1)) = 0 (for maximally distant leaf nodes)
Max = -log((0+1)/(2*20+1)) =~ 3.71 (for leaf node identity)

resnik

public RelatednessResult resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1,
                                de.tuebingen.uni.sfs.germanet.api.Synset c2,
                                java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Resnik 1995: "Using Information Content to Evaluate Semantic Relatedness in a Taxonomy".

rel(c1,c2) = max(c in S(c1,c2)) [-log(p(c))] = IC(c)

where S(c1,c2) is the set of concepts that subsume both c1 and c2.
-> in short: max(lcs)[-log(p(lcs))] = -log(freq(lcs)/rootFreq) .
If there are several LCS (least common subsumers), take the 'most informative' one.

Note that with Resnik's measure, it is possible for a synset to be 'more related' to a different synset with a larger IC (information content) than to itself.
As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:
c1 - first concept (synset) to be compared
c2 - second synset to be compared
freqs - HashMap holding the frequencies of all synsets
Returns:
a value 0 <= x < 18.75, where larger values indicate greater relatedness.
Min = -log(1) = -0.0 = -log(freq(root)/freq(root)) = ic(root)
Max = -log(1/freq(root)) =~ 18.748 = ic(least frequent, i.e. most informative)

lin

public RelatednessResult lin(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                             de.tuebingen.uni.sfs.germanet.api.Synset s2,
                             java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Lin 1998: "An Information-Theoretic Definition of Relatedness"

rel(x1,x2) = 2*log P(C0)/(log P(C1) + log P(C2))

where x1 and x2 are members of the classes C1 and C2 and C0 is the most specific class that subsumes both C1 and C2.
Since -log(p(s)) = ic(s) and the negative signs cancel out,
rel(s1,s2) = 2*ic(lcs)/(ic(s1) + ic(s2))

As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
freqs - HashMap holding the frequencies of all synsets
Returns:
a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 0 if lcs is root (ic(root) = 0)
Max = 1 (identity; otherwise, ic(lcs) will be less specific = smaller)

jiangAndConrath

public RelatednessResult jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                         de.tuebingen.uni.sfs.germanet.api.Synset s2,
                                         java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Jiang and Conrath 1997: "Semantic Relatedness Based on Corpus Statistics and Lexical Taxonomy"

rel(c1,c2) = max_dist - (IC(c1)+IC(c2)−2*IC(lcs))

where c1 and c2 are synsets.
The distance measure presented in the paper is turned into a relatedness measure simply by substracting it from the maximum possible 'distance' (2*max_IC), see Statistics.getMaxJcnValue).

As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
freqs - HashMap holding the frequencies of all synsets
Returns:
a value 0 <= x <= 37.51, where 37.51 is identity and 0 is unrelated.
Min = 0 = max_dist - (2*maxIC)-2*0.0) (for maximally specific, maximally distant leaf nodes)
Max = max_dist - (0+0-2*0.0) =~ 37.51 (identity)

hirstAndStOnge

public RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                        de.tuebingen.uni.sfs.germanet.api.Synset s2)
Relatedness according to Hirst and St-Onge 1998: "Lexical chains as representations of context for the detection and correction of malapropisms".

rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
where by default, C=10 and k=1.

'Strong relation' = synset identity, direct horizontal link or one word (orthForm of one Synset, presumably a compound) containing the other.
'medium-strong relation' = a path with length 5 or less between the two synsets, using all types of relations, but only according to specified patterns (call an upwards relation 'u', downwards 'd', horizontal 'h', then the allowed paths are u+, u+d+, u+h+, u+h+d+, d+, d+h+, h+d+, h+). For more information see the paper by Hirst and St-Onge.

May be used on Synsets of different categories.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
Returns:
a value 0 <= x <= 15, where 15 is strongly related and 0 is unrelated.
Min = 0 for no relation
Max = 15 for strong Relation

hirstAndStOnge

public RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                        de.tuebingen.uni.sfs.germanet.api.Synset s2,
                                        double c,
                                        double k)
Relatedness according to Hirst and St-Onge 1995: "Lexical chains as representations of context for the detection and correction of malapropisms".

rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
where pathLength is between 0 and 5, direction Changes between 0 and 2 and C and k are variables such that k,c>=0 and (2*k+5)

'Strong relation' = synset identity, direct horizontal link or one word (orthForm of one Synset, presumably a compound) containing the other.
'medium-strong relation' = a path with length 5 or less between the two synsets, using all types of relations, but only according to specified patterns (call an upwards relation 'u', downwards 'd', horizontal 'h', then the allowed paths are u+, u+d+, u+h+, u+h+d+, d+, d+h+, h+d+, h+, h+u+). For more information see the paper by Hirst and St-Onge.
Note: As GermaNet relations always go both ways, this method only looks for paths in one direction. Thus, the reverse of d+h+, namely h+u+, has been added although the original paper does not allow for it. The reverses of all other paths are already included in the 'allowed' list.

May be used on Synsets of different categories.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
c - maximum value for medium strength relations
k - variable to scale number of direction changes
Returns:
a value 0 <= x <= 15, where 15 is strongly related and 0 is unrelated.
Min = 0 for no relation
Max = 15 for strong Relation

lesk

public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2,
                              de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
This method returns the value computed with the default values using
- all orthForms of each synset
- size = 4 (path length for including related synsets)
- limit = 2 (distance from root inside which synsets are excluded (abstract))
- only using hypernyms (opposite to using all available relations except hyponyms)
- not including existing GermaNet glosses in lexical field
- using snowball stemmer for German
May be used on Synsets of different categories.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
Returns:
a value x >= 0, where 0 is no overlap and greater values indicate greater relatedness Min = 0 Max = n.a.

lesk

public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2,
                              de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
                              org.tartarus.snowball.SnowballStemmer stemmer,
                              int size,
                              int limit,
                              boolean oneOrthForm,
                              boolean hypernymsOnly,
                              boolean includeGermanetGloss,
                              boolean includeWiktionaryGlosses)
Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
May be used on Synsets of different categories.

Parameters:
s1 - first synset to be compared
s2 - second synset to be compared
oneOrthForm - if set to true, only one orthForm of each synset will be used; if false, all forms will be included in the lexical field
size - path length: how many related synsets will be included; if size=0 and includeGloss=true, then Lesk is applied in its original definition, i.e., it compares glosses to compute the similarity
limit - distance from root: how many synset layers will be excluded for being too abstract
hypernymsOnly - if true, use only hypernymy relation; otherwise, use all types of relations except hyponymy
includeGermanetGloss - if true, GermaNet's own glosses will be included in the lexical field where they exist
includeWiktionaryGlosses - if true, optionally loaded Wiktionary glosses will be included in the lexical field where they exist
Returns:
a value x >= 0,, where 0 is no overlap and greater values indicate greater relatedness. Min = 0 Max = n.a.

runList

public void runList(java.lang.String inFile,
                    java.lang.String outFile,
                    java.lang.String separator,
                    java.lang.String encoding,
                    java.lang.String methodName,
                    de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
                    java.util.HashMap<java.lang.String,java.lang.Long> frequencies,
                    boolean normalized,
                    java.lang.String cat)
Reads a list of word pairs and returns a list of their relatedness. For lesk and hirstStOnge, the default methods will be used. Where more than one synset is found for a word, all combinations are tried and the average of relatedness values for all pairs of synsets is used.

Parameters:
inFile - csv file
outFile - file to print results to
separator - the char(s) used to separate words in the input file; will also be used on output file
encoding - String indicating the in- and output encoding (UTF8, Cp1252, ISO8859_1, ...)
methodName - name of relatedness measure to use
gnet - an instance of GermaNet
frequencies - for methods resnik, lin, jiangAndConrath; set to null for all others.
normalized - false: original values are used; true: all values mapped to (0..4)
cat - All word categories occuring on the list (n, nv, v, va, nva... n=noun, v=verb, a=adjective)

sortList

public void sortList(java.lang.String inFile,
                     java.lang.String separator,
                     java.lang.String encoding,
                     int index)
Sorts an input csv file by the numeric value in the indicated column; inteded for sorting of word lists by reledness of the word pairs. Deletes line "Word1 [separator] Word2 ...", i.e. header file, if present Does not allow for double entries (same pair of words twice).

Parameters:
inFile - File to be sorted
separator - the char(s) used to separate words in the input file; will also be used on output file
encoding - String indicating the in- and output encoding (UTF8, Cp1252, ISO8859_1, ...)
index - position of the numeric value in each line (0,1,2,...)