Relatedness

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

de.tuebingen.uni.sfs.germanet.relatedness
Class Relatedness

java.lang.Object
  de.tuebingen.uni.sfs.germanet.relatedness.Relatedness

public class Relatedness
extends java.lang.Object
extends java.lang.Object

Implements some of the more well-known relatedness measures for GermaNet API version 8.0.
Where paths are involved (all methods but Lesk's and Hirst&St.Onge's), the methods all do edge counting, i.e. identity = distance 0, parent = 1, sister nodes = 2.
They also all return -1 if the input words have different categories, as no useful relatedness measure can be computed in that case (reason: GermaNet keeps nouns, verbs and adjectives in different subtrees of the hypernym hierarchy, though connected by a common root node; paths between different categories are overly long and falsify relatedness results).
In the following short summary of the methods, LCS= least common subsumer of synsets s1 and s2, dist = distance between two synsets.

path:
rel(s1,s2) = (max_dist-dist(s1,s2))/max_dist
wuAndPalmer:
rel(s1,s2) = (2*depth(lcs)) / (dist(s1,lcs)+dist(s2,lcs)+2*depth(lcs))
leacockAndChodorow:
rel(s1,s2) = -log(dist(s1,s2)/2*max_depth)
resnik:
rel(s1,s2) = -log(p(lcs)) = IC(lcs)
lin:
rel(s1,s2) = 2*IC(lcs) / (IC(s1) + IC(s2))
jiangAndConrath:
rel(s1,s2) = max_dist - (IC(c1) + IC(c2) − 2*IC(lcs))
hirstAndStOnge: rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
lesk:
rel(s1,s2) = sum(word_overlap(s1,s2)), extended by related synsets

The javadoc for each method includes the hypothetical minimum and maximum values for that method, which may or may not ever be reached in practice. Values in javadoc are taken from GermaNet API version 7.0 and XML version 6.0 and may not apply to later versions.

Field Summary
`static de.tuebingen.uni.sfs.germanet.api.GermaNet`	`gnet`
`static int`	`MAX_DEPTH`
`static int`	`MAX_SHORTEST_PATH`
`static java.lang.String`	`ROOT`

Constructor Summary
`Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)` Constructor taking a GermaNet instance as input.
`Relatedness(java.lang.String germanetDirectory)` Constructor taking the path to a GermaNet-XML directory as input; instanciates a GermaNet object with (germanetDirectory, true), i.e.

Method Summary
`RelatednessResult`	`hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)` Relatedness according to Hirst and St-Onge 1998: "Lexical chains as representations of context for the detection and correction of malapropisms".
`RelatednessResult`	`hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, double c, double k)` Relatedness according to Hirst and St-Onge 1995: "Lexical chains as representations of context for the detection and correction of malapropisms".
`RelatednessResult`	`jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)` Relatedness according to Jiang and Conrath 1997: "Semantic Relatedness Based on Corpus Statistics and Lexical Taxonomy"
`RelatednessResult`	`leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a, de.tuebingen.uni.sfs.germanet.api.Synset b)` Relatedness according to Leacock&Chodorow, 1998: "Combining Local Context and WordNet Relatedness for Word Sense Identification".
`RelatednessResult`	`lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)` Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
`RelatednessResult`	`lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, org.tartarus.snowball.SnowballStemmer stemmer, int size, int limit, boolean oneOrthForm, boolean hypernymsOnly, boolean includeGermanetGloss, boolean includeWiktionaryGlosses)` Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
`RelatednessResult`	`lin(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)` Relatedness according to Lin 1998: "An Information-Theoretic Definition of Relatedness"
`RelatednessResult`	`path(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)` A very simple relatedness measure.
`RelatednessResult`	`resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)` Relatedness according to Resnik 1995: "Using Information Content to Evaluate Semantic Relatedness in a Taxonomy".
`void`	`runList(java.lang.String inFile, java.lang.String outFile, java.lang.String separator, java.lang.String encoding, java.lang.String methodName, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, java.util.HashMap<java.lang.String,java.lang.Long> frequencies, boolean normalized, java.lang.String cat)` Reads a list of word pairs and returns a list of their relatedness.
`void`	`sortList(java.lang.String inFile, java.lang.String separator, java.lang.String encoding, int index)` Sorts an input csv file by the numeric value in the indicated column; inteded for sorting of word lists by reledness of the word pairs.
`RelatednessResult`	`wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2)` Relatedness/Similarity according to Wu and Palmer, 1994: "Verb Semantics and Lexical Selection"

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

gnet

public static de.tuebingen.uni.sfs.germanet.api.GermaNet gnet

MAX_DEPTH

public static int MAX_DEPTH

MAX_SHORTEST_PATH

public static int MAX_SHORTEST_PATH

ROOT

public static java.lang.String ROOT

Constructor Detail

Relatedness

public Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)

Constructor taking a GermaNet instance as input.

Parameters:: germanet - an instance of the GermaNet class.

Relatedness

public Relatedness(java.lang.String germanetDirectory)

Constructor taking the path to a GermaNet-XML directory as input; instanciates a GermaNet object with (germanetDirectory, true), i.e. case insensitive.

Parameters:: germanetDirectory -

Method Detail

path

public RelatednessResult path(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2)

A very simple relatedness measure.
Calculates relatedness as a function of the distance between two nodes and the longest possible 'shortest path' between any two nodes:
rel(s1,s2) = (MAX_SHORTEST_PATH - distance(s2,s2)) / MAX_SHORTEST_PATH .
Can only compare words of same class; returns -1 otherwise.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared
Returns:: a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 0/40 = 0 (distance 40 = currently max. 'shortest' path)
Max = 40/40 = 1 (identity)

wuAndPalmer

public RelatednessResult wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1,
                                     de.tuebingen.uni.sfs.germanet.api.Synset c2)

Relatedness/Similarity according to Wu and Palmer, 1994: "Verb Semantics and Lexical Selection"

ConSim(C1, C2) = (2*N3) / (N1+N2+2*N3)

C1, C2: two synsets
C3: their least common subsumer/'superconcept' (LCS)
N1 = path length C1,C3
N2 = path length C2,C3
N3 = depth of C3

Can only compare words of same class; returns -1 otherwise.

Parameters:: c1 - first synset to be compared; c2 - second synset to be compared
Returns:: a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 2*0/(20+20+2*0) = 0 (leaf nodes with LCS=root)
Max = 2*20/(0+0+2*20) = 1 (for leaf node identity)

leacockAndChodorow

public RelatednessResult leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a,
                                            de.tuebingen.uni.sfs.germanet.api.Synset b)

Relatedness according to Leacock&Chodorow, 1998: "Combining Local Context and WordNet Relatedness for Word Sense Identification".

rel(a,b) = max [-log(N_p/2D)]
max is only relevant if no unique root node exists, thus here: rel(a,b) = -log(N_p/2D)

N_p = path length from a to b
D = max depth of taxonomy (for xml 6.0: maxDepth = 20, edge counting)
In this implementation, 1 is added to numerator and denominator to avoid -log(0) = infinity (for identity).

Can only compare words of same class; returns -1 otherwise.

Parameters:: a - first synset to be compared; b - second synset to be compared
Returns:: a value 0 <= x < 1, where larger values indicate greater relatedness.
Min = -log((40+1)/(2*20+1)) = 0 (for maximally distant leaf nodes)
Max = -log((0+1)/(2*20+1)) =~ 3.71 (for leaf node identity)

resnik

public RelatednessResult resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1,
                                de.tuebingen.uni.sfs.germanet.api.Synset c2,
                                java.util.HashMap<java.lang.String,java.lang.Long> freqs)

Relatedness according to Resnik 1995: "Using Information Content to Evaluate Semantic Relatedness in a Taxonomy".

rel(c1,c2) = max(c in S(c1,c2)) [-log(p(c))] = IC(c)

where S(c1,c2) is the set of concepts that subsume both c1 and c2.
-> in short: max(lcs)[-log(p(lcs))] = -log(freq(lcs)/rootFreq) .
If there are several LCS (least common subsumers), take the 'most informative' one.

Note that with Resnik's measure, it is possible for a synset to be 'more related' to a different synset with a larger IC (information content) than to itself.
As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:: c1 - first concept (synset) to be compared; c2 - second synset to be compared; freqs - HashMap holding the frequencies of all synsets
Returns:: a value 0 <= x < 18.75, where larger values indicate greater relatedness.
Min = -log(1) = -0.0 = -log(freq(root)/freq(root)) = ic(root)
Max = -log(1/freq(root)) =~ 18.748 = ic(least frequent, i.e. most informative)

lin

public RelatednessResult lin(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                             de.tuebingen.uni.sfs.germanet.api.Synset s2,
                             java.util.HashMap<java.lang.String,java.lang.Long> freqs)

Relatedness according to Lin 1998: "An Information-Theoretic Definition of Relatedness"

rel(x1,x2) = 2*log P(C0)/(log P(C1) + log P(C2))

where x1 and x2 are members of the classes C1 and C2 and C0 is the most specific class that subsumes both C1 and C2.
Since -log(p(s)) = ic(s) and the negative signs cancel out,
rel(s1,s2) = 2*ic(lcs)/(ic(s1) + ic(s2))

As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared; freqs - HashMap holding the frequencies of all synsets
Returns:: a value 0 <= x <= 1, where 1 is identity and 0 is unrelated.
Min = 0 if lcs is root (ic(root) = 0)
Max = 1 (identity; otherwise, ic(lcs) will be less specific = smaller)

jiangAndConrath

public RelatednessResult jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                         de.tuebingen.uni.sfs.germanet.api.Synset s2,
                                         java.util.HashMap<java.lang.String,java.lang.Long> freqs)

Relatedness according to Jiang and Conrath 1997: "Semantic Relatedness Based on Corpus Statistics and Lexical Taxonomy"

rel(c1,c2) = max_dist - (IC(c1)+IC(c2)−2*IC(lcs))

where c1 and c2 are synsets.
The distance measure presented in the paper is turned into a relatedness measure simply by substracting it from the maximum possible 'distance' (2*max_IC), see Statistics.getMaxJcnValue).

As this measure uses the LCS, it must be counted as somewhat path-based and as such, it also must not be used on synsets of different categories. Returns -1 in that case.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared; freqs - HashMap holding the frequencies of all synsets
Returns:: a value 0 <= x <= 37.51, where 37.51 is identity and 0 is unrelated.
Min = 0 = max_dist - (2*maxIC)-2*0.0) (for maximally specific, maximally distant leaf nodes)
Max = max_dist - (0+0-2*0.0) =~ 37.51 (identity)

hirstAndStOnge

public RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                        de.tuebingen.uni.sfs.germanet.api.Synset s2)

Relatedness according to Hirst and St-Onge 1998: "Lexical chains as representations of context for the detection and correction of malapropisms".

rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
where by default, C=10 and k=1.

May be used on Synsets of different categories.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared
Returns:: a value 0 <= x <= 15, where 15 is strongly related and 0 is unrelated.
Min = 0 for no relation
Max = 15 for strong Relation

hirstAndStOnge

public RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                                        de.tuebingen.uni.sfs.germanet.api.Synset s2,
                                        double c,
                                        double k)

Relatedness according to Hirst and St-Onge 1995: "Lexical chains as representations of context for the detection and correction of malapropisms".

rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for medium-strong Relations
where pathLength is between 0 and 5, direction Changes between 0 and 2 and C and k are variables such that k,c>=0 and (2*k+5)

'Strong relation' = synset identity, direct horizontal link or one word (orthForm of one Synset, presumably a compound) containing the other.
'medium-strong relation' = a path with length 5 or less between the two synsets, using all types of relations, but only according to specified patterns (call an upwards relation 'u', downwards 'd', horizontal 'h', then the allowed paths are u+, u+d+, u+h+, u+h+d+, d+, d+h+, h+d+, h+, h+u+). For more information see the paper by Hirst and St-Onge.
Note: As GermaNet relations always go both ways, this method only looks for paths in one direction. Thus, the reverse of d+h+, namely h+u+, has been added although the original paper does not allow for it. The reverses of all other paths are already included in the 'allowed' list.

May be used on Synsets of different categories.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared; c - maximum value for medium strength relations; k - variable to scale number of direction changes
Returns:: a value 0 <= x <= 15, where 15 is strongly related and 0 is unrelated.
Min = 0 for no relation
Max = 15 for strong Relation

lesk

public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2,
                              de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)

Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness").
This method returns the value computed with the default values using
- all orthForms of each synset
- size = 4 (path length for including related synsets)
- limit = 2 (distance from root inside which synsets are excluded (abstract))
- only using hypernyms (opposite to using all available relations except hyponyms)
- not including existing GermaNet glosses in lexical field
- using snowball stemmer for German
May be used on Synsets of different categories.

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared
Returns:: a value x >= 0, where 0 is no overlap and greater values indicate greater relatedness Min = 0 Max = n.a.

lesk

public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
                              de.tuebingen.uni.sfs.germanet.api.Synset s2,
                              de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
                              org.tartarus.snowball.SnowballStemmer stemmer,
                              int size,
                              int limit,
                              boolean oneOrthForm,
                              boolean hypernymsOnly,
                              boolean includeGermanetGloss,
                              boolean includeWiktionaryGlosses)

Parameters:: s1 - first synset to be compared; s2 - second synset to be compared; oneOrthForm - if set to true, only one orthForm of each synset will be used; if false, all forms will be included in the lexical field; size - path length: how many related synsets will be included; if size=0 and includeGloss=true, then Lesk is applied in its original definition, i.e., it compares glosses to compute the similarity; limit - distance from root: how many synset layers will be excluded for being too abstract; hypernymsOnly - if true, use only hypernymy relation; otherwise, use all types of relations except hyponymy; includeGermanetGloss - if true, GermaNet's own glosses will be included in the lexical field where they exist; includeWiktionaryGlosses - if true, optionally loaded Wiktionary glosses will be included in the lexical field where they exist
Returns:: a value x >= 0,, where 0 is no overlap and greater values indicate greater relatedness. Min = 0 Max = n.a.

runList

public void runList(java.lang.String inFile,
                    java.lang.String outFile,
                    java.lang.String separator,
                    java.lang.String encoding,
                    java.lang.String methodName,
                    de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
                    java.util.HashMap<java.lang.String,java.lang.Long> frequencies,
                    boolean normalized,
                    java.lang.String cat)

Reads a list of word pairs and returns a list of their relatedness. For lesk and hirstStOnge, the default methods will be used. Where more than one synset is found for a word, all combinations are tried and the average of relatedness values for all pairs of synsets is used.

Parameters:: inFile - csv file; outFile - file to print results to; separator - the char(s) used to separate words in the input file; will also be used on output file; encoding - String indicating the in- and output encoding (UTF8, Cp1252, ISO8859_1, ...); methodName - name of relatedness measure to use; gnet - an instance of GermaNet; frequencies - for methods resnik, lin, jiangAndConrath; set to null for all others.; normalized - false: original values are used; true: all values mapped to (0..4); cat - All word categories occuring on the list (n, nv, v, va, nva... n=noun, v=verb, a=adjective)

sortList

public void sortList(java.lang.String inFile,
                     java.lang.String separator,
                     java.lang.String encoding,
                     int index)

Sorts an input csv file by the numeric value in the indicated column; inteded for sorting of word lists by reledness of the word pairs. Deletes line "Word1 [separator] Word2 ...", i.e. header file, if present Does not allow for double entries (same pair of words twice).

Parameters:: inFile - File to be sorted; separator - the char(s) used to separate words in the input file; will also be used on output file; encoding - String indicating the in- and output encoding (UTF8, Cp1252, ISO8859_1, ...); index - position of the numeric value in each line (0,1,2,...)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

de.tuebingen.uni.sfs.germanet.relatedness Class Relatedness

gnet

MAX_DEPTH

MAX_SHORTEST_PATH

ROOT

Relatedness

Relatedness

path

wuAndPalmer

leacockAndChodorow

resnik

lin

jiangAndConrath

hirstAndStOnge

hirstAndStOnge

lesk

lesk

runList

sortList

de.tuebingen.uni.sfs.germanet.relatedness
Class Relatedness