|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.tuebingen.uni.sfs.germanet.relatedness.Relatedness
public class Relatedness
Implements some of the more well-known relatedness measures for GermaNet
API version 8.0.
Where paths are involved (all methods but Lesk's and Hirst&St.Onge's),
the methods all do edge counting,
i.e. identity = distance 0, parent = 1, sister nodes = 2.
They also all return -1 if the input words have different categories, as no
useful relatedness measure can be computed in that case
(reason: GermaNet keeps nouns, verbs and adjectives in different subtrees of
the hypernym hierarchy, though connected by a common root node; paths between
different categories are overly long and falsify relatedness results).
In the following short summary of the methods, LCS= least common subsumer of
synsets s1 and s2, dist = distance between two synsets.
path:
rel(s1,s2) = (max_dist-dist(s1,s2))/max_dist
wuAndPalmer:
rel(s1,s2) = (2*depth(lcs)) / (dist(s1,lcs)+dist(s2,lcs)+2*depth(lcs))
leacockAndChodorow:
rel(s1,s2) = -log(dist(s1,s2)/2*max_depth)
resnik:
rel(s1,s2) = -log(p(lcs)) = IC(lcs)
lin:
rel(s1,s2) = 2*IC(lcs) / (IC(s1) + IC(s2))
jiangAndConrath:
rel(s1,s2) = max_dist - (IC(c1) + IC(c2) − 2*IC(lcs))
hirstAndStOnge:
rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for
medium-strong Relations
lesk:
rel(s1,s2) = sum(word_overlap(s1,s2)), extended by related synsets
The javadoc for each method includes the hypothetical minimum and maximum values for that method, which may or may not ever be reached in practice. Values in javadoc are taken from GermaNet API version 7.0 and XML version 6.0 and may not apply to later versions.
Field Summary | |
---|---|
static de.tuebingen.uni.sfs.germanet.api.GermaNet |
gnet
|
static int |
MAX_DEPTH
|
static int |
MAX_SHORTEST_PATH
|
static java.lang.String |
ROOT
|
Constructor Summary | |
---|---|
Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)
Constructor taking a GermaNet instance as input. |
|
Relatedness(java.lang.String germanetDirectory)
Constructor taking the path to a GermaNet-XML directory as input; instanciates a GermaNet object with (germanetDirectory, true), i.e. |
Method Summary | |
---|---|
RelatednessResult |
hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2)
Relatedness according to Hirst and St-Onge 1998: "Lexical chains as representations of context for the detection and correction of malapropisms". |
RelatednessResult |
hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2,
double c,
double k)
Relatedness according to Hirst and St-Onge 1995: "Lexical chains as representations of context for the detection and correction of malapropisms". |
RelatednessResult |
jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2,
java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Jiang and Conrath 1997: "Semantic Relatedness Based on Corpus Statistics and Lexical Taxonomy" |
RelatednessResult |
leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a,
de.tuebingen.uni.sfs.germanet.api.Synset b)
Relatedness according to Leacock&Chodorow, 1998: "Combining Local Context and WordNet Relatedness for Word Sense Identification". |
RelatednessResult |
lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2,
de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness"). |
RelatednessResult |
lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2,
de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
org.tartarus.snowball.SnowballStemmer stemmer,
int size,
int limit,
boolean oneOrthForm,
boolean hypernymsOnly,
boolean includeGermanetGloss,
boolean includeWiktionaryGlosses)
Extended Lesk relatedness (original Lesk 1987: "Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from a ice cream cone.") using lexical field (or 'pseudo-glosses', losely following Gurevych 2005: "Using the Structure of a Conceptual Network in Computing Semantic Relatedness"). |
RelatednessResult |
lin(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2,
java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Lin 1998: "An Information-Theoretic Definition of Relatedness" |
RelatednessResult |
path(de.tuebingen.uni.sfs.germanet.api.Synset s1,
de.tuebingen.uni.sfs.germanet.api.Synset s2)
A very simple relatedness measure. |
RelatednessResult |
resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1,
de.tuebingen.uni.sfs.germanet.api.Synset c2,
java.util.HashMap<java.lang.String,java.lang.Long> freqs)
Relatedness according to Resnik 1995: "Using Information Content to Evaluate Semantic Relatedness in a Taxonomy". |
void |
runList(java.lang.String inFile,
java.lang.String outFile,
java.lang.String separator,
java.lang.String encoding,
java.lang.String methodName,
de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
java.util.HashMap<java.lang.String,java.lang.Long> frequencies,
boolean normalized,
java.lang.String cat)
Reads a list of word pairs and returns a list of their relatedness. |
void |
sortList(java.lang.String inFile,
java.lang.String separator,
java.lang.String encoding,
int index)
Sorts an input csv file by the numeric value in the indicated column; inteded for sorting of word lists by reledness of the word pairs. |
RelatednessResult |
wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1,
de.tuebingen.uni.sfs.germanet.api.Synset c2)
Relatedness/Similarity according to Wu and Palmer, 1994: "Verb Semantics and Lexical Selection" |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static de.tuebingen.uni.sfs.germanet.api.GermaNet gnet
public static int MAX_DEPTH
public static int MAX_SHORTEST_PATH
public static java.lang.String ROOT
Constructor Detail |
---|
public Relatedness(de.tuebingen.uni.sfs.germanet.api.GermaNet germanet)
germanet
- an instance of the GermaNet class.public Relatedness(java.lang.String germanetDirectory)
germanetDirectory
- Method Detail |
---|
public RelatednessResult path(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)
s1
- first synset to be compareds2
- second synset to be compared
public RelatednessResult wuAndPalmer(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2)
ConSim(C1, C2) = (2*N3) / (N1+N2+2*N3)
C1, C2: two synsets
C3: their least common subsumer/'superconcept' (LCS)
N1 = path length C1,C3
N2 = path length C2,C3
N3 = depth of C3
Can only compare words of same class; returns -1 otherwise.
c1
- first synset to be comparedc2
- second synset to be compared
public RelatednessResult leacockAndChodorow(de.tuebingen.uni.sfs.germanet.api.Synset a, de.tuebingen.uni.sfs.germanet.api.Synset b)
rel(a,b) = max [-log(N_p/2D)]
max is only relevant if no unique root node exists, thus here:
rel(a,b) = -log(N_p/2D)
N_p = path length from a to b
D = max depth of taxonomy (for xml 6.0: maxDepth = 20, edge counting)
In this implementation, 1 is added to numerator and denominator to avoid
-log(0) = infinity (for identity).
Can only compare words of same class; returns -1 otherwise.
a
- first synset to be comparedb
- second synset to be compared
public RelatednessResult resnik(de.tuebingen.uni.sfs.germanet.api.Synset c1, de.tuebingen.uni.sfs.germanet.api.Synset c2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
rel(c1,c2) = max(c in S(c1,c2)) [-log(p(c))] = IC(c)
where S(c1,c2) is the set of concepts that subsume both c1 and c2.
-> in short: max(lcs)[-log(p(lcs))] = -log(freq(lcs)/rootFreq) .
If there are several LCS (least common subsumers), take the
'most informative' one.
Note that with Resnik's measure, it is possible for a synset to be 'more
related' to a different synset with a larger IC (information content)
than to itself.
As this measure uses the LCS, it must be counted as somewhat path-based
and as such, it also must not be used on synsets of different categories.
Returns -1 in that case.
c1
- first concept (synset) to be comparedc2
- second synset to be comparedfreqs
- HashMappublic RelatednessResult lin(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
rel(x1,x2) = 2*log P(C0)/(log P(C1) + log P(C2))
where x1 and x2 are members of the classes C1 and C2 and C0 is the most
specific class that subsumes both C1 and C2.
Since -log(p(s)) = ic(s)
and the negative signs cancel out,
rel(s1,s2) = 2*ic(lcs)/(ic(s1) + ic(s2))
As this measure uses the LCS, it must be counted as somewhat path-based
and as such, it also must not be used on synsets of different categories.
Returns -1 in that case.
s1
- first synset to be compareds2
- second synset to be comparedfreqs
- HashMappublic RelatednessResult jiangAndConrath(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, java.util.HashMap<java.lang.String,java.lang.Long> freqs)
rel(c1,c2) = max_dist - (IC(c1)+IC(c2)−2*IC(lcs))
where c1 and c2 are synsets.
The distance measure presented in the paper is turned into a relatedness
measure simply by substracting it from the maximum possible 'distance'
(2*max_IC), see Statistics.getMaxJcnValue).
As this measure uses the LCS, it must be counted as somewhat path-based
and as such, it also must not be used on synsets of different categories.
Returns -1 in that case.
s1
- first synset to be compareds2
- second synset to be comparedfreqs
- HashMappublic RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2)
rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for
medium-strong Relations
where by default, C=10 and k=1.
'Strong relation' = synset identity, direct horizontal link or one word
(orthForm of one Synset, presumably a compound) containing the other.
'medium-strong relation' = a path with length 5 or less between the two
synsets, using all types of relations, but only according to specified
patterns (call an upwards relation 'u', downwards 'd', horizontal 'h',
then the allowed paths are u+, u+d+, u+h+, u+h+d+, d+, d+h+, h+d+, h+).
For more information see the paper by Hirst and St-Onge.
May be used on Synsets of different categories.
s1
- first synset to be compareds2
- second synset to be compared
public RelatednessResult hirstAndStOnge(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, double c, double k)
'Strong relation' = synset identity, direct horizontal link or one word
(orthForm of one Synset, presumably a compound) containing the other.
May be used on Synsets of different categories.
rel(s1,s2) = 15
for strong Relations
rel(s1,s2) = C-pathLength-k*directionChanges
for
medium-strong Relations
where pathLength is between 0 and 5, direction Changes between 0 and 2
and C and k are variables such that k,c>=0 and (2*k+5)
'medium-strong relation' = a path with length 5 or less between the two
synsets, using all types of relations, but only according to specified
patterns (call an upwards relation 'u', downwards 'd', horizontal 'h',
then the allowed paths are u+, u+d+, u+h+, u+h+d+, d+, d+h+, h+d+, h+,
h+u+).
For more information see the paper by Hirst and St-Onge.
Note: As GermaNet relations always go both ways, this method only looks
for paths in one direction.
Thus, the reverse of d+h+, namely h+u+, has been
added although the original paper does not allow for it. The reverses of
all other paths are already included in the 'allowed' list.
s1
- first synset to be compareds2
- second synset to be comparedc
- maximum value for medium strength relationsk
- variable to scale number of direction changes
Min = 0 for no relation
Max = 15 for strong Relation
public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
s1
- first synset to be compareds2
- second synset to be compared
public RelatednessResult lesk(de.tuebingen.uni.sfs.germanet.api.Synset s1, de.tuebingen.uni.sfs.germanet.api.Synset s2, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, org.tartarus.snowball.SnowballStemmer stemmer, int size, int limit, boolean oneOrthForm, boolean hypernymsOnly, boolean includeGermanetGloss, boolean includeWiktionaryGlosses)
s1
- first synset to be compareds2
- second synset to be comparedoneOrthForm
- if set to true, only one orthForm of each synset will be used;
if false, all forms will be included in the lexical fieldsize
- path length: how many related synsets will be included; if
size=0 and includeGloss=true, then Lesk is applied in its original
definition, i.e., it compares glosses to compute the similaritylimit
- distance from root: how many synset layers will be excluded
for being too abstracthypernymsOnly
- if true, use only hypernymy relation; otherwise, use
all types of relations except hyponymyincludeGermanetGloss
- if true, GermaNet's own glosses will
be included in the lexical field where they existincludeWiktionaryGlosses
- if true, optionally loaded Wiktionary glosses
will be included in the lexical field where they exist
public void runList(java.lang.String inFile, java.lang.String outFile, java.lang.String separator, java.lang.String encoding, java.lang.String methodName, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, java.util.HashMap<java.lang.String,java.lang.Long> frequencies, boolean normalized, java.lang.String cat)
inFile
- csv fileoutFile
- file to print results toseparator
- the char(s) used to separate words in the input file;
will also be used on output fileencoding
- String indicating the in- and output encoding
(UTF8, Cp1252, ISO8859_1, ...)methodName
- name of relatedness measure to usegnet
- an instance of GermaNetfrequencies
- for methods resnik, lin, jiangAndConrath; set to
null for all others.normalized
- false: original values are used;
true: all values mapped to (0..4)cat
- All word categories occuring on the list (n, nv, v, va, nva...
n=noun, v=verb, a=adjective)public void sortList(java.lang.String inFile, java.lang.String separator, java.lang.String encoding, int index)
inFile
- File to be sortedseparator
- the char(s) used to separate words in the input file;
will also be used on output fileencoding
- String indicating the in- and output encoding
(UTF8, Cp1252, ISO8859_1, ...)index
- position of the numeric value in each line (0,1,2,...)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |