de.tuebingen.uni.sfs.germanet.relatedness
Class Statistics

java.lang.Object
  extended by de.tuebingen.uni.sfs.germanet.relatedness.Statistics

public class Statistics
extends java.lang.Object

Calculates some values used in the Relatedness class for the current GermaNet version (GN 8.0).


Constructor Summary
Statistics()
           
 
Method Summary
static double correlationBetweenTwoLists(java.lang.String file1, java.lang.String file2, int index, java.lang.String encoding, java.lang.String separator, double min, double max, boolean includeUnknown)
          Calculates Pearson's correlation between values from two files with relatedness values for the same word pairs; order does not matter.
static int getLeskMax(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet, boolean oneSense, int size, int limit, boolean hypernymsOnly, boolean includeGloss)
          NO LONGER IN USE.
static int getMaxDepth(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Calculates the maximum depth by finding all the leaves and comparing their distance to the root (edge counting).
static int getMaxGlossLength(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Retrieves the maximum number of words in any GermaNet gloss (currently 33).
static int getMaxHypernyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Retrieves the maximum number of hypernyms of any GermaNet Synset, (currently 6).
static int getMaxHyponyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Retrieves the maximum number of hyponyms of any GermaNet Synset, (currently ).
static double getMaxJcnValue(java.util.HashMap<java.lang.String,java.lang.Long> frequencies)
          Finds the maximum possible 'distance' (sum of information content values) used in the Jiang & Conrath relatedness measure, which is the IC of 2 leaf nodes with the highest IC (information content), with the root as their LCS (least common subsumer):
max_IC + max_IC - 2*0.0 = 2*max_IC
Assuming that a leaf has the assigned default minimal frequency of 1, max_IC = -log(1/rootFreq), which is approx.
static int getMaxLeskValue(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          NO LONGER IN USE.
static int getMaxOrthForms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Retrieves the maximum number of orthForms of any GermaNet Synset (currently 18).
static int getMaxRelsNoHyponyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Retrieves the maximum number of relations of any GermaNet Synset, excluding hyponymy (currently 65).
static int getMaxShortestPath(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Returns the shortest path between the two Sysets with the largest distance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Statistics

public Statistics()
Method Detail

getMaxDepth

public static int getMaxDepth(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Calculates the maximum depth by finding all the leaves and comparing their distance to the root (edge counting). (max.depth = 20 for current version)

Parameters:
gnet - Instance of GermaNet.
Returns:
the maximum depth of the hierarchy

getMaxShortestPath

public static int getMaxShortestPath(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Returns the shortest path between the two Sysets with the largest distance. (Currently 39 for version 7.0).

Parameters:
gnet - An instance of Germanet.
Returns:
the longest "shortest path" that exists between any two synsets

getMaxJcnValue

public static double getMaxJcnValue(java.util.HashMap<java.lang.String,java.lang.Long> frequencies)
Finds the maximum possible 'distance' (sum of information content values) used in the Jiang & Conrath relatedness measure, which is the IC of 2 leaf nodes with the highest IC (information content), with the root as their LCS (least common subsumer):
max_IC + max_IC - 2*0.0 = 2*max_IC
Assuming that a leaf has the assigned default minimal frequency of 1, max_IC = -log(1/rootFreq), which is approx. 37.51 for the current version and frequency files.

Parameters:
frequencies - HashMap holding the frequencies of all synsets
Returns:
the maximum value ("distance") possible for jcn (2*maxIC)

getMaxRelsNoHyponyms

public static int getMaxRelsNoHyponyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Retrieves the maximum number of relations of any GermaNet Synset, excluding hyponymy (currently 65).

Parameters:
gnet - instance of GermaNet
Returns:
maximum number of relations contained for any Synset

getMaxHypernyms

public static int getMaxHypernyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Retrieves the maximum number of hypernyms of any GermaNet Synset, (currently 6).

Parameters:
gnet - instance of GermaNet
Returns:
maximum number of hypernyms of any Synset

getMaxHyponyms

public static int getMaxHyponyms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Retrieves the maximum number of hyponyms of any GermaNet Synset, (currently ). Not currently being used by any of the Relatedness measures.

Parameters:
gnet - instance of GermaNet
Returns:
maximum number of hyponyms of any Synset

getMaxOrthForms

public static int getMaxOrthForms(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Retrieves the maximum number of orthForms of any GermaNet Synset (currently 18).

Parameters:
gnet - instance of GermaNet
Returns:
maximum number of orthForms of any Synset

getMaxGlossLength

public static int getMaxGlossLength(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Retrieves the maximum number of words in any GermaNet gloss (currently 33).

Parameters:
gnet - instance of GermaNet
Returns:
maximum Number of words in any GermaNet gloss/paraphrase

getMaxLeskValue

public static int getMaxLeskValue(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
NO LONGER IN USE. Calculates the maximum value theoretically possible for this Lesk implementation, with oneSense = false, size = maxDepth, limit = 0, hypernymsOnly = false, includeGloss = true (currently 70686). In practice, values will stay far from this maximum as this counts every single word as a match - which it won't be. As such, this value is impractical. Use method getLeskMax() below.

Parameters:
gnet - an instance of GermaNet
Returns:
(max. orth forms + max. gloss length) * (max. rels + this synset) * max. depth

getLeskMax

public static int getLeskMax(de.tuebingen.uni.sfs.germanet.api.GermaNet gnet,
                             boolean oneSense,
                             int size,
                             int limit,
                             boolean hypernymsOnly,
                             boolean includeGloss)
NO LONGER IN USE. Calculates the maximum value actually possible for the given settings for this Lesk implementation, with oneSense = true/false, size = [0,maxDepth], limit = [0,maxDepth], hypernymsOnly = true/false,includeGloss = true/false. In practice, values will still stay far from this maximum as this counts every single word as a match - which it will not be.

Parameters:
gnet - an instance of GermaNet
Returns:
(max. orth forms + max. gloss length) * (max. rels + this synset) * max. depth

correlationBetweenTwoLists

public static double correlationBetweenTwoLists(java.lang.String file1,
                                                java.lang.String file2,
                                                int index,
                                                java.lang.String encoding,
                                                java.lang.String separator,
                                                double min,
                                                double max,
                                                boolean includeUnknown)
Calculates Pearson's correlation between values from two files with relatedness values for the same word pairs; order does not matter. Files need to have same format, though: both words need to precede the value, also anything else that may precede the value in one file has to be present in the second file, as well.

Parameters:
file1 - word pairs with relatedness values from one measure)
file2 - word pairs with relatedness values from another measure
index - position of value in the csv file (0,1,2...). Must be behind names; must be the same for both files.
encoding - Encoding of both files.
separator - the char(s) used to separate words in the input files
min - Smallest possible value in the distribution (e.g. 0).
max - Largest possible value in the distribution (e.g. 4).
includeUnknown - if true, pairs including one or two words with -1 values (unknown to GermaNet) are included in the calculation; if false, correlation is calculated only based on the pairs of known words. WARNING: as is, this also excludes entries where the method failed due to different categories. Need to distinguish!
Returns:
the Pearson correlation between values in the two files