de.tuebingen.uni.sfs.germanet.relatedness
Class Frequency

java.lang.Object
  extended by de.tuebingen.uni.sfs.germanet.relatedness.Frequency

public class Frequency
extends java.lang.Object

This class deals with frequency lists and information content.

Richardson and Smeaton 1995: "Many polysemous words and multi-worded synsets will have an exaggerated information content value".

The calculation of synset frequencies here is based on word frequencies, and as such is victim to this problem: 1 occurrence of a word like "Land" is counted towards 6 different synsets (property, earth, province, state, ...). This is slightly improved by the separate lists for each POS, but remains an inherent problem all the same.

Note: to avoid 0 frequencies, a default value of 1 is assigned to each synset. A word listed with frequency 1 will thus end up in a synset with a frequency of at least 2.
Note 2: GermaNet has multiple inheritance. In this implementation, each synset adds its frequency value exactly once to each of its (transitive) hypernyms, no matter whether that hypernym can be reached on more than 1 path.
As such, the root node will hold exactly the total of assigned (not cumulative) frequencies in the entire tree. As a side effect, though, a synset's cumulative frequency can be lower than the sum of the frequencies of its direct hyponyms.


Constructor Summary
Frequency()
           
 
Method Summary
static void assignFrequencies(java.lang.String inputDirectory, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
          Creates a file "frequencies.csv" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995.
static void assignFrequencies(java.lang.String inputDirectory, java.lang.String gnDirectory)
          Creates a file "frequencies" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995.
static int cleanList(java.io.File inputFile)
          Cleans an input list; from inputFileName(.suffix), creates new file inputFileName_clean(.suffix).
static void cleanLists(java.lang.String inputDir)
          Runs cleanList on a whole directory.
static java.util.HashMap<java.lang.String,java.lang.Long> loadFreq(java.lang.String freqFile)
          Loads the frequency file into a HashMap.
static java.util.HashMap<java.lang.String,java.lang.Double> loadIC(java.lang.String freqFile)
          Loads the frequency file and writes ICs (information content values) into a HashMap.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Frequency

public Frequency()
Method Detail

assignFrequencies

public static void assignFrequencies(java.lang.String inputDirectory,
                                     de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Creates a file "frequencies.csv" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995.

This method expects as input a directory with one or more frequency lists containing lines with the format "frequency\tword".
The file names must include "nn" or "noun" for nouns, "vv" or "verb" for verbs, "aj" or "adj" for adjectives (and "av" or "adv" for adverbs); or "all", if POS suffixes are already attached to each entry.
Also, filenames must end in "clean" (before the file ending, .text etc, f.ex. Tuepp-ADJ_clean.txt) so that cleanFile() and assignFrequencies() can be used on the same directory without the 'old' files being included in the latter method.

Parameters:
inputDirectory - a directory containing one or more frequency lists
gnet - an instance of the class GermaNet

assignFrequencies

public static void assignFrequencies(java.lang.String inputDirectory,
                                     java.lang.String gnDirectory)
Creates a file "frequencies" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995.

This method expects as input a directory with one or more frequency lists containing lines with the format "frequency\tword".
The file names must include "nn" or "noun" for nouns, "vv" or "verb" for verbs, "aj" or "adj" for adjectives (and "av" or "adv" for adverbs); or "all", if POS suffixes are already attached to each entry.
Also, filenames must end in "clean" (a regular file ending like .txt may still follow it), so that cleanFile() and assignFrequencies() can be used on the same directory without the 'old' files being included in the latter method.

Parameters:
inputDirectory - a directory containing one or more frequency lists
gnDirectory - path to your GermaNet xml directory

cleanList

public static int cleanList(java.io.File inputFile)
Cleans an input list; from inputFileName(.suffix), creates new file inputFileName_clean(.suffix).

- removes #'s (separable verb, adjs etc, as in "an#weisen"),
- removes question marks as in "erste??",
- splits variant1|variant2 into separate entries (as in "ein#fallen|ein#fällen),
- distributes frequency equally over variants; special case: "Alter|Alte|Altes": do minimal stemming to "alt" if ADJ, keep male and female form (not neutral) if NOUN.
Works with lower case (assignFrequencies lower-cases everything too).
Sorts alphabetically.

This method likely needs to be alotted adequate java heap space (run with -Xms512m -Xmx512m ), depending on size of list.

Parameters:
inputFile - The file to be cleaned. Format: "frequency \t word \n"
Returns:
0 if successful, 1 if an error occurred

cleanLists

public static void cleanLists(java.lang.String inputDir)
Runs cleanList on a whole directory.

The file should contain only frequency lists. May contain already cleaned lists, but they should have a name ending in '_clean' (before the file type ending, i.e. 'adj-list_clean.txt' etc.).
This method needs to be alotted adequate java heap space (run with at least -Xms512m -Xmx512m, depending on list size).

Parameters:
inputDir - The directory holding lists to be cleaned.

loadFreq

public static java.util.HashMap<java.lang.String,java.lang.Long> loadFreq(java.lang.String freqFile)
Loads the frequency file into a HashMap.
Keys: IDs as String, values: frequencies.

Parameters:
freqFile - File holding synset frequencies, format: "ID\tfrequency\n"
Returns:
a HashMap holding IDs as String key and frequency as Long value

loadIC

public static java.util.HashMap<java.lang.String,java.lang.Double> loadIC(java.lang.String freqFile)
Loads the frequency file and writes ICs (information content values) into a HashMap.
Keys: IDs as String, values: ics (= freq/freq(root)).

Uses loadFreq, i.e. a bit slow.
Note that the Root has IC -0.0.

Parameters:
freqFile - File holding synset frequencies, format: "ID\tfrequency\n"
Returns:
a HashMap holding IDs as String key and ic as double value.