|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.tuebingen.uni.sfs.germanet.relatedness.Frequency
public class Frequency
This class deals with frequency lists and information content.
Richardson and Smeaton 1995: "Many polysemous words and multi-worded synsets
will have an exaggerated information content value".
The calculation of synset frequencies here is based on word frequencies, and
as such is victim to this problem: 1 occurrence of a word like "Land" is
counted towards 6 different synsets (property, earth, province, state, ...).
This is slightly improved by the separate lists for each POS, but remains an
inherent problem all the same.
Note: to avoid 0 frequencies, a default value of 1 is assigned to each synset.
A word listed with frequency 1 will thus end up in a synset with a
frequency of at least 2.
Note 2: GermaNet has multiple inheritance. In this implementation, each synset
adds its frequency value exactly once to each of its (transitive) hypernyms,
no matter whether that hypernym can be reached on more than 1 path.
As such, the root node will hold exactly the total of assigned (not
cumulative) frequencies in the entire tree. As a side effect, though,
a synset's cumulative frequency can be lower than the sum of the
frequencies of its direct hyponyms.
Constructor Summary | |
---|---|
Frequency()
|
Method Summary | |
---|---|
static void |
assignFrequencies(java.lang.String inputDirectory,
de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
Creates a file "frequencies.csv" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995. |
static void |
assignFrequencies(java.lang.String inputDirectory,
java.lang.String gnDirectory)
Creates a file "frequencies" with format "ID \t frequency" in the given input directory with the cumulative frequencies as described by Resnik, 1995. |
static int |
cleanList(java.io.File inputFile)
Cleans an input list; from inputFileName(.suffix), creates new file inputFileName_clean(.suffix). |
static void |
cleanLists(java.lang.String inputDir)
Runs cleanList on a whole directory. |
static java.util.HashMap<java.lang.String,java.lang.Long> |
loadFreq(java.lang.String freqFile)
Loads the frequency file into a HashMap. |
static java.util.HashMap<java.lang.String,java.lang.Double> |
loadIC(java.lang.String freqFile)
Loads the frequency file and writes ICs (information content values) into a HashMap. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Frequency()
Method Detail |
---|
public static void assignFrequencies(java.lang.String inputDirectory, de.tuebingen.uni.sfs.germanet.api.GermaNet gnet)
This method expects as input a directory with one or more frequency lists
containing lines with the format "frequency\tword".
The file names must include "nn" or "noun" for nouns, "vv" or "verb" for
verbs, "aj" or "adj" for adjectives (and "av" or "adv" for adverbs); or
"all", if POS suffixes are already attached to each entry.
Also, filenames must end in "clean" (before the file ending, .text etc,
f.ex. Tuepp-ADJ_clean.txt) so that cleanFile() and
assignFrequencies() can be used on the same directory without the 'old'
files being included in the latter method.
inputDirectory
- a directory containing one or more frequency listsgnet
- an instance of the class GermaNetpublic static void assignFrequencies(java.lang.String inputDirectory, java.lang.String gnDirectory)
This method expects as input a directory with one or more frequency lists
containing lines with the format "frequency\tword".
The file names must include "nn" or "noun" for nouns, "vv" or "verb" for
verbs, "aj" or "adj" for adjectives (and "av" or "adv" for adverbs); or
"all", if POS suffixes are already attached to each entry.
Also, filenames must end in "clean" (a regular file ending like .txt may
still follow it), so that cleanFile() and
assignFrequencies() can be used on the same directory without the 'old'
files being included in the latter method.
inputDirectory
- a directory containing one or more frequency listsgnDirectory
- path to your GermaNet xml directorypublic static int cleanList(java.io.File inputFile)
- removes #'s (separable verb, adjs etc, as in "an#weisen"),
- removes question marks as in "erste??",
- splits variant1|variant2 into separate entries
(as in "ein#fallen|ein#fällen),
- distributes frequency equally over variants;
special case: "Alter|Alte|Altes": do minimal stemming to "alt" if ADJ,
keep male and female form (not neutral) if NOUN.
Works with lower case (assignFrequencies lower-cases everything too).
Sorts alphabetically.
This method likely needs to be alotted adequate java heap space
(run with -Xms512m -Xmx512m ), depending on size of list.
inputFile
- The file to be cleaned. Format: "frequency \t word \n"
public static void cleanLists(java.lang.String inputDir)
The file should contain only frequency lists. May contain already cleaned
lists, but they should have a name ending in '_clean' (before the file
type ending, i.e. 'adj-list_clean.txt' etc.).
This method needs to be alotted adequate java heap space
(run with at least -Xms512m -Xmx512m, depending on list size).
inputDir
- The directory holding lists to be cleaned.public static java.util.HashMap<java.lang.String,java.lang.Long> loadFreq(java.lang.String freqFile)
freqFile
- File holding synset frequencies, format: "ID\tfrequency\n"
public static java.util.HashMap<java.lang.String,java.lang.Double> loadIC(java.lang.String freqFile)
Uses loadFreq, i.e. a bit slow.
Note that the Root has IC -0.0.
freqFile
- File holding synset frequencies, format: "ID\tfrequency\n"
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |