The ELWIS project: General Information

Project Objectives and Issues

ELWIS is an acronym for German Korpusgestützte Entwicklung lexikalischer Wissensbasen (= corpus-based development of lexical knowledge bases). The project is funded by the Ministry of Science and Research Baden-Württemberg.

The project has been in existence since the beginning of 1992 at the Department of Linguistics at the University of Tütbingen, Germany. Research in the ELWIS framework deals with crucial issues related to a theoretically sound description of lexical units for NLP systems using corpus-based methods and tools. We distinguish two interrelated areas:

Acquisition:
Exploitation of existing lexical knowledge sources (text corpora, machine-readable dictionaries, computer lexicons) for efficient construction of lexical knowledge bases.
Representation:
adequate, theoretically sound descriptions of lexical knowledge for different types of applications in computational linguistics.
Both areas were (and still are) investigated in several national and international projects, but German has only rarely been the object of such studies. The ELWIS project not only evaluates possibilities of transfer methods developed for other languages to German, but also develops and tests new methods and approaches for description.

The linguistic phenomena under investigation are inflectional and derivational morphology, word formation, syntactic and semantic valency, collocations and morphosyntactic properties of idioms..

The ELWIS Scenario

The project's different activities yield a prototypical instance of the reusability scenario shown in figure 1.

Figure 1: ELWIS scenario


The lexical database is the central part of the scenario: on the one hand it is the target component for the research experiments carried out in the field of lexical acquisition; on the other hand the lexical knowledge stored in the database can be automatically compiled into lexical knowledge bases with alternative knowledge representation formalisms. The lexical database in itself is realised by the use of a relational database management system.

Specific methods and tools are used to extract lexical information from machine readable resources such as machine readable dictionaries, existing computer lexicons and machine readable text corpora and to store the information in the database. In most cases these resources need additional treatment: German machine readable dictionaries are available as typesetting files only and have to be converted into a more appropriate format. Quality and quantity of information obtainable through automatic analysis of text corpora is improved considerably if the texts are linguistically annotated (POS, syntactic and semantic tagging). This requires the availability of annotation tools for the language under inspection; the tools in turn can make use of the lexical information stored in the database.

The information stored in the lexical database will finally be integrated in lexical knowledge bases by use of compiler programmes. This two-stage approach was chosen because it has been shown that direct construction of intelligent knowledge bases from text corpora and dictionary resources is not feasible. While these types of systems (i.e., DATR, TFS) were never really tested for large amounts of data, relational database management systems have been sufficiently proven to safely administer large amounts of data. In addition this two-tiered architecture allows to test the theoretical generalisations expressed in the knowledge bases against large amounts of data for adequacy of description. This allows efficient testing of competing theories within a single formalism or across different ones.

Results and work in progress

Collection of lexical and linguistic resources

One major part of the project is the accumulation of lexical and linguistic resources for German. The following resources are currently available:
Text corpora
in machine readable form were initally not general available in sufficient quantities. In addition to various texts which has assembled from a variety of sources (e. g.: Mannheimer Korpus I from the Institut für deutsche Sprache at Mannheim, Literary texts from the Oxford Text Archive, one year of the newspapers Frankfurter Rundschau and Donaukurier from the European Data Initiative), all messages from all German news groups accessible in the Internet are archived within the project. This type of text is not only suitable for the construction of large text corpora due to its very machine readable nature, moreover, the richness of topics covered by the texts, and the fact that most of the texts are unedited written language is highly unusual; this type of language is frequently used and rarely studied. Moreover, the texts are highly idiosyncratic with respect to mark up and orthography and are thus challenging automatic procedures of text preparation and normalization.
Machine readable dictionaries,
i.e., dictionaries made for humans available in machine readable format. The Duden editorial provided us with typesetting files containing parts of the Duden Stilwöterbuch and the Duden Bedeutungswörterbuch.

Other sources available within the project are machine readable versions of Gerhard Augst's Morpheminventar A-Z and Verben in Feldern. Valenzwörterbuch zur Syntax und Semantik deutscher Verben.

Computer lexicons,
i.e., description of lexical units stored in computer files or databases which are in most cases designed as lexical components of NLP systems. ELWIS investigates possibilities and limitations of the exploitation of two existing computer lexicons for German, namely the Saarbrücker deutsches Analysewörterbuch (SADAW) and the German part of the lexical data from the Centre for Lexical Information (CELEX).

Tool Development

Another major work package within the project is the acquisition and adaption of available software tools, as well as the development of own software tools required for ELWIS's different fields of study:
Likely
is a robust probabilistic part-of-speech tagger for German. It has been developed by combining well-tested techniques for English with training data for German which was made available by the Institut für deutsche Sprache. The output of the tagger, POS-tagged German texts, will eventually be used as base for the development of lemmatization and noun phrase extraction tools for huge amounts of German texts.
Lexparse
is a flexible dictionary entry parser which was developed for the conversion of dictionaries stored on typesetting files. The parser consults a user-supplied dictionary grammar specifying the hierarchical structure of the dictionary entries by means of context-free rewrite rules. The grammar formalism provides schemes for expressing (multiple) optional rules, sets of terminals, and different homonym and sense counters. Switches and configuration options allow the adaption to user-specific needs, e.g. the display of the generated parse trees is user-configurable and supports SGML and LaTeX output, as well as hierarchical attribute-value structures. Lexparse was used for the parsing of the Duden dictionaries and is currently used for the parsing of other dictionaries.
Insyst,
INserter SYSTem, is a unique system for automatic insertion (i.e., classification) of lexical items under appropriate nodes in hierarchies as used in modern formalisms for lexical information such the DATR formalism. It allows efficient (a) testing of generalizations when designing a lexical hierarchy; (b) transferring large numbers of lexical items from flat data structures to a finished lexical hierarchy when using it to build a large lexicon. INSYST was developed and used to test competing DATR-theories for German noun inflection within the ELWIS project.
n-gram
facilitates the extraction of collocations from texts by means of statistical calculations: frequencies of clusters of words are compared with frequencies of individual words in the cluster, pairs of such frequencies having a high relation are said to have high mutual information-values and the pertaining word clusters are automatically selected as candidates for collocations.
Lexpand
is a LEXical EXPANDer programme that generates full forms from the project's lemmatized core database. The database contains a complete description of the morphological relevant features for each lemma. Lexpand takes a lemma and its features as its input and generates all inflected forms for the lemma. Underlying the programme is a thorough rule system of inflectional morphology for German.
Apart from the above mentioned tools developed within the project, ELWIS has assembled a large variety of tools for NLP processing from external sources.

Research

Research is concerned with the following issues:
Lexical database design:
For the development of the lexical database, a conceptual schema (i.e., an entity-relationship-model) was developed by the use of appropriate methods of semantic data modeling. This conceptual schema was then transferred into a schema of relations and implemented. It contains a theoretically sound description of the inflectional morphology and valency patterns for German, and can easily be adapted for descriptions of lexical semantics.
Methodology for the development of dictionary grammars:
based on meta-lexicographical findings on the structure of dictionaries, a method was explicated for the development of grammars for the dictionary entry parser Lexparse. The method was employed for the development of dictionary grammars for the two Duden dictionaries.
Tag Sets for German Corpus Tagging:
for the part-of-speech tagger developed in the project a revised set of tags of 42 syntactic categories for German was developed. This tag set is based on an earlier set developed in the 1970s at the University of Saarbrücken and was revised on a strictly distributional basis. The set will serve as a starting point for noun phrase extraction and other syntactic tools.
Methods for the automatic extraction of idioms and collocations:
automatic extraction of idioms and collocations was the target of a series of experiments: it was shown that corpus-based statistical methods which have been tested for English can not simply be applied to German data. A number of additional problems must be solved first. A combination of corpus- and dictionary-based methods seems to be more appropriate for the extraction of this type of lexical entries.