DEREKO (Deutsches Referenzkorpus) has been a joint project of the Institut für deutsche Sprache (IDS) in
Mannheim, the Seminar für
Sprachwissenschaft (SfS) in Tübingen, and the Institut für Maschinelle
Sprachverarbeitung (IMS) in Stuttgart. The project has been funded by the
Ministry of Science, Research and the Arts of
the State of Baden-Württemberg, starting in 1999 and running for three
years.
The project was set up in order to improve the infrastructure for
text-based linguistic research and development by building a huge,
automatically annotated German text corpus and the corresponding tools
for corpus annotation and exploitation. This raised the following
issues:
- Corpus Acquisition
- Corpus Annotation
- Corpus Exploitation
The task of corpus acquisition consisted of marketing activities and
contract negotations in order to convince publishing houses and individuals to
grant research licenses for their texts. (Responsibility: IDS)
Corpus annotation involved several steps of 'text enrichment'. The
meta-information (author, date of publication, etc.) of a text has to be
encoded in normalized markup. The text has to be segmented (i.e. the surface
structure of the text has to be detected and marked up, including paragraphs,
sentences, and word forms). Furthermore, in order to make the texts more
valuable for researchers interested in a wide range of linguistic phenomena, a
partial syntactic analysis has been carried out, in addition to POS tagging and
lemmatisation, and all additional information was added via a customised markup
scheme. (Responsibilities: IDS for the markup of meta-information and sentence
and paragraph segmentation, SfS for linguistic annotation, with some input on
the lexical level from the IMS)
In order to make use of a linguistically annotated text corpus, powerful
specialized tools for corpus exploitation are needed. The basic tool is
a query engine ('TIGERSearch'),
which can access structural text annotation in an efficient manner. On this
basis, query collection can be built which help to answer the questions which
lexicographers and linguists may have, for example in which contexts the word
"streichen" (paint; erase) occurs and how often it occurs in these
contexts. (Responsibility: IMS)
Please see the menu on the left hand side for more details on corpus
acquisition, annotation, and exploitation.
|