DEREKO - The Project

DEREKO is a joint effort of
Acquisition	Annotation	Exploitation
IDS Mannheim	SfS Tübingen	IMS Stuttgart


	The Project

	Acquisition and Document Annotation
		Please see IDS

	Linguistic Annotation
		Introduction
		Documentation
		Sample
		Contact

	Corpus Exploitation
		Introduction
		Query Collection
		Documentation
		Sample
		Contact

The Project

DEREKO (Deutsches Referenzkorpus) has been a joint project of the Institut für deutsche Sprache (IDS) in Mannheim, the Seminar für Sprachwissenschaft (SfS) in Tübingen, and the Institut für Maschinelle Sprachverarbeitung (IMS) in Stuttgart. The project has been funded by the Ministry of Science, Research and the Arts of the State of Baden-Württemberg, starting in 1999 and running for three years.

The project was set up in order to improve the infrastructure for text-based linguistic research and development by building a huge, automatically annotated German text corpus and the corresponding tools for corpus annotation and exploitation. This raised the following issues:

Corpus Acquisition
Corpus Annotation
Corpus Exploitation

The task of corpus acquisition consisted of marketing activities and contract negotations in order to convince publishing houses and individuals to grant research licenses for their texts. (Responsibility: IDS)

Corpus annotation involved several steps of 'text enrichment'. The meta-information (author, date of publication, etc.) of a text has to be encoded in normalized markup. The text has to be segmented (i.e. the surface structure of the text has to be detected and marked up, including paragraphs, sentences, and word forms). Furthermore, in order to make the texts more valuable for researchers interested in a wide range of linguistic phenomena, a partial syntactic analysis has been carried out, in addition to POS tagging and lemmatisation, and all additional information was added via a customised markup scheme. (Responsibilities: IDS for the markup of meta-information and sentence and paragraph segmentation, SfS for linguistic annotation, with some input on the lexical level from the IMS)

In order to make use of a linguistically annotated text corpus, powerful specialized tools for corpus exploitation are needed. The basic tool is a query engine ('TIGERSearch'), which can access structural text annotation in an efficient manner. On this basis, query collection can be built which help to answer the questions which lexicographers and linguists may have, for example in which contexts the word "streichen" (paint; erase) occurs and how often it occurs in these contexts. (Responsibility: IMS)

Please see the menu on the left hand side for more details on corpus acquisition, annotation, and exploitation.

Please contact Tylman Ule for more information. Site last modified Sun Sep 26 2004.