On the Representativeness of Syntactic Structures
Sonja Müller-Landmann

Institut für Deutsche Sprache, Mannheim

smueller@ids-mannheim.de

Like any other linguistic application, parsing suffers from ambiguities on all levels of explicit and implicit aspects of language. In order to decide on the respective probability of several ambiguous analyses, I propose a new model of corpus-based parse pruning (CPP): A corpus of electronically available real-life utterances provides the necessary structure-statistical information on word order and attachments. The CPP model utilizes this probabilistic information on phrase patterns (i.e. flattened constituent structures of partial analyses) as a backbone for dependency syntax - preserving the full coverage of the grammar. This pruning approach yields high-quality parsing results and outperforms alternative models - even when using corpora of a very limited size.

The composition of the training corpus and the employed annotation scheme thus are responsible for the representativeness of the achieved statistics. A series of experiments on corpus quantity and quality aims at learning about the optimal design, the necessary size, and the most useful exhaustion (as regards the information content of the annotation) to describe the syntax most reliably: A manually compiled, non-balanced, stratificational sample of German texts (8 domains, 831 sentences) is used to determine those data structures which are most likely suitable for delivering the appropriate syntactic information for parse disambiguation from the least amount of input data. With the German version of Slot Grammar (McCord 1989, 1991) on hand, the training corpus is parsed to provide several databases of phrase patterns of different syntactic depths and annotated in varying degrees of (under-)specification.

The syntactic analysis of the corpus yields the following basic facts regarding the phrase structures in focus (NP and VP): First, concerning within-text variances, the number of tokens per phrase type grows with the number of sentences. Thus, there is a tendency of authors to use similar structures. Second, there is a major difference in behavior between noun phrases and verb phrases, concerning between-text variances. Noun phrases do not show discernible patterns in the occurrence or absence of structures. There is neither similar behavior between texts which seem to have more in common, nor dissimilar behavior between texts which seem to be very different, regarding genre or domain. Verb phrases in turn show a rather regular behavior. Between the texts, they mostly vary in the overall number of verb phrases per se but not in the occurrence of the respective types.

We thus get more structures mainly from processing more data, not from processing different domains: The repertoire in structures converges, the more authors are involved and the larger the corpus is.

Two series of experiments on the effect corpus size and/or annotation scheme have on the corpus reliability are performed on a stratificational test corpus (125 sentences) different from the training corpus employing the CPP model.

It turns out, that the number of correct attachments is affected mainly by the number of authors contributing to the training corpus, while the disambiguation power and the determination of analysis is mainly affected by its size and the average number of words per sentence. Furthermore, it shows that a structure database with a less specified annotation performs better than set-ups employing structures with more specific information content: Increasing the specifity of the structure patterns yields an inefficient Large Number of Rare Events distribution in many cases: There are too many types with too few tokens referring to too disparate data.

However, the questions remain open whether extending the training corpus changes the general picture and, above all, to what extent the prospective application the annotation of the structure database is designed for determines its optimal shape.

Consequently, regarding multi-purpose tree banks, I propose an annotation scheme with various levels of specification.


doug@essex.ac.uk