Institut für Deutsche Sprache, Mannheim
smueller@ids-mannheim.de
Like any other linguistic application, parsing suffers from
ambiguities on all levels of explicit and implicit aspects of
language. In order to decide on the respective probability of several
ambiguous analyses, I propose a new model of corpus-based parse
pruning (CPP): A corpus of electronically available real-life
utterances provides the necessary structure-statistical information on
word order and attachments. The CPP model utilizes this probabilistic
information on phrase patterns (i.e. flattened constituent structures
of partial analyses) as a backbone for dependency syntax - preserving
the full coverage of the grammar. This pruning approach yields
high-quality parsing results and outperforms alternative models - even
when using corpora of a very limited size.
The composition of the training corpus and the employed annotation
scheme thus are responsible for the representativeness of the achieved
statistics. A series of experiments on corpus quantity and quality
aims at learning about the optimal design, the necessary size, and the
most useful exhaustion (as regards the information content of the
annotation) to describe the syntax most reliably: A manually compiled,
non-balanced, stratificational sample of German texts (8 domains, 831
sentences) is used to determine those data structures which are most
likely suitable for delivering the appropriate syntactic information
for parse disambiguation from the least amount of input data. With the
German version of Slot Grammar (McCord 1989, 1991) on hand, the
training corpus is parsed to provide several databases of phrase
patterns of different syntactic depths and annotated in varying
degrees of (under-)specification.
The syntactic analysis of the corpus yields the following basic facts
regarding the phrase structures in focus (NP and VP): First,
concerning within-text variances, the number of tokens per phrase type
grows with the number of sentences. Thus, there is a tendency of
authors to use similar structures. Second, there is a major difference
in behavior between noun phrases and verb phrases, concerning
between-text variances. Noun phrases do not show discernible patterns
in the occurrence or absence of structures. There is neither similar
behavior between texts which seem to have more in common, nor
dissimilar behavior between texts which seem to be very different,
regarding genre or domain. Verb phrases in turn show a rather regular
behavior. Between the texts, they mostly vary in the overall number of
verb phrases per se but not in the occurrence of the respective types.
We thus get more structures mainly from processing more data, not from
processing different domains: The repertoire in structures converges,
the more authors are involved and the larger the corpus is.
Two series of experiments on the effect corpus size and/or annotation
scheme have on the corpus reliability are performed on a
stratificational test corpus (125 sentences) different from the
training corpus employing the CPP model.
It turns out, that the number of correct attachments is affected
mainly by the number of authors contributing to the training corpus,
while the disambiguation power and the determination of analysis is
mainly affected by its size and the average number of words per
sentence. Furthermore, it shows that a structure database with a less
specified annotation performs better than set-ups employing structures
with more specific information content: Increasing the specifity of
the structure patterns yields an inefficient Large Number of Rare
Events distribution in many cases: There are too many types with too
few tokens referring to too disparate data.
However, the questions remain open whether extending the training
corpus changes the general picture and, above all, to what extent the
prospective application the annotation of the structure database is
designed for determines its optimal shape.
Consequently, regarding multi-purpose tree banks, I propose an
annotation scheme with various levels of specification.