Multi-level annotation for spoken-language corpora
Philippe BLACHE
Daniel HIRST
LPL - Universit de Provence
June 16, 2000
The analysis of the interactions between different levels of linguistic
analysis (prosody, syntax, semantics, pragmatics etc) requires the use of
corpora. There are however very few existing corpora with this type of
multi-level annotation. There are a number of serious obstacles to the
development of such resources including the choice of what information to
represent, the form that information should take and the way in which the
information can be exploited once it exists. In this paper we propose an
approach based on the use of annotation graphs adapted to the treatment of
spoken language data.
- What information to represent ?
-
Classical annotation systems are not well adapted to this type of corpora
since they are not capable of taking into account several simultaneous
levels of representation. In particular even when it is possible to
introduce the notion of "point of view" it is not possible to represent
the embedding of different levels. The fundamental difficulties lies in
the fact that the basic building-blocks of prosody and syntax are not
superposable : syllables and accent groups may well involve a different
parsing than morphemes and words for example. It may also be useful to
annotate information such as tonal representations of intonation patterns
which is not necessarily linked to a specific phoneme or even syllable
but which is more usefully treated as an autonomous level. The syntactic
analysis of spontaneous speech phenomena such as repetitions, false
starts, hesitations, corrections requires a non-hierarchical form of
representation. Finally, in the area of both prosody and syntax
(and no doubt elsewhere) it would be useful to be able to maintain
simultaneous ambiguous representations rather than imposing a single
interpretation.
- What type of annotation ?
-
Annotation graphs augmented with a system of typing provide an
interesting solution to these problems. In this approach, the acoustic
signal (when available) provides the fundamental baseline for reference. A
task-specific set of pointers to instants in the signal forms the set of
knots with no specific limits or linguistic constraints). The different
annotation labels are borne by different types of arcs, each type
corresponding to a given level of linguistic analysis.
The annotation of a sequence would thus be formed from the definition of a
set of knots followed by a set of arcs which (unlike the knots) carry the
linguistic information. The following examples illustrate a few
possibilities.
<arc pros> % prosodic arc
<begin id="node1"> % origin node
<label type="phon" name="i"> % phoneme /i/
<end id="node2"> % target node
</arc pros>
<arc pros>
<begin id="node1">
<label type="syl" name="its"> % arc " syllable " /its/
<end id="node4">
</arc pros>
<arc synt> % syntactic arc
<begin id="node1">
<label type="word" name="it" cat="pro"> % label (pronoun "it")
<end id="node3">
</arc synt>
- Using multi-level annotations.
-
This technique makes it possible to use a language like XML for the
representation of several different levels of information applied to the
same basic data. This makes it possible to formulate queries referring
simultaneously to different levels of annotation. Rather than develop a
specific query language for this task we propose the adaptation for
annotation graphs of a generic query language, SgmlQL, adapted to the
type of multi-level representation proposed here.
doug@essex.ac.uk