[ page: 1 2 3 4 5 6 7 ]

3. Annotation means the addition of linguistic information to the corpus in a systematic and consistent way. Alternative terms are encoding and markup. Linguistic annotation of corpora is carried out by placing a tag alongside each word in the corpus to indicate relevant information, so tagging is yet another term for annotation.

There are different types of linguistic annotation of corpora: morpholexical, syntactic, semantic, pragmatic, etc.

Morpholexical annotation here is a cover term for three types of information on a word form: (1) which word class or part of speech it belongs to, (2) which lexical word or lexeme it is a form of, and (3) how it is morphemically decomposed.

Syntactic annotation shows the grammatical function of each constituent of a sentence, whereby the structure of the sentence is made explicit. Morpholexical and syntactic annotations, which we are going to discuss in a bit more detail later on, can be grouped together under the superordinate term of grammatical annotation.

Semantic or word-sense annotation is extensively used in the fields of machine translation and information retrieval, and has also become instrumental in lexicography in recent years. Similarly, pragmatic annotation is indispensable for machine translation and natural language understanding systems. As my primary concern in this paper is grammatical annotation, I am not going to talk about semantic and pragmatic annotation any more.

Technically, morpholexical annotation consists of part-of-speech tagging and lemmatization. Part-of-speech tagging provides each word-form with information on the word class (part of speech) which it belongs to. Lemmatization means the grouping of different word-forms or inflected forms that belong to the same lexeme or lemma, and generally involves the morphological analysis or morphemic decomposition of each word-form.

Computer programs for part-of-speech tagging and lemmatization are usually called part-of-speech taggers and lemmatizers, respectively. In practice, the two processes go hand in hand and cannot be separated with each other, so one and the same program carries out both part-of-speech tagging and lemmatization in most cases. I will use the term morpholexical to express the inseparable nature of the two processes. Following the usage in Estonian computer linguistics, computer programs used in morpholexical tagging are referred to as morphological analyzers in this paper.

The only morphological analyzer of Estonian that I know of is ESTMORF (cf. Kaalep 1999). It was developed by the computater linguistics unit at the University of Tartu and is available from Filosoft. The morphological analyzer of Finnish I am familiar with is FINTWOL, a program based on the two-level model of morphology developed in the 1980s by Kimmo Koskenniemi of the University of Helsinki and distributed by Lingsoft. Both programs perform context-free word-by-word morphological analysis (see Tables 1 and 2). The capacity and performance of ESTMORF will be discussed later.

Table 1. Sample ESTMORF output   Table 2. Sample FINTWOL output
lood "<toista>"
  lood+0//_S_ sg n, //   "toistaa"  V PRES ACT NEG
  lood+d//_S_ pl n, //   "toistaa"V PRES ACT NEG
  loog+d//_S_ pl n, //   "toistaa"V IMPV ACT SG2
  lugu+d//_S_ pl, n, //   "toistaa"V IMPV ACT NEG SG
  loo+d//_V_ d, //   "toinen"Q PRON PTV SG
     "toinen"ORD NUM PTV SG

Syntactic annotation or parsing means the addition of information on the grammatical function of each word in a corpus so that the syntactic structure of each sentence is systematically described. Computer programs for syntactic parsing are called parsers.

Parsers have been developed for many languages including Finnish and Estonian. Among them is the Functional Dependency Grammar (FDG) Parser of Finnish developed by Conexor in Finland [4] which was just installed in our department's Linux server in June this year. As I have used the program for a relatively short time only, I have little to say about this program at the moment. I haven't had the opportunity to get acquainted with the Constraint Grammar for Estonian (ESTCG), a parser of Estonian which is being developed at the University of Tartu.

Automatic syntactic tagging is much more tricky than automatic morpholexical tagging and even for languages like English, a considerable part of parsing is still being done manually.

[4] See Conexor FDG. Käyttäjän manuaali.
[ page: 1 2 3 4 5 6 7 ]
Last updated: May 3, 2003 — © 2000-2003 by Kazuto Matsumura