[ page: 1 2 3 4 5 6 7 ]

6. The user interface I have developed in Perl for the Corpus of Estonian Literary Language needs to be improved as well as expanded.

In the first place, what I call "morpholexically annotated versions" of the University of Tartu's Estonian corpus files are not such in the strict sense of the term. Because they are the raw output of ESTMORF, most of the words in them are ambiguously annotated, so their morpholexical annotations need disambiguation.

Take the word või in the third sentence of Table 6, for example. Capable of context-free morphological analysis only, ESTMORF gives five interpretations as equally possible for this word: (1) adverb, (2) conjunction, (3) the genitive singular of the noun või, (4) the nominative singular of the noun või, or (5) the second person singular imperative or the negation form of the verb võima. A human speaker of the Estonian language, on the other hand, understands each word in actual context, so she has no difficulty grasping that it is neither a noun nor a verb, and will not see any homonymy here.

A computer program has been developed by the ESTMORF staff for the automatic disambiguation of ESTMORF output [5]. It is yet to find out to what extent this program demands manual correction of its output.

Secondly, the total dependency on ESTMORF of the morpholexical annotation means that it is dependent on the morphological assumptions ESTMORF is dependent on: the morphological information ESTMORF uses derives from Ülle Viks' model of Estonian inflectional morphology (Viks 1992).

There are certain details of ESTMORF output which one may not be very happy with. For example, ESTMORF considers the plural personal pronouns meie, teie and nemad to be the plural forms of the singular personal pronouns mina, sina and tema, respectively. So meil, for example, is labeled as the adessive plural of mina. Personally, I would prefer the traditional analysis according to which there are six lexically distinct personal pronouns and meil is the adessive form of the first person plural pronoun meie.

Thirdly, the capacity of Perl-based corpus-linguistic tools should be expanded so that they can deal with texts of minor Baltic Finnic languages as well [6].

The most challenging obstacle for corpus-based linguistics in minor Baltic Finnic languages is the fact that the textual data available of these languages are mostly phonetic transcriptions of speech in Finno-Ugric Phonetic Alphabet (FUPA). It is well known that FUPA resists full computerization at the moment.

It is yet to see whether the future incorporation of the FUPA symbols into UNICODE will make Baltic Finnic corpus creation less painstaking. For the time being, the only resort Baltic Finnic corpus linguists have seems to be ad hoc and clumsy ASCII transliterations of FUPA transcription.

[5] Filosoft has developed a disambiguation program called TAHMM for ESTMORF output. See Morfoloogiline ühestaja.
[6] With colleagues in Tartu and Tallinn, I have started a corpus project called Corpus of Baltic Finnic Languages. The languages to be included in the corpus are Karelian, Vepsian, Izhorian, Votian (cf. Tables 4 and 5), Livonian, and Estonian dialects of Setu and Võru.
[ page: 1 2 3 4 5 6 7 ]
Last updated: May 3, 2003 — © 2000-2003 by Kazuto Matsumura