To be published in the proceedings of the IXth International Congress of Finno-Ugrists, Tartu, August 7-13, 2000

ESTMORF and Perl as
Corpus-Linguistic Tools for Estonian

Kazuto Matsumura
Tokyo

1. In the present paper I will use the term corpus to refer to any collection of language texts in a computer-readable format used in linguistic research. Here the term text refers to either a written text or a transcription of speech. Actual usage in corpus-based linguistics prefers a more precise definition of corpus, laying emphasis on the systematic nature of corpus design. For example (my italics):

"any systematic collection of speech or writing in a language or variety of a language" (Matthews 1997: 78)
"a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language" [1]

Note the use of expressions such as "systematic" and "selected and ordered according to explicit linguistic criteria" in the citations above.

The reason for my preference of a less precise definition of corpus here is practical. In the first place, most computer tools developed for corpus-linguistic purposes work equally well with any computer-readable electronic text regardless of whether it is conformable to the most demanding requirements of corpus or is put together in a haphazard way.

Secondly, if the more precise definition were to be adopted, then the number of languages of which a corpus is available for linguistic research would be reduced to only a few. For most languages, a corpus strictly defined is an ideal, and not a standard, at this moment.

A corpus then may simply consist of plain text or sequences of orthographic words and punctuation. Such a corpus is called unannotated. Unannotated corpora, however, are rarely in the raw state of plain text. Rather, they are, so to speak, "minimally annotated" in the sense that the boundaries of units like paragraphs and sentences as well as specialized roles of headings, citations, etc., are indicated for written texts, and the beginnings and ends of individual utterances for transcriptions of spoken discourse.

[1] John Sinclair cited by Aston and Burnard (1998), p.4.
[ page: 1 2 3 4 5 6 7 ]
Last updated: May 3, 2003 — © 2000-2003 by Kazuto Matsumura