[ page: 1 2 3 4 5 6 7 ]

2. The creation of unannotated corpora is the initial stage in corpus development. We know that this initial stage has been already reached both in Finland and Estonia at least as far as their national languages are concerned. Finland's Kielipankki project boasts of a total of over 20 million words of Finnish written text as of February 2000 [2]. The Corpus of Estonian Literary Language (CELL) at the University of Tartu contains a total of c. 4.8 million words of Estonian text from between the 1890s and the 1990s [3].

Though plain text corpora of minor Baltic Finnic languages are yet to be created, corpus development in the Finnish and Estonian languages is now in its second stage: the creation of annotated corpora plus the development of corpus-linguistic tools with which linguists can make full use of the corpora in their research. done manually.

[2] See Kielipankin asiakkaan opas.
[3] My calculation. The official statement about the size of the baaskorpus or Core Corpus (texts from 1980s – 1 million words) appears to be correct. The sizes of other subcorpora: 1890s – 348,000 words, 1930s – 369,000, 1950s – 298,000, 1960s – 333,000, 1970s – 425,000, 1990s – 1,970,000. The 1990s corpus contains some texts from the 1980s. It is unknown whether there is any overlapping between the 1990s corpus and the Core Corpus.
[ page: 1 2 3 4 5 6 7 ]
Last updated: May 3, 2003 — © 2000-2003 by Kazuto Matsumura