[ page: 1 2 3 4 5 6 7 ]

4. Computer corpora would be of little practical use if it were not for a program with which to retrieve information from them for linguistic purposes. Such programs are generally known as corpus-linguistic tools, of which the most important is a concordance program or concordancer, a program which presents the search result in the format of concordance.

The term concordance most commonly refers to a specific type of concordance called KWIC concordance, where KWIC stands for Key Word In Context (see Table 3, as well as 6 and 7).

In a KWIC concordance, each occurrence of a particular feature or a combination of features found in a corpus is listed together with a certain amount of context or the text immediately preceding and following it. The principal search feature is highlighted in the center. It is taken for granted that the concordance lines are sorted in a language-specific alphabetic order (e.g., ä and ö are regarded as variants of a and o in the German alphabet, but they are independent letters coming after z and å in the Finnish alphabet; z follows s and š but precedes t in the Estonian alphabet, etc.).

Table 3. Sample KWIC concordance created by MyConc: Votian pajata-

                 nütt miE i * pajatan risittämizessä.
      vot, miä baba ol'assa * pajatan täm¨mää konstia.
    oomniis menin i mamalõõ * pajatan unta.
         nüd miE kat¨tiikaa * pajatan vaissi.
  i katti saap tolkkua, kui * pajatan vaissi.
                        miä * pajatan võõrass juttuu.
                    vot miä * pajatan õmaa eloo.
               i miä tällee * pajatan õm¨maa goor'aa.
                            * pajatap naapurilõõ siällä.
   – a miE juttõõn: kui siE * pajatat poigaakaa?
                            * pajatattii , etti ko tahto koiraa, i
                        ain * pajatattii , etti on kuumaa auta.
                            * pajatattii , jott se karu õli õllut
                   sis kõik * pajatattii , kõõs leevät pulmaD.
                   siiz ain * pajatattii , nagrattii, što jäid ell
                ühtä meessä * pajatattii , što emä ajõ tätä võttam
                            * pajatattii , što kabrios kreepostiz

Along with concordancing, various types of frequency counts play an important role in corpus-based linguistic research. It is usually taken for granted that concordance programs carry out frequency counts automatically.

Concordancers may be corpus-specific or corpus-independent. An example of a corpus-specific program is SARA, a user interface specifically designed for on-line use of the British National Corpus (BNC).

Since corpus-specific programs are necessarily language-specific, they are perhaps a luxury for the minor Baltic Finnic languages, many of which are on the verge of extinction. A more realistic option for them would be a concordance program that is corpus-independent and more or less language-independent.

In order for a general Baltic Finnic concordance program to be implemented, there should be some standardized system of Baltic Finnic grammatical annotation. Given the non-trivial differences between the output of ESTMORF (cf. Table 1) and that of FINTWOL (cf. Table 2) , the prospect of a general Baltic Finnic concordancer being developed in the near future is rather bleak. Computer linguists in Finland and Estonia are not interested in minor Baltic Finnic languages yet as they are in their own mother tongues or in major European languages such as English.


[ page: 1 2 3 4 5 6 7 ]
Last updated: May 3, 2003 — © 2000-2003 by Kazuto Matsumura