ELTE BTK Magyar Nyelvtudományi és Finnugor Intézet

Fejes László - Novák Attila

nyomtatható változat

Fejes, László - Novák, Attila (Budapest)

A very tight context: identification of meaning, word class and morphological category based on immediate neighbours

In our presentation, we would like to introduce an automatic morphological disambiguation system for Ob-Ugric languages that attempts to identify the actual word class and mophosyntactic features for morphologically ambiguous word forms. It is based on a well known method that, however, has not until now been applied to these languages.

In the past few years we have developed several morphological analyzers for different Uralic languages (for standard and dialect forms). The output of a morphological analyzer is the list of all possible morphological analyses of the input word: each analysis identifies the stem and the grammatical categories the word form expresses. The process of selecting the analysis proper for the given context is called disambiguation. Disambiguation can be made manually, automatically or combining the two approaches by first automatically ranking analysis candidates and subsequently checking and, if needed, manually overriding the best candidate selection. Automatic disambiguation can be based on a syntactic model, on statistics or both. We use a purely statistical model, as this has proved to be the most effective method in terms of the amount of work required to implement the disambiguator and it generally performs well in terms of the error rate of the output provided that the amount of training data is sufficient. Once the hybrid disambiguation system has been implemented, its automatic part can be incrementally improved by always adding the manually revised disambiguated texts to the training corpus of the system.

Almost all syntactic theories agree in one aspect: sentences have a multi-level, hierarchical semantic-grammatical structure. However, sentences can arise only as flat structures: as a series of word forms. Most syntactic rules are mapping rules between the hierarchical and the flat structure. The constituents of the sentence can follow each other in different ways and most of the words and word forms can be present in different types of constituents (phrases). This means in practice that it is random what word form precedes or follows a given word form. However, word order has a limited entropy: there are great differences between the probability of the presence of different possible neighbours.

The tool we based our automatic disambiguator on is a statistical part of speech tagger based on a trigram model (taking into account two words preceding the current word) that only identifies the most probable morphological tag for each word in the input. We had to extend this to the identification of the stem and the appropriate gloss (sense). In our experiment, we took relatively small manually disambiguated corpora (about 5000 words each) of Mansi and Khanty texts. Part of the corpus for each language was used as training data, and the rest as testing data that we used to evaluate the performance of the system.

As a conclusion, we will present data about the performance of the statistical method, we will suggest some possibilities to improve it and we will also present some less obvious observations about the word order of Khanty and Mansi.

In English.