J. Yaghob (Ed.): ITAT 2015 pp. 66–72 Charles University in Prague, Prague, 2015 Resource-Light Acquisition of Inflectional Paradigms Radoslav Klíč1 and Jirka Hana2 1 Geneea Analytics, Velkopřevorské nám. 1, 118 00 Praha 1 radoslav.klic@gmail.com 2 MFF UK, Malostranské nám. 25, 118 00 Praha 1 jirka.hana@gmail.com Abstract: This paper presents a resource-light acquisition racy (e.g., in English, most words ending in ed are past/- of morphological paradigms and lexicon for fusional lan- passive verbs, and most words ending in est are superlative guages. It builds upon Paramor [10], an unsupervised sys- adjectives). However, in many languages, the number of tem, by extending it: (1) to accept a small seed of man- homonymous endings is simply too high for such system ually provided word inflections with marked morpheme to be useful. For example, the ending a has about 19 dif- boundary; (2) to handle basic allomorphic changes ac- ferent meanings in Czech [4]. quiring the rules from the seed and/or from previously Thus our goal is to discover inflectional paradigms each acquired paradigms. The algorithm has been tested on with a list of words declining according to it, in other Czech and Slovene tagged corpora and has shown in- words we discover a list of paradigms and a lexicon. But creased F-measure in comparison with the Paramor base- we do not attempt to assign morphological categories to line. any of the forms. For example, given an English corpus the program should discover that talk, talks, talking, talked are the forms of the same word, and that work, push, pull, 1 Introduction miss,. . . decline according to the same pattern. How- ever, it will not label talked as a past tense and not even Morphological analysis is used in many computer appli- as a verb. cations ranging from web search to machine translation. This kind of shallow morphological analysis has appli- As Hajič [6] shows, for languages with high inflection, a cations in information retrieval (IR), for example search morphological analyzer is an essential part of a successful engines. For the most of the queries, users aren’t inter- tagger. ested only in particular word forms they entered but also Modern morphological analysers based on supervised in their inflected forms. In highly inflectional languages, machine learning and/or hand-written rules achieve very such as Czech, dealing with morphology in IR is a neces- high accuracy. However, the standard way to create them sity. Moreover, it can also be used as a basis for a stan- for a particular language requires substantial amount of dard morphological analyzer after labeling endings with time, money and linguistic expertise. For example, the morphological tags and adding information about closed- Czech analyzer by [7] uses a manually created lexicon class/irregular words. with 300,000+ entries. As a result, most of the world lan- As the basis of our system, we chose Paramor [10], guages and dialects have no realistic prospect for morpho- an algorithm for unsupervised induction of inflection logical analyzers created in this way. paradigms and morphemic segmentation. We extended it Various techniques have been suggested to overcome to handle basic phonological/graphemic alternations and this problem, including unsupervised methods acquiring to accept seeding paradigm-lexicon information. morphological information from an unannotated corpus. The rest of this paper is organized as follows: First, we While completely unsupervised systems are scientifically discuss related work on unsupervised and semi-supervised interesting, shedding light on areas such as child language learning. Then follows a section about baseline Paramor acquisition or general learneability, for many practical ap- model. After that, we motivate and describe our extension plications their precision is still too low. They also com- to it. Finally, we report results of experiments on Czech pletely ignore linguistic knowledge accumulated over sev- and Slovene. eral millennia, often failing to discover rules that can be found in basic grammar books. Lightly-supervised systems aim to improve upon the ac- 2 Previous Work curacy of unsupervised system by using a limited amount of resources. One of such systems for fusional languages Perhaps the best known unsupervised morphological anal- is described in the paper. ysers are Goldsmith’s Linguistica [5] and Morfessor [1, 2, Using a reference grammar, it is relatively easy to pro- 3] family of algorithms. vide information about inflectional endings, possibly or- Goldsmith uses minimum description length ganized into paradigms. In some languages, an analyzer (MDL; [14]) approach to find the morphology model built on such information would have an acceptable accu- which allows the most compact corpus representation. Resource-Light Acquisition of Inflectional Paradigms 67 His Linguistica software returns a set of signatures which The algorithm to acquire schemes has several steps: roughly correspond to paradigms. Unlike Linguistica, Morfessor splits words into mor- 1. Initialization: It first considers all possible segmenta- phemes in a hierarchical fashion. This makes it more suit- tions of forms into candidate stems and endings. able to agglutinative languages, such as Finnish or Turk- 2. Bottom-up Search: It builds schemes by adding end- ish, with a large number of morphemes per word. A prob- ings that share a large number of associated stems. abilistic model is used to tag each morph as a prefix, suffix or stem. Kohonen et al. [9] improve the results of Morfes- 3. Scheme clustering: Similar schemes (as measured by sor by providing a small set (1000+ for English, 100+ for cosine similarity) are merged. Finish) of correctly segmented words. While the precision slightly drops, the recall is significantly improved for both 4. Pruning: Schemes proposing frequent morpheme languages. Tepper and Xia [17] use handwritten rewrite boundaries not consistent with boundaries proposed rules to improve Morfessor’s performance by recognising by a character entropy measure are discarded. allomorphic variations. Paramor works with types and not tokens. Thus it is not The approaches by Yarowsky and Wicentowski [18] and using any information about the frequency or context of Schone and Jurafsky [15] aim at combining different in- forms. Below, we describe some of the steps in more de- formation sources (e.g., corpus frequencies, edit distance tail. similarity, or context similarity) to obtain better analysis, especially for irregular inflection. A system requiring significantly more human supervi- 3.1 Bottom-up Search sion is presented by Oflazer et al. [13]. This system takes In this phase, Paramor performs a bottom-up search of the manually entered paradigm specification as an input and scheme lattice. It starts with schemes containing exactly generates a finite-state analyser. The user is then presented one c-suffix. For each of them, Paramor ascends the lat- with words in a corpus which are not accepted by the anal- tice, adding one c-suffix at a time until a stopping criterion yser, but close to an accepted form. Then the user may is met. C-suffix selected for adding is the one with the adjust the specification and the analyser is iteratively im- biggest c-stem ratio. (Adding a c-suffix to a scheme re- proved. duces number of the stems and the suffix reducing it the Feldman and Hana [8, 4] build a system which relies least is selected. C-stem ratio is ratio between number of on a manually specified list of paradigms, basic phonol- stems in the candidate higher-level scheme and the current ogy and closed-class words and use a raw corpus to auto- scheme.) When the highest possible c-stem ratio falls un- matically acquire lexicon. For each form, all hypothetical der 0.25, the search stops. It is possible to reach the same lexical entries consistent with the information about the scheme from multiple searches. For example, a search endings are created. Then competing entries are compared starting from the scheme (-s) can continue by adding (-ing) and only those supported by the highest number of forms and end by adding (-ed), thus creating a scheme (-s, -ing, are retained. Most of the remaining entries are still non- -ed). Another search starting from (-ed) can continue by existent; however, in the majority of cases, they licence adding (-s) and then by adding (-ing), creating a redundant the same inflections as the correct entries, differing only scheme. Such duplicates are discarded. in rare inflections. 3.2 Scheme Clustering 3 Paramor Resulting schemes are then subjected to agglomerative Our approach builds upon Paramor [10, 11, 12], an- bottom-up clustering to group together schemes which other unsupervised approach for discovery of inflectional are partially covering the same linguistic paradigm. For paradigms. example, if the first phase generated schemes (-s, -ing) Due to data sparsity, not all inflections of a word are and (-ing, -ed), the clustering phase should put them in found in a corpus. Therefore Paramor does not attempt to the same scheme cluster. To determine proximity of two reconstruct full paradigms, but instead works with partial scheme clusters, sets of words generated by the clusters paradigms, called schemes. A scheme contains a set of are measured by cosine similarity.1 A scheme cluster gen- c(andidate)-suffixes and a set of c(andidate)-stems inflect- erates a set of words which is the union of sets generated ing according to this scheme. The corpus must contain the by the schemes it contains (not a Cartesian product of all concatenation of every c-stem with every c-suffix in the stems and suffixes throughout the schemes). In order to same scheme. Thus, a scheme is uniquely defined by its be merged, clusters must satisfy some conditions, e.g. for c-suffix set. Several schemes might correspond to a single any two suffixes in the cluster, there must be a stem in the morphological paradigm, because different stems belong- cluster which can combine with both of them. ing to the paradigm occur in the corpus in different set of |X∩Y | 1 proximity(X,Y ) = √ inflections. |X||Y | 68 R. Klíč, J. Hana 4.2 Scheme Seeding Corpus The manual seed contains a simple list of inflected words with marked morpheme boundary. A simple example in Paramor English would be: Stem-suffix map Induction of deep stems stem-change rules talk+0, talk+s, talk+ed, talk+ing stop+0, stop+s, stopp+ed, stopp+ing chat+0, chat+s, chatt+ed, chatt+ing Bottom-up search more starting This can be written in an abbreviated form as: schemes Manual seed talk, stop/stopp, chat/chatt + 0, s / ed, ing Scheme clustering keep some clusters The data are used to enhance Paramor’s accuracy in dis- from discarding covering the correct schemes and scheme clusters in the following way: 1. In the bottom-up search, Paramor starts with single- Word clusters suffix schemes. We added a 2-suffix scheme to the starting scheme set for every suffix pair from the Figure 1: Altered Paramor’s pipeline (our alterations are manual data belonging to the same inflection. Note in dashed boxes and outside the Paramor box). that we cannot simply add a scheme containing all the suffixes of the whole paradigm as many of the forms will not be present in the corpus. 3.3 Pruning 2. Scheme clusters containing suffixes similar to some After the clustering phase, there are still too many clusters of the manually entered suffix sets are protected from remaining and pruning is necessary. In the first pruning the second phase of the cluster pruning. More pre- step, clusters which generate only small number of words cisely, a cluster is protected if at least half of its are discarded. Then clusters modelling morpheme bound- schemes share at least two suffixes with a particular aries inconsistent with letter entropy are dropped. manual suffix set. 4.3 Allomorphy 4 Our Approach Many morphemes have several contextually dependent realizations, so-called allomorphs due to phonological/- 4.1 Overview graphemic changes or irregularities. For example, con- sider the declension of the Czech word matka ‘mother’ in We have modified the individual steps in Paramor’s Table 1. It exhibits stem-final consonant change (palatali- pipeline in order to use (1) a manually provided seed of sation of k to c) triggered by the dative and local singular inflected words divided into stems and suffixes; and (2) to ending, and epenthesis (insertion of -e-) in the bare stem take into account basic allomorphy of stems. Figure 1 genitive plural. shows phases of Paramor on the left with dashed boxes representing our alterations. Case Singular Plural In the bottom-up search phase and the scheme clus- nom matk+a matk+y ter filtering phase, we use manually provided examples gen matk+y matek+0 of valid suffixes and their grouping to sub-paradigms to dat matc+e matk+ám steer Paramor towards creating more adequate schemes acc matk+u matk+y and scheme clusters. The data may also contain allomor- voc matk+o matk+y phic stems, which we use to induce simple stem rewrite loc matc+e matk+ách rules. Using these rules, some of the allomorphic stems in inst matk+ou matk+ami the corpus can be discovered and used to find more com- plete schemes. Note that the Paramor algorithm is based on several Table 1: Declension of the word matka “mother”. Chang- heuristics with many parameters whose values were set ex- ing part of the stem is in bold. perimentally. We used the same settings. Moreover, when we applied similar heuristics in our modifications, we used Paramor ignores allomorphy completely (and so do Lin- analogical parameter values. guistica and Morfessor). There are at least two reasons Resource-Light Acquisition of Inflectional Paradigms 69 to handle allomorphy. First, linguistically, it makes more Induced rules are applied after the initialisation phase. So- sense to analyze winning as win+ing than as winn+ing or called deep stems are generated from the c-stems. A deep win+ning. For many applications, such as information re- stem is defined as a set of surface stems. trieval, it is helpful to know that two morphs are variants of To obtain a deep stem for a c-stem t, operation of expan- the same morpheme. Second, ignoring allomorphy makes sion is applied. Expansion works as a breadth-first search the data appear more complicated and noisier than they ac- using a queue initialised with t and keeping track of the tually are. Thus, the process of learning morpheme bound- set D of already generated variants. While the queue is not aries or paradigms is harder and less successful. empty, the first member is removed and its variants found This latter problem might manifests itself in Paramor’s by application of all the rules. (Result of applying a rule is bottom-up search phase: a linguistically correct suffix trig- non-empty only if the rule is applicable and its right hand gering a stem change might be discarded, because Paramor side is present in the corpus.) Variants which haven’t been would not consider stem allomorphs to be variants of the generated so far are added to the back of the queue and to same stem and c-stem ratio may drop significantly. Further D. When the queue is emptied, D becomes the deep stem more, incorrect c-suffixes may be selected. associated with t and all other members of D. For example, suppose there are 5 English verbs in the Bottom-up search and all the following phases of corpus: talk, hop, stop, knit, chat, together with their -s Paramor algorithm are then using the deep stems instead (talks, hops, stops, knits, chats) and -ing (talking, hop- of the surface ones. ping, stopping, knitting, chatting) forms. Let’s assume we already have a scheme {0, s} with 5 stems. Unfor- Stem change rule induction from scheme clusters. In tunately, a simple ing suffix (without stem-final consonant addition to deriving allomorphic rules from the manual doubling) combines with one out the 5 stems only, there- seed, we also use a heuristic for detecting stem allomor- fore adding ing to the scheme would decrease the number phy in the scheme clusters obtained from the previous run of its stems to 1, leaving only talk in the scheme. of the algorithm. Stem allomorphy increases the sparsity However, for most languages the full specification of problem and might prevent Paramor from finding some rules constraining allomorphy is not available, or at least paradigms. However, if the stem changes are systematic is not precise enough. Therefore, we automatically induce and frequent, Paramor does create the appropriate scheme a limited number of simple rules from the seed examples clusters. However, it considers the changing part of the and/or from the scheme clusters obtained from the previ- stem to be a part of suffix. ous run of algorithm. Such rules both over and undergen- As an example, consider again the declension of the erate, but nevertheless they do improve the accuracy of the Czech word matka “mother” in Table 1. Paramor’s scheme whole system. For languages, where formally specified cluster with suffixes ce, ek, ka, kami, kou, ku, ky, kách, kám allomorphic rules are available, they can be used directly has correctly discovered 9 of 10 paradigm’s suffixes,3 but along the lines of Tepper and Xia [17, 16]. For now, we fused together with parts of the stem. Presence of such consider only stem final changes, namely vowel epenthesis scheme cluster in the result is a hint that there may be a c/k (e.g., matk-a – matek-0) and alternation of the final conso- alteration and epenthesis in the language. nant (e.g., matk-a – matc-e). The extension to other pro- First phase of the algorithm for deciding whether a cesses such as root vowel change (e.g., English foot – feet) scheme cluster with a c-suffix set f is interesting in this is quite straightforward, but we leave it for future work. respect is following: Stem change rule induction and application. Formally, 1. If f contains a c-suffix without a consonant, return the process can be described as follows. From every pair false. of stem allomorphs in the manual input, sδ1 , sδ2 , where s 2. cc = count of unique initial consonants found in is their longest common initial substring,2 with suffix sets c-suffixes in f . f1 , f2 we generate a rule ∗δ1 → ∗δ2 / ( f1 , f2 ) and also a reverse rule ∗δ2 → ∗δ1 / ( f2 , f1 ). Notation ∗δ1 → ∗δ2 3. If cc > 2 return false. (Morpheme boundary probably / ( f1 , f2 ) means “transform a stem xδ1 into xδ2 if the fol- incorrectly shifted to the left.) lowing conditions hold:” 4. If cc = 1 and f doesn’t contain any c-suffix start- 1. xδ2 is a c-stem present in the corpus. ing with a vowel, return false. (No final consonant 2. C-suffix set f1x (from the corpus) of the c-stem xδ1 change, no epenthesis.) contains at least one of the suffixes from f1 and con- 5. Return true. tains no suffix from f2 . 3. C-suffix set f2x of the c-stem xδ2 contains at least one If a scheme cluster passes this test, each of its stems’ sub- of the suffixes from f2 and contains no suffix from f1 . paradigms is examined. Subparadigm for stem s consists of s and fs – all the c-suffixes from f with which s forms 2 should δ or δ be 0, one final character is removed from s and 1 2 prepended to δ1 and δ2 3 Except for vocative case singular, which is rarely used. 70 R. Klíč, J. Hana a word in the corpus. For example, let’s have a stem s = We use the following terminology in this section: mat with fs = {ce, ek, ka, ku, ky}. Now, the morpheme a word group is a set of words returned by our system, boundary is shifted so that it is immediately to the right a word paradigm is a set of words from the corpus sharing from the first consonant of the original c-suffixes. In our the same lemma. Both word groups and word paradigms example, we get 3 stem variants: matk + a, u, y, matc are divisions of corpus into disjoint sets of words. An au- + e, matek + 0. To reduce falsely detected phonological toseed is a seed generated by the heuristic described in changes, we check each stem variant’s suffix set whether Section 4.3. it contains at least one of the c-suffixes that Paramor has Since Paramor only produces schemes and scheme clus- already discovered in other scheme clusters. If the condi- ters, we need an additional step to obtain word groups. tion holds, rules the with same syntax as the manual data We generated the word groups by bottom-up clustering are created. For example, matk / matc / matek + a, u, y of words using the paradigm distance which is designed / e / 0. All generated rules are gathered in a file and can to group together words generated by similar sets of be used in the same way as the manual seed or just for the scheme clusters. To compute paradigm distance for two induction of phonological rules. words w1 , w2 , we find the set of all scheme clusters which generate w1 and compute cosine similarity to the analogi- cal set for w2 4 . In the simplest case, two forms of a lemma 5 Experiments and Results will be generated just by one scheme cluster and there- fore get distance 1. For a more complicated example, let’s We tested our approach on Czech and Slovene lemma- take two Czech words: otrávení “poisoned masc. anim. tised corpora. For Czech, we used two differently sized nom. pl.” and otrávený “poisoned masc. anim. nom. subsets of the PDT 1 corpus. The first, marked as cz1, sg.”. The first one was generated by scheme clusters 33 contains 11k types belonging to 6k lemmas. The sec- and 41, both with otráv as a stem. The second word was ond, cz2, has 27k types and 13k lemmas and is a su- generated by scheme cluster 41 with otráv as a stem and perset of cz1. The purpose of having two Czech cor- by scheme cluster 45 with otráven as stem. That means pora was to observe the effect of data size on perfor- that only scheme cluster 41 generates both words and their mance of the algorithm. The Slovene corpus si is a subset paradigm distance is √2×2 1 = 0.5. of the jos100k corpus V2.0 (http://nl.ijs.si/jos/ Precision and recall of the word groups can be com- jos100k-en.html) with 27k types and 15.5k lemmas. puted in the following way: To compute precision, start The manual seed consisted of inflections of 18 lem- with p = 0. For each word group, find a word paradigm mas for Czech and inflections of 9 lemmas for Slovene. with the largest intersection. Add the intersection size to p. In both cases, examples of nouns, adjectives and verbs Precision = p / total number of words. For computing re- were provided. They were obtained from a basic grammar call, start with r = 0. For each word paradigm, find a word overview. For Czech, we also added information about the group with the largest intersection. Add the intersection only two inflectional prefixes (negative prefix ne and su- size to r. Recall = r / total number of words. F1 is the perlative prefix nej). The decision which prefixes to con- standard balanced F-score. sider inflectional and which not is to a certain degree an ar- bitrary decision (e.g., it can be argued that ne is a clitic and not a prefix), therefore it makes sense to provide such in- 5.2 Results formation manually. (Prefixes were implemented by a spe- Results of the experiments are presented in Tables 2 – 4. cial form of stem transformation rules introduced in sec- We used the following experiment settings: tion 4.3 which create deep stems consisting of a stem with and without given prefix.) 1. no seed – the baseline, Paramor was run without any seeding 5.1 Evaluation Method 2. man. seed – manual seed was used We evaluated the experiments only on types at least 6 char- 3. autoseed – autoseed was used for induction of the acters long which Paramor uses for learning. That means stem change rules 8.5k types and 4500 lemmas for cz1, 21k types and 10k 4. both seeds – Paramor run with manual seed, stem lemmas for cz2 and 21k types and 12k lemmas for si. change rules were induced from manual and au- Since corpora we used do not have morpheme bound- toseed. aries marked, we could not use the same evaluation method as authors of Paramor and Morfessor – measuring 5. seed + pref. – manual seed was used together with the precision and recall of placing morpheme boundaries. additional rules for two Czech inflectional prefixes, On the other hand, corpora are lemmatised and we can otherwise same as 2. evaluate whether types grouped to paradigms by the algo- 4 We also have to check whether w and w have the same stem, so, 1 2 rithm correspond to sets of types belonging to the same in fact we are comparing sets of pairs hscheme cluster, c-stemi, to make lemma. sure only words sharing c-stems are grouped together. Resource-Light Acquisition of Inflectional Paradigms 71 6. both seeds + pref – manual seed was used together References with additional rules for two Czech inflectional pre- fixes, otherwise same as 4. [1] Creutz, M., Lagus. K.: Unsupervised discovery of mor- phemes. In: Proceedings of the ACL-02 Workshop on Mor- phological and Phonological Learning, Vol. 6, MPL ’02, 21–30, Stroudsburg, PA, USA, 2002, Association for Com- putational Linguistics Experiment Precision Recall F1 [2] Creutz, M., Lagus, K.: Inducing the morphological lexi- no seed 97.87 84.61 90.76 con of a natural language from unannotated text. In: Pro- man. seed 97.96 87.52 92.44 ceedings of the International and Interdisciplinary Confer- autoseed 98.19 84.58 90.88 ence on Adaptive Knowledge Representation and Reason- both seeds 97.96 87.52 92.44 ing (AKRR’05), 106–113, Finland, Espoo, 2005 seed + pref. 97.84 89.40 93.43 [3] Creutz, M., Lagus, K.: Unsupervised models for mor- both seeds + pref. 97.84 89.40 93.43 pheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4 (3) (February 2007), 1–34 [4] Feldman, A., Hana, J.: A resource-light approach to Table 2: Results for the cz1 corpus. morpho-syntactic tagging. Rodopi, Amsterdam/New York, NY, 2010 [5] Goldsmith, J. A.: Unsupervised learning of the morphol- ogy of a natural language. Computational Linguistics 27(2) Experiment Precision Recall F1 (2001), 153–198 no seed 97.36 87.02 91.90 [6] Hajič, J.: Morphological tagging: data vs. dictionaries. In: man. seed 97.04 89.30 93.01 Proceedings of ANLP-NAACL Conference, 94–101, Seat- autoseed 97.30 87.72 92.26 tle, Washington, USA, 2000 both seeds 96.78 89.30 92.89 [7] Hajič, J.: Disambiguation of rich inflection: computa- seed + pref. 96.68 92.35 94.46 tional morphology of Czech. Karolinum, Charles Univer- both seeds + pref. 96.31 92.49 94.36 sity Press, Praha, 2004 [8] Hana, J., Feldman, A., Brew, C.: A resource-light approach to Russian morphology: Tagging Russian using Czech re- Table 3: Results for the cz2 corpus. sources. In: Dekang Lin and Dekai Wu, (eds.), Proceedings of EMNLP 2004, 222–229, Barcelona, Spain, July 2004, Association for Computational Linguistics [9] Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised Experiment Precision Recall F1 learning of concatenative morphology. In: Proceedings no seed 95.70 93.00 94.33 of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, SIGMOR- man. seed 95.62 94.44 95.02 PHON’10, 78–86, Stroudsburg, PA, USA, 2010, Associa- autoseed 95.69 93.13 94.40 tion for Computational Linguistics both seeds 95.56 94.76 95.16 [10] Monson, C.: ParaMor: from paradigm structure to nat- ural language morphology induction. PhD thesis, Lan- Table 4: Results for the si corpus. guage Technologies Institute, School of Computer Science, Carnegie Mellon University, 2009 [11] Monson, C., Carbonell, J., Lavie, A., Levin, L.: ParaMor: As can be seen from the results, the extra manual infor- minimally supervised induction of paradigm structure and mation indeed does help the accuracy of clustering words morphological analysis. In: Proceedings of Ninth Meeting belonging to the same paradigms. What is not shown by of the ACL Special Interest Group in Computational Mor- the numbers is that more of the morpheme boundaries phology and Phonology, 117–125, Prague, Czech Repub- make linguistic sense because basic stem allomorphy is lic, June 2007, Association for Computational Linguistics accounted for. [12] Monson, C., Carbonell, J. G., Lavie, A., Levin, L. S.: Paramor: finding paradigms across morphology. In: Ad- vances in Multilingual and Multimodal Information Re- 6 Conclusion trieval, 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers, 900–907, 2007. We have shown that providing very little of easily obtain- [13] Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping able information can improve the result of a purely un- morphological analyzers by combining human elicitation supervised system. In the near future, we are planning to and machine learning. Computational Linguistics 27(1) model a wider range of allomorphic alternations, try larger (2001), 59–85 (but still easy to obtain) seeds and finally test the results on [14] Rissanen, J.: Stochastic complexity in statistical inquiry. more languages. World Scientific Publishing Co, Singapore, 1989. 72 R. Klíč, J. Hana [15] Schone, P., Jurafsky, D.: Knowledge-free induction of inflectional morphologies. In: Proceedings of the North American Chapter of the Association for Computational Linguistics, 183–191, 2001. [16] Tepper, M., Xia, F.: A hybrid approach to the induction of underlying morphology. In: Proceedings of the Third Inter- national Joint Conference on Natural Language Processing (IJCNLP-2008), Hyderabad, India, Jan 7-12, 17–24, 2008. [17] Tepper, M., Xia, F.: Inducing morphemes using light knowledge. ACM Trans. Asian Lang. Inf. Process. 9 (3) (March 2010), 1–38 [18] Yarowsky, D., Wicentowski, R.: Minimally supervised morphological analysis by multimodal alignment. In: Pro- ceedings of the 38th Meeting of the Association for Com- putational Linguistics, 207–216, 2000