Unsupervised Learning of Morphology by Using Syntactic Categories Burcu Can, Suresh Manandhar Department of Computer Science, University of York York YO10 5DD UK {burcu,suresh}@cs.york.ac.uk Abstract This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [4][12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic information i.e. Part-of-Speech (PoS) tags of words to aid morphological analysis. We employ an existing unsupervised PoS tagging algorithm for inducing the PoS cate- gories. A distributional clustering algorithm is developed for inducing morphological paradigms. Categories and Subject Descriptors 1.2 [Artificial Intelligence]: 1.2.6 Learning 1.2.7 Natural Language Processing General Terms Algorithms, Experimentation Keywords Morphological Analysis, Syntax, Unsupervised Learning, Clustering 1 Introduction Morphological analysis is a crucial subtask for most Natural Language Processing (NLP) prob- lems: Machine Translation, Question Answering, Information Retrieval, Part-of-Speech Tagging etc. Morphological segmentation is a preprocessing step before a full morphological analysis can be done. For applications dealing with large corpora morphological segmentation requires dealing with sparseness of data. For example, a verb can have thousands of different forms in languages like Finnish (Karlsson [13]) and often all the different forms are not seen during training. Morphology learning has been an active research topic since Harris [11] proposed the distribu- tional hypothesis. One of the earliest works on morphology learning using Harris’s ideas is Déjean and Norm’s method [8]. In this work [8], Déjean and Norm find the most frequent morphemes on the given unannotated text, and by using this morpheme list, he segments the given text morpho- logically. Since then, there has been huge amount of progress made in the field. We can classify these into the following approaches: Minimum Description Length (MDL) model, Letter Succes- sor Variety (LSV) model, Semantic models (ex: Latent Semantic Analysis - LSA), Probabilistic models. MDL based models (Brent [2], Brent et. al. [3], Goldsmith [9][10], Creutz and Lagus [7]) aim to minimize the space used by the corpus and the model by morphologically segmenting the words in the corpus. LSV model has been employed by Bordag [1] that used letter frequencies to find split points in the words. Snover et. al. [16] describe a probabilistic model in which morphological paradigms are created gradually by choosing the number of stems, morphemes, paradigms in a probabilistic, and generative manner. Another generative model is due to Creutz [6] in which the lengths and the frequencies of the morphemes are used as prior information. Schone and Juraf- sky [15] used LSA to capture the semantic relatedness of words to aid morphological segmentation. All the above work has been primarily employed to learn simple i.e. non-recursive concate- native morphology but they do not directly address the recursive nature of the morphology of agglutinative languages. Monson [14] proposed a system for handling the morphology of aggluti- native languages. His system achieved a precision of 52% on Turkish as evaluated in the Morpho Challenge Workshop, 2008. There has been some work on the joint unsupervised learning of morphology and PoS tags. Hu et. al. [12] extends the minimum description length (MDL) based framework due to Goldsmith [9] to explore the link between morphological signatures and PoS tags. Clark and Tim [4] experiments with the fixed endings of the words for PoS clustering. So morphologically similar words tend to belong into the same PoS cluster. Our current work can be viewed is in a similar direction. In particular, we show that unsu- pervised PoS tagging can be effectively employed for learning of morphology. However, the work presented here is not a method for simultaneous learning of PoS categories and morphology. It is limited to learning of morphology given that PoS categories already been induced using an unsupervised PoS tagger. 2 Inducing Syntactic Categories For the induction of syntactic categories, we used Clark’s distributional clustering approach [5] which can be considered as an instance of average link clustering. Although it should be empha- sised that any other method for unsupervised induction of PoS tags can be substituted without affecting the method presented in this paper. Following Clark’s approach, each word is clustered by using its context. A context consists of the previous word, and the following word. Each word has a context distribution over all ordered pairs of left-context/right-context words. To measure the distributional similarity between words, KL divergence is used which is defined as: X p(x) D(pkq) = p(x) log (1) x q(x) where p, q are the context distributions of the words being compared and x ranges over contexts. In his approach [5], the probability of a context for a target word is defined as: p(< w1 , w2 >) = p(< c(w1 ), c(w2 ) >)p(w1 |c(w1 ))p(w2 |c(w2 )) (2) where c(w1 ), c(w2 ) denote the PoS cluster of words w1 , w2 respectively. The algorithm requires the number of clusters K to be specified in advance. In addition to the K clusters, one spare cluster is employed containing all unclustered words. During each iteration, one word is chosen from the spare cluster having the minimum KL divergence with one of the K clusters. For each cluster, its context distribution is computed as the averaged distribution of all words in the cluster. In addition, the KL divergence between clusters are computed after each iteration and clusters are merged if the divergence is below a manually set threshold. We set K=77, the number of tags defined in CLAWS tagset used for tagging the BNC (British National Corpus). We used the same number of clusters for Turkish and German. Final clusters show that PoS clusters are related with the major syntactic categories. The system finds PoS clus- ters that can be identified as proper nouns, verbs in past tense form, verbs in present continuous form, nouns, adjectives, adverbs, and so on. Some sample clusters are given below for English: Cluster 1: much far badly deeply strongly thoroughly busy rapidly slightly heavily neatly widely closely easily profoundly readily eagerly Cluster 2: made found held kept bought heard played left passed finished lost changed etc Cluster 3: should may could would will might did does etc Cluster 4: working travelling flying fighting running moving playing turning etc Cluster 5: people men women children girls horses students pupils staff families etc 3 Inducing Mophological Paradigms After the induction of the syntactic categories, the conditional probability of each morpheme x given its PoS cluster c, p(x|c), is computed. The possible morphemes in each PoS cluster is found by splitting each word in each cluster at all split points, and the morphemes are ranked, and sorted according to their maximum likelihood estimates in each PoS cluster. A list of highest ranked morphemes are given in Table 1 for English, German, and Turkish. Table 1: Some high ranked potential morphemes in PoS clusters English German Turkish Cluster Morphemes Cluster Morphemes Cluster Morphemes 1 -s 1 -n,-en 1 -i,-si,-ri 2 -d,-ed 2 -e,-te 2 -mak,-mek,-mesi,-masi 3 -ng,-ing 3 -g,-ng,-ung 3 -an,-en 4 -y,-ly 4 -r,-er 4 -r,ar,er,-ler,-lar 5 -s,-rs,-ers 5 -n,-en,-rn,-ern 5 -r,-ir,-dir,-Ir,-dIr 6 -ing,-ng,g 6 -ch,-ich,-lich 6 -e,-a This ranking is used to eliminate the potential non-morphemes with a low conditional prob- ability hence reducing the search space. In the next step, morphemes across PoS clusters are incrementally merged forming the basis of the paradigm capturing mechanism. In each iteration, a morpheme pair across two different PoS cluster with the highest number of common stems is chosen for merging. Once a morpheme pair is merged the words that belong to this newly formed paradigm are removed from their respective PoS clusters. Once a word is assigned to a paradigm, it cannot be part of any other paradigm. Thus, we postulate that a word can only belong to a single morphological paradigm. Since, in our current framework, morphemes are tied to PoS clusters our definition of paradigm deviates from that of Goldsmith [9] in that a paradigm φ is a list of morpheme/cluster pairs i.e. φ = {m1 /c1 , . . . , mn /cn }. Associated with each paradigm is a list of stems i.e. the list of stems that can combine with each of the morphemes mi to produce a word belonging to the ci PoS category. Algorithm 1 describes the complete paradigm capturing process. Some examples of sample paradigms captured are given below: English: ed ing : reclaim aggravat hogg trimm expell administer divert register stimulat shap rehabilitat exempt stiffen spar deceiv contaminat disciplin implement stabiliz feign mistreat extricat mimick alert seal etc s d : implicate ditche amuse overcharge equate despise torpedoe curse plie supersede preclude snare tangle eclipse relinquishe ambushe reimburse alienate conceive vetoe waive envie negotiate diagnose etc er ing : brows wring worship cropp cater stroll zipp moneymak tun chok hustl angl windsurf swindl cricket painkill climb heckl improvis scream scaveng panhandl lawmak bark clean lifesav beekeep toast matchmak bodybuild etc e ed : subsid liquidat redecorat exorcis amputat fertiliz reshap regulat foreclos infring eradicat reverberat chim centralis restructur crippl rehabilitat symbolis reinstat etc ly er : dark cheap slow quiet fair light high poor rich cool quick broad deep bright calm crisp mild clever etc 0 s : benchmark instrument pretzel wheelchair scapegoat spike infomercial catastrophe beard paycheck reserve abduction Turkish: i e : zemin faaliyetin torenler secim incelemeler eyalet nem takvim makineler yontemin becerisin gorusmeler teknigin merkezin iklim goruntuler etc i a : cevab bakimin mektuplar esnaf olayin akisin miktar kayd yasamay bulgular sular masraflarin heyecanin kalan haklarin anlamin etc i in : sanayiin degerlerin esin denizler duman teminat erkekler kurullarin birbirin vatandaslarimiz gelismesin milletvekillerin partisin de e : bolgesin duzeyin yonetimin dergisin sektorun birimlerin bolgelerin tumun bolumlerin tesis- lerin donemin kongresin evin etc mesi en : izlen yurutul degis uretil gerceklestiril desteklen gelistiril etc i 0 : iman cekim mahkemelerin orneklem gaflet yazman sanat trendler mahalleler eviniz hamamlar piller ogretim olimpiyat German: r n : kurze ehemalige eidgenoessische professionelle erste bescheidene ungewoehnliche ethnische unbekannte besondere nationalsozialistische deutsche e en : praechtig gesichert dauerhaft bescheiden vereinbart biologisch natuerlich oekumenisch kan- tonal unterirdisch wissenschaftlich nahegelegen chinesisch t en : funktionier konkurrier schneid mitwirk ansteig plaedier pfeif aufklaer schluck ausgleich weitermach abhol ankomm spazier speis aussteig aufhoer er ung : versteiger unterdrueck erneuer vermarkt beschleunig besetz geschaeftsfuehr wirtschafts- foerder finanzverwalt verhandl s 0 : potential instrument flohmarkt vorhang pilotprojekt idol rechner thriller ensemble bebau- ungsplan empfinden defekt aufschwung Algorithm 1 Algorithm for paradigm capturing using syntactic categories 1: Apply unsupervised PoS clustering to the input corpus 2: For each PoS cluster c and morpheme m, compute maximum likelihood estimates of p(m | c) 3: Keep all m (in c) with p(m | c) > t, where t is a threshold 4: repeat 5: for all PoS clusters c1 , c2 do 6: Pick morphemes m1 in c1 and m2 in c2 with the highest number of common stems 7: Store φ = {m1 /c1 , m2 /c2 } as the new paradigm 8: Remove all words in c1 with morpheme m1 and associate these words with φ. 9: Remove all words in c2 with morpheme m2 and associate these words with φ. 10: end for 11: for all Paradigms φ1 , φ2 such that Acc(φ1 , φ2 ) > T , where T is a threshold do 12: Create new merged paradigm φ = φ1 ∪ φ2 13: Associate all words from φ1 and φ2 into φ 14: Delete paradigms φ1 , φ2 . 15: end for 16: until No morpheme pair consisting of at least one common stem is left 4 Merging Paradigms For capturing more general paradigms, paradigm merging is performed. We rank potential paradigms by the ratio of common stems with the total number of stems captured by the paradigm. More precisely, given paradigms φ1 , φ2 , let P be the total number of common stems. Let N1 be the total number of stems in φ1 that are not present in φ2 . Similarly, let N2 be the total number of stems in φ2 that are not present in φ1 . Then, we can define the expected paradigm accuracy of φ1 with respect to φ2 by: P Acc1 = (3) P + N1 Acc2 is defined analogously. We use the average of Acc1 and Acc2 to compute the combined (averaged) expected accuracy of the merged paradigms φ1 , φ2 : P P + P +N Acc(φ1 , φ2 ) = P +N1 2 (4) 2 During each iteration, all the paradigm pairs having an expected accuracy greater than a given threshold value are merged. Once two paradigms are merged, stems that occur in only one of the paradigms inherit the morphemes from the other paradigm. This mechanism helps create a more general paradigm and helps recover missing word forms. Thus, although some of the word forms do not exist in the corpus, it becomes possible to capture these forms. Some example paradigms that are found by the system are given below: English: es ing e ed: sketch chew nipp debut met factor profit occurr err trudg participat necessitat stomp streak siphon stroll sprint drizzl firm climax gestur whipp roll tripp stemm dangl shuffl kindl broker chalk latch rippl collaborat chok summ propp pedal paralyz parad plough cramm slack wad saddl conjur tipp gallop totall catalogu bundl barg whittl retaliat straighten tick peek jabb slimm s ing ed 0: benchmark mothball weed snicker thread queue jack paw yacht implement import bracket whoop conflict spoof stunt bargain honor bird fingerprint excerpt handcuff veil comment Turkish: u a e i : yapabileceklerin kredisin hizmetleri’n sevdikleriniz yeter’ transferlerin sevkin elimiz tehlikelerin sas mucizey tehditlerin bakir muhasebesin ed gayrimenkuller ecevit’ defterim izlemelerin tescilin minarey tahsilin lastikler yerlestirmey i lar li in : ruhsat semt ikilem reaksiyonlar harc tip prim gidilmis kaldirmis degistirmis bu- lunmayacak aktarmis bulunacak kapanacak yazilabilecek devredilmis degisecek gelmemis German: er 0 e en: kassiert beguenstigt eingeholt genuegt angelastet beruehrt beinhaltet zurueck- gegeben beschleunigt initiiert abgestellt bewirkt mitgenommen abgebrochen beruhigt besichtigt te ung er ten t en lich e : fahr gebrauch blockier identifizier studier entfalt gestalt agier passier sprech berat tausch kauf such weck beug erreich bearbeit beobacht erleid ueberrasch halt helf oeffn pruef uebertreff bezahl spring fuell toet 0 te t er : lichtenberg limburg hill trier elmshorn dreieich praunheim heusenstamm heddern- heim hellersdorf schmitt muehlheim lueneburg kassel schluechtern preungesheim rodgau bieber osnabrueck rodheim muenchen london lissabon seoul wedding treptow 5 Morphological Segmentation For Morpho Challenge 2009, we first clustered all the words in the given corpora thereby creating a set of PoS clusters. We then followed the steps described in the previous sections to induce the morphological paradigms. Wordlists as provided in Morpho Challenge 2009 contain the list of words that need to be segmented. To assign a PoS cluster to given word, w from the wordlist, the context distribution of w is first computed. The word is assigned the PoS cluster with the minimal KL divergence. In this case, we only consider words with a frequency greater than 10 to eliminate noise. To segment the words in the word lists, first the word is checked if it exists in one of the existing paradigms. We followed different algorithms for known, unknown and compound words: 5.1 Handling known words If the word exists in one of the paradigms, it is segmented by using the morpheme in the paradigm in which the word is found. For example, if a paradigm exists as given below: s ing ed 0 : benchmark mothball weed snicker thread queue jack paw yacht implement import bracket whoop If a word ’importing’ is to be morphologically analyzed, it is automatically segmented by using the morpheme ’ing’. 5.2 Handling unknown words If the word does not exist in any of the paradigms, a sequence of segmentation rules are applied. By using the paradigms, we created a morpheme dictionary to split the words which do not belong to any of the paradigms. All the morphemes in each paradigm are included in the morpheme dictionary if in any of the paradigm the initial letters of the morphemes are not the same. If the initial letters of the all morphemes in the same paradigm are the same, the longest morpheme is included in the dictionary. Using the morpheme dictionary, the word is scanned from the right- most letter to check if any of the endings of the word exist in the dictionary. The longest letter sequence (of the word) existing in the dictionary is chosen to split the word. The same process is repeated after splitting the word until no split can be applied. 5.3 Handling compounds For the compounds, such as ’hausaufhaben’ in German, or ’railway’ in English, for both known, and unknown words a recursive approach is performed. The compounding rules split a word re- cursively from the rightmost end to the left. If an ending sequence of letters exists as a word in the corpus, the sequence is split, and the same procedure is repeated until no valid internal word part is a valid word itself in the corpus. When there are multiple matches the longest match is chosen. This recursive search is also able to find the prefixes as it searches for the valid sub-words in the words. Algorithm for the segmentation of the words is given in Algorithm 2. Algorithm 2 Morphological Segmentation 1: for all For each given word, w, to be segmented do 2: if w already exists in a paradigm φ then 3: Split w using φ as w = u + m 4: else 5: u=w 6: end if 7: If possible split u recursively from the rightmost end by using the morpheme dictionary as u = s1 + . . . + sn otherwise s1 = u 8: If possible split s1 into its sub-words recursively from the rightmost end as s1 = w1 + . . . + wn 9: end for 6 Results The model is evaluated in Morpho Challenge 2009 competition. Here we describe the datasets we used, our model parameters, and finally we give our evaluation results. 6.1 Datasets We used the datasets given by Morpho Challenge 2009, and Cross Language Evaluation Forum (CLEF) to train our system on 3 different languages: English, German, and Turkish. For PoS clustering, we used the given corpora by Morpho Challenge 2009 1 . For the clustering of the words in the word lists to be segmented we used the datasets supplied by CLEF organization 2 . We used the CLEF dataset to obtain context distributions of the words in German, and in English. For Turkish, we used a collection manually collected newspaper archives. 6.2 Model Parameters Our model is unsupervised, but it requires two prior parameters to be manually set. First, the threshold t on the conditional probability of the morpheme given its PoS cluster, P (m|c), needs 1 http://www.cis.hut.fi/morphochallenge2009 2 http://www.clef-campaign.org/. English datasets: Los Angeles Times 1994 (425 mb), Glasgow Herald 1995 (154 mb). German datasets: Frankfurter Rundschau 1994 (320 mb), Der Spiegel 1994/95 (63 mb), SDA German 1994 (144 mb), SDA German 1995 (141 mb) to be fixed. We tested different values of this parameter for each language to find a suitable value through trail and error and we set t = 0.1. Thus, only morphemes, m, with P (m|c) > 0.1 were considered. Second, the threshold on the expected accuracy, T , of merging two paradigms φ1 , φ2 given in Equation 4 needs to be set. Smaller values of this threshold leads to bigger paradigms with more stems, but it decreases the accuracy. Several experiments were performed to find its optimum value for different languages and a value of T = 0.75 was chosen. Both thresholds t and T once set were unchanged across all experiments reported in this paper. 6.3 Evaluation & Results The system was evaluated in Competition 1 of Morpho Challenge 2009. Precision is calculated by sampling pairs of words with the same morpheme(s) from the system output and checking this against the gold standard. Recall is calculated in a similar way but this time pairs of words with the same morpheme(s) are sampled from the gold standard and checked against the system output. F-measure is the harmonic mean of the precision, and the recall: 1 F − measure = 1 1 (5) P recision + Recall Evaluation results corresponding to the English language are given in Table 2. Table 2: Evaluation results for English Language Precision Recall F-measure English 58.52% 44.82% 50.76% Evaluation results corresponding to the German language are given in Table 3. We conducted two different experiments for German. In the first experiment, we used only the compounding rules (see Section 5.3 ) for the German word list. Since German heavily consists of compounds, the results in Table 3 show that the compounding rules have high precision but low recall. In the second experiment, we used the unsupervised model developed in this paper. Table 3: Evaluation results for German Language Precision Recall F-measure German - compound 73.16% 15.27% 25.27% German - normal 57.67% 42.67% 49.05% Evaluation results for Turkish are given in Table 4. Two different experiments were conducted for Turkish. In the first experiment, a validity check was performed while splitting the word recursively to decide whether to split the word. The validity check simply checks the membership of the given word in the Turkish corpus. If the rest of the word after splitting one morpheme exists in the corpus, the validity condition is assumed to be met. In the second experiment, no validity check is performed. Instead the morpheme dictionary is used. The morpheme dictionary is constructed from the learnt morphological paradigms by extracting all the morphemes to create a dictionary. The word is split recursively from the rightmost end by matching these with the morphemes in the morpheme dictionary. Our experiments show that the precision gets higher when a validity check is done but the recall is reduced. Since the Turkish dataset does not include all the forms of every word, the validity check is not reliable leading to a lower recall. 7 Conclusion & Future Work In this paper, we have developed a model for unsupervised morphology learning that exploits the PoS categories induced by an unsupervised PoS tagger. To our knowledge there has been very lim- Table 4: Evaluation results for Turkish Language Precision Recall F-measure Turkish (validity) 73.03% 8.89% 15.86% Turkish (no validity) 41.39% 38.13% 39.70% ited work on combined learning of syntactic categories and morphology. Our results demonstrate that it is meaningful to use the syntactic categorial information for morphology learning. One problem with the current approach is that it requires a large amount of corpus for PoS clustering. If a word does not have enough context information due to corpus size, it can not be clustered. The system then segments this by using just the morpheme dictionary. This in turn leads to inaccurate segmentations. The current system also requires manual setting of some thresholds. Furthermore, the system is very sensitive to these thresholds. In the near future, we plan to address the above issues with the current model. In particular, we are interested in generative models for joint learning of morphology and PoS categories. References [1] Stephan Bordag. Unsupervised and knowledge-free morpheme segmentation and analysis. In Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross- Language Evaluation Forum (CLEF), pages 881–891, 2007. [2] Michael R. Brent. Minimal generative models: A middle ground between neurons and triggers. In Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, 1993. [3] Michael R. Brent, Sreerama K. Murthy, and Andrew Lundberg. Discovering morphemic suffixes a case study in MDL induction. In Fifth International Workshop on AI and Statistics, pages 264–271, 1995. [4] Alexander Clark and Issco Tim. Combining distributional and morphological information for part of speech induction. In Proceedings of the 10th Annual Meeting of the European Association for Computational Linguistics (EACL), pages 59–66, 2003. [5] Alexander Clark. Inducing syntactic categories by context distribution clustering. In The Fourth Conference on Natural Language Learning (CoNLL), pages 91–94, 2000. [6] Mathias Creutz. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting on Association for Compu- tational Linguistics (ACL), pages 280–287, 2003. [7] Mathias Creutz and Krista Lagus. Unsupervised discovery of morphemes. In Proceedings of the ACL workshop on Morphological and phonological learning, pages 21–30, 2002. [8] Hervé Déjean and Basse Norm. Morphemes as necessary concept for structures discovery from untagged corpora, 1998. [9] John Goldsmith. Unsupervised learning of the morphology of a natural language. Computa- tional Linguistics, 27(2):153–198, 2001. [10] John Goldsmith. An algorithm for the unsupervised learning of morphology. In Natural Language Engineering, volume 12, pages 353–371, 2006. [11] Zellig Sabbettai Harris. Distributional structure. Word, 10(23):146–162, 1954. [12] Yu Hu, I. Matveeva, J. Goldsmith and C. Sprague. Using morphology and syntax together in unsupervised learning. In Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition, pages 20–27, June, 2005. [13] Fred Karlsson. Finnish grammar. WSOY, Juva, 1983. [14] Christian Monson. Paramor: From Paradigm Structure to Natural Language Morphology In- duction. PhD thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2008. [15] Patrick Schone and Daniel Jurafsky. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of CoNLL-2000 and LLL-2000, pages 67–72, 2000. [16] Matthew G. Snover, Gaja E. Jarosz and Michael R. Brent. Unsupervised learning of mor- phology using a novel directed search algorithm: taking the first step. In Proceedings of the ACL workshop on Morphological and phonological learning, 2002.