Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig sbordag@informatik.uni-leipzig.de Abstract This paper presents a revised version of an unsupervised and knowledge-free morpheme boundary detection algorithm based on letter successor variety (LSV) and a trie clas- sifier [5]. Additionally a morphemic analysis based on contextual similarity provides knowledge about relatedness of the found morphs. For the boundary detection the challenge of increasing recall of found morphs while retaining a high precision is tack- led by adding a compound splitter, iterating the LSV analysis and dividing the trie classifier into two distinctly applied clasifiers. The result is a significantly improved overall performance and a decreased reliance on corpus size. Further possible improve- ments and analyses are discussed. Keywords letter successor variety, morpheme boundary detection, morpheme analysis, distributed similarity 1 Introduction The algorithm presented in this paper1 is a revised version of the letter successor variety (LSV) based algorithm [13, 12, 10] described and implemented previously [4, 5]. The additional compo- nent of morpheme analysis is based on a prototypical implementation described in [7]. The morpheme segmentation algorithm attempts to find morpheme boundaries within word forms. For a given input word form it results in a segmentation into morphs (as opposed to morphemes). It is based on the assumption that any grammatical function is expressed with only a small amount of different affixes. For example, plural is expressed with only five different morphs in German -en, -s, -e, -er (and zero). In essence, the algorithm measures the amount of various letters occuring after a given substring with respect to some context of other words (in this case semantically similar ones), weighting that value according to bi- and trigram probabilities and comparing the resulting score to a threshold. Hence it is designed to handle concatenative morphology and it is likely to fail in finding morpheme boundaries in languages with other types of morphology. The algorithm is not rooted in any particular (linguistic) theory of morphology, especially since such theories tend to omit the fact that morphemes, as their basic units of interest, are not as simply observable as words. The knowledge about where a morph begins and ends is usually assumed to be given a priori. The present implementation of the morpheme boundary detection consists of three distinct major parts: a compound splitter, a letter successor variety algorithm using contextual similarity of word forms and a trie based machine learning step. Apart from the improved modularity, this version differs from earlier implementations in several aspects. Due to the low performance of the 1 A recent implementation of this algorithm is available at http://wortschatz.uni-leipzig.de/∼sbordag/ LSV based method in splitting longer words, in a pre-processing step a simple compund splitter algorithm is applied. The LSV part is iterated to increase recall with only a moderate loss of precision. The machine learning part (using a trie) is split into two parts, one with high precision and a subsequent one with high recall. According to an evaluation using the German Celex [1], each change improves the overall performance slightly. Several possibilities of further improvements and analyses are discussed. Any of the major three parts (compound splitter, LSV algorithm, trie classifier) of the described algorithm can be replaced by or merged with a different algorithm, which should facilitate the combination of this algorithm with others. The morpheme analysis part is based on statistical co-occurrence of the found morphs and subsequent contextual similarity and a basic rule learning algorithm. The rules are then used to find related morphs where groups of related morphs represent a morpheme. 2 Letter Successor Variety LSV is a measure of the amount of different letters encountered after (or before) a certain substring, given a set of other strings as context. It is possible to use the entire word list as context for each string and its substrings [12, 10]. Alternatively, only a specific set of words may be used as context [5], if a method for the selection of relevant words is included. In order to use LSV to find true morpheme boundaries, this set ideally consists only of relevant words that share at least one grammatical feature with the input word. For example, if the input word is hurried, then relevant words are past tense forms. It is obvious that in such a case the amount of different letters encountered before the substring -ed is maximized. As has been shown earlier [7], using the entire word list for morpheme boundary detection (global LSV) is inferior to using a simulation of semantic similarity (contextual similarity based on comparing statistically significant co-occurrences) of words to find the relevant ones (local LSV). However, due to the power-law distribution of word frequencies, most words occur very rarely. This makes it impossible to compute a proper representation of their usage and accordingly compare such words for usage similarity. Hence, local LSV based morpheme boundary detection might have a high precision, but is guaranteed to have a low recall. Additionally, it is possible to first globally find a set of contextually similar word pairs and then analyze their differences [18]. Whereas this method also appears to have high precision, its recall is even lower than that of the LSV method. 2.1 Trie classifier In order to increase the recall of the local LSV method, a machine learning method was proposed. It is based on training a patricia compact trie (PCT) [17] with morpheme segmentations detected by the local LSV methods. The trained trie can then be used to recursively split all words into morphs, irrespective of their frequency. Training the trie, as also depicted in Figure 1, is performed as follows: Each known morpheme boundary is reformulated as a rule: The entire word is the input string, whereas the shorter half of the word (according to the morpheme boundary) is the class to be learned. The trie learns by adding nodes that represent letters of the word to be learned along with increasing the count of the class for each letter (see Figure 1). If more than one morpheme boundary is given in a word, then training is applied recursively, peeling off the outmost and shortest morphs first (from right to left). Hence, the word mis-under-stand-ing results in the three training instances misunderstanding -ing, misunderstand mis- and understand -stand. Two distinct tries, a forward-trie and a backward-trie are used to separately learn suffixes and affixes. The decision which trie to use for any given training instance is based on the length of the morphs. The longer half of the word probably contains the stem, whereas the shorter half is used as the class. In the case of the backward-trie, the word itself is reversed. Hence, in the PCT Training set root Result set } (on new words) 2:ly 2:- 1:ry clear - clearly ly P(week) =0.4 dearly ly P(easi-ly) =0.66 early - y r P(public-ly) =0.66 machinery ry 2:ly 1:- 1:ry 1:- earl r 2:ly 1:- 1:ry (optionally) - c d pruned 1:- 1:ly 1:ly Figure 1: Illustration of training a PCT and then using it to classify previously unseen words. example above, the first newly learned node for misunderstanding -ing is the node containing g with the class -ing and a frequency of 1 in the backward-trie. The classification is applied recursively as well: For an input string both the backward and forward tries are used to obtain the most probable class. This results up to two identified morpheme boundaries and hence three parts of the original words. Each part is the subject to the same mechanism recursively until no further classifications can be found. In the Morpho Challenge 2005 [15], both the local LSV and a subsequent application of the trie learning were submitted separately. As expected, the LSV method had a high precision, but extremely low recall (only 1.9% for Finnish, for example). The application of the trie increased recall, but also lowered precision. The precision decreased due to overgeneralization, which occurs mostly because the algorithm receives only positive examples. Even for a well-represented input word, the contextually similar words do not share grammatical information with it. Therefore, a high-frequent input word, for example matter, where the LSV algorithm did not find any morpheme boundaries, cannot be taken as an example of a word where -er is not a suffix. 3 Refined Implementation The above mentioned weaknesses of the LSV + trie combination hold uniformly over all tested languages. The following modifications attempt to address some of these weaknesses, while trying to avoid language-specific rules or thresholds. The new version contains several changes: a recur- sive compound identifier, an iteration of the LSV algorithm and splitting the trie classification into two steps. 3.1 Identifying compounds The LSV algorithm is good at finding affixes, because the set of contextually similar words often contains words with the same affix. However, words contextually similar to a compound do not necessarily contain other compounds, or compounds sharing parts with the input word. Partic- ularly for semantically opaque compounds (i.e. the meaning of the parts is not related to the meaning of the compound) this is guaranteed to be the case. Therefore it is mostly impossible for the LSV algorithm to find morpheme boundaries between the parts of a compound, unless the compound contains a very productive part, such as teacher (Mathelehrer, Physiklehrer, ...). Since only a small sample set is sufficient for the trie to correctly classify most compounds later, it is not necessary to find all compounds at this point. The following simple mechanism tests whether a certain partition of the input word at the position i is a possible correct compound division: testDiv(word,i) minl = 4 if ( leftPart.length > minl and rightPart.length > minl and freq(leftPart) > 20 and freq(rightPart) > 20 ) leftPart = findComp(leftPart) rightPart = findComp(rightPart) return leftPart - rightPart else return null Assuming testDiv(word, i) also returned the sum of the frequencies of all parts, it is then pos- sible to take the one partition of the input word that maximizes the frequency of the participating parts. In other words, this mechanism recursively divides a long word into shorter units. If both shorter units exist, their length is at least 4 and frequency no lower than 20, then that partition is a candidate and the one with the highest overall score is taken as a (probably) valid partition of the compound. As Table 1 shows, the algorithm (as expected) has a high precision, but very low recall. In fact, it may have even lower recall for other languages. It also shows that training the trie classifier with this data directly indeed increases recall, but also incurs a rather strong loss in precision. It can be assumed that if compounding exists in a language, then this algorithm in combination with the trie classifier helps to find the parts of a large part of compounds. However, a more elaborate implementation is desirable at this point, especially since this algorithm is unable to partition any compounds containing linking elements. 3.2 Iterated LSV algorithm For the LSV algorithm, the ideal case is achieved when all contextually similar words to a given input word carry the same grammatical information. However, due to data sparseness, compounds, overly high co-occurrence frequency and other factors, this ideal state is achieved only for few words. In many cases only a few contextually similar words actually share grammatical information with the input word. Running the LSV algorithm may thus find some morpheme boundaries correctly and not find many others. It is important that in tis setup (using contextually similar words as context) it nearly never finds from morpheme boundaries. Table 1 shows that the first run of the LSV algorithm found very few (but very precisely) morpheme boundaries. The boundaries that were found are based on the ideal cases where most contextually similar words indeed share grammatical information with the input word. In order to facilitate the boundary identification for some of the remaining words, it is pos- sible to iterate the LSV algorithm by incorporating knowledge produced in earlier iterations. A straightforward way of doing this is adding an additional factor to the computation of the LSV score: Given a substring -ly of the input word clearly, if the same substring was identified as a morph in any (or all) of the contextually similar words, then increase the LSV score. However, some very frequent words such as was or do-es are contextually similar to a large amount of words, which in turn means that these frequent words might influence the analyses of many other words adversely, such as James to Jam-es. Therefore the increase of the LSV score has to be normalized against the number of words with the same substring and the number of contextually similar words. To recall from [7], the formula to compute the left LSV score for the word w at the position i (the formula for the right score is likewise) is: R P F compounds 10.30 88.33 18.44 lsv iter 1 17.88 88.55 29.76 lsv iter 3 23.96 84.34 37.31 saveTrie 31.09 82.69 45.19 Table 1: Iterating the LSV algorithm and applying the modified trie classifier increases recall while keeping precision at high levels. lsvl (w, i) = plsvl (w, i) · f wl (w, i) · ib(w, i) (1) This takes anomalies such as phonemes represented my several letters into account in a straight- forward way. It assumes that plsvl (w, i) is the plain number of different letters found to the right of the substring between the beginning of the word w and the position i. f wl (w, i) is the bi- or trigram based frequency weight of the substring, whereas ib(w, i) is the inverse bigram weight. The previously acquired knowledge about morpheme boundaries can be used to compute prevl (w, i) as the number of previously identified morphs pfl (w, i) divided by 2 and multiplied with the quotient of the number of words containing the same substring subfl (w, i) and the size of the pruned list of contextually similar words prune: prevl (w, i) = pfl (w, i) · 0.5 · (subfl (w, i)/prune) (2) To prevent the previous analyses from overriding the analysis of the present word, the new LSV score is computed as a multiplication of the LSV score with the previous knowledge, which is at most as high as lsvl (w, i) -1: lsv2l (w, i) = min(lsvl (w, i) − 1, prevl (w, i)) · lsvl (w, i) (3) The same is reversely applied to the right LSV score lsvr (w, i) and both lsvl (w, i) and lsvr (w, i) are again summed to produce the final lsv2(w, i) and compare it to a threshold (for example 6) to obtain a decision whether the position i in the word w is a morpheme boundary. For example, the analyses of the most similar words of clear-ly might result in the following morpheme boundaries: closely, white, great-ly, legal-ly, clear, linear-ly, really, weakly, .... Hence, for the position 5 (which corresponds to -ly) in clearly, the amount of previously identified morphs pfr (w, i) is 3. The number of such substrings subfl (w, i) is 5 and the amount of contextually similar words was 150. Hence, prevr (clearly, 5) = 3 · 0.5 · (5/150) = 0.05 and thus the absolute increase of the LSV score is only 0.05 in this case. As Table 1 shows, there are, however, many cases where the influence was sufficiently strong for the resulting LSV score to reach the threshold. The table also shows that iterating the LSV algorithm increases Recall. However, it also incurs a certain Precision loss associated with words such as James that happen to be contextually similar to many other words where -es is really a suffix (and also was identified as such). 3.3 Split trie classification Irrespective of its source, the obtained knowledge (i.e. from a simple compounds identifier, the original LSV algorithm, or the improved LSV algorithm) is used to train the trie classifier and then apply the trained classifier to identify more morpheme boundaries. In the original version the trie produces a most probable class for an input string simply by searching for the deepest node in the trie. Since no further checking was introduced, decisions were made without considering further context in many cases. For example, the LSV algorithm found the morpheme boundary drama-tic. When analyzing plas-tic, the trie classifier would first find t as the deepest matching node. Since that node has only a single class stored with the frequency count of 1, the classifier R P F compounds 27.93 66.45 39.33 lsv iter 1 57.66 71.00 63.64 lsv iter 3 62.72 68.96 65.69 saveTrie 66.10 68.92 67.48 Table 2: Applying the unmodified classifier increases recall by a large amount, but also lowers precision considerably. would decide in favor of -tic being a morph with a maximal value of 1. In other words, no further context from the word is considered and the decision is made on grounds of only a single training instance. However, simply forbidding all decisions that do not take a certain amount of the word into account, would result in extremely low recall, such as 31% for German in Table 1. The trie classification is thus split into two parts, a modified trie classifier and subsequently an original unmodified trie classifier. The modified trie classifier returns a decision only if all of the following conditions are met: • The deepest matching node must be at least two letters deeper than the class to be returned. • The matching node must have a minimal distance of three from the root of the trie. • The total sum of the frequency of all classes stored in the deepest matching node must be larger than 5. Table 1 shows that applying the modified trie classifier saveTrie increases recall by 8% while reducing precision by less than 2%. Table 2 additionally shows that the subsequent application of the original trie classifier further increases recall to a total of 66% while lowering precision to roughly 69%. The table also shows that applying the original trie classifier directly on any of the LSV iterations or even the compound identification algorithm results in lower overall performance. 3.4 Assessing the improvements In order to measure the influence of the various improvements proposed, a number of experiments were run on the 3 million sentences German corpus available for the Morpho Challenge 2007. The results of each improvement were measured and are depicted in Table 1. Additionally, the original trie classifier was applied to the results of each modification as depicted in Table 2. These evaluations show that ultimately, the local LSV implementation could be significantly improved. As such, it reaches similar performance as reported in [5] despite being run on a significantly smaller corpus (3 million sentences vs. 11 million). On the other hand, the relatively small improvements achieved indicate that a significantly better morpheme boundary detection may only be achieved by combining this method with an entirely different approach, possibly one combining the various hitherto described approaches. The results of the Morpho Challenge 2007 also show that currently the MDL based approaches to morpheme boundary detection [8, 2] mostly outperform the LSV based approach, especially in the more important Information Retrieval task evaluation. The most porbable reason is that the LSV algorithm is good at detecting boundaries within high-frequent words, whereas the MDL based algorithms are better at detecting boundaries in longer words. Longer words tend to be less frequent and thus more important for Information Retrieval as opposed to the more frequent words. A manual analysis of the resulting word list revealed several possible improvements: • An algorithm specifically designed to identify compounds by means of finding both parts of a word as single words in the same sentence (reformulations) might help to find the linking elements that currently remain undetected, such as in mitglieds-laend-er. • In a post-processing step, an algorithm based on affix signatures such as proposed by [11], might find errors or generalize known morpheme boundaries better than the trie classifiers and ultimately avoid mistakes such as in-fra-struktur. • A global morpheme vocabulary control mechanism, such as the MDL [9, 14, 8, 2] might pro- vide further evidence for or against certain morpheme boundaries and subsequently inhibit mistakes such as schwa-ech-er. However, apart from the morpheme boundary detection, a clustering algorithm is needed that would cluster various found morphs according to their grammatical function. The contextual similarity on which the local LSV algorithm is based possibly already provides the corresponding information. Once the morpheme boundary detection achieves a sufficient quality, the contextually similar words of an input word could be taken as probably carrying a suffix with the same function, despite the suffix having a different form, such as the plural endings in German -en, -s, -e, -er. The alternation identification algorithm reported earlier [7] shows that such paradigmatic operations are principially possible on the morphological level (using only knowledge-free methods). 4 Morpheme analysis Under the assumption that morpheme boundaries were correctly detected, it is possible to treat every single morph separately (similarly to a word) in a statistical co-occurrence analysis. This allows computing contextual similarity between morphs, instead of words. The following algorithm uses this procedure to find rules that relate various morphs to each other and then applies these rules to produce morphemic analyses of the words that originally occurred in the corpus: for each morph m for each cont. similar morph s of m if LD_Similar(s,m) r = makeRule(s,m) store(r->s,m) for each word w for each morph m of w if in_store(m) sig = createSignature(m) write sig else write m For each morph, the function LD Similar(s,m) filters from the contextually most similar morphs those that differ only minimally, based on Levenshtein Distance (LD) [16] and word lengths. This step could be replaced by a more elaborate clustering mechanism. Pairs with short morphs are only accepted if LD = 1, pairs with longer morphs may have a larger distance. The function makeRule(s,m) creates a hypothetical rule that explains the difference between two con- textually similar morphs. For example, the morphs ion and ions have a Levenshtein Distance of 1 so the function creates a rule -s (or n -ns to take more context into account) which says that s can be added to derive the second morph from the first one. This rule is then stored and associated with the pair of morphs that produced it. This allows deciding between probably correct (if many morph pairs are associated with it) and incorrect rules later. The second part of the morphemic analysis then applies the acquired knowledge to the original word list. The goal is an analysis of the morphemic structure of all words, where a morpheme is represented by all its allomorphs. In the first step, each word is thus split into its morphs, according to the LSV and trie based algorithm described above. In the next step, all related morphs as stored by the first part of the morphemic analysis are retrieved for each morph of the input word. The function createSignature(m) produces a representation of each morpheme. For example, the original word fracturing was found to have two morphs: fractur and ing. The first morph is related to two morphs fracture and fractures. The second morph is related to inag, ingu and iong. This results in the following analysis: fracturing > fractur.fracture.fractures > inag.ing.ingu.iong It is noteworthy that this algorithm cannot distinguish between various meanings of a single morph. In English, the suffix -s may be a plural marker if used with a noun or the third person singular marker if used with a verb. Given the extremely high frequency of some these ambiguous morphs, the number of (at least partially) wrong analyses produced by the algorithm is likely to be high. Further research may evolve around using an unsupervised POS tag inducer [3] to distinguish between different word classes or using a word sense induction algorithm [6] applied to morphs in order to induce the various meanings. The results from the Morpho Challenge 2007 are surprising in that the morpheme analysis did not yield any significant changes to the evaluation results. This is despite the fact that on average nearly every single morpheme is represented by several morphs. After exploring the word lists for German, the most probable reasons for this appear to be any of the following: • During construction of the rules no context is taken into account. This often results in morphs to be found as correlated despite them just incidentally looking similar and sharing some contextual similarity. Hence, benefit of the analysis and error might be cancelling each other out. • Many of the morphs representing a morpheme are, in fact, only artifacts of the mistakes of the morpheme boundary detection algorithm. Thus, the morpheme analysis appears to be strongly influenced by the quality of the detected boundaries. • When determing the validity of a rule, the amount of morph pairs is taken into account, but not their frequency. This results in many extremely rare morphs (without any impact on the evaluation) to be merged correctly into morphemes, but many very frequent ones (with actual impact on the evaluation) to be missed. 5 Conclusions Whereas the changes introduced to the morpheme boundary detection improve the overall per- formance, they also add several more parameters to the entire process. The paramaters do not have to be set specifically for each language, but a large number of parameters often indicates the possibility of overfitting. Yet, despite the improvements and the possibility of overfitting, the performance of knowledge-free morpheme boundary detection is far below what knowledge-rich systems (i.e. rule-based) achieve. Nevertheless, the significant improvements achieved in the In- formation Retrieval evaluation task in the Morpho Challenge 2007 sufficiently demonstrate the usefulness of such algorithms even in the current state. Compared to other knowledge-free morpheme boundary detection algorithms, the version of the LSV algorithm described in this paper produces good results. The modular design of this algorithm allows for a better interoperability with other algorithms. For example, the significant performance boost achieved by adding a compound splitter indicates that combining various un- derlying hypotheses is more likely to yield significant improvements than changes to any single method. Also, given that the most simple combination of algorithms in the form of a voting algorithm in the Morpho Challenge 2005 demonstrated an extraordinary increase in performance, it is reasonable to assume that more direct combinations should perform even better. The noise produced during the morpheme boundary detection, the missing method for dis- tinguishing ambiguous affixes and other factors resulted in the subsequent morphemic analysis to produce apparently insignificant results. It becomes obvious that adding further algorithmic solutions representing other hypotheses about morpheme boundaries, as well as a more elaborate morphemic analysis, should be a significant step towards a true morphemic analysis similarily to what can be done manually. References [1] R. Harald Baayen, Richard Piepenbrock, and Léon Gulikers. The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, USA, 1995. [2] Delphine Bernhard. Unsupervised morphological segmentation based on segment predictabil- ity and word segments alignment. In Proceedings of the PASCAL Challenges Workshop on Unsupervised Segmentation of Words into Morphemes, April 2006. [3] Chris Biemann. Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the Student Research Workshop at the COLING/ACL, Sydney, Australia, July 2006. [4] Stefan Bordag. Unsupervised knowledge-free morpheme boundary detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, September 2005. [5] Stefan Bordag. Two-step approach to unsupervised morpheme segmentation. In Proceedings of the PASCAL Challenges Workshop on Unsupervised Segmentation of Words into Mor- phemes, Venice, Italy, April 2006. [6] Stefan Bordag. Word sense induction: Triplet-based clustering and automatic evaluation. In Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, April 2006. [7] Stefan Bordag. Elements of Knowledge-free and Unsupervised lexical acquisition. PhD thesis, Department of Natural Language Processing, University of Leipzig, Leipzig, Germany, 2007. [8] Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Publications in Computer and Information Science, Report A81, Helsinki, Finland, March 2005. Helsinki University of Technology. [9] Carl de Marcken. The unsupervised acquisition of a lexicon from continuous speech. Memo 1558, MIT Artificial Intelligence Lab, 1995. [10] Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged cor- pora. In D.M.W. Powers, editor, Workshop on Paradigms and Grounding in Natural Language Learning at NeMLaP3/CoNLL98, pages 295–299, Adelaide, Australia, January 1998. [11] John Goldsmith. Unsupervised learning of the morphology of a natural language. Computa- tional Linguistics, 27(2):153–198, 2001. [12] Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1974. [13] Zellig S. Harris. From phonemes to morphemes. Language, 31(2):190–222, 1955. [14] Dimitar Kazakov. Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming. Machine Learning, 43:121–162, April-May 2001. [15] Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ebru Arisoy, and Murat Saraclar. Un- supervised segmentation of words into morphemes - Challenge 2005 An Introduction and Evaluation Report. Proceedings of the PASCAL Challenges Workshop on Unsupervised Seg- mentation of Words into Morphemes, Venice, Italy, 2006. [16] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and rever- sals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1965. [17] David R. Morrison. Patricia - practical algorithm to retrieve information coded in alphanu- meric. Journal of the ACM, 15(4):514–534, October 1968. [18] Patrick Schone and Daniel Jurafsky. Knowledge-free induction of inflectional morphologies. In Proceedings of the 2nd Annual Meeting of the North American Chapter of Association for Computational Linguistics, Pittsburgh, PA, USA, June 2001.