ParaMor: Finding Paradigms across Morphology Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin Language Technologies Institute Carnegie Mellon University {cmonson, jgc+, alavie+, lsl+}@cs.cmu.ed Abstract Our algorithm, ParaMor, fared well in Morpho Challenge 2007 (Kurimo et al., 2007), a peer operated competition pitting against one another algorithms designed to discover the morphological structure of natural languages from nothing more than raw text. ParaMor constructs sets of affixes closely mimick- ing the paradigms of a language, and, with these structures in hand, annotates word forms with mor- pheme boundaries. Of the four language tracks in Morpho Challenge 2007, we entered ParaMor in Eng- lish and German. Morpho Challenge 2007 evaluated systems on their precision, recall, and balanced F1 at identifying morphological processes, whether those processes mark derivational morphology or in- flectional features. In English, ParaMor’s balanced precision and recall outperform at F1 an already so- phisticated baseline induction algorithm, Morfessor (Creutz, 2006). ParaMor placed fourth in English overall. In German, ParaMor suffers from a low morpheme recall. But combining ParaMor’s analyses with analyses from Morfessor results in a set of analyses that outperform either algorithm alone, and that place first in F1 among all algorithms submitted to Morpho Challenge 2007. Categories and Subject Descriptors I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing Keywords Unsupervised Natural Language Morphology Induction, Paradigms 1 Introduction Performance at natural language processing tasks as different as speech recognition (Creutz, 2006) and machine translation (Goldwater and McClosky, 2005) can improve with careful morphological analysis. But building a morphological analyzer for a natural language requires expert language knowledge that may be in short supply. In this paper we describe ParaMor, an algorithm that automates the construction of a morphology analysis sys- tem for any language; and we present and discuss ParaMor’s performance in Morpho Challenge 2007 (Kurimo et al., 2007), a competition for algorithms that induce the morphology of natural languages from nothing more than unannotated text. 1.1 Paradigms: The Structure of Natural Language Morphology Both traditional and modern theories of inflectional morphology (Stump, 2001) organize natural language mor- phology by paradigms. Where a paradigm is the set of surface forms a lexeme can take as it inflects for relevant morphosyntactic features. Following suit, our work on unsupervised morphology induction also recognizes the paradigm as the natural organizational structure of inflectional morphology. One of the properties of paradigms we exploit in our work is that of the mutual exclusion of affixes. Consider Spanish verbs. Each verbal lexeme in Spanish can take upwards of 35 surface forms. Most of the surface forms of a Spanish verb mark tense or mood in combination with person and number, but here we focus on the rela- tively few non-finite forms of Spanish verbs. A Spanish verb can appear in exactly one of three non-finite forms: as a past participle, as a present participle, or in the infinitive. If the verb occurs as a past participle, then the verb takes additional suffixes. First, an obligatory suffix marks gender, an a marks feminine, an o masculine. Follow- ing the gender suffix either a plural suffix, s, appears or else there is no suffix at all. The lack of an explicit plural Form Gender Number Form Gender Number Feminine Singular a Ø Past Participle ad Masculine Plural o s Present Participle ando Infinitive ar Figure 1: Left: A fragment of the Spanish verbal paradigm. There are three morphosyntactic categories cov- ered in this paradigm fragment: first, form; second, gender; and third, number. Each of these three catego- ries appear in separate columns. And features within one feature column are mutually exclusive. Right: The suffixes filling the cells of the Spanish verbal paradigm fragment for the inflection class of ar verbs. suffix marks singular. The values of each individual morphosyntactic feature (form, gender, and number) are mutually exclusive. The Spanish lexeme administrar, given here in the infinitive, translates as to administer or manage. The feminine plural past participle of administrar is administradas which can refer to a group of women under administration, an in the managed help. There is no way for administrar or any other Spanish lex- eme to appear simultaneously in the infinitive and in a past participle form simultaneously: *admistrardas, *admistradasar. Figure 1 sketches the paradigm schema of Spanish nonfinite verb forms. In the left-hand table the feature values for the form, gender, and number features are given, while the right-hand table presents the surface forms of the suffixes realizing the corresponding feature values for verbs belonging to the class of regu- lar Spanish ar verbs. Our unsupervised morphology induction algorithm exploits the mutual exclusivity of feature-valued para- digms in two phases. ParaMor’s first phase identifies sets of mutually exclusive strings which mimic paradigms. ParaMor’s second phase segments word forms into morpheme-like pieces suggested by the discovered para- digms. Currently, ParaMor can isolate word final suffixes. ParaMor’s methods can be straightforwardly general- ized to prefixes and forthcoming work models sequences of concatenative morphemes. 1.2 Related Work In this section we highlight previously proposed minimally supervised approaches to the induction of morphol- ogy that, like ParaMor, draw on the unique structure of natural language morphology. One facet of NL morpho- logical structure commonly leveraged by morphology induction algorithms is that morphemes are recurrent building blocks of words. Brent et al. (1995), Goldsmith (2001), and Creutz (2006) emphasize the building block nature of morphemes when they each use recurring word segments to efficiently encode a corpus. These ap- proaches then hypothesize that those recurring segments which most efficiently encode a corpus are likely mor- phemes. Another technique that exploits morphemes as repeating sub-word segments encodes the lexemes of a corpus as a character tree, i.e. trie, (Harris, 1955; Hafer and Weis, 1974; Demberg, 2007), or as a finite state automaton (FSA) over characters (Johnson, H. and Martin, 2003; Altun and M. Johnson, 2001). A trie or FSA conflates multiple instances of a morpheme into a single sequence of states. The paradigm structure of NL mor- phology has also been previously leveraged. Goldsmith (2001) uses morphemes to efficiently encode a corpus, but he first groups morphemes into paradigm like structures he calls signatures. To date, the work that draws the most on paradigm structure is Snover (2002). Snover incorporates paradigm structure into a generative statistical model of morphology. 2 ParaMor We present our unsupervised morphology induction algorithm, ParaMor, by following an extended example of the analysis of the Spanish word administradas (administered). The word administradas occurs in the corpus of Spanish newswire on which we developed the ParaMor algorithm. This Spanish newswire corpus contains 50,000 types. We hope the detailed example we give here can flesh out the abstract step-by-step description of ParaMor in Monson et al. (2007). Before delving into ParaMor’s details we note two facts which guided algorithm design. First, in any given corpus, a particular lexeme will likely not occur in all possible inflected forms. But rather each lexeme will occur in some subset of its possible surface forms. Second, we expect inflected forms of a single lexeme to be corre- lated. That is, if we have observed several lexemes in inflected form A , and if B belongs to the same paradigm as A , then we can expect a significant fraction of those lexemes inflected as A to also occur in an inflected form with B . 2.1 A Search for Partial Paradigms ParaMor begins with a search for partial paradigms, where a partial paradigm is a set of candidate suffixes, and a candidate suffix is any word final substring. The word administradas gives rise to many candidate suffixes in- cluding: stradas, tradas, radas, adas, das, as, s, and Ø. Referring again to Figure 1, the candidate suffix s is a true morpheme of Spanish, marking plural. Additionally, the candidate suffixes as and adas, cleanly contain more than one suffix: The left edges of the word-final strings as and adas occur at Spanish morpheme bounda- ries. All other candidate suffixes derived from administradas incorrectly segment the word. The candidate suf- fixes radas, tradas, stradas, etc. erroneously include part of the stem, while das, in our analysis, places a mor- pheme boundary internal to the past participle morpheme ad. Of course, while we can discuss which candidate suffixes are reasonable and which are not, an unsupervised morphology induction system has no a priori knowl- edge of Spanish morphology. ParaMor does not know what strings are valid Spanish morphemes, is ParaMor aware of the feature value meanings associated with morphemes. Each candidate suffix may be derived from multiple word forms. The candidate suffix stradas occurs as the final substring of eight wordforms in our Spanish corpus, including the words administradas, arrastradas (wretched) and mostradas (accustomed). The candidate suffix s is a word final string of 10,662 wordforms in this same corpus, more than one fifth of the unique wordforms! When a candidate suffix is stripped from a surface word, we call the remaining word initial string a candidate stem. The (incorrect) candidate suffix stradas gives rise to eight (incorrect) candidate stems including admini, arra, and mo. ParaMor’s initial search for partial paradigms considers every candidate suffix derived from any word form in the input corpus as potentially part of a true inflectional paradigm. ParaMor’s search considers each non-null candidate suffix in turn, beginning with that candidate suffix which can attach to the most candidate stems, working toward suffixes which can attach to fewer stems. For each particular candidate suffix, f , ParaMor notes the candidate stems, T , to which f can attach, and then identifies the candidate suffix, f ′ , that forms separate corpus words with the largest number of stems in T . The candidate suffix f ′ is then added to the par- tial paradigm anchored by f . In our examples, all eight of the candidate stems that take stradas also form cor- pus words with the candidate suffix strada (words such as administrada, arrastrada, and mostrada) and hence strada would be added to the partial paradigm begun from stradas; similarly, the candidate suffix which can attach to the largest fraction of the 10,662 candidate stems which have a word final s is Ø, at 5501. Now with a partial paradigm containing two candidate suffixes, ParaMor resets T to be the set of candidate stems which form corpus words with both f and f ′ . ParaMor then searches for a third suffix which can form words with a large subset of this new T . ParaMor continues to add candidate suffixes until one of two halting criteria is met: 1. Since we expect suffixes from a single paradigm to be correlated, ParaMor stops growing a partial paradigm if no candidate suffix can form corpus words with at least a threshold fraction of the stems in the current partial paradigm. 2. ParaMor stops adding candidate suffixes if the stem evidence for the partial paradigm is too meager— ParaMor will only add a suffix to a partial paradigm if there are more stems than there are suffixes in the proposed partial paradigm. Figure 2 contains a number of search paths that ParaMor followed when analyzing our Spanish corpus. Most of the paths in Figure 2 are directly relevant to the analysis of administradas. Search paths begin at the bottom of Figure 2 and proceed upwards one candidate suffix at a time. In Spanish, the non-null candidate suffix that can attach to the most stems is s. The search path begun from s is the right-most search path shown in Figure 2. As discussed above, the null suffix, Ø, can attach to the largest number of candidate stems to which s can attach, and so the first search step adds Ø to the candidate suffix s. ParaMor then identifies the candidate suffix r as the suf- fix which can attach to the most stems to which s and Ø can both attach. But r can only form corpus words in combination with 287 or 5.2% of the 5501 stems to which s and Ø can attach. As such a severe drop in stem count does not convincingly suggest that the candidate suffix r is correlated with the candidates s and Ø, Pa- raMor does not add r, or any other suffix, to the now closed partial paradigm s.Ø. Experimentally we determined that, for Spanish, requiring at least 25% of stems to carry over when adding a candidate suffix seems to discover reasonable partial paradigms. ra rada radas a ada adas ado Ø da das do rado rados ran ados an ar aron ó dos n ndo r ron rar raron ró 1786 6051 167 ra rada radas a ada ado Ø da das do rado rados rar ados an ar aron ó dos n r ron raron ró 1786 6051 167 trada tradas rada radas rado a ada ado Ø da do trado trados rados rar raron ró an ar aron ó dos n r ron trar traron tró 167 1786 6051 167 trada tradas trado rada radas rado trados trar tró a ado an ar aron ó Ø da do n r ron rados rar ró 167 1786 6051 167 strada trada stradas tradas strado trado rada radas rado a ado an ar ó Ø do n r ron strar trar rados rar stró tró 1786 6051 167 7 30 strada trada rada radas strado trado a an ar ó Ø do n r a as o os rado rados strar trar 1786 6051 8981 stró tró 167 8 30 strada trada rada strado trado rado a an ar Ønr a o os Ørs stró tró rados 1786 6051 8981 287 9 30 167 strada trada rada Øs a an Ø es Øn ao strado trado rado 1786 10662 6051 8981 5501 12 30 167 strado trado rado an es n a s ... ... ... 15 30 167 1786 10662 6051 8981 10662 Figure 2: Eight search paths that ParaMor follows in search of likely partial paradigms. Search paths begin at the bottom of the figure and move upward. Candidate suffixes appear in bold. The underlined candidate suffix in each partial paradigm is the suffix added by the most recent search step. Each partial paradigm gives the number of candidate stems which attach to all candidate suffixes in that partial paradigm. Horizontal links be- tween partial paradigms connect sets of suffixes that differ only in their initial character. Continuing leftward from the s-anchored partial paradigm in Figure 2, ParaMor follows search paths from the candidate suffixes a, n, es, and an in turn. The 77th candidate suffix from which ParaMor grows a partial para- digm is rado. The search path from rado is the first path to build a partial paradigm that includes the candidate suffix radas, relevant for administradas. Similarly, search paths from trado and strado lead to partial paradigms which include the candidate suffixes tradas and stradas respectively. The search path from strado illustrates the second stopping criterion. From strado four candidate suffixes are added one at a time: strada, stró, strar, and stradas. Only seven candidate stems form words when combined singly with all five of these candidate suffixes. Adding any additional candidate suffix to these five suffixes brings the stem count down at least to six. Since six stems is not more than the six suffixes which would be in the resulting partial paradigm, ParaMor does not add a sixth candidate suffix. In our corpus of Spanish newswire text, ParaMor’s initial search identifies partial paradigms containing 92% of all ideal inflectional suffixes of Spanish, or 98% of the ideal suffixes that occurred at least twice in the corpus. Among the selected partial paradigms are those which contain portions of all nine true paradigms for our analy- sis of Spanish. The high recall of the initial search comes, of course, at the expense of precision. While our analysis provides nine true paradigms and 87 unique suffixes, 8339 partial paradigms are constructed containing 9889 unique candidate suffixes. The constructed partial paradigms have at least three readily apparent flaws. First, the candidate suffixes of many partial paradigms overlap. At the end of the initial search, there are 27 dis- tinct partial paradigms that contain the reasonable candidate suffix adas. Each of these 27 partial paradigms geminates from a distinct initial candidate suffix: an, en, ación, amos, etc. Second flaw, most constructed partial paradigms contain many fewer candidate suffixes than do the true paradigms of Spanish. And third, many partial paradigms include candidate suffixes possessed of an incorrect morpheme boundary. ParaMor addresses the first two flaws by merging together similar partial paradigms. And ParaMor addresses the third flaw while further ameliorating the second through filters which weed out less likely paradigm clusters. 2.2 Merging Partial Paradigms To merge partial paradigms ParaMor adapts greedy hierarchical agglomerative clustering. The details of the spe- cific clustering algorithm appear in Monson et al. (2007). Here we continue our Spanish example to illustrate how partial paradigms are merged. Figure 3 contains a small portion of the partial paradigm cluster that con- sumes the partial paradigm built from the candidate suffix an. The first eight steps of the partial paradigm search path from an appear in Figure 2. But the search path continues until there are fifteen candidate suffixes in the partial paradigm: a, aba, aban, ada, adas, ado, ados, an, ando, ar, aron, arse, ará, arán, and ó. The partial para- digm built from an appears on the center right of Figure 3. During clustering, an’s partial paradigm is merged with a cluster that has previously formed from a merger of two partial paradigms. These two partial paradigms and their merged cluster appear at the bottom left of Figure 3. ParaMor decides which partial paradigm clusters to merge by computing a similarity score between pairs of paradigm clusters. A variety of similarity metrics on partial paradigms are possible. Looking at Figure 3, it is clear that both the candidate suffix sets and the candi- date stem sets of partial paradigms can overlap. Consequently partial paradigms can share covered surface types. For example, the bottom two clusters of Figure 3 both contain the candidate suffix a and the candidate stem anunci, reconcatenating this stem and suffix we say that both of these partial paradigms cover the boundary an- notated word form anunci+a. ParaMor computes the similarity of partial paradigms, and their clusters, by com- paring just such sets of morpheme boundary annotated word forms. We have found that the particular similarity metric used does not significantly affect clustering. For the experiments we report here we use the cosine simi- larity for sets, given as X ∩ Y ( ) X Y 1 / 2 . It is interesting to note that similarity scores do not monotonically decrease moving up the tree structure of a particular cluster. Non-decreasing similarities is a consequence of computing similarities over sets of objects which are merged up the tree. Returning to our Spanish example word administradas, Clustering reduces, from 27 to 6, the number of distinct partial paradigms in which the candidate suffix adas occurs. Clustering also reduces the total number of separate partial paradigms to 7511 from 8339. 2.3 Filtering Partial Paradigm Clusters With the fragmentation of partial paradigms significantly reduced, ParaMor focuses on removing erroneously proposed partial paradigm clusters. After clustering we would expect that most sound clusters cover a reasonably large number of word forms of the corpus. So ParaMor’s first filtration step simply removes all partial paradigms which do not cover at least a threshold number of word forms. Monson et al. (2007) discusses our empirical pro- cedure to identify a reasonable threshold. ParaMor currently discards all partial paradigms which do not cover at least 37 word forms. This first filter drastically reduces the number of selected partial paradigms, from 7511 to 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.715 532 Covered Types 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó 25: anunci, aplic, apoy, celebr, consider, desarroll, desplaz, disput, elev, enfrent, estudi, expres, Cosine Similarity: 0.664 form, hall, integr, lanz, llam, lleg, llev, ocup, 451 Covered Types pas, present, realiz, registr, tom 375 Covered Types 15: a aba ada adas ado ados an ando ar 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó ara aron arse ará arán ó 22: anunci, aplic, apoy, celebr, concentr, confirm, 23: anunci, apoy, confirm, consider, declar, desplaz, declar, elev, entreg, expres, fij, form, gan, disput, entreg, estudi, fij, gan, hall, inici, lanz, llam, inici,lanz, llam, llev, pas, present, realiz, tom lleg, llev, ocup, pas, present, public, realiz, tom 330 Covered Types 345 Covered Types Figure 3: A portion of a cluster of partial paradigms. The candidate suffixes of each partial paradigm or cluster node appear in bold, candidate stems are in italics. Suffixes in cluster nodes which uniquely originate in one child are underlined. Also noted is the number of types covered by each partial para- digm or cluster node. 137. Among the many discarded partial paradigms is one of the six remaining partial paradigms containing adas. Although adas can be a valid verbal suffix sequence, the discarded partial paradigm was built from forms includ- ing gradas (stairs) and hadas (fairies), both nouns. Also removed are all partial paradigms containing the incor- rect candidate suffix stradas—pseudo paradigms such as the partial paradigm built up from the candidate suffix strado presented at the far left of Figure 2. Of the 137 remaining partial paradigm clusters, more than a third clearly attempt to model a morpheme boundary to the left of a correct morpheme boundary. Among these left-leaning clusters are those containing the candidate suffixes tradas and radas, including clusters which subsume the partial paradigms built from the can- didate suffixes trado and rado given in Figure 2. To filter out left leaning clusters ParaMor implements a strat- egy inspired by Harris (1955). In a partial paradigm modeling a legitimate morpheme boundary, the candidate stems will likely take a wide variety of final characters, while, in reflection, the candidate suffixes will likely begin with a variety of characters. Conversely, in a partial paradigm attempting to place a morpheme boundary internal to a morpheme, the candidate stems will mostly end with the same character and the candidate suffixes will mostly begin with the same character. We apply this logic to build a filter that discards partial paradigm clusters with an obviously better morpheme boundary to the right of that proposed by the cluster. Specifically, ParaMor examines the suffixes in each cluster. If all the suffixes begin with the same character, then ParaMor recursively inspects the partial paradigms that would result from stripping off that initial character from all the suffixes in each partial paradigm that that cluster is built from. If more than half of the cluster’s base partial paradigms identify a likely morpheme boundary to the right, then that cluster is entirely removed. For example, consider the only cluster among the remaining 137 that contains the candidate suffix tradas. One of the partial paradigms this cluster is built from is that partial paradigm given in Figure 2 which geminates from the candidate stem trados, namely trada.tradas.trado.trados.trar.traron.tró. In Figure 2, this tradas-containing partial paradigm is linked to the right with the partial paradigm rada.radas.rado.rados.rar.raron.ró—obtaind by removing the initial t from each candidate suffix. Although not pictured in Figure 2, the partial paradigm containing radas is further connected to the partial paradigm ada.adas.ado.ados.ar.aron.ó through removal of the initial r. And the stems of this adas-containing partial paradigm end in a wide variety of characters, suggesting a morpheme boundary. We measure stem final character variety using entropy. If stem final character entropy falls above a threshold value then ParaMor takes that partial paradigm as modeling a morpheme boundary. We have found that even a conservative, low, entropy cutoff discards nearly all clusters which model a morpheme boundary too far to the left. Applying this filter leaves 80 clusters, and furthermore completely removes all clusters containg the candidate suffixes tradas and/or radas. ParaMor currently contains no method for discarding clusters which place a morpheme boundary to the right of the correct position. 2.4 Segmentation Finally, with a strong grasp on the paradigm structure, ParaMor straightforwardly segments the words of a cor- pus into morphemes. ParaMor’s current segmentation algorithm is perhaps the most simple paradigm inspired segmentation algorithm possible. Essentially, ParaMor strips off suffixes which likely participate in a paradigm. To segment any word, w , ParaMor identifies all partial paradigm clusters that contain a non-empty suffix that matches a word final string of w . For each such matching suffix, f ∈ C , where C is the cluster containing f , we strip f from w obtaining a stem t . If there is some second suffix f ′ ∈ C such that t. f ′ is a word form found in either the training or the test corpus, then ParaMor proposes a segmentation of w between t and f . ParaMor, here, identifies f and f ′ as mutually exclusive suffixes from the same paradigm. If ParaMor finds no complex analysis, then we propose w itself as the sole analysis of the word. Note that for each word form, ParaMor may propose multiple separate segmentation analyses each containing a single proposed stem and suf- fix. Let us finish out our extended example of the analysis of the word administradas. Among the 80 paradigm clusters that ParaMor accepts are clusters containing the candidate suffixes adas, das, as, and s. Of these, adas, as, and s identify correct morpheme boundaries, while das does not. The clusters containing candidate suffix das cannot be removed with either the size or the currently implemented morpheme boundary filters. Among the clusters which contain adas several also contain ada; similarly das and da, as and a, and s and Ø, each appear together in at least one cluster. Replacing, in administradas, adas with ada, das with da, as with a, or s with Ø results in the potential word form administrada. As administrada does occur in our Spanish corpus, ParaMor produces four separate analyses of the word administradas: administr +adas, administra +das, administrad +as, and administrada +s. Each of these four analyses appears as is in the file of analyzed words ParaMor produces. 3 Morpho Challenge 2007 Results and Conclusions We entered ParaMor in the English and the German tracks of Morpho Challenge 2007. In each track we submit- ted three systems. The first system we submitted was ParaMor alone. ParaMor’s algorithm has free parameters. We did not vary these parameters, but held each at a setting which produced reasonable Spanish suffix sets (Monson et al., 2007). The English and German corpora used in Morpho Challenge 2007 were larger than we had previously worked with. The English corpus contains nearly 385,000 types, while the German corpus con- tains more than 1.26 million types. ParaMor induced paradigmatic scheme-clusters over these larger corpora from just the top 50,000 most frequent types. But with the scheme-clusters in hand, ParaMor segmented all the types in each corpus. The second submitted system combines the analyses of ParaMor with the analyses of Morfessor (Creutz, 2006). We downloaded Morfessor Categories-MAP 0.9.2 (Creutz, 2007) and optimized Morfessor’s single pa- rameter separately for English and for German. We optimized Morfessor’s parameter against an F1 score calcu- lated following the methodology of Morpho Challenge 2007. The Morpho Challenge F1 score is found by com- paring Morfessor’s morphological analyses to analyses in human-built answer keys. The official Morpho Chal- lenge 2007 answer keys were not made available to the challenge participants. However, the official keys for English and German were created using the Celex database (Burnage, 1990), and Celex was available to us. Us- ing Celex we created our own morphological answer keys for English and German that, while likely not identical to the official gold standards, are quite similar. Optimizing Morfessor’s parameter renders the analyses we ob- tained from Morfessor no longer fully unsupervised. In the submitted combined system, we pooled Morfessor’s analyses with ParaMor’s in perhaps the most simple fashion possible: for each analyzed word we added Morfes- sor’s analysis as an additional, comma separated, analysis to the list of analyses ParaMor identified. Naively combining the analyses of two systems in this way increases the total number of morphemes in each word’s analyses—likely lowering precision but possibly increasing recall. English German Submitted Systems P R F1 P R F1 ParaMor & Morfessor 41.6 65.1 50.7 51.5 55.6 53.2 ParaMor 48.5 53.0 50.6 59.1 32.8 42.2 Morfessor Trained by Monson et al. 77.2 34.0 47.2 67.2 36.8 47.6 Bernhard-2 61.6 60.0 60.8 49.1 57.4 52.9 Bernhard-1 72.1 52.5 60.7 63.2 37.7 47.2 Pitler 74.7 40.6 52.3 N/A N/A N/A Bordag-5a 59.7 32.1 41.8 60.5 41.6 49.3 Zeman 53.0 42.1 46.9 52.8 28.5 37.0 Table 1: The official Precision, Recall, and F1 scores from Morpho Challenge 2007, to three signifi- cant digits. Only scores for submitted systems most relevant to a discussion of ParaMor are in- cluded. The third set of analyses we submitted to Morpho Challenge 2007 is the set Morfessor produced alone at the same optimized parameter settings used in our combined entry. Table 1 contains the official Morpho Challenge 2007 results for top placing systems in English and German. Measuring by F1, the clear winners on English are the two systems submitted by Bernhard. The ParaMor systems take fourth and fifth place. As expected, combining ParaMor’s and Morfessor’s analyses boosts recall over each individual system, but hurts English precision, negligibly increasing F1 over ParaMor alone. ParaMor’s more balanced precision and recall outperform the baseline Morfessor system with its precision centric analyses. In German, the combined ParaMor-Morfessor system achieved the highest F1 of any submitted system. Bern- hard is a close second just 0.3 absolute lower—a likely statistically insignificant difference. As with English, Morfessor alone scores well on precision; in contrast, ParaMor’s precision is significantly higher for German than in English. Combining two reasonable precision scores keeps the overall precision respectable. Both Pa- raMor and Morfessor alone have relatively low recall. But the combined system significantly improves recall over either system alone. Clearly ParaMor and Morfessor are complementary systems, identifying very different types of morphemes. Indeed, Morfessor is particularly designed to identify agglutinative sequences of morphemes, while ParaMor focuses on identifying productive paradigms of usually inflectional suffixes. To gauge ParaMor’s performance at its likely strength of inflectional morphology, we again used the Celex database to create morphological answer English German P R F1 P R F1 Morfessor 53.3 47.0 49.9 1.3 38.7 44.2 41.2 0.8 ParaMor 33.0 81.4 47.0 0.9 42.8 68.6 52.7 0.8 Table 2: ParaMor segmentations compared to Morfessor’s evaluated for Precision, Recall, F1, and standard deviation of F1, , against an answer key analyzed only for inflectional σ morphology. keys, this time analyzed only for inflectional morphology. Table 2 contains the results of ParaMor and Morfessor against these new inflectional answer keys. ParaMor attains remarkably high recall of inflectional morphological processes for both German and particularly English. Also notable, ParaMor’s precision is considerably lower measured against inflection only, as compared to measuring against both inflectional and derivational morphol- ogy. ParaMor is most likely identifying regular derivational processes in addition to a large fraction of the inflec- tional morphology. We are excited by ParaMor’s strong performance and are eager to extend our algorithm. Recent experiments suggest that the precision of ParaMor’s segmentations can be improved by building partial paradigms from cleaner language data. Perhaps ParaMor and Morfessor’s vastly different strategies for morphology induction can be combined in an even more fruitful fashion. We also intend to extend ParaMor to analyze sequences of affixes by combining separate analyses. Acknowledgements The research reported in this paper was funded in part by NSF grant number IIS-0121631. References Altun, Yasemin, and Mark Johnson. “Inducing SFA with -Transitions Using Minimum Description Length.” є Finite State Methods in Natural Language Processing Workshop at ESSLLI. Helsinki, Finland, 2001. Brent, Michael R., Sreerama K. Murthy, and Andrew Lundberg. “Discovering Morphemic Suffixes: A Case Study in MDL Induction.” The Fifth International Workshop on Artificial Intelligence and Statistics. Fort Lauderdale, Florida, 1995. Burnage, Gavin. Celex—A Guide for Users. Springer, Centre for Lexical information, Nijmegen, the Nether- lands, 1990. Creutz, Mathias. “Morpho project.” May 31, 2007. Creutz, Mathias. “Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmentation with Application to Automatic Speech Recognition.” Ph.D. Thesis in Computer and Information Science, Re- port D13. Helsinki: University of Technology, Espoo, Finland, 2006. Demberg, Vera. “A Language-Independent Unsupervised Model for Morphological Segmentation.” Association for Computational Linguistics. Prague, Czech Republic, 2007. Goldsmith, John. “Unsupervised Learning of the Morphology of a Natural Language.” Computational Linguis- tics 27.2 (2001): 153-198. Goldwater, Sharon, and David McClosky. “Improving Statistical MT through Morphological Analysis.” Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Vancouver, B.C., Canada, 2005. Hafer, Margaret A., and Stephen F. Weiss. “Word Segmentation by Letter Successor Varieties.” Information Storage and Retrieval 10.11/12 (1974): 371-385. Harris, Zellig. “From Phoneme to Morpheme.” Language 31.2 (1955): 190-222. Reprinted in Harris 1970. Harris, Zellig. Papers in Structural and Transformational Linguists. Ed. D. Reidel, Dordrecht 1970. Johnson, Howard, and Joel Martin. “Unsupervised Learning of Morphology for English and Inuktitut.” Human Language Technology Conference / North American Chapter of the Association for Computational Linguis- tics. Edmonton, Canada: 2003. Kurimo, Mikko, Mathias Creutz, and Matti Varjokallio. “Unsupervised Morpheme Analysis – Morpho Chal- lenge 2007.” March 26, 2007. Monson, Christian, Jaime Carbonell, Alon Lavie, and Lori Levin. “ParaMor: Minimally Supervised Induction of Paradigm Structure and Morphological Analysis.” Computing and Historical Phonology: The Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology. Prague, Czech Republic, 2007. Snover, Matthew G. “An Unsupervised Knowledge Free Algorithm for the Learning of Morphology in Natural Languages.” Sever Institute of Technology, Computer Science Saint Louis, Missouri: Washington University, M.S. Thesis, 2002. Stump, Gregory T. Inflectional Morphology: A Theory of Paradigm Structure. Cambridge University Press. 2001.