Simple Morpheme Labelling in Unsupervised Morpheme Analysis Delphine Bernhard∗ Ubiquitous Knowledge Processing Lab Computer Science Department Technische Universität Darmstadt, Germany http://www.ukp.tu-darmstadt.de Abstract This paper presents my participation to the second Morpho Challenge. Results have been obtained with the algorithm already presented at Morpho Challenge 2005. The system takes a plain list of words as input and returns a list of labelled morphemic segments for each word. Morphemic segments are obtained by an unsupervised learning process which can directly be applied to different natural languages. The system first relies on segment predictability within the longest words in the input word list to identify a set of prefixes and suffixes. Stems are then acquired by stripping affixes from the words. In a third step, words sharing a common stem are compared and split in similar and dissimilar parts corresponding to morphemic segments. Finally, the best segmentation is chosen for each word among all possible segmentations. Results obtained at competition 1 (evaluation of the morpheme analyses) are better in English, Finnish and German than in Turkish. For information retrieval (competition 2), the best results are obtained when indexing is performed using Okapi (BM25) weighting for all morphemes minus those belonging to an automatic stop list made of the most common morphemes. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing[Linguistic processing]; H.3.3 Information Search and Retrieval[Query formulation]; General Terms Measurement, Performance, Experimentation Keywords Morphology, Unsupervised Learning, Morpheme Analysis, Information Retrieval 1 Introduction Unsupervised morpheme analysis, which is the intended goal of the second Morpho Challenge, consists in automatically discovering a word’s morphemes using only minimal resources made up of a list of words and a text corpus in each language. This is a more challenging task than segmenting words into sub-lexical items, which was the objective of Morpho Challenge 2005. ∗ Former affiliation (until August 31, 2007): TIMC-IMAG, Grenoble, France For Morpho Challenge 2007, morphemic segments have to be identified but also labelled and hence disambiguated. On the one hand, morphemes are abstract units which may be realised by several surface forms, i.e. allomorphs. On the other hand, a single surface form may correspond to several homographic morphemes. In order to perform morpheme analysis and correctly label morphemic segments, it is therefore necessary to achieve a mapping between morpheme labels and their surface realisations. As I will show later, the system presented in this paper does not solve cases of allomorphy, since different surface forms will always be considered as different morphemes. It only partly aims at resolving cases of homography since identified morphemic segments are labelled with one of the following categories: stem/base, prefix, suffix and linking element. The system nevertheless achieved decent results in competition 1 for English, Finnish and German, but results in Turkish were less satisfactory. The rest of this paper is organised as follows. The algorithm is described in Section 2. Results obtained for Competitions 1 and 2 are presented in Section 3. Then, particular assumptions made in the system and pertaining to morpheme labelling are discussed in Section 4. Finally perspectives for the evolution of the system are given in Section 5. 2 Overview of the Method The algorithm is mostly identical to the one used for Morpho Challenge 2005, apart from a few very minor changes. A previous and more detailed description of the system can be found in [1]. The method takes as input a plain list of words without additional information. The text corpora provided for Morpho Challenge 2007 have therefore not been used. The output is a labelled segmentation of the input words. Labels belong to one of the following categories: stem, prefix, suffix and linking element. The algorithm can be sub-divided into 4 main steps, plus an additional step which may be performed to analyse another data set, given the list of segments learned after step 4. 2.1 Step 1: Extraction of prefixes and suffixes The objective of step 1 is to acquire a list of prefixes and suffixes. The longest words in the input word list are segmented based on the notion of segment predictability. This idea is recurrent in research on the segmentation of words into morphemes (see for instance [5, 8, 9]). It posits that a morpheme boundary can be hypothesised if it is difficult to predict the character (or string of characters) which follows, knowing an initial string of characters. In the system, segment predictability is modelled by computing the average maximum transition probabilities between all the substrings of a word coalescing at a given position k within the word. The variations of this measure make it possible to identify morpheme boundaries at positions where the average maximum transition probabilities reach a well-marked minimum. Figure 1 depicts the variations of the average maximum transition probabilities for the English word “hyperventilating”. Two morpheme boundaries are identified in this word, which corresponds to the following segmentation: “hyper + ventilat + ing”. Once a word has been segmented in this fashion, the longest and less frequent amongst the proposed segments is identified as a stem, if this segment also appears at least twice in the wordlist and at least once at the beginning of a word. In the example of Figure 1, the segment ‘ventilat’ will be identified as a valid stem. The identified stem is then used to acquire affixes. All the substrings preceding this stem in the input wordlist are added to the list of prefixes unless they are longer and less frequent than the stem. Correspondingly, all the substrings following this stem in the wordlist are added to the list of suffixes unless they are longer and less frequent than the stem. Moreover, one character-long prefixes are eliminated because these often lead to erroneous segmentations in later stages of the algorithm. h y p e r v e n t i l a t i n g . 0 1 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 H Y P E R V E N T I L A T I N G L e t t r e s Figure 1: Variations of the average maximum transition probabilities between substrings of the word “hyperventilating”. Identified segment boundaries are marked with a bold vertical line. This procedure ends when for N running words the number of new affixes among the affixes learned is inferior to the number of affixes which already belong to the list of prefixes and suffixes. 2.2 Step 2: Acquisition of stems The aim of the second step is to acquire a list of stems, using the prefixes and suffixes which have been previously identified. Stems are obtained by stripping off from each word in the input word list all the possible combinations of prefixes, suffixes and the empty string. In order to reduce the noise induced by such a simple method, some constraints are applied, especially a minimum length threshold of 3 characters for the stem. Note that the stems acquired by this method are not all minimal and may still contain affixes. 2.3 Step 3: Segmentation of words In a third step, all the words are segmented. Word segmentation is performed by comparing words which contain the same stem to one another. This consists in finding limits between shared and different segments and results in a segmentation of the words being compared. The outcome can be represented as an alignment graph for each stem. Figure 2 depicts the alignment graph obtained for the words containing the English stem “integrat”. ing d dis e integrat # re ion # - ist well s or fully Figure 2: Example segmentation of words sharing the stem “integrat”. Segments are subsequently labelled with one of the three affix types (prefix, suffix, linking element) according to their positions within the word, relatively to the stem. As a result of the alignment, prefixes and suffixes which do not belong to the list of affixes acquired after step 1 may be discovered. A validation procedure, similar to the one proposed by [5], is therefore applied. It consists in checking that the proportion of new affixes in the alignment graph does not exceed some threshold1 . All the segmentations made up of valid morphemic segments are stored. 2.4 Step 4: Selection of the best segmentation As a result of Step 3, several different segmentation may have been discovered for each word, since a word may contain more than one potential stem. In order the select the best possible segments, a best-first search strategy is applied on the potential segments of a word, privileging the most frequent segment when given a choice. Some frequency and morphotactic constraints are also checked (e.g. a prefix cannot be directly followed by a suffix, there has to be at least one stem in the proposed segmentation, etc.). 2.5 Step 5: Application of the learned segments to a new data set (optional) The morphemic segments identified after Step 4 can be used to segment any list of words or the whole list of words when segments are learned using only a subset of the list. An A*-like algorithm is used to find the best segmentation for each word, i.e. the segmentation with the lowest global cost. The global cost for a segmentation is the sum of the costs associated with each segment si . Two different segment cost functions have been used resulting in two different submissions for each language: f (si ) cost1 (si ) = −log P i f (si ) f (si ) cost2 (si ) = −log maxi [f (si )] Some morphotactic constraints, similar to those used at step 4, are also applied. 3 Morpho Challenge 2007 Experiments and Results The method has been applied to all of the four test languages of Morpho Challenge 2007. Mor- phemic segments have been learned using only a subset of the word lists provided for competition 1 (the 300,000 most frequent words), without taking into account contextual information found in the text corpora. These morphemic segments have then been used to segment all the words in the data sets provided both for competition 1 and 2, using either cost1 or cost2 . Moreover, no fine tuning of the three different parameters of the system (N , a and b) has been attempted. Earlier experiments have shown that default values N=5, a=0.8 and b=0.1 are globally reasonable and these have consequently been used for all four languages. 3.1 Results for Competition 1: Morpheme Analysis In competition 1, the system’s analyses have been compared to a linguistic gold standard in Finnish, Turkish, German and English. Table 1 details the precision, recall and F-measure obtained by the system. Method 1 corre- sponds to results obtained using cost1 and method 2 to results obtained using cost2 . As it had already pointed out at Morpho Challenge 2005, results for method 1, using cost1 , indicate higher precision but lower recall on all datasets. On the whole, better F-measures are obtained with method 2 for all languages. 1 There are actually two different thresholds, a and b. For details about these thresholds, see [1]. Language Method 1 Method 2 Precision Recall F-measure Precision Recall F-measure English 72.05 52.47 60.72 61.63 60.01 60.81 Finnish 75.99 25.01 37.63 59.65 40.44 48.20 German 63.20 37.69 47.22 49.08 57.35 52.89 Turkish 78.22 10.93 19.18 73.69 14.80 24.65 Table 1: Precision %, recall % and F-measure % obtained for competition 1. Results in Turkish are well under those obtained in English, German and Finnish and are characterised by very poor recall. Most of the other system’s recall is below 20% as well for this particular language. One possible explanation for this, apart from the inherent morphological complexity of the Turkish language, is that there are more different analyses per word in Turkish (at least in the provided gold standard sample), and therefore more ambiguities have to be solved in the proposed morpheme analyses than in the other languages. 3.2 Results for Competition 2: Information Retrieval In competition 2, the morphological analyses were used to perform information retrieval experi- ments in which words in queries and documents were replaced by their morphological components. Two different weighting methods have been used by the organisers: tf-idf and Okapi (BM25) with a stop list containing the most common morphemes. Moreover, experiments have been performed using two different lists of words: the list of words proposed for competition 1 (without new ) and the list of additional words proposed for competition 2 (with new ). Table 2 lists the results obtained with the tf-idf weighting and Table 3 lists the results obtained with the Okapi weighting. Method - word list English Finnish German method 1 - without new 0.2781 0.4016 0.3777 method 1 - with new 0.2777 0.3896 0.3720 method 2 - without new 0.2673 0.3984 0.3731 method 2 - with new 0.2682 0.3811 0.3703 Table 2: Average precision (AP%) obtained for competition 2 with tf-idf weighting for all mor- phemes. Method - word list English Finnish German method 1 - without new 0.3881 0.4183 0.4611 method 1 - with new 0.3900 0.4681 0.4729 method 2 - without new 0.3922 0.4425 0.4676 method 2 - with new 0.3943 0.4915 0.4625 Table 3: Average precision (AP%) obtained for competition 2 with Okapi weighting and removal of the most frequent morphemic segments. The results of the algorithm strongly depend on the weighting scheme used and are a lot better with the Okapi BM25 weighting and a stop list, whatever the method and the wordlist used. Moreover, while method 1 performs slightly better than method 2 with the tf-idf weighting, this tendency is reversed with the Okapi weighting (except for German). It is not clear how this could be accounted for but a possible explanation is that method 2, which is less precise, benefits more than method 1 from the removal of the most frequent morphemes. 4 Analysis and Discussion As mentioned in the introduction, the main novelty of Morpho Challenge 2007 is that the objective is to obtain a morpheme analysis of the word forms, which is a lot more demanding than just segmenting words. A morpheme analysis for a word corresponds to a list of labelled morphemes. A minimal morpheme analysis may consist of a list of unlabelled morphemic segments identified after morphological segmentation. The algorithm presented in this paper corresponds to an intermediary and simple solution since it labels the segments with general morpheme categories, which are detailed below. 4.1 Morpheme Categories The morphemic segments discovered by the system are labelled by one of the following categories: stem or base (B), prefix (P), suffix (S) and linking element (L). Tables 4(a) to 4(d) give some examples of the labelled morpheme analyses produced by the system compared with the gold standard analysis. Table 4: Example morpheme analyses (a) English Word Method 1 Method 2 Gold standard chilly chill B y S chill B y S chill A y s grow grow B grow B grow V planners’ planner B s S ’ S plann B er S s’ S plan N er s +PL +GEN sulking sulk B ing S sulk B ing S sulk V +PCP1 (b) Finnish Word Method 1 Method 2 Gold standard ikuisuus ikuis B uus B ikuis B uu L s S ikuinen A +DA-UUS linuxille linux B ille S linux B i S lle S linux N +ALL resoluutio resoluutio B resoluutio B resoluutio N vahingoittamatta vahingoi B tta S matta S vahingo B i S t S vahingoittaa V ta S mat B t S a S +DV-MA +ABE (c) German Word Method 1 Method 2 Gold standard bezwingen be P zwing B en S be P zwing B e S n S be zwing V +13PL kastrationen kastrat B ion S en S kastrat B ion S en S kastr R ation +PL risikoloser risiko B los S er S risiko B los S er S risiko N los +ADJ-er veranstaltungen veranstaltung B en S veranstalt B ung S en S ver anstalt N ung +PL (d) Turkish Word Method 1 Method 2 Gold standard avucuna a P v P ucu B na S a P v P ucu B na S avuc +POS2S +DAT ettik ettik B ettik B et +ADJ dik kolaCan kol B aCan B kol B aCan B kolaCan tanImlayabilirsiniz tanImlayabil B irsiniz S tanImlayabil B ir S tan +im DER +la DER siniz S yabil +TNS ir +PER2P The basic base, prefix and suffix categories are taken into account by several systems which perform unsupervised morphological analysis such as the Morfessor Categories systems [3, 4]. The linking element category is intended to encompass short segments (usually just one letter long) which link two words or word-forming elements in compounds. Example of linking elements are: • Hyphens as in “far-flung” • Neo-classical linking elements such as “-o-” in “psychology” or “-i-” in “insecticide”. • German Fugenelemente such as “-s-” in “Regierungskrise” Linking elements differ from the other categories of morphemes because they bring no semantic contribution to the overall meaning of the word. Simple morpheme labelling, as it is done by the system, is a first step towards full-fledged morphological tags in the sense that morphemic segment categories such as stem, suffix or prefix are more general than the labels given by linguistic analyses. For instance, the morphemic categories induced by the algorithm subsume the following mor- phological labels in English (see Table 4(a)): • stem category B: word categories as adjective (A), verb (V) or noun (N) • prefix category P: derivational prefixes as “un+” • suffix category S: inflectional suffixes as plural (+PL) and derivational suffixes as “+er” Note that no attempt is made by the system to distinguish between inflectional and derivational affixes. 4.2 Accuracy of Morpheme Labelling As stated in the introduction, a further objective of morpheme labelling is to disambiguate cases of allomorphy and homography. The recognition of allomorphy is beyond reach of the system in its current state. For instance, no distinction is made between the allomorphs of the English prefix in+ (im-, in-, ir-,) or of the suffix +able (-able, -ible) which will always be recognised as different morphemes, as shown by the following examples: • im P mov B able S • in P compar B able S • ir P resist B ible S Homography should be partially dealt with by the system when homographs belong to different morphemic categories, which excludes within-category homography as squash N and squash V where “squash” will only be identified as a stem. In order to verify this assumption, let us consider the following examples in German and English: • “los” in German: The morphemic segment “los” is highly ambiguous: it can be a derivational prefix (meaning ‘away’, ‘free’), a derivational suffix (meaning ‘less’) or even a stem. The segment “los” is correctly labelled as a suffix by the system in words like “riskolos” (risiko B los S) or “zwecklos” (zweck B los S). When the same segment occurs at the beginning of a word, it is always labelled as a stem, as in “losfahren” (los B fahren B, method 1) or “loslassen” (los B lass B e S n S, method 2). • “ship” in English: The morphemic segment “ship” is either a stem (meaning ‘vessel’) or a suffix which refers to a state. The segment “ship” is correctly labelled as a suffix by method 1 in words like “censorship” (censor B ship S) or “citizenship” (citizen B ship S), but is wrongly fragmented into s S hip S by method 2. The stem “ship” can be correctly identified either by method 1 or 2 when it is found at the beginning of a word, but not when it is found at the end of a word; for instance, “shipwreck” is analysed as ship B wreck B by methods 1 and 2, while “cargo-ship” is analysed as cargo B - L ship S by method 1 and cargo B - L s S hip S by method 2. The previous examples reveal both satisfactory and less satisfactory aspects of the current system. Over-segmentation by method 2, leading to a loss in precision, has already been identified at the previous Challenge. Moreover, morpheme labelling does not solve all detectable cases of homography between stems and affixes. Morphotactic constraints help in that respect, since they prevent a suffix from occurring at the beginning of a word, and thus the suffixes “los” or “ship” will not be identified at word initial positions. However, the final analysis privileges the most frequent segments, when several morpheme categories are morphotactically plausible. This tends to be favourable to affixes since they are usually more frequent than stems. 5 Future Work and Open Issues As shown in the previous section, morphotactic constraints, as they are currently implemented in the system, are not always sufficient and flexible enough to disambiguate between several homo- graphic segments. For the time being, these constraints are implemented as a simple deterministic automaton, which is the same for all languages. In the future versions of the system, it would be desirable to bootstrap these constraints from the data themselves, as suggested by [6]. Another direction for future research concerns the integration of corpus-derived information. Several algorithms have demonstrated the usefulness of contextual information for unsupervised morphological analysis, to complement orthographic information with semantic and syntactic con- straints. Corpus-derived knowledge can be used either at the beginning of the process [2, 7], or at the end [10]. In the first case, only words which are contextually similar are compared to dis- cover morphemes. In the second case, spurious morphological analyses are filtered out by taking semantic similarity into account. It seems that corpus-derived information could be incorporated in the first step of the current algorithm, in order to increase the precision of affix acquisition since most of the subsequent processes rely on the affixes acquired at step 1. Also, it is obviously worth investigating the use of text corpora to achieve finer-grained morpheme labelling and refine the very general categories used so far (stem, prefix and suffix). 6 Conclusion Even though requirements for Morpho Challenge 2007 were more demanding than for the previous challenge, the algorithm described in this paper still performed decently, especially in English, Finnish and German. As pointed out in the previous sections, the algorithm could however still be improved on many levels, which, hopefully, should be done by the next challenge. References [1] Delphine Bernhard. Unsupervised Morphological Segmentation Based on Segment Pre- dictability and Word Segments Alignment. In Mikko Kurimo, Mathias Creutz, and Krista Lagus, editors, Proceedings of the Pascal Challenges Workshop on the Unsupervised Segmen- tation of Words into Morphemes, pages 19–23, Venice, Italy, April 2006. http://www.cis. hut.fi/morphochallenge2005/P03 Bernhard.pdf. [2] Stefan Bordag. Two-step Approach to Unsupervised Morpheme Segmentation. In Proceed- ings of the Pascal Challenges Workshop on the Unsupervised Segmentation of Words into Morphemes, pages 25–29, Venice, Italy, April 2006. [3] Mathias Creutz and Krista Lagus. Induction of a Simple Morphology for Highly-Inflecting Languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Compu- tational Phonology (SIGPHON), pages 43–51, Barcelona, 2004. [4] Mathias Creutz and Krista Lagus. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05), pages 106–113, Espoo, Finland, June 2005. [5] Hervé Déjean. Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora. In D. Powers, editor, Proceedings of the CoNLL98 Workshop on Paradigms and Grounding in Language Learning, pages 295–298, 1998. [6] Vera Demberg. A language-independent unsupervised model for morphological segmentation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 920–927, Prague, Czech Republic, June 2007. Association for Computational Linguis- tics. [7] Dayne Freitag. Morphology Induction from Term Clusters. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 128–135, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [8] Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1974. [9] Zellig Harris. From phoneme to morpheme. Language, 31(2):190–222, 1955. [10] Patrick Schone and Daniel Jurafsky. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis. In Proceedings of the Fourth Conference on Computational Natural Lan- guage Learning and of the Second Learning Language in Logic Workshop, Lisbon, Portugal, September 2000.