Annotated Suffix Tree Method for German Compound Splitting Anna Shishkova1 and Ekaterina Chernyak1 National Research University – Higher School of Economics, Moscow, Russia, asshishkova_1@edu.hse.ru, echernyak@hse.ru Abstract. The paper presents an unsupervised and knowledge-free ap- proach to compound splitting. Although the research is focused on Ger- man compounds, the method is expected to be extensible to other com- pounding languages. The approach is based on the annotated suffix tree (AST) method proposed and modified by Mirkin et al. To the best of our knowledge, annotated suffix trees have not yet been used for compound splitting. The main idea of the approach is to match all the substrings of a word (suffixes and prefixes separately) against an AST, determining the longest and sufficiently frequent substring to perform a candidate split. A simplification considers only the suffixes (or prefixes) and splits a word at the beginning of the selected suffix (the longest and sufficiently frequent one). The results are evaluated by precision and recall. Keywords: compound splitting, annotated suffix tree, German language 1 Introduction The Germanic languages, in particular German, often use a large number of com- pounds, i.e. words consisting of several parts. Such word combinations result in larger vocabulary. Moreover, the word base may eventually expand further even without the formation of completely new words, only through compounding the existing ones. This especially complicates machine translation and search engine development. For examlpla, without compound splitting it is necessary to store all the words Sommer (summer ), Winter (winter ), Wetter (weather ), Märchen (fairy tale), Sommerwetter (summer weather ), Winterwetter (winter weather ), Wintermärchen (winter’s tale) to translate them correctly. In contrast, a system equipped with compound splitting tool needs only four words (Sommer, Winter, Wetter, Märchen) to represent the entire group. What is more, such system will easily translate the previously unseen word, Sommermärchen. In terms of search engines, com- pounds prevent search result diversity. For instance, to find results containing Regenmantel (raincoat) by query Regenschirm (umbrella) it is useful to split the compounds as well. Section 2 reviews different approaches to this problem. Due to the space limita- tion, the review is brief and incomplete yet sufficient to provide a general outline of the problem. Section 3 introduces the proposed algorithm for compound split- ting. Section 4 is devoted to the experiment results. Section 5 concludes. 2 Related Work This paper combines two aspects: German compound splitting and annotated suffix tree method. To correctly split a compound, it is necessary to study both areas. 2.1 Compound Splitting In [3], F. Holz and C. Biemann point out that the approaches to compound splitting can be divided into knowledge-intensive and knowledge-free ones. Ac- cording to [3], knowledge-intensive splitters are based on handcrafted rules or use supervised learning on training sets of splitted words, while knowledge-free approaches are not language-specific and create a compound splitter by analyz- ing a raw text corpus. The work [5] is focused on German, Dutch and Italian compounds. As a knowledge- free approach it can be applied to any language having compound words. The authors provide the reader with an algorithm that parses sequential combina- tions of the first letters (referred to as prefixes) one by one and accommodates them with a list of possible splitting candidates (all non-compound words) es- tablished beforehand. If a match is found, the current prefix is considered as the first part of splitting. Parsing continues until there are tokens left or splitting fails, unable to match a new prefix with the list. The authors evaluate their method on a set of around 700 complex nouns, achieving 60% precision and 50% recall. However, as mentioned in [3], this approach fails if the compound includes words having an independent root and an auxiliary part. F. Holz and C. Biemann give an example word Prüfungsvorbereitungsstress (stress that occurs when one pre- pares for an exam). The correct splitting should contain the word Prüfung but there is an independent German root Prüf, while ung is a nonsensical auxiliary part. After marking Prüf as the first part of splitting, the analysis becomes un- successful. Building upon previous research, F. Holz and C. Biemann [3] develop a more sophisticated knowledge-free algorithm. During the preprocessing, a word list is collected from a large raw corpus and marked with frequencies. Then, all possi- ble splits are generated and the program checks whether the precomputed word list contains the proposed items. The split is chosen based on the lengths of the components and other measures, such as the largest geometric mean of partial frequencies. Next, the authors improve the results by introducing periphrase detection. They illustrate the idea with an example: Schweine-schnitzel is a pork cutlet, while Kinder-schnitzel means a cutlet for children. Despite the same structure, the meanings are strictly different. To avoid terminological ambiguities, the authors look for periphrases in the source text. Given that periphrases contain differ- ent prepositions (Schnitzel vom Schwein and Schnitzel für Kinder ), the authors include the prepositions into splits and thus facilitate understanding. This ap- proach however goes beyond the scope of our paper. One more alternative approach is described in [2]. The authors consider Ger- man words too, but in contrast to the previous paper the results are not sim- ply transferred to other languages. Instead, corpora tagged with part-of-speech (POS-tagged) are used to supervise learning. Moreover, the aims of the task differ from the previous paper: the research addresses the question not of discovering of compounds, but inflectional phenomena of the German language. To perform the task, prefixes and suffixes are also treated similarly to [2]. Afterwords the results are matched with POS-tagged corpora. 2.2 Annotated Suffix Tree Moving on to the annotated suffix trees, the paper [4] defines an annotated suffix tree (AST) as a rooted tree where each non-root node corresponds to a character and the path from the root to the leave encodes a suffix. In addition to the char- acter label, each node is associated with a number, that denotes the frequency of the fragment. The authors specify an important property of the frequencies: the node frequency always equals the sum of its child node frequencies. The parent node corresponds to a prefix of several suffixes, and its frequency is the sum of the frequencies of these suffixes. Therefore, it is observed that the frequency of the parent node equals the sum of the frequencies of the leaves it covers. Fur- thermore, the authors give an exact algorithm for constructing annotated suffix trees based on the sequential enumeration of all the suffixes from the analyzed string’s collection. The paper [1] deals with the use of the AST to calculate the relevance mea- sure between given string and text. The authors compare the results of applying popular three measures: cosine similarity, probabilistic BM25 and the new pro- posed measure. The measure proposed is based on a conditional probabilities of symbols in AST-induced text fragments. As confirmed in [1], the results ob- tained with AST-based characteristics are superior to the other more popular measures. The authors also emphasize that AST is applicable to other tasks, such as categorization. 3 Our Method To avoid terminological ambiguities, we use the term “annotated prefix tree” (APT) to denote an annotated suffix tree constructed from the inverted words. For instance, the prefixes of Zeit (time) are represented in an annotated tree as the suffixes of a dummy word tieZ, namely, Z, eZ, ieZ, tieZ. The examples of both AST and APT created from word Freizeit (free time) are illustrated on Fig.1 (left and right respectively). Now we describe our algorithm in details. At the current stage, it aims to split the simplest compounds formed by a direct connection of two words. Thus, we consider such compounds as Kraftwerk (power station), which is formed by Kraft (power ) and Werk (factory), but avoid the more complex ones: Kleidungsstück (item of clothing) which includes not only Kleidung (clothing) and Stück (item) Fig. 1. AST (left) and APT (right) created from word Freizeit (free time) but also a connecting letter s, or Kirchturm (church steeple), which omits the letter e from the word Kirche (church). Further development of the algorithm should address the difficulties mentioned above as well as splitting the words formed from more than two parts. Therefore, the algorithm is as follows: Step 1. Create an AST and an APT from a large raw list of (compound) words. Step 2. For each suffix (prefix) of each word to be split, determine the frequency from the AST (APT). The frequency of a suffix (prefix) is defined by the fre- quency of the corresponding leaf, that is, the last element of the suffix (prefix) in the AST (APT). Then pick the longest suffix (prefix) with frequency above a certain cutoff value. For our evaluation set, the optimal cutoff frequencies are calculated as the geometric mean of the suffix and prefix frequencies divided by the multiplication of the suffix and prefix lengths. The next step can be done in two ways. Step 3a. The compound is split using either the AST or the APT. The suffix obtained at the previous step is proposed as the second part of the compound, and the remainder becomes the first one. The case of prefix splitting (using APT) is handled similarly. Step 3b. AST and APT results are handled jointly, producing a combined split. Since in Step 3a prefix splitting shows better results, it is applied in most cases. However, if the frequency of the prefix obtained from prefix splitting is lower than that of every part of suffix splitting, the latter is selected. The Step 3b is justified by the experiment presented in the following section. However, even without particular experiments the superior results of prefix split- ting can be explained from a linguistic point of view. German has a fixed set of common word endings. Let us proceed with an example: if two words which share an ending (e.g. Sommer and Winter ) are the prefixes of some compounds with the same second part (Sommerwetter and Winterwetter ), the frequency of the erroneous suffix (erwetter ) increases. That may lead to wrong suffix split- ting, but does not affect the prefix frequencies. In the next section, we present the results produced by an implementation of the three options (prefix, suffix and combined splitters) and evaluate these by precision and recall. 4 Experiments To build AST and APT we use a list from Wiktionary, Category: German com- pound words. It consists of 7144 words, but we use 6740 of them with well- described etymology. An example of well-described etymology is, for Apfelblüte (apple blossom), Apfel + Blüte. Availability of etymology has no effect on split- ting quality, but allows checking its correctness. Since the algorithm only handles two-part compounds with no connecting letters, the experiment is conducted on an appropriate 4619-word subset of the original list. The split may consist of one part, which makes it unsuccessful. However, in presence of non-compounds this might be the correct answer; such cases should be taken into account when calculating precision. Precision is measured as the ratio of number of correct determined splits to number of all words which splits were found. The numerator of recall is the same while the denominator is the number of all compounds in the evaluating set. The splitting examples of two different words are shown in Table 1. Table 1. Splitting of Bierglas (glass of beer ) and Apfelbaum (apple tree) word suffixes (freq.) suff. split prefixes (freq.) pref. split combined split s (5186) B (423) as (455) iB (42) Bier+glas las (71) eiB (10) (true) Bierglas glas (16) Bie+rglas reiB (7) Bier+glas because rglas (4) (false) greiB (2) (true) Bier (7), erglas (3) lgreiB (1) Bie (10), ierglas (1) algreiB (1) rglas (4), Bierglas (1) salgreiB (1) 4 < 7 <10 m (2311) A (352) um (622) pA (13) Apfel+baum aum (66) fpA (9) (true) baum (21) efpA (9) because Apfelbaum lbaum (3) Apfel+baum lefpA (9) Apfelb + aum Apfelb (3), elbaum (3) (true) blefpA (3) (false) Apfel (9), felbaum (1) ablefpA (1) baum (21), pfelbaum (1) uablefpA (1) 3 < 9 < 21 Apfelbaum (1) muablefpA (1) Using AST provides precision and recall of about 48%, while using APT im- proves these to 57%. The combined method further improves precision and recall up to 63%. It is also necessary to point out that in 78% of the cases at least one split (obtained either by AST or APT) is correct. That means that if the algorithm contains a manual post-processing component, we would have easily reached such value. 5 Conclusion In this paper, we have proposed an algorithm for compound splitting based on the annotated suffix tree method. Precision and recall of more than 60% provide a good starting point. With manual post-processing precision can reach 78%, which suggests a prospective direction of further work if automating of the heuristics is applied. At the current stage, the algorithm only splits simple compounds, which account for two-thirds of the total Wiktionary list. A minor modification of the algorithm should allow it to handle connecting letters or an independent third (middle) stem. However, such modifications have not been implemented yet. Additionally, the extensions may consider the more advanced cases, such as the removal of certain ending letters from the parts in compounds. Finally, it is also necessary to carry out the experiments on other word lists, containing non-compounds, both at tree-construction and splitting stages. References 1. Chernyak E. L. An approach to the problem of annotation of research publications, in: Proceedings of The Eighth International Conference on Web Search and Data Mining. NY, United States of America : Ch. 58. P. 429-434. ACM, 2014 2. Finkler, W., Neumann, G. Morphix: A fast realization of a classification-based ap- proach to morphology. In: 4. Österreichische Artificial-Intelligence-Tagung. Wiener Workshop - Wissensbasierte Sprachverarbeitung, 1998 3. Holz, F., Biemann, C. Unsupervised and knowledge-free learning of compound splits and periphrases. CICLing’08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing, P. 117-127. Springer, Hei- delberg, 2008 4. Mirkin B. G., Chernyak E. L. An AST method for scoring string-to-text similiarity in semantic text analysis, in: Clusters, orders, trees: methods and applications. In Honor of Boris Mirkin’s 70th Birthday / Sci. Ed.: F. T. Aleskerov, B. I. Goldengorin, P. M. Pardalos. Vol. 92. Berlin : Springer, 2014 5. Monz, C., de Rijke, M. Shallow morphological analysis in monolingual information retrieval for dutch, german, and italian. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, P. 262–277. Springer, Heidelberg, 2002