Evaluating Neural Sequence Models for Splitting (Swiss) German Compounds Don Tuggener Zurich University of Applied Sciences don.tuggener@zhaw.ch compound splitting. In this paper, we evaluate several neural sequence models on the task of (Swiss) Ger- Abstract man compound splitting. The models are compared to an unsupervised character ngram-based approach This paper evaluates unsupervised and super- and a baseline that uses a dictionary. Furthermore, vised neural sequence models for the task of we present the first gold standard for splitting Swiss splitting (Swiss) German compound words. German compounds and evaluate the models on this The models are compared to a state-of-the- newly available resource. Swiss German dialects fea- art approach based on character ngrams and ture no official spelling, and they are closely related to a simple heuristic that accesses a dictionary. Standard German, but differ regarding some language We find that the neural models do not outper- changes throughout history, and we use the Swiss Ger- form the baselines on the German data, but man compounds as a sort of out-of-domain test set for excel when applied to out-of-domain data, i.e. the models. splitting Swiss German compounds. We re- lease our code and data1 , namely the first an- 1.1 Related Work notated data set of Swiss German compounds. Several methods for automatic splitting of word com- 1 Introduction pounds exist. One approach is to use a dictionary to perform a full morphological analysis (Schmid Splitting compound words is an important task when et al., 2004). Others apply corpora statistics (Koehn setting up pipelines for Natural Language Process- and Knight, 2003) and combine them with linguistic ing of the (Swiss2 ) German language. (Swiss) Ger- heuristics (Weller-Di Marco, 2017). There are both man features a long tail regarding word frequencies supervised (Alfonseca et al., 2008; Henrich and Hin- due to the phenomenon that compound words are not richs, 2011) and unsupervised approaches (Macherey orthographically separated by whitespaces as in e.g. et al., 2011; Tuggener, 2016; Ziering and van der Plas, English (e.g. Autobahnraststätte vs. highway service 2016) to the task. Riedl and Biemann (2016); Zier- area). Thus, when mapping words to lexical resources ing et al. (2016) explore distributional semantics on such as word nets or embeddings, it is likely that there the premise that the constituents of a compound are are compounds which are not represented in the re- semantically similar to the compound. Other work source. researches the semantic relation between the con- Neural sequence models capture characteristics of stituents of the compounds (Schulte im Walde et al., word or character sequences (ngrams) in a latent rep- 2016). While most approaches focus on compounds resentation and are thus hypothetically well-suited for in a single language, there exists work that explores the task in a multilingual context (Alfonseca et al., In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- 2008; Macherey et al., 2011). Text 2018), Winterthur, Switzerland, June 2018 While there exists work on morphological segmen- 1 https://github.engineering.zhaw.ch/tuge/ tation (Kann et al., 2016), to the best of our knowl- neural_compound_splitter 2 Swiss German subsumes the Alemannic dialects spoken in edge, there is no prior work on using neural sequence Switzerland. models to identify head constituents of compounds. 1 42 2 Neural sequence models for compound position: splitting argmin p(xi+1 |x0 . . . xi ) (2) We first introduce the neural sequence models that we i apply in the experiments and how we adapt them to fit To combine the two features and determine the best the task of (Swiss) German compound splitting. split position, we sum the END token probability and the inverse of the probability of the next character at 2.1 Unsupervised Recurrent Neural Network each position in the sequence x of length n and take the position with the highest score: The first model we explore is a character-based lan- guage model using a recurrent neural network (RNN). Language models aim to predict the next token in a  argmax p(END|xi ) + (1 − p(xi+1 |x0 . . . xi )) sequence given the sequence history. Different from i ngram-based language models (Heafield et al., 2013, (3) e.g.), RNN-based models feature a hidden state that During initial experiments, we found that this ap- is updated after consuming a token (a character or proach did indeed yield correct boundaries of free a word). Based on the hidden state, a probability morphems in German compounds, but struggled to distribution over the token vocabulary is calculated find the correct boundary for splitting when com- using the softmax function, and the most likely to- pounds consist of more then two such free mor- ken is selected when generating sequences. During phems. For example, for the compound Autobahn- training, the difference between the predicted prob- raststätte (highway service area) with four free mor- ability and the given next token constitutes the loss phems (Auto, Bahn, Rast, Stätte), the approach iden- that is backpropagated to update the model parame- tified Autobahnrast as the body and Stätte as the ters (i.e. weights). Such RNN-based language models head, instead of Autobahn and Raststätte. We hy- have been shown to outperform ngram-based models pothesized that one flaw of the approach is that when in terms of perplexity on general language modelling determining the best split position i in a sequence tasks (Mikolov et al., 2010, inter alia). x, only the character sequence up to position i, i.e. To employ RNNs for unsupervised compound x0 . . . xi , is considered, and the remaining string, splitting, we exploit an implementation detail that is xi+1 . . . xn , is neglected. Therefore, we introduced a commonly used when working with such approaches: backward-looking RNN that consumes the sequence Before training an RNN on a given token sequence, a x in reverse order, which constitutes a bi-directional special END token is appended to the sequence. This RNN (biRNN). We calculate the same two features token is inserted into the vocabulary of the model. as for the forward-looking RNN for the backward- When using the trained RNN to generate a sequence, looking RNN and sum the scores of the forward and the END token is used as a stopping criteria, i.e. if the backward-looking RNN for each position i in the se- END token is the most likely next token, generation is quence to determine the best position for a split. terminated and the sequence considered complete. Furthermore, we noted that the approach fails at We adapt this idea and monitor the probability of placing boundaries correctly if the compounds con- emitting the END token when consuming a compound tain so called Fugen elements (linking elements), as word character-wise. That is, in a character sequence in e.g. Installation-s-Anweisung (installation instruc- x, we consider the position xi with the highest proba- tion), where a Fugen-S glues together the free mor- bility of emitting the END token as the split position: phemes Installation and Anweisung. The approach places the split after the first free morpheme (Installa- argmax p(END|x0 . . . xi ) (1) tion), and thus attaches the Fugen-S to the head (san- i weisung), which yields an incorrect split in the result. Additionally, we are interested in positions in the Therefore, we applied a regular expression (captur- sequence where the probability of generating the next ing bound morphemes that often occur before com- given character is low, based on the hypothesis that pound boundaries, e.g. -ions, -täts, -keits) that aims that such positions are indicators for a suitable split to remove Fugen-S heuristically before identifying a 2 43 splitting position (e.g. Installationsanweisung → In- The model is applied to transform sequences into stallationanweisung). other sequences in e.g. Machine Translation (Cho For this approach, we only need a collection of Ger- et al., 2014). It consists of an encoder component that man words to train the character-based RNNs, as it is encodes the input sequence into a latent representa- unsupervised regarding the compound splitting task. tion, and a decoder that generates the desired output sequence based on the encoded input. 2.2 Supervised Recurrent Neural Network The initial version of the seq2seq model encodes A natural extension to the neural character-based lan- the whole input sequence into one vector (i.e. the last guage model is to add supervision. To determine hidden state of the encoder) on which the generation the split in the unsupervised model, we summed of the output is based. To address this limitation, Bah- probabilities that we deemed relevant but did not danau et al. (2014) introduced an attention mechanism train the RNN itself on the task of compound split- that lets the decoder peak at all hidden states in the in- ting. In the supervised approach, we use the hidden put and apply a weight to each which represents its re- states of the trained character-based language model spective importance while generating the correct out- biRNNs when consuming a German compound word put token at each step during generation. character-wise as features to train a binary classifier We apply this model to the compound splitting task regarding the splitting decision. That is, at each posi- by using the compounds as input and the head noun tion in the sequence, we concatenate the hidden states as the desired output.3 We hypothesize that the at- of the forward and backward RNN to create a feature tention mechanism is helpful in our case, because not vector. This vector is then fed to a fully connected all characters in the input sequence are equally impor- layer, as shown in Figure 1. For determining the split, tant when identifying the split boundary. The attention we take the position in the sequence that has the high- mechanism enables the decoder to focus on different est probability according to the binary classifier. Dur- character groups, which, ideally, represent (groups of) ing training, the split position is known and used to relevant free morphems. We thus apply the seq2seq calculate and backpropagate the loss. with attention model. 0 0 0 1 0 3 Experiments MLP MLP MLP MLP MLP Having outlined our models, we now describe our concat concat concat concat concat data, the baselines, followed by evaluation results. LSTM LSTM LSTM LSTM LSTM ... 3.1 Data LSTM LSTM LSTM LSTM LSTM ... We use the dataset discussed in Henrich and Hinrichs (2011), i.e. a list of 75 000 German compounds and their head nouns extracted form GermaNet (Henrich and Hinrichs, 2010), a German wordnet.4 Henrich and Hinrichs (2011) used this resource to evaluate sev- t ü r s c ... eral approaches to compound splitting. They also in- cluded non-compounds in their evaluation, however, these are not provided in the resource. Therefore, we Figure 1: Supervised RNN-MLP architecture, shown only compare the ability of our approaches to deter- for the compound Türschloss (door lock), where the mine the correct split positions in known compounds split position is at the character s. (corresponding to the task in section 7.2 in Henrich 3 We also experimented with generating both the body and the 2.3 Sequence-to-sequence with attention head constituents, but obtained slightly better results with gener- ating the head only. An alternative supervised neural model is the 4 http://www.sfs.uni-tuebingen.de/lsd/ Sequence-to-sequence model (Sutskever et al., 2014). compounds.shtml, we use version v12.0 (2017) 3 44 and Hinrichs (2011)).5 Furthermore, we remove com- Algorithm 1 Dictionary-based compound splitting pounds from the list that contain a hyphen, since split- 1: Create dictionaryD from train set ting them at the hyphen is straight-forward. We ran- 2: Sort D based on word length domly split the remaining 73 1333 compounds into 3: procedure S PLIT(compound) 80% train and 20% test data. 4: for word ∈ D do Additionally, we created the first gold standard of 5: if compound ends with word then Swiss German compounds and their head nouns. We 6: head = word extracted the 150 longest Swiss German words in 7: if body ∈ D then the SB-CH corpus (Grubenmann et al., 2018) which 8: break consists of Swiss German texts gathered from dif- ferent sources (e.g. social media messages, business calculates probabilities of ngrams (length 2 to 20) to reports). We then manually annotated their (recur- occur at the beginning, middle, and end of words in sive) head nouns. For example, for the compound an unlabeled corpus and calculates a splitting score at Wältuntergangskatastrophefilm (apocalypse catastro- each character position in a (compound) word. phe movie), we extract the heads Katastrophefilm and Tuggener (2016) found that this method outperforms Film and consider both as a correct head in evaluation. several other splitters in the task of identifying We use this data as a kind of out-of-domain test set in correct splitting boundaries on the GermaNet data, the evaluation. achieving 95% accuracy. 3.2 Baselines 3.3 Evaluation We compare the neural models to two baselines: Next, we evaluate the models regarding their ability Dictionary-based: The first baseline uses a to identify correct splitting boundaries on the German dictionary and matches its words to the end of the and Swiss German data. Since we only evaluate on compounds in the test set. If a word from the known compounds, we measure accuracy, i.e. the per- dictionary is found to be a substring at the end of a centage of compounds for which the methods find the compound in the test set, it is taken as the head noun correct split. Results are given in Table 1. of that compound. Clearly, the order in which the The first striking observation is that the dictionary- dictionary is traversed matters, because the method based method outperforms all other approaches on the stops after finding the first match. We experimented GermaNet data. The reason for the high accuracy lies with sorting the dictionary by longest to shortest in the overlap of the words in the train and test set. words and vice versa. Also, we included a subroutine While there exist no direct duplicates in the sets, al- to check if the found body (the remaining word after most all head nouns (97%) that need to be identified in removing the head noun from the compound; the test set are included in the train set as constituents potentially removing the Fugen-S) is also in the of a compound. For example, the test set contains the dictionary and favored those splits which have both instance Fruchtnektar (fruit nectar) → Frucht → Nek- body and head in the dictionary. The algorithm is tar and the train set contains Bananennektar (banana outlined in Algorithm 1. nectar) → Banane → Nektar. As we see from the evaluation on the SB-CH data, the approach clearly CharSplit: The dictionary-based method is prone to fails when this overlap diminishes. fail where a head noun of a compound is never seen The CharSplit baseline performs 2 accuracy points in the training corpus. CharSplit, proposed in below the dictionary method on the GermaNet data, Tuggener (2016), alleviates this by basing the splits but achieves better results on the Swiss German com- on character ngrams rather than words. CharSplit pounds. Relying on ngram representations of the Ger- 5 Clearly, an end-to-end system for compound splitting needs man training data leads to an advantage when moving to be able to identify whether a given word constitutes a com- from the German training data to the related, but not pound. Unfortunately, we are not able to evaluate our approaches identical domain of Swiss German compounds. in this regard here. Another resource containing non-compounds is discussed in Escartı́n (2014), but it does not seem to be avail- For the unsupervised biLSTM, we found that train- able. ing for more than 1 epoch did not improve results. 4 45 Model Parameters Acc. Dictionary check long words first 94.07 Dictionary check short words first 58.35 Dictionary +favor known words as body 95.18 CharSplit include Fugen-S 90.65 CharSplit remove Fugen-S 93.26 Unsup. biRNN 1 layer, 32 hidden size, 1 epoch 67.34 Unsup. biLSTM 1 layer, 32 hidden size, 1 epoch 78.13 Unsup. biLSTM 1 layer, 32 hidden size, 1 epoch, keep Fugen-S 71.51 Unsup. biLSTM 2 layer, 512 hidden size, 1 epoch 87.04 Unsup. biLSTM+MLP MLP: 3 layers, 128-64-16 hidden size, 2 epochs 95.10 Seq2seq+attention 1 layer, 128 hidden size, 3 epochs 92.40 Table 1: Models, parameters, and splitting accuracy for different approaches. Trained and evaluated on German compounds (GermaNet). Best results per category are in bold, best overall underlined. Model Acc. on the German data. The seq2seq model was trained independently from Dictionary 20.67 the other models. It outperforms the unsupervised CharSplit 36.67 biLSTM on the German data, but not on the out-of- Unsup. biLSTM 57.33 domain Swiss German test set. It falls behind the Unsup. biLSTM+MLP 68.67 other supervised approach on both test sets, but fea- Seq2seq+attention 52.00 tures some other interesting properties discussed in the next section. Table 2: Splitting accuracy for different approaches evaluated on Swiss German compounds (SB-CH). Us- 3.4 Output analysis ing best performing models trained on GermaNet. In this section, we qualitatively compare the outputs However, increasing the model size in terms of layers and other properties of the different approaches and and hidden state size benefited accuracy to a certain discuss (dis)advantages of each. extent, and removing the Fugen-S is vital. Also, we In general, we hardly ever encountered system out- found that a vanilla biRNN fares poorly compared to puts that put splits at seemingly random positions using a biLSTM. The results show that for the German within the compounds. The two main errors we ob- compounds, the model falls behind the baselines by a served are related to the Fugen-S and errors in choos- considerable margin. However, on the Swiss German ing the correct split position for compounds consisting compounds, it leads to a large improvement compared of more than two free morphemes. to the baselines (+20 accuracy points). Dictionary-lookup: As this method relies on a For the combination of the unsupervised biLSTM predefined list of know words, it is only able split coupled with the supervised MLP, we took the best compounds that have a head noun which is contained performing biLSTM model (2 layers, 512 hidden size) in the dictionary. Hence, a main error cause are to create the inputs for the MLP. We experimented unknown head nouns, which especially affects with different numbers of layers, hidden state sizes, performance on the Swiss German data. The method and epochs for the MLP and report results of the is robust against Fugen-S, since it looks for known best performing configuration. On the German data, words at the compound end and hence never attaches this approach is on par with the dictionary baseline. a Fugen-S to a found head noun. However, it is the It also improves performance on the Swiss German only approach that does not provide a way to compound by +11 accuracy points compared to only distinguish compounds from non-compounds (i.e. it using the unsupervised biLSTM, similar to the gains provides no measure for the confidence of a found 5 46 split). That is, it does not provide any means to detect incorrect free morpheme boundary, e.g. for the non-compounds. The word Fahrzeug (vehicle) e.g. compound Parkleitsystem (parking guiding system), includes the head noun Zeug (thing, but it is not a it generates the head System instead of Leitsystem. hypernym of the word Fahrzeug and hence cannot be A unique error source for this model is that it often split without straying away from its meaning, and the generates spelling errors in the (otherwise) correctly dictionary approach would perform this split. Since identified heads, e.g. for Neukonstruktion (new we only evaluate the splitting methods on known construction), it produces konstroktion as the head. compounds, this disadvantage does not affect the Clearly, generating the head noun instead of just accuracy of the approach in evaluation. In finding its starting character seems to overcomplicate conclusion, this methods works well when train and the task in our setting and has an unnecessary impact test data feature a largely overlapping vocabulary and on performance. the approach is coupled with a method to distinguish When moving to the out-of-domain data, the model compounds from non-compounds. shows a bigger performance impact than the biLSTM CharSplit: This approach often fails when coupled with the MLP. One possible reason could be encountering the Fugen-S, i.e. it frequently attaches that the model does more heavily rely on full words it to the head noun if the regular expression is not during testing, i.e. the input representation is able to remove it before the split. However, this constructed over the full compound and the attention model improves performance on out-of-domain data mechanism does not focus strongly enough on compared to the dictionary approach, as it relies on relevant character sequences. A seq2seq model that an ngram representation of the training data. This replaces the input representation with convolution enables it to better handle vocabulary differences operations is presented in Gehring et al. (2017). An between training and testing domain compared to the interesting experiment would be to evaluate if dictionary baseline. As an unsupervised approach, it character convolutions are the better option for only relies on an unannotated corpus to calculate the representing important character sequences than the ngram probability distributions. biLSTM. Finally, we noted that when training the model to Unsupervised biLSTM: Similar to CharSplit, the generate both the body and head constituents given approach also often fails attaching Fugen-S to the the compound, it often produces the correct body instead of to the head if removing the Fugen-S lemmatization of the body by removing Fugen using the regular expression fails. In that regard, the elements from it (-es, -en, -s, -n, e.g. unsupervised biLSTM output is similar to CharSplit. bundesforschungsminister → bund, However, the latent representation of the character forschungsminister or landesverwaltung → land, sequences (compared to the ngram representation in verwaltung). It thus seems to be the appropriate CharSplit that directly relies on the surface forms) candidate for a neural model if identifying the allows it to better generalize to the out-of-domain lemmatized body of a compound is a goal of splitting data than CharSplit, yielding better splitting accuracy. compounds. Unsupervised biLSTM + MLP: The combination of Output combination: To gain insight on how the unsupervised biLSTM and the supervised MLP complementary the outputs of the different models yieled best overall results in our experiments. As one are, we calculated the percentage of compounds that of the supervised approaches, it does not seem to are split identically by all models, which is 76.63%. struggle with occurrences of the Fugen-S and most This suggests that there is a substantial difference in errors stem from splitting compounds with multiple the outputs. free morphemes at the wrong free morpheme We also calculated the upper bound splitting boundary. performance for ensembling the models. To do so, seq2seq: As the second supervised model, this model we regarded a compound as split correctly if at least also does not struggle with removing Fugen one of the outputs contained the correct split. This elements. The main error cause is thus splitting upper bound achieves an accuracy of 99.56% on the compounds with multiple free morphems at the GermaNet data, which indicates that the model 6 47 outputs are sufficiently different to be combined in an Jonas Gehring, Michael Auli, David Grangier, Denis ensemble system. Yarats, and Yann N Dauphin. 2017. Convolutional se- quence to sequence learning. ArXiv e-prints . 4 Conclusion Ralf Grubenmann, Don Tuggener, Pius von Dniken, Jan We evaluated three common neural sequence models Deriu, and Mark Cieliebak. 2018. SB-CH: A Swiss Ger- for the task of splitting (Swiss) German compounds man corpus with sentiment annotations. In Proceedings of the 11th Language Resources and Evaluation Con- and compared them to ngram- and dictionary-based ference (LREC), 2018. Association for Computational baselines. We found that for in-domain data (Ger- Linguistics. man compounds), the neural sequence models were not able to outperform the baselines, but that for out- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser- of-domain data (Swiss German), they achieved vastly Ney language model estimation. In Proceedings of the better accuracy in identifying splitting positions. We 51st Annual Meeting of the Association for Computa- hypothesize that the latent representations in the neu- tional Linguistics. Sofia, Bulgaria, pages 690–696. ral models of character sequences (in the form of vectors) allows them to process similar character se- Verena Henrich and Erhard Hinrichs. 2010. GernEdiT-the GermaNet editing tool. In Proceedings of the ACL 2010 quences in a similar way. Thus, when applying mod- System Demonstrations. pages 19–24. els trained on German data to Swiss German data, the models are able to resolve the differences between the Verena Henrich and Erhard Hinrichs. 2011. Determin- languages because slightly modified but similar and ing immediate constituents of compounds in GermaNet. corresponding ngram sequences lead to similar hid- In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011. pages den representations, which in turn yield similar out- 420–426. puts (i.e. splitting positions), where ngram- or dictio- nary approaches, which directly rely on surface forms, Katharina Kann, Ryan Cotterell, and Hinrich Schütze. fail. Future work in the direction of Alfonseca et al. 2016. Neural morphological analysis: Encoding- decoding canonical segments. In Proceedings of the (2008) will have to determine if the approaches based 2016 Conference on Empirical Methods in Natural Lan- on neural sequence models is applicable to other lan- guage Processing. pages 961–967. guage groups with similar spelling. Philipp Koehn and Kevin Knight. 2003. Empirical meth- ods for compound splitting. In Proceedings of the tenth References conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008. Computational Linguistics, pages 187–193. Decompounding query keywords from compounding languages. In Proceedings of the 46th Annual Meeting Klaus Macherey, Andrew M Dai, David Talbot, Ashok C of the Association for Computational Linguistics on Hu- Popat, and Franz Och. 2011. Language-independent man Language Technologies: Short Papers. Association compound splitting with morphological operations. In for Computational Linguistics, pages 253–256. Proceedings of the 49th Annual Meeting of the Associ- ation for Computational Linguistics: Human Language Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Technologies-Volume 1. Association for Computational 2014. Neural machine translation by jointly learning to Linguistics, pages 1395–1404. align and translate. arXiv preprint arXiv:1409.0473 . Kyunghyun Cho, Bart van Merriënboer, Çalar Gülçehre, Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent and Yoshua Bengio. 2014. Learning phrase represen- neural network based language model. In Eleventh An- tations using RNN encoder–decoder for statistical ma- nual Conference of the International Speech Communi- chine translation. In Proceedings of the 2014 Confer- cation Association. ence on Empirical Methods in Natural Language Pro- cessing (EMNLP). Association for Computational Lin- Martin Riedl and Chris Biemann. 2016. Unsupervised guistics, Doha, Qatar, pages 1724–1734. compound splitting with distributional semantics rivals supervised methods. In Proceedings of the 2016 Con- Carla Parra Escartı́n. 2014. Chasing the perfect splitter: ference of the North American Chapter of the Associa- A comparison of different compound splitting tools. In tion for Computational Linguistics: Human Language LREC. pages 3340–3347. Technologies. pages 617–622. 7 48 Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004. SMOR: A German computational morphology covering derivation, composition and inflection. In LREC. Lis- bon, pages 1–263. Sabine Schulte im Walde, Anna Hätty, and Stefan Bott. 2016. The role of modifier and head properties in predicting the compositionality of english and german noun-noun compounds: A vector-space perspective. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics. pages 148–158. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112. Don Tuggener. 2016. Incremental Coreference Resolution for German. Ph.D. thesis, University of Zurich. Marion Weller-Di Marco. 2017. Simple compound split- ting for German. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017). pages 161– 166. Patrick Ziering, Stefan Müller, and Lonneke van der Plas. 2016. Top a splitter: Using distributional semantics for improving compound splitting. In Proceedings of the 12th Workshop on Multiword Expressions. pages 50–55. Patrick Ziering and Lonneke van der Plas. 2016. To- wards unsupervised and language-independent com- pound splitting using inflectional morphological trans- formations. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies. pages 644–653. 8 49