Evaluating Neural Sequence Models for Splitting (Swiss) German Compounds

                                                   Don Tuggener
                                        Zurich University of Applied Sciences
                                            don.tuggener@zhaw.ch


                                                                      compound splitting. In this paper, we evaluate several
                                                                      neural sequence models on the task of (Swiss) Ger-
                         Abstract                                     man compound splitting. The models are compared
                                                                      to an unsupervised character ngram-based approach
    This paper evaluates unsupervised and super-                      and a baseline that uses a dictionary. Furthermore,
    vised neural sequence models for the task of                      we present the first gold standard for splitting Swiss
    splitting (Swiss) German compound words.                          German compounds and evaluate the models on this
    The models are compared to a state-of-the-                        newly available resource. Swiss German dialects fea-
    art approach based on character ngrams and                        ture no official spelling, and they are closely related to
    a simple heuristic that accesses a dictionary.                    Standard German, but differ regarding some language
    We find that the neural models do not outper-                     changes throughout history, and we use the Swiss Ger-
    form the baselines on the German data, but                        man compounds as a sort of out-of-domain test set for
    excel when applied to out-of-domain data, i.e.                    the models.
    splitting Swiss German compounds. We re-
    lease our code and data1 , namely the first an-                   1.1   Related Work
    notated data set of Swiss German compounds.
                                                                      Several methods for automatic splitting of word com-
1   Introduction                                                      pounds exist. One approach is to use a dictionary
                                                                      to perform a full morphological analysis (Schmid
Splitting compound words is an important task when                    et al., 2004). Others apply corpora statistics (Koehn
setting up pipelines for Natural Language Process-                    and Knight, 2003) and combine them with linguistic
ing of the (Swiss2 ) German language. (Swiss) Ger-                    heuristics (Weller-Di Marco, 2017). There are both
man features a long tail regarding word frequencies                   supervised (Alfonseca et al., 2008; Henrich and Hin-
due to the phenomenon that compound words are not                     richs, 2011) and unsupervised approaches (Macherey
orthographically separated by whitespaces as in e.g.                  et al., 2011; Tuggener, 2016; Ziering and van der Plas,
English (e.g. Autobahnraststätte vs. highway service                 2016) to the task. Riedl and Biemann (2016); Zier-
area). Thus, when mapping words to lexical resources                  ing et al. (2016) explore distributional semantics on
such as word nets or embeddings, it is likely that there              the premise that the constituents of a compound are
are compounds which are not represented in the re-                    semantically similar to the compound. Other work
source.                                                               researches the semantic relation between the con-
   Neural sequence models capture characteristics of                  stituents of the compounds (Schulte im Walde et al.,
word or character sequences (ngrams) in a latent rep-                 2016). While most approaches focus on compounds
resentation and are thus hypothetically well-suited for               in a single language, there exists work that explores
                                                                      the task in a multilingual context (Alfonseca et al.,
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-        2008; Macherey et al., 2011).
Text 2018), Winterthur, Switzerland, June 2018                            While there exists work on morphological segmen-
    1
      https://github.engineering.zhaw.ch/tuge/                        tation (Kann et al., 2016), to the best of our knowl-
neural_compound_splitter
    2
      Swiss German subsumes the Alemannic dialects spoken in          edge, there is no prior work on using neural sequence
Switzerland.                                                          models to identify head constituents of compounds.


                                                                 1
                                                                 42
2   Neural sequence models for compound                          position:
    splitting
                                                                                argmin p(xi+1 |x0 . . . xi )          (2)
We first introduce the neural sequence models that we                              i
apply in the experiments and how we adapt them to fit               To combine the two features and determine the best
the task of (Swiss) German compound splitting.                   split position, we sum the END token probability and
                                                                 the inverse of the probability of the next character at
2.1 Unsupervised Recurrent Neural Network                        each position in the sequence x of length n and take
                                                                 the position with the highest score:
The first model we explore is a character-based lan-
guage model using a recurrent neural network (RNN).
Language models aim to predict the next token in a                                                                
                                                                   argmax p(END|xi ) + (1 − p(xi+1 |x0 . . . xi ))
sequence given the sequence history. Different from                    i
ngram-based language models (Heafield et al., 2013,                                                                   (3)
e.g.), RNN-based models feature a hidden state that
                                                                     During initial experiments, we found that this ap-
is updated after consuming a token (a character or
                                                                 proach did indeed yield correct boundaries of free
a word). Based on the hidden state, a probability
                                                                 morphems in German compounds, but struggled to
distribution over the token vocabulary is calculated
                                                                 find the correct boundary for splitting when com-
using the softmax function, and the most likely to-
                                                                 pounds consist of more then two such free mor-
ken is selected when generating sequences. During
                                                                 phems. For example, for the compound Autobahn-
training, the difference between the predicted prob-
                                                                 raststätte (highway service area) with four free mor-
ability and the given next token constitutes the loss
                                                                 phems (Auto, Bahn, Rast, Stätte), the approach iden-
that is backpropagated to update the model parame-
                                                                 tified Autobahnrast as the body and Stätte as the
ters (i.e. weights). Such RNN-based language models
                                                                 head, instead of Autobahn and Raststätte. We hy-
have been shown to outperform ngram-based models
                                                                 pothesized that one flaw of the approach is that when
in terms of perplexity on general language modelling
                                                                 determining the best split position i in a sequence
tasks (Mikolov et al., 2010, inter alia).
                                                                 x, only the character sequence up to position i, i.e.
   To employ RNNs for unsupervised compound
                                                                 x0 . . . xi , is considered, and the remaining string,
splitting, we exploit an implementation detail that is
                                                                 xi+1 . . . xn , is neglected. Therefore, we introduced a
commonly used when working with such approaches:
                                                                 backward-looking RNN that consumes the sequence
Before training an RNN on a given token sequence, a
                                                                 x in reverse order, which constitutes a bi-directional
special END token is appended to the sequence. This
                                                                 RNN (biRNN). We calculate the same two features
token is inserted into the vocabulary of the model.
                                                                 as for the forward-looking RNN for the backward-
When using the trained RNN to generate a sequence,
                                                                 looking RNN and sum the scores of the forward and
the END token is used as a stopping criteria, i.e. if the
                                                                 backward-looking RNN for each position i in the se-
END token is the most likely next token, generation is
                                                                 quence to determine the best position for a split.
terminated and the sequence considered complete.
                                                                     Furthermore, we noted that the approach fails at
   We adapt this idea and monitor the probability of             placing boundaries correctly if the compounds con-
emitting the END token when consuming a compound                 tain so called Fugen elements (linking elements), as
word character-wise. That is, in a character sequence            in e.g. Installation-s-Anweisung (installation instruc-
x, we consider the position xi with the highest proba-           tion), where a Fugen-S glues together the free mor-
bility of emitting the END token as the split position:          phemes Installation and Anweisung. The approach
                                                                 places the split after the first free morpheme (Installa-
              argmax p(END|x0 . . . xi )             (1)         tion), and thus attaches the Fugen-S to the head (san-
                  i
                                                                 weisung), which yields an incorrect split in the result.
   Additionally, we are interested in positions in the           Therefore, we applied a regular expression (captur-
sequence where the probability of generating the next            ing bound morphemes that often occur before com-
given character is low, based on the hypothesis that             pound boundaries, e.g. -ions, -täts, -keits) that aims
that such positions are indicators for a suitable split          to remove Fugen-S heuristically before identifying a


                                                            2
                                                            43
splitting position (e.g. Installationsanweisung → In-             The model is applied to transform sequences into
stallationanweisung).                                             other sequences in e.g. Machine Translation (Cho
   For this approach, we only need a collection of Ger-           et al., 2014). It consists of an encoder component that
man words to train the character-based RNNs, as it is             encodes the input sequence into a latent representa-
unsupervised regarding the compound splitting task.               tion, and a decoder that generates the desired output
                                                                  sequence based on the encoded input.
2.2 Supervised Recurrent Neural Network                              The initial version of the seq2seq model encodes
A natural extension to the neural character-based lan-            the whole input sequence into one vector (i.e. the last
guage model is to add supervision. To determine                   hidden state of the encoder) on which the generation
the split in the unsupervised model, we summed                    of the output is based. To address this limitation, Bah-
probabilities that we deemed relevant but did not                 danau et al. (2014) introduced an attention mechanism
train the RNN itself on the task of compound split-               that lets the decoder peak at all hidden states in the in-
ting. In the supervised approach, we use the hidden               put and apply a weight to each which represents its re-
states of the trained character-based language model              spective importance while generating the correct out-
biRNNs when consuming a German compound word                      put token at each step during generation.
character-wise as features to train a binary classifier              We apply this model to the compound splitting task
regarding the splitting decision. That is, at each posi-          by using the compounds as input and the head noun
tion in the sequence, we concatenate the hidden states            as the desired output.3 We hypothesize that the at-
of the forward and backward RNN to create a feature               tention mechanism is helpful in our case, because not
vector. This vector is then fed to a fully connected              all characters in the input sequence are equally impor-
layer, as shown in Figure 1. For determining the split,           tant when identifying the split boundary. The attention
we take the position in the sequence that has the high-           mechanism enables the decoder to focus on different
est probability according to the binary classifier. Dur-          character groups, which, ideally, represent (groups of)
ing training, the split position is known and used to             relevant free morphems. We thus apply the seq2seq
calculate and backpropagate the loss.                             with attention model.
     0         0          0         1          0

                                                                  3       Experiments
    MLP       MLP        MLP       MLP        MLP

                                                                  Having outlined our models, we now describe our
   concat    concat     concat    concat     concat
                                                                  data, the baselines, followed by evaluation results.

   LSTM       LSTM      LSTM       LSTM      LSTM          ...
                                                                  3.1      Data

   LSTM       LSTM      LSTM       LSTM      LSTM          ...    We use the dataset discussed in Henrich and Hinrichs
                                                                  (2011), i.e. a list of 75 000 German compounds and
                                                                  their head nouns extracted form GermaNet (Henrich
                                                                  and Hinrichs, 2010), a German wordnet.4 Henrich
                                                                  and Hinrichs (2011) used this resource to evaluate sev-
     t         ü          r         s          c           ...    eral approaches to compound splitting. They also in-
                                                                  cluded non-compounds in their evaluation, however,
                                                                  these are not provided in the resource. Therefore, we
Figure 1: Supervised RNN-MLP architecture, shown                  only compare the ability of our approaches to deter-
for the compound Türschloss (door lock), where the               mine the correct split positions in known compounds
split position is at the character s.                             (corresponding to the task in section 7.2 in Henrich
                                                                      3
                                                                        We also experimented with generating both the body and the
2.3 Sequence-to-sequence with attention                           head constituents, but obtained slightly better results with gener-
                                                                  ating the head only.
An alternative supervised neural model is the                         4
                                                                        http://www.sfs.uni-tuebingen.de/lsd/
Sequence-to-sequence model (Sutskever et al., 2014).              compounds.shtml, we use version v12.0 (2017)


                                                             3
                                                             44
and Hinrichs (2011)).5 Furthermore, we remove com-                         Algorithm 1 Dictionary-based compound splitting
pounds from the list that contain a hyphen, since split-                    1: Create dictionaryD from train set
ting them at the hyphen is straight-forward. We ran-                        2: Sort D based on word length
domly split the remaining 73 1333 compounds into                            3: procedure S PLIT(compound)
80% train and 20% test data.                                                4:    for word ∈ D do
   Additionally, we created the first gold standard of                      5:        if compound ends with word then
Swiss German compounds and their head nouns. We                             6:             head = word
extracted the 150 longest Swiss German words in                             7:             if body ∈ D then
the SB-CH corpus (Grubenmann et al., 2018) which                            8:                 break
consists of Swiss German texts gathered from dif-
ferent sources (e.g. social media messages, business                       calculates probabilities of ngrams (length 2 to 20) to
reports). We then manually annotated their (recur-                         occur at the beginning, middle, and end of words in
sive) head nouns. For example, for the compound                            an unlabeled corpus and calculates a splitting score at
Wältuntergangskatastrophefilm (apocalypse catastro-                       each character position in a (compound) word.
phe movie), we extract the heads Katastrophefilm and                       Tuggener (2016) found that this method outperforms
Film and consider both as a correct head in evaluation.                    several other splitters in the task of identifying
We use this data as a kind of out-of-domain test set in                    correct splitting boundaries on the GermaNet data,
the evaluation.                                                            achieving 95% accuracy.

3.2 Baselines                                                              3.3   Evaluation
We compare the neural models to two baselines:                             Next, we evaluate the models regarding their ability
Dictionary-based: The first baseline uses a                                to identify correct splitting boundaries on the German
dictionary and matches its words to the end of the                         and Swiss German data. Since we only evaluate on
compounds in the test set. If a word from the                              known compounds, we measure accuracy, i.e. the per-
dictionary is found to be a substring at the end of a                      centage of compounds for which the methods find the
compound in the test set, it is taken as the head noun                     correct split. Results are given in Table 1.
of that compound. Clearly, the order in which the                             The first striking observation is that the dictionary-
dictionary is traversed matters, because the method                        based method outperforms all other approaches on the
stops after finding the first match. We experimented                       GermaNet data. The reason for the high accuracy lies
with sorting the dictionary by longest to shortest                         in the overlap of the words in the train and test set.
words and vice versa. Also, we included a subroutine                       While there exist no direct duplicates in the sets, al-
to check if the found body (the remaining word after                       most all head nouns (97%) that need to be identified in
removing the head noun from the compound;                                  the test set are included in the train set as constituents
potentially removing the Fugen-S) is also in the                           of a compound. For example, the test set contains the
dictionary and favored those splits which have both                        instance Fruchtnektar (fruit nectar) → Frucht → Nek-
body and head in the dictionary. The algorithm is                          tar and the train set contains Bananennektar (banana
outlined in Algorithm 1.                                                   nectar) → Banane → Nektar. As we see from the
                                                                           evaluation on the SB-CH data, the approach clearly
CharSplit: The dictionary-based method is prone to                         fails when this overlap diminishes.
fail where a head noun of a compound is never seen                            The CharSplit baseline performs 2 accuracy points
in the training corpus. CharSplit, proposed in                             below the dictionary method on the GermaNet data,
Tuggener (2016), alleviates this by basing the splits                      but achieves better results on the Swiss German com-
on character ngrams rather than words. CharSplit                           pounds. Relying on ngram representations of the Ger-
    5
      Clearly, an end-to-end system for compound splitting needs
                                                                           man training data leads to an advantage when moving
to be able to identify whether a given word constitutes a com-             from the German training data to the related, but not
pound. Unfortunately, we are not able to evaluate our approaches           identical domain of Swiss German compounds.
in this regard here. Another resource containing non-compounds
is discussed in Escartı́n (2014), but it does not seem to be avail-           For the unsupervised biLSTM, we found that train-
able.                                                                      ing for more than 1 epoch did not improve results.


                                                                      4
                                                                      45
              Model                     Parameters                                           Acc.
              Dictionary                check long words first                               94.07
              Dictionary                check short words first                              58.35
              Dictionary                +favor known words as body                           95.18
              CharSplit                 include Fugen-S                                      90.65
              CharSplit                 remove Fugen-S                                       93.26
              Unsup. biRNN              1 layer, 32 hidden size, 1 epoch                     67.34
              Unsup. biLSTM             1 layer, 32 hidden size, 1 epoch                     78.13
              Unsup. biLSTM             1 layer, 32 hidden size, 1 epoch, keep Fugen-S       71.51
              Unsup. biLSTM             2 layer, 512 hidden size, 1 epoch                    87.04
              Unsup. biLSTM+MLP         MLP: 3 layers, 128-64-16 hidden size, 2 epochs       95.10
              Seq2seq+attention         1 layer, 128 hidden size, 3 epochs                   92.40

Table 1: Models, parameters, and splitting accuracy for different approaches. Trained and evaluated on German
compounds (GermaNet). Best results per category are in bold, best overall underlined.
           Model                     Acc.                 on the German data.
                                                             The seq2seq model was trained independently from
           Dictionary                20.67
                                                          the other models. It outperforms the unsupervised
           CharSplit                 36.67
                                                          biLSTM on the German data, but not on the out-of-
           Unsup. biLSTM             57.33
                                                          domain Swiss German test set. It falls behind the
           Unsup. biLSTM+MLP 68.67
                                                          other supervised approach on both test sets, but fea-
           Seq2seq+attention         52.00
                                                          tures some other interesting properties discussed in
                                                          the next section.
Table 2: Splitting accuracy for different approaches
evaluated on Swiss German compounds (SB-CH). Us-          3.4 Output analysis
ing best performing models trained on GermaNet.
                                                          In this section, we qualitatively compare the outputs
However, increasing the model size in terms of layers     and other properties of the different approaches and
and hidden state size benefited accuracy to a certain     discuss (dis)advantages of each.
extent, and removing the Fugen-S is vital. Also, we          In general, we hardly ever encountered system out-
found that a vanilla biRNN fares poorly compared to       puts that put splits at seemingly random positions
using a biLSTM. The results show that for the German      within the compounds. The two main errors we ob-
compounds, the model falls behind the baselines by a      served are related to the Fugen-S and errors in choos-
considerable margin. However, on the Swiss German         ing the correct split position for compounds consisting
compounds, it leads to a large improvement compared       of more than two free morphemes.
to the baselines (+20 accuracy points).                   Dictionary-lookup: As this method relies on a
   For the combination of the unsupervised biLSTM            predefined list of know words, it is only able split
coupled with the supervised MLP, we took the best            compounds that have a head noun which is contained
performing biLSTM model (2 layers, 512 hidden size)          in the dictionary. Hence, a main error cause are
to create the inputs for the MLP. We experimented            unknown head nouns, which especially affects
with different numbers of layers, hidden state sizes,        performance on the Swiss German data. The method
and epochs for the MLP and report results of the             is robust against Fugen-S, since it looks for known
best performing configuration. On the German data,           words at the compound end and hence never attaches
this approach is on par with the dictionary baseline.        a Fugen-S to a found head noun. However, it is the
It also improves performance on the Swiss German             only approach that does not provide a way to
compound by +11 accuracy points compared to only             distinguish compounds from non-compounds (i.e. it
using the unsupervised biLSTM, similar to the gains          provides no measure for the confidence of a found


                                                        5
                                                        46
split). That is, it does not provide any means to detect        incorrect free morpheme boundary, e.g. for the
non-compounds. The word Fahrzeug (vehicle) e.g.                 compound Parkleitsystem (parking guiding system),
includes the head noun Zeug (thing, but it is not a             it generates the head System instead of Leitsystem.
hypernym of the word Fahrzeug and hence cannot be               A unique error source for this model is that it often
split without straying away from its meaning, and the           generates spelling errors in the (otherwise) correctly
dictionary approach would perform this split. Since             identified heads, e.g. for Neukonstruktion (new
we only evaluate the splitting methods on known                 construction), it produces konstroktion as the head.
compounds, this disadvantage does not affect the                Clearly, generating the head noun instead of just
accuracy of the approach in evaluation. In                      finding its starting character seems to overcomplicate
conclusion, this methods works well when train and              the task in our setting and has an unnecessary impact
test data feature a largely overlapping vocabulary and          on performance.
the approach is coupled with a method to distinguish            When moving to the out-of-domain data, the model
compounds from non-compounds.                                   shows a bigger performance impact than the biLSTM
CharSplit: This approach often fails when                       coupled with the MLP. One possible reason could be
encountering the Fugen-S, i.e. it frequently attaches           that the model does more heavily rely on full words
it to the head noun if the regular expression is not            during testing, i.e. the input representation is
able to remove it before the split. However, this               constructed over the full compound and the attention
model improves performance on out-of-domain data                mechanism does not focus strongly enough on
compared to the dictionary approach, as it relies on            relevant character sequences. A seq2seq model that
an ngram representation of the training data. This              replaces the input representation with convolution
enables it to better handle vocabulary differences              operations is presented in Gehring et al. (2017). An
between training and testing domain compared to the             interesting experiment would be to evaluate if
dictionary baseline. As an unsupervised approach, it            character convolutions are the better option for
only relies on an unannotated corpus to calculate the           representing important character sequences than the
ngram probability distributions.                                biLSTM.
                                                                Finally, we noted that when training the model to
Unsupervised biLSTM: Similar to CharSplit, the                  generate both the body and head constituents given
approach also often fails attaching Fugen-S to the              the compound, it often produces the correct
body instead of to the head if removing the Fugen-S             lemmatization of the body by removing Fugen
using the regular expression fails. In that regard, the         elements from it (-es, -en, -s, -n, e.g.
unsupervised biLSTM output is similar to CharSplit.             bundesforschungsminister → bund,
However, the latent representation of the character             forschungsminister or landesverwaltung → land,
sequences (compared to the ngram representation in              verwaltung). It thus seems to be the appropriate
CharSplit that directly relies on the surface forms)            candidate for a neural model if identifying the
allows it to better generalize to the out-of-domain             lemmatized body of a compound is a goal of splitting
data than CharSplit, yielding better splitting accuracy.        compounds.
Unsupervised biLSTM + MLP: The combination of                   Output combination: To gain insight on how
the unsupervised biLSTM and the supervised MLP                  complementary the outputs of the different models
yieled best overall results in our experiments. As one          are, we calculated the percentage of compounds that
of the supervised approaches, it does not seem to               are split identically by all models, which is 76.63%.
struggle with occurrences of the Fugen-S and most               This suggests that there is a substantial difference in
errors stem from splitting compounds with multiple              the outputs.
free morphemes at the wrong free morpheme
                                                                We also calculated the upper bound splitting
boundary.
                                                                performance for ensembling the models. To do so,
seq2seq: As the second supervised model, this model             we regarded a compound as split correctly if at least
also does not struggle with removing Fugen                      one of the outputs contained the correct split. This
elements. The main error cause is thus splitting                upper bound achieves an accuracy of 99.56% on the
compounds with multiple free morphems at the                    GermaNet data, which indicates that the model


                                                           6
                                                           47
outputs are sufficiently different to be combined in an           Jonas Gehring, Michael Auli, David Grangier, Denis
ensemble system.                                                     Yarats, and Yann N Dauphin. 2017. Convolutional se-
                                                                     quence to sequence learning. ArXiv e-prints .
4   Conclusion
                                                                  Ralf Grubenmann, Don Tuggener, Pius von Dniken, Jan
We evaluated three common neural sequence models                    Deriu, and Mark Cieliebak. 2018. SB-CH: A Swiss Ger-
for the task of splitting (Swiss) German compounds                  man corpus with sentiment annotations. In Proceedings
                                                                    of the 11th Language Resources and Evaluation Con-
and compared them to ngram- and dictionary-based                    ference (LREC), 2018. Association for Computational
baselines. We found that for in-domain data (Ger-                   Linguistics.
man compounds), the neural sequence models were
not able to outperform the baselines, but that for out-           Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark,
                                                                    and Philipp Koehn. 2013. Scalable modified Kneser-
of-domain data (Swiss German), they achieved vastly
                                                                    Ney language model estimation. In Proceedings of the
better accuracy in identifying splitting positions. We              51st Annual Meeting of the Association for Computa-
hypothesize that the latent representations in the neu-             tional Linguistics. Sofia, Bulgaria, pages 690–696.
ral models of character sequences (in the form of
vectors) allows them to process similar character se-             Verena Henrich and Erhard Hinrichs. 2010. GernEdiT-the
                                                                    GermaNet editing tool. In Proceedings of the ACL 2010
quences in a similar way. Thus, when applying mod-                  System Demonstrations. pages 19–24.
els trained on German data to Swiss German data, the
models are able to resolve the differences between the            Verena Henrich and Erhard Hinrichs. 2011. Determin-
languages because slightly modified but similar and                 ing immediate constituents of compounds in GermaNet.
corresponding ngram sequences lead to similar hid-                  In Proceedings of the International Conference Recent
                                                                    Advances in Natural Language Processing 2011. pages
den representations, which in turn yield similar out-               420–426.
puts (i.e. splitting positions), where ngram- or dictio-
nary approaches, which directly rely on surface forms,            Katharina Kann, Ryan Cotterell, and Hinrich Schütze.
fail. Future work in the direction of Alfonseca et al.              2016.    Neural morphological analysis: Encoding-
                                                                    decoding canonical segments. In Proceedings of the
(2008) will have to determine if the approaches based               2016 Conference on Empirical Methods in Natural Lan-
on neural sequence models is applicable to other lan-               guage Processing. pages 961–967.
guage groups with similar spelling.
                                                                  Philipp Koehn and Kevin Knight. 2003. Empirical meth-
                                                                     ods for compound splitting. In Proceedings of the tenth
References                                                           conference on European chapter of the Association for
                                                                     Computational Linguistics-Volume 1. Association for
Enrique Alfonseca, Slaven Bilac, and Stefan Pharies. 2008.           Computational Linguistics, pages 187–193.
  Decompounding query keywords from compounding
  languages. In Proceedings of the 46th Annual Meeting            Klaus Macherey, Andrew M Dai, David Talbot, Ashok C
  of the Association for Computational Linguistics on Hu-           Popat, and Franz Och. 2011. Language-independent
  man Language Technologies: Short Papers. Association              compound splitting with morphological operations. In
  for Computational Linguistics, pages 253–256.                     Proceedings of the 49th Annual Meeting of the Associ-
                                                                    ation for Computational Linguistics: Human Language
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.                 Technologies-Volume 1. Association for Computational
  2014. Neural machine translation by jointly learning to           Linguistics, pages 1395–1404.
  align and translate. arXiv preprint arXiv:1409.0473 .

Kyunghyun Cho, Bart van Merriënboer, Çalar Gülçehre,          Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan
  Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,                 Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent
  and Yoshua Bengio. 2014. Learning phrase represen-                neural network based language model. In Eleventh An-
  tations using RNN encoder–decoder for statistical ma-             nual Conference of the International Speech Communi-
  chine translation. In Proceedings of the 2014 Confer-             cation Association.
  ence on Empirical Methods in Natural Language Pro-
  cessing (EMNLP). Association for Computational Lin-             Martin Riedl and Chris Biemann. 2016. Unsupervised
  guistics, Doha, Qatar, pages 1724–1734.                           compound splitting with distributional semantics rivals
                                                                    supervised methods. In Proceedings of the 2016 Con-
Carla Parra Escartı́n. 2014. Chasing the perfect splitter:          ference of the North American Chapter of the Associa-
  A comparison of different compound splitting tools. In            tion for Computational Linguistics: Human Language
  LREC. pages 3340–3347.                                            Technologies. pages 617–622.


                                                             7
                                                             48
Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004.
  SMOR: A German computational morphology covering
  derivation, composition and inflection. In LREC. Lis-
  bon, pages 1–263.
Sabine Schulte im Walde, Anna Hätty, and Stefan Bott.
  2016. The role of modifier and head properties in
  predicting the compositionality of english and german
  noun-noun compounds: A vector-space perspective. In
  Proceedings of the Fifth Joint Conference on Lexical
  and Computational Semantics. pages 148–158.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
   Sequence to sequence learning with neural networks.
   In Advances in neural information processing systems.
   pages 3104–3112.
Don Tuggener. 2016. Incremental Coreference Resolution
  for German. Ph.D. thesis, University of Zurich.
Marion Weller-Di Marco. 2017. Simple compound split-
  ting for German. In Proceedings of the 13th Workshop
  on Multiword Expressions (MWE 2017). pages 161–
  166.
Patrick Ziering, Stefan Müller, and Lonneke van der Plas.
   2016. Top a splitter: Using distributional semantics for
   improving compound splitting. In Proceedings of the
   12th Workshop on Multiword Expressions. pages 50–55.
Patrick Ziering and Lonneke van der Plas. 2016. To-
   wards unsupervised and language-independent com-
   pound splitting using inflectional morphological trans-
   formations. In Proceedings of the 2016 Conference
   of the North American Chapter of the Association for
   Computational Linguistics: Human Language Tech-
   nologies. pages 644–653.


                                                              8
                                                              49