<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Neural Sequence Models for Splitting (Swiss) German Compounds</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Don Tuggener</string-name>
          <email>don.tuggener@zhaw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zurich University of Applied Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper evaluates unsupervised and supervised neural sequence models for the task of splitting (Swiss) German compound words. The models are compared to a state-of-theart approach based on character ngrams and a simple heuristic that accesses a dictionary. We find that the neural models do not outperform the baselines on the German data, but excel when applied to out-of-domain data, i.e. splitting Swiss German compounds. We release our code and data1, namely the first annotated data set of Swiss German compounds.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Splitting compound words is an important task when
setting up pipelines for Natural Language
Processing of the (Swiss2) German language. (Swiss)
German features a long tail regarding word frequencies
due to the phenomenon that compound words are not
orthographically separated by whitespaces as in e.g.
English (e.g. Autobahnraststa¨tte vs. highway service
area). Thus, when mapping words to lexical resources
such as word nets or embeddings, it is likely that there
are compounds which are not represented in the
resource.</p>
      <p>Neural sequence models capture characteristics of
word or character sequences (ngrams) in a latent
representation and are thus hypothetically well-suited for
compound splitting. In this paper, we evaluate several
neural sequence models on the task of (Swiss)
German compound splitting. The models are compared
to an unsupervised character ngram-based approach
and a baseline that uses a dictionary. Furthermore,
we present the first gold standard for splitting Swiss
German compounds and evaluate the models on this
newly available resource. Swiss German dialects
feature no official spelling, and they are closely related to
Standard German, but differ regarding some language
changes throughout history, and we use the Swiss
German compounds as a sort of out-of-domain test set for
the models.
1.1</p>
      <sec id="sec-1-1">
        <title>Related Work</title>
        <p>
          Several methods for automatic splitting of word
compounds exist. One approach is to use a dictionary
to perform a full morphological analysis
          <xref ref-type="bibr" rid="ref15">(Schmid
et al., 2004)</xref>
          . Others apply corpora statistics
          <xref ref-type="bibr" rid="ref11">(Koehn
and Knight, 2003)</xref>
          and combine them with linguistic
heuristics
          <xref ref-type="bibr" rid="ref19">(Weller-Di Marco, 2017)</xref>
          . There are both
supervised
          <xref ref-type="bibr" rid="ref1 ref9">(Alfonseca et al., 2008; Henrich and
Hinrichs, 2011)</xref>
          and unsupervised approaches
          <xref ref-type="bibr" rid="ref10 ref12 ref14 ref18 ref20 ref21">(Macherey
et al., 2011; Tuggener, 2016; Ziering and van der Plas,
2016)</xref>
          to the task. Riedl and Biemann (2016);
Ziering et al. (2016) explore distributional semantics on
the premise that the constituents of a compound are
semantically similar to the compound. Other work
researches the semantic relation between the
constituents of the compounds
          <xref ref-type="bibr" rid="ref16">(Schulte im Walde et al.,
2016)</xref>
          . While most approaches focus on compounds
in a single language, there exists work that explores
the task in a multilingual context
          <xref ref-type="bibr" rid="ref1 ref12">(Alfonseca et al.,
2008; Macherey et al., 2011)</xref>
          .
        </p>
        <p>
          While there exists work on morphological
segmentation
          <xref ref-type="bibr" rid="ref10">(Kann et al., 2016)</xref>
          , to the best of our
knowledge, there is no prior work on using neural sequence
models to identify head constituents of compounds.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Neural sequence models for compound splitting</title>
      <p>We first introduce the neural sequence models that we
apply in the experiments and how we adapt them to fit
the task of (Swiss) German compound splitting.
2.1</p>
      <sec id="sec-2-1">
        <title>Unsupervised Recurrent Neural Network</title>
        <p>
          The first model we explore is a character-based
language model using a recurrent neural network (RNN).
Language models aim to predict the next token in a
sequence given the sequence history. Different from
ngram-based language models
          <xref ref-type="bibr" rid="ref7">(Heafield et al., 2013,
e.g.)</xref>
          , RNN-based models feature a hidden state that
is updated after consuming a token (a character or
a word). Based on the hidden state, a probability
distribution over the token vocabulary is calculated
using the softmax function, and the most likely
token is selected when generating sequences. During
training, the difference between the predicted
probability and the given next token constitutes the loss
that is backpropagated to update the model
parameters (i.e. weights). Such RNN-based language models
have been shown to outperform ngram-based models
in terms of perplexity on general language modelling
tasks
          <xref ref-type="bibr" rid="ref13">(Mikolov et al., 2010, inter alia)</xref>
          .
        </p>
        <p>To employ RNNs for unsupervised compound
splitting, we exploit an implementation detail that is
commonly used when working with such approaches:
Before training an RNN on a given token sequence, a
special END token is appended to the sequence. This
token is inserted into the vocabulary of the model.
When using the trained RNN to generate a sequence,
the END token is used as a stopping criteria, i.e. if the
END token is the most likely next token, generation is
terminated and the sequence considered complete.</p>
        <p>We adapt this idea and monitor the probability of
emitting the END token when consuming a compound
word character-wise. That is, in a character sequence
x, we consider the position xi with the highest
probability of emitting the END token as the split position:
argmax p(END|x0 . . . xi)
i
(1)</p>
        <p>Additionally, we are interested in positions in the
sequence where the probability of generating the next
given character is low, based on the hypothesis that
that such positions are indicators for a suitable split
position:
argmin p(xi+1|x0 . . . xi) (2)</p>
        <p>i</p>
        <p>To combine the two features and determine the best
split position, we sum the END token probability and
the inverse of the probability of the next character at
each position in the sequence x of length n and take
the position with the highest score:
(3)
argmax p(END|xi) + (1 − p(xi+1|x0 . . . xi))
i</p>
        <p>During initial experiments, we found that this
approach did indeed yield correct boundaries of free
morphems in German compounds, but struggled to
find the correct boundary for splitting when
compounds consist of more then two such free
morphems. For example, for the compound
Autobahnraststa¨tte (highway service area) with four free
morphems (Auto, Bahn, Rast, Sta¨tte), the approach
identified Autobahnrast as the body and Sta¨tte as the
head, instead of Autobahn and Raststa¨tte. We
hypothesized that one flaw of the approach is that when
determining the best split position i in a sequence
x, only the character sequence up to position i, i.e.
x0 . . . xi, is considered, and the remaining string,
xi+1 . . . xn, is neglected. Therefore, we introduced a
backward-looking RNN that consumes the sequence
x in reverse order, which constitutes a bi-directional
RNN (biRNN). We calculate the same two features
as for the forward-looking RNN for the
backwardlooking RNN and sum the scores of the forward and
backward-looking RNN for each position i in the
sequence to determine the best position for a split.</p>
        <p>Furthermore, we noted that the approach fails at
placing boundaries correctly if the compounds
contain so called Fugen elements (linking elements), as
in e.g. Installation-s-Anweisung (installation
instruction), where a Fugen-S glues together the free
morphemes Installation and Anweisung. The approach
places the split after the first free morpheme
(Installation), and thus attaches the Fugen-S to the head
(sanweisung), which yields an incorrect split in the result.
Therefore, we applied a regular expression
(capturing bound morphemes that often occur before
compound boundaries, e.g. -ions, -ta¨ts, -keits) that aims
to remove Fugen-S heuristically before identifying a
0
MLP
LSTM
LSTM</p>
        <p>0
MLP
LSTM
LSTM</p>
        <p>0
MLP
LSTM
LSTM</p>
        <p>1
MLP
LSTM
LSTM
splitting position (e.g. Installationsanweisung →
Installationanweisung).</p>
        <p>For this approach, we only need a collection of
German words to train the character-based RNNs, as it is
unsupervised regarding the compound splitting task.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Supervised Recurrent Neural Network</title>
        <p>A natural extension to the neural character-based
language model is to add supervision. To determine
the split in the unsupervised model, we summed
probabilities that we deemed relevant but did not
train the RNN itself on the task of compound
splitting. In the supervised approach, we use the hidden
states of the trained character-based language model
biRNNs when consuming a German compound word
character-wise as features to train a binary classifier
regarding the splitting decision. That is, at each
position in the sequence, we concatenate the hidden states
of the forward and backward RNN to create a feature
vector. This vector is then fed to a fully connected
layer, as shown in Figure 1. For determining the split,
we take the position in the sequence that has the
highest probability according to the binary classifier.
During training, the split position is known and used to
calculate and backpropagate the loss.</p>
        <p>
          concat
concat
concat
concat
t
ü
r
s
c
The model is applied to transform sequences into
other sequences in e.g. Machine Translation
          <xref ref-type="bibr" rid="ref2 ref3">(Cho
et al., 2014)</xref>
          . It consists of an encoder component that
encodes the input sequence into a latent
representation, and a decoder that generates the desired output
sequence based on the encoded input.
        </p>
        <p>The initial version of the seq2seq model encodes
the whole input sequence into one vector (i.e. the last
hidden state of the encoder) on which the generation
of the output is based. To address this limitation,
Bahdanau et al. (2014) introduced an attention mechanism
that lets the decoder peak at all hidden states in the
input and apply a weight to each which represents its
respective importance while generating the correct
output token at each step during generation.</p>
        <p>We apply this model to the compound splitting task
by using the compounds as input and the head noun
as the desired output.3 We hypothesize that the
attention mechanism is helpful in our case, because not
all characters in the input sequence are equally
important when identifying the split boundary. The attention
mechanism enables the decoder to focus on different
character groups, which, ideally, represent (groups of)
relevant free morphems. We thus apply the seq2seq
with attention model.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>Having outlined our models, we now describe our
data, the baselines, followed by evaluation results.
3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>
          We use the dataset discussed in Henrich and Hinrichs
(2011), i.e. a list of 75 000 German compounds and
their head nouns extracted form GermaNet
          <xref ref-type="bibr" rid="ref8">(Henrich
and Hinrichs, 2010)</xref>
          , a German wordnet.4 Henrich
and Hinrichs (2011) used this resource to evaluate
several approaches to compound splitting. They also
included non-compounds in their evaluation, however,
these are not provided in the resource. Therefore, we
only compare the ability of our approaches to
determine the correct split positions in known compounds
(corresponding to the task in section 7.2 in Henrich
3We also experimented with generating both the body and the
head constituents, but obtained slightly better results with
generating the head only.
        </p>
        <p>4http://www.sfs.uni-tuebingen.de/lsd/
compounds.shtml, we use version v12.0 (2017)
0
MLP
concat
LSTM
LSTM
...
...
...
and Hinrichs (2011)).5 Furthermore, we remove
compounds from the list that contain a hyphen, since
splitting them at the hyphen is straight-forward. We
randomly split the remaining 73 1333 compounds into
80% train and 20% test data.</p>
        <p>
          Additionally, we created the first gold standard of
Swiss German compounds and their head nouns. We
extracted the 150 longest Swiss German words in
the SB-CH corpus
          <xref ref-type="bibr" rid="ref6">(Grubenmann et al., 2018)</xref>
          which
consists of Swiss German texts gathered from
different sources (e.g. social media messages, business
reports). We then manually annotated their
(recursive) head nouns. For example, for the compound
Wa¨ltuntergangskatastrophefilm (apocalypse
catastrophe movie), we extract the heads Katastrophefilm and
Film and consider both as a correct head in evaluation.
We use this data as a kind of out-of-domain test set in
the evaluation.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Baselines</title>
        <p>We compare the neural models to two baselines:
Dictionary-based: The first baseline uses a
dictionary and matches its words to the end of the
compounds in the test set. If a word from the
dictionary is found to be a substring at the end of a
compound in the test set, it is taken as the head noun
of that compound. Clearly, the order in which the
dictionary is traversed matters, because the method
stops after finding the first match. We experimented
with sorting the dictionary by longest to shortest
words and vice versa. Also, we included a subroutine
to check if the found body (the remaining word after
removing the head noun from the compound;
potentially removing the Fugen-S) is also in the
dictionary and favored those splits which have both
body and head in the dictionary. The algorithm is
outlined in Algorithm 1.</p>
        <p>CharSplit: The dictionary-based method is prone to
fail where a head noun of a compound is never seen
in the training corpus. CharSplit, proposed in
Tuggener (2016), alleviates this by basing the splits
on character ngrams rather than words. CharSplit
5Clearly, an end-to-end system for compound splitting needs
to be able to identify whether a given word constitutes a
compound. Unfortunately, we are not able to evaluate our approaches
in this regard here. Another resource containing non-compounds
is discussed in Escart´ın (2014), but it does not seem to be
available.</p>
        <p>Algorithm 1 Dictionary-based compound splitting
1: Create dictionaryD from train set
2: Sort D based on word length
3: procedure SPLIT(compound)
4: for word ∈ D do
5: if compound ends with word then
6: head = word
7: if body ∈ D then
8: break
calculates probabilities of ngrams (length 2 to 20) to
occur at the beginning, middle, and end of words in
an unlabeled corpus and calculates a splitting score at
each character position in a (compound) word.
Tuggener (2016) found that this method outperforms
several other splitters in the task of identifying
correct splitting boundaries on the GermaNet data,
achieving 95% accuracy.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation</title>
        <p>Next, we evaluate the models regarding their ability
to identify correct splitting boundaries on the German
and Swiss German data. Since we only evaluate on
known compounds, we measure accuracy, i.e. the
percentage of compounds for which the methods find the
correct split. Results are given in Table 1.</p>
        <p>The first striking observation is that the
dictionarybased method outperforms all other approaches on the
GermaNet data. The reason for the high accuracy lies
in the overlap of the words in the train and test set.
While there exist no direct duplicates in the sets,
almost all head nouns (97%) that need to be identified in
the test set are included in the train set as constituents
of a compound. For example, the test set contains the
instance Fruchtnektar (fruit nectar) → Frucht →
Nektar and the train set contains Bananennektar (banana
nectar) → Banane → Nektar. As we see from the
evaluation on the SB-CH data, the approach clearly
fails when this overlap diminishes.</p>
        <p>The CharSplit baseline performs 2 accuracy points
below the dictionary method on the GermaNet data,
but achieves better results on the Swiss German
compounds. Relying on ngram representations of the
German training data leads to an advantage when moving
from the German training data to the related, but not
identical domain of Swiss German compounds.</p>
        <p>For the unsupervised biLSTM, we found that
training for more than 1 epoch did not improve results.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Model</title>
        <sec id="sec-3-4-1">
          <title>Dictionary</title>
          <p>Dictionary
Dictionary</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>CharSplit</title>
          <p>CharSplit</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>Unsup. biRNN Unsup. biLSTM Unsup. biLSTM Unsup. biLSTM</title>
          <p>Parameters
check long words first
check short words first
+favor known words as body
include Fugen-S
remove Fugen-S</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>1 layer, 32 hidden size, 1 epoch</title>
          <p>1 layer, 32 hidden size, 1 epoch
1 layer, 32 hidden size, 1 epoch, keep Fugen-S
2 layer, 512 hidden size, 1 epoch</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>Unsup. biLSTM+MLP MLP: 3 layers, 128-64-16 hidden size, 2 epochs</title>
        </sec>
        <sec id="sec-3-4-6">
          <title>Seq2seq+attention 1 layer, 128 hidden size, 3 epochs Table 1: Models, parameters, and splitting accuracy for different approaches. Trained and evaluated on German compounds (GermaNet). Best results per category are in bold, best overall underlined.</title>
          <p>However, increasing the model size in terms of layers
and hidden state size benefited accuracy to a certain
extent, and removing the Fugen-S is vital. Also, we
found that a vanilla biRNN fares poorly compared to
using a biLSTM. The results show that for the German
compounds, the model falls behind the baselines by a
considerable margin. However, on the Swiss German
compounds, it leads to a large improvement compared
to the baselines (+20 accuracy points).</p>
          <p>For the combination of the unsupervised biLSTM
coupled with the supervised MLP, we took the best
performing biLSTM model (2 layers, 512 hidden size)
to create the inputs for the MLP. We experimented
with different numbers of layers, hidden state sizes,
and epochs for the MLP and report results of the
best performing configuration. On the German data,
this approach is on par with the dictionary baseline.
It also improves performance on the Swiss German
compound by +11 accuracy points compared to only
using the unsupervised biLSTM, similar to the gains
3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Output analysis</title>
        <p>In this section, we qualitatively compare the outputs
and other properties of the different approaches and
discuss (dis)advantages of each.</p>
        <p>In general, we hardly ever encountered system
outputs that put splits at seemingly random positions
within the compounds. The two main errors we
observed are related to the Fugen-S and errors in
choosing the correct split position for compounds consisting
of more than two free morphemes.</p>
        <p>Dictionary-lookup: As this method relies on a
predefined list of know words, it is only able split
compounds that have a head noun which is contained
in the dictionary. Hence, a main error cause are
unknown head nouns, which especially affects
performance on the Swiss German data. The method
is robust against Fugen-S, since it looks for known
words at the compound end and hence never attaches
a Fugen-S to a found head noun. However, it is the
only approach that does not provide a way to
distinguish compounds from non-compounds (i.e. it
provides no measure for the confidence of a found
split). That is, it does not provide any means to detect
non-compounds. The word Fahrzeug (vehicle) e.g.
includes the head noun Zeug (thing, but it is not a
hypernym of the word Fahrzeug and hence cannot be
split without straying away from its meaning, and the
dictionary approach would perform this split. Since
we only evaluate the splitting methods on known
compounds, this disadvantage does not affect the
accuracy of the approach in evaluation. In
conclusion, this methods works well when train and
test data feature a largely overlapping vocabulary and
the approach is coupled with a method to distinguish
compounds from non-compounds.</p>
        <p>CharSplit: This approach often fails when
encountering the Fugen-S, i.e. it frequently attaches
it to the head noun if the regular expression is not
able to remove it before the split. However, this
model improves performance on out-of-domain data
compared to the dictionary approach, as it relies on
an ngram representation of the training data. This
enables it to better handle vocabulary differences
between training and testing domain compared to the
dictionary baseline. As an unsupervised approach, it
only relies on an unannotated corpus to calculate the
ngram probability distributions.</p>
        <p>Unsupervised biLSTM: Similar to CharSplit, the
approach also often fails attaching Fugen-S to the
body instead of to the head if removing the Fugen-S
using the regular expression fails. In that regard, the
unsupervised biLSTM output is similar to CharSplit.
However, the latent representation of the character
sequences (compared to the ngram representation in
CharSplit that directly relies on the surface forms)
allows it to better generalize to the out-of-domain
data than CharSplit, yielding better splitting accuracy.
Unsupervised biLSTM + MLP: The combination of
the unsupervised biLSTM and the supervised MLP
yieled best overall results in our experiments. As one
of the supervised approaches, it does not seem to
struggle with occurrences of the Fugen-S and most
errors stem from splitting compounds with multiple
free morphemes at the wrong free morpheme
boundary.
seq2seq: As the second supervised model, this model
also does not struggle with removing Fugen
elements. The main error cause is thus splitting
compounds with multiple free morphems at the
incorrect free morpheme boundary, e.g. for the
compound Parkleitsystem (parking guiding system),
it generates the head System instead of Leitsystem.
A unique error source for this model is that it often
generates spelling errors in the (otherwise) correctly
identified heads, e.g. forNeukonstruktion (new
construction), it produces konstroktion as the head.
Clearly, generating the head noun instead of just
finding its starting character seems to overcomplicate
the task in our setting and has an unnecessary impact
on performance.</p>
        <p>When moving to the out-of-domain data, the model
shows a bigger performance impact than the biLSTM
coupled with the MLP. One possible reason could be
that the model does more heavily rely on full words
during testing, i.e. the input representation is
constructed over the full compound and the attention
mechanism does not focus strongly enough on
relevant character sequences. A seq2seq model that
replaces the input representation with convolution
operations is presented in Gehring et al. (2017). An
interesting experiment would be to evaluate if
character convolutions are the better option for
representing important character sequences than the
biLSTM.</p>
        <p>Finally, we noted that when training the model to
generate both the body and head constituents given
the compound, it often produces the correct
lemmatization of the body by removing Fugen
elements from it (-es, -en, -s, -n, e.g.
bundesforschungsminister → bund,
forschungsminister or landesverwaltung → land,
verwaltung). It thus seems to be the appropriate
candidate for a neural model if identifying the
lemmatized body of a compound is a goal of splitting
compounds.</p>
        <p>Output combination: To gain insight on how
complementary the outputs of the different models
are, we calculated the percentage of compounds that
are split identically by all models, which is 76.63%.
This suggests that there is a substantial difference in
the outputs.</p>
        <p>We also calculated the upper bound splitting
performance for ensembling the models. To do so,
we regarded a compound as split correctly if at least
one of the outputs contained the correct split. This
upper bound achieves an accuracy of 99.56% on the
GermaNet data, which indicates that the model
outputs are sufficiently different to be combined in an
ensemble system.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We evaluated three common neural sequence models
for the task of splitting (Swiss) German compounds
and compared them to ngram- and dictionary-based
baselines. We found that for in-domain data
(German compounds), the neural sequence models were
not able to outperform the baselines, but that for
outof-domain data (Swiss German), they achieved vastly
better accuracy in identifying splitting positions. We
hypothesize that the latent representations in the
neural models of character sequences (in the form of
vectors) allows them to process similar character
sequences in a similar way. Thus, when applying
models trained on German data to Swiss German data, the
models are able to resolve the differences between the
languages because slightly modified but similar and
corresponding ngram sequences lead to similar
hidden representations, which in turn yield similar
outputs (i.e. splitting positions), where ngram- or
dictionary approaches, which directly rely on surface forms,
fail. Future work in the direction of Alfonseca et al.
(2008) will have to determine if the approaches based
on neural sequence models is applicable to other
language groups with similar spelling.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Enrique</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          , Slaven Bilac, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Pharies</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Decompounding query keywords from compounding languages</article-title>
          .
          <source>In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics</source>
          , pages
          <fpage>253</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409</source>
          .
          <fpage>0473</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrie¨nboer, C¸alar G u¨lc¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Doha, Qatar, pages
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Carla</given-names>
            <surname>Parra Escart</surname>
          </string-name>
          ´ın.
          <year>2014</year>
          .
          <article-title>Chasing the perfect splitter: A comparison of different compound splitting tools</article-title>
          .
          <source>In LREC</source>
          . pages
          <fpage>3340</fpage>
          -
          <lpage>3347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jonas</given-names>
            <surname>Gehring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Auli</surname>
          </string-name>
          , David Grangier,
          <string-name>
            <given-names>Denis</given-names>
            <surname>Yarats</surname>
          </string-name>
          , and
          <string-name>
            <surname>Yann N Dauphin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional sequence to sequence learning</article-title>
          .
          <source>ArXiv</source>
          e-prints .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Grubenmann</surname>
          </string-name>
          , Don Tuggener, Pius von Dniken, Jan Deriu, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SB-CH: A Swiss German corpus with sentiment annotations</article-title>
          .
          <source>In Proceedings of the 11th Language Resources and Evaluation Conference (LREC)</source>
          ,
          <year>2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Kenneth</given-names>
            <surname>Heafield</surname>
          </string-name>
          , Ivan Pouzyrevsky,
          <string-name>
            <surname>Jonathan H. Clark</surname>
            , and
            <given-names>Philipp</given-names>
          </string-name>
          <string-name>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Scalable modified KneserNey language model estimation</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics</source>
          . Sofia, Bulgaria, pages
          <fpage>690</fpage>
          -
          <lpage>696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Verena</given-names>
            <surname>Henrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Hinrichs</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>GernEdiT-the GermaNet editing tool</article-title>
          .
          <source>In Proceedings of the ACL 2010 System Demonstrations</source>
          . pages
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Verena</given-names>
            <surname>Henrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Erhard</given-names>
            <surname>Hinrichs</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Determining immediate constituents of compounds in GermaNet</article-title>
          .
          <source>In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011</source>
          . pages
          <fpage>420</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Katharina</given-names>
            <surname>Kann</surname>
          </string-name>
          , Ryan Cotterell, and Hinrich Schu¨tze.
          <year>2016</year>
          .
          <article-title>Neural morphological analysis: Encodingdecoding canonical segments</article-title>
          .
          <source>In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          . pages
          <fpage>961</fpage>
          -
          <lpage>967</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Knight</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Empirical methods for compound splitting</article-title>
          .
          <source>In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Computational Linguistics</source>
          , pages
          <fpage>187</fpage>
          -
          <lpage>193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Macherey</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew M Dai</surname>
            , David Talbot, Ashok C Popat, and
            <given-names>Franz</given-names>
          </string-name>
          <string-name>
            <surname>Och</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Language-independent compound splitting with morphological operations</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics</source>
          , pages
          <fpage>1395</fpage>
          -
          <lpage>1404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Toma</surname>
            ´sˇ Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan Cˇernocky`, and
            <given-names>Sanjeev</given-names>
          </string-name>
          <string-name>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Recurrent neural network based language model</article-title>
          .
          <source>In Eleventh Annual Conference of the International Speech Communication Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Riedl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Unsupervised compound splitting with distributional semantics rivals supervised methods</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pages
          <fpage>617</fpage>
          -
          <lpage>622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          , Arne Fitschen, and
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Heid</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>SMOR: A German computational morphology covering derivation, composition and inflection</article-title>
          .
          <source>In LREC. Lisbon</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Sabine</given-names>
            <surname>Schulte im Walde</surname>
          </string-name>
          ,
          <source>Anna Ha¨tty, and Stefan Bott</source>
          .
          <year>2016</year>
          .
          <article-title>The role of modifier and head properties in predicting the compositionality of english and german noun-noun compounds: A vector-space perspective</article-title>
          .
          <source>In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics</source>
          . pages
          <fpage>148</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          . pages
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Don</given-names>
            <surname>Tuggener</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Incremental Coreference Resolution for German</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Zurich.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Marion</given-names>
            <surname>Weller-Di Marco</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Simple compound splitting for German</article-title>
          .
          <source>In Proceedings of the 13th Workshop on Multiword Expressions (MWE</source>
          <year>2017</year>
          ). pages
          <fpage>161</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ziering</surname>
          </string-name>
          , Stefan Mu¨ller, and Lonneke van der Plas.
          <year>2016</year>
          .
          <article-title>Top a splitter: Using distributional semantics for improving compound splitting</article-title>
          .
          <source>In Proceedings of the 12th Workshop on Multiword Expressions</source>
          . pages
          <fpage>50</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ziering</surname>
          </string-name>
          and Lonneke van der Plas.
          <year>2016</year>
          .
          <article-title>Towards unsupervised and language-independent compound splitting using inflectional morphological transformations</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pages
          <fpage>644</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>