<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cutter - a Universal Multilingual Tokenizer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johannes Graën</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mara Bertamini</string-name>
          <email>bertaminimara@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Volk</string-name>
          <email>volk@cl.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- Text 2018)</institution>
          ,
          <addr-line>Winterthur</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computational Linguistics University of Zurich</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tokenization is the process of splitting running texts into minimal meaningful units. In writing systems where a space character is used for word separation, this blank character typically acts as token boundary. A simple tokenizer that only splits texts at space characters already achieves a notable accuracy, although it misses unmarked token boundaries and erroneously splits tokens that contain space characters. Different languages use the same characters for different purposes. Tokenization is thus a language-specific task (with code-switching being a particular challenge). Extra-linguistic tokens, however, are similar in many languages. These tokens include numbers, XML elements, email addresses and identifiers of concepts that are idiosyncratic to particular text variants (e.g., patent numbers).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>We present a framework for tokenization that
makes use of language-specific and
languageindependent token identification rules. These
rules are stacked and applied recursively,
yielding a complete trace of the tokenization
process in form of a tree structure. Rules
are easily adaptable to different languages and
test types. Unit tests reliably detect if new
token identification rules conflict with existing
ones and thus assure consistent tokenization
when extending the rule sets.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>Common wisdom has it that tokenization is a solved
problem. Yet, in practice, we often find ourselves in
bothersome trials of adapting tokenizers or their
output. This may be due to the fact that “down-stream”
processing tools require a different tokenization. Or it
may be because of special tokenization needs for
particular domains, genres or historical text variants.</p>
      <p>As an example of different tokenization needs,
consider splits of English negations in contracted forms
like didn’t and won’t . The Penn Treebank guidelines
suggest to tokenize those as did + n’t and wo + n’t .
Such splits are practical for information extraction or
sentiment analysis. But, of course, these splits make
searching a corpus (e.g. for linguistic investigations)
for negated forms unintuitive. Searches must then be
supported by a specific module that undoes the splits.</p>
      <p>Another example is the English phrase
a 12-ft boat . How shall we handle the
hyphenated length expression? Is this one or two or even
three tokens? We follow the rule that measurement
units are split from numerical values. This rule is
meant for altitude or speed and says that the number
is split from the unit (e.g. 2850m → 2850 , m ;
155km/h → 155 , km/h ). Following this rule, we
decided to also split the hyphenated length expression
into two tokens resulting in: a , 12 , -ft , boat . Once
identified as such, we can, of course, keep numerical
values and measurement units as single tokens, if
required by the following processing step.</p>
      <p>We work on the annotation of large multilingual
corpora, some of them diachronic for the last 150
years. In our work such tokenization issues abound.
We have therefore developed tokenization guidelines
which started out as check-lists for the various
language versions of our corpora. We then realized that
only a custom-built tokenizer with systematic tests
included will serve our purposes of high-quality
tokenization.</p>
      <p>Our tokenization approach does not include
normalization, which we see as a separate step
involving coding issues (like turning ligatures into letter
sequences, or certain spaces into non-breakable spaces)
or other simplifications (like turning American into
British English spelling, or Swiss German into
Standard German spelling).</p>
      <p>In this paper, we first describe existing tokenization
approaches and show that there is a need for
tokenizers that can be adapted to particular language and text
variants (Section 2). We then show why tokenization
is a challenging task by giving examples of ambiguous
cases. We argue that a tokenizer needs to possess
linguistic information and to consider long-distance
relations to be able to decide those cases (Section 3).</p>
      <p>Having outlined the problem, we describe our
tokenization approach (Section 4), and how we employ
unit testing to warrant high-quality tokenization while
allowing for the adaptation of the tokenizer
(Section 5). Finally, we show that there are cases where
tokenization decisions require commonsense
knowledge, and which our tokenizer is not capable to handle
(Section 7). Future development (Section 8) will need
to involve syntactic parsing to solve those hard cases.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        The Stanford Tokenizer
        <xref ref-type="bibr" rid="ref10">(Manning et al., 2018)</xref>
        is
probably the most widely used tokenizer for English. It is
built on the basis of the tokenization rules in the Penn
Treebank.1 Following the Penn tokenizations gets us a
long way for English, but is not explicit enough to
address issues such as in the hyphenated length
expressions above, a 12-ft boat .
      </p>
      <p>Its strength is its speed and the numerous options
concerning the treatment of special symbols
(parentheses, ampersand, currency symbols and fractions).
In contrast, our tokenizer is highly modular and
adaptable to categories of texts that we did not consider
when compiling our guidelines. It also allows for a
combination of rule sets from different languages to
process texts with quotations or code switching, for
instance.</p>
      <p>
        <xref ref-type="bibr" rid="ref6">He and Kayaalp (2006)</xref>
        compare various
tokenizers for the biomedical domain. Their results point
1ftp://ftp.cis.upenn.edu/pub/treebank/public_
html/tokenization.html
to the need for standard tokenizers in order to ensure
the interoperability of processing tools.
        <xref ref-type="bibr" rid="ref1">Cruz Díaz
and Maña López (2015</xref>
        ) follow up with an analysis
of more recent tokenizers also for the biomedical
domain. They observe disagreement to a large extend
between the tokenization decisions of those tools for
the test cases they had identified preliminarily. That
observation is in agreement with
        <xref ref-type="bibr" rid="ref4">Habert et al. (1998)</xref>
        ,
when they concluded more than 15 years earlier “At
the moment, tokenizers represent black boxes, the
behavior and rationale of which are not made clear.”
      </p>
      <p>
        Apart from rule-based tokenization, there are
machine learning approaches to tokenization as well. For
those approaches, a certain amount of training
material (i.e., both original and tokenized versions of the
same texts) is required.
        <xref ref-type="bibr" rid="ref8">Jurish and Würzner (2013)</xref>
        argue that sufficient training material could be extracted
from “treebanks or multi-lingual corpora”.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Tokenization Challenges</title>
      <p>Although the only decision to be taken by the tokenizer
is whether or not to place a token boundary between
each two adjacent characters, this task is not as
trivial as it seems at first glance. If two adjacent
characters are both letters, they typically belong to the same
token. English negations in contracted forms (like
didn’t ) as described above are one exception.</p>
      <p>A non-letter character (e.g., a punctuation mark)
followed by a letter frequently marks the boundary of
two tokens, while the opposite case (a letter
character followed by a non-letter character) does not show
a general preference; the right decision in these cases
often requires to resort to linguistic knowledge. We
can, for instance, not decide if baby’s is one token or
two without knowing whether a text is written in
English ( ’s is a possessive marker of baby ) or Dutch
( baby’s means babies).</p>
      <p>Apart from knowing a text’s language, which
includes word formation and grammar knowledge,
sometimes long-distance relations between tokens that
belong together, such as brackets or quotation marks,
have to be determined in order to take the right
decision. An apostrophe following a German word
that phonetically ends in /s/ can be both a possessive
marker or the end of a single-quoted expression. If we
find another apostrophe in the same sentence
preceding the ambiguous one such that it is followed
immediately by a letter-character and preceded by a space
ich
die</p>
      <p>'
Veranstaltung</p>
      <p>Kulturstadt</p>
      <p>Europas
für
'
ein</p>
      <p>'
richtiges</p>
      <p>Ei
des</p>
      <p>'
Kolumbus
halte
character, we have evidence for the quoted expression
and consequently mark both apostrophes as single
tokens (see Figure 1).2</p>
      <p>In many languages, both sentences and
abbreviations typically end with a period. While the
sentencefinal period is a single token, abbreviations comprise
the period. To distinguish both cases, we either need
to know all abbreviations of the language in questions.
Or we need a reliable way of determining sentence
boundaries.
4</p>
    </sec>
    <sec id="sec-5">
      <title>The Cutter Implementation</title>
      <p>Simple tokenizers process text as a stream of
characters from left to right and take locally justified
decisions of whether to place a token boundary between
two adjacent characters. This approach is limited as
it is not capable of taking long-distance relations into
account.</p>
      <p>Our approach is to successively identify tokens
following an ordered list of patterns defined by advanced
regular expressions.3 Once identified, we ‘cut out’
the token (hence the name Cutter) and proceed by
applying the same patterns to the remaining parts, until
only empty character sequences remain. This
procedure generates a tree structure like the ones in Figure 1
and Figure 2.</p>
      <p>The order of patterns that describe tokens and their
respective context is chosen such that the more
detailed or exceptional tokens are identified first,
followed by more common and standard tokens. That
way, sequences of characters that would otherwise be
2This is only necessary if the single typewriter apostrophe is
used instead of the proper left and right single quotation marks.</p>
      <p>
        3We use the so-called Perl Compatible Regular Expressions
(PCRE) by
        <xref ref-type="bibr" rid="ref5">(Hazel, 1997)</xref>
        , including adjuvant features, such as
Unicode character properties
        <xref ref-type="bibr" rid="ref11">(The Unicode Consortium, 2017)</xref>
        ,
named capturing subpatterns and subpattern assertions.
split into several tokens can be protected by an
earlier match, which prevents that sequence from further
processing.
      </p>
      <p>
        Tokens that contain spaces, for instance, need to be
matched by a pattern that prevents them from being
split by the general rule which mandates that spaces
(and other white space characters) are token
separators. For ease of reading, numbers are often separated
into groups of three digits. The international standard
for “quantities and units” stipulates the use of a small
space as separator
        <xref ref-type="bibr" rid="ref7">(ISO 80000-1, 2009, Section 7.3.1)</xref>
        ,
which is often realized as a standard space in electronic
texts. We identify numbers formatted in this way (e.g.,
50 000 ) as single tokens.
      </p>
      <p>Another example for tokens that need to be
protected are French words originally composed of
more than one lexical unit that nowadays form
a single lexical unit and should thus be
recognized as a single token. In the example shown in
Figure 2, aujourd’hui ‘today’ is identified as
token in the first step, leaving On nous dit qu’ and
c’est le cas, encore faudra-t-il l’évaluer. as
remainders, which are subsequently further tokenized. To be
able to distinguish lexicalized forms (e.g., d’accord
→ d’accord ) from regular elision of vowels (e.g.,
d’accorder → d’ , accorder ), we need to incorporate
all lexicalized forms (e.g., entr’ouvèrt , c’est-à-dire ,
presqu’île ) into the patterns that constitute our
(tokenization) language model.</p>
      <p>In addition to the linguistic information encoded in
patterns, our tokenizers uses two word lists per
language. The first one has abbreviations in order to mark
their occurrences in the text as single tokens. The
second one consists of sentence-initial words, that is,
words that do not start with a capital letter except in
sentence-initial position, such as preposition or
deterOn
nous</p>
      <p>qu’
dit
c’
-t-il
l’
évaluer
.
est
le
cas
,
encore
faudra
miners. When we locate a word from this list in the
text, we mark it as potential sentence starter, which in
particular contexts leads to a special empty token that
marks a sentence boundary. Sentences can be split at
those markers, if no prior sentence segmentation has
been performed.4</p>
      <p>
        Pattern identification rules are composed of a list
of named capturing subpatterns
        <xref ref-type="bibr" rid="ref5">(see Hazel, 1997)</xref>
        that
extend to the whole text provided (i.e., they are
anchored both at the beginning and at the end of the text).
We distinguish between the actual token or tokens that
are identified by a rule (e.g., , in Figure 2) and the
remainders that await further processing ( est le cas and
encore faudra in Figure 2). A typical rule consists of
a left part, the actual token and a right part (see root
node in Figure 2), though it can also identify more
tokens (see root node in Figure 1).
      </p>
      <p>Some examples: Date expressions, for instance,
typically consist of a number of tokens (e.g.,
12. , und , 13. , Juni , 2018 ); pronouns in some
Romance languages can be concatenated (e.g.,
No vull posar-n’hi. → No vull posar , -n’ , hi ,
. ; Sim, dir-lhes-ia isso. → Sim, dir , -lhes , -ia ,
isso. ). Each identified pattern is assigned a tag,
which marks the corresponding language (if the rule
is language-dependent), the rule name, a running
letter (if there is more than one rule for the same
target) and a running number (if a rule identifies more
than one token).</p>
      <p>
        4If a sufficiently large corpus exists for the language and text
variant in question, methods to learn sentence boundaries from the
corpus such as
        <xref ref-type="bibr" rid="ref6 ref9">(Kiss and Strunk, 2006)</xref>
        may perform better.
      </p>
      <p>We envisage that our Cutter applies the
languageindependent rules together with the rules for a
particular language to a text whose language is known and
uniform. Beyond that, rule sets of different languages
can be combined in case of code switching. Various
rule sets for the same language (e.g., for different text
variants) can also be combined individually.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Unit Testing</title>
      <p>The architecture of our tokenizer has a modular
design to facilitate its adaptation to different user needs.
Towards this goal, we need to check that new rules do
not interfere with existing ones. Following the method
of unit testing, widely-used in software development,
we collect text snippets for each nontrivial
tokenization problem. We then provide information on correct
tokenization of those snippets according to our
guidelines. To make a unit test pass, our tokenizer needs to
perform tokenization exactly like indicated.</p>
      <p>If a unit test fails, the error can be either in the test or
in the rules. A test error can be based on contradictory
or unachievable tokenization guidelines (i.e.,
requiring commonsense knowledge) or an incomplete
manual tokenization (e.g., the annotator missed a comma).
A rule error typically results from a rule being too
restrictive or another rule being too general and thus
erroneously matching the unit test in question.</p>
      <p>In both cases, iterative improvement of tests or rules
(or both) finally leads to a configuration where all tests
pass, which is the objective of unit testing and, in our
case, guarantees that a deterioration of tokenization
quality by reason of language model changes are
detected immediately and can be traced back to a
particular change.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation</title>
      <p>
        Aside from the implicit evaluation of our unit tests,
all of which we require to pass, we tested the
performance of our tokenizer on gold-tokenized texts. Such
gold-standard texts together with their original
(untokenized texts) are not easy to obtain. Corpora are
typically available as raw texts, while treebanks typically
feature the manually determined tokens, but not the
original, untokenized material.
        <xref ref-type="bibr" rid="ref8">Jurish and Würzner
(2013)</xref>
        approach this problem with de-tokenization
rules and a manual correction in particular cases.
      </p>
      <p>
        We built a test corpus based on the SMULTRON
treebanks
        <xref ref-type="bibr" rid="ref12 ref3">(Gustafson-Capková et al., 2007; Volk et
al., 2010)</xref>
        , for which we have the original texts.
Using those treebanks for comparison with the output of
our tokenizer is problematic, however, since the
tokenization guidelines that our tokenizer implements
originate, among others, from the experiences in the
creation of these very treebanks. Notwithstanding
the expected bias, we select those sentences from the
treebanks which we can identify in the original texts
(1528 unique sentences in total) by simply ignoring
any whitespace characters.5 This selection comprises
1173 sample sentences in four languages: English
(67), German (388), Spanish (184) and Swedish (534).
      </p>
      <p>An initial tokenization with Cutter yields 59
sentences with errors (0.5 %).6 In more than half of these
cases, the tokenization in the treebank deviates from
the pattern stipulated by the tokenization guidelines.
Another frequent issue is that the textual source does
not correspond to the text in the original document,
which comprises missing or superfluous whitespaces
and the representation of images as characters (see
Figure 3).</p>
      <p>We correct apparent anomalies in the input
sentences and remove sentences that cannot regularly be
represented as text, which are all key symbols, such
as the one shown in Figure 3. That way, we obtain a
small tokenization gold standard. It comprises 1165
sentences in the aforementioned four languages.</p>
      <p>When we tokenize the tests obtained from the gold
standard sample sentences with our tokenizer, we still
see an error rate of 1 %. By adjusting the existing rules
to include borderline cases (e.g., including the ± sign
into the definition of numbers), we could make all tests
pass. The error rate of two other popular tokenizers,
the ones in the NTLK7 and the Spacy8 NLP toolkits,
is at approximately 12 %.9 The comparatively high
error rate is due to both real tokenization errors, such as
splitting URLs, XML tags and ordinal numbers in
German,10 and, of course, debatable tokenization rules.
Should the German adjective 100%ige be left as one
token (Spacy), or be split into two tokens ( 100 , %ige ;
Cutter) or three tokens ( 100 , % , ige ; NLTK)?
7</p>
    </sec>
    <sec id="sec-8">
      <title>Features and Limitations</title>
      <p>
        Rule-based approaches in natural language
processing have widely been replaced by machine learning
approaches, since the latter are capable of handling
unanticipated situations by abstraction from observed
patterns. For tokenization of standard texts in
wellresourced languages, machine learning approaches
such as
        <xref ref-type="bibr" rid="ref8">(Jurish and Würzner, 2013)</xref>
        might have enough
data from which to learn those patterns. For
particular text categories and low-resourced languages,
however, providing the algorithm with sufficient training
data will require a substantial effort.
      </p>
      <p>In our work with corpora in several languages, the
best approach turned out to be an iterative one. For
a new language, we start with an empty rule set (in
addition to the language-independent rules) and apply
it to the untokenized texts. We subsequently
generate unit tests from the errors that surfaced in manual
inspection, which we then address by defining
corresponding patterns (also consulting grammar books and
treebanks, if available). Few iterations of this
proce7https://www.nltk.org/
8https://spacy.io/
9The Spacy tokenizer has only been evaluated on English,
German and Spanish sentences as it has no model for Swedish.</p>
      <p>10The Spacy tokenizer also consequently splits compound
adjectives and nouns in English (e.g., low-cost, medium-voltage,
break-even), while the NLTK tokenizer alters all quotation marks.
dure lead to a collection of rules and tests, and lower
the tokenization error rate considerably.</p>
      <p>For languages that have treebanks or
sentencesegmented corpora, we can automatically extract
sentence-initial words and add them to our list, if they
do not appear with a capital letter in other positions;
we might also want to filter for closed word classes
(i.e., preposition, pronouns, etc.) here. If no resource
for gathering abbreviations is available, we need to
search for abbreviations in the given texts.</p>
      <p>In contrast to machine learning approaches,
erroneous tokenization decisions in our system can always
be traced back to a particular pattern, which facilitates
a quick remedy. The language model, however, is
inevitably incomplete and requires testing and
adaptation ahead of its application to new text variants.</p>
      <p>As mentioned above, some tokenization decisions
require a deeper understanding than what a sequence
of characters can provide. This is, for instance, the
case when abbreviations (without the period) coincide
with another word. We are only aware of German
examples such as Abt. for Abteilung ‘department’ vs.
Abt ‘abbot’ or Art. for Artikel ‘article’ vs. Art kind.
Even if we address this problem by excluding those
words from the abbreviation list and match them with a
dedicated rule that expects a succeeding number (e.g.,
in Abt. 3 ‘in department 3’; nach Art. 25
‘pursuant to article 25’), we can still come up with cases
that cannot be solved without dictionary lookups or
parsing. Compare, for instance:
1. Wir trafen den Abt. Bergbahnen sind seine
Leidenschaft.
‘We met the abbot. Mountain railways are his
passion.’
2. Wir sahen den Sprecher der Abt. Bergbahnen und
Wanderwege.
‘We saw the spokesman of the dept. of mountain
railways and hiking trails.’</p>
      <p>Splitting undirected quotation marks (") results in
an information loss.11 After tokenization, it can no
longer be inferred whether such a quotation mark
signals the beginning or the end of a quotation. A more
careful tokenizer needs to preserve the information
11The same is true for typewriter apostrophes (') as a
replacement of matching single quotation marks.
whether the quotation mark was split from the
previous or from the following word. Our tokenizer
provides the option to alternatively return or suppress
white space tokens.12
8</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusions and Future Development</title>
      <p>To overcome ambiguous cases, we propose to extend
the shallow processing of the tokenizer by a syntactic
parser, to select the more likely tokenization. To this
end, tokenization has to be performed several times
with alternating rules. Parsing likelihood as decision
maker is only required if different results are obtained.
For low-resourced languages where no parser exists,
a heuristic based on the identification of finite verb
forms could suffice.</p>
      <p>Rules are currently organized in sets, one for each
language and one for language-independent rules.
Each set comprises different stages, which are used to
interconnect different sets. Corresponding quotation
marks, for instance, need to be identified before any
token in between them splits the sentence into smaller
parts. Language-specific date expressions (e.g., with
ordinal numbers expressed as digits plus a period)
need to be processed before the language-independent
identification of numbers takes place.</p>
      <p>We think that instead of a limited list of stages, a
more dynamic data structure would be beneficial. We
already know which rules interfere (e.g., numbers with
spaces vs. spaces as separators), but this is not
explicitly reflected in the data. If we were to reorganize the
tokenization rule sets by means of a “before” relation
between pairs of rules, we could build a rule
dependency graph, which, serialized, would define the order
of rules to apply. In case of code-switching sentences,
the order of languages given would be decisive if no
order is enforced by that graph.</p>
      <p>We have used the tokenizer ourselves in a number
of projects. It supports several European languages,
including Romansh as a low-resourced language, and
more languages are in preparation. The tokenizer and
all our language models are freely available.13 We also
provide a web demo and a tokenization web service.</p>
      <p>12A sample sentence with whitespace tokens looks like this:
Suot , , il , , titel , , " , vacanzas , , e , , cultura , " ,
, as , , prouva , , d’ , eruir , , la , , funcziun , , da , ,
la , , lingua , , e , , cultura , , rumauntscha , , per , ,
il , , turissem , , i , ’l , , Grischun , .</p>
      <p>13http://pub.cl.uzh.ch/purl/cutter</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This research was supported by the Swiss National
Science Foundation under grant 105215_146781/1
through the project “SPARCLING – Large-scale
Annotation and Alignment of Parallel Corpora for the
Investigation of Linguistic Variation”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Cruz</given-names>
            <surname>Díaz</surname>
          </string-name>
          ,
          <source>Noa Patricia and Manuel Jesús Maña López (Sept</source>
          .
          <year>2015</year>
          ).
          <article-title>“An Analysis of Biomedical Tokenization: Problems and Strategies”</article-title>
          .
          <source>In: Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis</source>
          . Lisbon, Portugal: Association for Computational Linguistics, pp.
          <fpage>40</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Graën</surname>
          </string-name>
          ,
          <string-name>
            <surname>Johannes</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>“Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning”</article-title>
          .
          <source>PhD thesis</source>
          . University of Zurich.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gustafson-Capková</surname>
          </string-name>
          , Sofia, Yvonne Samuelsson, and
          <string-name>
            <surname>Martin Volk</surname>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>SMULTRON - The Stockholm MULtilingual parallel Treebank</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Habert</surname>
            , Benoit, Gilles Adda,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Adda-Decker</surname>
            , P. Boula de Marëuil,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Ferret</surname>
            , G. Illouz, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paroubek</surname>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>“Towards tokenization evaluation”</article-title>
          .
          <source>In: Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC)</source>
          . Vol.
          <volume>98</volume>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Hazel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philip</surname>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>PCRE (Perl-compatible regular expressions)</article-title>
          . URL: https://www.pcre.org/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Ying and Mehmet
          <string-name>
            <surname>Kayaalp</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>A Comparison of 13 Tokenizers on MEDLINE</article-title>
          .
          <source>Tech. rep. U.S. National Library of Medicine, Lister Hill National Center for Biomedical Communications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>ISO</surname>
          </string-name>
          80000-
          <issue>1</issue>
          (
          <issue>Nov</issue>
          .
          <year>2009</year>
          ).
          <source>ISO 80000-1: Quantities and units - Part</source>
          <volume>1</volume>
          : General. Ed.
          <source>by ISO/TC 12 Technical Committee.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Jurish</surname>
          </string-name>
          , Bryan and
          <string-name>
            <surname>Kay-Michael Würzner</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>“Word and Sentence Tokenization with Hidden Markov Models”</article-title>
          .
          <source>In: Journal for Language Technology and Computational Linguistics 28.2</source>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Kiss</surname>
          </string-name>
          , Tibor and Jan
          <string-name>
            <surname>Strunk</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>“Unsupervised Multilingual Sentence Boundary Detection”</article-title>
          .
          <source>In: Computational Linguistics 32.4</source>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , Christopher, Tim Grow, Teg Grenager,
          <source>Jenny Finkel, and John Bauer (Aug</source>
          .
          <year>2018</year>
          ). Stanford Tokenizer. URL: http : / / nlp . stanford . edu/software/tokenizer.shtml.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>The</given-names>
            <surname>Unicode Consortium</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <source>The Unicode Standard, Version</source>
          <volume>10</volume>
          .0.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Volk</surname>
            , Martin,
            <given-names>Anne Göhring</given-names>
          </string-name>
          , Torsten Marek, and
          <string-name>
            <surname>Yvonne Samuelsson</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <source>SMULTRON (version 3</source>
          .0)
          <article-title>- the Stockholm MULtilingual parallel TReebank</article-title>
          .
          <article-title>An English-French-German-SpanishSwedish parallel treebank with sub-sentential alignments</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>