Cutter – a Universal Multilingual Tokenizer

                           Johannes Graën, Mara Bertamini, Martin Volk
                                Institute of Computational Linguistics
                                          University of Zurich
                   graen@cl.uzh.ch, bertaminimara@gmail.com, volk@cl.uzh.ch


                                                                      1 Introduction

                         Abstract                                     Common wisdom has it that tokenization is a solved
                                                                      problem. Yet, in practice, we often find ourselves in
    Tokenization is the process of splitting run-                     bothersome trials of adapting tokenizers or their out-
    ning texts into minimal meaningful units. In                      put. This may be due to the fact that “down-stream”
    writing systems where a space character is                        processing tools require a different tokenization. Or it
    used for word separation, this blank charac-                      may be because of special tokenization needs for par-
    ter typically acts as token boundary. A simple                    ticular domains, genres or historical text variants.
    tokenizer that only splits texts at space char-                      As an example of different tokenization needs, con-
    acters already achieves a notable accuracy,                       sider splits of English negations in contracted forms
    although it misses unmarked token bound-                          like didn’t and won’t . The Penn Treebank guidelines
    aries and erroneously splits tokens that con-                     suggest to tokenize those as did + n’t and wo + n’t .
    tain space characters.                                            Such splits are practical for information extraction or
                                                                      sentiment analysis. But, of course, these splits make
    Different languages use the same characters
                                                                      searching a corpus (e.g. for linguistic investigations)
    for different purposes. Tokenization is thus
                                                                      for negated forms unintuitive. Searches must then be
    a language-specific task (with code-switching
                                                                      supported by a specific module that undoes the splits.
    being a particular challenge). Extra-linguistic
                                                                         Another example is the English phrase
    tokens, however, are similar in many lan-
                                                                       a 12-ft boat . How shall we handle the hyphen-
    guages. These tokens include numbers, XML
                                                                      ated length expression? Is this one or two or even
    elements, email addresses and identifiers of
                                                                      three tokens? We follow the rule that measurement
    concepts that are idiosyncratic to particular
                                                                      units are split from numerical values. This rule is
    text variants (e.g., patent numbers).
                                                                      meant for altitude or speed and says that the number
    We present a framework for tokenization that                      is split from the unit (e.g. 2850m → 2850 , m ;
    makes use of language-specific and language-                       155km/h → 155 , km/h ). Following this rule, we
    independent token identification rules. These                     decided to also split the hyphenated length expression
    rules are stacked and applied recursively,                        into two tokens resulting in: a , 12 , -ft , boat . Once
    yielding a complete trace of the tokenization                     identified as such, we can, of course, keep numerical
    process in form of a tree structure. Rules                        values and measurement units as single tokens, if
    are easily adaptable to different languages and                   required by the following processing step.
    test types. Unit tests reliably detect if new to-                    We work on the annotation of large multilingual
    ken identification rules conflict with existing                   corpora, some of them diachronic for the last 150
    ones and thus assure consistent tokenization                      years. In our work such tokenization issues abound.
    when extending the rule sets.                                     We have therefore developed tokenization guidelines
                                                                      which started out as check-lists for the various lan-
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-        guage versions of our corpora. We then realized that
Text 2018), Winterthur, Switzerland, June 2018                        only a custom-built tokenizer with systematic tests in-


                                                                 75
                                                                 1
cluded will serve our purposes of high-quality tok-               to the need for standard tokenizers in order to ensure
enization.                                                        the interoperability of processing tools. Cruz Díaz
    Our tokenization approach does not include nor-               and Maña López (2015) follow up with an analysis
malization, which we see as a separate step involv-               of more recent tokenizers also for the biomedical do-
ing coding issues (like turning ligatures into letter se-         main. They observe disagreement to a large extend
quences, or certain spaces into non-breakable spaces)             between the tokenization decisions of those tools for
or other simplifications (like turning American into              the test cases they had identified preliminarily. That
British English spelling, or Swiss German into Stan-              observation is in agreement with Habert et al. (1998),
dard German spelling).                                            when they concluded more than 15 years earlier “At
    In this paper, we first describe existing tokenization        the moment, tokenizers represent black boxes, the be-
approaches and show that there is a need for tokeniz-             havior and rationale of which are not made clear.”
ers that can be adapted to particular language and text              Apart from rule-based tokenization, there are ma-
variants (Section 2). We then show why tokenization               chine learning approaches to tokenization as well. For
is a challenging task by giving examples of ambiguous             those approaches, a certain amount of training mate-
cases. We argue that a tokenizer needs to possess lin-            rial (i.e., both original and tokenized versions of the
guistic information and to consider long-distance rela-           same texts) is required. Jurish and Würzner (2013) ar-
tions to be able to decide those cases (Section 3).               gue that sufficient training material could be extracted
    Having outlined the problem, we describe our tok-             from “treebanks or multi-lingual corpora”.
enization approach (Section 4), and how we employ
unit testing to warrant high-quality tokenization while           3 Tokenization Challenges
allowing for the adaptation of the tokenizer (Sec-
                                                                  Although the only decision to be taken by the tokenizer
tion 5). Finally, we show that there are cases where
                                                                  is whether or not to place a token boundary between
tokenization decisions require commonsense knowl-
                                                                  each two adjacent characters, this task is not as triv-
edge, and which our tokenizer is not capable to handle
                                                                  ial as it seems at first glance. If two adjacent charac-
(Section 7). Future development (Section 8) will need
                                                                  ters are both letters, they typically belong to the same
to involve syntactic parsing to solve those hard cases.
                                                                  token. English negations in contracted forms (like
                                                                   didn’t ) as described above are one exception.
2 Related Work
                                                                     A non-letter character (e.g., a punctuation mark)
The Stanford Tokenizer (Manning et al., 2018) is prob-            followed by a letter frequently marks the boundary of
ably the most widely used tokenizer for English. It is            two tokens, while the opposite case (a letter charac-
built on the basis of the tokenization rules in the Penn          ter followed by a non-letter character) does not show
Treebank.1 Following the Penn tokenizations gets us a             a general preference; the right decision in these cases
long way for English, but is not explicit enough to ad-           often requires to resort to linguistic knowledge. We
dress issues such as in the hyphenated length expres-             can, for instance, not decide if baby’s is one token or
sions above, a 12-ft boat .                                       two without knowing whether a text is written in En-
   Its strength is its speed and the numerous options             glish ( ’s is a possessive marker of baby ) or Dutch
concerning the treatment of special symbols (paren-               ( baby’s means babies).
theses, ampersand, currency symbols and fractions).                  Apart from knowing a text’s language, which
In contrast, our tokenizer is highly modular and adapt-           includes word formation and grammar knowledge,
able to categories of texts that we did not consider              sometimes long-distance relations between tokens that
when compiling our guidelines. It also allows for a               belong together, such as brackets or quotation marks,
combination of rule sets from different languages to              have to be determined in order to take the right de-
process texts with quotations or code switching, for              cision. An apostrophe following a German word
instance.                                                         that phonetically ends in /s/ can be both a possessive
   He and Kayaalp (2006) compare various tokeniz-                 marker or the end of a single-quoted expression. If we
ers for the biomedical domain. Their results point                find another apostrophe in the same sentence preced-
   1
     ftp://ftp.cis.upenn.edu/pub/treebank/public_                 ing the ambiguous one such that it is followed imme-
html/tokenization.html                                            diately by a letter-character and preceded by a space


                                                             76
                                                             2
                                          '                                        '
              da                                Kulturstadt
                   ich                                        Europas                          '                    '
                         die                                                 für                     Ei                    halte
                               Veranstaltung                                       ein                    des
                                                                                         richtiges              Kolumbus

Figure 1: German sub-clause “da ich die Veranstaltung 'Kulturstadt Europas' für ein richtiges 'Ei des Kolumbus'
halte” with typewriter apostrophes as tokenization tree. Every node represents a single decision of the tokenizer.
Example taken from (Graën, 2018, p. 31).


character, we have evidence for the quoted expression                     split into several tokens can be protected by an ear-
and consequently mark both apostrophes as single to-                      lier match, which prevents that sequence from further
kens (see Figure 1).2                                                     processing.
   In many languages, both sentences and abbrevia-                           Tokens that contain spaces, for instance, need to be
tions typically end with a period. While the sentence-                    matched by a pattern that prevents them from being
final period is a single token, abbreviations comprise                    split by the general rule which mandates that spaces
the period. To distinguish both cases, we either need                     (and other white space characters) are token separa-
to know all abbreviations of the language in questions.                   tors. For ease of reading, numbers are often separated
Or we need a reliable way of determining sentence                         into groups of three digits. The international standard
boundaries.                                                               for “quantities and units” stipulates the use of a small
                                                                          space as separator (ISO 80000-1, 2009, Section 7.3.1),
4 The Cutter Implementation                                               which is often realized as a standard space in electronic
Simple tokenizers process text as a stream of charac-                     texts. We identify numbers formatted in this way (e.g.,
ters from left to right and take locally justified deci-                   50 000 ) as single tokens.
sions of whether to place a token boundary between                           Another example for tokens that need to be pro-
two adjacent characters. This approach is limited as                      tected are French words originally composed of
it is not capable of taking long-distance relations into                  more than one lexical unit that nowadays form
account.                                                                  a single lexical unit and should thus be recog-
    Our approach is to successively identify tokens fol-                  nized as a single token. In the example shown in
lowing an ordered list of patterns defined by advanced                    Figure 2, aujourd’hui ‘today’ is identified as to-
regular expressions.3 Once identified, we ‘cut out’                       ken in the first step, leaving On nous dit qu’ and
the token (hence the name Cutter) and proceed by ap-                       c’est le cas, encore faudra-t-il l’évaluer. as remain-
plying the same patterns to the remaining parts, until                    ders, which are subsequently further tokenized. To be
only empty character sequences remain. This proce-                        able to distinguish lexicalized forms (e.g., d’accord
dure generates a tree structure like the ones in Figure 1                 → d’accord ) from regular elision of vowels (e.g.,
and Figure 2.                                                              d’accorder → d’ , accorder ), we need to incorporate
    The order of patterns that describe tokens and their                  all lexicalized forms (e.g., entr’ouvèrt , c’est-à-dire ,
respective context is chosen such that the more de-                        presqu’île ) into the patterns that constitute our (tok-
tailed or exceptional tokens are identified first, fol-                   enization) language model.
lowed by more common and standard tokens. That                               In addition to the linguistic information encoded in
way, sequences of characters that would otherwise be                      patterns, our tokenizers uses two word lists per lan-
   2
     This is only necessary if the single typewriter apostrophe is        guage. The first one has abbreviations in order to mark
used instead of the proper left and right single quotation marks.         their occurrences in the text as single tokens. The
   3
     We use the so-called Perl Compatible Regular Expressions             second one consists of sentence-initial words, that is,
(PCRE) by (Hazel, 1997), including adjuvant features, such as
Unicode character properties (The Unicode Consortium, 2017),              words that do not start with a capital letter except in
named capturing subpatterns and subpattern assertions.                    sentence-initial position, such as preposition or deter-


                                                                     77
                                                                     3
                                                        aujourd’hui
                                            qu’                                  c’
                        On
                               nous
                                         dit                                                        l’
                                                                                                                   .
                                                                                            -t-il        évaluer

                                                                       ,
                                                  est                      encore
                                                         le
                                                               cas                    faudra

Figure 2: Tokenization tree of the French sentence “On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il
l’évaluer.”. Example taken from (Graën, 2018, p. 31).


miners. When we locate a word from this list in the                            We envisage that our Cutter applies the language-
text, we mark it as potential sentence starter, which in                    independent rules together with the rules for a partic-
particular contexts leads to a special empty token that                     ular language to a text whose language is known and
marks a sentence boundary. Sentences can be split at                        uniform. Beyond that, rule sets of different languages
those markers, if no prior sentence segmentation has                        can be combined in case of code switching. Various
been performed.4                                                            rule sets for the same language (e.g., for different text
    Pattern identification rules are composed of a list                     variants) can also be combined individually.
of named capturing subpatterns (see Hazel, 1997) that
extend to the whole text provided (i.e., they are an-                       5 Unit Testing
chored both at the beginning and at the end of the text).                   The architecture of our tokenizer has a modular de-
We distinguish between the actual token or tokens that                      sign to facilitate its adaptation to different user needs.
are identified by a rule (e.g., , in Figure 2) and the re-                  Towards this goal, we need to check that new rules do
mainders that await further processing ( est le cas and                     not interfere with existing ones. Following the method
 encore faudra in Figure 2). A typical rule consists of                     of unit testing, widely-used in software development,
a left part, the actual token and a right part (see root                    we collect text snippets for each nontrivial tokeniza-
node in Figure 2), though it can also identify more to-                     tion problem. We then provide information on correct
kens (see root node in Figure 1).                                           tokenization of those snippets according to our guide-
    Some examples: Date expressions, for instance,                          lines. To make a unit test pass, our tokenizer needs to
typically consist of a number of tokens (e.g.,                              perform tokenization exactly like indicated.
 12. , und , 13. , Juni , 2018 ); pronouns in some                             If a unit test fails, the error can be either in the test or
Romance languages can be concatenated (e.g.,                                in the rules. A test error can be based on contradictory
 No vull posar-n’hi. → No vull posar , -n’ , hi ,                           or unachievable tokenization guidelines (i.e., requir-
 . ; Sim, dir-lhes-ia isso. → Sim, dir , -lhes , -ia ,                      ing commonsense knowledge) or an incomplete man-
 isso. ). Each identified pattern is assigned a tag,                        ual tokenization (e.g., the annotator missed a comma).
which marks the corresponding language (if the rule                         A rule error typically results from a rule being too re-
is language-dependent), the rule name, a running                            strictive or another rule being too general and thus er-
letter (if there is more than one rule for the same                         roneously matching the unit test in question.
target) and a running number (if a rule identifies more                        In both cases, iterative improvement of tests or rules
than one token).                                                            (or both) finally leads to a configuration where all tests
   4
                                                                            pass, which is the objective of unit testing and, in our
     If a sufficiently large corpus exists for the language and text
variant in question, methods to learn sentence boundaries from the          case, guarantees that a deterioration of tokenization
corpus such as (Kiss and Strunk, 2006) may perform better.                  quality by reason of language model changes are de-


                                                                       78
                                                                       4
tected immediately and can be traced back to a partic-                        We correct apparent anomalies in the input sen-
ular change.                                                              tences and remove sentences that cannot regularly be
                                                                          represented as text, which are all key symbols, such
6 Evaluation                                                              as the one shown in Figure 3. That way, we obtain a
                                                                          small tokenization gold standard. It comprises 1165
Aside from the implicit evaluation of our unit tests,
                                                                          sentences in the aforementioned four languages.
all of which we require to pass, we tested the perfor-
                                                                              When we tokenize the tests obtained from the gold
mance of our tokenizer on gold-tokenized texts. Such
                                                                          standard sample sentences with our tokenizer, we still
gold-standard texts together with their original (unto-
                                                                          see an error rate of 1 %. By adjusting the existing rules
kenized texts) are not easy to obtain. Corpora are typ-
                                                                          to include borderline cases (e.g., including the ± sign
ically available as raw texts, while treebanks typically
                                                                          into the definition of numbers), we could make all tests
feature the manually determined tokens, but not the
                                                                          pass. The error rate of two other popular tokenizers,
original, untokenized material. Jurish and Würzner
                                                                          the ones in the NTLK7 and the Spacy8 NLP toolkits,
(2013) approach this problem with de-tokenization
                                                                          is at approximately 12 %.9 The comparatively high er-
rules and a manual correction in particular cases.
                                                                          ror rate is due to both real tokenization errors, such as
   We built a test corpus based on the SMULTRON
                                                                          splitting URLs, XML tags and ordinal numbers in Ger-
treebanks (Gustafson-Capková et al., 2007; Volk et
                                                                          man,10 and, of course, debatable tokenization rules.
al., 2010), for which we have the original texts. Us-
                                                                          Should the German adjective 100%ige be left as one
ing those treebanks for comparison with the output of
our tokenizer is problematic, however, since the to-                      token (Spacy), or be split into two tokens ( 100 , %ige ;
kenization guidelines that our tokenizer implements                       Cutter) or three tokens ( 100 , % , ige ; NLTK)?
originate, among others, from the experiences in the
creation of these very treebanks. Notwithstanding                         7 Features and Limitations
the expected bias, we select those sentences from the                     Rule-based approaches in natural language process-
treebanks which we can identify in the original texts                     ing have widely been replaced by machine learning
(1528 unique sentences in total) by simply ignoring                       approaches, since the latter are capable of handling
any whitespace characters.5 This selection comprises                      unanticipated situations by abstraction from observed
1173 sample sentences in four languages: English                          patterns. For tokenization of standard texts in well-
(67), German (388), Spanish (184) and Swedish (534).                      resourced languages, machine learning approaches
   An initial tokenization with Cutter yields 59 sen-                     such as (Jurish and Würzner, 2013) might have enough
tences with errors (0.5 %).6 In more than half of these                   data from which to learn those patterns. For particu-
cases, the tokenization in the treebank deviates from                     lar text categories and low-resourced languages, how-
the pattern stipulated by the tokenization guidelines.                    ever, providing the algorithm with sufficient training
Another frequent issue is that the textual source does                    data will require a substantial effort.
not correspond to the text in the original document,                          In our work with corpora in several languages, the
which comprises missing or superfluous whitespaces                        best approach turned out to be an iterative one. For
and the representation of images as characters (see                       a new language, we start with an empty rule set (in
Figure 3).                                                                addition to the language-independent rules) and apply
                                                                          it to the untokenized texts. We subsequently gener-
                                                                          ate unit tests from the errors that surfaced in manual
                                                                          inspection, which we then address by defining corre-
                                                                          sponding patterns (also consulting grammar books and
Figure 3: The ‘up’ and ‘down’ key symbols in this                         treebanks, if available). Few iterations of this proce-
detail from a DVD player manual are represented as                           7
                                                                                https://www.nltk.org/
circumflex diacritic and letter ‘v’, respectively: eller                     8
                                                                                https://spacy.io/
med ^/v-knapparna på fjärkontrollen.                                          9
                                                                                The Spacy tokenizer has only been evaluated on English, Ger-
                                                                          man and Spanish sentences as it has no model for Swedish.
   5                                                                         10
     From our point of view, a tokenizer is only allowed to remove              The Spacy tokenizer also consequently splits compound ad-
input characters, not to alter them.                                      jectives and nouns in English (e.g., low-cost, medium-voltage,
   6
     Only the first error found in a sentence is counted.                 break-even), while the NLTK tokenizer alters all quotation marks.


                                                                     79
                                                                     5
dure lead to a collection of rules and tests, and lower                   whether the quotation mark was split from the previ-
the tokenization error rate considerably.                                 ous or from the following word. Our tokenizer pro-
    For languages that have treebanks or sentence-                        vides the option to alternatively return or suppress
segmented corpora, we can automatically extract                           white space tokens.12
sentence-initial words and add them to our list, if they
do not appear with a capital letter in other positions;                   8 Conclusions and Future Development
we might also want to filter for closed word classes
                                                                          To overcome ambiguous cases, we propose to extend
(i.e., preposition, pronouns, etc.) here. If no resource
                                                                          the shallow processing of the tokenizer by a syntactic
for gathering abbreviations is available, we need to
                                                                          parser, to select the more likely tokenization. To this
search for abbreviations in the given texts.
                                                                          end, tokenization has to be performed several times
    In contrast to machine learning approaches, erro-
                                                                          with alternating rules. Parsing likelihood as decision
neous tokenization decisions in our system can always
                                                                          maker is only required if different results are obtained.
be traced back to a particular pattern, which facilitates
                                                                          For low-resourced languages where no parser exists,
a quick remedy. The language model, however, is in-
                                                                          a heuristic based on the identification of finite verb
evitably incomplete and requires testing and adapta-
                                                                          forms could suffice.
tion ahead of its application to new text variants.
                                                                              Rules are currently organized in sets, one for each
    As mentioned above, some tokenization decisions
                                                                          language and one for language-independent rules.
require a deeper understanding than what a sequence
                                                                          Each set comprises different stages, which are used to
of characters can provide. This is, for instance, the
                                                                          interconnect different sets. Corresponding quotation
case when abbreviations (without the period) coincide
                                                                          marks, for instance, need to be identified before any
with another word. We are only aware of German ex-
                                                                          token in between them splits the sentence into smaller
amples such as Abt. for Abteilung ‘department’ vs.
                                                                          parts. Language-specific date expressions (e.g., with
 Abt ‘abbot’ or Art. for Artikel ‘article’ vs. Art kind.
                                                                          ordinal numbers expressed as digits plus a period)
Even if we address this problem by excluding those
                                                                          need to be processed before the language-independent
words from the abbreviation list and match them with a
                                                                          identification of numbers takes place.
dedicated rule that expects a succeeding number (e.g.,
                                                                              We think that instead of a limited list of stages, a
 in Abt. 3 ‘in department 3’; nach Art. 25 ‘pur-
                                                                          more dynamic data structure would be beneficial. We
suant to article 25’), we can still come up with cases
                                                                          already know which rules interfere (e.g., numbers with
that cannot be solved without dictionary lookups or
                                                                          spaces vs. spaces as separators), but this is not explic-
parsing. Compare, for instance:
                                                                          itly reflected in the data. If we were to reorganize the
                                                                          tokenization rule sets by means of a “before” relation
  1. Wir trafen den Abt. Bergbahnen sind seine Lei-                       between pairs of rules, we could build a rule depen-
     denschaft.                                                           dency graph, which, serialized, would define the order
     ‘We met the abbot. Mountain railways are his                         of rules to apply. In case of code-switching sentences,
     passion.’                                                            the order of languages given would be decisive if no
                                                                          order is enforced by that graph.
  2. Wir sahen den Sprecher der Abt. Bergbahnen und
                                                                              We have used the tokenizer ourselves in a number
     Wanderwege.
                                                                          of projects. It supports several European languages,
     ‘We saw the spokesman of the dept. of mountain
                                                                          including Romansh as a low-resourced language, and
     railways and hiking trails.’
                                                                          more languages are in preparation. The tokenizer and
                                                                          all our language models are freely available.13 We also
   Splitting undirected quotation marks (") results in
                                                                          provide a web demo and a tokenization web service.
an information loss.11 After tokenization, it can no
longer be inferred whether such a quotation mark sig-                       12
                                                                                A sample sentence with whitespace tokens looks like this:
nals the beginning or the end of a quotation. A more                      Suot , , il , , titel , , " , vacanzas , , e , , cultura , " ,
careful tokenizer needs to preserve the information                         , as , , prouva , , d’ , eruir , , la , , funcziun , , da , ,
                                                                          la , , lingua , , e , , cultura , , rumauntscha , , per , ,
  11                                                                      il , , turissem , , i , ’l , , Grischun , .
     The same is true for typewriter apostrophes (') as a replace-
                                                                             13
ment of matching single quotation marks.                                        http://pub.cl.uzh.ch/purl/cutter


                                                                     80
                                                                     6
Acknowledgments                                                 ford Tokenizer. URL: http : / / nlp . stanford .
                                                                edu/software/tokenizer.shtml.
This research was supported by the Swiss National
                                                              The Unicode Consortium (2017). The Unicode Stan-
Science Foundation under grant 105215_146781/1
                                                                dard, Version 10.0.
through the project “SPARCLING – Large-scale An-
                                                              Volk, Martin, Anne Göhring, Torsten Marek, and
notation and Alignment of Parallel Corpora for the In-
                                                                Yvonne Samuelsson (2010). SMULTRON (ver-
vestigation of Linguistic Variation”.
                                                                sion 3.0) – the Stockholm MULtilingual parallel
References                                                      TReebank. An English-French-German-Spanish-
                                                                Swedish parallel treebank with sub-sentential align-
Cruz Díaz, Noa Patricia and Manuel Jesús Maña                   ments.
   López (Sept. 2015). “An Analysis of Biomedical
   Tokenization: Problems and Strategies”. In: Pro-
   ceedings of the Sixth International Workshop on
   Health Text Mining and Information Analysis. Lis-
   bon, Portugal: Association for Computational Lin-
   guistics, pp. 40–49.
Graën, Johannes (2018). “Exploiting Alignment in
   Multiparallel Corpora for Applications in Linguis-
   tics and Language Learning”. PhD thesis. Univer-
   sity of Zurich.
Gustafson-Capková, Sofia, Yvonne Samuelsson, and
   Martin Volk (2007). SMULTRON – The Stockholm
   MULtilingual parallel Treebank.
Habert, Benoit, Gilles Adda, M. Adda-Decker,
   P. Boula de Marëuil, S. Ferrari, O. Ferret, G.
   Illouz, and P. Paroubek (1998). “Towards tok-
   enization evaluation”. In: Proceedings of the 1st
   International Conference on Language Resources
   and Evaluation (LREC). Vol. 98, pp. 427–431.
Hazel, Philip (1997). PCRE (Perl-compatible regular
   expressions). URL: https://www.pcre.org/.
He, Ying and Mehmet Kayaalp (2006). A Compari-
   son of 13 Tokenizers on MEDLINE. Tech. rep. U.S.
   National Library of Medicine, Lister Hill National
   Center for Biomedical Communications.
ISO 80000-1 (Nov. 2009). ISO 80000-1: Quantities
   and units – Part 1: General. Ed. by ISO/TC 12
   Technical Committee.
Jurish, Bryan and Kay-Michael Würzner (2013).
   “Word and Sentence Tokenization with Hid-
   den Markov Models”. In: Journal for Language
   Technology and Computational Linguistics 28.2,
   pp. 61–83.
Kiss, Tibor and Jan Strunk (2006). “Unsupervised
   Multilingual Sentence Boundary Detection”. In:
   Computational Linguistics 32.4, pp. 485–525.
Manning, Christopher, Tim Grow, Teg Grenager,
   Jenny Finkel, and John Bauer (Aug. 2018). Stan-


                                                         81
                                                         7