Cutter – a Universal Multilingual Tokenizer Johannes Graën, Mara Bertamini, Martin Volk Institute of Computational Linguistics University of Zurich graen@cl.uzh.ch, bertaminimara@gmail.com, volk@cl.uzh.ch 1 Introduction Abstract Common wisdom has it that tokenization is a solved problem. Yet, in practice, we often find ourselves in Tokenization is the process of splitting run- bothersome trials of adapting tokenizers or their out- ning texts into minimal meaningful units. In put. This may be due to the fact that “down-stream” writing systems where a space character is processing tools require a different tokenization. Or it used for word separation, this blank charac- may be because of special tokenization needs for par- ter typically acts as token boundary. A simple ticular domains, genres or historical text variants. tokenizer that only splits texts at space char- As an example of different tokenization needs, con- acters already achieves a notable accuracy, sider splits of English negations in contracted forms although it misses unmarked token bound- like didn’t and won’t . The Penn Treebank guidelines aries and erroneously splits tokens that con- suggest to tokenize those as did + n’t and wo + n’t . tain space characters. Such splits are practical for information extraction or sentiment analysis. But, of course, these splits make Different languages use the same characters searching a corpus (e.g. for linguistic investigations) for different purposes. Tokenization is thus for negated forms unintuitive. Searches must then be a language-specific task (with code-switching supported by a specific module that undoes the splits. being a particular challenge). Extra-linguistic Another example is the English phrase tokens, however, are similar in many lan- a 12-ft boat . How shall we handle the hyphen- guages. These tokens include numbers, XML ated length expression? Is this one or two or even elements, email addresses and identifiers of three tokens? We follow the rule that measurement concepts that are idiosyncratic to particular units are split from numerical values. This rule is text variants (e.g., patent numbers). meant for altitude or speed and says that the number We present a framework for tokenization that is split from the unit (e.g. 2850m → 2850 , m ; makes use of language-specific and language- 155km/h → 155 , km/h ). Following this rule, we independent token identification rules. These decided to also split the hyphenated length expression rules are stacked and applied recursively, into two tokens resulting in: a , 12 , -ft , boat . Once yielding a complete trace of the tokenization identified as such, we can, of course, keep numerical process in form of a tree structure. Rules values and measurement units as single tokens, if are easily adaptable to different languages and required by the following processing step. test types. Unit tests reliably detect if new to- We work on the annotation of large multilingual ken identification rules conflict with existing corpora, some of them diachronic for the last 150 ones and thus assure consistent tokenization years. In our work such tokenization issues abound. when extending the rule sets. We have therefore developed tokenization guidelines which started out as check-lists for the various lan- In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- guage versions of our corpora. We then realized that Text 2018), Winterthur, Switzerland, June 2018 only a custom-built tokenizer with systematic tests in- 75 1 cluded will serve our purposes of high-quality tok- to the need for standard tokenizers in order to ensure enization. the interoperability of processing tools. Cruz Díaz Our tokenization approach does not include nor- and Maña López (2015) follow up with an analysis malization, which we see as a separate step involv- of more recent tokenizers also for the biomedical do- ing coding issues (like turning ligatures into letter se- main. They observe disagreement to a large extend quences, or certain spaces into non-breakable spaces) between the tokenization decisions of those tools for or other simplifications (like turning American into the test cases they had identified preliminarily. That British English spelling, or Swiss German into Stan- observation is in agreement with Habert et al. (1998), dard German spelling). when they concluded more than 15 years earlier “At In this paper, we first describe existing tokenization the moment, tokenizers represent black boxes, the be- approaches and show that there is a need for tokeniz- havior and rationale of which are not made clear.” ers that can be adapted to particular language and text Apart from rule-based tokenization, there are ma- variants (Section 2). We then show why tokenization chine learning approaches to tokenization as well. For is a challenging task by giving examples of ambiguous those approaches, a certain amount of training mate- cases. We argue that a tokenizer needs to possess lin- rial (i.e., both original and tokenized versions of the guistic information and to consider long-distance rela- same texts) is required. Jurish and Würzner (2013) ar- tions to be able to decide those cases (Section 3). gue that sufficient training material could be extracted Having outlined the problem, we describe our tok- from “treebanks or multi-lingual corpora”. enization approach (Section 4), and how we employ unit testing to warrant high-quality tokenization while 3 Tokenization Challenges allowing for the adaptation of the tokenizer (Sec- Although the only decision to be taken by the tokenizer tion 5). Finally, we show that there are cases where is whether or not to place a token boundary between tokenization decisions require commonsense knowl- each two adjacent characters, this task is not as triv- edge, and which our tokenizer is not capable to handle ial as it seems at first glance. If two adjacent charac- (Section 7). Future development (Section 8) will need ters are both letters, they typically belong to the same to involve syntactic parsing to solve those hard cases. token. English negations in contracted forms (like didn’t ) as described above are one exception. 2 Related Work A non-letter character (e.g., a punctuation mark) The Stanford Tokenizer (Manning et al., 2018) is prob- followed by a letter frequently marks the boundary of ably the most widely used tokenizer for English. It is two tokens, while the opposite case (a letter charac- built on the basis of the tokenization rules in the Penn ter followed by a non-letter character) does not show Treebank.1 Following the Penn tokenizations gets us a a general preference; the right decision in these cases long way for English, but is not explicit enough to ad- often requires to resort to linguistic knowledge. We dress issues such as in the hyphenated length expres- can, for instance, not decide if baby’s is one token or sions above, a 12-ft boat . two without knowing whether a text is written in En- Its strength is its speed and the numerous options glish ( ’s is a possessive marker of baby ) or Dutch concerning the treatment of special symbols (paren- ( baby’s means babies). theses, ampersand, currency symbols and fractions). Apart from knowing a text’s language, which In contrast, our tokenizer is highly modular and adapt- includes word formation and grammar knowledge, able to categories of texts that we did not consider sometimes long-distance relations between tokens that when compiling our guidelines. It also allows for a belong together, such as brackets or quotation marks, combination of rule sets from different languages to have to be determined in order to take the right de- process texts with quotations or code switching, for cision. An apostrophe following a German word instance. that phonetically ends in /s/ can be both a possessive He and Kayaalp (2006) compare various tokeniz- marker or the end of a single-quoted expression. If we ers for the biomedical domain. Their results point find another apostrophe in the same sentence preced- 1 ftp://ftp.cis.upenn.edu/pub/treebank/public_ ing the ambiguous one such that it is followed imme- html/tokenization.html diately by a letter-character and preceded by a space 76 2 ' ' da Kulturstadt ich Europas ' ' die für Ei halte Veranstaltung ein des richtiges Kolumbus Figure 1: German sub-clause “da ich die Veranstaltung 'Kulturstadt Europas' für ein richtiges 'Ei des Kolumbus' halte” with typewriter apostrophes as tokenization tree. Every node represents a single decision of the tokenizer. Example taken from (Graën, 2018, p. 31). character, we have evidence for the quoted expression split into several tokens can be protected by an ear- and consequently mark both apostrophes as single to- lier match, which prevents that sequence from further kens (see Figure 1).2 processing. In many languages, both sentences and abbrevia- Tokens that contain spaces, for instance, need to be tions typically end with a period. While the sentence- matched by a pattern that prevents them from being final period is a single token, abbreviations comprise split by the general rule which mandates that spaces the period. To distinguish both cases, we either need (and other white space characters) are token separa- to know all abbreviations of the language in questions. tors. For ease of reading, numbers are often separated Or we need a reliable way of determining sentence into groups of three digits. The international standard boundaries. for “quantities and units” stipulates the use of a small space as separator (ISO 80000-1, 2009, Section 7.3.1), 4 The Cutter Implementation which is often realized as a standard space in electronic Simple tokenizers process text as a stream of charac- texts. We identify numbers formatted in this way (e.g., ters from left to right and take locally justified deci- 50 000 ) as single tokens. sions of whether to place a token boundary between Another example for tokens that need to be pro- two adjacent characters. This approach is limited as tected are French words originally composed of it is not capable of taking long-distance relations into more than one lexical unit that nowadays form account. a single lexical unit and should thus be recog- Our approach is to successively identify tokens fol- nized as a single token. In the example shown in lowing an ordered list of patterns defined by advanced Figure 2, aujourd’hui ‘today’ is identified as to- regular expressions.3 Once identified, we ‘cut out’ ken in the first step, leaving On nous dit qu’ and the token (hence the name Cutter) and proceed by ap- c’est le cas, encore faudra-t-il l’évaluer. as remain- plying the same patterns to the remaining parts, until ders, which are subsequently further tokenized. To be only empty character sequences remain. This proce- able to distinguish lexicalized forms (e.g., d’accord dure generates a tree structure like the ones in Figure 1 → d’accord ) from regular elision of vowels (e.g., and Figure 2. d’accorder → d’ , accorder ), we need to incorporate The order of patterns that describe tokens and their all lexicalized forms (e.g., entr’ouvèrt , c’est-à-dire , respective context is chosen such that the more de- presqu’île ) into the patterns that constitute our (tok- tailed or exceptional tokens are identified first, fol- enization) language model. lowed by more common and standard tokens. That In addition to the linguistic information encoded in way, sequences of characters that would otherwise be patterns, our tokenizers uses two word lists per lan- 2 This is only necessary if the single typewriter apostrophe is guage. The first one has abbreviations in order to mark used instead of the proper left and right single quotation marks. their occurrences in the text as single tokens. The 3 We use the so-called Perl Compatible Regular Expressions second one consists of sentence-initial words, that is, (PCRE) by (Hazel, 1997), including adjuvant features, such as Unicode character properties (The Unicode Consortium, 2017), words that do not start with a capital letter except in named capturing subpatterns and subpattern assertions. sentence-initial position, such as preposition or deter- 77 3 aujourd’hui qu’ c’ On nous dit l’ . -t-il évaluer , est encore le cas faudra Figure 2: Tokenization tree of the French sentence “On nous dit qu’aujourd’hui c’est le cas, encore faudra-t-il l’évaluer.”. Example taken from (Graën, 2018, p. 31). miners. When we locate a word from this list in the We envisage that our Cutter applies the language- text, we mark it as potential sentence starter, which in independent rules together with the rules for a partic- particular contexts leads to a special empty token that ular language to a text whose language is known and marks a sentence boundary. Sentences can be split at uniform. Beyond that, rule sets of different languages those markers, if no prior sentence segmentation has can be combined in case of code switching. Various been performed.4 rule sets for the same language (e.g., for different text Pattern identification rules are composed of a list variants) can also be combined individually. of named capturing subpatterns (see Hazel, 1997) that extend to the whole text provided (i.e., they are an- 5 Unit Testing chored both at the beginning and at the end of the text). The architecture of our tokenizer has a modular de- We distinguish between the actual token or tokens that sign to facilitate its adaptation to different user needs. are identified by a rule (e.g., , in Figure 2) and the re- Towards this goal, we need to check that new rules do mainders that await further processing ( est le cas and not interfere with existing ones. Following the method encore faudra in Figure 2). A typical rule consists of of unit testing, widely-used in software development, a left part, the actual token and a right part (see root we collect text snippets for each nontrivial tokeniza- node in Figure 2), though it can also identify more to- tion problem. We then provide information on correct kens (see root node in Figure 1). tokenization of those snippets according to our guide- Some examples: Date expressions, for instance, lines. To make a unit test pass, our tokenizer needs to typically consist of a number of tokens (e.g., perform tokenization exactly like indicated. 12. , und , 13. , Juni , 2018 ); pronouns in some If a unit test fails, the error can be either in the test or Romance languages can be concatenated (e.g., in the rules. A test error can be based on contradictory No vull posar-n’hi. → No vull posar , -n’ , hi , or unachievable tokenization guidelines (i.e., requir- . ; Sim, dir-lhes-ia isso. → Sim, dir , -lhes , -ia , ing commonsense knowledge) or an incomplete man- isso. ). Each identified pattern is assigned a tag, ual tokenization (e.g., the annotator missed a comma). which marks the corresponding language (if the rule A rule error typically results from a rule being too re- is language-dependent), the rule name, a running strictive or another rule being too general and thus er- letter (if there is more than one rule for the same roneously matching the unit test in question. target) and a running number (if a rule identifies more In both cases, iterative improvement of tests or rules than one token). (or both) finally leads to a configuration where all tests 4 pass, which is the objective of unit testing and, in our If a sufficiently large corpus exists for the language and text variant in question, methods to learn sentence boundaries from the case, guarantees that a deterioration of tokenization corpus such as (Kiss and Strunk, 2006) may perform better. quality by reason of language model changes are de- 78 4 tected immediately and can be traced back to a partic- We correct apparent anomalies in the input sen- ular change. tences and remove sentences that cannot regularly be represented as text, which are all key symbols, such 6 Evaluation as the one shown in Figure 3. That way, we obtain a small tokenization gold standard. It comprises 1165 Aside from the implicit evaluation of our unit tests, sentences in the aforementioned four languages. all of which we require to pass, we tested the perfor- When we tokenize the tests obtained from the gold mance of our tokenizer on gold-tokenized texts. Such standard sample sentences with our tokenizer, we still gold-standard texts together with their original (unto- see an error rate of 1 %. By adjusting the existing rules kenized texts) are not easy to obtain. Corpora are typ- to include borderline cases (e.g., including the ± sign ically available as raw texts, while treebanks typically into the definition of numbers), we could make all tests feature the manually determined tokens, but not the pass. The error rate of two other popular tokenizers, original, untokenized material. Jurish and Würzner the ones in the NTLK7 and the Spacy8 NLP toolkits, (2013) approach this problem with de-tokenization is at approximately 12 %.9 The comparatively high er- rules and a manual correction in particular cases. ror rate is due to both real tokenization errors, such as We built a test corpus based on the SMULTRON splitting URLs, XML tags and ordinal numbers in Ger- treebanks (Gustafson-Capková et al., 2007; Volk et man,10 and, of course, debatable tokenization rules. al., 2010), for which we have the original texts. Us- Should the German adjective 100%ige be left as one ing those treebanks for comparison with the output of our tokenizer is problematic, however, since the to- token (Spacy), or be split into two tokens ( 100 , %ige ; kenization guidelines that our tokenizer implements Cutter) or three tokens ( 100 , % , ige ; NLTK)? originate, among others, from the experiences in the creation of these very treebanks. Notwithstanding 7 Features and Limitations the expected bias, we select those sentences from the Rule-based approaches in natural language process- treebanks which we can identify in the original texts ing have widely been replaced by machine learning (1528 unique sentences in total) by simply ignoring approaches, since the latter are capable of handling any whitespace characters.5 This selection comprises unanticipated situations by abstraction from observed 1173 sample sentences in four languages: English patterns. For tokenization of standard texts in well- (67), German (388), Spanish (184) and Swedish (534). resourced languages, machine learning approaches An initial tokenization with Cutter yields 59 sen- such as (Jurish and Würzner, 2013) might have enough tences with errors (0.5 %).6 In more than half of these data from which to learn those patterns. For particu- cases, the tokenization in the treebank deviates from lar text categories and low-resourced languages, how- the pattern stipulated by the tokenization guidelines. ever, providing the algorithm with sufficient training Another frequent issue is that the textual source does data will require a substantial effort. not correspond to the text in the original document, In our work with corpora in several languages, the which comprises missing or superfluous whitespaces best approach turned out to be an iterative one. For and the representation of images as characters (see a new language, we start with an empty rule set (in Figure 3). addition to the language-independent rules) and apply it to the untokenized texts. We subsequently gener- ate unit tests from the errors that surfaced in manual inspection, which we then address by defining corre- sponding patterns (also consulting grammar books and Figure 3: The ‘up’ and ‘down’ key symbols in this treebanks, if available). Few iterations of this proce- detail from a DVD player manual are represented as 7 https://www.nltk.org/ circumflex diacritic and letter ‘v’, respectively: eller 8 https://spacy.io/ med ^/v-knapparna på fjärkontrollen. 9 The Spacy tokenizer has only been evaluated on English, Ger- man and Spanish sentences as it has no model for Swedish. 5 10 From our point of view, a tokenizer is only allowed to remove The Spacy tokenizer also consequently splits compound ad- input characters, not to alter them. jectives and nouns in English (e.g., low-cost, medium-voltage, 6 Only the first error found in a sentence is counted. break-even), while the NLTK tokenizer alters all quotation marks. 79 5 dure lead to a collection of rules and tests, and lower whether the quotation mark was split from the previ- the tokenization error rate considerably. ous or from the following word. Our tokenizer pro- For languages that have treebanks or sentence- vides the option to alternatively return or suppress segmented corpora, we can automatically extract white space tokens.12 sentence-initial words and add them to our list, if they do not appear with a capital letter in other positions; 8 Conclusions and Future Development we might also want to filter for closed word classes To overcome ambiguous cases, we propose to extend (i.e., preposition, pronouns, etc.) here. If no resource the shallow processing of the tokenizer by a syntactic for gathering abbreviations is available, we need to parser, to select the more likely tokenization. To this search for abbreviations in the given texts. end, tokenization has to be performed several times In contrast to machine learning approaches, erro- with alternating rules. Parsing likelihood as decision neous tokenization decisions in our system can always maker is only required if different results are obtained. be traced back to a particular pattern, which facilitates For low-resourced languages where no parser exists, a quick remedy. The language model, however, is in- a heuristic based on the identification of finite verb evitably incomplete and requires testing and adapta- forms could suffice. tion ahead of its application to new text variants. Rules are currently organized in sets, one for each As mentioned above, some tokenization decisions language and one for language-independent rules. require a deeper understanding than what a sequence Each set comprises different stages, which are used to of characters can provide. This is, for instance, the interconnect different sets. Corresponding quotation case when abbreviations (without the period) coincide marks, for instance, need to be identified before any with another word. We are only aware of German ex- token in between them splits the sentence into smaller amples such as Abt. for Abteilung ‘department’ vs. parts. Language-specific date expressions (e.g., with Abt ‘abbot’ or Art. for Artikel ‘article’ vs. Art kind. ordinal numbers expressed as digits plus a period) Even if we address this problem by excluding those need to be processed before the language-independent words from the abbreviation list and match them with a identification of numbers takes place. dedicated rule that expects a succeeding number (e.g., We think that instead of a limited list of stages, a in Abt. 3 ‘in department 3’; nach Art. 25 ‘pur- more dynamic data structure would be beneficial. We suant to article 25’), we can still come up with cases already know which rules interfere (e.g., numbers with that cannot be solved without dictionary lookups or spaces vs. spaces as separators), but this is not explic- parsing. Compare, for instance: itly reflected in the data. If we were to reorganize the tokenization rule sets by means of a “before” relation 1. Wir trafen den Abt. Bergbahnen sind seine Lei- between pairs of rules, we could build a rule depen- denschaft. dency graph, which, serialized, would define the order ‘We met the abbot. Mountain railways are his of rules to apply. In case of code-switching sentences, passion.’ the order of languages given would be decisive if no order is enforced by that graph. 2. Wir sahen den Sprecher der Abt. Bergbahnen und We have used the tokenizer ourselves in a number Wanderwege. of projects. It supports several European languages, ‘We saw the spokesman of the dept. of mountain including Romansh as a low-resourced language, and railways and hiking trails.’ more languages are in preparation. The tokenizer and all our language models are freely available.13 We also Splitting undirected quotation marks (") results in provide a web demo and a tokenization web service. an information loss.11 After tokenization, it can no longer be inferred whether such a quotation mark sig- 12 A sample sentence with whitespace tokens looks like this: nals the beginning or the end of a quotation. A more Suot , , il , , titel , , " , vacanzas , , e , , cultura , " , careful tokenizer needs to preserve the information , as , , prouva , , d’ , eruir , , la , , funcziun , , da , , la , , lingua , , e , , cultura , , rumauntscha , , per , , 11 il , , turissem , , i , ’l , , Grischun , . The same is true for typewriter apostrophes (') as a replace- 13 ment of matching single quotation marks. http://pub.cl.uzh.ch/purl/cutter 80 6 Acknowledgments ford Tokenizer. URL: http : / / nlp . stanford . edu/software/tokenizer.shtml. This research was supported by the Swiss National The Unicode Consortium (2017). The Unicode Stan- Science Foundation under grant 105215_146781/1 dard, Version 10.0. through the project “SPARCLING – Large-scale An- Volk, Martin, Anne Göhring, Torsten Marek, and notation and Alignment of Parallel Corpora for the In- Yvonne Samuelsson (2010). SMULTRON (ver- vestigation of Linguistic Variation”. sion 3.0) – the Stockholm MULtilingual parallel References TReebank. An English-French-German-Spanish- Swedish parallel treebank with sub-sentential align- Cruz Díaz, Noa Patricia and Manuel Jesús Maña ments. López (Sept. 2015). “An Analysis of Biomedical Tokenization: Problems and Strategies”. In: Pro- ceedings of the Sixth International Workshop on Health Text Mining and Information Analysis. Lis- bon, Portugal: Association for Computational Lin- guistics, pp. 40–49. Graën, Johannes (2018). “Exploiting Alignment in Multiparallel Corpora for Applications in Linguis- tics and Language Learning”. PhD thesis. Univer- sity of Zurich. Gustafson-Capková, Sofia, Yvonne Samuelsson, and Martin Volk (2007). SMULTRON – The Stockholm MULtilingual parallel Treebank. Habert, Benoit, Gilles Adda, M. Adda-Decker, P. Boula de Marëuil, S. Ferrari, O. Ferret, G. Illouz, and P. Paroubek (1998). “Towards tok- enization evaluation”. In: Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC). Vol. 98, pp. 427–431. Hazel, Philip (1997). PCRE (Perl-compatible regular expressions). URL: https://www.pcre.org/. He, Ying and Mehmet Kayaalp (2006). A Compari- son of 13 Tokenizers on MEDLINE. Tech. rep. U.S. National Library of Medicine, Lister Hill National Center for Biomedical Communications. ISO 80000-1 (Nov. 2009). ISO 80000-1: Quantities and units – Part 1: General. Ed. by ISO/TC 12 Technical Committee. Jurish, Bryan and Kay-Michael Würzner (2013). “Word and Sentence Tokenization with Hid- den Markov Models”. In: Journal for Language Technology and Computational Linguistics 28.2, pp. 61–83. Kiss, Tibor and Jan Strunk (2006). “Unsupervised Multilingual Sentence Boundary Detection”. In: Computational Linguistics 32.4, pp. 485–525. Manning, Christopher, Tim Grow, Teg Grenager, Jenny Finkel, and John Bauer (Aug. 2018). Stan- 81 7