Context based number normalization using skip-chain conditional random fields Linas Balčiūnas Vytautas Magnus University Kaunas University of Technology Kaunas, Lithuania linasb20@gmail.com Abstract—Verbalizing numeric text tokens is a required task II. R ELATED WORKS for various speech-related applications, including automatic speech recognition and text-to-speech synthesis. In morphologi- As far as we know, there are no published works or cally rich languages, such conversion involves predicting implicit publicly available applications performing number normaliza- morphological properties of a corresponding numeral. In this tion based on the sentence context for Lithuanian. Although, paper, we propose first-order skip-chain Conditional Random many languages deal with similar morphological disambigua- Field (CRF) models and various prepossessing techniques to tion problems. Russian and Lithuanian numbers share many leverage different contextual information. We show that our best skip-chain CRF models achieve over 80% accuracy on the set of morphological properties, including types (cardinal, ordinal), 2000 Lithuanian sentences. genders and cases. There are existing research and systems for the Russian language on general text normalization, hand- Keywords—Number normalization, text normalization, condi- written language-general grammar [2], [3], Recurrent Neural tional random field, natural language processing Network (RNN) [4], and number normalization [5], [6], [7]. I. I NTRODUCTION III. P REPROCESSING Number normalization is the task of replacing numeric A. Data tokens in a sentence by numerals (word tokens) using an A small text corpus for training and evaluating text nor- appropriate inflected form of a numeral. Number normalization malization models was collected and manually annotated. It usually involves disambiguation as the same numeric token consists of 1955 sentences containing 3143 numeric tokens. needs to be mapped into different word forms depending on Sentences were inspected by linguists who suggested a nu- the context (e.g. ’5 vaikai eina’ 7→ ’Penki vaikai eina’ (five meral word form as an appropriate replacement for every children are going) vs. ’5 vaiku˛ nėra’ 7→ ’Penkiu˛ vaiku˛ nėra’ numeric token. In some ambiguous cases, a few reasonable (five children are missing)). Although number normalization alternatives were proposed. Some ambiguities were related to can be considered as a part of a broader task of text normal- the use of the pronominal numeral forms (e.g. ’15 savaitė’ ization, formulating it as the separate task might be beneficial, (15th week) 7→ ’penkiolikta savaitė’ (non-pronominal form) or since the process of number normalization can be quite com- ’penkioliktoji savaitė’ (pronominal form)). Other ambiguities plex depending on morphological features of a language. In were related to numeral case (e.g. ’2019 vasarı˛’ (2019 Febru- this paper, we describe the process of building and evaluating a ary) 7→ ’du tūkstančiai devynioliktu˛ju˛ vasarı˛’ or ’du tūkstančiai number normalization system for Lithuanian. However, some devynioliktaisiais vasarı˛’). All suffixes that represented a ’nor- techniques and models are language-independent and might be malization hint’ were eliminated from the data set (e.g. ’2019- applied to other languages. In Lithuanian, for example, number aisiais’ was replaced by ’2019’). This had an effect of making ’5’ depending on sentence context may represent any of 63 training subset of the corpus more interesting for the training different words. Predicting this relationship directly is rather algorithm, increased the complexity of the normalization task, difficult and it would require huge data-set to properly learn and reduced the normalization accuracy estimates on the test Numeral grammar. A simpler approach is predicting Part Of subset of the corpus. Speech (POS) tag and then generating numerals accordingly. Sentences of the corpus were pre-processed by the Hidden POS tag contains all necessary morphological information Markov Models (HMM) based POS tagger [8]. Every text to use language-specific grammar based numbers-to-words token was labeled with the so-called ’detailed’ (or composite) system [1]. This way possible result classes are shared across morphological label that contained the following information: all numbers and predicting POS tag can be formulated as • Lemma sequence labeling, rather than sequence-to-sequence task, be- • Part of speech (Noun, Verb, Adjective,...) cause of the one-to-one relationship. • Case (Nominative, Genitive,...) • Gender (Feminine, Masculine) c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) • Number (Singular, Plural) 17 Since important prediction decisions are based on tagger others are barely represented. A good way to address this data provided POS tags and lemmas, to ensure optimal perfor- scarcity problem is creating three independent CRF models mance, morphological annotations were hand-corrected in for Case, Type, Gender prediction and to combine these training-testing data-set. When using morphological analysis predictions at a later stage. This is preferable since there is no data, it is beneficial to divide POS tags into sub-labels to direct dependency relationship between these morphological build more abstract grammar rules and filter out redundant categories and operating with sub-labels allows creating a information. more abstract rule-set. Although, it is worth noting that sub- label dependencies have been proven useful for NLP sequence B. Number grammar labeling using CRF [10] in combination with composite labels. Similar more saturated Natural Language Processing (NLP) Additionally exploiting composite label dependencies might be sequence labeling tasks, POS tagging and Named Entity beneficial for number normalization as well, and it is worth Recognition (NER), does not require hand-written language- exploring in future research. specific grammar rules to achieve state-of-the-art [9] perfor- mance. Long-Short Term Memory neural networks coupled TABLE II. M ORPHOLOGICAL SUB - LABELS with Conditional Random Fields (LSTM-CRF) based data- driven sequence labeling approaches prove to be insufficient Case Type/Number Gender to achieve desired number normalization quality, considering Nominative Cardinal Feminine training-data availability limitations. To efficiently leverage Genitive Ordinal singular Masculine Dative Ordinal plural Not applicable small data-set, morphosyntactic knowledge should be ex- Accusative Ordinal definitive singular ploited for crafting language and task-specific grammar rules. Instrumental Ordinal definitive plural All rules are constructed as conditional functions without Locative Cardinal multiple Not applicable Month∗ any prior weighting. Here are different techniques used to ∗ This class is designed for a number that could be substituted with month approximate and generalize number relationship with sentence name, for example, ’2019-02-03’ and ’2019-February-03’ context: • Lemma classification. (Replacing certain word lemmas with a dedicated class name, for example, month names B. Skip-Chain CRF with ’%Month’) The linear-chain structure is usually used for sequence • Number classification. (See Table I) labeling CRF since additionally modeling non-linear rela- • Verb classification. (Based on syntactic features of con- tionships requires complicated inference algorithms and prior trolling case of other POS) specification of such dependencies [11], [12]. For number • Syntactic linking (See Section V. Long-Distance Depen- normalization, a simplified version of Skip-chain Conditional dencies) Random Field (S-CRF) can be used, as shown in the gender prediction model comparison in Figure 1 and Figure 2 (for TABLE I. N UMBER CLASSIFICATION EXAMPLE readability reasons, we only show English glossary sentence example of Lithuanian model). Both graphs are representations Num. Roman Int Digit count Req.Gen.∗ Req.Sing.∗ of Viterbi algorithm decoding (the same structure is used 21 - + 2 - + for encoding). Circles correspond to nodes and arrows to 113.5 - - 3 + - transitions. The weight of node or transition is calculated as IV + + 1 - - the sum of its conditional feature-set weights (unigrams for ∗ Requires Genitive and Requires singular signifies that countable noun of node and bigrams for transition). ’f’ and ’m’ are notations certain number must be of Genitive case or Singular for ’feminine’ and ’masculine’ genders, while ’0’ represents a class for non-number tokens, which are not to be changed IV. M ODELS by the normalization task. The blue path is the correct path In this paper, mainly variations of Conditional Random selected by the Viterbi algorithm. In Linear-chain CRF (L- Fields (CRF) are explored, since CRF got better baseline CRF) most bigram features are useless since they connect with performance than its neural version (LSTM-CRF) and appears non-number tokens (in Figure 1 none of transition weights are to be more suitable for particular data-set and grammar rule- significant). This means we effectively have zeroth-order CRF. set. Transitions that are actually important are between numbers. To implement such dependencies we make two sequences - full A. Sub-Label models (original) and skip (numbers only). We build unigram features To create a single sequence tagging model for this task, we only for number tokens but from the full sequence. This way would need 79 (different combinations of sub-labels shown in our unigram features exactly match those of the linear-chain Table II) detailed morphological labels corresponding to the model. Next, bigram features are built from the skip sequence. output classes. With the currently available corpus, this would For encoding and decoding, we use skip sequence as well, cause significant data scarcity problems. There are no training since we do not build any feature functions for non-number examples for a considerable number of classes and many tokens. Skip-chain models, as described above, have unaltered 18 unigram and improved bigram function sets (for number to- noun has crucial morphological information. For example, kens), while being significantly faster (see graph simplification with successful identification, we no longer need to predict shown in Figure 1 and Figure 2). Our implementation uses a the gender of a numeral, since it is directly determined by the modified version of the CRFSharp toolkit [13]. noun. A. Countable noun identification 5 - 7 red apples We determine the most likely countable noun in a two- step process. First, for a given numeric token d we select 0 0 0 0 0 all potentially countable noun tokens {ni } according to the ad-hoc set of linkage rules for nouns. We can not make an educated choice among selected nouns on the basis of available morphosyntactic annotation since noun morphology does not f f f f f have the property of ’countability’. To discriminate among potential countable nouns, semantic analysis is needed. We need to rate the set of selected nouns {ni } according to some ’countability’ measure ξ that is dependent on the numeric token d being normalized and select the noun nbest with the m m m m m highest ξ(d, ni ) rating nbest = arg max ξ(d, ni ) Suppose that we have vector embeddings v(n), v(n) ∈ RD Fig. 1. First-Order Linear-Chain CRF for every noun n, that were obtained by an algorithm such as ’word2vec’ [14]. Suppose that we also designed a mapping φ that maps every numeric token d into a vector φ(d), φ(d) ∈ 5 - 7 red apples RD such that φ(d) is the representative embedding of the set of nouns that are frequently counted by the numeric token d. f 0 f 0 0 If both assumptions are correct, we can rate the set of potential countable nouns by estimating cosine similarity between each selected noun and the corresponding representative vectors, i.e. ξ(d, ni ) = cosine-similarity(φ(d), v(ni )) (1) m m We have tested a few different approaches to design the above-mentioned mapping φ(d). We sought large unannotated Fig. 2. First-Order Skip-Chain CRF text corpus for number and noun adjacent co-occurrences and made noun frequency lists per every numeric token that V. L ONG -D ISTANCE D EPENDENCIES was found (around 350 thousand co-occurrences). Information Although skip-chain structure quite reliably models some present in a frequency list can be aggregated into a single important long-distance relationships, it is not able to capture vector by estimating the weighted average of noun embeddings distant dependencies between number and non-number tokens making up that list. Thus a representative (or central) embed- (e.g. in ’3 didžiu˛ju˛ mobiliojo ryšio operatoriu˛’ 7→ ’triju˛ mo- ding vector can be obtained per every numeric token. Although biliojo ryšio operatoriu˛’ (three major mobile network opera- this tabular mapping from numeric tokens into representative tors) numbers ’3’ case is determined by words ’operatoriu˛’ vectors can be used in (1), it has serious limitations. The table (operators) case). CRF is generally unable to leverage such contains many unreliable vectors for rare numbers, because features and requires either hybridization such as LSTM- of the lack of co-occurrences in the unannotated corpus. To CRF, or additional pre-processing. We propose identifying circumvent the limitations of this tabular mapping we used position-distance independent relationships using an ad-hoc set the Neural Network (NN) approach to build a continuous co- of linkage rules and formulating perceived syntactic links as occurrence model. We built two different neural networks: one conditional functions of CRF. For Lithuanian language, we with a single input (corresponding to the mathematical value of discern three directly related parts of speech (Noun, Verb, the numeric token) and one with 7 inputs, corresponding to the Preposition) in the number normalization task. For each, we decomposition of the numeric token into sub-parts (thousands, use a different set of linkage rules, to identify related tokens hundreds,...) and including number features similar to Table I. to every number in the sentence, effectively performing partial NN had 200 output units. syntactic analysis. To link prepositions and verbs to numbers, The evaluation of these models is shown in Table III. The our rules solely rely on morphological labels provided by the baseline performance is obtained by the simple rule "take POS tagger. For nouns, the task of linking could be more the first potentially countable noun to the right of a numeric precisely formulated as an identification of a noun which token". Accuracy is measured using whole CRF training data, represents an object or quantity being counted by some number extracting situations where a choice between two or more in the sentence. This is extremely important since countable nouns (2.41 avg.) is needed. 19 TABLE III. COUNTABLE NOUN LINKING morphological form can be extracted from predicted label (e.g. ’nuo 5%’ (from 5%) 7→ ’nuo penkiu˛ procentu˛’). Our Method Accuracy implementation based on this model is publicly available [15] Select first 68.77 and in the future will be integrated into a full Lithuanian text 1-input NN 84.11 7-input deep NN 87.40 normalization system. Number normalization errors are often directly dependant on morphological analysis mistakes and we are currently working on improving both vocabulary-grammar and disam- VI. E VALUATION biguation sides of Lithuanian POS tagging to consequentially We evaluate models with 5-fold cross-validation (except for increase number normalization accuracy. countable noun identification in Section V, since training and Currently, we use ’word2vec’ [14] algorithm trained on testing data-sets were obtained from different sources). The relatively small text corpus to produce word embeddings. accuracy of different models is shown in Table IV. Combined Although various improvements have been made in encoding accuracy estimates the accuracy of all three models. The text semantic information to vectors [16], [17] and using more combined answer is considered to be correct if all three sub- advanced method and larger corpus would likely improve our labels are correct. model performance. It is worth noting, that our model is focused on grammati- Our achieved number normalization accuracy could be cally correct, as ’spoken’ number normalization. This might further improved by expanding annotated training data since not be desirable for systems like text-to-speech synthesis, a considerable amount of errors are a direct result of data hence a more standardized approach can be chosen. For scarcity. Although, our approach generally lacks semantic and Lithuanian language, numeral definiteness property could be syntactic language understanding, so performing full syntactic removed from the prediction model, since it is not strictly sentence analysis in the preprocessing stage would be highly constrained by grammar. This would increase language cor- beneficial. rectness and improve Type prediction models and combined accuracy as shown last line of Table IV (best performing model ACKNOWLEDGMENT without definiteness property). Accuracies above represent the lower bound accuracies of This research was supported by the project “Semantika 2“ the real-world number normalization performance. Firstly, in (No. 02.3.1-CPVA-V-527-01-0002). Special gratitude goes to certain situations, some sub-label prediction mistakes might our colleagues Lina Majauskaitė and Dovilė Stukaitė who be irrelevant for numeral generation. For example, both ’5, helped us in collecting and annotating text corpus. Cardinal, Genitive, Feminine’ and ’5, Cardinal, Genitive, Mas- culine’ will generate the same word representation ’penkiu˛’. Secondly, real-world sentences often contain suffixes (e.g. R EFERENCES ’Kovo 11-aj ˛ 7→ ’Kovo vienuoliktaj ˛ a’ ˛ (March 11th)) that ˛ a’ [1] V. Dadurkevičius. dadurka/number-to-words-lt. [Online]. Available: either offer an unambiguous hint that solves the number https://github.com/dadurka/number-to-words-lt normalization problem or at least provides most of the needed [2] K. Wu, K. Gorman, and R. Sproat. (2016) Minimally supervised written- morphological information, which can be used to correct to-spoken text normalization. [3] M. Wróbel, J. T. Starczewski, and C. Napoli, “Handwriting recognition prediction mistakes. with extraction of letter fragments,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2017, pp. 183– TABLE IV. E VALUATION 192. [4] R. Sproat and N. Jaitly. (2016) Rnn approaches to text normalization: A challenge. Case Type Gender Combined [5] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Comparison of effective- L-CRF 77.19 89.00 94.79 67.52 ness of multi-objective genetic algorithms in optimization of invertible S-CRF 78.89 89.82 95.17 69.43 s-boxes,” in International Conference on Artificial Intelligence and Soft S-CRF+class.∗ 83.81 93.64 95.17 76.49 Computing. Springer, 2017, pp. 466–476. S-CRF+class.+syn.∗∗ 86.05 94.01 98.51 80.91 [6] K. Gorman and R. Sproat, “Minimally supervised number without Definiteness 86.05 96.82 98.51 83.08 normalization,” Transactions of the Association for Computational ∗ classification, see Section III. Preprocessing Linguistics, vol. 4, pp. 507–519, 2016. [Online]. Available: ∗∗ syntactic analysis, see Section V. Long-distance dependencies https://www.transacl.org/ojs/index.php/tacl/article/view/897/213 [7] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Application of genetic algorithms in the construction of invertible substitution boxes,” in International Conference on Artificial Intelligence and Soft Computing. VII. C ONCLUSIONS Springer, 2016, pp. 380–391. [8] [Online]. Available: http://donelaitis.vdu.lt/main_helper.php?id=4&nr= In this paper, we describe the number normalization dis- 7_2 ambiguation model, which is needed to develop a context [9] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string dependant number-to-words system. Sequence-labeling ap- embeddings for sequence labeling,” in Proceedings of the 27th International Conference on Computational Linguistics. Association proach allows us to normalize countable abbreviations and for Computational Linguistics, 2018, pp. 1638–1649. [Online]. symbols (next to number) effortlessly, since countable noun Available: http://aclweb.org/anthology/C18-1139 20 [10] M. Silfverberg, T. Ruokolainen, K. Lindén, and M. Kurimo, “Part-of- speech tagging using conditional random fields: Exploiting sub-label dependencies for improved accuracy,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 2014, pp. 259–264. [Online]. Available: http://aclweb.org/anthology/P14-2043 [11] M. Galley, “A skip-chain conditional random field for ranking meeting utterances by importance,” in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’06. Stroudsburg, PA, USA: Association for Computational Linguistics, 2006, pp. 364–372. [Online]. Available: http://dl.acm.org/citation.cfm? id=1610075.1610126 [12] J. Liu, M. Huang, and X. Zhu, “Recognizing biomedical named entities using skip-chain conditional random fields,” in Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ser. BioNLP ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 10–18. [Online]. Available: http://dl.acm.org/ citation.cfm?id=1869961.1869963 [13] Z. Fu. zhongkaifu/crfsharp. [Online]. Available: https://github.com/ zhongkaifu/CRFSharp [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” CoRR, vol. abs/1301.3781, 2013. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1301. html#abs-1301-3781 [15] [Online]. Available: http://prn509.vdu.lt:9080/ [16] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in In EMNLP, 2014. 21