Context based number normalization using
               skip-chain conditional random fields
                                                                     Linas Balčiūnas
                                                         Vytautas Magnus University
                                                        Kaunas University of Technology
                                                              Kaunas, Lithuania
                                                             linasb20@gmail.com


   Abstract—Verbalizing numeric text tokens is a required task                                          II. R ELATED WORKS
for various speech-related applications, including automatic
speech recognition and text-to-speech synthesis. In morphologi-                     As far as we know, there are no published works or
cally rich languages, such conversion involves predicting implicit               publicly available applications performing number normaliza-
morphological properties of a corresponding numeral. In this                     tion based on the sentence context for Lithuanian. Although,
paper, we propose first-order skip-chain Conditional Random                      many languages deal with similar morphological disambigua-
Field (CRF) models and various prepossessing techniques to                       tion problems. Russian and Lithuanian numbers share many
leverage different contextual information. We show that our best
skip-chain CRF models achieve over 80% accuracy on the set of                    morphological properties, including types (cardinal, ordinal),
2000 Lithuanian sentences.                                                       genders and cases. There are existing research and systems
                                                                                 for the Russian language on general text normalization, hand-
   Keywords—Number normalization, text normalization, condi-
                                                                                 written language-general grammar [2], [3], Recurrent Neural
tional random field, natural language processing
                                                                                 Network (RNN) [4], and number normalization [5], [6], [7].
                          I. I NTRODUCTION                                                              III. P REPROCESSING
   Number normalization is the task of replacing numeric                         A. Data
tokens in a sentence by numerals (word tokens) using an                             A small text corpus for training and evaluating text nor-
appropriate inflected form of a numeral. Number normalization                    malization models was collected and manually annotated. It
usually involves disambiguation as the same numeric token                        consists of 1955 sentences containing 3143 numeric tokens.
needs to be mapped into different word forms depending on                        Sentences were inspected by linguists who suggested a nu-
the context (e.g. ’5 vaikai eina’ 7→ ’Penki vaikai eina’ (five                   meral word form as an appropriate replacement for every
children are going) vs. ’5 vaiku˛ nėra’ 7→ ’Penkiu˛ vaiku˛ nėra’               numeric token. In some ambiguous cases, a few reasonable
(five children are missing)). Although number normalization                      alternatives were proposed. Some ambiguities were related to
can be considered as a part of a broader task of text normal-                    the use of the pronominal numeral forms (e.g. ’15 savaitė’
ization, formulating it as the separate task might be beneficial,                (15th week) 7→ ’penkiolikta savaitė’ (non-pronominal form) or
since the process of number normalization can be quite com-                      ’penkioliktoji savaitė’ (pronominal form)). Other ambiguities
plex depending on morphological features of a language. In                       were related to numeral case (e.g. ’2019 vasarı˛’ (2019 Febru-
this paper, we describe the process of building and evaluating a                 ary) 7→ ’du tūkstančiai devynioliktu˛ju˛ vasarı˛’ or ’du tūkstančiai
number normalization system for Lithuanian. However, some                        devynioliktaisiais vasarı˛’). All suffixes that represented a ’nor-
techniques and models are language-independent and might be                      malization hint’ were eliminated from the data set (e.g. ’2019-
applied to other languages. In Lithuanian, for example, number                   aisiais’ was replaced by ’2019’). This had an effect of making
’5’ depending on sentence context may represent any of 63                        training subset of the corpus more interesting for the training
different words. Predicting this relationship directly is rather                 algorithm, increased the complexity of the normalization task,
difficult and it would require huge data-set to properly learn                   and reduced the normalization accuracy estimates on the test
Numeral grammar. A simpler approach is predicting Part Of                        subset of the corpus.
Speech (POS) tag and then generating numerals accordingly.                          Sentences of the corpus were pre-processed by the Hidden
POS tag contains all necessary morphological information                         Markov Models (HMM) based POS tagger [8]. Every text
to use language-specific grammar based numbers-to-words                          token was labeled with the so-called ’detailed’ (or composite)
system [1]. This way possible result classes are shared across                   morphological label that contained the following information:
all numbers and predicting POS tag can be formulated as
                                                                                    • Lemma
sequence labeling, rather than sequence-to-sequence task, be-
                                                                                    • Part of speech (Noun, Verb, Adjective,...)
cause of the one-to-one relationship.
                                                                                    • Case (Nominative, Genitive,...)
                                                                                    • Gender (Feminine, Masculine)
c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)                           • Number (Singular, Plural)


                                                                            17
   Since important prediction decisions are based on tagger                       others are barely represented. A good way to address this data
provided POS tags and lemmas, to ensure optimal perfor-                           scarcity problem is creating three independent CRF models
mance, morphological annotations were hand-corrected in                           for Case, Type, Gender prediction and to combine these
training-testing data-set. When using morphological analysis                      predictions at a later stage. This is preferable since there is no
data, it is beneficial to divide POS tags into sub-labels to                      direct dependency relationship between these morphological
build more abstract grammar rules and filter out redundant                        categories and operating with sub-labels allows creating a
information.                                                                      more abstract rule-set. Although, it is worth noting that sub-
                                                                                  label dependencies have been proven useful for NLP sequence
B. Number grammar
                                                                                  labeling using CRF [10] in combination with composite labels.
   Similar more saturated Natural Language Processing (NLP)                       Additionally exploiting composite label dependencies might be
sequence labeling tasks, POS tagging and Named Entity                             beneficial for number normalization as well, and it is worth
Recognition (NER), does not require hand-written language-                        exploring in future research.
specific grammar rules to achieve state-of-the-art [9] perfor-
mance. Long-Short Term Memory neural networks coupled                                            TABLE II. M ORPHOLOGICAL SUB - LABELS
with Conditional Random Fields (LSTM-CRF) based data-
driven sequence labeling approaches prove to be insufficient                                 Case               Type/Number                Gender
to achieve desired number normalization quality, considering                              Nominative              Cardinal                Feminine
training-data availability limitations. To efficiently leverage                            Genitive           Ordinal singular           Masculine
                                                                                            Dative             Ordinal plural           Not applicable
small data-set, morphosyntactic knowledge should be ex-                                   Accusative      Ordinal definitive singular
ploited for crafting language and task-specific grammar rules.                           Instrumental      Ordinal definitive plural
   All rules are constructed as conditional functions without                              Locative           Cardinal multiple
                                                                                        Not applicable             Month∗
any prior weighting. Here are different techniques used to
                                                                                  ∗ This class is designed for a number that could be substituted with month
approximate and generalize number relationship with sentence
                                                                                  name, for example, ’2019-02-03’ and ’2019-February-03’
context:
   • Lemma classification. (Replacing certain word lemmas
      with a dedicated class name, for example, month names                       B. Skip-Chain CRF
      with ’%Month’)                                                                 The linear-chain structure is usually used for sequence
   • Number classification. (See Table I)                                         labeling CRF since additionally modeling non-linear rela-
   • Verb classification. (Based on syntactic features of con-                    tionships requires complicated inference algorithms and prior
      trolling case of other POS)                                                 specification of such dependencies [11], [12]. For number
   • Syntactic linking (See Section V. Long-Distance Depen-                       normalization, a simplified version of Skip-chain Conditional
      dencies)                                                                    Random Field (S-CRF) can be used, as shown in the gender
                                                                                  prediction model comparison in Figure 1 and Figure 2 (for
             TABLE I. N UMBER CLASSIFICATION EXAMPLE                              readability reasons, we only show English glossary sentence
                                                                                  example of Lithuanian model). Both graphs are representations
   Num.      Roman     Int    Digit count    Req.Gen.∗     Req.Sing.∗             of Viterbi algorithm decoding (the same structure is used
    21         -        +          2             -             +                  for encoding). Circles correspond to nodes and arrows to
   113.5       -        -          3             +             -                  transitions. The weight of node or transition is calculated as
    IV         +        +          1             -             -
                                                                                  the sum of its conditional feature-set weights (unigrams for
∗ Requires Genitive and Requires singular signifies that countable noun of
                                                                                  node and bigrams for transition). ’f’ and ’m’ are notations
certain number must be of Genitive case or Singular
                                                                                  for ’feminine’ and ’masculine’ genders, while ’0’ represents
                                                                                  a class for non-number tokens, which are not to be changed
                             IV. M ODELS                                          by the normalization task. The blue path is the correct path
   In this paper, mainly variations of Conditional Random                         selected by the Viterbi algorithm. In Linear-chain CRF (L-
Fields (CRF) are explored, since CRF got better baseline                          CRF) most bigram features are useless since they connect with
performance than its neural version (LSTM-CRF) and appears                        non-number tokens (in Figure 1 none of transition weights are
to be more suitable for particular data-set and grammar rule-                     significant). This means we effectively have zeroth-order CRF.
set.                                                                              Transitions that are actually important are between numbers.
                                                                                  To implement such dependencies we make two sequences - full
A. Sub-Label models                                                               (original) and skip (numbers only). We build unigram features
  To create a single sequence tagging model for this task, we                     only for number tokens but from the full sequence. This way
would need 79 (different combinations of sub-labels shown in                      our unigram features exactly match those of the linear-chain
Table II) detailed morphological labels corresponding to the                      model. Next, bigram features are built from the skip sequence.
output classes. With the currently available corpus, this would                   For encoding and decoding, we use skip sequence as well,
cause significant data scarcity problems. There are no training                   since we do not build any feature functions for non-number
examples for a considerable number of classes and many                            tokens. Skip-chain models, as described above, have unaltered


                                                                             18
unigram and improved bigram function sets (for number to-                 noun has crucial morphological information. For example,
kens), while being significantly faster (see graph simplification         with successful identification, we no longer need to predict
shown in Figure 1 and Figure 2). Our implementation uses a                the gender of a numeral, since it is directly determined by the
modified version of the CRFSharp toolkit [13].                            noun.
                                                                          A. Countable noun identification
   5             -              7             red        apples              We determine the most likely countable noun in a two-
                                                                          step process. First, for a given numeric token d we select
       0             0           0              0          0              all potentially countable noun tokens {ni } according to the
                                                                          ad-hoc set of linkage rules for nouns. We can not make an
                                                                          educated choice among selected nouns on the basis of available
                                                                          morphosyntactic annotation since noun morphology does not
       f             f           f              f           f             have the property of ’countability’. To discriminate among
                                                                          potential countable nouns, semantic analysis is needed. We
                                                                          need to rate the set of selected nouns {ni } according to some
                                                                          ’countability’ measure ξ that is dependent on the numeric
                                                                          token d being normalized and select the noun nbest with the
    m             m             m              m           m              highest ξ(d, ni ) rating nbest = arg max ξ(d, ni )
                                                                             Suppose that we have vector embeddings v(n), v(n) ∈ RD
                Fig. 1. First-Order Linear-Chain CRF                      for every noun n, that were obtained by an algorithm such as
                                                                          ’word2vec’ [14]. Suppose that we also designed a mapping φ
                                                                          that maps every numeric token d into a vector φ(d), φ(d) ∈
   5             -              7             red        apples           RD such that φ(d) is the representative embedding of the set
                                                                          of nouns that are frequently counted by the numeric token d.
       f             0           f              0          0              If both assumptions are correct, we can rate the set of potential
                                                                          countable nouns by estimating cosine similarity between each
                                                                          selected noun and the corresponding representative vectors,
                                                                          i.e.
                                                                                     ξ(d, ni ) = cosine-similarity(φ(d), v(ni ))       (1)
    m                           m
                                                                             We have tested a few different approaches to design the
                                                                          above-mentioned mapping φ(d). We sought large unannotated
                 Fig. 2. First-Order Skip-Chain CRF
                                                                          text corpus for number and noun adjacent co-occurrences
                                                                          and made noun frequency lists per every numeric token that
            V. L ONG -D ISTANCE D EPENDENCIES                             was found (around 350 thousand co-occurrences). Information
   Although skip-chain structure quite reliably models some               present in a frequency list can be aggregated into a single
important long-distance relationships, it is not able to capture          vector by estimating the weighted average of noun embeddings
distant dependencies between number and non-number tokens                 making up that list. Thus a representative (or central) embed-
(e.g. in ’3 didžiu˛ju˛ mobiliojo ryšio operatoriu˛’ 7→ ’triju˛ mo-        ding vector can be obtained per every numeric token. Although
biliojo ryšio operatoriu˛’ (three major mobile network opera-             this tabular mapping from numeric tokens into representative
tors) numbers ’3’ case is determined by words ’operatoriu˛’               vectors can be used in (1), it has serious limitations. The table
(operators) case). CRF is generally unable to leverage such               contains many unreliable vectors for rare numbers, because
features and requires either hybridization such as LSTM-                  of the lack of co-occurrences in the unannotated corpus. To
CRF, or additional pre-processing. We propose identifying                 circumvent the limitations of this tabular mapping we used
position-distance independent relationships using an ad-hoc set           the Neural Network (NN) approach to build a continuous co-
of linkage rules and formulating perceived syntactic links as             occurrence model. We built two different neural networks: one
conditional functions of CRF. For Lithuanian language, we                 with a single input (corresponding to the mathematical value of
discern three directly related parts of speech (Noun, Verb,               the numeric token) and one with 7 inputs, corresponding to the
Preposition) in the number normalization task. For each, we               decomposition of the numeric token into sub-parts (thousands,
use a different set of linkage rules, to identify related tokens          hundreds,...) and including number features similar to Table I.
to every number in the sentence, effectively performing partial           NN had 200 output units.
syntactic analysis. To link prepositions and verbs to numbers,               The evaluation of these models is shown in Table III. The
our rules solely rely on morphological labels provided by the             baseline performance is obtained by the simple rule "take
POS tagger. For nouns, the task of linking could be more                  the first potentially countable noun to the right of a numeric
precisely formulated as an identification of a noun which                 token". Accuracy is measured using whole CRF training data,
represents an object or quantity being counted by some number             extracting situations where a choice between two or more
in the sentence. This is extremely important since countable              nouns (2.41 avg.) is needed.


                                                                     19
                 TABLE III. COUNTABLE NOUN LINKING                           morphological form can be extracted from predicted label
                                                                             (e.g. ’nuo 5%’ (from 5%) 7→ ’nuo penkiu˛ procentu˛’). Our
                            Method          Accuracy                         implementation based on this model is publicly available [15]
                          Select first       68.77                           and in the future will be integrated into a full Lithuanian text
                          1-input NN         84.11
                       7-input deep NN       87.40                           normalization system.
                                                                                Number normalization errors are often directly dependant
                                                                             on morphological analysis mistakes and we are currently
                                                                             working on improving both vocabulary-grammar and disam-
                            VI. E VALUATION
                                                                             biguation sides of Lithuanian POS tagging to consequentially
   We evaluate models with 5-fold cross-validation (except for               increase number normalization accuracy.
countable noun identification in Section V, since training and                  Currently, we use ’word2vec’ [14] algorithm trained on
testing data-sets were obtained from different sources). The                 relatively small text corpus to produce word embeddings.
accuracy of different models is shown in Table IV. Combined                  Although various improvements have been made in encoding
accuracy estimates the accuracy of all three models. The                     text semantic information to vectors [16], [17] and using more
combined answer is considered to be correct if all three sub-                advanced method and larger corpus would likely improve our
labels are correct.                                                          model performance.
   It is worth noting, that our model is focused on grammati-                   Our achieved number normalization accuracy could be
cally correct, as ’spoken’ number normalization. This might                  further improved by expanding annotated training data since
not be desirable for systems like text-to-speech synthesis,                  a considerable amount of errors are a direct result of data
hence a more standardized approach can be chosen. For                        scarcity. Although, our approach generally lacks semantic and
Lithuanian language, numeral definiteness property could be                  syntactic language understanding, so performing full syntactic
removed from the prediction model, since it is not strictly                  sentence analysis in the preprocessing stage would be highly
constrained by grammar. This would increase language cor-                    beneficial.
rectness and improve Type prediction models and combined
accuracy as shown last line of Table IV (best performing model
                                                                                                      ACKNOWLEDGMENT
without definiteness property).
   Accuracies above represent the lower bound accuracies of                    This research was supported by the project “Semantika 2“
the real-world number normalization performance. Firstly, in                 (No. 02.3.1-CPVA-V-527-01-0002). Special gratitude goes to
certain situations, some sub-label prediction mistakes might                 our colleagues Lina Majauskaitė and Dovilė Stukaitė who
be irrelevant for numeral generation. For example, both ’5,                  helped us in collecting and annotating text corpus.
Cardinal, Genitive, Feminine’ and ’5, Cardinal, Genitive, Mas-
culine’ will generate the same word representation ’penkiu˛’.
Secondly, real-world sentences often contain suffixes (e.g.                                                R EFERENCES
’Kovo 11-aj   ˛ 7→ ’Kovo vienuoliktaj
            ˛ a’                          ˛ (March 11th)) that
                                       ˛ a’
                                                                              [1] V. Dadurkevičius. dadurka/number-to-words-lt. [Online]. Available:
either offer an unambiguous hint that solves the number                           https://github.com/dadurka/number-to-words-lt
normalization problem or at least provides most of the needed                 [2] K. Wu, K. Gorman, and R. Sproat. (2016) Minimally supervised written-
morphological information, which can be used to correct                           to-spoken text normalization.
                                                                              [3] M. Wróbel, J. T. Starczewski, and C. Napoli, “Handwriting recognition
prediction mistakes.                                                              with extraction of letter fragments,” in International Conference on
                                                                                  Artificial Intelligence and Soft Computing. Springer, 2017, pp. 183–
                         TABLE IV. E VALUATION                                    192.
                                                                              [4] R. Sproat and N. Jaitly. (2016) Rnn approaches to text normalization:
                                                                                  A challenge.
                              Case       Type    Gender      Combined         [5] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Comparison of effective-
           L-CRF              77.19      89.00   94.79         67.52              ness of multi-objective genetic algorithms in optimization of invertible
           S-CRF              78.89      89.82   95.17         69.43              s-boxes,” in International Conference on Artificial Intelligence and Soft
        S-CRF+class.∗         83.81      93.64   95.17         76.49              Computing. Springer, 2017, pp. 466–476.
     S-CRF+class.+syn.∗∗      86.05      94.01   98.51         80.91          [6] K. Gorman and R. Sproat, “Minimally supervised number
     without Definiteness     86.05      96.82   98.51         83.08              normalization,” Transactions of the Association for Computational
∗ classification, see Section III. Preprocessing                                  Linguistics, vol. 4, pp. 507–519, 2016. [Online]. Available:
∗∗ syntactic analysis, see Section V. Long-distance dependencies                  https://www.transacl.org/ojs/index.php/tacl/article/view/897/213
                                                                              [7] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Application of genetic
                                                                                  algorithms in the construction of invertible substitution boxes,” in
                                                                                  International Conference on Artificial Intelligence and Soft Computing.
                         VII. C ONCLUSIONS                                        Springer, 2016, pp. 380–391.
                                                                              [8] [Online]. Available: http://donelaitis.vdu.lt/main_helper.php?id=4&nr=
  In this paper, we describe the number normalization dis-                        7_2
ambiguation model, which is needed to develop a context                       [9] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string
dependant number-to-words system. Sequence-labeling ap-                           embeddings for sequence labeling,” in Proceedings of the 27th
                                                                                  International Conference on Computational Linguistics. Association
proach allows us to normalize countable abbreviations and                         for Computational Linguistics, 2018, pp. 1638–1649. [Online].
symbols (next to number) effortlessly, since countable noun                       Available: http://aclweb.org/anthology/C18-1139


                                                                        20
[10] M. Silfverberg, T. Ruokolainen, K. Lindén, and M. Kurimo, “Part-of-
     speech tagging using conditional random fields: Exploiting sub-label
     dependencies for improved accuracy,” in Proceedings of the 52nd Annual
     Meeting of the Association for Computational Linguistics (Volume 2:
     Short Papers). Association for Computational Linguistics, 2014, pp.
     259–264. [Online]. Available: http://aclweb.org/anthology/P14-2043
[11] M. Galley, “A skip-chain conditional random field for ranking meeting
     utterances by importance,” in Proceedings of the 2006 Conference on
     Empirical Methods in Natural Language Processing, ser. EMNLP ’06.
     Stroudsburg, PA, USA: Association for Computational Linguistics,
     2006, pp. 364–372. [Online]. Available: http://dl.acm.org/citation.cfm?
     id=1610075.1610126
[12] J. Liu, M. Huang, and X. Zhu, “Recognizing biomedical named
     entities using skip-chain conditional random fields,” in Proceedings of
     the 2010 Workshop on Biomedical Natural Language Processing, ser.
     BioNLP ’10. Stroudsburg, PA, USA: Association for Computational
     Linguistics, 2010, pp. 10–18. [Online]. Available: http://dl.acm.org/
     citation.cfm?id=1869961.1869963
[13] Z. Fu. zhongkaifu/crfsharp. [Online]. Available: https://github.com/
     zhongkaifu/CRFSharp
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
     word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
     [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1301.
     html#abs-1301-3781
[15] [Online]. Available: http://prn509.vdu.lt:9080/
[16] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
     vectors with subword information,” Transactions of the Association for
     Computational Linguistics, vol. 5, pp. 135–146, 2017.
[17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
     for word representation,” in In EMNLP, 2014.


                                                                                21