=Paper=
{{Paper
|id=Vol-2470/p7
|storemode=property
|title=Context based number normalization using skip-chain conditional random fields
|pdfUrl=https://ceur-ws.org/Vol-2470/p7.pdf
|volume=Vol-2470
|authors=Linas Balčiūnas
|dblpUrl=https://dblp.org/rec/conf/ivus/Balciunas19
}}
==Context based number normalization using skip-chain conditional random fields==
Context based number normalization using
skip-chain conditional random fields
Linas Balčiūnas
Vytautas Magnus University
Kaunas University of Technology
Kaunas, Lithuania
linasb20@gmail.com
Abstract—Verbalizing numeric text tokens is a required task II. R ELATED WORKS
for various speech-related applications, including automatic
speech recognition and text-to-speech synthesis. In morphologi- As far as we know, there are no published works or
cally rich languages, such conversion involves predicting implicit publicly available applications performing number normaliza-
morphological properties of a corresponding numeral. In this tion based on the sentence context for Lithuanian. Although,
paper, we propose first-order skip-chain Conditional Random many languages deal with similar morphological disambigua-
Field (CRF) models and various prepossessing techniques to tion problems. Russian and Lithuanian numbers share many
leverage different contextual information. We show that our best
skip-chain CRF models achieve over 80% accuracy on the set of morphological properties, including types (cardinal, ordinal),
2000 Lithuanian sentences. genders and cases. There are existing research and systems
for the Russian language on general text normalization, hand-
Keywords—Number normalization, text normalization, condi-
written language-general grammar [2], [3], Recurrent Neural
tional random field, natural language processing
Network (RNN) [4], and number normalization [5], [6], [7].
I. I NTRODUCTION III. P REPROCESSING
Number normalization is the task of replacing numeric A. Data
tokens in a sentence by numerals (word tokens) using an A small text corpus for training and evaluating text nor-
appropriate inflected form of a numeral. Number normalization malization models was collected and manually annotated. It
usually involves disambiguation as the same numeric token consists of 1955 sentences containing 3143 numeric tokens.
needs to be mapped into different word forms depending on Sentences were inspected by linguists who suggested a nu-
the context (e.g. ’5 vaikai eina’ 7→ ’Penki vaikai eina’ (five meral word form as an appropriate replacement for every
children are going) vs. ’5 vaiku˛ nėra’ 7→ ’Penkiu˛ vaiku˛ nėra’ numeric token. In some ambiguous cases, a few reasonable
(five children are missing)). Although number normalization alternatives were proposed. Some ambiguities were related to
can be considered as a part of a broader task of text normal- the use of the pronominal numeral forms (e.g. ’15 savaitė’
ization, formulating it as the separate task might be beneficial, (15th week) 7→ ’penkiolikta savaitė’ (non-pronominal form) or
since the process of number normalization can be quite com- ’penkioliktoji savaitė’ (pronominal form)). Other ambiguities
plex depending on morphological features of a language. In were related to numeral case (e.g. ’2019 vasarı˛’ (2019 Febru-
this paper, we describe the process of building and evaluating a ary) 7→ ’du tūkstančiai devynioliktu˛ju˛ vasarı˛’ or ’du tūkstančiai
number normalization system for Lithuanian. However, some devynioliktaisiais vasarı˛’). All suffixes that represented a ’nor-
techniques and models are language-independent and might be malization hint’ were eliminated from the data set (e.g. ’2019-
applied to other languages. In Lithuanian, for example, number aisiais’ was replaced by ’2019’). This had an effect of making
’5’ depending on sentence context may represent any of 63 training subset of the corpus more interesting for the training
different words. Predicting this relationship directly is rather algorithm, increased the complexity of the normalization task,
difficult and it would require huge data-set to properly learn and reduced the normalization accuracy estimates on the test
Numeral grammar. A simpler approach is predicting Part Of subset of the corpus.
Speech (POS) tag and then generating numerals accordingly. Sentences of the corpus were pre-processed by the Hidden
POS tag contains all necessary morphological information Markov Models (HMM) based POS tagger [8]. Every text
to use language-specific grammar based numbers-to-words token was labeled with the so-called ’detailed’ (or composite)
system [1]. This way possible result classes are shared across morphological label that contained the following information:
all numbers and predicting POS tag can be formulated as
• Lemma
sequence labeling, rather than sequence-to-sequence task, be-
• Part of speech (Noun, Verb, Adjective,...)
cause of the one-to-one relationship.
• Case (Nominative, Genitive,...)
• Gender (Feminine, Masculine)
c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0) • Number (Singular, Plural)
17
Since important prediction decisions are based on tagger others are barely represented. A good way to address this data
provided POS tags and lemmas, to ensure optimal perfor- scarcity problem is creating three independent CRF models
mance, morphological annotations were hand-corrected in for Case, Type, Gender prediction and to combine these
training-testing data-set. When using morphological analysis predictions at a later stage. This is preferable since there is no
data, it is beneficial to divide POS tags into sub-labels to direct dependency relationship between these morphological
build more abstract grammar rules and filter out redundant categories and operating with sub-labels allows creating a
information. more abstract rule-set. Although, it is worth noting that sub-
label dependencies have been proven useful for NLP sequence
B. Number grammar
labeling using CRF [10] in combination with composite labels.
Similar more saturated Natural Language Processing (NLP) Additionally exploiting composite label dependencies might be
sequence labeling tasks, POS tagging and Named Entity beneficial for number normalization as well, and it is worth
Recognition (NER), does not require hand-written language- exploring in future research.
specific grammar rules to achieve state-of-the-art [9] perfor-
mance. Long-Short Term Memory neural networks coupled TABLE II. M ORPHOLOGICAL SUB - LABELS
with Conditional Random Fields (LSTM-CRF) based data-
driven sequence labeling approaches prove to be insufficient Case Type/Number Gender
to achieve desired number normalization quality, considering Nominative Cardinal Feminine
training-data availability limitations. To efficiently leverage Genitive Ordinal singular Masculine
Dative Ordinal plural Not applicable
small data-set, morphosyntactic knowledge should be ex- Accusative Ordinal definitive singular
ploited for crafting language and task-specific grammar rules. Instrumental Ordinal definitive plural
All rules are constructed as conditional functions without Locative Cardinal multiple
Not applicable Month∗
any prior weighting. Here are different techniques used to
∗ This class is designed for a number that could be substituted with month
approximate and generalize number relationship with sentence
name, for example, ’2019-02-03’ and ’2019-February-03’
context:
• Lemma classification. (Replacing certain word lemmas
with a dedicated class name, for example, month names B. Skip-Chain CRF
with ’%Month’) The linear-chain structure is usually used for sequence
• Number classification. (See Table I) labeling CRF since additionally modeling non-linear rela-
• Verb classification. (Based on syntactic features of con- tionships requires complicated inference algorithms and prior
trolling case of other POS) specification of such dependencies [11], [12]. For number
• Syntactic linking (See Section V. Long-Distance Depen- normalization, a simplified version of Skip-chain Conditional
dencies) Random Field (S-CRF) can be used, as shown in the gender
prediction model comparison in Figure 1 and Figure 2 (for
TABLE I. N UMBER CLASSIFICATION EXAMPLE readability reasons, we only show English glossary sentence
example of Lithuanian model). Both graphs are representations
Num. Roman Int Digit count Req.Gen.∗ Req.Sing.∗ of Viterbi algorithm decoding (the same structure is used
21 - + 2 - + for encoding). Circles correspond to nodes and arrows to
113.5 - - 3 + - transitions. The weight of node or transition is calculated as
IV + + 1 - -
the sum of its conditional feature-set weights (unigrams for
∗ Requires Genitive and Requires singular signifies that countable noun of
node and bigrams for transition). ’f’ and ’m’ are notations
certain number must be of Genitive case or Singular
for ’feminine’ and ’masculine’ genders, while ’0’ represents
a class for non-number tokens, which are not to be changed
IV. M ODELS by the normalization task. The blue path is the correct path
In this paper, mainly variations of Conditional Random selected by the Viterbi algorithm. In Linear-chain CRF (L-
Fields (CRF) are explored, since CRF got better baseline CRF) most bigram features are useless since they connect with
performance than its neural version (LSTM-CRF) and appears non-number tokens (in Figure 1 none of transition weights are
to be more suitable for particular data-set and grammar rule- significant). This means we effectively have zeroth-order CRF.
set. Transitions that are actually important are between numbers.
To implement such dependencies we make two sequences - full
A. Sub-Label models (original) and skip (numbers only). We build unigram features
To create a single sequence tagging model for this task, we only for number tokens but from the full sequence. This way
would need 79 (different combinations of sub-labels shown in our unigram features exactly match those of the linear-chain
Table II) detailed morphological labels corresponding to the model. Next, bigram features are built from the skip sequence.
output classes. With the currently available corpus, this would For encoding and decoding, we use skip sequence as well,
cause significant data scarcity problems. There are no training since we do not build any feature functions for non-number
examples for a considerable number of classes and many tokens. Skip-chain models, as described above, have unaltered
18
unigram and improved bigram function sets (for number to- noun has crucial morphological information. For example,
kens), while being significantly faster (see graph simplification with successful identification, we no longer need to predict
shown in Figure 1 and Figure 2). Our implementation uses a the gender of a numeral, since it is directly determined by the
modified version of the CRFSharp toolkit [13]. noun.
A. Countable noun identification
5 - 7 red apples We determine the most likely countable noun in a two-
step process. First, for a given numeric token d we select
0 0 0 0 0 all potentially countable noun tokens {ni } according to the
ad-hoc set of linkage rules for nouns. We can not make an
educated choice among selected nouns on the basis of available
morphosyntactic annotation since noun morphology does not
f f f f f have the property of ’countability’. To discriminate among
potential countable nouns, semantic analysis is needed. We
need to rate the set of selected nouns {ni } according to some
’countability’ measure ξ that is dependent on the numeric
token d being normalized and select the noun nbest with the
m m m m m highest ξ(d, ni ) rating nbest = arg max ξ(d, ni )
Suppose that we have vector embeddings v(n), v(n) ∈ RD
Fig. 1. First-Order Linear-Chain CRF for every noun n, that were obtained by an algorithm such as
’word2vec’ [14]. Suppose that we also designed a mapping φ
that maps every numeric token d into a vector φ(d), φ(d) ∈
5 - 7 red apples RD such that φ(d) is the representative embedding of the set
of nouns that are frequently counted by the numeric token d.
f 0 f 0 0 If both assumptions are correct, we can rate the set of potential
countable nouns by estimating cosine similarity between each
selected noun and the corresponding representative vectors,
i.e.
ξ(d, ni ) = cosine-similarity(φ(d), v(ni )) (1)
m m
We have tested a few different approaches to design the
above-mentioned mapping φ(d). We sought large unannotated
Fig. 2. First-Order Skip-Chain CRF
text corpus for number and noun adjacent co-occurrences
and made noun frequency lists per every numeric token that
V. L ONG -D ISTANCE D EPENDENCIES was found (around 350 thousand co-occurrences). Information
Although skip-chain structure quite reliably models some present in a frequency list can be aggregated into a single
important long-distance relationships, it is not able to capture vector by estimating the weighted average of noun embeddings
distant dependencies between number and non-number tokens making up that list. Thus a representative (or central) embed-
(e.g. in ’3 didžiu˛ju˛ mobiliojo ryšio operatoriu˛’ 7→ ’triju˛ mo- ding vector can be obtained per every numeric token. Although
biliojo ryšio operatoriu˛’ (three major mobile network opera- this tabular mapping from numeric tokens into representative
tors) numbers ’3’ case is determined by words ’operatoriu˛’ vectors can be used in (1), it has serious limitations. The table
(operators) case). CRF is generally unable to leverage such contains many unreliable vectors for rare numbers, because
features and requires either hybridization such as LSTM- of the lack of co-occurrences in the unannotated corpus. To
CRF, or additional pre-processing. We propose identifying circumvent the limitations of this tabular mapping we used
position-distance independent relationships using an ad-hoc set the Neural Network (NN) approach to build a continuous co-
of linkage rules and formulating perceived syntactic links as occurrence model. We built two different neural networks: one
conditional functions of CRF. For Lithuanian language, we with a single input (corresponding to the mathematical value of
discern three directly related parts of speech (Noun, Verb, the numeric token) and one with 7 inputs, corresponding to the
Preposition) in the number normalization task. For each, we decomposition of the numeric token into sub-parts (thousands,
use a different set of linkage rules, to identify related tokens hundreds,...) and including number features similar to Table I.
to every number in the sentence, effectively performing partial NN had 200 output units.
syntactic analysis. To link prepositions and verbs to numbers, The evaluation of these models is shown in Table III. The
our rules solely rely on morphological labels provided by the baseline performance is obtained by the simple rule "take
POS tagger. For nouns, the task of linking could be more the first potentially countable noun to the right of a numeric
precisely formulated as an identification of a noun which token". Accuracy is measured using whole CRF training data,
represents an object or quantity being counted by some number extracting situations where a choice between two or more
in the sentence. This is extremely important since countable nouns (2.41 avg.) is needed.
19
TABLE III. COUNTABLE NOUN LINKING morphological form can be extracted from predicted label
(e.g. ’nuo 5%’ (from 5%) 7→ ’nuo penkiu˛ procentu˛’). Our
Method Accuracy implementation based on this model is publicly available [15]
Select first 68.77 and in the future will be integrated into a full Lithuanian text
1-input NN 84.11
7-input deep NN 87.40 normalization system.
Number normalization errors are often directly dependant
on morphological analysis mistakes and we are currently
working on improving both vocabulary-grammar and disam-
VI. E VALUATION
biguation sides of Lithuanian POS tagging to consequentially
We evaluate models with 5-fold cross-validation (except for increase number normalization accuracy.
countable noun identification in Section V, since training and Currently, we use ’word2vec’ [14] algorithm trained on
testing data-sets were obtained from different sources). The relatively small text corpus to produce word embeddings.
accuracy of different models is shown in Table IV. Combined Although various improvements have been made in encoding
accuracy estimates the accuracy of all three models. The text semantic information to vectors [16], [17] and using more
combined answer is considered to be correct if all three sub- advanced method and larger corpus would likely improve our
labels are correct. model performance.
It is worth noting, that our model is focused on grammati- Our achieved number normalization accuracy could be
cally correct, as ’spoken’ number normalization. This might further improved by expanding annotated training data since
not be desirable for systems like text-to-speech synthesis, a considerable amount of errors are a direct result of data
hence a more standardized approach can be chosen. For scarcity. Although, our approach generally lacks semantic and
Lithuanian language, numeral definiteness property could be syntactic language understanding, so performing full syntactic
removed from the prediction model, since it is not strictly sentence analysis in the preprocessing stage would be highly
constrained by grammar. This would increase language cor- beneficial.
rectness and improve Type prediction models and combined
accuracy as shown last line of Table IV (best performing model
ACKNOWLEDGMENT
without definiteness property).
Accuracies above represent the lower bound accuracies of This research was supported by the project “Semantika 2“
the real-world number normalization performance. Firstly, in (No. 02.3.1-CPVA-V-527-01-0002). Special gratitude goes to
certain situations, some sub-label prediction mistakes might our colleagues Lina Majauskaitė and Dovilė Stukaitė who
be irrelevant for numeral generation. For example, both ’5, helped us in collecting and annotating text corpus.
Cardinal, Genitive, Feminine’ and ’5, Cardinal, Genitive, Mas-
culine’ will generate the same word representation ’penkiu˛’.
Secondly, real-world sentences often contain suffixes (e.g. R EFERENCES
’Kovo 11-aj ˛ 7→ ’Kovo vienuoliktaj
˛ a’ ˛ (March 11th)) that
˛ a’
[1] V. Dadurkevičius. dadurka/number-to-words-lt. [Online]. Available:
either offer an unambiguous hint that solves the number https://github.com/dadurka/number-to-words-lt
normalization problem or at least provides most of the needed [2] K. Wu, K. Gorman, and R. Sproat. (2016) Minimally supervised written-
morphological information, which can be used to correct to-spoken text normalization.
[3] M. Wróbel, J. T. Starczewski, and C. Napoli, “Handwriting recognition
prediction mistakes. with extraction of letter fragments,” in International Conference on
Artificial Intelligence and Soft Computing. Springer, 2017, pp. 183–
TABLE IV. E VALUATION 192.
[4] R. Sproat and N. Jaitly. (2016) Rnn approaches to text normalization:
A challenge.
Case Type Gender Combined [5] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Comparison of effective-
L-CRF 77.19 89.00 94.79 67.52 ness of multi-objective genetic algorithms in optimization of invertible
S-CRF 78.89 89.82 95.17 69.43 s-boxes,” in International Conference on Artificial Intelligence and Soft
S-CRF+class.∗ 83.81 93.64 95.17 76.49 Computing. Springer, 2017, pp. 466–476.
S-CRF+class.+syn.∗∗ 86.05 94.01 98.51 80.91 [6] K. Gorman and R. Sproat, “Minimally supervised number
without Definiteness 86.05 96.82 98.51 83.08 normalization,” Transactions of the Association for Computational
∗ classification, see Section III. Preprocessing Linguistics, vol. 4, pp. 507–519, 2016. [Online]. Available:
∗∗ syntactic analysis, see Section V. Long-distance dependencies https://www.transacl.org/ojs/index.php/tacl/article/view/897/213
[7] T. Kapuściński, R. K. Nowicki, and C. Napoli, “Application of genetic
algorithms in the construction of invertible substitution boxes,” in
International Conference on Artificial Intelligence and Soft Computing.
VII. C ONCLUSIONS Springer, 2016, pp. 380–391.
[8] [Online]. Available: http://donelaitis.vdu.lt/main_helper.php?id=4&nr=
In this paper, we describe the number normalization dis- 7_2
ambiguation model, which is needed to develop a context [9] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string
dependant number-to-words system. Sequence-labeling ap- embeddings for sequence labeling,” in Proceedings of the 27th
International Conference on Computational Linguistics. Association
proach allows us to normalize countable abbreviations and for Computational Linguistics, 2018, pp. 1638–1649. [Online].
symbols (next to number) effortlessly, since countable noun Available: http://aclweb.org/anthology/C18-1139
20
[10] M. Silfverberg, T. Ruokolainen, K. Lindén, and M. Kurimo, “Part-of-
speech tagging using conditional random fields: Exploiting sub-label
dependencies for improved accuracy,” in Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers). Association for Computational Linguistics, 2014, pp.
259–264. [Online]. Available: http://aclweb.org/anthology/P14-2043
[11] M. Galley, “A skip-chain conditional random field for ranking meeting
utterances by importance,” in Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing, ser. EMNLP ’06.
Stroudsburg, PA, USA: Association for Computational Linguistics,
2006, pp. 364–372. [Online]. Available: http://dl.acm.org/citation.cfm?
id=1610075.1610126
[12] J. Liu, M. Huang, and X. Zhu, “Recognizing biomedical named
entities using skip-chain conditional random fields,” in Proceedings of
the 2010 Workshop on Biomedical Natural Language Processing, ser.
BioNLP ’10. Stroudsburg, PA, USA: Association for Computational
Linguistics, 2010, pp. 10–18. [Online]. Available: http://dl.acm.org/
citation.cfm?id=1869961.1869963
[13] Z. Fu. zhongkaifu/crfsharp. [Online]. Available: https://github.com/
zhongkaifu/CRFSharp
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
[Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1301.
html#abs-1301-3781
[15] [Online]. Available: http://prn509.vdu.lt:9080/
[16] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,” Transactions of the Association for
Computational Linguistics, vol. 5, pp. 135–146, 2017.
[17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
for word representation,” in In EMNLP, 2014.
21