Text Normalization and Spelling Correction in Kazakh
                       Language

                      Gaukhar Slamova1 and Meruyert Mukhanova1
    1 Suleyman Demirel University, Engineering and Natural Sciences, Information Systems,

                        040900, Kaskelen, Almaty, Kazakhstan
                   {150103077, 150103018}@stu.sdu.edu.kz


       Abstract. Text normalization is significant step in preprocessing of informal,
       social media and short texts in the Natural Language Processing (NLP) tasks.
       Researches in the field are mostly on English, but not on the agglutinative lan-
       guages such as Kazakh, Korean, Japanese, which are determined as morpholog-
       ically rich languages, and complex compared to English. In this paper, we pre-
       sent text normalization and auto correction of words for Kazakh language, we
       convert informal text into grammatically correct form. To do the auto correction
       task, firstly we countered keyboard error while typing words, then choose the
       best match from them. Additionally, we categorized words to several groups
       and separated text into modules of words. The exact match score of the overall
       system on the provided datasets are 85.40 per cent.


1 Introduction

   Text normalization is the task of transforming informal writing into its standard
form in the language. It is an important processing step for a wide range of Natural
Language Processing (NLP) tasks such as text-to-speech synthesis, speech recogni-
tion, information extraction, parsing, and machine translation. (Richard Sproat, Alan
W. Black, Stanley F. Chen, Shankar Kumar, Mari Ostendorf, Christopher Richards,
2001) Text normalization involves merging different written forms of token into a
canonical normalized form; for example, a document may contain the equivalent
tokens “Mr.”, “Mr”, “mister”, and “Mister” that would all be normalized to a single
form (Nitin Indurkhya, Fred J. Damerau, 2010).
   Normalization poses multiple challenges, as we know it is a task of mapping all
out-of-vocabulary non-standard word tokens to in-vocabulary standard forms, to deal
with it we should convert raw text into grammatically correct sentence by modifying
punctuation and capitalization, and adding, removing, or reordering words. Also, we
gave specific values to some types as date, phone, currency, URL, etc. On informal
texts as usual a lot of mistakes, it is useful to correct them. To spelling correction task,
we consider keyboard typing mistakes, character repetition and other tools. In this
paper, we propose spelling correction and text preprocessing by mentioned above
techniques, it gives higher precision accuracy than other methodologies.
   The rest of this paper is organized as follows. In Section 2 we discuss previous ap-
proaches to the normalization problem. Section 3 presents our normalization frame-
work, including the actual normalization and learning procedures. In Section 4 we
introduce evaluation metric, and present experimental results of our model with re-
spect to several categories. Finally, we conclude in Section 5.


2 Related Work

   Early studies of text normalization include machine learning approach in text-to-
speech and social media, and with usage of neural network in it. In this paper, we use
similar method as in works which investigated text normalization in social media,
because of recent rise heavily informal writing in messaging applications, text nor-
malization is a huge problem of every language.
   Previous works handled text normalization process by producing noisy text where
normalized text go through a noisy channel; this approach called noisy channel mod-
el. (Moore, Eric Brill and Robert C., 2000) presented a method for modelling the
spelling correction as a noisy channel model based on string to string edits; this model
gives significant improvements compared to early studies. (Kristina Toutanova and
Robert C. Moore, 2002) enhanced the string to string edits model by modelling pro-
nunciation similarities between words achieved a substantial performance improve-
ment over the previous best performing models for spelling correction. (Monojit
Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and
Anupam Basu, 2007) introduced a supervised HMM channel model which adopted
the spellchecking metaphor based on character-level edit which has been extended by
(Paul Cook and Suzanne Stevenson, 2009) who used unsupervised noisy channel
model using probabilistic models for common abbreviation and various spelling er-
rors types. (Kobus Catherine, François Yvon, and Géraldine, 2008) presented French
SMS messages normalization process by normalizing the orthography with combina-
tion of Statistical Machine Translation and automatic speech recognition approaches.
(Bo Han and Timothy Baldwin, 2011) presented model for identifying and normaliz-
ing ill-formed words, generating correction candidates based on morphophonemic
similarity over SMS corpus and Twitter. (Joseph Kaufmann and Jugal Kalita, 2010)
used a machine translation approach with a pre-processor for syntactic normalization
rather than lexical. (Liu, Deana Pennell and Yang, 2011) presented two-phase method
for expanding abbreviations using a machine translation system trained at the charac-
ter level during the first phase and in the second phase utilizing an in-domain lan-
guage model, in the context of neighbouring words. (Fei Liu, Fuliang Weng, and Xiao
Jiang, 2012) proposed a cognitively-driven normalization system that integrates dif-
ferent human perspectives in normalizing the nonstandard tokens, including the en-
hanced letter transformation, visual priming, and string/phonetic similarity.
   There are fewer studies done on the agglutinative language comparing to English,
(Gülşen Eryiğit, Dilara Torunoğlu-Selamet, 2017) introduced social media text nor-
malization for Turkish by analyzing Web 2.0 Turkish texts, categorizing them into
seven types and providing candidate spelling correction words. (Mohammad Saloot,
Norisma Idris, Rohana Mahmud, 2014) propose an approach to normalize the Malay
Twitter messages based on corpus-driven analysis. (Panchapagesan Krishnamurthy,
P.P. Talukdar, N Sridhar, A.G. Ramakrishnan, 2004) introduced a novel approach to
text normalization, wherein tokenization and initial token classification are combined
into one stage followed by a second level of token sense disambiguation, is described.
(O. De Clercq, B. Desmet, S. Schulz, E. Lefever, V. Hoste, 2013) used multimodule
approach which rely on Machine Translation and transliteration-based system for
social media messages in the Dutch language. Agglutinative languages tend to have
longer words than fusional ones (Steffen Eger et al., 2016) and spelling correction
model would be complex, because of the morphology.
   To our knowledge, the work presented here is the first which observed normaliza-
tion in Kazakh language with the usage of auto correction methodology and value
categorization.


3 Evaluation

   In this section we introduce our normalization framework, which consider both
spelling correction and text preprocessing processes. Morphologically rich languages
such as Kazakh, Korean, Finnish, Arabic, Turkish, etc. are considered as highly in-
flectional; their characteristic is that one stem in these languages may have hundreds
of possible forms.


3.1   Spelling Correction

   Spelling errors are categorized into two classes: typographic and cognitive. Cogni-
tive errors phonetic or orthographic similarity of words; person does not know how to
spell a word. Typographic errors are related to the keyboard and hand/finger move-
ment where spelling errors happen because of two letters keys’ closeness on the key-
board. (Kukich, 1992)
                                   Figure 1. Keyboard.


  In the Figure 1, on the upper-right corner are shown Kazakh language letters. On
Kazakh alphabet there are 42 letters, where 9 vowels.
                 Table 1. Vowels and Consonants in Kazakh Language.
          Form        English
          Vowels      а, ә, е, о, ө, ұ, ү, ы, і
          Consonants б, г, ғ, д, ж, з, й, к, қ, л, м, н, ң, п, р, с, т, х, һ, ш
   To spelling correction Spelling errors have been classified into four types: Dele-
tion, Insertion, Substitution and Transposition. (Damerau, 1964) Deletion errors
where characters are repeated, as in қаты→қатты, is observed significantly more
frequently than in a non-repeating context showing that visually conspicuous errors
tend to be corrected. Substitution errors of visually similar characters (e.g., ага→аға)
are in fact very common. (Yukino Baba, Hisami Suzuki, 2012)
   We make correction within four parts:
      Selection Mechanism – choose candidate with the highest probability
      Candidate model – gives candidate for the given word.
      Language model – probability of the candidates acquireness on the text
      Error model – probability that another word was typed when author mean
          exact word.
   When we trying to find most likely correct candidate (x) to word out of all possible
candidates that has maximum probability to intended correction to given word, w:


  By Bayes’ Theorem it is equivalent to:


  Since P(w) is the same for every possible candidate c, we can factor it out, giving:


   Consider the misspelled word "сенін" and the two candidates "сенім" and "сенің".
Correction candidate "сенің" seems good because words look similar and only change
is "ң" to "н", it is an accusative case of noun. On the other hand, "сенім" is a very
common word and a noun, this is the correct spelling of word. The point is that to
estimate P(x|w) we consider both the probability of candidate and the probability of
the change from x to w.


3.1   Replacement rules

   Kazakh is morphologically rich language; one stem has a very large number of
word forms. It is not efficient to use a lexicon lookup for storing and checking all
possible candidates of word forms in the dataset. But morphological analyzer helps to
find all possible word forms, lemmas, and inflectional or derivational structures.
   Kazakh is generally verb-final, though various permutations on subject–object–
verb word order can be used. Inflectional and derivational morphology, both verbal
and nominal, in Kazakh, exists almost exclusively in the form
of agglutinative suffixes. Kazakh is a nominative-accusative, head-final, left-
branching, dependent-marking language. (Mukhamedova, Raikhangul, 2015)
                              Table 2. Declension of Words.
 Case     Possible Forms       шелек          кеме     бас "head"          тұз
                              "bucket"       "ship"                       "salt"
 Nom     —                   шелек          кеме         бас           тұз
 Acc     -ні, -ны, -ді, -ды, шелекті        кемені       басты         Тұзды
         -ті, -ты, -н
 Gen     -нің, -ның, -дің, - шелектің       кеменің      бастың        тұздың
         дың, -тің, -тың
 Dat     -ге, -ға, -ке, -қа, - Шелекке      кемеге       басқа         тұзға
         не, -на
 Loc     -де, -да, -те, -та   Шелекте       кемеде       баста         тұзда
 Abl     -ден, -дан, -тен, - шелектен       кемеден      бастан        тұздан
         тан, -нен, -нан
 Inst    -мен(ен) -бен(ен) шелекпен         кемемен      баспен        тұзбен
         -пен(ен)
   ( Zitouni and R. Sarikaya, 2009) list the below problems related to issue with ag-
glutinative languages:
      Increase in dictionary size;
      Poor language model probability estimation;
      Higher out-of-vocabulary rate;
      Inflection gap for machine translation

                   Table 3. Some form for the Kazakh word ‘Кітап’.
           Word form                            English
           Кітап                   Book
           Кітаптар                      Books
           Кітаптағы                     In the book
           Кітаптың                      Of the book
           Кітапқа                       To the book
           Кітапта                       At the book
           Кітаптан                      From the book
           Кітаппен                      With the book
           Кітап                         Book


   We make candidate generation for the nonstandard word forms. In informal texts
mostly used slangs, abbreviations, character repetitions, logograms, wrong letter cas-
es, spelling errors related to pronunciation, vowels misspelling errors. To normalize
such words, we make following candidate generation layer:
      Letter case transformations;
       Accent normalization
       Spelling correction
    Replacement rules considered as a regular expression pattern and used for handling
with character repetitions, emails, URLs, etc. Following word types tagged by the
specific labels:
       E-mails: labeled as @email[example@gmail.com]
       URLs: labeled as @URL[http://sdu.edu.kz]
       Emoticons: labeled as @emoji[>3]
       Money: labeled as @money[$500]
       Date: labeled as @date[25.02.2018]
       Phone: labeled as @phone[87772349134]
    Texts contain different word cases: uppercase, lowercase and mixed case. We con-
verted: uppercase words into first letter upper remaining letters lower, if the word
length less than five; lowercase word remains the same; and mixed case word into
first letter upper remaining letters lower.


4 Evaluation

   We performed evaluation for both word spelling correction and replacement rules.
For the training dataset we used most popular and valuable novels of Kazakh litera-
ture written by Mukhtar Auezov “Abai Zholy” (The path of Abai) which consists of
16893 words.
                         Table 4. Examples of word correction.
             Misspelling    Correct             Guess
             Қыздартың      Қыздардың           Қыздардың
             Сагыз          Сағыз               Сағыз
             Атам             Адам                   Атам
             Сенің            Сенім                  Сенің
             Сагыніш          Сағыныш                Сагыныш
             Жанын            Жаным                  Жанын


                          Table 5. Text Normalization results.
           System                           Accuracy, per cent
           Keyboard correction       90.7
           Replacement               80.1
           Total                        85.40
   As shown in Table 5, spelling correction with the usage of keyboard model errors
gave higher accuracy than word replacement to find normalized form of value. Noisy
non-standard words correction not inserts words into the dataset, it generates best fit
candidate to the misspelling word. We made testing to 500 words, constructed testing
dataset according to words from “Abai Zholy”. Instead of using lexicon lookup, we
propose to use keyboard model for Kazakh language.


4 Conclusions

   NLP is the recent field of science in the Kazakhstan, there is a lack of tools for
preprocessing and spelling correction. In this research, we aimed to explore the neces-
sary components for text normalization of a morphologically rich language, Kazakh,
for the further studies related to this field.
   In this article, we suggested to use social media and messaging normalization tech-
nique for Kazakh language. We hope to have provided a better insight into spelling
correction by the keyboard usage in Kazakh alphabet which contains 42 letters 16
characters more than English.


Acknowledgements

   We thank the anonymous reviewers for helpful comments and suggestions. We al-
so thank Kessikbayeva Gulshat for her comments on a preliminary version of this
work.


References

Zitouni and R. Sarikaya. (2009). Arabic diacritic restoration approach based on
         maximum entropy models. London, UK: Computer Speech & Language.
Bo Han and Timothy Baldwin. (2011). Lexical Normalisation of Short Text
         Messages: Makn Sens a #twitter. Portland, Oregon, USA: Proceedings of
         ACL-HLT.
Damerau, F. (1964). A technique for computer detection and correction of spelling
         errors. Communications of the ACM, 659-664.
Fei Liu, Fuliang Weng, and Xiao Jiang. (2012). A broad-coverage normalization
         system for social media language. ACL, 1035–1044.
Gülşen Eryiğit, Dilara Torunoğlu-Selamet. (2017). Social media text normalization
         for Turkish. Natural Language Engineering, 835-875.
Joseph Kaufmann and Jugal Kalita. (2010). Syntactic normalization of Twitter
         messages. Kharagpur, India: International Conference on Natural Language
         Processing.
Kobus Catherine, François Yvon, and Géraldine. (2008). Transcrire les SMS comme
         on reconnaît la parole. Actes de la Conférence sur le Traitement Automatique
         des Langues (pp. 128–138). Avignon, France: TALN’08.
Kristina Toutanova and Robert C. Moore. (2002). Pronunciation modeling for
         improved spelling correction. Philadelphia, USA: Proceedings of the 40th
         Annual Meeting on Association for Computational Linguistics, ACL.
Kukich, K. (1992). Techniques for automatically correcting. ACM Computing
        Surveys, 24(4).
Liu, Deana Pennell and Yang. (2011). A character-level machine translation approach
        for normalization of SMS abbreviations. IJCNLP, 974–982.
Mohammad Saloot, Norisma Idris, Rohana Mahmud. (2014). An architecture for
        Malay Tweet normalization. Information Processing & Management, 621–
        633.
Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar,
        and Anupam Basu. (2007). Investigation and modeling of the structure of
        texting language. International Journal of Document Analysis and
        Recognition, 157-174.
Moore, Eric Brill and Robert C. (2000). An improved error model for noisy channel
        spelling correction. Englewood Cliffs, NJ, USA: Proceedings of the 38th
        Annual Meeting on Association for Computational Linguistics.
Mukhamedova, Raikhangul. (2015). Kazakh: A Comprehensive Grammar.
        Routledge(ISBN 9781317573081).
Nitin Indurkhya, Fred J. Damerau. (2010). Handbook of Natural Language
        Processing (2 ed.). New York, US: Taylor&Francis Group, LLC.
O. De Clercq, B. Desmet, S. Schulz, E. Lefever, V. Hoste. (2013). Normalization of
        Dutch user-generated content. Proceedings of the 9th International
        Conference on Recent Advances in Natural Language Processing (pp. 179-
        88). Hissar, Bulgaria: RANLP'13.
Panchapagesan Krishnamurthy, P.P. Talukdar, N Sridhar, A.G. Ramakrishnan.
        (2004). Hindi Text Normalization. Conference: Fifth International
        Conference on Knowledge Based Computer Systems (KBCS) (p. 10).
        Hyderabad, India: KBCS.
Paul Cook and Suzanne Stevenson. (2009). An unsupervised model for text message
        normalization. Boulder, USA: CALC 09: Proceedings of the Workshop on
        Computational Approaches to Linguistic Creativity.
Richard Sproat, Alan W. Black, Stanley F. Chen, Shankar Kumar, Mari Ostendorf,
        Christopher Richards. (2001). Normalization of non-standard. Computer
        Speech & Language, 15(3), 287-333.
Steffen Eger et al. (2016). A comparison of four character-level string-to-string
        translation models for (OCR) spelling error correction. The Prague Bulletin
        of Mathematical Linguistics, 77–99.
Yukino Baba, Hisami Suzuki. (2012). How Are Spelling Errors Generated and
        Corrected? A Study of Corrected and Uncorrected Spelling Errors Using
        Keystroke Logs. Proceedings of the 50th Annual Meeting of the Association
        for Computational Linguistics (pp. 373-377). Jeju, Republic of Korea:
        Association for Computational Linguistics.