=Paper=
{{Paper
|id=Vol-3232/paper21
|storemode=property
|title=Challenges Of Using Character Level Statistical Machine Translation For Normalizing Old Estonian Texts
|pdfUrl=https://ceur-ws.org/Vol-3232/paper21.pdf
|volume=Vol-3232
|authors=Gerth Jaanimäe
|dblpUrl=https://dblp.org/rec/conf/dhn/Jaanimae22
}}
==Challenges Of Using Character Level Statistical Machine Translation For Normalizing Old Estonian Texts==
<pdf width="1500px">https://ceur-ws.org/Vol-3232/paper21.pdf</pdf>
<pre>
Challenges of Using Character Level Statistical Machine
Translation for Normalizing Old Estonian Texts
Gerth Jaanimäe1
1
    University of Tartu, Institute of Estonian and General Linguistics, Jakobi 2, Tartu, 51005, Estonia


                 Abstract
                 This paper reports on experiments of normalizing the 19th century Estonian parish court
                 records. Converting the historical texts from old to contemporary spelling system, also known
                 as normalizing, can be challenging in itself due to the fact that there was no single orthographic
                 standard or if there even was, often the rules were not strictly followed, so there was a lot of
                 variation in the texts. This paper also concentrates on the more specific issues related to
                 Estonian as a morphologically rich language and presents the initial results of applying the
                 character level statistical machine translation normalization on the parish court records from
                 the 19th century. Morphological richness and the peculiarities of the old orthography can create
                 the problem of ambiguity, which we attempted to solve using word bigrams instead of single
                 words for training. Also, as the annotated training data is scarce and we assumed that more of
                 it helps us obtain better results, we tested the idea to create the artificial additional training
                 data, the so-called silver standard. The old texts which’s spellings were closest to modern
                 Estonian were converted to the old spelling system, which is much simpler than the reverse
                 process, and after that added to the training set.

                 Keywords 1
                 natural language processing, historical texts, corpus linguistics, text normalization

1. Introduction
    Historical texts are invaluable resource for linguists, historians, genealogists and other people who
use digital archives in their work. In the linguistic point of view these writings are interesting for the
reason that they can provide an insight to the dialects, vocabulary and grammar used in the time period
they were written in. These writings can be difficult to analyze automatically due to the differences
between modern and old orthographies. Thus, the tools designed for contemporary language usually
perform worse on them and they have to be converted to modern form, or in other words normalized
[1]. Another approach would be adapting the tools to older orthography, however it would be very time
consuming.
    Estonian, which belongs to Finno-Ugric language family and on which this research is based, is a
morphologically rich language, meaning that many different word forms can be created, and thus more
material is needed to cover the vocabulary. Another issue that can occur is that some of the words
normalized can create forms which are homonymous with forms of another word, which may cause
falsely recognized lemmas for a given word. Automatic detection of these errors can be complicated,
mainly because these words are often morphologically correct, and the sentences formed by them can
also be in accordance with the rules of syntax.
    The dataset that is used in this research consists of parish court records written in the 19th century.
These texts were written mostly in Estonian and provide a valuable insight into the way of life,
relationships and the language that was used colloquially during this time period. Some of these texts


The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), March 15–18, 2022, Uppsala, Sweden.
EMAIL: gerth.jaanimae@ut.ee
ORCID: 0000-0002-9588-1642
              © 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 235
were written in old Estonian orthography, some in modern and a little portion in the so-called
transitional spelling system. Also, the texts contain a sizeable amount of dialectal variation.
    These varieties make them especially interesting from the linguistic point of view, however at the
same time make them more difficult to normalize.
    In this paper we discuss the issues described above and present the initial results of applying the
statistical machine translation method for normalizing the Estonian texts written in the 19th century.
    The paper consists of the following sections. Section 2 gives an overview of the data used in this
research and describes the issues related to it. In Section 3, the normalization method and related work
is described. Section 4 provides an overview of the preprocessing, and the normalization experiments
themselves. Section 5 gives the summary of the results of the experiments and attempts to give the
reasons behind them. In Section 6, the reasons are elaborated further and future plans are briefly
discussed.

2. Description of the dataset
    The dataset analyzed in this research consists of parish court records written in the 19th century.
Automatic analysis of these texts would make it possible to perform keyword searches and use different
NLP applications that are designed for standard language. While there exist NLP tools for standard
Estonian, such as a Python library called ESTNLTK [2], the researched material have some features
that make it impossible or extremely difficult to apply them off the shelf. Also, as Estonian morphology
contains fusional elements, searching different keywords using regular expressions would be
impossible or at least a lot of hard work. For example the genitive and partitive forms for South Estonian
word susi ‘wolf’ is soe and sutt.
    Not only is the material written in older spelling system and non-standard Estonian, they were also
hand written and due to a big variation in the handwriting styles, it would be difficult and error-prone
to use optical character recognition on them. Thus the texts were first manually transcribed by
volunteers in the crowdsourcing project launched by the National Archives of Estonia.2 After that
further processing and analysis could be performed.
    Many of these writings are written in old spelling system which was introduced around the end of
the 17th century and was heavily influenced by German orthography at the time. The main rules were
as follows: 1. The long vowel of a stressed open syllable is marked by a single letter. 2. The long vowel
of a stressed closed syllable is marked by a digraph. 3. The short vowel of a stressed open syllable is
marked by a double consonant [3].
    The old spelling system was also ambiguous as the Table 1 shows [4]. Although for a human it is
quite easy to make the correct decision based on the context, it would be incredibly difficult for the
normalization algorithm to know, which of the modern equivalents is the correct one.
    To make matters more complicated there were two written languages in parallel use until the end of
the 19th century representing North vs. South Estonian. Eventually the North Estonian language and
spelling standard became the single standard for the whole country. The spelling standard Estonians
know and use today was introduced in 1843 and started gaining popularity in the 1870s. This means
that although there is some material in the dataset written in Modern Estonian orthography, most of it
is written in older spelling and some of it during a transitional period, where people still wrote some
words in the earlier spelling out of habit [3].


2
    https://www.ra.ee/vallakohtud/


                                                   236
Table 1
Differences between old and modern Estonian orthographies
           Old spelling                Modern spelling                            Meaning
               ma                            maa                                     land
               ma                             ma                                       I
             ramat                         raamat                                    book
              maalt                         maalt                             from the country
             munna                          muna                                      egg
              teggi                          tegi                                     did
               kolli                         koli                                    stuff
               kolli                         kolli                          monster (genitive form)

    South Estonian used to be considered a dialect of Estonian, but nowadays many linguists classify it
as a separate language due to numerous grammatical and phonological differences suggesting that the
South Estonian language branched off the Proto-Finnic language earlier on [5]. As the main goal of this
research is to normalize the texts to standard Estonian, North and South Estonian are still treated as
dialects. The data can be divided into nine different dialectal areas which in turn can be grouped into
North and South Estonian dialects.
    North Estonian: central, insular, coastal, western, eastern and northeastern dialects.
    South Estonian: Mulgi, Tartu and Võru dialects.
    Mulgi dialect was an interesting case as the official language in this area was North Estonian,
although colloquially South Estonian was spoken instead.
    In addition to the sizable amount of dialectal variation, there are more challenges in normalizing
these texts. Morphological richness, meaning that cases and derivations are used instead of prepositions
and postpositions, poses some extra challenges in normalization. Main one being that there are
inevitably many more different wordforms to normalize and thus probability of mistakes will be
significantly increased. Also, there would be much smaller amount of frequently occurring prepositions
that would automatically increase the scores reflecting the quality of normalization.
    Another problem is the small amount of manually annotated data for training the machine learning
algorithm as the annotation process is time consuming, human resources are limited and due to dialectal
variations, data from one region often does not work for normalizing texts from another region.

3. Method
   The method of normalizing older texts by converting them to standard modern spelling can be
achieved using many different methods, such as dictionaries, rule-based approach, edit distances,
machine translation etc.

3.1.    Method
   The method used in the current investigation is often referred to as character level statistical machine
translation, where the old and modern spelling systems are treated as two separate languages.
   Also, as the “languages” are similar enough, the words are processed as sentences and characters as
words. This makes it possible to translate the patterns of letters instead of just individual words, thus
making it more flexible, compared to, for example, the dictionary-based method [6]. In order to
overcome the challenges described in the previous section, the following processes were implemented.
   In order to mitigate the problem of ambiguity, the bigrams, or in other words word pairs were used
instead of giving a single word at a time for the algorithm to process. Therefore, the problem of
ambiguity could possibly be solved thanks to the collocations providing the translated words the
context.
   The issue of scarcity of data could be solved by creating more artificial data for the algorithm to
learn from, a so-called silver standard. The conversion from the contemporary spelling to old spelling


                                                   237
can be achieved with a small amount of rules and thus could be done more easily than the reverse
process. The texts were converted to the old spelling system and the pairs of texts were given for the
machine translation algorithm to learn.

3.2.         Related work
   Using character level statistical machine translation for normalizing historical texts is nothing new.
One of the first experiments with this methods was to normalize old Slovene texts written in the 18th
and 19th century [6]. It has been also extensively tested in order to compare its performance on English,
Swedish, German, Icelandic and Hungarian [7]. Although there are more state of the art methods today,
such as ones based on neural networks, which usually have better performance, they require large
amounts of data. Some researchers have also found out that the method even performs worse on the
smaller dataset [8] [9].

4. Description of experiments and setup
   In order to evaluate the statistical machine translation for normalizing the text material the following
preprocessing and experiments were performed.
   A small set of parish court records, 153 in total, was randomly chosen for manual annotation and
normalization. The annotation consists of morphological information, such as lemma and inflectional
information. It also contains the normalized form for every given word, which is the main interest in
our research.
   Before training, the tokens were separated by newline, the letters by whitespace and the punctuation
was removed. The corpus was then divided into nine smaller datasets according to the dialectal
variations. After that these smaller datasets were randomly divided into training set, development set
and test set in size of 75%, 5% and 20% respectively. The software used for the translation process
was Moses3.
   Training the models consisted from two steps. First the target language model is trained and after
that the translation model. For the former the parallel corpus is needed and for the latter the corpus in
the target language, or in our case, normalized words are required.
   The scripts and related files are uploaded to Github.4
   The following subsections describe different types of experiments.

4.1.         Baseline translation
    The manually annotated corpus was used to train both the language model and translation model
without any additions or modifications. As the target language or in our case normalized forms are in
the same language for every dialect, the training sets were merged into single file for the language
model.
    The training set was used to train the translation models and the development set to tune them using
minimal error rate training (mert). The accuracy on the test sets were calculated by comparing the
translation to the normalized form found in the test set.
    For cross validation purposes the corpus was shuffled in ten iterations into train, development and
test sets and the macro-average was taken. Table 2 describes how many tokens the datasets contain.


3
    https://www.statmt.org/moses/
4
    https://github.com/gerthjaanimae/csmt-parish-court-records


                                                                 238
Table 2
Sizes of the datasets in tokens in the annotated corpus
     Dialect        Records        Training set   Development set           Test set          Total
    Eastern             3              880               58                   236             1174
     Central            79            17842             1189                 4759             23790
  Northeastern          2              375               25                   100              500
    Western             23            5063               337                 1351             6751
      Mulgi             21            5735               382                 1530             7647
     Coastal            7             1543               102                  413             2058
     Insular            40            9391               626                 2505             12522
      Tartu             26            6413               427                 1711             8551
      Võru              40            9269               617                 2473             12359


4.2.    Translation using the silver standard
    In order to improve the quality of the translation and give the training algorithm more data to learn,
artificial data, the so called silver standard was introduced.
    As converting texts from contemporary Estonian to old spelling system is much simpler than the
reverse process, the old parish court records that had the spelling closest to modern Estonian were
transformed into older orthography. In order to determine the texts to be converted, they were
morphologically analysed using Vabamorf tagger, which is a tool for extracting morphological
information from a given word and to determine if a word belongs to modern Estonian or not. It is
contained in ESTNLTK library [2]. The texts that got the highest percentage of words in accordance to
modern Estonian (about 1100 texts) were transformed to old system using the automatic syllabifier from
the ESTNLTK library and some hand-crafted rules. The main ones being: the single letter refering to
the consonant in the first syllable was doubled if the vowel was short. For example koli > kolli ‘stuff’.
The double letters refering to a long vowel in a first syllable were singled. For example kooli > koli ‘to
school’.
    As a result the train and development sets got significantly larger. The test set remained the same as
described in the experiment above. After appending the silver standard to the portion for the train and
dev-sets, 90% of it went for former and 10% for latter.
    Afterwards the process was identical to the one described above.
    Table 3 describes how many words the datasets contain within the silver standard corpus.

Table 3
Sizes of the datasets in words in the silver standard corpus
      Dialect             Training set         Development set           Test set             Total
      Eastern                2158                    240                   234                2632
      Central                99065                  11008                 4758               114831
   Northeastern               439                     49                   100                 588
     Western                 19545                   2172                 1350               23067
       Mulgi                 5506                    612                  1529                7647
      Coastal                1482                    165                   411                2058
      Insular                9016                    1002                 2504               12522
       Tartu                 6156                    685                  1710                8551
       Võru                  8899                    989                  2471               12359


                                                   239
4.3.    Translation using larger language model
    The process was identical to the baseline experiment, except the contemporary Estonian part of the
silver standard corpus was added to train the target language model.
    For comparison the language model in the baseline translation contained about 57000 tokens and
the larger language model about 164000 tokens.

4.4.    Translation using bigrams
    As the older spelling of Estonian could be ambiguous with one written form possibly corresponding
to two different forms in the contemporary standard spelling (see section 2), the use of bigrams was
tested to mitigate this problem.
    As the word pairs containing punctuation were removed, the datasets became smaller.
    Otherwise the experiment was identical to the baseline translation.

5. Results
5.1. Results of the text normalization
   The following table describes the macro-average accuracies across 10 iterations on the test sets.

Table 4
Results of the character level statistical machine translation experiments on the test sets
                            Baseline                                     Large
      Dialect                                   Silver standard                         Word bigrams
                           translation                              language model
      Central                88.05%                 82.07%              89.68%              86.26%
      Coastal                86.90%                 89.59%              81.53%              83.39%
      Eastern                88.39%                 88.25%              88.43%              82.66%
      Insular                86.39%                 86.00%              87.34%              81.31%
  Northeastern                72.4%                  76.6%                80%               72.2%
     Western                 84.37%                  81.2%              85.31%              81.32%
       Mulgi                 84.01%                 85.88%              86.88%              81.59%
       Tartu                 78.94%                 82.27%              83.89%              71.76%
       Võru                  83.02%                 83.52%              86.41%              75.75%
Average accuracy
                             83.61%                 83.93%               85.5%              79.58%
  for all dialects

   As can be observed from Table 4, the best results were obtained by using the baseline translation
model together with large target language model. The explanation could be that the larger language
model helps the algorithm to learn the patterns existing in the target language. The scores were the
lowest using word bigrams, which could have a simple reason that within the longer strings the n-grams
create, the probability of mistakes increases significantly. Across the dialects, the scores were the
highest when normalizing the texts written in central dialect. That can be easily explained by the fact
that modern standard Estonian is largely based on that dialect [3].
   The scores were the lowest when normalizing the texts written in the northeastern dialect due to the
small amount of data for training. It also has to be reminded that Mulgi, Tartu and Võru dialects belong
into South Estonian and the rest into North Estonian dialects.


                                                  240
5.2.    Results of the morphological analysis
     In order to measure the performance of the normalization on the bigger corpus, the “translated” texts
were analyzed using Vabamorf tagger, which outputs the inflectional information for a given word and
if it cannot be retrieved, we can deduce that it is not in accordance to the modern Estonian orthography
[2].
     The morphological analysis was performed first on the unnormalized texts and after that different
translation methods were compared. The corpus consisted of around 25000 records.
     Although it is a very rough and error-prone estimate, as some of the words can easily get incorrect
analyses due to the fact that some of the old and dialectal word forms are homonymous with modern
ones. For example, pesnud in standard modern Estonian means ‘washed’, in South-Estonian it means
‘beaten’. Regardless of the issues it still gives a general overview of the performance of the method
used on the larger data that has not been annotated.

Table 5
Scores of the morphological analysis
                         Not          Baseline             Silver             Large           Word
      Dialect
                     normalized      translation         standard        language model       bigrams
     Central          73.03%           85.27%             83.79%             87.82%           81.76%
     Coastal          78.00%           83.35%             83.09%             80.53%           81.03%
     Eastern          63.22%           74.66%             74.76%             79.62%           71.67%
      Insular         69.27%           82.43%             83.38%             86.73%           80.31%
  Northeastern        76.73%           83.10%             85.64%             76.00%           84.50%
     Western          72.41%           79.59%             80.41%             85.81%           78.84%
       Mulgi          67.23%           83.40%             82.02%             85.94%           80.00%
       Tartu          74.02%           77.55%             80.31%             84.40%           73.85%
       Võru           57.71%           82.13%             82.95%             86.13%           73.35%
  Whole corpus        69.81%           80.79%             81.06%             85.19%           77.22%

   As is evident from Table 5, the scores were also the highest when using baseline translation together
with large target language model and the lowest using word bigrams. The results are very likely the
same as described in section 5.1. The results across the dialects were not so clear cut as in previous
section, but the same tendencies also apply here, except the northeastern dialect, that ranked surprisingly
high. The reasons for that could be that the texts were already relatively close to modern Estonian and
due to the small amount of texts there is also lower amount of variation in vocabulary and thus also
lower probability of mistakes.

6. Discussion
    As it can be seen from the previous sections, the accuracy was the highest when performing the
experiments using the larger language and baseline translation model and the lowest using word
bigrams. The results remained almost the same when comparing the accuracies of baseline and silver
standard experiments. The scores of the morphological analysis reflect similar results.
    Although we expected much better results from the silver standard experiments, it is still too early
to draw a definite conclusion and the silver standard might simply need further development and tuning.
For example, the unstressed syllables are occasionally still incorrectly converted.
    Also, the scores seem to be in accordance with the related work in character level machine
translation. As can be seen from the Pettersson et al. experiments [7], the method performed better on
English, Swedish and German (over 90% accuracy) and worse on Hungarian and Icelandic texts (around
80% and 70% accuracy respectively). One of the reasons was that both, the Hungarian and Icelandic
texts, came from earlier time period compared to others, the other was most probably due to the fact


                                                   241
that Hungarian is morphologically very rich and Icelandic richer compared to English [7]. As the same
can be said about Estonian language, the lower accuracy can be expected.
   Also, as mentioned in section 5.2, the scores reflecting the amount of words being in accordance
with modern Estonian spelling system are rough estimates. There are some examples were even a
human, who usually has more knowledge about word meanings and context than the machine, can
normalize a word in a wrong way, let alone an algorithm. For example, töisel päeval means ‘on the
second day’ in South Estonia. However, it can be easy to mistakenly give it a meaning ‘on the day when
people were hard at work’, which is the meaning of the phrase in contemporary Estonian.
   Also, as the distribution of data across different dialects is uneven, it can contribute to the
occasionally inconsistent results. It would be interesting to test the combinations of different dialects
that have some features in common.

7. Conclusion
   Character level statistical machine translation showed promising results in normalizing old Estonian
texts written in the 19th century. However, there is still a lot of work to be done in order to improve the
quality and mitigate various issues that cropped up during the process. Mainly the silver standard has
yet to be improved. Also combining the machine translation with some hand-crafted rules is something
that might improve the quality of the normalization. It would be also important to gather the statistics
about words that are already in their contemporary form, but still get erroneously normalized.

8. Acknowledgements
   The author wishes to thank his supervisors Kadri Muischnek, Siim Orasmaa and Külli Prillop for
their help and support and the National Archives of Estonia for the cooperation. This work has been
supported by the national programme “Estonian language and culture in the digital age” grant EKKD29.

9. References
[1] M. Piotrowski, Natural language processing for historical texts, Morgan & Claypool, 2012.
[2] S. Laur, S. Orasmaa, D. Särg, P. Tammo, EstNLTK 1.6: Remastered Estonian NLP Pipeline, in:
    Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 7152−7160.
[3] M. Erelt, Estonian language, volume 1 of Linguistica Uralica Supplementary Series, Estonian
    Academy Publishers, 2007.
[4] R. Raag, Talurahvakeelest riigikeeleks, Atlex, Tartu, 2008.
[5] P. Kallio, The diversification of Proto-Finnic, in: volume 18 of Studia Fennica, Fibula, fabula, fact:
    The Viking Age in Finland, Suomalaisen Kirjallisuuden Seura, 2014, pp. 155–170.
[6] Y. Scherrer, T. Erjavec, Modernizing historical Slovene words with character-based SMT, in: 4th
    Biennial Workshop on Balto-Slavic Natural Language Processing, 2013.
[7] E. Pettersson, J. Tiedemann, B. Megyesi, An SMT approach to automatic annotation of historical
    text, in: Proceedings of the workshop on computational historical linguistics, 2013.
[8] G. Tang, F. Cap, E. Pettersson, J. Nivre, An Evaluation of Neural Machine Translation Models on
    Historical Spelling Normalization, in: Proceedings of the 27th International Conference on
    Computational Linguistics, 2018, pp. 1320–1331.
[9] N. Korchagina, Normalizing medieval German texts: From rules to deep learning, in: Proceedings
    of the NoDaLiDa 2017 Workshop on Processing Historical Language, Linköping University
    Electronic Press, 2017, pp. 12–17.


                                                   242
10. Appendix

Table 6
Detailed results of the morphological analysis
  Dialect      Not normalized                 Baseline translation                  Silver standard
              Number of words                  Number of words                     Number of words
                                 Percentage                          Percentage                     Percentage
                  analyzed                         analyzed                             analyzed
  Central    781533 / 1070212     73.03%       920342 / 1079288       85.27%      903427 / 1078193    83.79%
  Coastal      19924 / 25542      78.00%         21374 / 25643        83.35%        21312 / 25649     83.09%
  Eastern     335425 / 530571     63.22%       397637 / 532616        74.66%       397211 / 531299    74.76%
   Insular    127510 / 184077     69.27%       153608 / 186354        82.43%       155333 / 186293    83.38%
Northeastern     709 / 924        76.73%           772 / 929          83.10%           793 / 926      85.64%
  Western     221394 / 305732     72.41%       245921 / 308990        79.59%       248385 / 308882    80.41%
    Mulgi      51587 / 76737      67.23%         64264 / 77052        83.40%        63210 / 77071     82.02%
    Tartu     488030 / 659357     74.02%       516597 / 666120        77.55%       534237 / 665201    80.31%
    Võru      163521 / 283357     57.71%       234506 / 285522        82.13%       236412 / 285013    82.95%
Whole corpus 2189633 / 3136509    69.81%      2555021 / 3162514       80.79%      2560320 / 3158527   81.06%


      Dialect       Large language model                                     Word bigrams
                      Number of words                                      Number of words
                                                  Percentage                                      Percentage
                           analyzed                                            analyzed
     Central          941064 / 1071540              87.82%                 879963 / 1076215         81.76%
     Coastal            20655 / 25649               80.53%                   20783 / 25649          81.03%
     Eastern          423828 / 532306               79.62%                  381669 / 532566         71.67%
      Insular         161607 / 186324               86.73%                 149390 / 186012          80.31%
   Northeastern           703 / 925                 76.00%                     785 / 929            84.50%
     Western           262656 / 306104              85.81%                  243366 / 308677         78.84%
       Mulgi            66187 / 77016               85.94%                   61425 / 76780          80.00%
       Tartu          560173 / 663746               84.40%                 482008 / 652671          73.85%
       Võru            245785 / 285349              86.13%                  211507 / 288358         73.35%
   Whole corpus      2682658 / 3148959              85.19%                2430896 / 3147857         77.22%


                                                    243

</pre>