1. Introduction

Challenges of Using Character Level Statistical Translation for Normalizing Old Estonian Texts Machine

Gerth Jaanimäe

0 0 University of Tartu, Institute of Estonian and General Linguistics , Jakobi 2, Tartu, 51005 , Estonia

235 243

This paper reports on experiments of normalizing the 19th century Estonian parish court records. Converting the historical texts from old to contemporary spelling system, also known as normalizing, can be challenging in itself due to the fact that there was no single orthographic standard or if there even was, often the rules were not strictly followed, so there was a lot of variation in the texts. This paper also concentrates on the more specific issues related to Estonian as a morphologically rich language and presents the initial results of applying the character level statistical machine translation normalization on the parish court records from the 19th century. Morphological richness and the peculiarities of the old orthography can create the problem of ambiguity, which we attempted to solve using word bigrams instead of single words for training. Also, as the annotated training data is scarce and we assumed that more of it helps us obtain better results, we tested the idea to create the artificial additional training data, the so-called silver standard. The old texts which's spellings were closest to modern Estonian were converted to the old spelling system, which is much simpler than the reverse process, and after that added to the training set.

1 natural language processing historical texts corpus linguistics text normalization

1. Introduction

Historical texts are invaluable resource for linguists, historians, genealogists and other people who use digital archives in their work. In the linguistic point of view these writings are interesting for the reason that they can provide an insight to the dialects, vocabulary and grammar used in the time period they were written in. These writings can be difficult to analyze automatically due to the differences between modern and old orthographies. Thus, the tools designed for contemporary language usually perform worse on them and they have to be converted to modern form, or in other words normalized [ 1 ]. Another approach would be adapting the tools to older orthography, however it would be very time consuming.

Estonian, which belongs to Finno-Ugric language family and on which this research is based, is a morphologically rich language, meaning that many different word forms can be created, and thus more material is needed to cover the vocabulary. Another issue that can occur is that some of the words normalized can create forms which are homonymous with forms of another word, which may cause falsely recognized lemmas for a given word. Automatic detection of these errors can be complicated, mainly because these words are often morphologically correct, and the sentences formed by them can also be in accordance with the rules of syntax.

The dataset that is used in this research consists of parish court records written in the 19th century. These texts were written mostly in Estonian and provide a valuable insight into the way of life, relationships and the language that was used colloquially during this time period. Some of these texts were written in old Estonian orthography, some in modern and a little portion in the so-called transitional spelling system. Also, the texts contain a sizeable amount of dialectal variation.

These varieties make them especially interesting from the linguistic point of view, however at the same time make them more difficult to normalize.

In this paper we discuss the issues described above and present the initial results of applying the statistical machine translation method for normalizing the Estonian texts written in the 19th century.

The paper consists of the following sections. Section 2 gives an overview of the data used in this research and describes the issues related to it. In Section 3, the normalization method and related work is described. Section 4 provides an overview of the preprocessing, and the normalization experiments themselves. Section 5 gives the summary of the results of the experiments and attempts to give the reasons behind them. In Section 6, the reasons are elaborated further and future plans are briefly discussed.

2. Description of the dataset

The dataset analyzed in this research consists of parish court records written in the 19th century. Automatic analysis of these texts would make it possible to perform keyword searches and use different NLP applications that are designed for standard language. While there exist NLP tools for standard Estonian, such as a Python library called ESTNLTK [ 2 ], the researched material have some features that make it impossible or extremely difficult to apply them off the shelf. Also, as Estonian morphology contains fusional elements, searching different keywords using regular expressions would be impossible or at least a lot of hard work. For example the genitive and partitive forms for South Estonian word susi ‘wolf’ is soe and sutt.

Not only is the material written in older spelling system and non-standard Estonian, they were also hand written and due to a big variation in the handwriting styles, it would be difficult and error-prone to use optical character recognition on them. Thus the texts were first manually transcribed by volunteers in the crowdsourcing project launched by the National Archives of Estonia.2 After that further processing and analysis could be performed.

Many of these writings are written in old spelling system which was introduced around the end of the 17th century and was heavily influenced by German orthography at the time. The main rules were as follows: 1. The long vowel of a stressed open syllable is marked by a single letter. 2. The long vowel of a stressed closed syllable is marked by a digraph. 3. The short vowel of a stressed open syllable is marked by a double consonant [ 3 ].

The old spelling system was also ambiguous as the Table 1 shows [ 4 ]. Although for a human it is quite easy to make the correct decision based on the context, it would be incredibly difficult for the normalization algorithm to know, which of the modern equivalents is the correct one.

To make matters more complicated there were two written languages in parallel use until the end of the 19th century representing North vs. South Estonian. Eventually the North Estonian language and spelling standard became the single standard for the whole country. The spelling standard Estonians know and use today was introduced in 1843 and started gaining popularity in the 1870s. This means that although there is some material in the dataset written in Modern Estonian orthography, most of it is written in older spelling and some of it during a transitional period, where people still wrote some words in the earlier spelling out of habit [ 3 ]. 2 https://www.ra.ee/vallakohtud/

South Estonian used to be considered a dialect of Estonian, but nowadays many linguists classify it as a separate language due to numerous grammatical and phonological differences suggesting that the South Estonian language branched off the Proto-Finnic language earlier on [ 5 ]. As the main goal of this research is to normalize the texts to standard Estonian, North and South Estonian are still treated as dialects. The data can be divided into nine different dialectal areas which in turn can be grouped into North and South Estonian dialects.

North Estonian: central, insular, coastal, western, eastern and northeastern dialects. South Estonian: Mulgi, Tartu and Võru dialects.

Mulgi dialect was an interesting case as the official language in this area was North Estonian, although colloquially South Estonian was spoken instead.

In addition to the sizable amount of dialectal variation, there are more challenges in normalizing these texts. Morphological richness, meaning that cases and derivations are used instead of prepositions and postpositions, poses some extra challenges in normalization. Main one being that there are inevitably many more different wordforms to normalize and thus probability of mistakes will be significantly increased. Also, there would be much smaller amount of frequently occurring prepositions that would automatically increase the scores reflecting the quality of normalization.

Another problem is the small amount of manually annotated data for training the machine learning algorithm as the annotation process is time consuming, human resources are limited and due to dialectal variations, data from one region often does not work for normalizing texts from another region.

3. Method

The method of normalizing older texts by converting them to standard modern spelling can be achieved using many different methods, such as dictionaries, rule-based approach, edit distances, machine translation etc. 3.1.

Method

The method used in the current investigation is often referred to as character level statistical machine translation, where the old and modern spelling systems are treated as two separate languages.

Also, as the “languages” are similar enough, the words are processed as sentences and characters as words. This makes it possible to translate the patterns of letters instead of just individual words, thus making it more flexible, compared to, for example, the dictionary-based method [ 6 ]. In order to overcome the challenges described in the previous section, the following processes were implemented.

In order to mitigate the problem of ambiguity, the bigrams, or in other words word pairs were used instead of giving a single word at a time for the algorithm to process. Therefore, the problem of ambiguity could possibly be solved thanks to the collocations providing the translated words the context.

The issue of scarcity of data could be solved by creating more artificial data for the algorithm to learn from, a so-called silver standard. The conversion from the contemporary spelling to old spelling can be achieved with a small amount of rules and thus could be done more easily than the reverse process. The texts were converted to the old spelling system and the pairs of texts were given for the machine translation algorithm to learn. 3.2.

Related work

Using character level statistical machine translation for normalizing historical texts is nothing new. One of the first experiments with this methods was to normalize old Slovene texts written in the 18th and 19th century [ 6 ]. It has been also extensively tested in order to compare its performance on English, Swedish, German, Icelandic and Hungarian [ 7 ]. Although there are more state of the art methods today, such as ones based on neural networks, which usually have better performance, they require large amounts of data. Some researchers have also found out that the method even performs worse on the smaller dataset [ 8 ] [ 9 ].

4. Description of experiments and setup

In order to evaluate the statistical machine translation for normalizing the text material the following preprocessing and experiments were performed.

A small set of parish court records, 153 in total, was randomly chosen for manual annotation and normalization. The annotation consists of morphological information, such as lemma and inflectional information. It also contains the normalized form for every given word, which is the main interest in our research.

Before training, the tokens were separated by newline, the letters by whitespace and the punctuation was removed. The corpus was then divided into nine smaller datasets according to the dialectal variations. After that these smaller datasets were randomly divided into training set, development set and test set in size of 75%, 5% and 20% respectively. The software used for the translation process was Moses3.

Training the models consisted from two steps. First the target language model is trained and after that the translation model. For the former the parallel corpus is needed and for the latter the corpus in the target language, or in our case, normalized words are required.

The scripts and related files are uploaded to Github.4

The following subsections describe different types of experiments. 4.1.

Baseline translation

The manually annotated corpus was used to train both the language model and translation model without any additions or modifications. As the target language or in our case normalized forms are in the same language for every dialect, the training sets were merged into single file for the language model.

The training set was used to train the translation models and the development set to tune them using minimal error rate training (mert). The accuracy on the test sets were calculated by comparing the translation to the normalized form found in the test set.

For cross validation purposes the corpus was shuffled in ten iterations into train, development and test sets and the macro-average was taken. Table 2 describes how many tokens the datasets contain. 3 https://www.statmt.org/moses/ 4 https://github.com/gerthjaanimae/csmt-parish-court-records

Translation using the silver standard

In order to improve the quality of the translation and give the training algorithm more data to learn, artificial data, the so called silver standard was introduced.

As converting texts from contemporary Estonian to old spelling system is much simpler than the reverse process, the old parish court records that had the spelling closest to modern Estonian were transformed into older orthography. In order to determine the texts to be converted, they were morphologically analysed using Vabamorf tagger, which is a tool for extracting morphological information from a given word and to determine if a word belongs to modern Estonian or not. It is contained in ESTNLTK library [ 2 ]. The texts that got the highest percentage of words in accordance to modern Estonian (about 1100 texts) were transformed to old system using the automatic syllabifier from the ESTNLTK library and some hand-crafted rules. The main ones being: the single letter refering to the consonant in the first syllable was doubled if the vowel was short. For example koli > kolli ‘stuff’. The double letters refering to a long vowel in a first syllable were singled. For example kooli > koli ‘to school’.

As a result the train and development sets got significantly larger. The test set remained the same as described in the experiment above. After appending the silver standard to the portion for the train and dev-sets, 90% of it went for former and 10% for latter.

Afterwards the process was identical to the one described above.

Table 3 describes how many words the datasets contain within the silver standard corpus.

Translation using larger language model

The process was identical to the baseline experiment, except the contemporary Estonian part of the silver standard corpus was added to train the target language model.

For comparison the language model in the baseline translation contained about 57000 tokens and the larger language model about 164000 tokens. 4.4.

Translation using bigrams

As the older spelling of Estonian could be ambiguous with one written form possibly corresponding to two different forms in the contemporary standard spelling (see section 2), the use of bigrams was tested to mitigate this problem.

As the word pairs containing punctuation were removed, the datasets became smaller.

Otherwise the experiment was identical to the baseline translation.

5. Results 5.1. Results of the text normalization

The following table describes the macro-average accuracies across 10 iterations on the test sets.

As can be observed from Table 4, the best results were obtained by using the baseline translation model together with large target language model. The explanation could be that the larger language model helps the algorithm to learn the patterns existing in the target language. The scores were the lowest using word bigrams, which could have a simple reason that within the longer strings the n-grams create, the probability of mistakes increases significantly. Across the dialects, the scores were the highest when normalizing the texts written in central dialect. That can be easily explained by the fact that modern standard Estonian is largely based on that dialect [ 3 ].

The scores were the lowest when normalizing the texts written in the northeastern dialect due to the small amount of data for training. It also has to be reminded that Mulgi, Tartu and Võru dialects belong into South Estonian and the rest into North Estonian dialects. 5.2.

Results of the morphological analysis

In order to measure the performance of the normalization on the bigger corpus, the “translated” texts were analyzed using Vabamorf tagger, which outputs the inflectional information for a given word and if it cannot be retrieved, we can deduce that it is not in accordance to the modern Estonian orthography [ 2 ].

The morphological analysis was performed first on the unnormalized texts and after that different translation methods were compared. The corpus consisted of around 25000 records.

Although it is a very rough and error-prone estimate, as some of the words can easily get incorrect analyses due to the fact that some of the old and dialectal word forms are homonymous with modern ones. For example, pesnud in standard modern Estonian means ‘washed’, in South-Estonian it means ‘beaten’. Regardless of the issues it still gives a general overview of the performance of the method used on the larger data that has not been annotated.

As is evident from Table 5, the scores were also the highest when using baseline translation together with large target language model and the lowest using word bigrams. The results are very likely the same as described in section 5.1. The results across the dialects were not so clear cut as in previous section, but the same tendencies also apply here, except the northeastern dialect, that ranked surprisingly high. The reasons for that could be that the texts were already relatively close to modern Estonian and due to the small amount of texts there is also lower amount of variation in vocabulary and thus also lower probability of mistakes.

6. Discussion

As it can be seen from the previous sections, the accuracy was the highest when performing the experiments using the larger language and baseline translation model and the lowest using word bigrams. The results remained almost the same when comparing the accuracies of baseline and silver standard experiments. The scores of the morphological analysis reflect similar results.

Although we expected much better results from the silver standard experiments, it is still too early to draw a definite conclusion and the silver standard might simply need further development and tuning. For example, the unstressed syllables are occasionally still incorrectly converted.

Also, the scores seem to be in accordance with the related work in character level machine translation. As can be seen from the Pettersson et al. experiments [ 7 ], the method performed better on English, Swedish and German (over 90% accuracy) and worse on Hungarian and Icelandic texts (around 80% and 70% accuracy respectively). One of the reasons was that both, the Hungarian and Icelandic texts, came from earlier time period compared to others, the other was most probably due to the fact that Hungarian is morphologically very rich and Icelandic richer compared to English [ 7 ]. As the same can be said about Estonian language, the lower accuracy can be expected.

Also, as mentioned in section 5.2, the scores reflecting the amount of words being in accordance with modern Estonian spelling system are rough estimates. There are some examples were even a human, who usually has more knowledge about word meanings and context than the machine, can normalize a word in a wrong way, let alone an algorithm. For example, töisel päeval means ‘on the second day’ in South Estonia. However, it can be easy to mistakenly give it a meaning ‘on the day when people were hard at work’, which is the meaning of the phrase in contemporary Estonian.

Also, as the distribution of data across different dialects is uneven, it can contribute to the occasionally inconsistent results. It would be interesting to test the combinations of different dialects that have some features in common.

7. Conclusion

Character level statistical machine translation showed promising results in normalizing old Estonian texts written in the 19th century. However, there is still a lot of work to be done in order to improve the quality and mitigate various issues that cropped up during the process. Mainly the silver standard has yet to be improved. Also combining the machine translation with some hand-crafted rules is something that might improve the quality of the normalization. It would be also important to gather the statistics about words that are already in their contemporary form, but still get erroneously normalized.

8. Acknowledgements

The author wishes to thank his supervisors Kadri Muischnek, Siim Orasmaa and Külli Prillop for their help and support and the National Archives of Estonia for the cooperation. This work has been supported by the national programme “Estonian language and culture in the digital age” grant EKKD29.

9. References

[1]

Piotrowski , Natural language processing for historical texts , Morgan & Claypool, 2012 .

[2]

Laur ,

Orasmaa ,

Särg , P. Tammo, EstNLTK 1 . 6: Remastered Estonian NLP Pipeline , in: Proceedings of the 12th Language Resources and Evaluation Conference , 2020 , pp. 7152 − 7160 .

[3]

Erelt , Estonian language , volume 1 of Linguistica Uralica Supplementary Series, Estonian Academy Publishers, 2007 .

[4]

Raag , Talurahvakeelest riigikeeleks, Atlex, Tartu, 2008 .

[5]

Kallio , The diversification of Proto-Finnic , in: volume 18 of Studia Fennica, Fibula, fabula, fact: The Viking Age in Finland, Suomalaisen Kirjallisuuden Seura, 2014 , pp. 155 - 170 .

[6]

Scherrer , T. Erjavec, Modernizing historical Slovene words with character-based SMT , in: 4th Biennial Workshop on Balto-Slavic Natural Language Processing , 2013 .

[7]

Pettersson ,

Tiedemann ,

Megyesi , An SMT approach to automatic annotation of historical text , in: Proceedings of the workshop on computational historical linguistics , 2013 .

[8]

Tang ,

Cap ,

Pettersson ,

Nivre , An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization, in: Proceedings of the 27th International Conference on Computational Linguistics , 2018 , pp. 1320 - 1331 .

[9]

Korchagina , Normalizing medieval German texts: From rules to deep learning , in: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language , Linköping University Electronic Press, 2017 , pp. 12 - 17 .

10. Appendix