<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Challenges of Using Character Level Statistical Translation for Normalizing Old Estonian Texts Machine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerth Jaanimäe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tartu, Institute of Estonian and General Linguistics</institution>
          ,
          <addr-line>Jakobi 2, Tartu, 51005</addr-line>
          ,
          <country country="EE">Estonia</country>
        </aff>
      </contrib-group>
      <fpage>235</fpage>
      <lpage>243</lpage>
      <abstract>
        <p>This paper reports on experiments of normalizing the 19th century Estonian parish court records. Converting the historical texts from old to contemporary spelling system, also known as normalizing, can be challenging in itself due to the fact that there was no single orthographic standard or if there even was, often the rules were not strictly followed, so there was a lot of variation in the texts. This paper also concentrates on the more specific issues related to Estonian as a morphologically rich language and presents the initial results of applying the character level statistical machine translation normalization on the parish court records from the 19th century. Morphological richness and the peculiarities of the old orthography can create the problem of ambiguity, which we attempted to solve using word bigrams instead of single words for training. Also, as the annotated training data is scarce and we assumed that more of it helps us obtain better results, we tested the idea to create the artificial additional training data, the so-called silver standard. The old texts which's spellings were closest to modern Estonian were converted to the old spelling system, which is much simpler than the reverse process, and after that added to the training set.</p>
      </abstract>
      <kwd-group>
        <kwd>1 natural language processing</kwd>
        <kwd>historical texts</kwd>
        <kwd>corpus linguistics</kwd>
        <kwd>text normalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Historical texts are invaluable resource for linguists, historians, genealogists and other people who
use digital archives in their work. In the linguistic point of view these writings are interesting for the
reason that they can provide an insight to the dialects, vocabulary and grammar used in the time period
they were written in. These writings can be difficult to analyze automatically due to the differences
between modern and old orthographies. Thus, the tools designed for contemporary language usually
perform worse on them and they have to be converted to modern form, or in other words normalized
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Another approach would be adapting the tools to older orthography, however it would be very time
consuming.
      </p>
      <p>Estonian, which belongs to Finno-Ugric language family and on which this research is based, is a
morphologically rich language, meaning that many different word forms can be created, and thus more
material is needed to cover the vocabulary. Another issue that can occur is that some of the words
normalized can create forms which are homonymous with forms of another word, which may cause
falsely recognized lemmas for a given word. Automatic detection of these errors can be complicated,
mainly because these words are often morphologically correct, and the sentences formed by them can
also be in accordance with the rules of syntax.</p>
      <p>The dataset that is used in this research consists of parish court records written in the 19th century.
These texts were written mostly in Estonian and provide a valuable insight into the way of life,
relationships and the language that was used colloquially during this time period. Some of these texts
were written in old Estonian orthography, some in modern and a little portion in the so-called
transitional spelling system. Also, the texts contain a sizeable amount of dialectal variation.</p>
      <p>These varieties make them especially interesting from the linguistic point of view, however at the
same time make them more difficult to normalize.</p>
      <p>In this paper we discuss the issues described above and present the initial results of applying the
statistical machine translation method for normalizing the Estonian texts written in the 19th century.</p>
      <p>The paper consists of the following sections. Section 2 gives an overview of the data used in this
research and describes the issues related to it. In Section 3, the normalization method and related work
is described. Section 4 provides an overview of the preprocessing, and the normalization experiments
themselves. Section 5 gives the summary of the results of the experiments and attempts to give the
reasons behind them. In Section 6, the reasons are elaborated further and future plans are briefly
discussed.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of the dataset</title>
      <p>
        The dataset analyzed in this research consists of parish court records written in the 19th century.
Automatic analysis of these texts would make it possible to perform keyword searches and use different
NLP applications that are designed for standard language. While there exist NLP tools for standard
Estonian, such as a Python library called ESTNLTK [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the researched material have some features
that make it impossible or extremely difficult to apply them off the shelf. Also, as Estonian morphology
contains fusional elements, searching different keywords using regular expressions would be
impossible or at least a lot of hard work. For example the genitive and partitive forms for South Estonian
word susi ‘wolf’ is soe and sutt.
      </p>
      <p>Not only is the material written in older spelling system and non-standard Estonian, they were also
hand written and due to a big variation in the handwriting styles, it would be difficult and error-prone
to use optical character recognition on them. Thus the texts were first manually transcribed by
volunteers in the crowdsourcing project launched by the National Archives of Estonia.2 After that
further processing and analysis could be performed.</p>
      <p>
        Many of these writings are written in old spelling system which was introduced around the end of
the 17th century and was heavily influenced by German orthography at the time. The main rules were
as follows: 1. The long vowel of a stressed open syllable is marked by a single letter. 2. The long vowel
of a stressed closed syllable is marked by a digraph. 3. The short vowel of a stressed open syllable is
marked by a double consonant [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The old spelling system was also ambiguous as the Table 1 shows [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Although for a human it is
quite easy to make the correct decision based on the context, it would be incredibly difficult for the
normalization algorithm to know, which of the modern equivalents is the correct one.
      </p>
      <p>
        To make matters more complicated there were two written languages in parallel use until the end of
the 19th century representing North vs. South Estonian. Eventually the North Estonian language and
spelling standard became the single standard for the whole country. The spelling standard Estonians
know and use today was introduced in 1843 and started gaining popularity in the 1870s. This means
that although there is some material in the dataset written in Modern Estonian orthography, most of it
is written in older spelling and some of it during a transitional period, where people still wrote some
words in the earlier spelling out of habit [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2 https://www.ra.ee/vallakohtud/
      </p>
      <p>
        South Estonian used to be considered a dialect of Estonian, but nowadays many linguists classify it
as a separate language due to numerous grammatical and phonological differences suggesting that the
South Estonian language branched off the Proto-Finnic language earlier on [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As the main goal of this
research is to normalize the texts to standard Estonian, North and South Estonian are still treated as
dialects. The data can be divided into nine different dialectal areas which in turn can be grouped into
North and South Estonian dialects.
      </p>
      <p>North Estonian: central, insular, coastal, western, eastern and northeastern dialects.
South Estonian: Mulgi, Tartu and Võru dialects.</p>
      <p>Mulgi dialect was an interesting case as the official language in this area was North Estonian,
although colloquially South Estonian was spoken instead.</p>
      <p>In addition to the sizable amount of dialectal variation, there are more challenges in normalizing
these texts. Morphological richness, meaning that cases and derivations are used instead of prepositions
and postpositions, poses some extra challenges in normalization. Main one being that there are
inevitably many more different wordforms to normalize and thus probability of mistakes will be
significantly increased. Also, there would be much smaller amount of frequently occurring prepositions
that would automatically increase the scores reflecting the quality of normalization.</p>
      <p>Another problem is the small amount of manually annotated data for training the machine learning
algorithm as the annotation process is time consuming, human resources are limited and due to dialectal
variations, data from one region often does not work for normalizing texts from another region.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>The method of normalizing older texts by converting them to standard modern spelling can be
achieved using many different methods, such as dictionaries, rule-based approach, edit distances,
machine translation etc.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>The method used in the current investigation is often referred to as character level statistical machine
translation, where the old and modern spelling systems are treated as two separate languages.</p>
      <p>
        Also, as the “languages” are similar enough, the words are processed as sentences and characters as
words. This makes it possible to translate the patterns of letters instead of just individual words, thus
making it more flexible, compared to, for example, the dictionary-based method [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In order to
overcome the challenges described in the previous section, the following processes were implemented.
      </p>
      <p>In order to mitigate the problem of ambiguity, the bigrams, or in other words word pairs were used
instead of giving a single word at a time for the algorithm to process. Therefore, the problem of
ambiguity could possibly be solved thanks to the collocations providing the translated words the
context.</p>
      <p>The issue of scarcity of data could be solved by creating more artificial data for the algorithm to
learn from, a so-called silver standard. The conversion from the contemporary spelling to old spelling
can be achieved with a small amount of rules and thus could be done more easily than the reverse
process. The texts were converted to the old spelling system and the pairs of texts were given for the
machine translation algorithm to learn.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>
        Using character level statistical machine translation for normalizing historical texts is nothing new.
One of the first experiments with this methods was to normalize old Slovene texts written in the 18th
and 19th century [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It has been also extensively tested in order to compare its performance on English,
Swedish, German, Icelandic and Hungarian [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Although there are more state of the art methods today,
such as ones based on neural networks, which usually have better performance, they require large
amounts of data. Some researchers have also found out that the method even performs worse on the
smaller dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. Description of experiments and setup</title>
      <p>In order to evaluate the statistical machine translation for normalizing the text material the following
preprocessing and experiments were performed.</p>
      <p>A small set of parish court records, 153 in total, was randomly chosen for manual annotation and
normalization. The annotation consists of morphological information, such as lemma and inflectional
information. It also contains the normalized form for every given word, which is the main interest in
our research.</p>
      <p>Before training, the tokens were separated by newline, the letters by whitespace and the punctuation
was removed. The corpus was then divided into nine smaller datasets according to the dialectal
variations. After that these smaller datasets were randomly divided into training set, development set
and test set in size of 75%, 5% and 20% respectively. The software used for the translation process
was Moses3.</p>
      <p>Training the models consisted from two steps. First the target language model is trained and after
that the translation model. For the former the parallel corpus is needed and for the latter the corpus in
the target language, or in our case, normalized words are required.</p>
      <p>The scripts and related files are uploaded to Github.4</p>
      <p>The following subsections describe different types of experiments.
4.1.</p>
    </sec>
    <sec id="sec-7">
      <title>Baseline translation</title>
      <p>The manually annotated corpus was used to train both the language model and translation model
without any additions or modifications. As the target language or in our case normalized forms are in
the same language for every dialect, the training sets were merged into single file for the language
model.</p>
      <p>The training set was used to train the translation models and the development set to tune them using
minimal error rate training (mert). The accuracy on the test sets were calculated by comparing the
translation to the normalized form found in the test set.</p>
      <p>For cross validation purposes the corpus was shuffled in ten iterations into train, development and
test sets and the macro-average was taken. Table 2 describes how many tokens the datasets contain.
3 https://www.statmt.org/moses/
4 https://github.com/gerthjaanimae/csmt-parish-court-records</p>
    </sec>
    <sec id="sec-8">
      <title>Translation using the silver standard</title>
      <p>In order to improve the quality of the translation and give the training algorithm more data to learn,
artificial data, the so called silver standard was introduced.</p>
      <p>
        As converting texts from contemporary Estonian to old spelling system is much simpler than the
reverse process, the old parish court records that had the spelling closest to modern Estonian were
transformed into older orthography. In order to determine the texts to be converted, they were
morphologically analysed using Vabamorf tagger, which is a tool for extracting morphological
information from a given word and to determine if a word belongs to modern Estonian or not. It is
contained in ESTNLTK library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The texts that got the highest percentage of words in accordance to
modern Estonian (about 1100 texts) were transformed to old system using the automatic syllabifier from
the ESTNLTK library and some hand-crafted rules. The main ones being: the single letter refering to
the consonant in the first syllable was doubled if the vowel was short. For example koli &gt; kolli ‘stuff’.
The double letters refering to a long vowel in a first syllable were singled. For example kooli &gt; koli ‘to
school’.
      </p>
      <p>As a result the train and development sets got significantly larger. The test set remained the same as
described in the experiment above. After appending the silver standard to the portion for the train and
dev-sets, 90% of it went for former and 10% for latter.</p>
      <p>Afterwards the process was identical to the one described above.</p>
      <p>Table 3 describes how many words the datasets contain within the silver standard corpus.</p>
    </sec>
    <sec id="sec-9">
      <title>Translation using larger language model</title>
      <p>The process was identical to the baseline experiment, except the contemporary Estonian part of the
silver standard corpus was added to train the target language model.</p>
      <p>For comparison the language model in the baseline translation contained about 57000 tokens and
the larger language model about 164000 tokens.
4.4.</p>
    </sec>
    <sec id="sec-10">
      <title>Translation using bigrams</title>
      <p>As the older spelling of Estonian could be ambiguous with one written form possibly corresponding
to two different forms in the contemporary standard spelling (see section 2), the use of bigrams was
tested to mitigate this problem.</p>
      <p>As the word pairs containing punctuation were removed, the datasets became smaller.</p>
      <p>Otherwise the experiment was identical to the baseline translation.</p>
    </sec>
    <sec id="sec-11">
      <title>5. Results</title>
    </sec>
    <sec id="sec-12">
      <title>5.1. Results of the text normalization</title>
      <p>The following table describes the macro-average accuracies across 10 iterations on the test sets.</p>
      <p>
        As can be observed from Table 4, the best results were obtained by using the baseline translation
model together with large target language model. The explanation could be that the larger language
model helps the algorithm to learn the patterns existing in the target language. The scores were the
lowest using word bigrams, which could have a simple reason that within the longer strings the n-grams
create, the probability of mistakes increases significantly. Across the dialects, the scores were the
highest when normalizing the texts written in central dialect. That can be easily explained by the fact
that modern standard Estonian is largely based on that dialect [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The scores were the lowest when normalizing the texts written in the northeastern dialect due to the
small amount of data for training. It also has to be reminded that Mulgi, Tartu and Võru dialects belong
into South Estonian and the rest into North Estonian dialects.
5.2.</p>
    </sec>
    <sec id="sec-13">
      <title>Results of the morphological analysis</title>
      <p>
        In order to measure the performance of the normalization on the bigger corpus, the “translated” texts
were analyzed using Vabamorf tagger, which outputs the inflectional information for a given word and
if it cannot be retrieved, we can deduce that it is not in accordance to the modern Estonian orthography
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The morphological analysis was performed first on the unnormalized texts and after that different
translation methods were compared. The corpus consisted of around 25000 records.</p>
      <p>Although it is a very rough and error-prone estimate, as some of the words can easily get incorrect
analyses due to the fact that some of the old and dialectal word forms are homonymous with modern
ones. For example, pesnud in standard modern Estonian means ‘washed’, in South-Estonian it means
‘beaten’. Regardless of the issues it still gives a general overview of the performance of the method
used on the larger data that has not been annotated.</p>
      <p>As is evident from Table 5, the scores were also the highest when using baseline translation together
with large target language model and the lowest using word bigrams. The results are very likely the
same as described in section 5.1. The results across the dialects were not so clear cut as in previous
section, but the same tendencies also apply here, except the northeastern dialect, that ranked surprisingly
high. The reasons for that could be that the texts were already relatively close to modern Estonian and
due to the small amount of texts there is also lower amount of variation in vocabulary and thus also
lower probability of mistakes.</p>
    </sec>
    <sec id="sec-14">
      <title>6. Discussion</title>
      <p>As it can be seen from the previous sections, the accuracy was the highest when performing the
experiments using the larger language and baseline translation model and the lowest using word
bigrams. The results remained almost the same when comparing the accuracies of baseline and silver
standard experiments. The scores of the morphological analysis reflect similar results.</p>
      <p>Although we expected much better results from the silver standard experiments, it is still too early
to draw a definite conclusion and the silver standard might simply need further development and tuning.
For example, the unstressed syllables are occasionally still incorrectly converted.</p>
      <p>
        Also, the scores seem to be in accordance with the related work in character level machine
translation. As can be seen from the Pettersson et al. experiments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the method performed better on
English, Swedish and German (over 90% accuracy) and worse on Hungarian and Icelandic texts (around
80% and 70% accuracy respectively). One of the reasons was that both, the Hungarian and Icelandic
texts, came from earlier time period compared to others, the other was most probably due to the fact
that Hungarian is morphologically very rich and Icelandic richer compared to English [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As the same
can be said about Estonian language, the lower accuracy can be expected.
      </p>
      <p>Also, as mentioned in section 5.2, the scores reflecting the amount of words being in accordance
with modern Estonian spelling system are rough estimates. There are some examples were even a
human, who usually has more knowledge about word meanings and context than the machine, can
normalize a word in a wrong way, let alone an algorithm. For example, töisel päeval means ‘on the
second day’ in South Estonia. However, it can be easy to mistakenly give it a meaning ‘on the day when
people were hard at work’, which is the meaning of the phrase in contemporary Estonian.</p>
      <p>Also, as the distribution of data across different dialects is uneven, it can contribute to the
occasionally inconsistent results. It would be interesting to test the combinations of different dialects
that have some features in common.</p>
    </sec>
    <sec id="sec-15">
      <title>7. Conclusion</title>
      <p>Character level statistical machine translation showed promising results in normalizing old Estonian
texts written in the 19th century. However, there is still a lot of work to be done in order to improve the
quality and mitigate various issues that cropped up during the process. Mainly the silver standard has
yet to be improved. Also combining the machine translation with some hand-crafted rules is something
that might improve the quality of the normalization. It would be also important to gather the statistics
about words that are already in their contemporary form, but still get erroneously normalized.</p>
    </sec>
    <sec id="sec-16">
      <title>8. Acknowledgements</title>
      <p>The author wishes to thank his supervisors Kadri Muischnek, Siim Orasmaa and Külli Prillop for
their help and support and the National Archives of Estonia for the cooperation. This work has been
supported by the national programme “Estonian language and culture in the digital age” grant EKKD29.</p>
    </sec>
    <sec id="sec-17">
      <title>9. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Piotrowski</surname>
          </string-name>
          ,
          <article-title>Natural language processing for historical texts</article-title>
          , Morgan &amp; Claypool,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Laur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orasmaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Särg</surname>
          </string-name>
          , P. Tammo, EstNLTK
          <volume>1</volume>
          .
          <article-title>6: Remastered Estonian NLP Pipeline</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7152</fpage>
          −
          <lpage>7160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Erelt</surname>
          </string-name>
          ,
          <article-title>Estonian language</article-title>
          , volume
          <volume>1</volume>
          of Linguistica Uralica Supplementary Series, Estonian Academy Publishers,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Raag</surname>
          </string-name>
          , Talurahvakeelest riigikeeleks, Atlex, Tartu,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kallio</surname>
          </string-name>
          ,
          <article-title>The diversification of Proto-Finnic</article-title>
          , in: volume
          <volume>18</volume>
          of Studia Fennica, Fibula, fabula, fact: The Viking Age in Finland, Suomalaisen Kirjallisuuden Seura,
          <year>2014</year>
          , pp.
          <fpage>155</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Scherrer</surname>
          </string-name>
          , T. Erjavec,
          <article-title>Modernizing historical Slovene words with character-based SMT</article-title>
          ,
          <source>in: 4th Biennial Workshop on Balto-Slavic Natural Language Processing</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pettersson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Megyesi</surname>
          </string-name>
          ,
          <article-title>An SMT approach to automatic annotation of historical text</article-title>
          ,
          <source>in: Proceedings of the workshop on computational historical linguistics</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pettersson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          ,
          <source>An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization, in: Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1320</fpage>
          -
          <lpage>1331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Korchagina</surname>
          </string-name>
          ,
          <article-title>Normalizing medieval German texts: From rules to deep learning</article-title>
          ,
          <source>in: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language</source>
          , Linköping University Electronic Press,
          <year>2017</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>10. Appendix</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>