<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>User-friendly Search Possibilities for Early Challenges Posed by Automatic Conversion Latvian Texts:</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Everita Andronova</string-name>
          <email>everita.andronova@lumii.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Frīdenberga</string-name>
          <email>anna.fridenberga@lu.lv</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lauma Pretkalniņa</string-name>
          <email>lauma.pretkalnina@lumii.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renāte Siliņa-Piņķe</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elga Skrūzmane</string-name>
          <email>elga.skruzmane@lu.lv</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anta Trumpa</string-name>
          <email>anta.trumpa@lu.lv</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pēteris Vanags</string-name>
          <email>peteris.vanags@lu.lv</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Mathematics and Computer Science, University of Latvia</institution>
          ,
          <addr-line>Raiņa bulv. 29, Rīga LV-1459</addr-line>
          ,
          <country country="LV">Latvia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Latvian Language Institute, University of Latvia</institution>
          ,
          <addr-line>Kalpaka bulv. 4, Rīga LV-1050</addr-line>
          ,
          <country country="LV">Latvia</country>
        </aff>
      </contrib-group>
      <fpage>168</fpage>
      <lpage>176</lpage>
      <abstract>
        <p>This paper deals with the Corpus of early written Latvian and explains the methodology for normalising historical spellings found in texts from the 16th-18th cc. It describes the types of replacements which will make searching early texts more convenient.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Historical corpus</kwd>
        <kwd>conversion of old spelling into modern</kwd>
        <kwd>replacement algorithms</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Diachronic corpora are of high importance not only for linguistic research but also for those
interested in other fields of humanities (literature, history, sociology, etc.). Historical spelling is a
considerable obstacle for broader use of the Corpus of early written Latvian texts (henceforth the
Corpus) among non-linguists. Work is currently underway on providing user-friendly search
possibilities in the corpus.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Corpus of early written Latvian texts: some remarks on its history and scope</title>
      <p>The Corpus of early written Latvian texts was launched in 2003 after a short one-year project, but
the origins of this initiative date back to the 1990s, when some texts from the 17th c. were manually
typed in at the Institute of Mathematics and Computer Science, University of Latvia (henceforth IMCS,
UL). A great deal of work in the digitalisation of Latvian texts covering different time periods has been
actively carried out, but the main emphasis has of course been on modern texts, as they were crucial for
Latvian language processing [1].</p>
      <p>
        In 2002, the Corpus was developed with financial support from University of Latvia. This was a
joint activity gathering together researchers from the IMCS, UL and the Department of the Baltic
Languages, UL. It was one of the first projects in digital humanities in Latvia. The various stages and
methodology of development of the corpus have been presented to the scholarly community elsewhere
[
        <xref ref-type="bibr" rid="ref2">2, 3</xref>
        ].
      </p>
      <p>
        The original sources were acquired from the National Library of Latvia, scanned and returned to the
library. Both the Academic Library of the University of Latvia and the National Archive of Latvia, State
Historical Archives of Latvia have become cooperation partners as well. One of the aims of the Corpus
was to give researchers an opportunity to access these early Latvians texts in one repository, therefore
not only word and frequency indices and a concordancer, but also facsimiles are available on the corpus
platform (http://senie.korpuss.lv/). For a long time, this was the only public resource providing access
to Latvian early texts, and it was highly appreciated by scholars and students. At the moment Latvian
sources are available not only at the Latvian National Digital Library (http://gramatas.lndb.lv/), the
largest resource of Latvian books, periodicals, maps and recordings, but also scattered across
European libraries where intensive digitalisation is taking place. For instance, the digital copy of G.
Dreszell’s Catechism ‘Swähta Bährno=Mahziba’ (1682) is housed at the Royal Danish Library (https://
www.kb.dk/e-mat/dod/12089000708F-color.pdf), G. Elger’s ‘Geistliche Catholische Gesänge’ (1621)
is kept at Vilnius University library and they have kindly passed the scan to developers of the Corpus.
The development of the Corpus is still in progress and is still being supplemented with new sources
(cf. [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ] on adding short texts to the Corpus, mostly the occasional poetry of the 18th c.).
      </p>
      <p>
        The scope of the Corpus is Latvian texts from the beginnings of the written tradition in the early 16th
century until 1800. These are mostly printed Latvian monolingual sources (with some supplementary
texts in German or Latin). A couple of bilingual dictionaries (German-Latvian and Latvian-German)
have been added to the Corpus. Although the major sources are printed texts, some transcripts of the
manuscripts have been also included (see [
        <xref ref-type="bibr" rid="ref1 ref4">5</xref>
        ] on the issues of decrypting the Statutes of Linen weavers
(1625) housed in the National Archive of Latvia, State Historical Archives of Latvia). Typically for the
time, the texts represented in the Corpus are mostly religious ones (hymnals, texts of catechisms, holy
scriptures, the Lord's Prayer, etc.) and mostly translations from different German sources (but also from
Latin and Polish). Therefore, we can trace a number features of German and Latin origin in the language
of early printings (for more on the linguistic characteristics of early texts, see [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">6, 7, 8</xref>
        ]). Nevertheless,
original texts have also been produced; one of the most remarkable is the 1,200-page ‘Sermon book’
by G. Mancelius published in three parts in 1654 and comprising historical and ethnographic facts along
with nice rhetorical figures of speech.
      </p>
      <p>Due to the fact that the original sources are scattered across Europe, the developers opted for a
fulltext corpus in order to facilitate access to them. This explains the choice of interactive word indices for
almost every single source (large sources lack this), with the possibility of navigating from the index
within the entire text, see Fig. 1.</p>
      <p>Some non-standard metadata were added, thus selection of texts by author, century and text type is
offered (see Fig. 2.).</p>
      <p>Originally the Corpus was supported by in-house mark-up and a Java-based concordancer. At the
end of 2021, it had a volume of 1.1 million running words.</p>
      <p>In 2022, a new release of the Corpus is in progress. It will be moved to the new corpus platform
(more on this below) and a number of new sources have been added to the corpus. The recent
characteristics of the Corpus are as follows: the size is ca 1.75 million tokens; there are more than 100
different sources included. Every single source has kept its unique identifier which was assigned at the
beginnings of the Corpus and shows some metadata about the source. It consists of an abbreviation of
the author, source and year of publishing, thus ensuring sustainable development and not confusing
users.2 For example, the identifier Manc1638_PhL stands for the German-Latvian phrase book
‘Phraseologia Lettica’ published by Georg Mancelius in 1638. 35 known authors and a number of
unknown authors are represented in the Corpus.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Historical spelling in early prints: experience of others</title>
      <p>
        A good overview and comparison of different methods for normalising historical spellings is
presented in Bollmann’s study [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. Here, we would like to describe the experience of researchers in
neighbouring countries sharing a similar history of the development of early printed texts.
      </p>
      <p>
        Estonian colleagues working on the Corpus of old written Estonian (https://vakk.ut.ee/) have
developed the converter Vakker [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ], which also uses conversion rules and later consults a dictionary
to deal with early sources. Polish researchers developing the KorBa corpus (17th–18th cc. texts with
morphosyntactic annotation, https://korba.edu.pl) offer their users transliterated and transcribed
(normalised) texts. They deal with spelling normalisation only, keeping the original inflectional endings
and lexis unchanged. The aim is to have the spelling of the texts as similar to modern Polish as possible
in order to facilitate search in the corpus [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ].
      </p>
      <p>
        Lithuanian researchers also aim to create a universal search engine dealing with different spellings
in their Database of old writings (http://seniejirastai.lki.lt/home.php). Thus, historical spellings are
rewritten in the modern Lithuanian alphabet, unifying graphemes and ignoring orthography, but taking
into account normalisation of phonetics (eliminating dialectal features and solving assimilation issues).
The pre-processing handles forms where rules cannot be applied. Although the morphology is not
changed, some of the rewriting rules apply to the morpheme level. The number of rules applied varies
from 74 to 495 [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ].
      </p>
      <p>
        Researchers dealing with Early New High German texts have presented their semi-automatic
normalisation tool Norma [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ]. The normalisation is performed in two stages. A distinction is made
2 The list of source abbreviations is available at http://senie.korpuss.lv/abbrevs.jsp.
between normalisation (preferring forms close to original) and modernisation (preferring forms close
to modern language).
      </p>
      <p>
        To sum up, several approaches can be applied in the normalisation of old writings (like wordlist
substitutions, the rule-based method). The frequently used terms transliteration [cf. 14] and
transcription in Latvian studies are from time to time used with different meanings. Our working group
uses the term conversion to denote the process and result of such a change, as it covers several steps:
transliteration, transcription, and adaptation [
        <xref ref-type="bibr" rid="ref14">15</xref>
        ]. Rules are set for every single source in the Corpus,
and subsequently the software engineer converts the text into normalised spelling.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Coding in the Corpus: From Windows-1257 to Unicode</title>
      <p>The original texts were mostly printed using Fraktur and other blackletter typefaces, but in the
Corpus they are presented in Latin transliteration. The ASCII code set – single and combined characters
– was used in text processing. These combined symbols represented a number of diacritic marks typical
for early writings. Thus, we have 7 letters a with different diacritic marks encountered in the Corpus:
1. à 00E0 C1
2. á 00E1 C1
3. ã 00E3 C1
4. â 00E2 C1
5. ȧ 0227 Latin Extended-B
6. ä 00E4 C1
7. ⱥ 2C65 Latin Extended-C</p>
      <p>In 2017, the conversion into Unicode format3 was carried out to ensure more precise visualisation
of the original text and to facilitate its comprehension. Linguists created 73 tables for every single
source. Unicode files were added to the Corpus as separate items for downloading, and all used symbols
were merged in a single table (see http://senie.korpuss.lv/unicode/tabula.pdf). The conversion of new
sources added to the Corpus continues, and the number of tables has thus far reached 168.</p>
      <p>Since the Corpus was created over a long period of time and sources were added gradually, different
symbols were introduced for the same grapheme, e.g., the grapheme ë in VLH1685_Sal was represented
as e#, but the same letter ë in the manuscript source Fuer1650_70_1ms was presented as e". During the
conversion to Unicode such cases were unified, and this grapheme is in all cases represented with
Unicode symbol ë (00EB), which visually is the closest version to the original graphemes.</p>
      <p>In 2020 the project ‘Digital Resources for Humanities: Integration and Development’ was initiated
to support development and a wider access of digital resources. The modernisation of the Corpus will
be developed further within this project: the conversion from the old spelling into the modern one is
being carried out and a switch to the elaborated corpus management system NoSketch Engine is in
progress.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conversion into Modern Latvian orthography</title>
      <p>Unicode files now serve as input data for conversion of the texts into Modern Latvian orthography,
which will provide easier search and comprehension of the corpus material. This task presupposes not
only transliteration, but also morphological adaptation of Old Latvian spelling to the modern one. Both
procedures can be facilitated and accelerated by elaboration of certain rules of automatic conversion.
This paper presents problems that occur when performing automatic transliteration.
3 The project was funded by University of Latvia within the project of academic development ‘Switch of the Corpus of Early Written Latvian
to Unicode’ (LU, No AAP2017/63).
6.1.</p>
    </sec>
    <sec id="sec-6">
      <title>Methodology</title>
      <p>As this research deals with the very first Latvian sources, the number of spelling and morphology
versions is very high and differs from source to source. The Corpus comprises both printed and
handwritten texts of different length. There is no stable Latvian orthographic system in this period yet;
we can observe attempts of different authors to offer their own writing systems. Taking into account
the facts mentioned above, we conclude that each source or at least each author requires an individual
conversion approach. In order to provide the best possible results, we opted for creating hand-crafted
conversion rule tables for every single source. These rules do not use any Latvian lexicon or language
processing tools, because to our best knowledge there are none for early Latvian. The historical
dictionary of Latvian (16–17th cc.) (www.tezaurs.lv/lvvv) is still too small (only ca 2000 entries) to be
of significant help for large-scale transliteration.</p>
      <p>The accepted conversion process consists of the following steps:
1. Development of tables of conversion rules for every source. Each rule is deterministic, i. e, is
applied for every token it matches and rules can stack on each other, namely, each token can
undergo multiple rule applications to reach its final converted form.
2. Implementing of tables in the software algorithm and automatic conversion.
3. Post-editing: rereading of the converted texts (all or part of it, if the text is huge) and detecting
errors.
4. Error analysis and supplementation/correction of the tables, evaluating the usefulness of
correction if possible.
5. Repeated automated conversion.
6. Quality assessment.</p>
      <p>It should be noted that converted text will not be the same as modern standard Latvian (the
main emphasis lies in the recognisable root of the word, length of vowel in suffixes is ignored
at this stage).
6.2.</p>
    </sec>
    <sec id="sec-7">
      <title>Characteristics of early Latvian sources and spelling conversion applied</title>
      <p>Our recent experience is based mostly on the texts of the 1st (from the beginning of the 16th c. until
the 1620s) and the 2nd (1631–1680s) period of Old Latvian. These texts are characterized by the greatest
amount of spelling variation, and thus they hopefully cover most of the potential issues.</p>
      <p>A high level of inconsistency in spelling and ambiguity of graphemes and grapheme combinations
is typical for the first period sources, which consist of mostly anonymous translations of religious texts
of various length. This can be illustrated by the large variety of spellings of the word ‘heart’ within one
single source, Szyrdtcz, Szirdes, ßirde, ßirdtcz, ßyrdtz, ßyrdtcz (UP1587). A comparison of several
sources reveals even greater diversity: Szirdees, Sczyrdtcz, czirdtcz, ßirde (Ench1615). Of course, the
conversion tables for these sources include rules converting a letter to another letter (ä&gt;ā), a grapheme
combination to a letter (ſch&gt;š), or one grapheme combination to another (like dcz&gt;dz), but due to the
high orthographic inconsistency, the source tables in this group have a disproportionate number of
socalled individual correspondences when the root is replaced by the root or a whole word for a whole
word (czedaatz&gt;dziedāts ‘sung’). As a result, the number of conversion laws in this group of sources is
relatively high, for example, the ‘Vndeudsche Psalmen’ (UP1587) has 1024 laws.</p>
      <p>
        Nine mid-17th century sources by Georg Mancelius make up the largest group of the second period.
G. Mancelius has an improved and more systematic spelling in comparison to texts of the previous
period; therefore, it was assumed that letter-to-letter replacement or letter-to-grapheme combination
correspondences would predominate in the conversion of Mancelius’ works. However, this assumption
was not completely borne out [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ]. For example, in ‘Ten conversations’ by Mancelius (Manc1638_Run)
it is possible to replace part of letters or grapheme combinations with a particular letter or grapheme
combination in modern writing. So, w&gt;v (pļawas&gt;pļavas 'meadows’), v&gt;u (vs&gt;uz ‘to’, Vppe&gt;Upe
‘river’), ñ&gt;n (mañ&gt;man ‘for me’), ä&gt;e (rättais&gt;retais ‘the seldom’, wätz&gt;vecs ‘old’, Bährni&gt;Bērni
‘children’), à&gt;ā (Zeemà&gt;Ciemā ‘guest’), ee&gt;ie (Deena&gt;Diena ‘day’); in turn ie&gt;ī (brienums&gt;brīnums
‘miracle’), gh&gt;g (ghann&gt;gan ‘enough; ever’), tſch&gt;č (tſchettrus&gt;četrus ‘four’) etc. Double consonants
in most cases can be replaced by one consonant, e. g., bb&gt;b (drebb&gt;dreb ‘shiver’, labba&gt;laba ‘good’),
ļļ&gt;ļ (zeļļu&gt;ceļu ‘I pick up’, packaļļ&gt;pakaļ ‘after’), nn&gt;n (mann&gt;man ‘to me’), tt &gt;t (Ratti&gt;Rati
‘carriage’), rr&gt;r (turr&gt;tur ‘keeps’, Barribu&gt;Barību ‘food’), ŗŗ&gt;ŗ (kuŗŗam&gt;kuŗam ‘to whom’) etc., a
short vowel and the following letter h can be replaced by a long vowel, e. g., äh&gt;ē (Dähls&gt;Dēls ‘son’),
uh&gt;ū (truhx&gt;trūks ‘will lack’), eh&gt;ē (Drehbes&gt;Drēbes ‘clothes’), ih&gt;ī (dſihrehß&gt;dzīrēs ‘was going to’);
the only exception is oh&gt;o (Ohrmans&gt;Ormans ‘a coachman’), denoting a diphthong.
      </p>
      <p>The order of conversion rules is also crucial in many cases. For instance, the replacement iß&gt;iz and
Jß&gt;Iz should be completed before all other changes involving ß&gt;s. However, the number of exemptions
for several letters or grapheme combinations is still very high. During the process of conversion, it was
observed that, e. g., the usage of long ſ without a stroke corresponds to modern s and z in 50/50 cases;
the grapheme combination ſch corresponds to modern š and ž equally. A decision was made to make
conversion laws for separate grapheme combinations, namely, dſ&gt;dz, ſi&gt;zi, ſt&gt;st, ſp&gt;sp, ſm&gt;zm, ſl&gt;sl.
If necessary, lexical substitution of root to root or lexeme to lexeme was carried out, eg., Meſch&gt;Mež
‘forest’. The same issues concern the conversion of long s with a stroke ẜ, as well as z, y, x, ß and the
grapheme combination tz. After different attempts there are 190 rules set to be applied in certain order.
6.3.</p>
    </sec>
    <sec id="sec-8">
      <title>Description of conversion rules</title>
      <p>On the basis of the conversion rule tables for the 16th and 17th cc. Latvian sources developed so far,
we may identify three main conversion rule groups, each with subgroups:
1. Unambiguous graphemic correspondences:
1) grapheme-to-grapheme conversion, e. g., à&gt;ā (Dahrſà&gt;dārzā ‘in a garden’);
2) grapheme combination to letter, e. g., tſch&gt;č (Lahtſchus&gt;lāčus ‘bears’ Pl.Acc.);
3) letter to grapheme combination, e. g, x&gt;ks (attmaxaht&gt;atmaksāt ‘to repay’);
4) grapheme combination to grapheme combination, e. g., ee &gt; ie (peedärr&gt;pieder ‘belongs’).
2. Positional (graphemic and morphemic) correspondences:
1) depending on the position in a word, e. g., in the beginning or in the middle of the word: tz&gt;c
(Tzilwäki&gt;cilvēki ‘men’), in the middle or at the end of the word: tz&gt;c (tapetz&gt;tāpēc ‘therefore’,
Swetze&gt;svece ‘candle’); at the end also tz&gt;ts (ſälltz&gt;zelts ‘gold’) or tz&gt;ds (Ghalltz&gt;galds ‘table’);
2) depending on neighboring letters, e. g., aya&gt;āja (ißghaya&gt;izgāja ‘went out'); but ty&gt;tī
(nackty&gt;naktī ‘at night’).</p>
      <p>3. Individual (lexical) correspondences:
1) word roots, e. g., ſwäht&gt;svēt (ſwähtitam&gt;svētītam ‘blessed’), here we also deal with position in
the word, e. g., beginning of the word tytcz&gt;tic, (tytczam&gt;ticam ‘we believe’), but at the end of word
tytcz&gt;tīts (raxtytcz &gt;rakstīts ‘written’);
2) separate lexemes, e. g., ſöv&gt;sev ‘for oneself’.</p>
      <p>Undeniably, the older the source, the more inconsistency is observed in the representation of
different graphemes and phonemes. This is the reason why positional and individual correspondences
are prevalent in the process of conversion of the texts from the earliest period (before 1631), which in
turn increases the number of rules applied. Of course, setting individual correspondences is a
timeconsuming task, but this is the only way to recognise a part of the instances where graphemes are
ambiguous.</p>
      <p>Taking into account the development of Latvian writing, the number of conversion rules gradually
decreases as fewer individual rules are needed and as writing becomes more homogeneous. The newer
a source is and the more consistent the spelling it displays, the smaller number of positional and
individual correspondences and the fewer conversion rules needed.</p>
      <p>The sequence of conversion rules is crucial, e. g, only after the implementation of the law ie&gt;ī, can
the rule ee&gt;ie be applied. In general, the sequence of correspondence rules is as follows: lexical –
morphemic, graphemic.</p>
      <p>The number of conversion rules also depends on the size of text. The number of rules varies from
37 rules in the Lord’s Prayer to 1024 correspondences in ‘Vndeudsche Psalmen’ (1587). As the
orthography of the 18th c. texts is similar to the spelling predominating at the end of the 17th c., it could
be possible to create a conversion template which might be used for the bulk of the sources, with some
minor variations.</p>
    </sec>
    <sec id="sec-9">
      <title>7. Switch from in-house platform to NoSketch Engine</title>
      <p>In 2022 a new corpus version was released. The corpus was moved from an in-house platform to the
NoSketch Engine platform (http://nosketch.korpuss.lv/#dashboard?corpname=senie_unicode), because
this old platform is not maintained any more. The corpus is now available on a par with other Latvian
language corpora.</p>
      <p>
        During the migration to the new platform, we paid special attention to preserve the unique address
of every token the same as it was in the old version. The address format makes it very convenient to
cite the particular wordform in articles and in the corpus-based Historical dictionary of Latvian
(https://tezaurs.lv/lvvv/; [
        <xref ref-type="bibr" rid="ref1 ref4">5</xref>
        ]). The address consists of source identifier, page, line or book of the Bible,
chapter and verse, as in Fig. 3.
      </p>
      <p>In NoSketch Engine the address is presented in a separate window, showing text identifier, year of
publication and page number. Simple metadata have been provided (author, century, year of
publication, title, text genre, and type (printed/ handwritten)).</p>
      <p>Different languages are encountered in the sources (mostly German, Latin, Polish, but also Greek,
Hebrew), which have been appropriately marked in the corpus. Even though these languages are not of
primary interest to this research, it is worth mentioning that NoSketch Engine will provide search
possibilities in these parts of texts; they were excluded for analysis in the old system.</p>
      <p>NoSketch Engine offers us a concordancer and wordlists of the original forms. At the moment, a
search can be done either by original forms or regular expressions describing original forms, but after
completion of conversion, searches will also be available by converted forms and regular expressions
describing converted forms. However, the search results (concordancer and wordlists) will be presented
in the original writing. At the moment we do not plan to publish conversions as whole texts, as we fear
that converted but not standardised text may confuse a number of corpus users with no research
background in early prints.</p>
    </sec>
    <sec id="sec-10">
      <title>8. Results and issues</title>
      <p>If any incorrectly recognised or typed wordforms are noticed, they are corrected in the Unicode file.
It turns out that pre-editing is preferable, e. g. expanded spacing in a word in a header should be
eliminated (like J E S U&gt;Jesu ‘Jesus’). Pre-editing concerns only formatting, but obvious original
spelling mistakes are defined as separate replacement rules. Therefore, the number of rules grows, but
we can re-use them and get a new version of the converted text.</p>
      <p>There are replacement rules supplemented with a list of exceptions; the number of exceptions might
reach ca. ten in some cases.</p>
      <p>In some languages where normalisation of historical spelling is performed, pre-editing takes care of
dialectal forms. However, we decided to leave them as in the original, e. g. ūz- (the prefix uz- in modern
Latvian), āz- (the prefix aiz- in modern Latvian), the verb form jir (ir) ‘is’ etc. Our practice is not to
intervene in the original text. In addition to this, we cannot solve highly complicated linguistic issues
in historical texts (thus, svēts ‘holy’ and its variant švēts are left as two forms because there is no clear
agreement on this yet).</p>
      <p>In the result we got a converted text and a Unicode file which is as close as possible to the original.
After conversion post-editing is performed and mistakes are evaluated, new replacement rules are
written.</p>
      <p>
        Although this process is very time-consuming, the results show that source-based rules give rather
precise results. A major bonus of this approach is that differences between sources do not introduce
new errors in other, differently written sources, which was major issue in [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ]. Another major
improvement compared to [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ] is the elimination of multiple transliteration variants per single token
since all conversion rules in this project are mandatory and deterministic, only a single transliteration
per token is generated.
      </p>
      <p>Up to now, all replacement rules have been written by linguists; no machine learning method has
been applied. Hopefully it would be possible to create a kind of template with base rules for the
conversion of further texts of the 18th c. in which spelling is not so idiosyncratic.</p>
    </sec>
    <sec id="sec-11">
      <title>9. Conclusions</title>
      <p>In this paper, we have described the methods used for normalisation of early Latvian sources,
identifying three main conversion rule groups with subgroups: 1) unambiguous graphemic
correspondences; 2) positional (graphemic and morphemic) correspondences; 3) individual (lexical)
correspondences. This will make texts more accessible to scholars in the humanities.
10.Acknowledgements</p>
      <p>We would like to express our gratitude to the Department of Baltic Linguistics, University of Latvia
and the Latvian language institute supporting the corpus development at different stages during 2002–
2021. We would especially like to thank Andrejs Spektors and Normunds Grūzītis from the AILab,
IMCS, UL for their long-term guidance.</p>
      <p>The modernization of the Corpus of early written Latvian has been undertaken within the framework
of the National Research Programme ‘Digital Resources of the Humanities’ (No:
VPP-IZM-DH2020/1-0001) funded by Latvian Council of Science of the Ministry of Education and Science. This
article has been prepared within the same project.
11.References
[1] A. Spektors, Latviešu valodas datorfonda izveide, Latvijas Zinātņu Akadēmijas Vēstis A 2 (2001)
74–82. URL: http://ailab.mii.lu.lv/aspekt/dfond.htm.
[2] E. Andronova, The Corpus of Early Written Latvian: current state and future tasks, in: Proceedings
of Corpus Linguistics, Birmingham, UK, 2007. URL:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          5.
          <article-title>On-going modernisation of the Corpus (</article-title>
          <year>2020</year>
          -2022) https://www.birmingham.ac.uk/documents/college-artslaw/corpus/conferencearchives/2007/245Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Milčonoka</surname>
          </string-name>
          , Latviešu valodas 17.
          <article-title>gadsimta teksti internetā, Baltu filoloģija XII (1</article-title>
          ), (
          <year>2003</year>
          )
          <fpage>139</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andronova</surname>
          </string-name>
          ,
          <article-title>Short texts in the Corpus of early written Latvian (www</article-title>
          .korpuss.lv/senie), in: S. Reinsone,
          <string-name>
            <surname>I. Skadiņa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baklāne</surname>
          </string-name>
          , J. Daugavietis (Eds.).
          <source>Proceedings of the 5th Conference on Digital Humanities in the Nordic Countries (DHN)</source>
          .
          <source>CEUR Workshop Proceedings</source>
          , volume
          <volume>26</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>183</lpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2612</volume>
          /short1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andronova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siliņa-Piņķe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trumpa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <source>The Electronic Historical Latvian Dictionary Based on the Corpus of Early Written Latvian Texts, Acta-Baltico Slavica 40</source>
          .
          <string-name>
            <surname>Pogranicze</surname>
          </string-name>
          bałtycko
          <article-title>-słowiańskie w aspekcie leksykalnym i leksykograficznym (</article-title>
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . https://doi.org/10.11649/abs.
          <year>2016</year>
          .
          <volume>018</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>Die möglichen Formen deutschen Einflusses auf die grammatische und syntaktische Struktur der ältesten lettischen Texte, Linguistica Baltica 2 (</article-title>
          <year>1993</year>
          )
          <fpage>163</fpage>
          -
          <lpage>181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>Latvian texts in the 16th and 17th centuries: beginnings and development</article-title>
          , in: K. Ross, P. Vanags (Eds.),
          <article-title>Common Roots of the Latvian and Estonian Literary Languages, Peter Lang, Frankfurt am Main etc</article-title>
          .,
          <year>2008</year>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>German Influence on the Christian Discourse of Early Written Latvian</article-title>
          , in: M.
          <string-name>
            <surname>Kaukko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Norro</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-M. Nummila</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Toropainen</surname>
          </string-name>
          , T. Fonsén (Eds.),
          <source>Languages in the Lutheran Reformation. Textual Networks and the Spread of Ideas</source>
          , Amsterdam University Press, Amsterdam,
          <year>2019</year>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>301</lpage>
          . doi-
          <volume>10</volume>
          .
          <fpage>5117</fpage>
          -9789462981553-ch12.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bollmann</surname>
          </string-name>
          ,
          <article-title>A Large-Scale Comparison of Historical Text Normalization Systems, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , volume
          <volume>1</volume>
          .
          <string-name>
            <surname>Minneapolis</surname>
          </string-name>
          , Minnesota,
          <year>2019</year>
          , pp.
          <fpage>3885</fpage>
          -
          <lpage>3898</lpage>
          . URL: https://aclanthology.org/N19-1389.pdf. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1389.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Prillop</surname>
          </string-name>
          ,
          <article-title>Kuidas märksõnastada vanu eestikeelseid tekste?</article-title>
          ,
          <source>Keel ja Kirjandus</source>
          ,
          <volume>2</volume>
          , (
          <year>2004</year>
          )
          <fpage>90</fpage>
          −
          <lpage>99</lpage>
          . URL: https://vakk.ut.ee/avaleht/Prillop-KK-2
          <article-title>-2004</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Gruszczyński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Adamiec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bronikowska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kieraś</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Modrzejewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wieczorek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woliński</surname>
          </string-name>
          .
          <source>The Electronic Corpus of 17th- and 18th-century Polish Texts. Lang Resources &amp; Evaluation</source>
          .
          <volume>56</volume>
          (
          <year>2022</year>
          )
          <fpage>309</fpage>
          -
          <lpage>332</lpage>
          . https://doi.org/10.1007/s10579-021-09549-1.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Šinkūnas</surname>
          </string-name>
          ,
          <article-title>Senųjų raštų rašybos keitimas paieškos sistemai</article-title>
          , in: G.
          <string-name>
            <surname>Judžentytė-Šinkūnienė</surname>
          </string-name>
          , V. Zubaitienė (Eds.),
          <article-title>Baltų kalbų tekstų ir žodžių reikšmės, Vilniaus universiteto leidykla</article-title>
          , Vilnius,
          <year>2018</year>
          , pp.
          <fpage>389</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bollmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dipper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krasselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petran</surname>
          </string-name>
          ,
          <article-title>Manual and Semi-automatic Normalization of Historical Spelling - Case Studies from Early New High German, Proceedings of the KONVENSWorkshop on Language Technology for Historical Text(s) (LThist2012), Wien</article-title>
          , Austria,
          <year>2012</year>
          . URL: https://www.linguistics.rub.de/~dipper/pub/lthist12.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pretkalnina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paikens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gruzitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rituma</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Spektors</surname>
          </string-name>
          ,
          <article-title>Making historical Latvian texts more intelligible to contemporary readers</article-title>
          .
          <source>Proceedings of the LREC Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage Objects, LREC</source>
          <year>2012</year>
          ,
          <volume>29</volume>
          -
          <fpage>35</fpage>
          . URL: https://www.researchgate.net/publication/230800163.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andronova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Frīdenberga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pretkalniņa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siliņa-Piņķe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Skrūzmane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trumpa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>Latviešu valodas senāko rakstu pieminekļu konvertācija mūsdienu rakstībā: iepriekšējā pieredze un automatizācijas mēģinājumi, Aktuālas problēmas literatūras un kultūras pētniecībā: rakstu krājums, atb</article-title>
          . red. Anita Helviga. Liepāja, LiePA,
          <volume>27</volume>
          , (
          <year>2022</year>
          )
          <fpage>346</fpage>
          -
          <lpage>358</lpage>
          . URL: https://dom.lndb.lv/data/obj/1035006.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Andronova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Frīdenberga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pretkalniņa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siliņa-Piņķe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Skrūzmane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trumpa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vanags</surname>
          </string-name>
          ,
          <article-title>Variantums kā konvertācijas izaicinājums: Georga Manceļa tekstu atveide mūsdienu rakstībā</article-title>
          ,
          <source>Letonika</source>
          (
          <year>2022</year>
          ). To apper.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>