=Paper= {{Paper |id=Vol-2717/paper12 |storemode=property |title=Towards improving OCR accuracy with Bulgarian Language Resources |pdfUrl=https://ceur-ws.org/Vol-2717/paper12.pdf |volume=Vol-2717 |authors=Ivan Kratchanov,Laska Laskova,Kiril Simov |dblpUrl=https://dblp.org/rec/conf/dhn/KratchanovLS20 }} ==Towards improving OCR accuracy with Bulgarian Language Resources== https://ceur-ws.org/Vol-2717/paper12.pdf
                               Towards improving OCR accuracy with
                                  Bulgarian Language Resources

                   Ivan Kratchanov1[0000−0002−0430−7953] , Laska Laskova2[0000−0002−6931−9082] , and
                                         Kiril Simov2[0000−0003−3555−0179]
                           1
                          Digitization Centre, National Library “Ivan Vazov”, Plovdiv, Bulgaria
                                                   ivankra@gmail.com
                   2
                     AIaLT, Institute of Information and Communication Technologies, Sofia, Bulgaria
                                             {laska|kivs}@bultreebank.org


                           Abstract In 2017, the National Library “Ivan Vazov”–Plovdiv, em-
                           barked on a digitalization project whose ultimate purpose is to provide
                           both learners and scholars with several types of content, including peri-
                           odicals and books published during the late Bulgarian National Revival
                           and afterwards, in the decades before the communist era (1870s-1940s).
                           We focus on the technical aspects of the digitalization project that in-
                           volves optical character recognition (OCR) and requires proper handling
                           of Cyrillic texts. The paper provides insight into the library’s joint ac-
                           tivities with its partners from the Institute of Information and Commu-
                           nication Technology at the Bulgarian Academy of Sciences to develop
                           relevant tools and methodologies, by stressing the mutual benefits from
                           the co-operations. The library’s participation in the project CLaDA-BG,
                           integrated within the European CLARIN and DARIAH infrastructures,
                           offered a chance to take advantage of the multidisciplinary expertise of
                           the partnering organisations and to develop the best methodology for
                           OCR and consequently to enhance the methods of using and handling
                           the acquired machine-readable text.

                           Keywords: Digitization · Cultural Heritage · Digital Library · Op-
                           tical Character Recognition · Spelling Models · Modern Bulgarian


                  1 Introduction
                  The paper discusses the current efforts of the National Library “Ivan Vazov”–
                  Plovdiv (NLIV) in making digitized content accessible to learners and scholars
                  and focuses on the technical aspects of a digitalization project that involves op-
                  tical character recognition (OCR) and requires the proper handling of Bulgarian
                  Cyrillic texts, especially texts published before the last major spelling reform
                  from 1945 (historical texts). It provides insight into the library’s joint activities
                  with the Institute of Information and Communication Technology at the Bul-
                  garian Academy of Sciences (IICT-BAS) for the development of relevant tools
                  and methodologies.
                  Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
                  License Attribution 4.0 International (CC BY 4.0).




Twin Talks 2 and 3, 2020           Understanding and Facilitating Collaboration in Digital Humanities     115/143
                  2        Kratchanov et al.

                      Our goal is twofold: (1) to perform a correct OCR on historical texts and (2)
                  to normalize them, i.e. to convert various old spellings to the present one. The
                  first step is essential for the publication of the original documents (newspapers,
                  magazines, books, etc.). The latter is important at least for two reasons: it makes
                  possible for users to search a corpus of both historical and preset-day documents
                  with a query input in the current Bulgarian orthography and allows for the
                  application of NLP tools built for contemporary Bulgarian on the normalized
                  texts.
                      In order to achieve our first goal, we planned several experiments. The best
                  option is, of course, to train a professional OCR software to perform OCR tasks
                  for old Bulgarian orthography in the best possible way. Thus, our first exper-
                  iment was to train the ABBYY FineReader system on a lexicon provided by
                  the IICT-BAS group. The lexicon is a version of the contemporary inflectional
                  lexicon of Bulgarian. It contains word forms converted in accordance with the
                  writing rules of an old spelling. The conversion was based on rules that take
                  into account the combination of letters in the word form, their position and some
                  relevant grammatical features. As a result, the new “old” version of the lexicon
                  contains 1 121 872 word forms. After the training of ABBYY FineReader, we
                  performed evaluation on the basis of a scanned version of all issues of the “Sci-
                  ence” magazine published in 1881, a total amount of 5485 running words. The
                  percentage of non-recognized words dropped from 4.9% to 4.4%. The number
                  of non-recognized hyphenated words per page was reduced from 6.9 to 5.55. Al-
                  though these results are not significant, they show that training with knowledge
                  resources is possible and that has the capacity to improve the result from OCR.

                  2 The Problem: Spelling Variations and Old Orthography
                    Models in Bulgarian Printed Historical Texts
                  Optical recognition and access to texts printed before the last orthographic re-
                  form of the Bulgarian language (1945) is of utmost importance for any researcher
                  in social sciences and humanities, whose work is related to 18𝑡ℎ –19𝑡ℎ century Bul-
                  garia. The reform known as the Fatherland’s Front Reform, has brought about
                  the current rendition of the language written and spoken by Bulgarians today.
                  Before 1945, there were several attempts at creating an exhaustive set of ortho-
                  graphic prescriptions (models) for written modern Bulgarian as opposed to the
                  example of Church Slavonic.
                      Among those models, some proved to have more impact than others [3, 7]:
                  the Drinov model (1870–1899), its slightly modified version, the Drinov-Ivanchev
                  model (1899–1921), the short-lived Omarchevski model (1921–1923) and an up-
                  dated version of the Drinov-Ivanchev orthography (1923–1945). They were de-
                  veloped by various authorities—writers, educationalists, scientific organizations,
                  such as the Bulgarian Literary Society (BAS predecessor), or specially appointed
                  committees—and for all of them, there were several topics of major importance:
                      – modification of the Old Bulgarian alphabet in order to have an adequate
                        representation of the modern Bulgarian phonemes. For instance, the ex-




Twin Talks 2 and 3, 2020         Understanding and Facilitating Collaboration in Digital Humanities     116/143
                           Towards improving OCR accuracy with Bulgarian Language Resources            3

                      clusion/inclusion of the letter щ from the alphabet was a subject of ardent
                      discussion. While щ represents the consecutive pronunciation of the sounds
                      /ʃ/ and /t/, each of them has its own letter, ш and т, respectively. Some
                      argued that щ should be replaced by the combination of ш and т.
                    – mapping of sound changes onto letters. For example, in modern Bulgarian,
                      the sound /ɨ/ represented by the letter ы, has reflected in /i/ that is already
                      represented by the letter и, thus rendering ы redundant. From phonological
                      point of view, keeping ы and several other redundant letters (ѣ, ѧ, ѫ, ѩ, ѭ,
                      і, ꙗ) in use was meaningless, but in the times when Bulgarian identity was
                      being (re)built, many considered those letters an evidence and a symbol of
                      cultural continuity.
                    – selection of regional phonomorphological norms as the basis for the creation
                      of a standard language. Different dialects offered different solutions. One
                      question that remained open for decades because of the substantial linguistic
                      variation related to origin, concerned the spelling of endings for 1st and 2nd
                      conjugation present verbs in first-person singular and third-person plural,
                      for example вървя [vɤr’vj ɤ], ’(I) am going’ and вървят [vɤr’vj ɤt], ’(they)
                      are going’. Depending on their region of origin and/or considerations about
                      the prestige associated with some of the vernaculars, authors of various pre-
                      scriptive texts suggested different spellings. If the inflectional inventory of
                      the dialect included only the “hard endings” [ɤ]/[ɤt], the letter a seemed to
                      be the most appropriate choice: върва [vɤr’vɤ], върват [vɤr’vɤt]. The “soft
                      endings” [j ɤ]/[j ɤt] were represented in accordance with the spelling rules of
                      Old Bulgarian, that is, by the letter ѭ (вървѭ, вървѭт), or, alternatively,
                      by я (вървя, вървят) and even ꙗ (вървꙗ, вървꙗт).

                      Excerpt (1) below is from a newspaper article published in 1878. It gives a
                  good idea—especially when compared to its normalized version—of some of the
                  key differences between a Drinov type of spelling and the modern orthography
                  (highlights in red and blue):

                     (1) На телеграммата отъ 10 Юлия Главнокомандующийтъ на войскитѣ
                         На телеграмата от∅ 10 юли∅ главнокомандващият∅ на войските
                         позволи изнасяньето на хранитѣ отъ България .
                         позволи изнасян∅ето на храните от∅ България .
                         ‘In a telegram from 10 July, the Commander-in-chief
                         gave permission to export the food from Bulgaria.’

                      Except for the Omarchevski model that replaced the two yers ъ and ь al-
                  together with ѫ and dropped silent letters, all other spelling models kept the
                  silent ъ and ь at the end of the words phonetically ending in a consonant (in
                  this example, отъ [ot] and Главнокомандующийтъ [glavnoko’mandujuʃtijt]).
                  Here we have also an example for another redundant letter, ѣ, that denoted
                  /ɛ/ in Old Bulgarian (войскитѣ [voj’skite], хранитѣ [hra’nite]). In Western
                  Bulgarian dialects, the reflex of the vowel /ɛ/ is /e/, while in the majority of the
                  Eastern dialects it is /ja/. After the reform of 1945, a complex rule regulated




Twin Talks 2 and 3, 2020          Understanding and Facilitating Collaboration in Digital Humanities       117/143
                  4          Kratchanov et al.

                  the replacement of ѣ with e or я depending on prosodic and phonetic factors.
                  The rest of the differences between the two spellings in example (1), are either
                  the result of dialect variation (изнасяньето [iz’nasj anj eto] vs. изнасянето
                  [iz’nasj aneto]) or introduction of foreign norms—gemination (телеграммата
                  vs. телеграмата) and capitalization of the names of the months and job titles.
                      Observations on NLIV collections of historical texts show that until 1891,
                  more than a decade after the restoration of the Bulgarian state, different pub-
                  lishing entities followed their own spelling and grammar conventions. That was
                  due to the fact that the elaboration of a fully-fledged language standard or lan-
                  guage planning in general were not among the top priorities for the Bulgarian
                  governments after the liberation of the country in 1878 [2]. Cyrillic texts until
                  1945 contain a myriad of letter symbols such as ѣ, ѧ, ѫ, ѭ, etc., which were
                  gradually removed from the modern written language, eventually reducing the
                  number of letters in the alphabet to the current 30. These wide variations of
                  the officially accepted language become a serious hindrance to the success rate
                  of OCR.


                  3 The Solution: Machine-Readable and Normalized Texts
                  The goal of the project collaboration is to use the tools developed by the tech-
                  nological partners in CLaDA-BG to minimize and correct errors in the machine-
                  readable texts produced by OCR software, and also to make possible their nor-
                  malization in order to aid the user, so that s/he would not have to search for
                  a word or expression twice, in the new and in the old spelling. The retrieved
                  search results should include both.
                      Advancements in the area of accessibility are especially important in the
                  current times, marked by the COVID-19 pandemic. Indeed, as the demand
                  for credible e-resources surges, digital libraries have emerged as vital pathways
                  to high-quality e-books, journals and educational content. Statistics from the
                  world’s leading e-libraries testify to their cultural significance [4].


                  4 The Approach
                  4.1 Old Bulgarian Orthography Language Resources
                  The first major outcome of the work on the project was the preparation and
                  testing of a lexicon of old Bulgarian spelling word forms, to be used for the
                  purpose of assisting OCR. Initially, we decided to opt for a strategy where all
                  word forms from a modern Bulgarian lexicon3 are transformed to comply with
                  the older orthography [6] developed by the linguist, ethnographer and univer-
                  sity professor Stoyan Romanski in 1933. The choice of the prescriptive source
                  was based on its comprehensiveness, the fact that it provides both a detailed
                  and clear definition of the rules and a lexicon. Last but not least, the dictio-
                  nary of Romanski represents a version of Drinov-Ivanchev orthography that was
                   3
                       The electronic version of [5].




Twin Talks 2 and 3, 2020            Understanding and Facilitating Collaboration in Digital Humanities   118/143
                           Towards improving OCR accuracy with Bulgarian Language Resources             5

                  widespread in one of most prolific periods in the history of Bulgarian literature.
                  Many of the literary works created between the two World Wars, are available
                  in modern and Drinov-Ivanchev spelling, which makes much more easier the
                  development of a parallel corpus necessary for the training of a neural network
                  model for normalization. The new “old” version of the lexicon was created using
                  rule-based method in the XML-based CLaRK editor [8] and then imported in
                  FineReader as a user dictionary named CLADABG-MODEL. The testing was
                  conducted in the period March-April 2020. The program ABBYY FineReader
                  (ver. 14 and 15) was used to carry out recognition of 20 pages from issue 1/1881
                  of the magazine “Наука” (“Science”) from the holdings of NLIV, with call num-
                  ber П РЦ-9. All pages are color scanned with an i2S CopyBook A2 scanner at
                  a resolution of 300 ppi, 24-bit, TIFF format, no compression.


                  4.2 Experiments and Results

                  The purpose of the test was to determine to what extent the dictionary with
                  old word forms assists the software program in performing OCR of printed Bul-
                  garian texts before the orthographic reform of 1945. The dictionaries used by
                  FineReader are lists of words available in a specific language. The program relies
                  on dictionaries to increase the quality of recognition by reinforcing hypotheses
                  about words included in the dictionary. Custom dictionaries are especially useful
                  in case the text contains many non-common words [1].
                      The program has a built-in dictionary only for the modern Bulgarian lan-
                  guage. CLADABG-MODEL contains 1,121,872 words from the time before the
                  Fatherland’s Front Reform of 1945, including words that are no longer in use
                  or word forms with letters that were gradually removed from the alphabet of
                  modern Bulgarian like ѣ, ѫ, ѧ and so on. Many of the digitized valuable li-
                  brary possessions contain text that is pre-1945, and the purpose of developing
                  CLADABG-MODEL was to test the hypothesis that its use will lead to a higher
                  recognition rate. The amount of the increase, if any, also had to be determined.
                  We used as a main indicator the percentage of misrecognized words4 in relation
                  to the total number of words. The counting was done manually.
                      In the course of the test, two other characteristic features of the OCR process
                  and of the software program were measured: the degree of recognition of images
                  in grayscale (as opposed to those in color) and whether and how the FineReader-
                  reported parameter “Low-confidence characters” (expressed in percentage) can
                  serve as an indicator of the success of OCR.
                      The original paper version of the journal “Наука” is very well preserved, and
                  respectively, the resulting scanned files are close to the optimal characteristics
                  recommended for OCR. However, there is some darkening of the paper, which
                   4
                       Misrecognized are the words in which there is a discrepancy between a letter symbol
                       in the scanned primary word in image form and the same letter symbol in the
                       derivative machine-readable word. It is not considered incorrect recognition if the
                       primary word is spelled incorrectly and the derived word has correctly recognized
                       letter characters, thus duplicating the spelling error.




Twin Talks 2 and 3, 2020           Understanding and Facilitating Collaboration in Digital Humanities        119/143
                  6          Kratchanov et al.

                  reduces the contrast and distinctiveness of the letters. Also, the chosen font
                  (widely used back then) makes it difficult for the program to distinguish letters
                  with dominant vertical lines, such as и, п, н, ш, л (see Fig.1.). The horizontal
                  lines converge, the letters fuse together and further complicate the task for the
                  recognition algorithm.




                                   Figure 1. Example of a word with merged letter symbols.



                     To test the degree of recognition of CLADABG-MODEL, 20 identical pages
                  were scanned, with uniform text and font. The total number of words is 5485
                  and their average number per page is 274.25.
                     Minimal training was done, to aid the recognition of traditionally problematic
                  symbols such as ѫ, which without prior training always becomes a ж.


                  Table 1. Mean percentage of misrecognized words for 20 color scanned pages, 300 ppi,
                  24-bit, TIFF format, no compression.

                                                 Percentage of misrecognized words
                              FineReader                CLADABG-MODEL                            Combined
                           built-in dictionary
                                 4,90%                            4,40%                            4,50%




                      A testing was included also for the simultaneous, combined use of two dic-
                  tionaries, the FineReader built-in Bulgarian dictionary and the CLADABG-
                  MODEL, with recognition performed using two base languages: (1) “Bulgarian”
                  with a standard, present-day set of characters, with the FineReader built-in
                  Bulgarian dictionary, and (2) “Bulgarian before 1945” featuring a character set
                  with added old letter symbols, such as ѣ, ѫ, ѧ, etc., and with the CLADABG-
                  MODEL dictionary. The inclusion of the combined dictionary test was done
                  due to the consideration that when the program works only with CLADABG-
                  MODEL, there is a risk of greater recognition failure in words still in use in mod-
                  ern Bulgarian. The results are summarized in Table 1. The results show that the
                  recognition with CLADABG-MODEL is improved. Although the improvement
                  is not so significant—on average with 0.5% fewer misrecognized words—it shows
                  that this line of research is worth pursuing.
                      The second test was related to the ability of FineReader to recognize the
                  hyphenation of words split at line-breaks (see Table 2). In case of successful




Twin Talks 2 and 3, 2020            Understanding and Facilitating Collaboration in Digital Humanities      120/143
                            Towards improving OCR accuracy with Bulgarian Language Resources                7

                           Table 2. Average number of misrecognized line-break split words per page.


                                                 Number of misrecognized line-break
                              FineReader               CLADABG-MODEL                             Combined
                           built-in dictionary
                                   6,90                            5,55                              6,05




                  recognition, the line breaking is omitted, thus the split words are kept whole,
                  enabling their searching, copying, etc.
                      The trend of initial slight improvement using CLADABG-MODEL was con-
                  firmed by the second test as well. Concerning the recognition difference between
                  color and greyscale pages, the recognition success of the greyscale pages was only
                  slightly better, which does not justify prioritizing the greyscale scanning mode
                  or unnecessary file conversion.



                  5 A Three-way Collaboration Experience within NLIV
                    and IICT-BAS


                  The partnership between NLIV and IICT-BAS brought about the intense team-
                  work between three people—Ivan Kratchanov, a librarian, Laska Laskova, a
                  linguist, and Kiril Simov who is a computer scientist. While the last two shared
                  the same professional physical space in Sofia, the communication with Ivan
                  Kratchanov who is based in Plovdiv, was predominantly via e-mail, chat and
                  video calls. Other factors also played significant role in the development of the
                  project. Neither of the three researchers involved are new to the challenges
                  posed by the interdisciplinary nature of the interaction—Kratchanov, who is
                  Head of the Digital Center at NLIV, has previous experience with digital image
                  processing while Laskova and Simov have worked together for several years on
                  various projects in Natural Language Processing. After the initial discussion of
                  the workflow was concluded with a more or less clear definition of the specific
                  tasks, these tasks were distributed among the three team members with regard
                  to their expertise and access to resources.
                      The tasks performed at NLIV were related to the selection of digitized mate-
                  rials from different genres, different time periods and different quality of printing,
                  papers, etc. Kratchanov also performed the training and evaluation of the differ-
                  ent OCR models. The colleagues at IICT-BAS worked on the creation of lexical
                  resources and their conversion to the different old spelling norms. Another on-
                  going task for the team members at IICT-BAS is the creation of parallel corpus
                  in several orthography representations.




Twin Talks 2 and 3, 2020            Understanding and Facilitating Collaboration in Digital Humanities          121/143
                  8        Kratchanov et al.

                  6 Conclusions

                  Overall, the benefits of CLADABG-MODEL have been proven and its use is
                  highly recommended. The work on the lexicon will continue in order to stream-
                  line the process as a whole, to its efficiency in terms of higher recognition success.
                      Two are the major reasons for these modest results. In the period from mid-
                  19𝑡ℎ century to 1945, many spelling systems were introduced and put to use,
                  while the “old” lexicon represents only one of them, albeit widely accepted, from
                  1933. One solution to this problem is to create additional “old” versions of the
                  inflectional lexicon that will reflect various spelling models and their codification
                  in monolingual dictionaries, grammars and other documents. Alternatively, we
                  could enrich the “old” lexicon which will encompass several spelling variants
                  for each word form much like a multilingual dictionary. Another reason for
                  the results obtained so far lies in the scarcity of personal names represented
                  in the lexicons, not to mention named entities of other categories, for example
                  organizations or products. We plan to solve this by adding lexical material
                  extracted from manually corrected OCR-ed texts.
                      Besides training of the OCR software, we envisage to implement a neural
                  network spellchecker for the OCR-ed historical texts. In this case the model will
                  rely on a wider context in order to predict the wrongly recognized words. In
                  order to train the models, we plan to create automatically a parallel corpus with
                  historical and modern texts using the “old” lexicons and pre-trained models.


                  Acknowledgements

                  This work was partially supported by the Bulgarian National Interdisciplinary
                  Research e-Infrastructure for Resources and Technologies in favor of the Bul-
                  garian Language and Cultural Heritage, part of the EU infrastructures CLARIN
                  and DARIAH – CLaDA-BG, Grant number DO01-272/16.12.2019.
                      We would like to thank Petya Osenova for the support during our work and
                  for her comments on the paper.


                  References
                  1. ABBYY Technology Portal: Dictionaries and OCR.
                     https://abbyy.technology/en:features:ocr:dictionary_support. Last accessed 8 Sept
                     2020
                  2. Andreychin, L.: Iz istoriyata na nasheto ezikovo stroitelstvo [From the History
                     of Our Language Construction]. Darzhavno izdatelstvo “Narodna prosveta”, Sofia
                     (1977) [In Bulgarian]
                  3. Danailova, V.: Basic factors triggering the spelling reform in the Bulgarian Lan-
                     guage. Crossing Boundaries in Culture and Communication. 5(2), 51–56 (2014)
                  4. Falt, E., Das, P. P.: Digital libraries can ensure continuity as Covid-19 puts
                     brake to academic activity. https://en.unesco.org/news/digital-libraries-can-ensure-
                     continuity-covid-19-puts-brake-academic-activity. Last accessed 11 Sept 2020




Twin Talks 2 and 3, 2020         Understanding and Facilitating Collaboration in Digital Humanities         122/143
                           Towards improving OCR accuracy with Bulgarian Language Resources            9

                  5. Popov, R., Simov, K., Vidinska, S.: Rechnik za pravogovor, pravopis, punktuat-
                     siya [Orthoepic, Spelling and Punctuation Dictionary]. Atlantis, Sofia (1998) [In
                     Bulgarian]
                  6. Romanski, S.: Pravopisen rechnik na balgarskiya ezik s posochvane izgovora i udare-
                     nieto na dumite [Orthographic Dictionary of Bulgarian Language with Word Pro-
                     nunciation and Accent]. Knigoizdatelstvo “Kazanlashka dolina”, Sofia (1933) [In
                     Bulgarian]
                  7. Rusinov, R.: Istoriya na balgarskiya pravopis [A History of Bulgarian Orthography].
                     Nauka i izkustvo, Sofia (1981) [In Bulgarian]
                  8. Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dimitrov, M., Kiryakov, A.: CLaRK
                     – an XML-based System for Corpora Development. In: Proceedings of the Corpus
                     Linguistics 2001 Conference, pp. 558–560. UCREL (2001)




Twin Talks 2 and 3, 2020          Understanding and Facilitating Collaboration in Digital Humanities       123/143