<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Workflow: Facing Printed Texts of Ancient, Medieval and Modern Greek Literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christina Tzogka</string-name>
          <email>ctzogka@datascouting.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fotini Koidaki</string-name>
          <email>koidaki@lit.auth.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stavros Doropoulos</string-name>
          <email>doro@datascouting.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Papastergiou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Efthymios Agrafiotis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katerina Tiktopoulou</string-name>
          <email>atiktopo@lit.auth.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stavros Vologiannidis</string-name>
          <email>svol@ihu.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DataScouting</institution>
          ,
          <addr-line>30 Vakchou Street, 54629 Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer, Informatics and Telecommunications Engineering, International Hellenic University</institution>
          ,
          <addr-line>Terma Magnisias, 62124 Serres</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Philology, Aristotle University of Thessaloniki</institution>
          ,
          <addr-line>54124 Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Book digitization is being increasingly enhanced, as it facilitates not only the dissemination and preservation of cultural heritage but also the analysis of large amounts of textual data as well as the extraction and discovery of knowledge in a faster, dynamic and interactive way. Quite often, OCR, as the core technology of book digitization, has to address major difficulties related to the condition of the primary source or to scanning issues. The main contribution of this paper is to provide an extensive study on Tesseract, an open-source OCR system, including image pre-processing and text post-processing methods, that overcome a variety of image handling problems. Additionally, a re-trained Greek language model, based on individual fonts training plus pairs of image-text training, is being provided. Finally, this paper proposes a pipeline of methods, including text line detection, that result in enhanced accuracy for Greek Literature documents, even when they consist of distorted pages, due to scanning issues or damaged physical material.</p>
      </abstract>
      <kwd-group>
        <kwd>OCR</kwd>
        <kwd>Scanned Document</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Training</kwd>
        <kwd>Text Line</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        New technologies have introduced new ways of researching, reading and preserving
written texts. The researcher can now search for information and extract
knowledge through a large amount of textual documents, while the student and reader
can acquire textual content quickly and effortlessly by creating his personalized
and editable digital library [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. At the same time, the role of the library is
being modernized. The latter is now transformed into an intelligent entity capable of
managing, researching, classifying, analyzing information and dealing even with the
most complex bibliographic needs of researchers.
      </p>
      <p>
        In order to be able to take advantage of the opportunities offered by the
digitization technologies in researching, reading and preserving Greek Literature, a
necessary condition is the conversion of the scanned literary documents into
computer editable text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This paper presents the scientific findings as well as the
? The authors sincerely acknowledge the valuable contribution of the digital humanities
research team, Maria Georgoula, Valando Landrou, Markia Liapi and Marianna Sylivrili,
who as project partners supported post-processing phase, by providing the ground-truth
data and valuable feedback on the OCR results.
      </p>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
workflow for the development of a comprehensive OCR system trained to analyse
scans of books and recognise Greek polytonic characters of modern and ancient
texts.</p>
      <p>Moreover, it utilizes and evolves parts of a remarkable research conducted in the
context of the “ Exploitation of Cultural Assets with computer-assisted Recognition,
Labeling and meta-data Enrichment (ECARLE)” project. The ECARLE project
concerns the development of an integrated Software as a Service (SaaS) that can
take scans of printed documents as input and export them as editable electronic
texts, enriched with semantic metadata about the structure of the document, its
publication, as well as semantic elements of cultural interest (i.e. person names,
place names, dates, book titles etc.). The scientific interests of ECARLE project
focus on the publications of the 19th century, since this is the century during which
typography emerged and flourished in Greece producing printed artefacts of major
cultural importance. To this end, the structure of the paper reflects the research
progression which is organized in the following five main sections: a. OCR
Challenges, b. State Of The Art Review, c. Tesseract Training, d. Proposed Method, e.</p>
      <p>Experimental Evaluation and f. Future Challenges.</p>
      <p>The main contribution of this paper is i. to provide an extensive study on the
OCR system, including image pre-processing techniques that address a variety of
image quality issues and text post-processing, ii. to provide a fine-tuned Greek
language model based on individual font training as well as pairs of image-text training,
iii. to propose a pipeline of methods that enhance the accuracy of the digitisation of
Greek literary documents, even when they include worn (e.g distorted/skew) pages,
due to scanning issues or damaged physical material. The evaluation of the already
available and the re-trained OCR models was performed by copy-editors,
specialising in the subject–matter in order to create ground-truth data which will be used
for evaluation and further optimization.
2</p>
    </sec>
    <sec id="sec-2">
      <title>OCR Challenges</title>
      <p>
        This section makes a detailed description of the kinds of challenges that the proposed
OCR system succeeds to address. Before inserting an image into the OCR system, it
is necessary to pre-process this image, since it may contain imperfections [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which
complicate the process and reduce the quality of the generated text. Furthermore,
imperfections make it difficult to identify characters due to their deviation and
variability.
      </p>
      <p>Deviation: The letters in an image differ significantly from the structure
expected by the OCR system. This may be due to physical deterioration, or
printing errors, or even scanning imperfections.</p>
      <p>Variability: Identical characters that are represented in one or more images,
show variability, i.e. the same letter is represented by more than one glyph
which substitute each other as stylistic alternates in the same font.</p>
      <p>The reasons for these defects could be categorized as physical material problems
and problems during the scanning.</p>
      <p>
        Problems of physical material: The physical material, depending on the storage
way, usage and age, may have deteriorated significantly compared to its initial
condition. Examples of such alterations are the fading of the letters, the thinning
of the paper in which the text on back of the sheet from the front is visible or
even stains, marks and watermarks added by the readers or the owner etc.
Scanning issues: The physical material, depending on the procedure and
equipment used during the scanning, may not be accurately represented in its digital
facsimile. This may be due to faults in the placement of the object in the
scanner, lenses with limited sharpness, incorrect or in-homogeneous illumination
and refractions. The lesion may include warping[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], skewing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as well as the
appearance of shadows, color and contrast gradients.
      </p>
      <p>However, in addition to the scanning failures mentioned above, quite often there
are some expected scanning challenges. For instance, while a book is flipped
during scanning, the degree of distortion changes due to the way a book is
located on the mounting surface, based on its geometry.</p>
      <p>These problems are both local and universal. In fact, their intensity may differ
from page to page. For instance, pages in the center of the book, that do not have
much exposure to environmental conditions and fluctuations, may have variations
in wear rate. Similar variations in wear rate are also expected to exist within the
page itself.
3</p>
    </sec>
    <sec id="sec-3">
      <title>State Of The Art Review</title>
      <p>
        As mentioned above, text recognition is undoubtedly a crucial and demanding
process in order to achieve great performance. The accuracy [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] of the digital text
produced depends on the extent of the physical deterioration suffered by the source
materials as well as the condition of the scanned images. For this reason the texts
that will be used during experiments and evaluation are carefully selected on the
basis of combined criteria concerning external and internal factors. According to the
relevant international literature, the workflow for the conversion of digitized
material into computer-editable digital text is analyzed in the following stages: i. image
pre-processing, ii. extract properties and character recognition, iii. post-processing.
The state-of-the-art OCR engines [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] that have been used in related works are the
following:
1. ABBYY Finereader (www.abbyy.com) OCR engine clearly defines the state of
the art for layout analysis and OCR, supporting about 200 recognition
languages.
2. OCRopus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is an open-source engine (based on LSTM neural networks) with
significant recognition capabilities compared to glyph-based approaches.
Additionally, this method provides the ability to train new models by just providing
input image in pair with their ground-truth text, on line level.
3. Tesseract [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is an open-source OCR engine (based on LSTM neural networks)
which provides a wide variety of trained mixed models. Moreover, Tesseract
supports language modelling and training either on specific fonts or on image
samples in pair with their ground-truth text.
      </p>
      <p>In order to achieve best possible performance, the open-source OCR software,
Tesseract, was selected. Our decision was mainly based on its popularity in the
opensource state-of-the-art OCR engines literature as well as on its various capabilities
on further training and fine-tuning that result in great performance.
3.1</p>
      <sec id="sec-3-1">
        <title>OCR System’s Features</title>
        <p>
          An OCR system includes a set of parameters, machine learning techniques and
training profiles so that they can identify as diverse items as possible. Taking into
account the most usual user requirements, the most common parameters refer to
the system’s ability to recognize a variety of languages, fonts, document types and
text alignments so that it can meet the most likely needs of its users. Although this
generalisation may satisfy the objectives of the most use cases of such software, it
often creates problems:
1. It is infeasible to include all the possible use-cases in the scope of the software
and particularly the specialized and not frequently encountered. For example,
obsolete typography and archaic writing systems have been disregarded by the
OCR models, which results in undefined user stories and unsatisfied user needs.
Frequently, the existence of these scenarios is unknown to the system designer
due to their rarity, as a result of which they are not even included in the strategy
for selecting the optimal parameters and training forms of the systems.
2. Generalization can reduce system performance since it’s goal is to maximize
performance in as many usage scenarios as possible rather than in specific
scenarios. For example, an OCR system that has been trained to recognize a large
amount of fonts is quite likely to make mistakes by confusing different
characters because of similar typographic features. Knowing that it is impossible
to cover all the software’s usage scenarios, the designers of these OCR systems
(e.g. OCRopus, Tesseract) allow their users to customize them according to
their needs. The configuration of OCR systems is mainly divided into three
categories:
(a) Change OCR settings: This category includes various variables that
regulate the operation of the OCR system. These variables may be the set of
recognizable languages, the architecture of the recognition model, the page
segmentation etc.
(b) Fine-tune training models: Each OCR system uses at its core a set of
machine learning models [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] (e.g CNN, LSTM, etc.). The parameters of
the models (e.g. number of neurons, layers, activation functions) have been
selected by the designers based on the general usage scenarios. In OCR
systems, such as Tesseract, the user is given the opportunity to fine-tune
these models, in order to define a set of parameters that could be more
efficient than the predefined parameters for the specific task.
(c) Train the system: Many language models provided by the OCR engines
are font-created and they have not been trained in real texts. Training
procedure requires ground-truth data in order the model to learn to recognise
not only characters, but also words and strings. The texts on which the
pretrained models have been trained are generic, i.e they do not include all
possible letter combinations of a language. Most OCR systems (e.g.OCRopus,
Tesseract) allow their users to choose the languages, characters and
corresponding texts/images that will be used to train the model.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Tesseract Training</title>
      <p>Tesseract is an open-source OCR software that supports various operating systems
(Windows, Linux, etc.) and includes a large number of pre-trained languages.
4.1</p>
      <sec id="sec-4-1">
        <title>Default Model</title>
        <p>Tesseract provides two packages of polytonic Greek, that divide the Greek polytonic
writing system between the ancient and the modern one:
– Ancient Greek (GRC) is based on a vocabulary derived from texts before 1453,
which is the year of conquest of Constantinople by the Ottoman troops.
– Modern Greek (ELL) includes texts after 1453 with a strong emphasis on
modern Greek.</p>
        <p>Tesseract gives three different models (i.e tessdata) for each language, one
normal, one optimized for accuracy (best) and one optimized for speed (fast). The best
model was used as a basis for optimizing the results for both Greek models.
Evaluation: Both OCR models for the polytonic Greek were evaluated
individually, without any image pre-processing in the evaluation data. In both cases the
system recognised the punctuation marks with great success but at the same time
there was observed an extremely large failure in unknown fonts, since the Tesseract
Greek models have been trained in a number of the most common fonts (e.g. Arial,
Dejavu, Coutier-New etc.). Furthermore, there were phenomena of text
fragmentation, line separation into two or more lines, line creation, etc.</p>
        <p>The evaluation (by copy-editors see section 4.3) revealed that the quality of the
output of the default Tesseract model was extremely low (see Version-1 in Figures
below 3 and 4), as the text has been altered and disfigured in such an extent that
it was difficult to correct or even read. In fact, it was quite difficult for the human
eye even to match the digital text with the text of the scanned page. Consequently,
we noticed that it was necessary to train the system with machine-learning-assisted
correction algorithms.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Training Default Model (grc-ecarle)</title>
        <p>The corpus that we created for the training of the system includes content collected
from various websites that contain texts written in the Greek archaistic form, also
known as katharevousa (i.e polytoniko.org) as well as from monasteries’ webpages
which publish texts in katharevousa. Finally, a corpus of 5000 lines was built whose
content was mainly Ancient Greek texts (especially historical texts of Thucydides).
The training focused on “GFS Didot Classic” font, since it was observed to prevail
all over the corpus. The training setup was based on Tessdoc/TrainingTesseract
while the total number of epochs equals 15,000. The corpus was split into training
(90%) and evaluation (10%). In order to identify the font characters for which the
system had not been trained, we decided to train both the Greek models included
by Tesseract (grc and ell).</p>
        <p>The exported trained language model was further trained based on pairs of
ground-truth data and image samples (i.e approximately 20 scanned pages) from
the available documents. The total number of epochs equals 5,000. The final trained
model will, hereafter, referred to as “grc-ecarle” and will be used throughout the
proposed method that is presented below.</p>
        <p>Evaluation: The evaluation results of the new trained model, were significantly
better (see Version-2 in Figures 3 and 4), since the training focused on training the
system on a single font, as it was described above. Particularly, the evaluation was
performed on image samples, that have not undergone any pre-processing, showed
a tendency of the system to be able to easily recognize the morphology of the
characters as well as the punctuation. This is mainly due to the fact that Tesseract
uses LSTM neural networks that have the ability to learn large sequences of letters.
In this way, issues concerning punctuation, due to wear, are being fixed. Therefore,
LSTMs is how we handled and finally cured issues concerning the misrecognition
of:
– capital letters and punctuation (e.g comma, colon)
– numbers and common symbols (; “” «»+ = &amp; # )
– glyphs that are above or below the line height, such as ; ; ;
Grc-ecarle trained model’s weakness: The trained model (“grc-ecarle”) has
shown excellent results in addressing some crucial problems. However, the evaluation
of the very first outputs revealed the weaknesses of the system and confirmed the
need to create ground-truth data for evaluation and for further accuracy
improvement. The weaknesses of the system concerned the identification of some characters,
that could be further improved as well as the following transcript failures:
horizontal and vertical line segregations
addition of both noisy lines, which contain only symbols and numbers, and
empty lines (no OCRed text at all)
inverted content of consecutive lines</p>
        <p>It is worth mentioning that the image processing technique (Leptonica) used by
Tesseract could not adequately address the image distortion and skew problems.
Therefore, it became obvious that further image pre-processing was required to
improve the results.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Copy-Editors Post-Process</title>
        <p>The post-processing phase is the last step of the OCR training workflow and involves
the OCR error correction. Instead of an automated post-processing system, human
assisted post-processing was preferred as more efficient for our dataset, since the
polytonic accent system requires expert editors. This kind of process aims to create
validation data, which are essential in order to improve the system and to evaluate
its performance.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Proposed Method</title>
      <p>In order to address the grc-ecarle train model’s weakness and further improve the
system, we constructed a pipeline of processes (Figure 1).
5.1</p>
      <sec id="sec-5-1">
        <title>Text-line Detection and Cropping</title>
        <p>
          Initially, we attempted to split every page into lines, so that a possible
distortion couldn’t affect the sequence of the lines. The first step of the OCR pipeline
corresponds to text-line detection and cropping, in which the scanned images are
transformed into text lines and the content is represented in lines, instead of pages.
The open-source software dhSegment [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] was used for the purposes of text line
detection [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] in association with the baseline detection model.
        </p>
        <p>The open-source software dhSegment locates the area occupied by the respective
text line. More precisely, it applies an underlining technique, without providing the
exact bounding box of a line (Figure 2-a). This gap is filled by an algorithm designed
to generate bounding box coordinates (Figure 2-b) with high precision, so that
the corresponding text line could be cropped (Figure 2-c). In fact, this algorithm
calculates the line width by estimating the distance (step 2.(b) - algorithm 1) at
which the line ends and subtracting it from the current line height.</p>
        <p>A scanning problem that the proposed algorithm below successfully addresses is the
division of a line into 2 or even 3 individual lines. The algorithm detects cases like
this, using the predefined threshold, and restores the line by attaching its fragments
into a single line (step 3 - Algorithm 1). Additionally, the estimated line width
is being normalized in order to handle lines with significantly smaller/higher line
width compared to the average line width of the current page (step 4 - Algorithm
1). Finally, it removes lines that appeared extremely small length compared to a
threshold value that was defined taking into account the average line length (step
6 - Algorithm 1). The algorithm finally estimates the actual coordinates of the
bounding box (per line) in order to generate the cropped text line that will feed the
OCR system (steps 7 and 8 - Algorithm 1).</p>
        <p>Consequently, after applying dhSegment an additional set of sub-processes is
being implemented. These sub-processes result in the proposed algorithm, that consist
of the following steps:</p>
        <p>Algorithm 1: Text line generation
1. extracting full list of initial bounding boxes (xstart; yend; xend; yend) per line
and sorting them by ascending yend (second item of initial bounding box).
(a) (xstart, xend) correspond to (left, right) points of the bounding box
(b) ystart is not estimated through dhSegment sub-processes, since the output
delivered from its implementation corresponds to a thick line
(c) yend corresponds to bottom point of the bounding box
2. contracting list with line width per text-line
(a) we re-defined yend as the maximum of the two yend points per bounding
box (xstart; yend; xend; yend)
(b) we estimated line width (ydiff ) per line by subtracting the current yend
from the previous one
3. merging consecutive lines if their (width) difference is under a threshold value
(minLineDiff)
4. filtering ydiff by estimating mean and standard deviation in order to
normalize lines that have extremely small or large width
5. defining parameters for ystart estimation and text line cropping
(a) minLineLength = minimum text line length applied to throw away lines
that do not contain valid text (e.g. symbols generated by noise)
(b) yendExtend = small region extended yend to avoid cutting lines that may
have been scanned askew
(c) minStartY = minimum value of ystart applied when line is at the top of
the page and a minimum starting height should be picked
(d) ystartEntend = small region extended ystart to avoid cutting lines that may
have been scanned askew
6. removing lines whose length is under the minLineLength threshold
7. estimating the actual bounding boxes via a loop:
(a) yend is considered to be the maximum from the two yend points in initial
box (xstart; yend; xend; yend), plus yendExtend (extend box to down
direction)
(b) if line is at the top of the page (i.e yend - ydiff &lt; 0) then</p>
        <p>ystart = minStartY
(c) if line is not at the top of the page (i.e yend ydiff &gt; 0) then</p>
        <p>ystart = yend ydiff ystartEntend (extend box to up direction)
8. extracting final text line bounding box and cropping it</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2 Image Pre-processing</title>
        <p>
          The next step after the text line detection and cropping, is applying a set of image
pre-processing techniques [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to each text line. The image pre-processing step aims
to address issues related to the image clarity. There are many techniques that
contribute cumulatively to the above goal, each of which tries to identify and eliminate
a specific problem. As mentioned above, the image processing techniques
(Leptonica) used by Tesseract internally could not fix all the problems within an image.
Therefore, further image pre-processing of the inputs is required. The proposed
method involves the following techniques which are applied hierarchically:
Dilation [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] technique is used to remove shadow and noise around the letters.
In the dilation process the letters are expanded until the smallest faded letters
filled with color. Dilation is based on a rule where the state of every pixel in the
output image is determined by the corresponding pixel and its neighbors in the
input image.
        </p>
        <p>
          Image Normalization [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] process changes the range of pixel intensity values and
makes the image more familiar or normal to the human eye. This technique is
often used to increase contrast allowing this way the noise removal. In order to
achieve this all the high or extremely low frequency contents are being removed
from the image.
        </p>
        <p>Binarization process converts the image from colored or grayscale to
black-andwhite in order to eliminate shadows, images and dots that do not correspond
to characters in the page, since these phenomena mislead the OCR recognition
system. Optimal performance of an OCR system requires high contrast between
the characters of an image and the rest of the background.</p>
        <p>Blurring is a process that removes pixel gradations that correspond to noise.
This is being achieved by convolving the image with a low-pass filter kernel that
removes high frequency content (e.g. noise, edges) from the image. As a result of
this technique, the edges in the image are a bit blurred leading to higher OCR
accuracy.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>OCRed Text Cleaning and Merging</title>
        <p>At this point, input images have been properly processed and are in the ideal
condition for the OCR process. The next step is to start the OCR process, after defining
all the necessary parameters. More precisely, since input images are aligned to text
lines, it is necessary to use the appropriate page segmentation mode. By default
Tesseract expects a text page when it segments an input image. Therefore, when
the input corresponds to a small text region, a different segmentation mode should
be tried. A complete list of supported modes is available in the official
documentation. For proposed method’s purposes we have chosen the mode that treats the
image as a single text line.</p>
        <p>After completing the above-mentioned image processing and page segmentation,
the latest trained “grc-ecarle” language model is being utilized. In the new OCR
output every scanned page is transformed in a continuous text line. Therefore, a
technique that will attach the individual text lines into a single textual unit is
being applied. Meanwhile, a text cleaning process aims to the best possible clarity
by removing the following:</p>
        <p>Blank lines and lines with noise, i.e the percentage of symbols and numbers in
line exceeds a threshold (e.g 50%).</p>
        <p>Lines with extremely long words, i.e sequences of characters longer than a cutoff
(e.g tokenMaxLength = 21), which are rare in the Greek language and probably
correspond to noise.</p>
        <p>It is worth noting that all the above-mentioned threshold values are being set
based on the language’s characteristics. It’s quite possible these values may not
result in such an optimal OCR output for another language.
This section presents the evaluation results for each of the three models tested. These
models correspond to: a.) Version-1 (Tesseract’s default Ancient Greek model), b.)
Version-2 (grc-ecarle trained model) and c.) Version-3 (proposed method).</p>
        <p>Table 1 presents the main features of the 3 versions of OCR models that were
implemented during the experiments. Specifically, the following are mentioned:
– the mode of the input sample, either as a text line or as a full page
– the implementation of image pre-processing techniques to the OCR input
– the specific trained language model used for generating OCR output
– the issues that were observed to degrade the accuracy of the final results</p>
      </sec>
      <sec id="sec-5-4">
        <title>Version</title>
      </sec>
      <sec id="sec-5-5">
        <title>Input</title>
      </sec>
      <sec id="sec-5-6">
        <title>Mode</title>
      </sec>
      <sec id="sec-5-7">
        <title>Image</title>
      </sec>
      <sec id="sec-5-8">
        <title>Pre-process</title>
      </sec>
      <sec id="sec-5-9">
        <title>Trained</title>
      </sec>
      <sec id="sec-5-10">
        <title>Model</title>
      </sec>
      <sec id="sec-5-11">
        <title>Issues</title>
        <sec id="sec-5-11-1">
          <title>Version-1 Full page</title>
        </sec>
        <sec id="sec-5-11-2">
          <title>Version-2 Full page</title>
        </sec>
        <sec id="sec-5-11-3">
          <title>Version-3 Text line</title>
          <p>- Extremely low character accuracy.</p>
          <p>- Line separation into more lines.</p>
          <p>No grc - Inserting empty and “noisy” lines
(i.e sequence of symbols and numbers).
- Better character accuracy but</p>
          <p>still remain many cases of misspelled
No grc-ecarle characters/numbers.</p>
          <p>- Less empty/“noisy” lines.</p>
          <p>- Enhancing character accuracy.</p>
          <p>Yes grc-ecarle - Fixing text fragmentation issues.</p>
          <p>- Optimal result.</p>
          <p>Table 1: Model versions. Features and issues.</p>
          <p>For each of the above models the performance was tested for a random page sample.
Figures 3 and 4 show the progress made in OCR between the different versions
of the implemented models, for two different samples. The first sample could be
interpreted as normal (without distortion) by a human, but it turned out to involve
several difficulties for the OCR system. The second sample is obviously distorted
due to material/scanning problems. It is worth noting that in both Figure 3 and 4
we have interfered with Version-1 OCRed text. In fact, the latter text consists of too
many lines, compared to the other Versions, while a significantly large percentage
of them corresponds to empty lines. Therefore, we decided to remove a few empty
lines in order to fit the example in figure’s dimensions.</p>
          <p>
            The superiority of the proposed method (Version-3) is visible to human eyes.
Nevertheless, it is necessary to estimate the character accuracy of each OCRed text
in Figures 3 and 4 and present it to a bar chart (Figure 5), in order to confirm that
the proposed model outperforms the rest of implemented models. Character
accuracy is being estimated after ground-truth text generation (post-processing phase),
making use of the open-source tool, ocreval [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. The latter bar chart confirms
the ability of Version-3 to offer the optimal results, approaching to a large extent
(98.56% and 97.11% for each image, respectively) the ground-truth text. Meanwhile,
in Figure 3 there is a large accuracy gap between Version-1, which does not exceed
50%, and Version-3, which approaches 99%.
          </p>
          <p>Finally, in order to give a comprehensive overview of the performance of each
model, a evaluation set was constructed, consisting of 102 image samples. All image
samples have passed the post-processing step and the corresponding ground-truth
Fig. 3: Model versions. The improved output of the third version of a random normal
image sample.
text has already been produced. Each one of the implemented versions was evaluated
on the specific set and the character accuracy was calculated, using ocreval tool.
Table 2 summarizes the results of the evaluation, highlighting the superiority of the
proposed model over the rest models for a numerous evaluation set.
There are several challenges for future research. Firstly, the proposed method selects
Tesseract open-source OCR engine as well as further training in order to achieve the
optimal OCR accuracy for Greek Literature documents. To this end, further
accuracy comparison with other current approaches could be incorporated. Furthermore,
a sophisticated heuristic or machine learning approach could apply layout
recognition that would involve the division of a whole document into front/body/back
sections as well as the detection of headers, chapters, paragraphs, page numbers,
etc., within a document’s page.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vandegrift</surname>
            ,
            <given-names>M</given-names>
          </string-name>
          , Varner,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>Evolving in Common: Creating Mutually Supportive Relationships Between Libraries and the Digital Humanities</article-title>
          .
          <source>In: Journal of Library Administration53</source>
          <volume>(1</volume>
          )
          <fpage>67</fpage>
          -
          <lpage>78</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Springmann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lüdeling</surname>
          </string-name>
          . A.:
          <article-title>OCR of Historical Printings with an Application to Building Diachronic Corpora: A Case Study Using the RIDGES Herbal Corpus</article-title>
          . In ArXiv.org (
          <year>2017</year>
          ), arxiv.org/abs/1608. 02153.
          <article-title>Last accessed 2 Nov 2020</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rahnemoonfar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonacopoulos</surname>
          </string-name>
          . A.:
          <article-title>Restoration of Arbitrarily Warped Historical Document Images Using Flow Lines</article-title>
          .
          <source>In: Proceedings of the 2011 International Conference on Document Analysis and Recognition</source>
          , pp.
          <fpage>905</fpage>
          -
          <lpage>909</lpage>
          . IEEE Computer Society, Washington (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adhikari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasgupta</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pradhan</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An Adaptive Warp Correction Algorithm for Handwritten Text Images with Non-Linear Baselines</article-title>
          .
          <source>In: Proceedings of the 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . IEEE Computer Society, Washington (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kong</surname>
          </string-name>
          . L.:
          <article-title>Research on Deskew Algorithm of Scanned Image</article-title>
          .
          <source>In: Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation (ICMA</source>
          <year>2018</year>
          ), pp.
          <fpage>397</fpage>
          -
          <lpage>402</lpage>
          . IEEE Computer Society, Washington (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Traub</surname>
            , M. C.,
            <given-names>van Ossenbruggen. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardman</surname>
          </string-name>
          . L.:
          <article-title>Impact Analysis of OCR Quality on Research Tasks in Digital Archives</article-title>
          .
          <source>In: Research and Advanced Technology for Digital Libraries Lecture Notes in Computer Science</source>
          <volume>9316</volume>
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Strange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wodak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers</article-title>
          .
          <source>In: Digital Humanities Quarterly</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ) (
          <year>2014</year>
          ), http://www.digitalhumanities.org/dhq/vol/8/ 1/000168/000168.html.
          <source>Last accessed 4 Oct</source>
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Reul</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springmann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wick</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>State of the Art Optical Character Recognition of 19th Century Fraktur Scripts Using Open Source Engines</article-title>
          . In ArXiv.org (
          <year>2018</year>
          ), https://arxiv.org/abs/
          <year>1810</year>
          . 03436.
          <article-title>Last accessed 2 Nov 2020</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Breuel</surname>
            ,
            <given-names>T. M.:</given-names>
          </string-name>
          <article-title>The OCRopus open source OCR system</article-title>
          .
          <source>In: Proceedings of SPIE - International Society for Optical Engineering</source>
          , https://www.deepdyve.com/lp/spie/ the-ocropus
          <article-title>-open-source-ocr-system-5xECzD6Gu0</article-title>
          .
          <source>Last accessed 4 Dec 2020</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An Overview of the Tesseract OCR Engine</article-title>
          .
          <source>In: 9th International Conference on Document Analysis and Recognition (ICDAR</source>
          <year>2007</year>
          ), pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          , IEEE Computer Society, Washington (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Dojčinović</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihajlović</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joković</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marković</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milovanović</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Neural Network Based Optical Character Recognition System</article-title>
          .
          <source>In: Proceedings of the 11th Symposium on Neural Network Applications in Electrical Engineering</source>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>114</lpage>
          . IEEE Computer Society, Washington (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ares</surname>
            ,
            <given-names>O. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seguin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>dhSegment: A generic deep-learning approach for document segmentation</article-title>
          .
          <source>In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>12</lpage>
          . IEEE Computer Society, Washington (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allebach</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouman</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          :
          <article-title>Text Line Detection Based on Cost Optimized Local Text Line Direction Estimation</article-title>
          .
          <source>In: SPIE Proceedings on Color Imaging XX: Displaying, Processing, Hardcopy, and Applications</source>
          , pp.
          <fpage>939507</fpage>
          . SPIE,
          <string-name>
            <surname>California</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Bieniecki</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grabowski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rozenberg</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Image Preprocessing for Improving OCR Accuracy</article-title>
          .
          <source>In: Proceedings of 2007 International Conference on Perspective Technologies and Methods in MEMS Design</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>80</lpage>
          . IEEE Computer Society, Washington (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Raid</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khedr</surname>
            ,
            <given-names>W.M.</given-names>
          </string-name>
          , El-dosuky,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Aoud</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Image Restoration Based on Morphological Operations</article-title>
          . In
          <source>International Journal of Computer Science, Engineering and Information Technology</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          )
          <fpage>9</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sane</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
          </string-name>
          , R.:
          <article-title>Pixel Normalization from Numeric Data as Input to Neural Networks: For Machine Learning and Image Processing</article-title>
          .
          <source>In: Proceedings of 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)</source>
          , pp.
          <fpage>2221</fpage>
          -
          <lpage>2225</lpage>
          . IEEE Computer Society, Washington (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          :
          <article-title>OCR Evaluation Tools for the 21st Century</article-title>
          .
          <source>In: Proceedings of the Workshop on Computational Methods for Endangered Languages</source>
          (
          <year>2019</year>
          ), https://journals.colorado.edu/index. php/computel/article/view/345. Last accessed 4
          <source>Dec</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>