=Paper=
{{Paper
|id=Vol-2984/paper16
|storemode=property
|title=Docreader labeling system for line type classifier (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2984/paper16.pdf
|volume=Vol-2984
|authors=Ilya S. Kozlov
|dblpUrl=https://dblp.org/rec/conf/itams/Kozlov21
}}
==Docreader labeling system for line type classifier (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2984/paper16.pdf</pdf>
<pre>
Docreader labeling system for line type classifier
Ilya S. Kozlov1
1
 Ivannikov Institute for System Programming of the RAS, Alexander Solzhenitsyn st. 25, Moscow, 109004, Russian
Federation


                                         Abstract
                                         We develop the document analysis system, which is able to extract text and text metadata (such as font
                                         size and style), and restore the document structure. Some parts of the pipeline are based on machine
                                         learning thus requiring training and the labeled dataset, creating a training dataset is based on manual
                                         labeling. In this article, we describe an approach to the creation of a labeling system in the task of
                                         multiclass classification of document lines (paragraphs). The pipeline consists of several stages ranged
                                         from getting the source documents to getting a ready-to-learn dataset. An approach to the analysis of
                                         scanned documents and documents in docx and txt format is considered. In our work, we focus on
                                         intra-team labeling, thus we do not consider some problems, common for the crowdsourcing approach
                                         (such as unscrupulous annotators).

                                         Keywords
                                         document structure analysis, PDF documents, document analysis,


1. Introduction
As a rule, large documents are not uniform but split into smaller parts. The scientific articles are
divided into sections, the novels divided into chapters, etc. The larger parts, in their turn, are
divided into smaller parts, as subsections or paragraphs. Thus the document can be represented
in the form of the tree.


Figure 1: Document: how the reader sees it and in the form of a tree


Information Technologies: Algorithms, Models, Systems (ITAMS), September 14, 2021, Irkutsk
   kozlov-ilya@ispras.ru (I. S. Kozlov)
 0000-0002-0145-1159 (I. S. Kozlov)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   The physical representation of this logical structure is the Table Of Contents, it serves to
facilitate the navigation in the large texts and also can be hierarchical. Thus in this article we
can consider the task of logical structure extraction and TOC extraction as synonymous.
   Most of the methods of automatic TOC extraction are based on supervised machine learning
techniques, thus requiring some labeled dataset. One of the ways of obtaining the labeled
dataset is to ask human experts to label the data. These humans are called annotators. One can
see main pipeline (in solid lines) and alternative way where feature extraction and classification
preformed by the annotator (in dotted lines) in figure 2.


Figure 2: Document: line classification pipeline


The creation of the labeled dataset is not a one-time task:
    • Each new type of documents has its own type of logical structure, thus it requires its own
      labeled dataset. For example, we can work with scientific articles but want to work also
      with financial documents
    • The existing labeled dataset does not contain some important subcategory of documents.
      For example, the scientific articles dataset does not contain any biological articles. In
      this case, the quality of the logical structure extraction from biological articles probably
      will be low, and the best way to improve it is the expansion of the collection of labeled
      documents.
    • Generally, the enlargement of the training set is the simplest and effective way to improve
      the quality of any supervised machine learning model.
  This paper is organized as follows:
    • Section 2 describes related work gave a short description of the current state of the task
      of TOC extraction and the existing approaches to the creation of the training dataset.
    • Section 3 describes our approach to the creation of the dataset and the way to solve the
      problem, which emerges in the process of the creation of a labeled dataset for the task of
      TOC extraction.


2. Related Work
2.1. Logical Structure Extraction
One of the early surveys, considering the logical structure extraction task is [1], most of consid-
ered methods are rule-based. In the survey most of concepts are defined, such as tree structure
of the document.

   In the 2008–2013 the series of TOC extraction competitions from fiction books [2, 3, 4, 5]
were held. The competition is based on two complementary metrics: a title-based measure and
a link-based measure. The rule-based and machine learning-based solutions were proposed.

   In the 2019–2021 the series FinTOC competitions were held – TOC extraction competitions
from financial reports [6, 7, 8]. There were two tasks – TOC extraction and title detection.
The difference is that in the title detection participants were asked to classify each line of the
document as "Title" or "Not Title". There were two datasets – English and French documents,
thus there were potentially more than one winners. Let’s briefly list the winner’s approaches:

    • The best solution [9] for title detection in the FinTOC 2019 was based on the LSTM.
    • The best solution [10] for the TOC extraction task in the FinTOC 2019 was based on the
      decision tree classifier.
    • The Best Title Detection for English in FinTOC 2020 was based on the neural net-
      works [11].
    • The Best TOC extraction for both English and French was based on Random Forest
      classifier [12].
    • At the moment of writing the paper, the winner of the FinTOC-2021 has not published
      the solution yet.

   The approach, used in the Docreader project is based on the XGBoost classifier [13] and
described in [14].
   We can conclude that most modern approaches to TOC extraction are based on machine
learning methods and, accordingly, require a labeled dataset.

2.2. Labeling Systems
Depending on the specific labeling problem one can rely on crowdsourcing or do the job in-team.
Each approach has advantages and disadvantages. Crowdsourcing allows the creation of large
datasets with the assistance of external (and often working for the little money) annotators.
One can use Amazon Mechanical Turk1 or Yandex Toloka2 as a crowdsourcing platform. The
disadvantages of the crowdsourcing approach are includes:

    • It is impossible to give data containing state or commercial secrets to outsourcing.
    • Some annotators may work unscrupulous, some external quality control is required.
    • You would be limited with the platform restrictions.

You may find more information about crowdsourcing in the [15, 16]

 In the case of in-team labeling the work is performed by team members, who often work for
much more fee then the crowdsourcers. On the other hand in the case of in-team labeling, some
   1
       https://www.mturk.com/
   2
       https://toloka.yandex.ru
problems are not actual. Usually one may not worry about unscrupulous annotators, it is easier
to organize labeling of the secret documents. There are many tools for labeling, one of the most
powerful tools is a Labeling Studio[17] Label Studio has reach functionality, it is suitable for
segmentation, object detection, image classification, etc. We have use Label Studio but face a
relatively long waiting time for image updates (about 1 second)3 . It is not a problem when we
do segmentation tasks, because it takes much more than one second for one image. But in the
case of image classification, the waiting time may be more than task completion time, so we
use much more simple program ImageClassifier4 .


3. Proposed Method
The text line classification pipeline is organized as follows (fig 2).
   1. Extraction of the text lines with metadata (font size and style, indents, etc) from the
      document.
   2. Extraction of the features from lines with metadata
   3. Classification of the lines by their type
   4. Construction of the document structure
We need labeled data to train the classifier 3 (and sometimes the feature extractor 2).
  As a rule, the result of a labeling task are pairs of features 𝑋 and the label 𝑦 But the way how
we extract lines with metadata and how we extract features may change, and we don’t want to
redo the data labeling task every time when any step of the pipeline is changed.


3.1. Persistent line id
We add special persistent id for each line with metadata, which is not changed during the change
of our pipeline. Thus we are able to change our pipeline (for example feature extraction) and do
not need to relabel the training dataset. The way how to build the persistent id is different for
different kinds of documents.

Scanned documents: Scanned documents in fact are images, so line detection is a separate
task, and we use Tesseract5 for this purpose. The change of the Tesseract version may lead to
the change in the text line location algorithm and even to the change of number and the order
of lines. In order to avoid the need to perform the labeling task each time when we change the
Tesseract version, we save the found lines in the form of the bounding box and the extracted
text.


    3
      as of the beginning of 2019
    4
      https://github.com/dronperminov/ImageClassifier
    5
      https://github.com/tesseract-ocr/tesseract
Figure 3: Line with bounding box (red frame)


  In the future, we use the found bboxes as a result of the work of the Tesseract.

Txt document: Line in txt document is defined by the document itself and the number of
the line. We define the line id as md5sum(document) + "_" + line_id

Docx: Docx is the Microsoft Word format. It represents by a zip archive with files in XML
format. One of the files consists of paragraphs, each paragraph contains text and some meta
information in the explicit form or as a reference to some style. Paragraphs located in file
document.xml, styles defined in the style.xml file, the archive can hold other files also. The
paragraph is defined by the XML which produces it and by the docx file itself. We define the
line id as md5sum(document) + "_" + md5sum(paragraph_xml)

3.2. Creating tasks for annotators
We hope that the one who is interested in labeled data should maximally simplify the process
for the annotators. The annotator should not install the strange software with the cumbersome
installation instructions, the annotator should have direct access to the instruction. In our
case, we reduce the task to the classification of the image. Annotator gets a zip archive with
images of the bounding box (as in picture 3), the annotator instructions, docker file with all
the dependencies. The task can be launched with two shell commands (in case if you have
docker installed), after launching one can use a web browser to perform the tasks. We use
ImageClassifier to create web interfaces for the labeling task. The annotator labels the document
line by line, after the first line following the second line, and so on, so the context is holding.
After finishing the labeling task annotator gets the archive with the results and is able to upload
it into the tasks server. When all annotators have uploaded their tasks, task server merges the
answers and adds original documents. As a result, we obtain the collection of documents and
labeled pairs line id and the label. The line id should be the same for each run of the pipeline,
we describe how to build such id in the subsection 3.1.

3.2.1. Creating tasks images
We have to create images with a bounding box around the text line. The way how to create is
different for the different kinds of the documents.

Scanned documents: A scanned document is a picture, we have the coordinates of the line
from the Tesseract and have saved it. Thus the bounding box can be drawn with the help of the
OpenCV [18] of the PIL [19] library.

Txt documents: Txt document containing only text lines without the metadata (such as font
size or font style). Thus one may draw text with some image processing library and do not fear
losing some important meta information.

Docx documents: Docx document is the most difficult one to obtain the image. Typically the
docx document contains a lot of valuable information about the text formatting, so we do not
want to draw the document as raw text and lose all the metainformation. On the other hand, the
process of the drawings of the docx document is complicated, so only some large libraries are
able to do it. The solution may be found in the modification of the internal XML. One may add
information to the paragraph that should be concluded into the box. To obtain the coordinates
of the box we tried to convert the modified document and the original one into images and
subtract the second image from the first one, but note that the drawing of the box leads to the
shift of the paragraphs. We also note that the box of the neighborhood paragraphs may be
merged if both boxs have the same color. All the above forces us to use more complicated ways
of creating images with bounding boxes for docx:
   1. Create a pair of documents with the boxes. Each box in the first document have a unique
      color and the color of the box in the second document alternates.
   2. Convert pair of the docx documents to the pair of pdfs and pdfs into list of images. See
      picture 4
   3. We subtract the second image from the first one and obtain the image with only nonzero
      pixels in the former box.
Figure 4: Pair of images with boxes


4. Summary
The task of creating a training dataset occurs regularly in the process of machine learning-based
system development. We describe our approach to the creation of the labeled dataset. We have
described our approach to the creation of the labeled dataset, describe an approach that enables
us to not redo data annotation with every change of our documents handling pipeline, described
how to lead the task of line annotation to the task of image classification. We hope that the
need for human labeling data is not one time task, but a regularly occurring problem, so the
machine learning systems should enable to create such tasks easily.


References
 [1] S. Mao, A. Rosenfeld, T. Kanungo, Document structure analysis algorithms: a literature
     survey, in: Document Recognition and Retrieval X, volume 5010, International Society for
     Optics and Photonics, 2003, pp. 197–207.
 [2] G. Kazai, A. Doucet, M. Landoni, Overview of the inex 2008 book track, in: International
     Workshop of the Initiative for the Evaluation of XML Retrieval, Springer, 2008, pp. 106–123.
 [3] G. Kazai, A. Doucet, M. Koolen, M. Landoni, Overview of the inex 2009 book track, in:
     International Workshop of the Initiative for the Evaluation of XML Retrieval, Springer,
     2009, pp. 145–159.
 [4] A. Doucet, G. Kazai, B. Dresevic, A. Uzelac, B. Radakovic, N. Todic, Setting up a competition
     framework for the evaluation of structure extraction from ocr-ed books, International
     Journal on Document Analysis and Recognition (IJDAR) 14 (2011) 45–52.
 [5] A. Doucet, G. Kazai, S. Colutto, G. Mühlberger, Icdar 2013 competition on book structure
     extraction, in: 2013 12th International Conference on Document Analysis and Recognition,
     IEEE, 2013, pp. 1438–1443.
 [6] R. Juge, I. Bentabet, S. Ferradans, The fintoc-2019 shared task: Financial document structure
     extraction, in: Proceedings of the Second Financial Narrative Processing Workshop (FNP
     2019), 2019, pp. 51–57.
 [7] N.-I. Bentabet, R. Juge, I. El Maarouf, V. Mouilleron, D. Valsamou-Stanislawski, M. El-Haj,
     The financial document structure extraction shared task (fintoc 2020), in: Proceedings
     of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial
     Summarisation, 2020, pp. 13–22.
 [8] I. El Maarouf, J. Kang, A. Aitazzi, S. Bellato, M. Gan, M. El-Haj, The Financial Docu-
     ment Structure Extraction Shared Task (FinToc 2021), in: The Third Financial Narrative
     Processing Workshop (FNP 2021), Lancaster, UK, 2021.
 [9] K. Tian, Z. J. Peng, Finance document extraction using data augmentation and attention,
     in: Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019), 2019,
     pp. 1–4.
[10] E. Giguet, G. Lejeune, Daniel@ fintoc-2019 shared task: toc extraction and title detection,
     in: Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019), 2019,
     pp. 63–68.
[11] D. Premi, A. Badugu, H. Sharad Bhatt, AMEX-AI-LABS: Investigating transfer learning for
     title detection in table of contents generation, in: Proceedings of the 1st Joint Workshop
     on Financial Narrative Processing and MultiLing Financial Summarisation, COLING,
     Barcelona, Spain (Online), 2020, pp. 153–157. URL: https://www.aclweb.org/anthology/
     2020.fnp-1.26.
[12] D. Kosmajac, S. Taylor, M. Saeidi, DNLP@FinTOC’20: Table of contents detection in
     financial documents, in: Proceedings of the 1st Joint Workshop on Financial Narrative
     Processing and MultiLing Financial Summarisation, COLING, Barcelona, Spain (Online),
     2020, pp. 169–173. URL: https://www.aclweb.org/anthology/2020.fnp-1.29.
[13] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the
     22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
     KDD ’16, ACM, New York, NY, USA, 2016, pp. 785–794. URL: http://doi.acm.org/10.1145/
     2939672.2939785. doi:10.1145/2939672.2939785.
[14] A. O. Bogatenkova, I. S. Kozlov, O. V. Belyaeva, A. I. Perminov, Logical structure extraction
     from scanned documents, Proceedings of the Institute for System Programming of the
     RAS 32 (2020) 175–188.
[15] R. Gilyazev, D. Y. Turdakov, Active learning and crowdsourcing: A survey of optimization
     methods for data labeling, Programming and Computer Software 44 (2018) 476–491.
[16] A. Drutsa, V. Farafonova, V. Fedorova, O. Megorskaya, E. Zerminova, O. Zhilinskaya,
     Practice of efficient data collection via crowdsourcing at large-scale, arXiv preprint
     arXiv:1912.04444 (2019).
[17] M. Tkachenko, M. Malyuk, N. Shevchenko, A. Holmanyuk, N. Liubimov, Label Studio:
     Data labeling software, 2020-2021. URL: https://github.com/heartexlabs/label-studio, open
     source software available from https://github.com/heartexlabs/label-studio.
[18] G. Bradski, The OpenCV Library, Dr. Dobb’s Journal of Software Tools (2000).
[19] P. Umesh, Image processing in python, CSI Communications 23 (2012).

</pre>