=Paper= {{Paper |id=Vol-2543/spaper05 |storemode=property |title=Methods of Processing Large Collections of Scientific Documents and the Formation of Digital Mathematical Library |pdfUrl=https://ceur-ws.org/Vol-2543/spaper05.pdf |volume=Vol-2543 |authors=Alexander Elizarov,Evgeny Lipachev |dblpUrl=https://dblp.org/rec/conf/ssi/ElizarovL19 }} ==Methods of Processing Large Collections of Scientific Documents and the Formation of Digital Mathematical Library== https://ceur-ws.org/Vol-2543/spaper05.pdf
    Methods of Processing Large Collections of Scientific
              Documents and the Formation
             of Digital Mathematical Library
           A. M. Elizarov[0000-0003-2546-6897] and E. K. Lipachev[0000-0001-7789-2332]

                  N. I. Lobachevskii Institute of Mathematics and Mechanics,
              Higher School of Information Technologies and Intelligent Systems,
                           Kazan (Volga Region) Federal University
                 amelizarov@gmail.com, elipachev@gmail.com



        Abstract. Digital mathematical libraries are today one of the tools for integrat-
        ing mathematical knowledge. This integration method is based on the use of
        metadata. Our task was to create methods to programmatically extract the nec-
        essary objects from digital mathematical documents, establish semantic rela-
        tionships between them and generate the necessary sets of metadata. Based on
        the analysis of the structure of the set of documents under consideration and the
        stylistic features of their design, an algorithm has been developed for extracting
        their metadata, creating digital collections and then including them in the corre-
        sponding digital library. The algorithm is implemented as a software system
        and tested on the example of a set of files “Proceedings of the N.I. Lobachev-
        skii Mathematical Center” for 1998–2018. A corresponding digital collection
        has been created, which is included in the Lobachevskii Digital Mathematical
        Library (Lobachevskii DML, https://lobachevskii-dml.ru/).

        Keywords: Digital Collection, Digital Mathematics Library, Metadata, Seman-
        tic Relation, Semantic Method, Lobachevskii DML.


1       Introduction

Currently, one of the tools for integrating mathematical knowledge is digital mathe-
matical libraries (see, for example, [1]–[5]). This integration method, as well as in-
formation management on the Web, is based on the use of metadata [6]–[8]. Creating
a new digital collection from a set of files containing, for example, an archive of jour-
nal articles, involves a series of operations to coordinate formats, extract and refine
metadata, and normalize them in accordance with established data schemes. Each
such operation requires the use of special methods and software tools that take into
account the specifics of the collection being processed and the rules for organizing a
digital library. When creating digital libraries, additional requirements are also im-
posed, in particular, on the composition and format of metadata [9]. The construction
of digital mathematical libraries requires the development of software tools that take
into account such features of mathematical documents as the presence of formulas,
notation, definitions, theorems, and proofs in texts. All of the above form a complex

Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                           355


system of objects, interconnected both within the framework of the document under
consideration, and with other documents and objects in this field of mathematics [1,
4]. Our task was to create methods that allow the software to extract the necessary
objects from digital mathematical documents and establish semantic relations between
them [10–14].
   Periodic journals with a long history of publication have archives of articles that
differ significantly in the composition of the metadata presented in these articles. In
addition, style rules, fixed, for example, by MS Word templates or in .sty-files (for
the TEX system), have also changed many times over the past. For example, in the
journal “Lobachevskii Journal of Mathematics” (the articles of this journal constitute
one of the collections of the digital mathematical library Lobachevskii DML) from
1998 to 2019, foor different .sty-files were used to design articles. Algorithms for
extracting metadata use such stylistic features as heading, annotation fonts, and the
document structure that determines the sequence of structural blocks (name, authors
list, etc.) (see, for example, [15–17]). Algorithms for extracting metadata use such
stylistic features as fonts for the title of the article, its annotation and a set of key-
words, as well as the structure of the document, which determines the sequence of
structural blocks (name and surname of authors, list of authors, etc.) (see, for exam-
ple, [15–17]).
    Automated processing of arrays of scientific documents is complicated by the vari-
ety of styles used in journals. As a result, various methods of extracting metadata are
required, which will take into account the structural features of a particular collection
(see, for example, [10, 18]).
    This work is devoted to the development of methods for creating digital scientific
collections from an array of heterogeneous digitized documents. On the example of
processing a set of files containing volumes of “Proceedings of the N.I. Lobachevskii
Mathematical Center” (hereinafter referred to as the “Proceedings”) for 1998–2018,
describes the process of forming the corresponding digital collection and its inclusion
in the digital mathematical library Lobachevskii Digital Mathematical Library (Loba-
chevskii-DML, https://lobachevskii-dml.ru/), which is currently being formed in Ka-
zan (Volga region) Federal University.
    The main purpose of this “Proceedings” is the publication of materials of mathe-
matical conferences. As a result, most volumes of the “Proceedings” contain dozens
of articles with a limited (from the modern point of view) metadata composition.
Since 1998 (the moment the first volume was released), several style rules for the
preparation of materials have been used, which affected the design of articles and the
choice of file formats of compiled collections.
    Let us single out the main tasks of forming a digital collection. The necessary con-
ditions for creating a digital collection from the “Proceedings” array were:
       division of volumes into separate articles;
       highlighting metadata describing each article;
       generation of additional metadata that contains, in particular, a bibliographic
           description of the article, a link to the article file in the digital collection, as
           well as links to the profiles of article authors on academic portals and scien-
           tometric databases (kpfu.ru, MathNet.ru, Scopus, DBLP and etc.).
356


2       Algorithm for Extracting Metadata and Semantic
        Relationships

Let us describe in more detail the main stages of the software processing of the select-
ed set of files of collections of articles that have been processed. These stages are as
follows:
         highlighting metadata;
         their conversion in accordance with XML schemas;
         creation of a digital collection;
         its inclusion in the digital mathematical library Lobachevskii-DML.


2.1     Clustering
First of all, file clustering was performed. As a result, the corresponding volumes
were divided into classes according to the similarity of their structure and design.
From the moment the first volume was released (1998) until the publication of the last
57th volume (to date), various rules were used to prepare the materials. This affected
the file formats of the volumes, as well as the design of the articles themselves. For
example, only a number of documents contain indexes of the Universal Decimal Clas-
sification (UDC). Table 1 shows the differences in the use of structural elements and
the composition of the metadata of this collection.

                  Table 1. Metadata and building blocks in collection volumes.

 e-mail                 UDC              Contents                        Author Index
 3, 4, 12, 15, 18,19,                    4, 9, 15, 17, 20, 22, 26, 29,   2, 3, 5, 6, 8, 9, 12, 13,
                        54, 55, 56, 57
 22, 24, 31, 35-40,                      32, 33, 49                      16, 18, 19, 21, 23, 24,
 42–47, 49, 50, 52–                                                      28, 31, 34–40, 42–47,
 57                                                                      50, 52–57


2.2     Metadata Extraction

Further, in order to extract metadata describing both the volume as a whole and the
articles included in it, the collection files were processed. In particular, the page num-
bers of all the articles of each volume were determined. To search for pages with
article titles, an algorithm has been developed that uses the structural homogeneity of
each volume and style uniqueness in the design of articles in it.
   An important part of the structural analysis of documents is the allocation of blocks
such as the name, surnames of the authors, their affiliation, abstract, key words and
bibliographic records. Special ontologies have been developed to describe the struc-
ture of scientific documents [19, 20]. For the semantic structuring of digital content,
they use ontologies CiTO, DoCo, SWAN, SKOS, CERIF and SPAR (see, for exam-
ple, [21, 22]). An example of the distribution of such blocks according to structural
features and their description in terms of DoCO ontologies is given in [10].
                                                                                       357


    To extract article metadata based on characteristic features, we have defined rules
for selecting article blocks. Such features include, in particular, the design of article
styles (font, size, use of selections, and a number of others). Improving the quality of
metadata extraction provides some additional features that take into account:
    • text structuring (for example, the location of the word “Annotation” in front of
the annotation block);
    • the type of email address record template used;
    • the position of the block in the text (for example, the document begins with the ti-
tle of the article).
    As the main elements of the article taken into account by these functions, you can
use the position of the block in question in the document, as well as the font used in
the text of this block. These functions made it possible to distinguish not only the
titles of articles, but also lists of authors, bibliography blocks, and other metadata (for
example, e-mail, keywords) if they exist in the text.
    Using text analysis methods [23, 24], the terms from which the sets of keywords
were formed for inclusion in the metadata were formed from the documents of the
digital collection.
    A number of metadata (such as email addresses of authors, their affiliation) was
imported and updated from the profiles of authors on academic sites and in scien-
tometric databases. In this procedure, the semantic relationships established during
the formation of the digital collection were applied.


2.3    XML-representation
We have developed an XML language for describing digital mathematical collections,
consisting of a set of tags and XML schemes based on the Journal Archiving and
Interchange Tag Suite (NISO JATS, https://jats.nlm.nih.gov/archiving/). In the nota-
tion of this language, based on the data obtained during the processing of an array of
files, a description of the collection of “Proceedings” is performed.


2.4    Splitting Volume Files into Article Files and Creating a Digital Collection
The next stage in creating the digital collection consisted of the procedures for divid-
ing each volume of the “Proceedings” into separate articles. To do this, tags whose
attributes indicate the start and end pages of the articles were read from XML-files
that contain meta descriptions of volumes. After that, the files were divided into sepa-
rate documents, which were named in accordance with the rules of the digital collec-
tion.
   The system of metadata prepared in the process of the above algorithm allowed to
form a digital collection of “Proceedings of the N.I. Lobachevskii Mathematical Cen-
ter” and include it in the digital library Lobachevskii DML (https://lobachevskii-
dml.ru/) [11].
358


2.5    Software Implementation
The algorithm is implemented in the form of programs in C #, allowing you to pro-
cess files in the formats TeX, OpenXML (.docx) and .pdf. TeX-files were gener-
ated using standard functions that implement operations with text strings. To work
with pdf-files, we used the functions of the PDFLib libraries
(https://www.pdflib.com)                          and                        iTextSharp
(https://www.nuget.org/packages/iTextSharp/). For documents presented as docx-
files, the “word/document.xml” file was parsed from the .docx archive in ac-
cordance with the Office OpenXML format (see, for example, [25]).
    The process of selecting articles was carried out using a program developed in Py-
thon using the functions of the PyPDF2 library (http://pybrary.net/ pyPdf/).


3      Conclusion

For inclusion in the international scientific space of digital mathematical collections
of Kazan University, methods of their formation from a set of documents presented in
various storage formats are proposed. Based on the analysis of the structure of docu-
ments and the stylistic features of their design, an algorithm for the extraction of their
metadata has been developed, implemented on the example of “Proceedings of the
N.I. Lobachevskii Mathematical Center”.
    The work partially contains the results of the project “Monitoring and standardiza-
tion of the development and use of technologies for storing and analyzing big data in
the digital economy of the Russian Federation”, carried out as part of the program of
competence of the National Technological Initiative “Center for storing and analyzing
big data”, supported by the Ministry of Science and Higher Education of the Russian
Federation under the Treaty of Moscow State University named after M.V. Lomono-
sov with the Project Support Fund of the National Technological Initiative dated
15/08/2019 No. 7/1251/2019. The work was also carried out with the partial support
of the Russian Fund for Basic Researches (project 18-29-03086); with the partial
support of the Russian Fund for Basic Researches and the Government of the Repub-
lic of Tatarstan within the framework of scientific project 18-47-160012.


References
 1. Developing a 21st Century Global Library for Mathematics Research. The National Acad-
    emies Press, Washington (2014).
 2. Ion, P.: The Effort to Realize a Global Digital Mathematics Library. In: G.-M. Greuel et al.
    (Eds.). ICMS 2016, LNCS 9725. Springer International Publishing Switzerland, 458–466
    (2016). https://doi.org/10.1007/978-3-319-42432-3 59.
 3. Ion, P.D.F., Watt, S.M.: The Global Digital Mathematics Library and the International
    Mathematical Knowledge Trust. In: ICM 2017: Intelligent Computer Mathematics, 2017.
    Lecture Notes in Artificial Intelligence, vol. 10383, pp. 56-69. Springer (2017),
    https://doi.org/10.1007/978-3-319-62075-6_5.
                                                                                        359


 4. Elizarov, A.M., Lipachev, E.K., Zuev, D.S.: Digital Mathematical Libraries: Overview of
    Implementations and Content Management Services. CEUR Workshop Proceedings, vol.
    2022, pp. 317–325 (2017).
 5. Chebukov, D.E., Izaak, A.D., Misyurina, O.G., Pupyrev, Yu.A., and Zhizhchenko, A.B.:
    Math-Net.Ru as a Digital Archive of the Russian Mathematical Knowledge from the XIX
    Century to Today. Intelligent Computer Mathematics. LNCS, 7961, 344–348 (2013),
    https://doi.org/10.1007/978-3-642-39320-4_26.
 6. Gartner, R.: Metadata. Shaping Knowledge from Antiquity to the Semantic Web. Springer
    (2016).
 7. Sicilia, M.-A. (Ed.): Handbook of Metadata, Semantics and Ontologies. World Scientific
    Publishing Co. Pte. Ltd. (2014).
 8. Lubas, R., Jackson, A., Schneider, I.: The Metadata Manual. Chandos Publishing (2013).
 9. Alemu, G., Stevens, B.: An Emergent Theory of Digital Library Metadata. Elsevier Ltd.
    (2015).
10. Elizarov, A.M., Khaydarov, Sh.M., Lipachev, E.K.: Scientific Documents Ontologies for
    Semantic Representation of Digital Libraries. 2nd RUSSIA AND PACIFIC CONF. ON
    COMPUTER           TECHNOLOGY           AND      APPLICATIONS,      pp.    1–5   (2017),
    https://doi.org/10.1109/RPC.2017.8168064.
11. Elizarov, A.M., Lipachev, E.K.: Lobachevskii DML: Towards a Semantic Digital Mathe-
    matical Library of Kazan University. CEUR Workshop Proceedings, vol. 2022, pp. 326–
    333 (2017).
12. Elizarov, A.M., Kirillovich, A.V., Lipachev, E.K., Nevzorova, O.A., Solovyev, V.D., and
    Zhiltsov, N.G.: Mathematical knowledge representation: semantic models and formalisms.
    Lobachevskii         J.    of      Mathematics,      35      (4),    348–354     (2014),
    https://doi.org/10.1134/S1995080214040143.
13. Elizarov, A.M., Kirillovich, A.V., Lipachev, E.K., Nevzorova, O.A.: Mathematical
    Knowledge Management: Ontological Models and Digital Technology. In: CEUR Work-
    shop Proceedings, vol. 1752, pp. 44–50 (2016).
14. Elizarov, A.M., Kirillovich, A.V., Lipachev, E.K., Nevzorova, O.A.: Digital Ecosystem
    OntoMath: Mathematical Knowledge Analytics and Management. In: Communications in
    Computer and Information Science, vol. 70, pp. 33–46. Springer (2017),
    https://doi.org/10.1007/978-3-319-57135-5_3.
15. Chen, J., Chen, H.: A Structured Information Extraction Algorithm for Scientific Papers
    based on Feature Rules Learning. Journal of Software, vol. 8, no. 1, pp. 55–62 (2013),
    https://doi.org/10.4304/jsw.8.1.55-62.
16. Ronzano, F., Saggion, H.: Dr. Inventor Framework: Extracting Structured Information
    from Scientific Publications. In: Japkowicz N., Matwin S. (eds) Discovery Science. Lec-
    ture Notes in Computer Science, vol 9356, Springer, Cham. (2015),
    https://doi.org/10.1007/978-3-319-24282-8_18.
17. Tkaczyk, D., Tarnawski, B. and Bolikowski, Ł.: Structured Affiliations Extraction from
    Scientific     Literature.   D-Lib      Magazine,    vol.   21,   no.    11/12   (2015),
    https://doi.org/10.1045/november2015-tkaczyk.
18. Elizarov, A.M., Lipachev, E.K., and Khaydarov, S.M.: Automated system of services for
    processing of large collections of scientific documents. CEUR Workshop Proceedings, vol.
    1752, pp. 58–64 (2016).
19. Peroni, S.: Semantic Web Technologies and Legal Scholarly Publishing. Springer Interna-
    tional Publishing, (2014), https://doi.org/10.1007/978-3-319-04777-5.
360


20. Constantin, A., Peroni, S., Pettifer, S., Shotton, D., Vitali, F.: The Document Components
    Ontology (DoCO). Semantic Web, vol. 7, no. 2, pp. 167–181 (2016),
    https://doi.org/10.3233/SW-150177.
21. Ruiz-Iniesta, A., and Corcho, O.: A review of ontologies for describing scholarly and sci-
    entific documents. CEUR Workshop Proceedings, vol. 1155, pp. 1–12 (2014).
22. Kogalovsky, M.R., Parinov, S.I.: Scholarly Communication in a Semantically Enrichable
    Research Information System with Embedded Taxonomy of Scientific Relationships. In:
    Klinov P., Mouromtsev D. (eds) Knowledge Engineering and Semantic Web. Communica-
    tions in Computer and Information Science, Springer, vol 518, pp. 87–101 (2015),
    https://doi.org/10.1007/978-3-319-24543-0_7.
23. Ingersoll, G. S., Morton T. S., Farris A. L.: Taming Text. How to Find, Organize, and Ma-
    nipulate It. Manning Publications Co. (2013).
24. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting
    Data. EMC. Education Services (Ed), Wiley (2015).
25. Standard         ECMA-376          Office        Open          XML        File    Formats,
    http://www.ecmainternational.org/publications/standards/Ecma-376.htm. last accessed
    2019/11/21.