<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistical Localization of Bibliographic Descriptions in Unstructured Full-Texts Documents</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Russia FSO Academy</institution>
          ,
          <addr-line>Orel</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The article describes the results of experiments in the field of automatic localization of bibliographic descriptions (single and group as part of lists) drown up according to GOST 7.0.100-2018 (or close standards). The experiments were performed on the set of unstructured full-text Russian-language documents of various styles. The proposed solution is based on several parameters of bibliographic descriptions: lengths distribution (in characters), the frequency of prescribed punctuation characters and autocorrelation factors. The use of these features in an explicit form during simple classifiers creation made it possible to obtain criteria of Recall and F1-scores, comparable to previously obtained one using structural recognition methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Bibliographic description</kwd>
        <kwd>Bibliographic data</kwd>
        <kwd>Text mining</kwd>
        <kwd>Named entity recognition</kwd>
        <kwd>Statistical features</kwd>
        <kwd>Natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        One of the actual tasks in the Natural Language Processing (NLP) is the bibliographic
information extraction, which is used for both identification of source documents and
bibliographic search. This task is a special case of metadata extraction. Primarily, such
functionality is used in automatic quality assessment systems of scientific and
educational papers, their peer review, in applications to Bibliometrics and Plagiarism
detection. Bibliographic information extraction from scientific books, theses and articles
eventually helps in saving researcher’s time while performing literature acquisition. In
addition, «metadata information within citation also carries immense importance
especially in the domain of Scientometrics» [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Bibliographic information can be used to
perform variety of other tasks including article recommendation and citation analysis,
etc.
      </p>
      <p>This task also can be characterized as a special case of Text Mining and Entity
extraction within the NLP. More precisely, we can talk about extracting objects
represented by a set of named entities. The process of extracting bibliographic information
consists of three stages: (i) localization of bibliographic data, (ii) definition of object
boundaries and its type, (iii) identification of fields, extraction and interpretation
bibliographic data. This paper presents the results of experiments to determine informative
features for the statistical localization of single and group (as part of a list) bibliographic
descriptions in full-text unstructured documents in Russian.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overview</title>
      <p>The majority of the texts in Russian with the appearance of bibliographic references
and descriptions is subject to the standards GOST 7.1-2003, GOST 7.0.5-2008 and
GOST 7.0.100-2018. Primarily, we are talking about scientific, academic and
educational publications, technical reports, patents, official documents, etc. Of course, other
formats (styles) of presentation of bibliographic descriptions are also in use. However,
the article considers the localization of bibliographic descriptions in the notation of the
GOST standards. Predominantly we considered bibliographic descriptions of books and
articles as the most traditional types of publications.</p>
      <p>
        So, based on 10 thousand random bibliographic descriptions from the Russian State
Library web-portal of the [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], it was found that 61% of them represented by books, 17%
by articles, 15% by dissertations (theses), 5% by electronic documents on the optical
disk storage and 2% by other sources.
      </p>
      <p>According to the above standards, the source of bibliographic information is
presented by bibliographic records, bibliographic descriptions (BD) and bibliographic
references (citations). At the same time, the bibliographic description is the main part of
the bibliographic record, and differs from the bibliographic reference (BR) in that it
clearly specifies the sequence and formatting of fields (elements) containing
bibliographic data. Also in relation to bibliographic description defined the term «prescribed
punctuation» – a set of characters used to separate fields within the BD (see Fig. 1).</p>
      <sec id="sec-2-1">
        <title>Bibliographic List</title>
      </sec>
      <sec id="sec-2-2">
        <title>Bibliographic Information Citation</title>
        <p>is a source of
is a part of
is a source of
is a variant of
is a source of
is an item of</p>
      </sec>
      <sec id="sec-2-3">
        <title>Bibliographic Description</title>
      </sec>
      <sec id="sec-2-4">
        <title>Bibliographic Reference</title>
      </sec>
      <sec id="sec-2-5">
        <title>Prescribed Punctuation</title>
      </sec>
      <sec id="sec-2-6">
        <title>Bibliographic Data</title>
        <p>is a variant of
is an item of
is a part of</p>
        <p>GOST 7.0.100-2018
GOST 7.0.5-2008 GOST 7.0-99
Thus, the extraction of bibliographic information from texts involves the identification
of such objects as bibliographic records, bibliographic descriptions and bibliographic
references and the interpretation of their constituent fields containing bibliographic
data. Taking into account that BD may be considered as an extended version of BR,
attention should be focused on identifying signs of localization of this object. It should
also be noted that in full-text documents bibliographic descriptions may occur in
various presentation forms, according to the classification (see Fig. 2). The classification
items on the diagram are ordered in such a way that the complexity of identifying such
BDs increases from right to left. Wherein, the greatest difficulty is the automatic
recognition of short single-level single BDs, which format does not match any known styles.
Accordingly, it is easiest to identify multilevel full group (two or more in a row)
bibliographic descriptions drawn up according to GOST.</p>
      </sec>
      <sec id="sec-2-7">
        <title>Full</title>
        <p>
          Taking into account the above classification, the following aspects should be listed that
affect the solution of the of the problem of identifying BD in the text:
1. Bibliographic descriptions may be located both in groups (in lists) in certain places,
and by one in arbitrary places within the text.
2. Generally, bibliographic descriptions can be multilingual information objects.
3. Texts of documents can be structured and unstructured [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Structured texts contain
special markup (tags) or dedicated fields for bibliographic descriptions. In
unstructured texts, bibliographic descriptions are not especially distinguished in any way.
4. Bibliographic descriptions may be incomplete (one or several fields are missing) and
corrupted (the necessary fields or markup elements, as well as their sequence are
distorted). In addition, sometimes references to literature contain erroneous data, for
example, incorrect page numbers, publisher name, etc. First and foremost, the source
and cause of distortions is a human.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Previous and Related Works</title>
      <p>
        A review of the available publications showed that the following approaches to
extracting bibliographic information from texts of various documents in Russian were
implemented in practice:
 «Antiplagiat» system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] includes a module for extracting bibliographic information.
      </p>
      <p>
        However, unlike other modules, there is no publicly available description of its
algorithm.
 The work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] presents a method for extracting bibliographic information from
fulltext patent descriptions based on a library of patterns (regular expressions). After
searching in the text for all templates, the procedure of integrating the text fragments
selected by each template is performed. The authors point to the achieved accuracy
of selection of bibliographic references – 88% (another 9% is allocated partially).
 In work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an approach to the identification of bibliographic descriptions in the
bibliography based on the modified shingle algorithm is implemented. The low
speed of the developed system and the dependence of accuracy on the composition
(completeness) of the required bibliographic database are noted.
 The article [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presents a toolkit for extracting elements of bibliographic lists from
scientific texts based on automatic generation of regular expressions. The author
points out that template generation was successful in 76% of cases (100
bibliographic references to electronic resources were used).
 Finally, work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] discusses the extraction of bibliographic descriptions using regular
expressions. They are used to determine the correspondence of a block of text to the
characteristics of a named entity in accordance with certain templates, which include
special characters. The authors point out that for correct and accurate extraction of
bibliographic information, tens to hundreds of regular expressions are required,
which significantly slows down the text processing. In their opinion, it is advisable
to combine the regular expressions with statistical recognition methods to localize
text fragments where bibliographic information may be contained.
      </p>
      <p>
        For metadata extraction, various researchers have considered approaches based on
Hidden Markov Models (HMM), Conditional Random Fields (CRF), Support Vector
Machine (SVM), decision trees and neural networks [
        <xref ref-type="bibr" rid="ref1 ref2 ref7">1, 2, 7</xref>
        ]. Some researchers use
heuristic approaches; others prefer Machine Learning methods. There are studies that
combine the advantages of each of these approaches. However, in the vast majority of cases,
practical systems first parse the document to establish its structure, and then proceed to
extract metadata. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] it is indicated that various practical systems demonstrate the
recall of the bibliographic information extraction within 57–90%, and the F1-score
within 67–93%. In this case, used classifiers based on tens or hundreds of features.
Machine learning models are widely used due to their ability to adapt to different
structures and text styles.
      </p>
      <p>
        Note that in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] the authors define BD as a sequence of named entities. Indeed, some
standardized fields of bibliographic descriptions contain indications of persons,
geographical names; others are represented only by numeric values (see Fig. 3).
      </p>
      <p>Bibliographic
Description
GOST 7.1-2003
GOST 7.0.5-2008
GOST 7.0.100-2018</p>
      <p>Type of field
(named entity)</p>
      <p>Header
Касаткин, А.С.
Касаткин А.С.,
Немцов М.В.
Касаткин, А.С.</p>
      <p>Person</p>
      <p>Title</p>
      <p>Responsibility Edition</p>
      <p>Output data
Электромеханика: учебник</p>
      <p>для вузов
Электромеханика: учебник</p>
      <p>для вузов
Электромеханика: учебник</p>
      <p>для вузов
Location
АМ.С.В..КНасеамтцкоивн, п6е-ереирзадб.,.</p>
      <p>6-е изд.,
перераб.</p>
      <p>АМ.С.В..КНасеамтцкоивн, п6е-ереирзадб.,.</p>
      <p>Numeric Alpha-numeric
Москва:</p>
      <p>Вшыксошлаая 2004
Москва: Вшыксошлаая 2004
Москва: Вшыксошлаая 2004</p>
      <p>
        Physical
character
istic
Recognition and selection of named entities is a rather trivial task for structured text,
primarily because at its creation certain conventions and patterns are used. However, in
general we are dealing with poorly structured texts. Therefore, direct recognition of BD
base.
4
by structural methods may be preceded by the localization of places in texts where the
presence of BD is most likely. For this, it is reasonable to use a statistical classifier
based on a compact set of simple features. Therefore, this paper investigates the main
statistical properties of bibliographic descriptions. In contrast to the previously
presented data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], here the research is carried out on a more representative information
      </p>
    </sec>
    <sec id="sec-4">
      <title>Computational Models for Localization of the Bibliographic Descriptions</title>
      <p>Let us consider universal alphabet Ω – the set of all possible characters that can occur
in natural language texts, including the empty character ε. In practice, this can be a
Unicode character set. Also given a set of prescribed punctuation characters (PPC) Σ of
size |Σ|. In addition, over the alphabet Ω, a string of characters A = a0a1 ... an-1 of
length n = |A| is defined.</p>
      <p>The statistical features of bibliographic descriptions considered below are based on
the following set of statements. BD in the general case is a text fragment with limited
length. This fragment is composed by arbitrary sequence of the alphanumeric substrings
(fields of bibliographic data), separated by characters from the set Σ. In addition to the
usual grammatical punctuation marks, there are PPCs between the BD fields. Therefore,
on the length of the BD there should be an increased occurrence of prescribed
punctuation characters relative to the text as a whole. Then the localization of bibliographic
descriptions can be performed using the sliding window. And the first localization
model is specified by three parameters: M1 = &lt;S, O, TPPC&gt;, where S is the window size,
O is the offset step for window sliding relative to the previous one, TPPC is the threshold
of the relative frequency of the PPC set Σ. It is applicable to Bayesian classification
scheme.</p>
      <p>Further, because BDs are structured entities, they may demonstrate some cyclic or
periodic properties with respect to the positions of prescribed punctuation characters.
Therefore, there may be increased values of the corresponding indicators in the areas
of BD localization in the text. As such an indicator, consider the autocorrelation.</p>
      <p>Let φ: Ω → {ε, 0, 1} be a homomorphism where for character x from Ω
 ( ) = { 1,   ∈ Σ,</p>
      <p>,   =  ,
0,  ℎ
.</p>
      <p>The autocorrelation (periodic) of a string A length n is defined by
  (A) = ∑
 −1  (  ) (   + 
 =0
 ),  ∈ ℕ, 0 ≤  ≤ ⌊ ⌋.</p>
      <p>2
Averaged autocorrelation coefficient (AACC) is defined by

 = 1 ∑

 =0 c ( ) , 0 ≤ 
≤ ⌊ ⌋.
Accordingly, the second localization model is also specified by three parameters
M2 = &lt;S, O, TAC&gt;, where S is the window size, O is the offset step for window sliding
relative to the previous one, TAC is the threshold value of AACC. It is also applicable
to the Bayesian classification scheme.</p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Evaluation</title>
      <sec id="sec-5-1">
        <title>Datasets Overview</title>
        <p>
          The following datasets were used for the experiments. A sample of about 350,000
bibliographic descriptions was obtained from marked-up lists of references to texts posted
on the websites of the Russian State Library [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the National Electronic Library [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
and the Scientific Electronic Library [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The original texts were divided into three
categories: books (monographs), articles and theses (with auto-abstracts) in the
following percentage: 50%, 28%, and 22%. Bibliographic descriptions were extracted with
the use of Selenium, PhantomJS and Scrapy libraries in Python.
        </p>
        <p>
          A sample of full-text documents was taken from the Russian National Corpus posted
on the Internet and marked up by experts [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The sample size is 2000 texts of five
functional styles (scientific, official-business, journalism, spoken, and literary). The
total length of the sample texts is more than 20 million characters.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Parameter Estimation</title>
        <p>The analytical software package Statistica 10 was used to determine the parameters of
the distributions. Based on the Kolmogorov-Smirnov test with a significance level of
0.95, it was found that a lognormal law with various parameters for articles, books and
theses describes the BD lengths distribution (see Fig. 4).</p>
        <p>As shown, in articles and books, short BDs are usually used (mean – 220 and 244
characters, respectively), while in theses, extended and full BDs are often used (mean
– 437 characters). Due to this, the mathematical expectation and variance of BD lengths
are noticeably higher for dissertations. For convenience and optimization of further
calculations, the width of the BD localization window is taken as a value near the average
lengths for books and articles – 256 (as 28).</p>
        <p>Next, the assumption that the frequencies of occurrence of prescribed punctuation
characters (according to GOST 7.0.1-2018) differ within the boundaries of
bibliographic descriptions and in the text as a whole was checked.
0.006
0.005
n
o
it
fnu0.004
c
y
it
s
n
eD0.003
y
iilt
b
a
rob0.002
P
0.001
0
0
Articles - LN(5.39; 0.27)
Books - LN(5.50; 0.38)</p>
        <p>Theses - LN(6.08; 0.51)
200 400 600 800
Length of Bibliographic description (in characters)
1000
The following Table 1 summarizes data on occurrence frequencies of each prescribed
punctuation character (PPC) in texts and within bibliographic descriptions.
Since the use of PPC in texts and BD differs (for some characters significantly), this
allows the use of relative frequencies of such characters as predictors in the logistic
regression model for classifying text blocks. From the initial dataset, using a sliding
window with a size of 256 characters and with an offset of 128 characters, blocks of
two types were selected: those containing no BDs (text blocks) and containing BDs
(bibliographic blocks). Then the parameters of the frequency distributions of the full
set of PPCs were determined (see Fig. 5).</p>
        <p>5 10 15 20 25 30 35 40</p>
        <p>The relative frequency of the full set of prescribed punctuation characters, %
As a result of averaging the data over blocks of texts of all styles, the threshold of the
relative frequency of the full set of PPCs in various bibliographic descriptions was
selected TPPC = 0.045. In addition, the probabilities of Type I errors (α, «false positive»)
and Type II errors (β, «false negative») are 0.0122 and 0.0054 respectively (see Fig. 6).
Evaluation of the first classifier with parameters M1 = &lt;256, 128, 0.045&gt; were carried
out on samples of 70,000 text blocks of various styles and 150,000 bibliographic blocks
obtained by pulling a sliding window. It was found that a Naive Bayesian classifier at
the specified threshold, although it allows identifying nearly all of BD locations, but
also captures some syntactic constructs. These are addresses, listings of surname-name
groups, tabular data, mathematical and formal expressions, legislative acts.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Autocorrelation Features</title>
        <p>Similarly, the parameters for the Bayesian model of recognition of bibliographic blocks
were determined based on the value of the average autocorrelation coefficient (AACC)
calculated on a string with a length of 256 characters (see Fig. 7).</p>
        <p>0.20
0.15
F
D
P
0.10
0.05
0 0</p>
        <p>N(0.024, 0.017)</p>
        <p>N(0.14, 0.041)
0.05
0.10
0.15
Since the calculation of autocorrelation is a computationally expensive process, we
optimized such a parameter as the maximum index of the autocorrelation coefficient used
for the calculation. It is founded that the minimum value of the classifier errors sum is
reached with the shift K = 30 (see Fig. 8).
In general, the proposed method for taking into account autocorrelation properties for
localizing bibliographic references, with some refinement, can be used to visualize the
structure of documents (see Fig. 9).</p>
        <p>Single BD</p>
        <p>Bibliographic List
In addition, after applying the Fast Fourier transform (FFT) to the autocorrelation
sequences sized 128 symbols was investigated the form of the power spectrum. The
experiment was carried out on a sample of 45000 blocks: 15000 plain-text blocks, 15000
blocks with mathematical expressions selected from textbooks of physics, chemistry
and mathematics and 15000 bibliographic blocks. Based on the results of averaging the
spectra, an integral result was obtained that demonstrates significant differences in the
spectral characteristics between the indicated types of content (see Fig. 10).
0,16
0,14
0,12
0,1
0,08
0,06
0,04
0,02
0
0,15
0,1
0,05
0
1</p>
        <p>11
Math expressions
21</p>
        <p>31 41
Bibliographic descriptions
51</p>
        <p>61
Plain text
Based on the considered statistical features of bibliographic descriptions in the first
approximation, an attempt was made to use them as features (in their original form,
without any transformations) when constructing simple classifiers separating text
fragments on the width of a sliding window into two classes - text blocks and bibliographic
blocks. Thus, it becomes possible to localize bibliographic descriptions in an
unstructured full-text document. The first two classifiers are built on the basis of a naive
Bayesian approach – the decision is made based on comparison with the threshold of a single
feature value.</p>
        <p>In addition, two binary logistic classifiers (BLC) were built with a decision threshold
equal to zero (two classes are given: -1 for text block, +1 for bibliographic block). The
first classifier based on the normalized relative frequencies of the six most commonly
used prescribed punctuation characters: comma (x1), dot (x2), colon (x3), semicolon (x4),
hyphen (x5) and forward slash (x6). All six factors are significant.
 1( ) = −0,94 − 9,40 ∙  1 + 18,14 ∙  2 + 22,48 ∙  3 + 9,55 ∙  4 + 3,58 ∙  5 + 62,94 ∙  6. (4)</p>
        <p>The second BLC-classifier is based on same set of features, but the seventh feature
is added – the normalized value of the averaged correlation coefficient AACC30 (x7).
All seven factors are significant.
 2( ) = −1,06 − 9,67 ∙  1 + 15,49 ∙  2 + 19,53 ∙  3 + 9,84 ∙  4 + 2,25 ∙  5 + 47,02 ∙  6 +
4,03 ∙  7.
(5)
For experimental verification of the presented solutions, in addition to the described set
of initial data, with the involvement of volunteers, 500 full-text documents were
annotated – one hundred texts of each style. Each document contained a list of references
from 80–150 bibliographic descriptions, and 121 documents contained a single BD,
230 contained paired BDs, and 149 contained single and paired BDs in different parts
of the text.</p>
        <p>To compare and evaluate different classifiers major evaluation metrics include
Precision, Recall and F1-score. The test results of classifiers are summarized in Table 2.
It should be noted that the obtained values of the AUC-ROC indicator can be described
as good, which indicates the prospects of the statistical indicators considered as
classification signs for solving the problem of extracting bibliographic information from
fulltext documents. The obtained values of Recall, Precision and F-score are comparable
with the results achieved to the results obtained in the above studies using structural
recognition methods. At the same time, a compact model of BD localization on simply
calculated features is implemented.</p>
        <p>To implement stages of the presented study specialized software was developed. It
provides the ability to vary various parameters of the models (width of the sliding
window, parameters for calculating autocorrelation, symbol sets, etc.) and iterate over them
in the specified ranges. In this case, the user can visually view the results of BDs
localization – text fragments selected from the input documents (see Fig. 11).
 The lognormal law describes probability distributions of the lengths of bibliographic
descriptions, as well as relative frequencies of the prescribed punctuation characters
set.
 A large number of false positives were identified for scientific-style texts, which
explained by an abundance of mathematical expressions with punctuation marks.
Therefore, the relative frequencies of PPC is a poor predictor for long size BDs (full),
which are typical for dissertations. It is not advisable to use this feature in its original
form.
 The obtained values of Recall, Precision and F1-score of classifiers, developed on
the basis of the studied statistical features of BDs, are comparable with the results
achieved in previous studies using structural recognition methods.
 It seems promising to increase the Recall value of the classification of the presented
compact models in order to achieve the necessary precision at the next stage of
processing – applying regular expressions to selected text fragments.
 Experimental data indicate that the studied statistical features and models are
applicable not only to texts in Russian, but also to other alphabetic languages.
 Note also that there remains some potential for improving performance within the
framework of the presented models by varying the sliding window size, changing
the original text string encoding way into a binary sequence, changing character sets,
using more advanced metrics. It is also worth considering the possibility of using
neural networks to solve the stated problem.
 The presented results are preliminary and intermediate and will be further refined.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nasar</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaffry</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          :
          <article-title>Information extraction from scientific articles: a survey</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>117</volume>
          ,
          <fpage>1931</fpage>
          -
          <lpage>1990</lpage>
          . Springer, Heidelberg (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chenet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Identify and extract entities from bibliography references in a free text</article-title>
          .
          <source>Master thesis</source>
          . University of Twente, Enschede (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Antiplagiat service Homepage, www.antiplagiat.ru,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zatsman</surname>
            ,
            <given-names>I.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havanskov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shubnikov</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          :
          <article-title>Method of bibliographic information extraction from full-text descriptions of inventions</article-title>
          .
          <source>Informatics and Applications</source>
          ,
          <volume>7</volume>
          (
          <issue>4</issue>
          ),
          <fpage>52</fpage>
          -
          <lpage>65</lpage>
          . Russian Academy of Sciences, Moscow (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lutsenko</surname>
            ,
            <given-names>E.V.:</given-names>
          </string-name>
          <article-title>The application of ASC-analysis and "AIDOS" intelligent system to solve, in general, the problem of identifying the sources and authors of the standard, nonstandard and incorrect bibliographic descriptions</article-title>
          , http://ej.kubagro.ru/
          <year>2014</year>
          /09/pdf/32.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sokolova</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          :
          <article-title>An extraction of the elements from bibliography based on automatically generated regular expressions</article-title>
          .
          <source>In: Proceedings of the All-Russian conference with international participation «Information and telecommunication technologies and mathematical modeling of high-tech systems»</source>
          ,
          <fpage>313</fpage>
          -
          <lpage>316</lpage>
          . Peoples' Friendship University of Russia, Moscow (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kolmogortsev</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saraev</surname>
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Extracting bibliography from texts with regular expressions</article-title>
          .
          <source>New Information Technologies in Automated Systems</source>
          ,
          <volume>20</volume>
          ,
          <fpage>82</fpage>
          -
          <lpage>88</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tkaczyk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szostek</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedoryszak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          et al.:
          <article-title>CERMINE: automatic extraction of structured metadata from scientific literature</article-title>
          .
          <source>International Journal on Document Analysis and Recognition (IJDAR)</source>
          ,
          <volume>18</volume>
          ,
          <fpage>317</fpage>
          -
          <lpage>335</lpage>
          . Springer, Heidelberg (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Graschenko</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherkasov</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuzmin</surname>
            ,
            <given-names>N.S.:</given-names>
          </string-name>
          <article-title>Experience of automatic localization of bibliographic descriptions in Russian-language texts</article-title>
          . New information technologies in
          <source>automated systems</source>
          ,
          <volume>22</volume>
          ,
          <fpage>192</fpage>
          -
          <lpage>198</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Russian State Library Homepage, www.rsl.ru,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. National Electronic Library Homepage, http://rusneb.ru,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Scientific Electronic Library Homepage, www.elibrary.ru,
          <source>last accessed</source>
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Russian National Corpus, http://www.ruscorpora.ru/new/, last accessed
          <year>2020</year>
          /09/21.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>