<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Search Query Extension Semantics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>imir S</string-name>
        </contrib>
      </contrib-group>
      <fpage>325</fpage>
      <lpage>339</lpage>
      <abstract>
        <p>The problems of extracting the most complete information from the semantic library by accounting for related documents are considered. Expert knowledge encrypted in the subject area can be made available when the user obtains additional information from linked documents. A feature of the approach is the use of a shallow neural network algorithm to expand the search query in mathematical subject areas, where expert knowledge is available with a significant scientific background of users. The solution to this problem can be achieved by means of semantic analysis in the knowledge space using machine learning algorithms. The paper investigates the construction of a vector representation of documents based on paragraphs in relation to the data array of the digital semantic library LibMeta. Each piece of text is labelled. Both the whole document and its separate parts can be marked. Since the problem of enriching user queries with synonyms was solved, when building a search model in conjunction with word2vec algorithms, an approach of “indexing first, then training” was used to cover more information and give more accurate results.</p>
      </abstract>
      <kwd-group>
        <kwd>Search Model</kwd>
        <kwd>Word2vec</kwd>
        <kwd>Synonyms</kwd>
        <kwd>Query</kwd>
        <kwd>Query Extension</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The history of research the problems of expanding the request for the most complete
coverage of information is quite long [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8">1-8</xref>
        ]. The problem itself is directly related to
the understanding of the subject of the search, that is, the level of competence of the
user and the capabilities of the information retrieval system to use expert knowledge.
Ideally, the use of query enhancement and refinement functionality assumes the
presence of the actual data and knowledge base and the ability to reformulate the original
query in order to improve the search result.
      </p>
      <p>
        Many approaches have been developed with the advent of artificial intelligence
algorithms and corresponding programming tools in this area [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The first expert
system using query refinement technique, Dendral [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] was developed in 1965 for
the analysis of chemical compounds. An example of another system based on medical
expertise was MYCIN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] presented to the scientific community in 1972. During the
dialogue, MYCIN offered options for the diagnosis and further investigation of the
patient. Using about 500 inference rules, MYCIN performed at about the same level
of competence as blood infection specialists and better than general practitioners.
      </p>
      <p>
        The next stage of introducing artificial intelligence into knowledge systems is due
to the use of neural network algorithms [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Despite the fact that the ideas of creating
mathematical models based on the functioning of biological neural networks have
been developing since 1943 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], their practical implementation has gained popularity
with the accumulation of digitized data, that is, already in the 21st century. Some
researchers have noted this as a new era in the “partially forgotten” for the time of
artificial intelligence. Search algorithms began to learn [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] on the accumulated
queries, accumulate the most frequent of them, as well as the corresponding answers. All
this contributed to an increase in the reaction speed of the search service, the
development of targeted offers and user tips.
      </p>
      <p>
        More complex links and structures are embedded in scientific libraries, which is
dictated by the logic of subject areas and requires more careful processing of links to
provide users with advanced query capabilities [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. One such subject area is
mathematics. It is of interest to study and replenish the mathematical encyclopedia, to
identify unaccounted for semantic relationships of concepts and formulas.
      </p>
      <p>
        This work is devoted to the use of shallow neural network algorithms [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to
expand the search query in mathematical subject areas based on the LibMeta [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
library, presented in the form of an ontology, and is a continuation of the authors'
research in this direction [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23 ref24">19-24</xref>
        ]. The description of the subject area is terminologically
limited to the terms of the mathematical encyclopedia [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. As a corpus of texts,
many mathematical articles are considered, which are partially supplied with codes of
thematic classifiers MSC (https://msc2020.org/) and UDC
(https://teacode.com/online/udc/) and correspond to a certain structure.
      </p>
      <p>
        The LibMeta resources include a thesaurus on ordinary differential equations
(ODE), dictionaries for special functions of equations of mathematical physics. All
dictionaries are semantically linked to a mathematical encyclopedia [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. These
resources are used to analyze semantic relationships.
      </p>
      <p>
        This paper presents a search model (part 2), outlines a technique based on the use
of algorithms for vector representation of texts [
        <xref ref-type="bibr" rid="ref26 ref27 ref28 ref29">26-29</xref>
        ] (part 3); shows the application
of the search model to add synonyms to a search query (part 4); and examples of
search query extension (part 5), which demonstrate the application of the model to
improve search results, also provides estimates of the completeness and accuracy of
the algorithm, and also shows the process of ranking documents.
      </p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Search Model</title>
      <p>The construction of the search model in LibMeta is based on three main key points,
namely:
 converting documents to searchable format;
 requests are presented in a format that allows expressing the user's information
needs;
 the assessment of the compliance of the document with the request.
In our case, for the preparation of documents, preprocessing of full texts was
carried out to remove the publisher's markup and highlight the main parts of the text.
Then a full-text document index was created, which allows you to efficiently load
and store data and provide quick access to it. Queries written in natural language
are used, which can be enriched with synonyms by the system. The assessment of
the compliance of a document with a request is subjective and depends on the
method used.</p>
      <p>
        One of the most commonly used document and query presentation models is the
vector space model [
        <xref ref-type="bibr" rid="ref26 ref27 ref28 ref29">26-29</xref>
        ]. In this model, one of the models based on artificial neural
networks, both the request and the document are represented by a vector and the
distance between them is measured, which estimates the degree of closeness of the
document and the request.
      </p>
      <p>
        In vector notation, each word is associated with a weight, which can be calculated
in different ways. One of the most commonly used algorithms is the TF-IDF [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
algorithm, the main idea of which is that the more often a word appears in one
document, the more important it is. And at the same time, the more common a word is in a
corpus of documents, the less important it is. Another common model is the
probabilistic model, which is based on an estimate of the likelihood that a document is
relevant to a particular query. One of the popular scoring algorithms in this model is
Okapi BM25 [
        <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
        ].
      </p>
      <p>
        The main problem of any search model is to provide relevant results in relation to
the user's information needs: from query analysis to ranking search results. This work
is devoted to options for resolving this problem. One of the modern approaches is to
use neural networks for text processing, since text is an example of data that can be
parsed into smaller structures such as paragraphs, sentences, words, etc. depending on
the text. This approach to text processing allows you to capture the semantics of the
text, since closely related words or fragments of text occur in the same context and lie
side by side in vector space. The search model used in this work is based on the
vector representation of words and documents built using the word2vec [
        <xref ref-type="bibr" rid="ref27 ref28 ref29">27-29</xref>
        ] neural
network algorithm [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>Integration of neural network and index can be done in the following ways:
 first training on the corpus of texts, then indexing the texts and share them in the
search;
 indexing first, then training on indexed data and sharing in search;
 first training, then extraction / creation of useful resources by the trained network,
and then indexing of all resources, both new and original.</p>
      <p>Since we were solving the problem of enriching user queries with synonyms, in the
LibMeta system we used the “indexing first, then training” approach to provide more
results and more accurate results, based on extended queries on the one hand. On the
other hand, using the extended version of word2vec in conjunction with the LibMeta
search engine, it becomes possible to give users smarter recommendations based on
the documents found. This approach to sharing the index and search engine and
neural network allows for relevant models and ranking functions that adapt well to the
underlying data. The version of the model built on the LibMeta search index using
word2vec algorithms, hereinafter we will be abbreviated as wsgMath.</p>
      <p>Figure 1 schematically illustrates the operation of a search based on a neural
network, which receives a query string as input, then returns synonyms to the query
using the model built by word2vec. In another case, a document on a vector
representation can be submitted to the input, which, using the constructed model, gives
recommendations in the form of a list of documents similar to it.</p>
      <p>3</p>
    </sec>
    <sec id="sec-3">
      <title>Vector Representation of Documents</title>
      <p>
        Studies show [
        <xref ref-type="bibr" rid="ref26 ref27 ref28 ref29">26-29</xref>
        ] that vector representations of text are well suited for taking into
account the semantics of words, but the meaning and deep semantics of text
documents depend not only on the meaning of individual words. For this purpose, you
need to study the semantics of phrases and longer text fragments.
      </p>
      <p>For convenience, we will use the term “paragraph” to denote a paragraph, as such,
but also for fragments of a paragraph or several phrases from the text. As applied to
our field and the specifics of the structure of a mathematical text, these can also be
theorems, lemmas, etc.</p>
      <p>Note that the term “important” fragment will also be used. In scientific texts, this is
an abstract, introduction, conclusion, theorem, etc. This term is defined since the
specified elements of a scientific publication will be used as defining for documents
belonging to a certain subject area.</p>
      <p>
        Content for research is the resources of the LibMeta digital library [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], where,
along with the accumulated original thesauri and dictionaries (for special functions,
ordinary differential equations, mixed equations of mathematical physics), a
mathematical library is integrated [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        Therefore, to construct wsgMath, taking into account the context for paragraphs,
we used a version of the word2vec algorithm which is a generalization (extension) of
the original doc2vec algorithm [
        <xref ref-type="bibr" rid="ref29 ref33">29, 33</xref>
        ]. For this, during training, one more
component is added to the vector. Thus, when training “vectors of the word w”, the
“document vector d” is also trained, and upon completion of training, we obtain a vector
representation of the document. As a result of the processing of the original content,
the presentation of documents as a set of “related contents” was obtained. “Related
content” is a semantically similar article related to articles from a mathematical
encyclopedia and thesauri.
      </p>
      <p>The procedure for highlighting such content will be used to offer the user
semantically related documents. It is essential that without the application of the algorithm for
highlighting related content, such documents will not be displayed in the search
results by request, since they may not contain keywords from the query or not directly
related to a certain subject area in other terms.</p>
      <p>The peculiarity of common search models, such as the vector space model with
TF-IDF, is that they only take into account individual terms. This approach does not
always lead to optimal results because contextual information is discarded. The word
context is understood as N words in the text before the word for which the vector is
constructed, and N words after this word. In contrast to the TF-IDF model, the
individual elements of the vector are not interpretable, but the distance between the
vectors is investigated, which is interpreted as the semantic proximity of words.</p>
      <p>Based on the vector representation, the proximity of the texts is measured. Using
the search index and vector document representation together leverages the ability of
these views to capture the semantics of text when building search models that are well
adapted to the data.</p>
      <p>The main metrics for measuring the proximity of texts are cosine distance and
Euclidean distance, which are used to capture semantically similar words, sentences,
paragraphs, etc.</p>
      <p>4</p>
    </sec>
    <sec id="sec-4">
      <title>Revealing Synonyms</title>
      <p>The analysis of mathematical texts is conventionally considered as the analysis of the
actual mathematical text as a whole, the analysis of formulas as a “separate language”
for the representation of mathematical knowledge and the establishment of semantic
links between the text and formulas. Further, only the analysis of the mathematical
text as a whole is considered.</p>
      <p>To extract synonyms for query terms from the constructed model, lexical and
grammatical templates were used, which are one of the recognized methods for
extracting links from text [34-38]. Based on the idea of using such patterns, we
investigated the task of extracting synonyms of concepts and extracting / constructing simple
patterns from them to identify relationships.</p>
      <p>The implementation of the model consists in the application of an iterative research
algorithm, which will be called iraWsgMath below. We list its main stages:
</p>
      <p>allocation of synonyms of terms
As an example, we will demonstrate the query “Cauchy problem” (For the
convenience of the reader, the examples have been translated into English, but the work was
done for texts in Russian. In Russian the considering term is “задача Коши”), which
consists of two words, “problem” and “Cauchy”, each of which has its own
synonyms, which are presented in Table 1.</p>
      <p>The third column presents the query context as one unit Cauchy problem (задача
Коши). Extracting its synonyms, it is clear that the list consists of words where the
adjective boundary falls, which also occurs in the synonyms of individual words in
the first two columns.</p>
      <p>In this case, the term “Cauchy problem” itself has the following synonyms:
“Cauchy equation”, “Cauchy inequality”, which were determined on the basis of high
estimates of the proximity of the following pairs of synonyms, for example, for a pair
(problem, equation), the proximity estimate is 0.84.</p>
      <p>Note that when constructing synonymous terms, synonyms of the word Cauchy
were not used, since it was defined as a named entity Cauchy based on a dictionary
that includes a list of persons mentioned in the mathematical encyclopedia. But at the
same time, we note that Riemann got into the synonyms of Cauchy.</p>
      <p>determination of classes of synonyms by parts of speech</p>
      <p>Lexical and grammatical templates were used to extract synonyms for query terms
from the constructed model. They are one of the recognized methods for extracting
links from text [34, 35]. Based on the idea of using such patterns, we investigated the
task of extracting synonyms of concepts and extracting / constructing simple patterns
from them to identify relationships. Consider a link extraction pattern based on a
simple adjective &lt;term&gt; pattern that most often indicates generic links. The original term
is a generic concept, and the combination corresponding to the pattern is a specific
concept [36, 37, 38].</p>
      <p>Each word was considered separately, the synonyms are filtered by parts of speech
and a possible synonym (candidate) of the term is formed from them. After that, a
sentence was formed for the term and its synonyms based on the selected templates.</p>
      <p>Based on these synonyms, the following sentences were formed, which were
obtained in accordance with the pattern “adjective &lt;request term&gt; &lt;synonyms of the
request term&gt;”: [Cauchy boundary value problem, Cauchy boundary equation,
Cauchy boundary inequality] (In Russian was used word “краевой”).</p>
      <p>When compiling an extended query, it was also proposed to use the adjective
boundary as a synonym, therefore additional queries were used: [Cauchy boundary
value problem, Cauchy boundary equation, Cauchy boundary inequality]. (In Russian
was used word “краевой”).</p>
      <p></p>
      <p>selection of patterns of “capture” of links</p>
      <p>The &lt;term&gt; verb pattern was considered to analyze and construct more complex
relationship patterns in the &lt;term&gt; verb &lt;term&gt; thesaurus. Using this pattern to fill
the links with it requires a separate analysis and is beyond the scope of the
word2vecbased algorithm considered in this article.</p>
      <p>In the process of training the model, several verbs were defined to identify patterns
using the considered algorithm. When analyzing context-sensitive synonyms, the list
of emerging verbs was rather limited, which is not surprising due to the specifics of
the subject area. The list of these verbs is limited to such as: apply, use, apply, base,
prove, consider, consider, define, depend, be, embody. Also often used are verbal
nouns formed from the listed verbs: application, use, application, basis, definition.
</p>
      <p>improving the “quality of terms” and checking them</p>
      <p>To improve the search for extended domain terms matching templates for verbose
terms in the thesaurus, possible spellings of terms were considered, for example, for
“ordinary differential equation”, the possible options are “ODE”, “ordinary DE”,
etc. All possible spellings were explored as separate terms. Since there are few such
terms in the studied subset of terms, they did not have a significant meaning on the
results.</p>
      <p>Validation of the model and the links it retrieves was performed based on the
thesaurus ODE.</p>
      <p>
        The problem of synonyms and their extraction using word2vec with search index is
covered in more detail in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>5</p>
    </sec>
    <sec id="sec-5">
      <title>Examples</title>
      <p>The combined use of the full-text index [40, 41] and the search model wsgMath
makes it possible to extend the original query with synonyms. Extending queries with
synonyms without wsgMath requires pre-compiled synonym dictionaries. You can
use resources such as WordNet (https://wordnet.princeton.edu/) or RuWordNet
(https://ruwordnet.ru/ru), but the main problem is that synonyms from pre-compiled
dictionaries are not tied to the data being indexed and their use does not improve the
results.</p>
      <p>Figure 2 shows the main steps of forming a model wsgMath for generating query
synonyms in LibMeta content. The query string coming from the full-text search
interface goes through the Analyzer. Analyzer is a functional part of the model, where
the basic operations for interacting with the wsgMath model are performed. All
operations described in points in the previous section refer to its main functionality.</p>
      <p>The Analyzer splits a string into words, analyzes and transforms them. From the
wsgMath model, synonyms for words are extracted and filtered, an extended query is
formed, with the help of which the corresponding documents are extracted from the
full-text index.</p>
      <p>A user's information need is defined as a chain of requests that leads him to the
information he needs. Each subsequent request in this chain is a refinement of the
previous one.</p>
      <p>A real information request, as a rule, consists of an initial request and
clarifications. Let's consider an example, when the primary query leads to excessive
information noise, and the refinement allows you to get a more pertinent answer, and
compare the search results using the wsgMath model and without it. For comparison,
statistical characteristics are calculated (denote score) obtained using the TF-IDF
algorithm.</p>
      <p>The example below demonstrates three lists (List 1-3) with different scores
depending on how the query is expanded. For example, when searching for the test
query “Cauchy problem”, the user enters the qualifying query “Cauchy boundary value
problem” and finds the information of interest. Based on the fact that the search index
contains 3654 scientific articles, of which only 637 contain a mention of the “Cauchy
problem”. Of these, 59 pieces were selected for the user, since the query words were
found in significant parts of the document (title and annotation). With this approach,
the document of interest to the user is in 18th place. Part of the list is shown below,
“score” value shows how well a document matches the request and is calculated based
on statistical characteristics such as TF-IDF.
1. The Cauchy problem for the system of equations of the theory of elasticity and
thermoelasticity in space
score = 0.65376675
2. The Cauchy problem for the system of thermoelasticity equations in space
score = 0.64415324
……………………………………………………….
18. On the Well-Posedness of a Boundary Value Problem on the Line for Three</p>
      <sec id="sec-5-1">
        <title>Analytic Functions score = 0.5233538</title>
        <p>With the refinement query “Cauchy boundary value problem”, the list of results
looks as shown below, and the document of interest is moved to the fifth position,
while the number of documents satisfying the query text is reduced to 338, while the
user is recommended only 20 of them.</p>
        <p>List 2:
1. Projection procedures for non-local improvement of linearly controlled processes
score = 0.8902895
2. On one method of constructing parametric synthesis for a linear-quadratic
optimal control problem
score = 0.8708762
…………………………………………</p>
      </sec>
      <sec id="sec-5-2">
        <title>5. On the Well-Posedness of a Boundary Value Problem on the Line for Three</title>
      </sec>
      <sec id="sec-5-3">
        <title>Analytic Functions score = 0.85024154</title>
        <p>Let us consider the situation when the query “Cauchy problem” is extended by
synonyms and is transformed into the form “boundary”, “problem or equation or
inequality Cauchy” (in Russian: “краевая или граничная”, “задача или уравнение
или неравенство Коши”) using the wsgMath model. In parentheses in an extended
query, synonyms are listed, connected by a logical operation OR. The presence of at
least one of these synonyms is required. This approach insignificantly increases the
completeness of the answer, and the accuracy also increases, therefore, the degree of
satisfaction of the user's need increases. The list of results obtained is displayed below
and the searched document is in second place. The number of documents
corresponding to the request is 395 and the user will receive the desired answer already in the
first positions, while the size of the issue by the system is 65.</p>
      </sec>
      <sec id="sec-5-4">
        <title>List 3:</title>
        <p>1. On a positive radially symmetric solution of the Dirichlet problem for one
nonlinear equation and a numerical method for obtaining it
score = 0.9809638</p>
      </sec>
      <sec id="sec-5-5">
        <title>2. On the Well-Posedness of a Boundary Value Problem on the Line for Three</title>
      </sec>
      <sec id="sec-5-6">
        <title>Analytic Functions</title>
        <p>score = 0.9587569
3. On a positive radially symmetric solution of the Dirichlet problem for one
nonlinear equation and a numerical method for obtaining it
score = 0.9512307</p>
        <p>This example illustrates the effect of this approach already at the level of extending
queries with synonyms based on indexed documents. With this approach, all
suggested synonyms are found in the search engine index and the query extension is
guaranteed to offer answers to the user's queries.</p>
        <p>
          The use of the extended version of word2vec (doc2vec or paragraph2vec, in
different sources in different ways) [
          <xref ref-type="bibr" rid="ref29 ref33">29, 33</xref>
          ] allows you to introduce an additional
element, such as a label for a text fragment or the entire document, and based on the
vectors of these labels, select similar documents not only by the exact match of
keywords or terms, but based on the context of individual fragments or the entire
document. As an illustration, Fig. 3 shows the main steps of this approach. This feature is
used to issue documents that are close in meaning, which do not appear in the search
results, but may be of interest to the user.
        </p>
        <p>Let's take a closer look at the process of ranking documents based on the wsgMath
model when searching for similar documents. When a document enters the system, its
current vector representation is retrieved, a search is performed, and the labels of the
nearest documents are returned, the cosine distance of which exceeds a certain
threshold, determined experimentally as 0.6. Below is the result of the work on the example
of the document, which in the previous example was the desired one. As the closest to
it, 9 documents were found whose cosine distance exceeded 0.6.</p>
        <p>List 4:
1. Some classes of singular integral equations solvable in closed form
cosineSimilarity = 0.8136491179466248
2. Riemann's boundary value problem for a half-plane with a coefficient
exponentially decreasing at infinity
cosineSimilarity = 0.8028532266616821
3. Algorithm for constructing a quasiregular asymptotic representation of the
solution of singularly perturbed linear multipoint boundary value problems with fast
and slow variables
cosineSimilarity = 0.7246567010879517
4. Solution in closed form of an integral equation of convolution type in the
hyperelliptic case
cosineSimilarity = 0.6468908786773682
5. On biorthogonal systems generated by some involutive operators
cosineSimilarity = 0.6454607248306274
6. On linear periodic systems in the plane having matrices of the required form
cosineSimilarity = 0.6165973544120789
7. On integral equations for the Riemann function
cosineSimilarity = 0.6134763956069946
8. Gakhov's equation for an exterior mixed inverse boundary value problem with
respect to a parameter ...
cosineSimilarity = 0.6059825420379639
9. On a nonlinear integral equation of the first kind
cosineSimilarity = 0.6017340421676636</p>
        <p>In Fig. 4 adds steps that include attribute search and how it interacts with the
previously described search components. Attribute-based search delineates the boundaries
in which documents are searched (by author, by year, etc.), then a transition to
fulltext search can be performed on them, and/or its results can also be refined based on
the similarity of documents.</p>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Vector representation of documents is proposed to expand the search query, increase
the coverage of information on demand. It is shown that the quality of an answer to a
request is improved by taking into account semantically close text fragments.</p>
      <p>The model proposed in the work was tested on primary data, namely, arrays of
articles not systematized by subject matter. Note that the technology of processing and
thematic classification of primary data using machine learning methods has been
tested. This technology can be used for the subject classification of the texts of scientific
articles in Russian and the comparison of selected subjects with the English-language
classification by comparing the MSC and UDC classifiers.</p>
      <p>Integration of neural network and search indexes makes possible to give users
smarter results based on the identified relations among documents.</p>
      <p>Also, the considered search model can be used for thematic processing, both
primary texts of scientific articles, and already systematized, provided with keywords
and links to classifiers. In the second case, this can help to identify interdisciplinary
research, as well as erroneous assignments of the subject area, since not only
secondary documents, but also the texts of articles (primary documents) are taken as the
basis for thematic analysis.</p>
      <p>Aknowledgement. The work is presented in the framework of the implementation of
the theme of the state assignment “Mathematical methods of data analysis and
forecasting” FRC CSC of RAS and partially supported by grant #20-07-00324 of the
Russian Foundation of Basic Research.
34. Bullinaria, J.A., Levy, J.P.: Extracting Semantic Representations from Word
Cooccurrence Statistics: A Computational Study, Behavior Research Methods, vol. 39, pp.
510–526 (2007).
35. Klaussner, C., Zhekova, D.: Lexico-syntactic patterns for automatic ontology building,
Proceedings of the Second Student Research Workshop associated with RANLP, 109–114
(2011).
36. Raza, M.A., Mokhtar, R., Ahmad, N., Pasha, M., and Pasha, U.: A Taxonomy and Survey
of Semantic Approaches for Query Expansion, in IEEE Access, vol. 7, pp. 17823-17833,
(2019) https://doi.org/10.1109/ACCESS.2019.2894679.
37. Wang, C., Cao, L., Zhou, B.: Medical synonym extraction with concept space models
https://arxiv.org/abs/1506.00528. (2015), last accessed 2021/07/27.
38. Mitchell, J., and Lapata, M.: Vector-based Models of Semantic Composition (2008).
39. Polozov I.K., Volkova I.A.: Applying word2vec technology to shifter extraction task.
International research journal 4-1 (94) (2020).
40. Makinen, V.: Compact suffix array — a space-efficient full-text index. Fundamenta
Informaticae 56(1–2), 191–210 (2003).
41. Makinen, V., and Navarro, G.: Compressed full-text indexes ACM Computing Surveys 39,
(1), 1–79 (2007) https://doi.org/10.1145/1216370.1216372.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          , S.T.:
          <article-title>The vocabulary problem in human-system communication</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>30</volume>
          (
          <issue>11</issue>
          ),
          <fpage>964</fpage>
          -
          <lpage>971</lpage>
          (
          <year>1987</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezdek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Oakman</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>A knowledge-based approach to online document retrieval system design</article-title>
          .
          <source>In Proc. ACM SIGART Int. Symp. Methodol</source>
          . pp.
          <fpage>112</fpage>
          -
          <lpage>120</lpage>
          . Intell. Syst. (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>Query expansion using lexical-semantic relations</article-title>
          .
          <source>17th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf</source>
          . Retr., Dublin, Ireland (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic query expansion using SMART: TREC 3, presented at the 3rd Text Retr</article-title>
          .
          <source>Conf. (TREC)</source>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Efthimiadis</surname>
            ,
            <given-names>E.N.</given-names>
          </string-name>
          :
          <article-title>Query expansion</article-title>
          .
          <source>Annu. Rev. Inf. Sci. Technol</source>
          .,
          <volume>31</volume>
          (
          <issue>5</issue>
          ),
          <fpage>121</fpage>
          -
          <lpage>187</lpage>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Guarino</surname>
          </string-name>
          , N.:
          <article-title>OntoSeek: Content-Based Access to the Web, IEEE Intelligent Systems</article-title>
          , May-June , pp.
          <fpage>70</fpage>
          -
          <lpage>80</lpage>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bhogal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacFarlane</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A review of ontology based query expansion</article-title>
          , Inf. Process. Manage.,
          <volume>43</volume>
          (
          <issue>4</issue>
          ),
          <fpage>866</fpage>
          -
          <lpage>886</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Qui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frei</surname>
          </string-name>
          , H.:
          <article-title>Concept based query expansion</article-title>
          .
          <source>SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval Pittsburgh</source>
          , Pennsylvania, USA June 27 - July 01,
          <year>1993</year>
          . ACM New York, NY, USA, pp.
          <fpage>160</fpage>
          -
          <lpage>169</lpage>
          (
          <year>1993</year>
          ) https://doi.org/10.1145/160688.160713.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Berk</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <source>LISP: the Language of Artificial Intelligence</source>
          . New York: Van Nostrand Reinhold Company,
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          (
          <year>1985</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lindsay</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buchanan</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feigenbaum</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lederberg</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>61</volume>
          (
          <issue>2</issue>
          ),
          <fpage>209</fpage>
          -
          <lpage>261</lpage>
          (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lederberg</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An Instrumentation Crisis in Biology</article-title>
          . Stanford University Medical School. Palo Alto (
          <year>1963</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Copeland</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          :
          <article-title>"MYCIN"</article-title>
          .
          <source>Encyclopedia Britannica</source>
          ,
          <volume>21</volume>
          Nov.
          <year>2018</year>
          , https://www.britannica.com/technology/MYCIN, last accessed
          <year>2021</year>
          /07/27.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gurney</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>An Introduction to Neural Networks</article-title>
          . CRC Press. London and New York (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>McCulloch</surname>
            ,
            <given-names>W.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pitts</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A logical calculus of the ideas immanent in nervous activity</article-title>
          .
          <source>Bulletin of Mathematical Biophysics</source>
          <volume>5</volume>
          ,
          <fpage>115</fpage>
          -
          <lpage>133</lpage>
          (
          <year>1943</year>
          ). https://doi.org/10.1007/BF02478259.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. MachineLearning.ru, http://www.machinelearning.ru/,
          <source>last accessed</source>
          <year>2021</year>
          /07/27.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gavrilova</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horoshevskij</surname>
            ,
            <given-names>V.F.</given-names>
          </string-name>
          :
          <article-title>Bazy znanij intellektualnyh sistem</article-title>
          .
          <source>SPb. Piter</source>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          <article-title>: Machine Learning with Shallow Neural Networks</article-title>
          .
          <source>In: Neural Networks and Deep Learning</source>
          . Springer, Cham. (
          <year>2018</year>
          ) https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -94463-
          <issue>0</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sererbryakov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          :
          <article-title>Ontology based approach to modeling of the subject domain "Mathematics" in the digital library</article-title>
          .
          <source>Lobachevskij Journal of Mathematics</source>
          .
          <volume>42</volume>
          (
          <issue>8</issue>
          ), (
          <year>2021</year>
          ). pp.
          <fpage>1920</fpage>
          -
          <lpage>1934</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serebryakov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
          </string-name>
          , N.:
          <article-title>Ontological Approach: Knowledge Representation and Knowledge Extraction</article-title>
          .
          <source>Lobachevskii Journal of Mathematics</source>
          .
          <volume>41</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1938</fpage>
          -
          <lpage>1948</lpage>
          (
          <year>2020</year>
          ) https://doi.org/10.1134/S1995080220100030 ISSN 19950802.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ataeva</surname>
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sererbryakov</surname>
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
            <given-names>N.P.</given-names>
          </string-name>
          : Mathematical Physics Branches:
          <article-title>Identifying Mixed Type Equations</article-title>
          .
          <source>Lobachevskij Journal of Mathematics</source>
          .
          <volume>40</volume>
          (
          <issue>7</issue>
          ),
          <fpage>876</fpage>
          -
          <lpage>886</lpage>
          (
          <year>2019</year>
          ) https://doi.org/10.1134/S1995080219070047.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sererbryakov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
            ,
            <given-names>N.P.</given-names>
          </string-name>
          : Mathematical Physics Problems:
          <article-title>Thesaurus and Ontology. Selected Papers of the XXI International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2019) Kazan</article-title>
          , Russia,
          <source>October 15-18</source>
          . Vol-
          <volume>2523</volume>
          , pp.
          <fpage>158</fpage>
          -
          <lpage>168</lpage>
          , (
          <year>2019</year>
          ) http://ceur-ws.org/Vol2523/paper16.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Muromskij</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
            ,
            <given-names>N.P.</given-names>
          </string-name>
          :
          <article-title>Predstavlenie matematicheskih ponyatij v ontologii nauchnyh znanij</article-title>
          .
          <source>Ontologiya proektirovaniy. 9</source>
          (
          <issue>1</issue>
          ), (
          <volume>31</volume>
          ),
          <fpage>50</fpage>
          -
          <lpage>69</lpage>
          (
          <year>2019</year>
          ) https://doi.org/0.18287/
          <fpage>2223</fpage>
          -9537-2019-9-1-
          <fpage>50</fpage>
          -69.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sererbryakov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
            ,
            <given-names>N.P.</given-names>
          </string-name>
          :
          <source>Query Expansion Method Application for Searching in Mathematical Subject Domains</source>
          ,
          <fpage>38</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2020</year>
          ) http://ceur-ws.org/Vol2543/rpaper04.pdf,
          <source>last accessed</source>
          <year>2021</year>
          /04/27.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sererbryakov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuchkova</surname>
            ,
            <given-names>N.P.</given-names>
          </string-name>
          : Using Applied Ontology to Saturate Semantic Relations.
          <source>Lobachevskij Journal of Mathematics</source>
          .
          <volume>42</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1776</fpage>
          -
          <lpage>1785</lpage>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vinogradov</surname>
            <given-names>I.M.</given-names>
          </string-name>
          :
          <source>Mathematical Encyclopedia</source>
          , Vol.
          <volume>1</volume>
          -
          <issue>5</issue>
          ,
          <string-name>
            <given-names>Soviet</given-names>
            <surname>Encyclopedia</surname>
          </string-name>
          , Moscow, (
          <year>1982</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uren</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pacheco</surname>
          </string-name>
          , R.: LRD:
          <article-title>Latent Relation Discovery for Vector Space Expansion and Information Retrieval</article-title>
          .
          <source>Technical Report KMI-06-09. Conference: Advances in Web-Age Information Management, 7th International Conference,WAIM</source>
          <year>2006</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, June 17-19,
          <year>2006</year>
          ,
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          (
          <year>2006</year>
          ). DOI:
          <volume>10</volume>
          .1007/11775300_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>Proceedings of Workshop at ICLR</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yih</surname>
          </string-name>
          , W.T.,
          <string-name>
            <surname>Zweig</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Linguistic Regularities in Continuous Space Word Representations</article-title>
          .
          <source>Proceedings of NAACL HLT</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Sentences and Document</article-title>
          .
          <source>International Conference on Machine Learning</source>
          , pp.
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , H.:
          <article-title>Introduction to Information</article-title>
          . Retrieval. Cambridge Univ. Press, Cambridge (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          , H.:
          <article-title>The Probabilistic Relevance Framework: BM25 and Beyond</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          .
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          (
          <year>2009</year>
          ). DOI:
          <volume>10</volume>
          .1561/1500000019.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Ataeva</surname>
            ,
            <given-names>O.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serebryakov</surname>
            ,
            <given-names>V.A.</given-names>
          </string-name>
          :
          <article-title>Ontologiya cifrovoj semanticheskoj biblioteki LibMeta</article-title>
          .
          <source>Informatics and Applications</source>
          .
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <fpage>2</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <source>MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec</source>
          ,
          <source>American Journal of Information Science and Technology</source>
          .
          <volume>3</volume>
          (
          <issue>3</issue>
          ),
          <fpage>62</fpage>
          -
          <lpage>71</lpage>
          (
          <year>2019</year>
          ) https://doi.org/10.11648/j.ajist.
          <volume>20190303</volume>
          .12.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>