<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches to assessing the semantic similarity of texts in a multilingual space</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A.Kh. Khakimova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.M. Charnine</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.A. Klokov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E.G. Sokolov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>aida_khatif@mail.ru</string-name>
          <email>aida_khatif@mail.ru</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mc@keywen.com</string-name>
          <email>mc@keywen.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>aaklokov@yandex.ru</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>evgeny.sokolov@phystech.edu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ANO «Scientific and Research Center for Information in Physics and Technique»</institution>
          ,
          <addr-line>Nizhny Novgorod</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>FRC CSC of the Russian Academy of Sciences</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Khakimova Aida Kh., PhD, docent, Kama Institute (Naberezhnye Chelny, Russia), ANO «Scientific and Research Center for Information in Physics and Technique» (Nizhny Novgorod</institution>
          ,
          <addr-line>Russia), Е-mail:</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can facilitate the search for close documents in the analysis of multilingual patent documentation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As cross-language information retrieval gets more
attention, tools to measure cross-language semantic
similarity between documents become necessary. An
accurate assessment of the actual similarity between
documents is fundamental for many automatic text
analysis applications, such as thesaurus generation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
machine translation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], information search [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], automatic
generalization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Text mining and knowledge management technologies
play a key role in many areas, including critical
infrastructures. Information search, document
classification, business analytics, forecasting
technologies, etc. are currently the most important
activities.</p>
      <p>Patent search, including monitoring competitors,
checking the novelty of an invention, or searching for
technical solutions in other fields of application, requires
a lot of effort.</p>
      <p>Comparing documents in different languages is
challenging for natural language processing applications,
and especially in machine translation applications.</p>
      <p>Cross-language matching of documents is carried out
in a patent search to protect an invention in more than one
country or region. A separate patent must be filed with
several patent offices in different languages. Before
applying for a patent, applicants conduct a preliminary
search for patents or documents revealing intellectual
property similar to the filed invention. In such a process, a
set of patents is requested in one language, using the
source document in another language as a request.</p>
      <p>
        To compare the received documents, it is necessary to
use cross-language similarity assessment functions. This
task can be formulated as discarding text pairs that are not
semantically equivalent [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The task is complicated by the
fact that in the case of filing an invention in different
countries, different standards may be used, which may
lead to a discrepancy between versions of the document in
different languages. In this case, the task of identifying
semantic equivalents is complicated [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Natural language processing methods for text analysis
and data mining are used in the analysis of many types of
technical documentation. Functional analysis methods are
based on extracting interactions between the entities
described in the document.</p>
      <p>Linguistic analysis tools permit to identify key
elements of a document by combining morphological,
syntactic, and semantic analysis. Application of methods
of linguistic analysis to patent documents allows for
accelerated analysis and comparison of patents.</p>
      <p>The purpose of the analysis of technical documentation
is to discover possible ambiguities or incompleteness on
the one hand, and understanding the requirements in the
direction of possible formalization on the other.</p>
      <p>
        The main problem here is that keyword searches do not
take into account synonyms or more abstract terms
associated with given query words. This means that if a
synonym is used for an important term in a patent
application, for example, a wire instead of a cable, a
keyword search may not reveal this relationship if an
alternative term was not explicitly included in the search
query. This is relevant since patent texts often use abstract
and general terms to describe the invention in order to
maximize protection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>If we consider the Internet as a multilingual database,
a typical problem when searching for information is the
search for relevant documents in the collection of
documents by some key terms, or by the example of the
corresponding document. Assessing the semantic
similarity between words (phrases) is critical to assessing
whether a document meets user needs. Many information
retrieval systems, such as online library catalog systems,
web search engines, deal with multilingual documents and
must have tools to measure cross-language semantic
similarity.</p>
      <p>In recent decades, many studies have been carried out
aimed at improving the effectiveness of measures of
semantic similarity of words. However, studies of
semantic similarity mainly focus on English. This is partly
due to the limited availability of similarity criteria for
words in languages other than English. Since the
development of multilingual methods is necessary, there is
an urgent need to find a reliable basis for assessing
multilingual and interlingual semantic similarity.</p>
      <p>
        Despite the fact that in many areas a multilingual
measurement of semantic semantic similarity is required,
most algorithms measure semantic similarity between
words of the same language. Cross-language similarity
was first described in 2009 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for Anglo-Spanish
crosslanguage data sets. Over the past few years, multilingual
word embeddings, which are lexical elements from several
languages in a single semantic space, have attracted
considerable attention of researchers [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9-11</xref>
        ].
      </p>
      <p>Interlanguage applications are based on data mining
methods, such as text clustering, which includes extracting
words or phrases from documents as functions,
representing documents as feature vectors, and then
grouping documents into clusters based on similarity of
feature vectors. In a multilingual document collection,
recoverable functions will refer to multilingual words.
Therefore, it is important to measure the similarity
between the words of not only one language, but also of
different languages.</p>
      <p>
        According to the concept of the information data space
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the information space should model a rich set of
relationships between data repositories. To model the
relationship between data warehouses in data spaces, you
need a component that can measure the semantic similarity
between interlanguage pairs. Sources in a data space can
be relational databases, XML repositories, text databases,
web services, etc.
      </p>
      <p>
        The problem of plagiarism in a monolingual context is
well developed [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Free machine translation tools help
spread cross-language plagiarism (plagiarism by
translation). In this relatively new field of research, the
definition of semantic text similarity in language pairs has
been carried out. The authors investigated various existing
approaches to detect plagiarism on different language
pairs and found that if the method is effective for a
particular language pair, it will be equally effective for
another language pair with a sufficient number of available
lexical resources, i.e. the method can be optimized for a
particular case and is effectively applied on another case
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology for calculating the assessment of semantic similarity</title>
      <p>The technique includes the following steps:
1) pre-processing of texts by replacing their terms with</p>
      <p>synset codes;
2) construction of quotation vectors by identifying
common rare phrases (long quotes) in various
documents using the relevant phrases method;
3) thematic analysis of processed texts and building a set
of available topics and corresponding thematic
document vectors using the LDA method with the
possibility of further clustering documents on topics /
ideas into “baskets”/clusters;
4) the construction for each document of an extended
vector describing the presence of long citations, the
statistics of the synsets included in it and their
thematic composition, i.e. the document vector is the
concatenation of the citation vector, thematic vector
and synset statistics vector;
5) calculation of the similarity index between
articles/documents (Semantic Text Similarity Index,
ISTS) by the cosine measure of the corresponding
article vectors;
6) calculation of the correlation between the formal
connectedness of articles and their similarity index,
taking into account the minimum and maximum
thresholds of the ISTS;
7) the choice of values of various calculation parameters
(ISTS thresholds) based on the maximum correlation.</p>
      <p>The calculation method is selected according to the
maximum correlation of ISTS with formal links.</p>
      <p>In the basis of the algorithm for vector transformation
of terms used recurrent neural networks (RNN - Recurrent
neural network) - Fig. 1.</p>
      <p>Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper)
and IFT, depending on the number of articles with the word in the last 3 years</p>
      <p>RNN is used for tasks where there is a sequence of
words and phrases. Formally, at each step (after each new
processed word), RNN considers for each word in the
corpus the probability of which word will be next. In this
work, LSTM neurons, which are a special case of RNN,
were used. Moreover, bi-directional recurrent biLSTM
network (Bidirectional recurrent neural networks) was
used. biLSTM is a combination of two LSTM networks in
which at the same time one network builds a language
model from the beginning of the sentence, and the second
from the end.</p>
      <p>We used the simplest sequential model, consisting of
two layers. For the software implementation of the
proposed architecture in Python, the jupyter notebook
development environment was used. A linear layer was
attached to the biLSTM layer to solve the classification
problem (Fig. 2).</p>
      <p>At the input of the neural network, vector
representations of words (embending) were applied.</p>
      <p>Word2Vec was used to convert each word from the title of
the article to a number vector. In the experiments, 300
dimension vectors were used (Word2Vec from the gensim
library allows changing the embedding dimension).</p>
      <p>In our experiments, we consider the DBLP citation
network, a collection of articles on artificial intelligence
compiled by aminer.org. In this study, we intentionally
relied only on the title of the publication and its links.</p>
      <p>During the experiments, various models of the neural
network were tested. Experiments were conducted with a
change in the number of neurons in the biLSTM layer (4,
8, 16, 32, 64, 128) and the number of neurons in the linear
layer (from 0 to 10). The best model was able to give an
accuracy of 0.6131 according to the ROC AUC metric.</p>
      <p>The time for calculating the forecast and evaluating its
accuracy was about 1 hour.</p>
      <p>To combine articles with similar topics into clusters,
we used generally accepted approaches to machine word
processing (NLP), clustering articles using the Latent
Dirichlet Allocation (LDA) method, and visualizing the
results obtained with Python libraries. After extracting the
data, preprocessing it, extracting tokens, stamping and
deleting stop words, we used the Latent Dirichlet
Allocation (LDA) algorithm - Fig. 2.</p>
      <p>LDA is a hierarchical Bayesian model that consists of
two levels: at the first level, a mixture whose components
correspond to “themes”; at the second level, a multinomial
variable with an a priori Dirichlet distribution that defines
the “distribution of topics” in the document.</p>
      <p>The principle of the model:
1) select the document length N
2) vector is selected θ ~ (α) - the vector of the “degree of</p>
      <p>expression” of each topic in this document;
3) for each of N words w:
choose a theme zn by distribution Mult(θ);
choose a word wn ~ p(wn|zn, β) with probabilities
given in β.</p>
      <p>For simplicity, we fix the number of topics k and
assume that β is just a set of parameters βi,j = p(wj=1|zi=1),
which need to be evaluated, and we won’t worry about the
distribution on N. The joint distribution then looks like
this:</p>
      <p>Unlike conventional clustering with an a priori
Dirichlet distribution, we do not select a cluster here once,
and then we look for words from this cluster, but for each
word we first select a topic from the distribution θ, and
only then we relate this word to this topic.</p>
      <p>At the output after training the LDA model, themed
vectors θ are obtained, showing how topics are distributed
in each document, and distributions β, which show which
words are more likely in certain topics. In our case, we got
8 pronounced clusters corresponding to the following
directions:
1) computing systems and algorithms in them;
2) bioinformatics and data processing methods in it;
3) signal processing;
4) optimization methods and algorithms based on them;
5) problems related to theoretical informatics and</p>
      <p>computational complexity;
6) neural and computing networks;
7) issues regarding natural language processing (NLP)</p>
      <p>and programming languages;
8) robotics, and self-learning systems (Reinforcement</p>
      <p>Learning).</p>
      <p>After the previous step, n-dimensional thematic
vectors of articles are obtained. To compress the results
into a two-dimensional vector space, the t-SNE machine
learning algorithm was used. To visualize the clusters, we
used an interface written in JavaScrIFT (Fig. 3).</p>
      <p>The previous approach was based on a comparison of
vectors at the megalemma level in a cosine measure, which
determined the semantic similarity of the texts. As a
development of this approach, based on the assumption
that while maintaining the semantic similarity of phrases,
ideas in them can be expressed in different words, we use
the Impact Factor of the Term (IFT) to assess the similarity
of documents.</p>
      <p>To compare articles expressing new ideas, we use the
hypothesis that new ideas are often expressed in terms of
a high impact factor IFT. IFT is determined by the average
number of links to articles with this term, the higher the
IFT, the higher the citation trend and the number of formal
links. If a couple of articles have a general term with a high
IFT, the probability of a formal link between them will be
high.</p>
      <p>Using multilingual synsets built for high IFT terms
(IFT terms), you can evaluate the similarity of articles in
any language. If there is a semantic similarity, estimated
by a cosine measure, it can be assumed that articles with
this term will be quoted with some probability.</p>
      <p>If previously the similarity of the vectors of
megalemma determined the similarity of texts, now we use
extended vectors based on common rare phrases,
megalemmas and multilingual IFT synsets, as well as the
results of thematic analysis. The similarity of extended
vectors more accurately reflects the similarity of texts,
since it takes into account not only semantic, but also
thematic similarity.</p>
      <p>Fig. 3. Cluster states in 1993. 1) computing systems and algorithms in them (pink); 2) bioinformatics and data processing methods in it
(purple); 3) signal processing (brown); 4) optimization methods and algorithms based on them (green); 5) problems related to
theoretical informatics and computational complexity (orange); 6) neural and computing networks (red); 7) issues regarding natural
language processing (NLP) and programming languages (blue); 8) robotics, and self-learning systems (Reinforcement Learning) (dark
orange); yellow - a “garbage” cluster with articles in German</p>
      <p>Our study is based on a model for representing ideas in
the form of many terms and similar phrases in a
multilingual semantic field, on the hypothesis that the
proximity of vector representations of terms in a
multilingual vector semantic space can be interpreted as
semantic similarity in an interlanguage environment. We
propose a method of formalizing ideas by using terms with
high IFT and megalemma, which allows you to recognize
an idea expressed in different words. References, both
formal (bibliographic) and contextual (implicit, expressed
by matching IFT terms), are an expression of the
connection between ideas.</p>
      <p>High IFT terms are significant terms (or ideologically
significant). If the texts on the IFT synsets have the same
vector, then this means the presence of common ideas in
these texts and a significant similarity related to citation.</p>
      <p>The similarity in vectors of megalemmas also correlates
with formal links (as our previous experiments showed),
but to a much lesser extent. It is shown that megalemma
has a very low impact factor.</p>
      <p>It should be noted that the similarity in vectors of
megalemmas is more applicable to texts with common
vocabulary, in this case, the degree of coincidence of their
thematic composition as a set of popular words is
calculated. The approach to calculating the similarity of
IFT / megalemma vectors is focused on comparing the
similarity of scientific texts with specific terminology,
despite the fact that ideas can have different lexical
expressions. Therefore, in the second case, it becomes
possible to more accurately assess the similarity from the
point of view of ideological similarity, since terms with a
high IFT are significant terms denoting ideas.</p>
      <p>Three types of semantic similarity can be considered
(based on implicit references): 1) similarity of the thematic
composition of popular / common words (word frequency
from 10 thousand or more); 2) the presence of common
significant IFT terms denoting specific ideas (frequency
51000); 3) the presence of common rare phrases (long
quotation) (frequency 2-100). These types differ in the
frequency of matching terms / phrases. The highest
frequency is typical for popular terms and megalemmas,
the lowest is for common rare phrases. The proposed
similarity assessment algorithm takes into account all
these types of similarities, giving appropriate weights.</p>
      <p>Thus, when identifying similarities and implicit
references, the entire frequency range of terms and phrases
is used.</p>
      <p>So, we build extended vectors from megalemmas and
multilingual IFT synsets, and these can be weighted
vectors whose elements have weights. The larger the
impact factor, the higher the likelihood of a formal link
and the higher the weight of the vector element. The cosine
measure allows you to work with weighted vectors, in
which elements take large real values. Since our task is to
search for semantic similarity of articles correlating with
the presence of formal links, then increasing the weights
of IFT synsets in extended vectors improves the quality of
the proposed algorithm.</p>
      <p>Therefore, the algorithm for calculating ISTS is based
on assessing the similarity of vectors, expanded by adding
multilingual IFT synsets and weights, according to a
cosine measure, in order to determine the similarity of
texts. This takes into account the presence of formal links
between texts containing matching IFT terms. The method
may contain options that are determined/selected by the
optimization method according to the maximum
correlation of ISTS with formal links.</p>
      <p>The first version of the methodology for calculating the
multilingual Index of Ideological Influence (III) as the
number of similar subsequent / future articles / documents
has been developed.</p>
      <p>
        We consider similar subsequent articles to be articles
that will cite this document, i.e. those articles are similar
that are linked by formal links. Thus, the III is looking for
trending articles containing trending IFT terms. We can
calculate the second-level III, since one idea gives rise to
another, then you can search for articles similar to the
articles found in the first stage (indirect similarity /
similarity). The mutual influence of articles is calculated
using the PageRank algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which increases the
significance / influence of texts / articles the more they
have more (implicit) links with other significant /
influential texts.
      </p>
      <p>IFT terms in scientific articles have an expiration date.</p>
      <p>The value of IFT is higher in the first years (3-4 years),
and then it decreases (Fig. 4).</p>
      <p>Fig. 4. Graphs of the average values of the IFT term, depending on the number of articles with these terms and the speed of the trend. 1
- 0 years, 2 - 1 year, 3 - 2 years, 4 - 3 years, 5 - 4 years, 6 - 5 years</p>
      <p>Over time, some important terms are replaced by
others. If in the vector for the term, in addition to the IFT,
another year is introduced when the term was of high
importance, then you can also obtain some information
about the age of the article by the vector, which will allow
you to find general ideas of a certain age when comparing
the articles. This provides information on the dynamics of
the development of ideas. For example, the term NEURAL
NETWORKS has a long history, and in different years,
various derivatives of this term were significant IFT terms,
for example, FUZZY NEURAL or RECURRENT neural
networks.</p>
      <p>Over time, some important terms are replaced by
others. If in the vector for the term, in addition to the IFT,
another year is introduced when the term was of high
importance, then you can also obtain some information
about the age of the article by the vector, which will allow
you to find general ideas of a certain age when comparing
the articles. This provides information on the dynamics of
the development of ideas. For example, the term NEURAL
NETWORKS has a long history, and in different years,
various derivatives of this term were significant IFT terms,
for example, FUZZY NEURAL or RECURRENT neural
networks.</p>
      <p>So, the methodology for calculating the III contains the
following steps:
1) search in the article for significant IFT terms;
2) compiling multilingual IFT synsets for these IFT</p>
      <p>
        terms;
3) on the basis of IFT-synsets, the definition of the
forecast (regression analysis according to previous
values of IFT and trend parameters);
4) refinement of the forecast using the PageRank
algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which increases the
significance/influence of texts / ideas, the more they
have (implicit) connections with other significant /
influential texts.
      </p>
      <p>In this case, implicit links between texts/articles are
determined using the methodology for calculating the
index of semantic text similarity (ISTS).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>As a result, we see the following pattern: the higher the
forecast of the IFT, the higher the III of the document. The
predictive value of the IFT is the same as the text, term, or
idea. If there are several IFT terms in the text, then you can
make a prediction according to the most significant/high
IFT, or according to statistics that take into account the
synergy of IFT terms when found together. An updated
forecast of III/IFT is carried out using regression analysis
using a number of indicators for the current year (IFT,
IFTm, external links) and similar indicators of previous
years.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The Multilingual Index of Ideological Influence (III)
corresponds to the number of subsequent/future
articles/documents citing the source document that are
similar to the source document. We plan to consider a
number of index modifications taking into account the
cascade of citation (first and other levels) and the temporal
dynamics of the development of ideas. It is planned to
develop an algorithm for the updated forecast of III/IFT
using a number of indicators of the current year (IFT,
IFTm, external links) and similar indicators of previous
years.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>The reported study was funded by RFBR according to
the research projects № 18-07-00909, 19-07-00857 and
20-04-60185.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jarmasz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szpakowicz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Roget's Thesaurus and Semantic Similarity</article-title>
          .
          <source>Recent Adv. Nat. Lang. Process. III Sel. Pap. from RANLP</source>
          <year>2003</year>
          , vol.
          <volume>111</volume>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inkpen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Unsupervised NearSynonym Choice using the Google Web 1T</article-title>
          .
          <source>ACM Trans. Knowl. Discov. Data</source>
          , vol. V, no.
          <source>June</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Semantic matching in search</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>7</volume>
          (
          <issue>5</issue>
          ):
          <fpage>343</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Aliguliyev</surname>
            <given-names>R. M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A new sentence similarity measure and sentence based extractive technique for automatic text summarization</article-title>
          .
          <source>Expert Systems with Applications</source>
          .
          <volume>36</volume>
          .
          <fpage>7764</fpage>
          -
          <lpage>7772</lpage>
          .
          <fpage>10</fpage>
          .1016/j.eswa.
          <year>2008</year>
          .
          <volume>11</volume>
          .022.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wäschle</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications</article-title>
          . Heidelberg. -
          <volume>139</volume>
          р.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Wäschle</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Riezler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Structural and topical dimensions in multi-task patent translation</article-title>
          .
          <source>In Proceedings of the 13th</source>
          <article-title>Conference of the European Chapter of the Association for Computational Linguistics (EACL)</article-title>
          .
          <source>Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , pages
          <fpage>818</fpage>
          -
          <lpage>828</lpage>
          , Avignon, France,
          <source>April 23 - 27</source>
          ,
          <year>2012</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Andersson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>The Portability of Three Types of Text Mining Techniques into the Patent Text Genre</article-title>
          , chapter
          <volume>9</volume>
          , pages
          <fpage>241</fpage>
          -
          <lpage>280</lpage>
          . Springer Berlin. Heidelberg, Berlin, Heidelberg. ISBN 978-3-
          <fpage>662</fpage>
          -53817-3.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Eneko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Enrique</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keith</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jana</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marius</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Aitor</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A study on similarity and relatedness using distributional and WordNet-based approaches</article-title>
          .
          <source>Proceedings of Human Language Technologies</source>
          :
          <article-title>The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          (pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          ). Boulder, Colorado: Association for Computational Linguistics
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>W. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Manning C.D.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Bilingual word embeddings for phrase-based machine translation</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          (pp.
          <fpage>1393</fpage>
          -
          <lpage>1398</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>de Melo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Wiktionary-based word embeddings</article-title>
          .
          <source>Proceedings of MT Summit XV</source>
          (pp.
          <fpage>346</fpage>
          -
          <lpage>359</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ammar</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulcaire</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsvetkov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Massively multilingual word embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1602</source>
          .
          <year>01925</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Michael</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alon</surname>
            ,
            <given-names>Y. H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>David</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>From databases to data spaces: A new abstraction for information management</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>34</volume>
          (
          <issue>4</issue>
          ),
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Overview of the 6th International Competition on Plagiarism Detection</article-title>
          .
          <source>In PAN at CLEF 2014</source>
          . Sheffield, UK (pp.
          <fpage>845</fpage>
          -
          <lpage>876</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ferrero</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Besacier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Agnes</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Using Word Embedding for Cross-Language Plagiarism Detection</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <source>(EACL</source>
          <year>2017</year>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          , Valencia, Spain, volume
          <volume>2</volume>
          (pp.
          <fpage>415</fpage>
          -
          <lpage>421</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winograd</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>The PageRank Citation Ranking: Bringing Order to the Web</article-title>
          .
          <source>In: Technical Report</source>
          . Stanford University, Stanford,
          <year>1998</year>
          . http://ilpubs.stanford.edu:
          <volume>8090</volume>
          /422/1/1999-
          <fpage>66</fpage>
          .pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>