=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-9
|storemode=property
|title=Approaches to assessing the semantic similarity of texts in a multilingual space
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-9.pdf
|volume=Vol-2763
|authors=Aida Khakimova,Michael Charnine,Alexey Klokov,Evgenii Sokolov
}}
==Approaches to assessing the semantic similarity of texts in a multilingual space==
<pdf width="1500px">https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-9.pdf</pdf>
<pre>
           Approaches to assessing the semantic similarity of texts in a
                              multilingual space
                           A.Kh. Khakimova1, M.M. Charnine2, A.A. Klokov2, E.G. Sokolov2
            aida_khatif@mail.ru | mc@keywen.com | aaklokov@yandex.ru | evgeny.sokolov@phystech.edu
      1
        ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia;
                            2
                              FRC CSC of the Russian Academy of Sciences, Moscow, Russia

    This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages
is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be
interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual
semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the
corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree
of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is
based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the
existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are
selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can
facilitate the search for close documents in the analysis of multilingual patent documentation.
    Keywords: cross-lingual semantic similarity, semantic textual similarity measure, semantic implicit links, collection of documents,
measure of similarity of texts, method of relevant phrases, vector representations for words.

                                                                        different languages. In this case, the task of identifying
1. Introduction                                                         semantic equivalents is complicated [6].
    As cross-language information retrieval gets more                       Natural language processing methods for text analysis
attention, tools to measure cross-language semantic                     and data mining are used in the analysis of many types of
similarity between documents become necessary. An                       technical documentation. Functional analysis methods are
accurate assessment of the actual similarity between                    based on extracting interactions between the entities
documents is fundamental for many automatic text                        described in the document.
analysis applications, such as thesaurus generation [1],                    Linguistic analysis tools permit to identify key
machine translation [2], information search [3], automatic              elements of a document by combining morphological,
generalization [4].                                                     syntactic, and semantic analysis. Application of methods
    Text mining and knowledge management technologies                   of linguistic analysis to patent documents allows for
play a key role in many areas, including critical                       accelerated analysis and comparison of patents.
infrastructures.     Information      search,      document                 The purpose of the analysis of technical documentation
classification,     business      analytics,     forecasting            is to discover possible ambiguities or incompleteness on
technologies, etc. are currently the most important                     the one hand, and understanding the requirements in the
activities.                                                             direction of possible formalization on the other.
    Patent search, including monitoring competitors,                        The main problem here is that keyword searches do not
checking the novelty of an invention, or searching for                  take into account synonyms or more abstract terms
technical solutions in other fields of application, requires            associated with given query words. This means that if a
a lot of effort.                                                        synonym is used for an important term in a patent
    Comparing documents in different languages is                       application, for example, a wire instead of a cable, a
challenging for natural language processing applications,               keyword search may not reveal this relationship if an
and especially in machine translation applications.                     alternative term was not explicitly included in the search
    Cross-language matching of documents is carried out                 query. This is relevant since patent texts often use abstract
in a patent search to protect an invention in more than one             and general terms to describe the invention in order to
country or region. A separate patent must be filed with                 maximize protection [7].
several patent offices in different languages. Before                       If we consider the Internet as a multilingual database,
applying for a patent, applicants conduct a preliminary                 a typical problem when searching for information is the
search for patents or documents revealing intellectual                  search for relevant documents in the collection of
property similar to the filed invention. In such a process, a           documents by some key terms, or by the example of the
set of patents is requested in one language, using the                  corresponding document. Assessing the semantic
source document in another language as a request.                       similarity between words (phrases) is critical to assessing
    To compare the received documents, it is necessary to               whether a document meets user needs. Many information
use cross-language similarity assessment functions. This                retrieval systems, such as online library catalog systems,
task can be formulated as discarding text pairs that are not            web search engines, deal with multilingual documents and
semantically equivalent [5]. The task is complicated by the             must have tools to measure cross-language semantic
fact that in the case of filing an invention in different               similarity.
countries, different standards may be used, which may                       In recent decades, many studies have been carried out
lead to a discrepancy between versions of the document in               aimed at improving the effectiveness of measures of
                                                                        semantic similarity of words. However, studies of
                                                                        semantic similarity mainly focus on English. This is partly

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
due to the limited availability of similarity criteria for             another language pair with a sufficient number of available
words in languages other than English. Since the                       lexical resources, i.e. the method can be optimized for a
development of multilingual methods is necessary, there is             particular case and is effectively applied on another case
an urgent need to find a reliable basis for assessing                  [14].
multilingual and interlingual semantic similarity.
    Despite the fact that in many areas a multilingual                 2. Methodology for calculating the assessment
measurement of semantic semantic similarity is required,                  of semantic similarity
most algorithms measure semantic similarity between                        The technique includes the following steps:
words of the same language. Cross-language similarity                  1) pre-processing of texts by replacing their terms with
was first described in 2009 [8] for Anglo-Spanish cross-                    synset codes;
language data sets. Over the past few years, multilingual              2) construction of quotation vectors by identifying
word embeddings, which are lexical elements from several                    common rare phrases (long quotes) in various
languages in a single semantic space, have attracted                        documents using the relevant phrases method;
considerable attention of researchers [9-11].                          3) thematic analysis of processed texts and building a set
    Interlanguage applications are based on data mining                     of available topics and corresponding thematic
methods, such as text clustering, which includes extracting                 document vectors using the LDA method with the
words or phrases from documents as functions,                               possibility of further clustering documents on topics /
representing documents as feature vectors, and then                         ideas into “baskets”/clusters;
grouping documents into clusters based on similarity of                4) the construction for each document of an extended
feature vectors. In a multilingual document collection,                     vector describing the presence of long citations, the
recoverable functions will refer to multilingual words.                     statistics of the synsets included in it and their
Therefore, it is important to measure the similarity                        thematic composition, i.e. the document vector is the
between the words of not only one language, but also of                     concatenation of the citation vector, thematic vector
different languages.                                                        and synset statistics vector;
    According to the concept of the information data space             5) calculation of the similarity index between
[12], the information space should model a rich set of                      articles/documents (Semantic Text Similarity Index,
relationships between data repositories. To model the                       ISTS) by the cosine measure of the corresponding
relationship between data warehouses in data spaces, you                    article vectors;
need a component that can measure the semantic similarity              6) calculation of the correlation between the formal
between interlanguage pairs. Sources in a data space can                    connectedness of articles and their similarity index,
be relational databases, XML repositories, text databases,                  taking into account the minimum and maximum
web services, etc.                                                          thresholds of the ISTS;
    The problem of plagiarism in a monolingual context is              7) the choice of values of various calculation parameters
well developed [13]. Free machine translation tools help                    (ISTS thresholds) based on the maximum correlation.
spread cross-language plagiarism (plagiarism by                            The calculation method is selected according to the
translation). In this relatively new field of research, the            maximum correlation of ISTS with formal links.
definition of semantic text similarity in language pairs has               In the basis of the algorithm for vector transformation
been carried out. The authors investigated various existing            of terms used recurrent neural networks (RNN - Recurrent
approaches to detect plagiarism on different language                  neural network) - Fig. 1.
pairs and found that if the method is effective for a
particular language pair, it will be equally effective for


 Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper)
                           and IFT, depending on the number of articles with the word in the last 3 years

   RNN is used for tasks where there is a sequence of                  which at the same time one network builds a language
words and phrases. Formally, at each step (after each new              model from the beginning of the sentence, and the second
processed word), RNN considers for each word in the                    from the end.
corpus the probability of which word will be next. In this                We used the simplest sequential model, consisting of
work, LSTM neurons, which are a special case of RNN,                   two layers. For the software implementation of the
were used. Moreover, bi-directional recurrent biLSTM                   proposed architecture in Python, the jupyter notebook
network (Bidirectional recurrent neural networks) was                  development environment was used. A linear layer was
used. biLSTM is a combination of two LSTM networks in
attached to the biLSTM layer to solve the classification
problem (Fig. 2).
                                                                     Unlike conventional clustering with an a priori
                                                                 Dirichlet distribution, we do not select a cluster here once,
                                                                 and then we look for words from this cluster, but for each
                                                                 word we first select a topic from the distribution θ, and
                                                                 only then we relate this word to this topic.
                                                                     At the output after training the LDA model, themed
                                                                 vectors θ are obtained, showing how topics are distributed
                                                                 in each document, and distributions β, which show which
              Fig. 2. Scheme of the LDA model                    words are more likely in certain topics. In our case, we got
                                                                 8 pronounced clusters corresponding to the following
    At the input of the neural network, vector                   directions:
representations of words (embending) were applied.               1) computing systems and algorithms in them;
Word2Vec was used to convert each word from the title of         2) bioinformatics and data processing methods in it;
the article to a number vector. In the experiments, 300          3) signal processing;
dimension vectors were used (Word2Vec from the gensim            4) optimization methods and algorithms based on them;
library allows changing the embedding dimension).                5) problems related to theoretical informatics and
    In our experiments, we consider the DBLP citation                 computational complexity;
network, a collection of articles on artificial intelligence     6) neural and computing networks;
compiled by aminer.org. In this study, we intentionally          7) issues regarding natural language processing (NLP)
relied only on the title of the publication and its links.            and programming languages;
During the experiments, various models of the neural             8) robotics, and self-learning systems (Reinforcement
network were tested. Experiments were conducted with a                Learning).
change in the number of neurons in the biLSTM layer (4,              After the previous step, n-dimensional thematic
8, 16, 32, 64, 128) and the number of neurons in the linear      vectors of articles are obtained. To compress the results
layer (from 0 to 10). The best model was able to give an         into a two-dimensional vector space, the t-SNE machine
accuracy of 0.6131 according to the ROC AUC metric.              learning algorithm was used. To visualize the clusters, we
The time for calculating the forecast and evaluating its         used an interface written in JavaScrIFT (Fig. 3).
accuracy was about 1 hour.                                           The previous approach was based on a comparison of
    To combine articles with similar topics into clusters,       vectors at the megalemma level in a cosine measure, which
we used generally accepted approaches to machine word            determined the semantic similarity of the texts. As a
processing (NLP), clustering articles using the Latent           development of this approach, based on the assumption
Dirichlet Allocation (LDA) method, and visualizing the           that while maintaining the semantic similarity of phrases,
results obtained with Python libraries. After extracting the     ideas in them can be expressed in different words, we use
data, preprocessing it, extracting tokens, stamping and          the Impact Factor of the Term (IFT) to assess the similarity
deleting stop words, we used the Latent Dirichlet                of documents.
Allocation (LDA) algorithm - Fig. 2.                                 To compare articles expressing new ideas, we use the
    LDA is a hierarchical Bayesian model that consists of        hypothesis that new ideas are often expressed in terms of
two levels: at the first level, a mixture whose components       a high impact factor IFT. IFT is determined by the average
correspond to “themes”; at the second level, a multinomial       number of links to articles with this term, the higher the
variable with an a priori Dirichlet distribution that defines    IFT, the higher the citation trend and the number of formal
the “distribution of topics” in the document.                    links. If a couple of articles have a general term with a high
    The principle of the model:                                  IFT, the probability of a formal link between them will be
1) select the document length N                                  high.
2) vector is selected θ ~ (α) - the vector of the “degree of         Using multilingual synsets built for high IFT terms
     expression” of each topic in this document;                 (IFT terms), you can evaluate the similarity of articles in
3) for each of N words w:                                        any language. If there is a semantic similarity, estimated
    ˗    choose a theme zn by distribution Mult(θ);              by a cosine measure, it can be assumed that articles with
    ˗     choose a word wn ~ p(wn|zn, β) with probabilities      this term will be quoted with some probability.
          given in β.                                                If previously the similarity of the vectors of
    For simplicity, we fix the number of topics k and            megalemma determined the similarity of texts, now we use
assume that β is just a set of parameters βi,j = p(wj=1|zi=1),   extended vectors based on common rare phrases,
which need to be evaluated, and we won’t worry about the         megalemmas and multilingual IFT synsets, as well as the
distribution on N. The joint distribution then looks like        results of thematic analysis. The similarity of extended
this:                                                            vectors more accurately reflects the similarity of texts,
                                                                 since it takes into account not only semantic, but also
                                                                 thematic similarity.
Fig. 3. Cluster states in 1993. 1) computing systems and algorithms in them (pink); 2) bioinformatics and data processing methods in it
     (purple); 3) signal processing (brown); 4) optimization methods and algorithms based on them (green); 5) problems related to
  theoretical informatics and computational complexity (orange); 6) neural and computing networks (red); 7) issues regarding natural
language processing (NLP) and programming languages (blue); 8) robotics, and self-learning systems (Reinforcement Learning) (dark
                                      orange); yellow - a “garbage” cluster with articles in German

    Our study is based on a model for representing ideas in           expressions. Therefore, in the second case, it becomes
the form of many terms and similar phrases in a                       possible to more accurately assess the similarity from the
multilingual semantic field, on the hypothesis that the               point of view of ideological similarity, since terms with a
proximity of vector representations of terms in a                     high IFT are significant terms denoting ideas.
multilingual vector semantic space can be interpreted as                  Three types of semantic similarity can be considered
semantic similarity in an interlanguage environment. We               (based on implicit references): 1) similarity of the thematic
propose a method of formalizing ideas by using terms with             composition of popular / common words (word frequency
high IFT and megalemma, which allows you to recognize                 from 10 thousand or more); 2) the presence of common
an idea expressed in different words. References, both                significant IFT terms denoting specific ideas (frequency 5-
formal (bibliographic) and contextual (implicit, expressed            1000); 3) the presence of common rare phrases (long
by matching IFT terms), are an expression of the                      quotation) (frequency 2-100). These types differ in the
connection between ideas.                                             frequency of matching terms / phrases. The highest
    High IFT terms are significant terms (or ideologically            frequency is typical for popular terms and megalemmas,
significant). If the texts on the IFT synsets have the same           the lowest is for common rare phrases. The proposed
vector, then this means the presence of common ideas in               similarity assessment algorithm takes into account all
these texts and a significant similarity related to citation.         these types of similarities, giving appropriate weights.
The similarity in vectors of megalemmas also correlates               Thus, when identifying similarities and implicit
with formal links (as our previous experiments showed),               references, the entire frequency range of terms and phrases
but to a much lesser extent. It is shown that megalemma               is used.
has a very low impact factor.                                             So, we build extended vectors from megalemmas and
    It should be noted that the similarity in vectors of              multilingual IFT synsets, and these can be weighted
megalemmas is more applicable to texts with common                    vectors whose elements have weights. The larger the
vocabulary, in this case, the degree of coincidence of their          impact factor, the higher the likelihood of a formal link
thematic composition as a set of popular words is                     and the higher the weight of the vector element. The cosine
calculated. The approach to calculating the similarity of             measure allows you to work with weighted vectors, in
IFT / megalemma vectors is focused on comparing the                   which elements take large real values. Since our task is to
similarity of scientific texts with specific terminology,             search for semantic similarity of articles correlating with
despite the fact that ideas can have different lexical                the presence of formal links, then increasing the weights
of IFT synsets in extended vectors improves the quality of                 We consider similar subsequent articles to be articles
the proposed algorithm.                                                that will cite this document, i.e. those articles are similar
    Therefore, the algorithm for calculating ISTS is based             that are linked by formal links. Thus, the III is looking for
on assessing the similarity of vectors, expanded by adding             trending articles containing trending IFT terms. We can
multilingual IFT synsets and weights, according to a                   calculate the second-level III, since one idea gives rise to
cosine measure, in order to determine the similarity of                another, then you can search for articles similar to the
texts. This takes into account the presence of formal links            articles found in the first stage (indirect similarity /
between texts containing matching IFT terms. The method                similarity). The mutual influence of articles is calculated
may contain options that are determined/selected by the                using the PageRank algorithm [15], which increases the
optimization method according to the maximum                           significance / influence of texts / articles the more they
correlation of ISTS with formal links.                                 have more (implicit) links with other significant /
    The first version of the methodology for calculating the           influential texts.
multilingual Index of Ideological Influence (III) as the                   IFT terms in scientific articles have an expiration date.
number of similar subsequent / future articles / documents             The value of IFT is higher in the first years (3-4 years),
has been developed.                                                    and then it decreases (Fig. 4).


Fig. 4. Graphs of the average values of the IFT term, depending on the number of articles with these terms and the speed of the trend. 1
                                  - 0 years, 2 - 1 year, 3 - 2 years, 4 - 3 years, 5 - 4 years, 6 - 5 years

    Over time, some important terms are replaced by                    the development of ideas. For example, the term NEURAL
others. If in the vector for the term, in addition to the IFT,         NETWORKS has a long history, and in different years,
another year is introduced when the term was of high                   various derivatives of this term were significant IFT terms,
importance, then you can also obtain some information                  for example, FUZZY NEURAL or RECURRENT neural
about the age of the article by the vector, which will allow           networks.
you to find general ideas of a certain age when comparing                  So, the methodology for calculating the III contains the
the articles. This provides information on the dynamics of             following steps:
the development of ideas. For example, the term NEURAL                 1) search in the article for significant IFT terms;
NETWORKS has a long history, and in different years,                   2) compiling multilingual IFT synsets for these IFT
various derivatives of this term were significant IFT terms,                terms;
for example, FUZZY NEURAL or RECURRENT neural                          3) on the basis of IFT-synsets, the definition of the
networks.                                                                   forecast (regression analysis according to previous
    Over time, some important terms are replaced by                         values of IFT and trend parameters);
others. If in the vector for the term, in addition to the IFT,         4) refinement of the forecast using the PageRank
another year is introduced when the term was of high                        algorithm       [12],     which      increases      the
importance, then you can also obtain some information                       significance/influence of texts / ideas, the more they
about the age of the article by the vector, which will allow                have (implicit) connections with other significant /
you to find general ideas of a certain age when comparing                   influential texts.
the articles. This provides information on the dynamics of
   In this case, implicit links between texts/articles are            Association for Computational Linguistics, pages
determined using the methodology for calculating the                  818–828, Avignon, France, April 23 - 27, 2012
index of semantic text similarity (ISTS).                        [7] Andersson, L., Hanbury, A. and Rauber, A. (2017).
                                                                      The Portability of Three Types of Text Mining
3. Results                                                            Techniques into the Patent Text Genre, chapter 9,
    As a result, we see the following pattern: the higher the         pages 241–280. Springer Berlin. Heidelberg, Berlin,
forecast of the IFT, the higher the III of the document. The          Heidelberg. ISBN 978-3-662-53817-3.
predictive value of the IFT is the same as the text, term, or    [8] Eneko, A., Enrique, A., Keith, H., Jana, K., Marius,
idea. If there are several IFT terms in the text, then you can        P., & Aitor, S. (2009). A study on similarity and
make a prediction according to the most significant/high              relatedness using distributional and WordNet-based
IFT, or according to statistics that take into account the            approaches. Proceedings of Human Language
synergy of IFT terms when found together. An updated                  Technologies: The 2009 Annual Conference of the
forecast of III/IFT is carried out using regression analysis          North American Chapter of the Association for
using a number of indicators for the current year (IFT,               Computational Linguistics (pp. 19-27). Boulder,
IFTm, external links) and similar indicators of previous              Colorado: Association for Computational Linguistics
years.                                                           [9] Zou, W. Y., Socher, R., Cer, D.M. and Manning C.D.
                                                                      (2013). Bilingual word embeddings for phrase-based
4. Conclusion                                                         machine translation. In Proceedings of EMNLP (pp.
                                                                      1393-1398).
    The Multilingual Index of Ideological Influence (III)        [10] de Melo, G. (2015). Wiktionary-based word
corresponds to the number of subsequent/future                        embeddings. Proceedings of MT Summit XV (pp.
articles/documents citing the source document that are                346-359).
similar to the source document. We plan to consider a            [11] Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G.,
number of index modifications taking into account the                 Dyer, C. and Smith, N.A. (2016). Massively
cascade of citation (first and other levels) and the temporal         multilingual word embeddings. arXiv preprint
dynamics of the development of ideas. It is planned to                arXiv:1602.01925.
develop an algorithm for the updated forecast of III/IFT         [12] Michael, J. F., Alon, Y. H., & David, M. (2005). From
using a number of indicators of the current year (IFT,                databases to data spaces: A new abstraction for
IFTm, external links) and similar indicators of previous              information management. SIGMOD Record, 34(4),
years.                                                                27-33
                                                                 [13] Potthast, M., Hagen, M., Beyer, A., Busse, M.,
Acknowledgment
                                                                      Tippmann, M., Rosso, P. and Stein, B. (2014).
   The reported study was funded by RFBR according to                 Overview of the 6th International Competition on
the research projects № 18-07-00909, 19-07-00857 and                  Plagiarism Detection. In PAN at CLEF 2014.
20-04-60185.                                                          Sheffield, UK (pp. 845-876).
                                                                 [14] Ferrero, J., Besacier, L., Schwab, D. & Agnes, F.
References                                                            (2017). Using Word Embedding for Cross-Language
[1] Jarmasz,     M., Szpakowicz, S. (2003). Roget’s                   Plagiarism Detection. In Proceedings of the 15th
      Thesaurus and Semantic Similarity. Recent Adv. Nat.             Conference of the European Chapter of the
      Lang. Process. III Sel. Pap. from RANLP 2003, vol.              Association for Computational Linguistics, (EACL
      111, 2004.                                                      2017). Association for Computational Linguistics,
[2]   Islam, A., Inkpen, D. (2012). Unsupervised Near-                Valencia, Spain, volume 2 (pp. 415-421).
      Synonym Choice using the Google Web 1T. ACM                [15] Page, L., Brin, S., Motwani, R., Winograd, T. (1998).
      Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1-            The PageRank Citation Ranking: Bringing Order to
      19.                                                             the Web. In: Technical Report. Stanford University,
[3]   Li, H., Xu, J. (2014). Semantic matching in search.             Stanford,                                       1998.
      Foundations and Trends in Information Retrieval,                http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
      7(5):343-469.
[4]   Aliguliyev R. M. (2009). A new sentence similarity         About the autors
      measure and sentence based extractive technique for            Khakimova Aida Kh., PhD, docent, Kama Institute
      automatic text summarization. Expert Systems with          (Naberezhnye Chelny, Russia), ANO «Scientific and Research
      Applications.             36.              7764-7772.      Center for Information in Physics and Technique» (Nizhny
      10.1016/j.eswa.2008.11.022.                                Novgorod, Russia), Е-mail: aida_khatif@mail.ru
                                                                     Charnine Mikhail M., PhD, Senior Researcher, FRC CSC of
[5]   Wäschle, K. (2015). Quantifying Cross-lingual
                                                                 the Russian Academy of Sciences, Moscow, Russia, Е-mail:
      Semantic Similarity for Natural Language Processing        mc@keywen.com
      Applications. Heidelberg. – 139 р.                             Klokov Alexey A., graduate student, FRC CSC of the
[6]   Wäschle, K. and Riezler, S. (2012). Structural and         Russian Academy of Sciences, Moscow, Russia, Е-mail:
      topical dimensions in multi-task patent translation. In    aaklokov@yandex.ru
      Proceedings of the 13th Conference of the European             Sokolov Evgenii G., graduate student, FRC CSC of the
      Chapter of the Association for Computational               Russian Academy of Sciences, Moscow, Russia, Е-mail:
      Linguistics (EACL).Proceedings of the 13th                 evgeny.sokolov@phystech.edu
      Conference of the European Chapter of the

</pre>