=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-9
|storemode=property
|title=Approaches to assessing the semantic similarity of texts in a multilingual space
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-9.pdf
|volume=Vol-2763
|authors=Aida Khakimova,Michael Charnine,Alexey Klokov,Evgenii Sokolov
}}
==Approaches to assessing the semantic similarity of texts in a multilingual space==
Approaches to assessing the semantic similarity of texts in a
multilingual space
A.Kh. Khakimova1, M.M. Charnine2, A.A. Klokov2, E.G. Sokolov2
aida_khatif@mail.ru | mc@keywen.com | aaklokov@yandex.ru | evgeny.sokolov@phystech.edu
1
ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia;
2
FRC CSC of the Russian Academy of Sciences, Moscow, Russia
This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages
is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be
interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual
semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the
corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree
of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is
based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the
existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are
selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can
facilitate the search for close documents in the analysis of multilingual patent documentation.
Keywords: cross-lingual semantic similarity, semantic textual similarity measure, semantic implicit links, collection of documents,
measure of similarity of texts, method of relevant phrases, vector representations for words.
different languages. In this case, the task of identifying
1. Introduction semantic equivalents is complicated [6].
As cross-language information retrieval gets more Natural language processing methods for text analysis
attention, tools to measure cross-language semantic and data mining are used in the analysis of many types of
similarity between documents become necessary. An technical documentation. Functional analysis methods are
accurate assessment of the actual similarity between based on extracting interactions between the entities
documents is fundamental for many automatic text described in the document.
analysis applications, such as thesaurus generation [1], Linguistic analysis tools permit to identify key
machine translation [2], information search [3], automatic elements of a document by combining morphological,
generalization [4]. syntactic, and semantic analysis. Application of methods
Text mining and knowledge management technologies of linguistic analysis to patent documents allows for
play a key role in many areas, including critical accelerated analysis and comparison of patents.
infrastructures. Information search, document The purpose of the analysis of technical documentation
classification, business analytics, forecasting is to discover possible ambiguities or incompleteness on
technologies, etc. are currently the most important the one hand, and understanding the requirements in the
activities. direction of possible formalization on the other.
Patent search, including monitoring competitors, The main problem here is that keyword searches do not
checking the novelty of an invention, or searching for take into account synonyms or more abstract terms
technical solutions in other fields of application, requires associated with given query words. This means that if a
a lot of effort. synonym is used for an important term in a patent
Comparing documents in different languages is application, for example, a wire instead of a cable, a
challenging for natural language processing applications, keyword search may not reveal this relationship if an
and especially in machine translation applications. alternative term was not explicitly included in the search
Cross-language matching of documents is carried out query. This is relevant since patent texts often use abstract
in a patent search to protect an invention in more than one and general terms to describe the invention in order to
country or region. A separate patent must be filed with maximize protection [7].
several patent offices in different languages. Before If we consider the Internet as a multilingual database,
applying for a patent, applicants conduct a preliminary a typical problem when searching for information is the
search for patents or documents revealing intellectual search for relevant documents in the collection of
property similar to the filed invention. In such a process, a documents by some key terms, or by the example of the
set of patents is requested in one language, using the corresponding document. Assessing the semantic
source document in another language as a request. similarity between words (phrases) is critical to assessing
To compare the received documents, it is necessary to whether a document meets user needs. Many information
use cross-language similarity assessment functions. This retrieval systems, such as online library catalog systems,
task can be formulated as discarding text pairs that are not web search engines, deal with multilingual documents and
semantically equivalent [5]. The task is complicated by the must have tools to measure cross-language semantic
fact that in the case of filing an invention in different similarity.
countries, different standards may be used, which may In recent decades, many studies have been carried out
lead to a discrepancy between versions of the document in aimed at improving the effectiveness of measures of
semantic similarity of words. However, studies of
semantic similarity mainly focus on English. This is partly
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
due to the limited availability of similarity criteria for another language pair with a sufficient number of available
words in languages other than English. Since the lexical resources, i.e. the method can be optimized for a
development of multilingual methods is necessary, there is particular case and is effectively applied on another case
an urgent need to find a reliable basis for assessing [14].
multilingual and interlingual semantic similarity.
Despite the fact that in many areas a multilingual 2. Methodology for calculating the assessment
measurement of semantic semantic similarity is required, of semantic similarity
most algorithms measure semantic similarity between The technique includes the following steps:
words of the same language. Cross-language similarity 1) pre-processing of texts by replacing their terms with
was first described in 2009 [8] for Anglo-Spanish cross- synset codes;
language data sets. Over the past few years, multilingual 2) construction of quotation vectors by identifying
word embeddings, which are lexical elements from several common rare phrases (long quotes) in various
languages in a single semantic space, have attracted documents using the relevant phrases method;
considerable attention of researchers [9-11]. 3) thematic analysis of processed texts and building a set
Interlanguage applications are based on data mining of available topics and corresponding thematic
methods, such as text clustering, which includes extracting document vectors using the LDA method with the
words or phrases from documents as functions, possibility of further clustering documents on topics /
representing documents as feature vectors, and then ideas into “baskets”/clusters;
grouping documents into clusters based on similarity of 4) the construction for each document of an extended
feature vectors. In a multilingual document collection, vector describing the presence of long citations, the
recoverable functions will refer to multilingual words. statistics of the synsets included in it and their
Therefore, it is important to measure the similarity thematic composition, i.e. the document vector is the
between the words of not only one language, but also of concatenation of the citation vector, thematic vector
different languages. and synset statistics vector;
According to the concept of the information data space 5) calculation of the similarity index between
[12], the information space should model a rich set of articles/documents (Semantic Text Similarity Index,
relationships between data repositories. To model the ISTS) by the cosine measure of the corresponding
relationship between data warehouses in data spaces, you article vectors;
need a component that can measure the semantic similarity 6) calculation of the correlation between the formal
between interlanguage pairs. Sources in a data space can connectedness of articles and their similarity index,
be relational databases, XML repositories, text databases, taking into account the minimum and maximum
web services, etc. thresholds of the ISTS;
The problem of plagiarism in a monolingual context is 7) the choice of values of various calculation parameters
well developed [13]. Free machine translation tools help (ISTS thresholds) based on the maximum correlation.
spread cross-language plagiarism (plagiarism by The calculation method is selected according to the
translation). In this relatively new field of research, the maximum correlation of ISTS with formal links.
definition of semantic text similarity in language pairs has In the basis of the algorithm for vector transformation
been carried out. The authors investigated various existing of terms used recurrent neural networks (RNN - Recurrent
approaches to detect plagiarism on different language neural network) - Fig. 1.
pairs and found that if the method is effective for a
particular language pair, it will be equally effective for
Fig. 1. Graphs of the number of points for calculating the correlations of the current and future years for the indicators IFTm (upper)
and IFT, depending on the number of articles with the word in the last 3 years
RNN is used for tasks where there is a sequence of which at the same time one network builds a language
words and phrases. Formally, at each step (after each new model from the beginning of the sentence, and the second
processed word), RNN considers for each word in the from the end.
corpus the probability of which word will be next. In this We used the simplest sequential model, consisting of
work, LSTM neurons, which are a special case of RNN, two layers. For the software implementation of the
were used. Moreover, bi-directional recurrent biLSTM proposed architecture in Python, the jupyter notebook
network (Bidirectional recurrent neural networks) was development environment was used. A linear layer was
used. biLSTM is a combination of two LSTM networks in
attached to the biLSTM layer to solve the classification
problem (Fig. 2).
Unlike conventional clustering with an a priori
Dirichlet distribution, we do not select a cluster here once,
and then we look for words from this cluster, but for each
word we first select a topic from the distribution θ, and
only then we relate this word to this topic.
At the output after training the LDA model, themed
vectors θ are obtained, showing how topics are distributed
in each document, and distributions β, which show which
Fig. 2. Scheme of the LDA model words are more likely in certain topics. In our case, we got
8 pronounced clusters corresponding to the following
At the input of the neural network, vector directions:
representations of words (embending) were applied. 1) computing systems and algorithms in them;
Word2Vec was used to convert each word from the title of 2) bioinformatics and data processing methods in it;
the article to a number vector. In the experiments, 300 3) signal processing;
dimension vectors were used (Word2Vec from the gensim 4) optimization methods and algorithms based on them;
library allows changing the embedding dimension). 5) problems related to theoretical informatics and
In our experiments, we consider the DBLP citation computational complexity;
network, a collection of articles on artificial intelligence 6) neural and computing networks;
compiled by aminer.org. In this study, we intentionally 7) issues regarding natural language processing (NLP)
relied only on the title of the publication and its links. and programming languages;
During the experiments, various models of the neural 8) robotics, and self-learning systems (Reinforcement
network were tested. Experiments were conducted with a Learning).
change in the number of neurons in the biLSTM layer (4, After the previous step, n-dimensional thematic
8, 16, 32, 64, 128) and the number of neurons in the linear vectors of articles are obtained. To compress the results
layer (from 0 to 10). The best model was able to give an into a two-dimensional vector space, the t-SNE machine
accuracy of 0.6131 according to the ROC AUC metric. learning algorithm was used. To visualize the clusters, we
The time for calculating the forecast and evaluating its used an interface written in JavaScrIFT (Fig. 3).
accuracy was about 1 hour. The previous approach was based on a comparison of
To combine articles with similar topics into clusters, vectors at the megalemma level in a cosine measure, which
we used generally accepted approaches to machine word determined the semantic similarity of the texts. As a
processing (NLP), clustering articles using the Latent development of this approach, based on the assumption
Dirichlet Allocation (LDA) method, and visualizing the that while maintaining the semantic similarity of phrases,
results obtained with Python libraries. After extracting the ideas in them can be expressed in different words, we use
data, preprocessing it, extracting tokens, stamping and the Impact Factor of the Term (IFT) to assess the similarity
deleting stop words, we used the Latent Dirichlet of documents.
Allocation (LDA) algorithm - Fig. 2. To compare articles expressing new ideas, we use the
LDA is a hierarchical Bayesian model that consists of hypothesis that new ideas are often expressed in terms of
two levels: at the first level, a mixture whose components a high impact factor IFT. IFT is determined by the average
correspond to “themes”; at the second level, a multinomial number of links to articles with this term, the higher the
variable with an a priori Dirichlet distribution that defines IFT, the higher the citation trend and the number of formal
the “distribution of topics” in the document. links. If a couple of articles have a general term with a high
The principle of the model: IFT, the probability of a formal link between them will be
1) select the document length N high.
2) vector is selected θ ~ (α) - the vector of the “degree of Using multilingual synsets built for high IFT terms
expression” of each topic in this document; (IFT terms), you can evaluate the similarity of articles in
3) for each of N words w: any language. If there is a semantic similarity, estimated
˗ choose a theme zn by distribution Mult(θ); by a cosine measure, it can be assumed that articles with
˗ choose a word wn ~ p(wn|zn, β) with probabilities this term will be quoted with some probability.
given in β. If previously the similarity of the vectors of
For simplicity, we fix the number of topics k and megalemma determined the similarity of texts, now we use
assume that β is just a set of parameters βi,j = p(wj=1|zi=1), extended vectors based on common rare phrases,
which need to be evaluated, and we won’t worry about the megalemmas and multilingual IFT synsets, as well as the
distribution on N. The joint distribution then looks like results of thematic analysis. The similarity of extended
this: vectors more accurately reflects the similarity of texts,
since it takes into account not only semantic, but also
thematic similarity.
Fig. 3. Cluster states in 1993. 1) computing systems and algorithms in them (pink); 2) bioinformatics and data processing methods in it
(purple); 3) signal processing (brown); 4) optimization methods and algorithms based on them (green); 5) problems related to
theoretical informatics and computational complexity (orange); 6) neural and computing networks (red); 7) issues regarding natural
language processing (NLP) and programming languages (blue); 8) robotics, and self-learning systems (Reinforcement Learning) (dark
orange); yellow - a “garbage” cluster with articles in German
Our study is based on a model for representing ideas in expressions. Therefore, in the second case, it becomes
the form of many terms and similar phrases in a possible to more accurately assess the similarity from the
multilingual semantic field, on the hypothesis that the point of view of ideological similarity, since terms with a
proximity of vector representations of terms in a high IFT are significant terms denoting ideas.
multilingual vector semantic space can be interpreted as Three types of semantic similarity can be considered
semantic similarity in an interlanguage environment. We (based on implicit references): 1) similarity of the thematic
propose a method of formalizing ideas by using terms with composition of popular / common words (word frequency
high IFT and megalemma, which allows you to recognize from 10 thousand or more); 2) the presence of common
an idea expressed in different words. References, both significant IFT terms denoting specific ideas (frequency 5-
formal (bibliographic) and contextual (implicit, expressed 1000); 3) the presence of common rare phrases (long
by matching IFT terms), are an expression of the quotation) (frequency 2-100). These types differ in the
connection between ideas. frequency of matching terms / phrases. The highest
High IFT terms are significant terms (or ideologically frequency is typical for popular terms and megalemmas,
significant). If the texts on the IFT synsets have the same the lowest is for common rare phrases. The proposed
vector, then this means the presence of common ideas in similarity assessment algorithm takes into account all
these texts and a significant similarity related to citation. these types of similarities, giving appropriate weights.
The similarity in vectors of megalemmas also correlates Thus, when identifying similarities and implicit
with formal links (as our previous experiments showed), references, the entire frequency range of terms and phrases
but to a much lesser extent. It is shown that megalemma is used.
has a very low impact factor. So, we build extended vectors from megalemmas and
It should be noted that the similarity in vectors of multilingual IFT synsets, and these can be weighted
megalemmas is more applicable to texts with common vectors whose elements have weights. The larger the
vocabulary, in this case, the degree of coincidence of their impact factor, the higher the likelihood of a formal link
thematic composition as a set of popular words is and the higher the weight of the vector element. The cosine
calculated. The approach to calculating the similarity of measure allows you to work with weighted vectors, in
IFT / megalemma vectors is focused on comparing the which elements take large real values. Since our task is to
similarity of scientific texts with specific terminology, search for semantic similarity of articles correlating with
despite the fact that ideas can have different lexical the presence of formal links, then increasing the weights
of IFT synsets in extended vectors improves the quality of We consider similar subsequent articles to be articles
the proposed algorithm. that will cite this document, i.e. those articles are similar
Therefore, the algorithm for calculating ISTS is based that are linked by formal links. Thus, the III is looking for
on assessing the similarity of vectors, expanded by adding trending articles containing trending IFT terms. We can
multilingual IFT synsets and weights, according to a calculate the second-level III, since one idea gives rise to
cosine measure, in order to determine the similarity of another, then you can search for articles similar to the
texts. This takes into account the presence of formal links articles found in the first stage (indirect similarity /
between texts containing matching IFT terms. The method similarity). The mutual influence of articles is calculated
may contain options that are determined/selected by the using the PageRank algorithm [15], which increases the
optimization method according to the maximum significance / influence of texts / articles the more they
correlation of ISTS with formal links. have more (implicit) links with other significant /
The first version of the methodology for calculating the influential texts.
multilingual Index of Ideological Influence (III) as the IFT terms in scientific articles have an expiration date.
number of similar subsequent / future articles / documents The value of IFT is higher in the first years (3-4 years),
has been developed. and then it decreases (Fig. 4).
Fig. 4. Graphs of the average values of the IFT term, depending on the number of articles with these terms and the speed of the trend. 1
- 0 years, 2 - 1 year, 3 - 2 years, 4 - 3 years, 5 - 4 years, 6 - 5 years
Over time, some important terms are replaced by the development of ideas. For example, the term NEURAL
others. If in the vector for the term, in addition to the IFT, NETWORKS has a long history, and in different years,
another year is introduced when the term was of high various derivatives of this term were significant IFT terms,
importance, then you can also obtain some information for example, FUZZY NEURAL or RECURRENT neural
about the age of the article by the vector, which will allow networks.
you to find general ideas of a certain age when comparing So, the methodology for calculating the III contains the
the articles. This provides information on the dynamics of following steps:
the development of ideas. For example, the term NEURAL 1) search in the article for significant IFT terms;
NETWORKS has a long history, and in different years, 2) compiling multilingual IFT synsets for these IFT
various derivatives of this term were significant IFT terms, terms;
for example, FUZZY NEURAL or RECURRENT neural 3) on the basis of IFT-synsets, the definition of the
networks. forecast (regression analysis according to previous
Over time, some important terms are replaced by values of IFT and trend parameters);
others. If in the vector for the term, in addition to the IFT, 4) refinement of the forecast using the PageRank
another year is introduced when the term was of high algorithm [12], which increases the
importance, then you can also obtain some information significance/influence of texts / ideas, the more they
about the age of the article by the vector, which will allow have (implicit) connections with other significant /
you to find general ideas of a certain age when comparing influential texts.
the articles. This provides information on the dynamics of
In this case, implicit links between texts/articles are Association for Computational Linguistics, pages
determined using the methodology for calculating the 818–828, Avignon, France, April 23 - 27, 2012
index of semantic text similarity (ISTS). [7] Andersson, L., Hanbury, A. and Rauber, A. (2017).
The Portability of Three Types of Text Mining
3. Results Techniques into the Patent Text Genre, chapter 9,
As a result, we see the following pattern: the higher the pages 241–280. Springer Berlin. Heidelberg, Berlin,
forecast of the IFT, the higher the III of the document. The Heidelberg. ISBN 978-3-662-53817-3.
predictive value of the IFT is the same as the text, term, or [8] Eneko, A., Enrique, A., Keith, H., Jana, K., Marius,
idea. If there are several IFT terms in the text, then you can P., & Aitor, S. (2009). A study on similarity and
make a prediction according to the most significant/high relatedness using distributional and WordNet-based
IFT, or according to statistics that take into account the approaches. Proceedings of Human Language
synergy of IFT terms when found together. An updated Technologies: The 2009 Annual Conference of the
forecast of III/IFT is carried out using regression analysis North American Chapter of the Association for
using a number of indicators for the current year (IFT, Computational Linguistics (pp. 19-27). Boulder,
IFTm, external links) and similar indicators of previous Colorado: Association for Computational Linguistics
years. [9] Zou, W. Y., Socher, R., Cer, D.M. and Manning C.D.
(2013). Bilingual word embeddings for phrase-based
4. Conclusion machine translation. In Proceedings of EMNLP (pp.
1393-1398).
The Multilingual Index of Ideological Influence (III) [10] de Melo, G. (2015). Wiktionary-based word
corresponds to the number of subsequent/future embeddings. Proceedings of MT Summit XV (pp.
articles/documents citing the source document that are 346-359).
similar to the source document. We plan to consider a [11] Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G.,
number of index modifications taking into account the Dyer, C. and Smith, N.A. (2016). Massively
cascade of citation (first and other levels) and the temporal multilingual word embeddings. arXiv preprint
dynamics of the development of ideas. It is planned to arXiv:1602.01925.
develop an algorithm for the updated forecast of III/IFT [12] Michael, J. F., Alon, Y. H., & David, M. (2005). From
using a number of indicators of the current year (IFT, databases to data spaces: A new abstraction for
IFTm, external links) and similar indicators of previous information management. SIGMOD Record, 34(4),
years. 27-33
[13] Potthast, M., Hagen, M., Beyer, A., Busse, M.,
Acknowledgment
Tippmann, M., Rosso, P. and Stein, B. (2014).
The reported study was funded by RFBR according to Overview of the 6th International Competition on
the research projects № 18-07-00909, 19-07-00857 and Plagiarism Detection. In PAN at CLEF 2014.
20-04-60185. Sheffield, UK (pp. 845-876).
[14] Ferrero, J., Besacier, L., Schwab, D. & Agnes, F.
References (2017). Using Word Embedding for Cross-Language
[1] Jarmasz, M., Szpakowicz, S. (2003). Roget’s Plagiarism Detection. In Proceedings of the 15th
Thesaurus and Semantic Similarity. Recent Adv. Nat. Conference of the European Chapter of the
Lang. Process. III Sel. Pap. from RANLP 2003, vol. Association for Computational Linguistics, (EACL
111, 2004. 2017). Association for Computational Linguistics,
[2] Islam, A., Inkpen, D. (2012). Unsupervised Near- Valencia, Spain, volume 2 (pp. 415-421).
Synonym Choice using the Google Web 1T. ACM [15] Page, L., Brin, S., Motwani, R., Winograd, T. (1998).
Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1- The PageRank Citation Ranking: Bringing Order to
19. the Web. In: Technical Report. Stanford University,
[3] Li, H., Xu, J. (2014). Semantic matching in search. Stanford, 1998.
Foundations and Trends in Information Retrieval, http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
7(5):343-469.
[4] Aliguliyev R. M. (2009). A new sentence similarity About the autors
measure and sentence based extractive technique for Khakimova Aida Kh., PhD, docent, Kama Institute
automatic text summarization. Expert Systems with (Naberezhnye Chelny, Russia), ANO «Scientific and Research
Applications. 36. 7764-7772. Center for Information in Physics and Technique» (Nizhny
10.1016/j.eswa.2008.11.022. Novgorod, Russia), Е-mail: aida_khatif@mail.ru
Charnine Mikhail M., PhD, Senior Researcher, FRC CSC of
[5] Wäschle, K. (2015). Quantifying Cross-lingual
the Russian Academy of Sciences, Moscow, Russia, Е-mail:
Semantic Similarity for Natural Language Processing mc@keywen.com
Applications. Heidelberg. – 139 р. Klokov Alexey A., graduate student, FRC CSC of the
[6] Wäschle, K. and Riezler, S. (2012). Structural and Russian Academy of Sciences, Moscow, Russia, Е-mail:
topical dimensions in multi-task patent translation. In aaklokov@yandex.ru
Proceedings of the 13th Conference of the European Sokolov Evgenii G., graduate student, FRC CSC of the
Chapter of the Association for Computational Russian Academy of Sciences, Moscow, Russia, Е-mail:
Linguistics (EACL).Proceedings of the 13th evgeny.sokolov@phystech.edu
Conference of the European Chapter of the