=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-11
|storemode=property
|title=Network Approach for Visualizing the Evolution of the Research of Cross-lingual Semantic Similarity
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-11.pdf
|volume=Vol-2763
|authors=Aida Khakimova
}}
==Network Approach for Visualizing the Evolution of the Research of Cross-lingual Semantic Similarity==
<pdf width="1500px">https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-11.pdf</pdf>
<pre>
      Network Approach for Visualizing the Evolution of the Research of
                    Cross-lingual Semantic Similarity
                                                 Aida Kh. Khakimova
                                                 aida_khatif@mail.ru
       ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia

    The paper is devoted to the problem of the bibliometric study of publications on the topic “Cross-lingual Semantic Similarity”,
available in the Dimensions database. Visualization of scientific networks showed fragmentation of research, limited interaction of
organizations. Leading countries, leading organizations and authors are highlighted. Overlay visualization allowed us to assess the
trends in citing authors. The expansion of the geography of research is shown. For international cooperation, the uniformity of semantic
approaches to describing the concepts of critical infrastructure, incidents, resources and services related to their maintenance and
protection is important. The stated approaches can be applied for visualization and modeling of technological development in the
modern digital world. Semantic similarity is a longstanding problem in natural language processing (NLP). The semantic similarity
between two words represents the semantic proximity (or semantic distance) between two words or concepts. This is an important
problem in natural language processing, as it plays an important role in finding information, extracting information, text mining, web
mining and many other applications.
    Keywords: text mining, tech mining, cross-lingual semantic similarity, visualization, scientific network, bibliometrics

                                                                       Mining (TM) [6] uses text mining software to exploit
1. Introduction                                                        scientific and technical information resources. Mining
    Linguistic similarities were studied by researchers                technology is used to inform technology management.
from different fields using numerous statistical, linguistic           This technology combines understanding of technological
and neuroscientific approaches.                                        innovative processes with software tools for obtaining
    The semantic properties of languages are usually                   vital scientific and technical knowledge.
evaluated using the embedding of words, which projects a                   Whereas many applications have employed certain
linguistic dictionary onto the vector space of a given                 similarity functions to compute the semantic similarity
number of dimensions, in which the semantic relations of               between terms, most of the traditional approaches solving
words are stored.                                                      the problem by using dictionaries such as WordNet. The
    In artificial intelligence and cognitive science,                  main problem is that a lot of terms (e.g. abbreviations,
semantic similarities were used for various scientific                 acronyms, brand names etc.) that are not covered by these
assessments and measurements, as well as for decoding                  kinds of dictionaries [7]. As a result, semantic similarity
complex interfaces of conceptualizing feelings [1].                    measures which are based on this type of resources cannot
    Theoretically, semantic similarity refers to the idea of           be used directly in these cases.
commonality in the characteristics between words or                        Tech Mining is the application of text mining tools to
concepts in a language. Although this is a property of the             scientific and technical information resources. The ever-
relationship between concepts or feelings, it can also be              growing volume of scientific results represents a boom in
defined as a measurement of the conceptual similarity                  technological innovation, but also complicates efforts to
between two words, sentences, paragraphs, documents, or                obtain useful and concise information for solving
even two parts of a text.                                              problems. This problem extends to technological mining,
    Recently, there has been a growing interest in finding             where the development of methods compatible with big
semantically similar words in different languages based                data is an urgent problem.
on comparable data easily accessible from the Internet                     In the current patent analysis, numerous patent
(for example, Wikipedia, news) [2, 3].                                 documents use different words to describe the same event,
    According to Hotho et al. [4] Text Mining can be                   leading to semantic inconsistency and polysemy due to
defined - like data mining - as the application of                     the many meanings that may exist for a single word. To
algorithms and methods from fields of machine learning                 solve this problem, document analysis often requires
and statistics in texts in order to search for useful                  combining synonyms into the same semantic dimension.
templates after pre-processing. Data mining algorithms                 On the other hand, different words can be used to describe
can be applied to the extracted data.                                  the same events.
    Text analysis in big data analytics is becoming a                      The methods for measuring the semantic similarity of
powerful tool for processing unstructured text data,                   texts are necessary for the development of areas of
analyze it to extract new knowledge and identify                       information retrieval, data mining and text analysis. Such
meaningful models and correlations hidden in the data.                 methods will help to avoid patent infringement in the
Text mining refers to the extraction of information and                development of technological capabilities to achieve
implicit patterns previously unknown in automatic or                   future competitive advantages [8].
semi-automatic mode from a huge unstructured text data                     The growing popularity of data science is also
such as natural language texts [5].                                    affecting high-tech industries. However, since they
    Tech Mining refers to the application of text mining               usually have different core competencies - the creation of
methods to technical documentation. For the purposes of                cyberphysical systems, and not, for example, machine
patent analysis, this is called “patent mining”. Tech                  learning algorithms or data mining - to delve into the


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
science of data by specialists in the field, such as system     components (i.e., words or phrases). Traditional clustering
engineers or architects, can be more cumbersome than            algorithms usually rely on the BOW (Bag of Words)
expected.                                                       approach, and the obvious drawback of BOW is that it
    In recent years, in order to help subject matter experts    ignores the semantic relationship between words.
use data science, scientists have been developing semantic          Researchers expanded DSM to include the
search engines. So, for example, Semantic Snake Charmer         compositional structure of the language, and called these
(SSC) [9], is a search engine based on subject knowledge.       models compositional-DSM (CDSM). CDSM models
SSC includes a natural language processing module that          suggest that the meaning of a word can be interpreted by
can convert relevant documentation into several types of        its context, and the meaning of a sentence can be obtained
semantic graphs.                                                from its compositions [19]. The central place in CDSM is
                                                                compositionality, that is, the meaning of complex
2. Related works                                                expressions is determined by the values of their
    An accurate assessment of the actual similarity             component expressions and the rules for combining them.
between documents is fundamental for many automatic                 Assessing semantic similarities between concepts is a
text analysis applications, such as thesaurus generation        key tool to improve understanding of texts. The structured
[10], machine translation [11], question-answer [12],           knowledge provided by ontologies is widely used to
information search [13], and automatic generalization.          evaluate similarities. However, in many areas several
    Semantic space is an attempt to model the                   ontologies modeling the same concepts in different ways
characteristics of human semantic memory, which is              are available. The paper describes the criteria for choosing
guided by the principle that words with similar meanings        ontologies for assessing semantic similarity [20].
are found in a similar language environment. Semantic               A measure of calculating the similarity between
space is a vector space that captures the value                 sentences or between documents using an ontology is
quantitatively from the point of view of coincidence            proposed. The similarity is evaluated using the concept
statistics, where words (or concepts) are represented as        vector of the document (proposal), formed by finding the
vectors in a high-dimensional space [14]. As a result, the      links between the ontology terms and the content of the
similarity of the meanings of words can be quantified by        document (proposal) [21].
measuring their distance in a high-dimensional vector               The vector space model is used to identify potentially
space.                                                          useful services and evaluate web services [22]. Methods
    Latent semantic analysis (LSA) is based on the fact         for extracting information and automatic semantic textual
that words that have similar meanings tend to occur in          similarity assessment were used for electronic health
similar texts [15].                                             systems (EHR) [23].
    Knowledge-based methods suffer from a limited                   Similarity measures are used to select a context-
number of common vocabulary words that are commonly             sensitive application that matches the current context of
used in general English literature and often not suitable       the user. Personalization of services is directly related to
for specific domains.                                           the user's preferences, displaying his contextual
    The vector space model is classically used to evaluate      information from the user environment.
the semantic similarity between two documents. Terms                A semantic similarity measure is a tool for assessing
are represented in this semantic space as vectors called        the similarity between instances of the context, which
word embeddings. The possibilities of determining textual       allows to select services in accordance with their
similarity based on vector representations of terms in a        relevance for a given request, profile and user preferences.
semantic space in which the proximity of vectors can be         With this approach, the context is considered as a set of
interpreted as semantic similarity [16] are investigated.       information representing spatio-temporal information
    The LSA method has an advantage over most modern            about the user, as well as his preferences and interests,
information retrieval methods because it has the ability to     which is used as a factor in classifying services by
measure the similarity of two texts that use completely         relevance [24].
different words. However, there are morphological                   The data sets of common STS problems were widely
problems of the correct identification of terms, as well as     used to study similarities at the sentence level and
more fundamental problems with homonymy / polysemy              semantic representations [25-27].
and synonymy. Techniques that depend on large                       The CL-WES method [28] is based on the cosine
enclosures tend to overestimate relatively unrelated            similarity of distributed representations of sentences,
sentences or relatively related sentences (e.g., LSAs).         which are obtained by weighting the sum of each word
LSAs overestimate the similarity score of compared pairs        vector in a sentence. At the same time, at the first stage,
of sentences [17]. The study of the similarity assessment       the Spanish sentence is translated into English using
between patent documents and scientific publications in         Google Translate (i.e., two sentences are formulated in the
the field of biotechnology by the LSA method proved that        same language), then both statements are compared.
in this case the decrease in dimension led to the cutting off       The similarity score of the interlanguage pairs in
of valuable information [18].                                   English and Spanish was calculated as the average of the
    Semantic spaces can be constructed either using the         corresponding language ratings in the monolingual data
additive model or the multiplicative model. Both additive       sets [29]. The study was developed for five languages [30]
and multiplicative approaches to constructing semantic          - English, German, Italian, Spanish and Farsi.
space do not take into account the word order among the             The skip-gram model has become one of the most
                                                                popular for the study of word representations in NLP [31].
    The cross-language definition of semantic textual          4. Results and Discussion
similarity is an important step for the detection and
                                                                   2050 articles of 2825 authors from 64 countries were
evaluation of interlanguage plagiarism; research in this
                                                               discovered. The dynamics of publications is shown in Fig.
area is rare.
                                                               1. The trend line is clearly exponential, the determination
    A comparable corpus consists of documents in two or
                                                               coefficient (R2), which is also called the approximation
more languages or varieties that are not translations of
                                                               confidence value, is 0.6648. Initial publications date back
each other and deal with similar topics. Comparable
                                                               to the 80s of the 20th century, but research has been
bodies are, by definition, multilingual and interlanguage
                                                               growing since the beginning of the 21st century.
collections of text. The Internet can be used as a huge
resource of multilingual texts.

3. Matherials and methods
    To search for publications, the Dimensions database
(https://app.dimensions.ai/) was used, which provides
open access to more than 95 million publication records
and related metrics for individual users. The search
keywords used were “cross-lingual semantic similarity”.
2050 articles were discovered.
    VOSviewer (https://www.vosviewer.com/) was used
to visualize scientific networks. VOSviewer uses a remote
approach to visualizing bibliometric networks. In a             Fig. 1. Dynamics of the number of publications devoted to the
bibliometric network, there are often large differences                  problem of cross-lingual semantic similarity
between nodes in the number of edges they have to other            With the help of VOSviewer, a co-authorship network
nodes.                                                         was built. For 2825 authors, the minimum number of
    Popular sites, for example, representing highly cited      articles by the author was taken to be five; 26 such authors
                                                               were identified in 17 clusters. The largest cluster included
publications or highly prolific researchers, may have
                                                               4 authors. Fig. 2 shows that there is a separation of the
several orders of magnitude more connections than their
                                                               authors into small research groups.
less popular counterparts. When analyzing bibliometric
networks, normalization of these differences between
nodes is usually performed. VOSviewer by default applies
the normalization of communication strength [32].


                                              Fig. 2. Collaboration Network
    We reviewed a collaborative network of organizations          Kessler; German Research Center for Artificial
(Fig. 3). For 684 organizations, the minimum number of            Intelligence; National University of Distance Education;
articles of the organization was taken to be five; such           Trinity College Dublin; University of Alicante; University
organizations were allocated 64 in 6 clusters. Fig. 3 shows       of Edinburgh; University of Sheffield; University of The
that only a small number of universities interact. The            Basque Country; University of Trento; University of
largest cluster included 11 European universities and             Wolverhampton.
organizations: Dublin City University; Fondazione Bruno


                                            Fig. 3. Collaboration on organizations

    We examined a co-authorship network by country, the           Canada, China, India, Iran, Japan, the Netherlands,
minimum number of articles by the author was taken to be          Taiwan, and the USA. The second cluster included
five. Of 2825 authors of 64 countries, 35 are associated in       countries: Belgium, Finland, France, Greece, Slovenia,
five clusters (Fig. 4). The two largest clusters included 9       Spain, Switzerland, Tunisia, Great Britain.
countries. The first cluster included countries: Austria,


                                              Fig. 4. Co-authorship by country

    The citation index in recent years is the main measure            We examined the citation network for documents, the
of the value of both a scientist and an institution, so we        minimum number of publications by the author was taken
examined citation networks.                                       equal to ten. 298 authors from 2050 were identified in 14
                                                                  clusters (Fig. 5).
                                   Fig. 5. Citation from publications of the most cited authors

    The most cited author is Navigli, Roberto (759                  We presented the authors citation network in an overlay
citations) [29, 30]. More than 200 citations from Rosso,            visualization option to assess citation trends (Fig. 6). The
Paolo (239) and Moens, Marie-Francine (216) [2].                    figure clearly shows that R. Navigli is the founder in the
    VOSviewer also supports overlay renderings. In                  area.
overlay rendering, the color of a node indicates a specific
property of the node, for example, the year of publication.


                                         Fig. 6. Overlay citation network visualization

    A citation network by authors was built. The minimum            Li Juanzi and Zhan Lei (2016) are the most recent of the
number of publications by the author was taken to be five.          most cited authors.
16 authors were identified from 2050 in 2 clusters (Fig.
7). Mittal Namitali (2017), Rettinger Achim, Gipp Bela,
                                       Fig. 7. Overlay visualization of the most cited authors

   The geographical aspects of citation were considered.              is expanding. So, in the last goals, Brazil, Czech Republic,
A citation network for countries was built with a                     Iran, Egypt, Tunisia have joined the research.
minimum of five publications. 34 countries were
identified (Fig. 8). It is seen that the geography of research


                                  Fig. 8. Overlay visualization of citation by affiliation of authors
                                                                      Visualization of scientific networks using VOSviewer has
5. Conclusions                                                        shown fragmentation of research, small research groups
    A bibliometric study of publications on the topic                 have been identified.
“Cross-lingual Semantic Similarity”, available in the                    The visualization of a network of co-authorship across
Dimensions database, was carried out. In recent years,                organizations showed limited university interaction on
there has been a significant increase in research.                    cross-language semantic similarities. The largest cluster
included 11 European universities and organizations from            Opportunities.      Sustainability,     10,    3729;
Ireland, Italy, Germany, Spain, Scotland, and Great                 doi:10.3390/su10103729.
Britain.                                                       [9] Grappiolo, C., van Gerwen, E., Verhoosel, J. and
    Visualization of the co-authorship network by country           Somers, L. (2019). The Semantic Snake Charmer
showed that 35 countries interact in research, countries are        Search Engine: A Tool to Facilitate Data Science in
connected in five clusters. The two largest clusters                High-tech Industry Domains. In Proceedings of the
included 9 countries. In the largest clusters, including 9          2019 Conference on Human Information Interaction
countries, the leading ones were the USA and China,                 and Retrieval (CHIIR ’19). Association for
Great Britain and Spain.                                            Computing Machinery, New York, NY, USA, 355-
    The visualization of the citation network revealed 298          359. DOI:https://doi.org/10.1145/3295750.3298915.
of the most cited authors out of 2050. The most cited          [10] Jarmasz, M. and Szpakowicz, S. (2003). Roget’s
author is Navigli, Roberto (759 citations). More than 200           Thesaurus and Semantic Similarity. Recent Adv.
citations from Rosso, Paolo (239) and Moens, Marie-                 Nat. Lang. Process. III Sel. Pap. from RANLP , vol.
Francine (216) [2].                                                 111, 2004.
    Overlay visualization made it possible to evaluate the     [11] Islam, A. and Inkpen, D. (2012). Unsupervised Near-
citation trends of the authors; it turned out that the most         Synonym Choice using the Google Web 1T. ACM
cited author, Navigli, Roberto, is also the founder of              Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1-
research in this field [29, 30].                                    19.
    The most recent cited authors are Mittal Namitali          [12] O’Shea, J., Bandar, Z., Crockett, K., and McLean, D.
(2017 citation), Rettinger Achim, Gipp Bela, Li Juanzi              (2008). A Comparative Study of Two Short Text
and Zhan Lei (2016 citation).                                       Semantic Similarity Measures. In Agent and Multi-
    Consideration of the geographical aspects of citation           Agent Systems: Technologies and Applications, vol.
showed an expansion of the geography of research. So, in            4953, N. Nguyen, G. Jo, R. Howlett, and L. Jain,
the last goals, Brazil, Czech Republic, Iran, Egypt,                Eds. Springer Berlin Heidelberg, pp. 172-181.
Tunisia have joined the research.                              [13] Li, H. and Xu, J. (2014). Semantic matching in
                                                                    search. Foundations and Trends in Information
Acknowledgment                                                      Retrieval, 7(5):343-469.
   The reported study was funded by RFBR according to          [14] Mitchell, J. and Lapata, M. (2010). Composition in
the research projects № 18-07-00225, 18-07-00909, 18-               distributional models of semantics. Cognitive
07-01111 and 20-04-60185.                                           science, 34(8), 1388-1429.
                                                               [15] Chen, B. (2009). Latent topic modelling of word co-
References                                                          occurence information for spoken document
                                                                    retrieval. In IEEE International Conference on
[1] Rajat Pandit, R., Sengupta, S., Naskar, S.K., Dash,             Acoustics, Speech and Signal Processing ICASSP
    N.S. and Sardar, M.M. (2019). Improving Semantic                2009, no. 2, pp. 3961-3964.
    Similarity with Cross-Lingual Resources: A Study in        [16] Kenter, T., Rijke, M. de (2015). Short Text
    Bangla - A Low Resourced Language. Informatics,                 Similarity with Word Embeddings. CIKM '15
    6, 19; doi:10.3390/informatics6020019                           Proceedings of the 24th ACM International on
[2] Vulic, I., De Smet, W., and Moens, M.-F. (2011).                Conference on Information and Knowledge
    Identifying word translations from comparable                   Management October 19-23, Melbourne, Australia.
    corpora using latent topic models. In Proceedings of            Pp. 1411-1420.
    ACL, pages 479-484.                                        [17] Atoum, I. (2016). Efficient Hybrid Semantic Text
[3] Prochasson, E. and Fung, P. (2011). Rare word                   Similarity using Wordnet and a Corpus. (IJACSA)
    translation extraction from aligned comparable                  International Journal of Advanced Computer Science
    documents. In Proceedings of ACL, pages 1327-                   and Applications, Vol. 7, No. 9, pp.124-130.
    1335.                                                      [18] Magerman, T., Van Looy, B., Baesens, B. and
[4] Hotho, A., Nürnberger, A. and Paaß, G. (2005). A                Debackere, K. (2011). Assessment of Latent
    brief survey of text mining. In Ldv Forum, Vol.                 Semantic Analysis (LSA) text mining algorithms for
    20(1), p. 19-62.                                                large scale mapping of patent and scientific
[5] Hassani, H., Beneki, C., Unger, S., Mazinani, M.T.              publication documents. Department Of Managerial
    and Yeganegi, M.R. (2020). Text Mining in Big Data              Economics, Strategy And Innovation (MSI),
    Analytics. Big Data Cogn. Comput. 2020, 4, 1;                   October, 77 р.
    doi:10.3390/bdcc4010001.                                   [19] Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R.,
[6] Porter, A. L. (2005). Tech Mining. Competitive                  Menini, S., & Zamparelli, R. (2014). Semeval-2014
    Intelligence Magazine. 8 (1): 30-37.                            task 1: Evaluation of compositional distributional
[7] Ali, A., Alfayez, F. and Alquhayz, H. (2018).                   semantic models on full sentences through semantic
    Semantic Similarity Measures Between Words: A                   relatedness and textual entailment. SemEval-2014.
    Brief Survey. Sci. Int. (Lahore),30(6), 907-914,           [20] Batet, M. and Sánchez, D. (2015). Ontology
    2018.                                                           Selection for Semantic Similarity Assessment.
[8] Wang, H. C., Chi, Y. C. and Hsin, P. L. (2018).                 ICAART 2015, At Lisbon, Portugal, Volume: 2
    Constructing Patent Maps Using Text Mining to                   https://www.researchgate.net/publication/283877653
    Sustainably      Detect   Potential    Technological
[21] Liu, H., Wang, P. (2014). Assessing Text Semantic       Center for Information in Physics and Technique» (Nizhny
     Similarity Using Ontology. Journal Of Software, vol.    Novgorod, Russia), Е-mail: aida_khatif@mail.ru
     9, no. 2, pp.490-497.
[22] Maheswari, J.U., Karpagam, G.R., Indhumathy, S.
     (2014). Comparison of Web Service Similarity-
     Assessment Methods. International Journal of
     Computer Applications (0975 - 8887) Volume 98 -
     No.22.
[23] Moen, H. (2016). Distributional Semantic Models
     for Clinical Text Applied to Health Record
     Summarization Thesis for the Degree of
     Philosophiae Doctor Trondheim, May NTNU
     (Norwegian University of Science and Technology
     Faculty of Information Technology), 93 р.
[24] Guessoum, D., Miraoui, M., Tadj, C. (2015). Survey
     Of Semantic Similarity Measures In Pervasive
     Computing. International Journal On Smart Sensing
     And Intelligent Systems Vol. 8, no. 1, рр.125-158.
[25] Arora, S., Liang, Y., and Ma, T. (2017). A simple
     but tough-to-beat baseline for sentence embeddings.
     In        Proceedings       of      ICLR       2017.
     https://openreview.net/pdf?id=SyK00v5xx.
[26] Conneau, A., Kiela, D., Schwenk, H., Barrault, L.,
     and Bordes, A. (2017). Supervised learning of
     universal sentence representations from natural
     language inference data. CoRR abs/1705.02364.
     http://arxiv.org/abs/1705.02364.
[27] Pagliardini, M., Gupta, P., and Jaggi, M. (2017).
     Unsupervised Learning of Sentence Embeddings
     using Compositional n-Gram Features. arXiv
     https://arxiv.org/pdf/1703.02507.pdf.
[28] Ferrero, J., Besacier, L., Schwab, D., and Agnes, F.
     (2017). Using Word Embedding for Cross-Language
     Plagiarism Detection. In Proceedings of the 15th
     Conference of the European Chapter of the
     Association for Computational Linguistics, (EACL
     2017). Association for Computational Linguistics,
     Valencia, Spain, volume 2, pages 415-421.
     http://aclweb.org/anthology/E/E17/E17-2066.pdf.
[29] Camacho-Collados, J. and Navigli, R. (2016). Find
     the word that does not belong: A framework for an
     intrinsic evaluation of word vector representations.
     In Proceedings of the ACL Workshop on Evaluating
     Vector Space Representations for NLP. Berlin,
     Germany, pages 43-50.
[30] Camacho-Collados, J., Taher Pilehvar, M., Collier,
     N., and Navigli, R. (2017). SemEval-2017 Task 2:
     Multilingual and cross-lingual semantic word
     similarity. In Proceedings of SemEval. Vancouver,
     Canada.
[31] Mikolov, T., Chen, K., Corrado, G., and Dean, J.
     (2013). Efficient estimation of word representations
     in vector space. arXiv preprint arXiv:1301.3781
[32] Van Eck, N.J., and Waltman, L. How to normalize
     cooccurrence data? An analysis of some well-known
     similarity measures. 2009. Journal of the American
     Society for Information Science and Technology,
     60(8), 1635-1651.

About the autors
   Khakimova Aida Kh., PhD, docent, Kama Institute
(Naberezhnye Chelny, Russia), ANO «Scientific and Research

</pre>