=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-11
|storemode=property
|title=Network Approach for Visualizing the Evolution of the Research of Cross-lingual Semantic Similarity
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-11.pdf
|volume=Vol-2763
|authors=Aida Khakimova
}}
==Network Approach for Visualizing the Evolution of the Research of Cross-lingual Semantic Similarity==
Network Approach for Visualizing the Evolution of the Research of Cross-lingual Semantic Similarity Aida Kh. Khakimova aida_khatif@mail.ru ANO «Scientific and Research Center for Information in Physics and Technique», Nizhny Novgorod, Russia The paper is devoted to the problem of the bibliometric study of publications on the topic “Cross-lingual Semantic Similarity”, available in the Dimensions database. Visualization of scientific networks showed fragmentation of research, limited interaction of organizations. Leading countries, leading organizations and authors are highlighted. Overlay visualization allowed us to assess the trends in citing authors. The expansion of the geography of research is shown. For international cooperation, the uniformity of semantic approaches to describing the concepts of critical infrastructure, incidents, resources and services related to their maintenance and protection is important. The stated approaches can be applied for visualization and modeling of technological development in the modern digital world. Semantic similarity is a longstanding problem in natural language processing (NLP). The semantic similarity between two words represents the semantic proximity (or semantic distance) between two words or concepts. This is an important problem in natural language processing, as it plays an important role in finding information, extracting information, text mining, web mining and many other applications. Keywords: text mining, tech mining, cross-lingual semantic similarity, visualization, scientific network, bibliometrics Mining (TM) [6] uses text mining software to exploit 1. Introduction scientific and technical information resources. Mining Linguistic similarities were studied by researchers technology is used to inform technology management. from different fields using numerous statistical, linguistic This technology combines understanding of technological and neuroscientific approaches. innovative processes with software tools for obtaining The semantic properties of languages are usually vital scientific and technical knowledge. evaluated using the embedding of words, which projects a Whereas many applications have employed certain linguistic dictionary onto the vector space of a given similarity functions to compute the semantic similarity number of dimensions, in which the semantic relations of between terms, most of the traditional approaches solving words are stored. the problem by using dictionaries such as WordNet. The In artificial intelligence and cognitive science, main problem is that a lot of terms (e.g. abbreviations, semantic similarities were used for various scientific acronyms, brand names etc.) that are not covered by these assessments and measurements, as well as for decoding kinds of dictionaries [7]. As a result, semantic similarity complex interfaces of conceptualizing feelings [1]. measures which are based on this type of resources cannot Theoretically, semantic similarity refers to the idea of be used directly in these cases. commonality in the characteristics between words or Tech Mining is the application of text mining tools to concepts in a language. Although this is a property of the scientific and technical information resources. The ever- relationship between concepts or feelings, it can also be growing volume of scientific results represents a boom in defined as a measurement of the conceptual similarity technological innovation, but also complicates efforts to between two words, sentences, paragraphs, documents, or obtain useful and concise information for solving even two parts of a text. problems. This problem extends to technological mining, Recently, there has been a growing interest in finding where the development of methods compatible with big semantically similar words in different languages based data is an urgent problem. on comparable data easily accessible from the Internet In the current patent analysis, numerous patent (for example, Wikipedia, news) [2, 3]. documents use different words to describe the same event, According to Hotho et al. [4] Text Mining can be leading to semantic inconsistency and polysemy due to defined - like data mining - as the application of the many meanings that may exist for a single word. To algorithms and methods from fields of machine learning solve this problem, document analysis often requires and statistics in texts in order to search for useful combining synonyms into the same semantic dimension. templates after pre-processing. Data mining algorithms On the other hand, different words can be used to describe can be applied to the extracted data. the same events. Text analysis in big data analytics is becoming a The methods for measuring the semantic similarity of powerful tool for processing unstructured text data, texts are necessary for the development of areas of analyze it to extract new knowledge and identify information retrieval, data mining and text analysis. Such meaningful models and correlations hidden in the data. methods will help to avoid patent infringement in the Text mining refers to the extraction of information and development of technological capabilities to achieve implicit patterns previously unknown in automatic or future competitive advantages [8]. semi-automatic mode from a huge unstructured text data The growing popularity of data science is also such as natural language texts [5]. affecting high-tech industries. However, since they Tech Mining refers to the application of text mining usually have different core competencies - the creation of methods to technical documentation. For the purposes of cyberphysical systems, and not, for example, machine patent analysis, this is called “patent mining”. Tech learning algorithms or data mining - to delve into the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) science of data by specialists in the field, such as system components (i.e., words or phrases). Traditional clustering engineers or architects, can be more cumbersome than algorithms usually rely on the BOW (Bag of Words) expected. approach, and the obvious drawback of BOW is that it In recent years, in order to help subject matter experts ignores the semantic relationship between words. use data science, scientists have been developing semantic Researchers expanded DSM to include the search engines. So, for example, Semantic Snake Charmer compositional structure of the language, and called these (SSC) [9], is a search engine based on subject knowledge. models compositional-DSM (CDSM). CDSM models SSC includes a natural language processing module that suggest that the meaning of a word can be interpreted by can convert relevant documentation into several types of its context, and the meaning of a sentence can be obtained semantic graphs. from its compositions [19]. The central place in CDSM is compositionality, that is, the meaning of complex 2. Related works expressions is determined by the values of their An accurate assessment of the actual similarity component expressions and the rules for combining them. between documents is fundamental for many automatic Assessing semantic similarities between concepts is a text analysis applications, such as thesaurus generation key tool to improve understanding of texts. The structured [10], machine translation [11], question-answer [12], knowledge provided by ontologies is widely used to information search [13], and automatic generalization. evaluate similarities. However, in many areas several Semantic space is an attempt to model the ontologies modeling the same concepts in different ways characteristics of human semantic memory, which is are available. The paper describes the criteria for choosing guided by the principle that words with similar meanings ontologies for assessing semantic similarity [20]. are found in a similar language environment. Semantic A measure of calculating the similarity between space is a vector space that captures the value sentences or between documents using an ontology is quantitatively from the point of view of coincidence proposed. The similarity is evaluated using the concept statistics, where words (or concepts) are represented as vector of the document (proposal), formed by finding the vectors in a high-dimensional space [14]. As a result, the links between the ontology terms and the content of the similarity of the meanings of words can be quantified by document (proposal) [21]. measuring their distance in a high-dimensional vector The vector space model is used to identify potentially space. useful services and evaluate web services [22]. Methods Latent semantic analysis (LSA) is based on the fact for extracting information and automatic semantic textual that words that have similar meanings tend to occur in similarity assessment were used for electronic health similar texts [15]. systems (EHR) [23]. Knowledge-based methods suffer from a limited Similarity measures are used to select a context- number of common vocabulary words that are commonly sensitive application that matches the current context of used in general English literature and often not suitable the user. Personalization of services is directly related to for specific domains. the user's preferences, displaying his contextual The vector space model is classically used to evaluate information from the user environment. the semantic similarity between two documents. Terms A semantic similarity measure is a tool for assessing are represented in this semantic space as vectors called the similarity between instances of the context, which word embeddings. The possibilities of determining textual allows to select services in accordance with their similarity based on vector representations of terms in a relevance for a given request, profile and user preferences. semantic space in which the proximity of vectors can be With this approach, the context is considered as a set of interpreted as semantic similarity [16] are investigated. information representing spatio-temporal information The LSA method has an advantage over most modern about the user, as well as his preferences and interests, information retrieval methods because it has the ability to which is used as a factor in classifying services by measure the similarity of two texts that use completely relevance [24]. different words. However, there are morphological The data sets of common STS problems were widely problems of the correct identification of terms, as well as used to study similarities at the sentence level and more fundamental problems with homonymy / polysemy semantic representations [25-27]. and synonymy. Techniques that depend on large The CL-WES method [28] is based on the cosine enclosures tend to overestimate relatively unrelated similarity of distributed representations of sentences, sentences or relatively related sentences (e.g., LSAs). which are obtained by weighting the sum of each word LSAs overestimate the similarity score of compared pairs vector in a sentence. At the same time, at the first stage, of sentences [17]. The study of the similarity assessment the Spanish sentence is translated into English using between patent documents and scientific publications in Google Translate (i.e., two sentences are formulated in the the field of biotechnology by the LSA method proved that same language), then both statements are compared. in this case the decrease in dimension led to the cutting off The similarity score of the interlanguage pairs in of valuable information [18]. English and Spanish was calculated as the average of the Semantic spaces can be constructed either using the corresponding language ratings in the monolingual data additive model or the multiplicative model. Both additive sets [29]. The study was developed for five languages [30] and multiplicative approaches to constructing semantic - English, German, Italian, Spanish and Farsi. space do not take into account the word order among the The skip-gram model has become one of the most popular for the study of word representations in NLP [31]. The cross-language definition of semantic textual 4. Results and Discussion similarity is an important step for the detection and 2050 articles of 2825 authors from 64 countries were evaluation of interlanguage plagiarism; research in this discovered. The dynamics of publications is shown in Fig. area is rare. 1. The trend line is clearly exponential, the determination A comparable corpus consists of documents in two or coefficient (R2), which is also called the approximation more languages or varieties that are not translations of confidence value, is 0.6648. Initial publications date back each other and deal with similar topics. Comparable to the 80s of the 20th century, but research has been bodies are, by definition, multilingual and interlanguage growing since the beginning of the 21st century. collections of text. The Internet can be used as a huge resource of multilingual texts. 3. Matherials and methods To search for publications, the Dimensions database (https://app.dimensions.ai/) was used, which provides open access to more than 95 million publication records and related metrics for individual users. The search keywords used were “cross-lingual semantic similarity”. 2050 articles were discovered. VOSviewer (https://www.vosviewer.com/) was used to visualize scientific networks. VOSviewer uses a remote approach to visualizing bibliometric networks. In a Fig. 1. Dynamics of the number of publications devoted to the bibliometric network, there are often large differences problem of cross-lingual semantic similarity between nodes in the number of edges they have to other With the help of VOSviewer, a co-authorship network nodes. was built. For 2825 authors, the minimum number of Popular sites, for example, representing highly cited articles by the author was taken to be five; 26 such authors were identified in 17 clusters. The largest cluster included publications or highly prolific researchers, may have 4 authors. Fig. 2 shows that there is a separation of the several orders of magnitude more connections than their authors into small research groups. less popular counterparts. When analyzing bibliometric networks, normalization of these differences between nodes is usually performed. VOSviewer by default applies the normalization of communication strength [32]. Fig. 2. Collaboration Network We reviewed a collaborative network of organizations Kessler; German Research Center for Artificial (Fig. 3). For 684 organizations, the minimum number of Intelligence; National University of Distance Education; articles of the organization was taken to be five; such Trinity College Dublin; University of Alicante; University organizations were allocated 64 in 6 clusters. Fig. 3 shows of Edinburgh; University of Sheffield; University of The that only a small number of universities interact. The Basque Country; University of Trento; University of largest cluster included 11 European universities and Wolverhampton. organizations: Dublin City University; Fondazione Bruno Fig. 3. Collaboration on organizations We examined a co-authorship network by country, the Canada, China, India, Iran, Japan, the Netherlands, minimum number of articles by the author was taken to be Taiwan, and the USA. The second cluster included five. Of 2825 authors of 64 countries, 35 are associated in countries: Belgium, Finland, France, Greece, Slovenia, five clusters (Fig. 4). The two largest clusters included 9 Spain, Switzerland, Tunisia, Great Britain. countries. The first cluster included countries: Austria, Fig. 4. Co-authorship by country The citation index in recent years is the main measure We examined the citation network for documents, the of the value of both a scientist and an institution, so we minimum number of publications by the author was taken examined citation networks. equal to ten. 298 authors from 2050 were identified in 14 clusters (Fig. 5). Fig. 5. Citation from publications of the most cited authors The most cited author is Navigli, Roberto (759 We presented the authors citation network in an overlay citations) [29, 30]. More than 200 citations from Rosso, visualization option to assess citation trends (Fig. 6). The Paolo (239) and Moens, Marie-Francine (216) [2]. figure clearly shows that R. Navigli is the founder in the VOSviewer also supports overlay renderings. In area. overlay rendering, the color of a node indicates a specific property of the node, for example, the year of publication. Fig. 6. Overlay citation network visualization A citation network by authors was built. The minimum Li Juanzi and Zhan Lei (2016) are the most recent of the number of publications by the author was taken to be five. most cited authors. 16 authors were identified from 2050 in 2 clusters (Fig. 7). Mittal Namitali (2017), Rettinger Achim, Gipp Bela, Fig. 7. Overlay visualization of the most cited authors The geographical aspects of citation were considered. is expanding. So, in the last goals, Brazil, Czech Republic, A citation network for countries was built with a Iran, Egypt, Tunisia have joined the research. minimum of five publications. 34 countries were identified (Fig. 8). It is seen that the geography of research Fig. 8. Overlay visualization of citation by affiliation of authors Visualization of scientific networks using VOSviewer has 5. Conclusions shown fragmentation of research, small research groups A bibliometric study of publications on the topic have been identified. “Cross-lingual Semantic Similarity”, available in the The visualization of a network of co-authorship across Dimensions database, was carried out. In recent years, organizations showed limited university interaction on there has been a significant increase in research. cross-language semantic similarities. The largest cluster included 11 European universities and organizations from Opportunities. Sustainability, 10, 3729; Ireland, Italy, Germany, Spain, Scotland, and Great doi:10.3390/su10103729. Britain. [9] Grappiolo, C., van Gerwen, E., Verhoosel, J. and Visualization of the co-authorship network by country Somers, L. (2019). The Semantic Snake Charmer showed that 35 countries interact in research, countries are Search Engine: A Tool to Facilitate Data Science in connected in five clusters. The two largest clusters High-tech Industry Domains. In Proceedings of the included 9 countries. In the largest clusters, including 9 2019 Conference on Human Information Interaction countries, the leading ones were the USA and China, and Retrieval (CHIIR ’19). Association for Great Britain and Spain. Computing Machinery, New York, NY, USA, 355- The visualization of the citation network revealed 298 359. DOI:https://doi.org/10.1145/3295750.3298915. of the most cited authors out of 2050. The most cited [10] Jarmasz, M. and Szpakowicz, S. (2003). Roget’s author is Navigli, Roberto (759 citations). More than 200 Thesaurus and Semantic Similarity. Recent Adv. citations from Rosso, Paolo (239) and Moens, Marie- Nat. Lang. Process. III Sel. Pap. from RANLP , vol. Francine (216) [2]. 111, 2004. Overlay visualization made it possible to evaluate the [11] Islam, A. and Inkpen, D. (2012). Unsupervised Near- citation trends of the authors; it turned out that the most Synonym Choice using the Google Web 1T. ACM cited author, Navigli, Roberto, is also the founder of Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1- research in this field [29, 30]. 19. The most recent cited authors are Mittal Namitali [12] O’Shea, J., Bandar, Z., Crockett, K., and McLean, D. (2017 citation), Rettinger Achim, Gipp Bela, Li Juanzi (2008). A Comparative Study of Two Short Text and Zhan Lei (2016 citation). Semantic Similarity Measures. In Agent and Multi- Consideration of the geographical aspects of citation Agent Systems: Technologies and Applications, vol. showed an expansion of the geography of research. So, in 4953, N. Nguyen, G. Jo, R. Howlett, and L. Jain, the last goals, Brazil, Czech Republic, Iran, Egypt, Eds. Springer Berlin Heidelberg, pp. 172-181. Tunisia have joined the research. [13] Li, H. and Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Acknowledgment Retrieval, 7(5):343-469. The reported study was funded by RFBR according to [14] Mitchell, J. and Lapata, M. (2010). Composition in the research projects № 18-07-00225, 18-07-00909, 18- distributional models of semantics. Cognitive 07-01111 and 20-04-60185. science, 34(8), 1388-1429. [15] Chen, B. (2009). Latent topic modelling of word co- References occurence information for spoken document retrieval. In IEEE International Conference on [1] Rajat Pandit, R., Sengupta, S., Naskar, S.K., Dash, Acoustics, Speech and Signal Processing ICASSP N.S. and Sardar, M.M. (2019). Improving Semantic 2009, no. 2, pp. 3961-3964. Similarity with Cross-Lingual Resources: A Study in [16] Kenter, T., Rijke, M. de (2015). Short Text Bangla - A Low Resourced Language. Informatics, Similarity with Word Embeddings. CIKM '15 6, 19; doi:10.3390/informatics6020019 Proceedings of the 24th ACM International on [2] Vulic, I., De Smet, W., and Moens, M.-F. (2011). Conference on Information and Knowledge Identifying word translations from comparable Management October 19-23, Melbourne, Australia. corpora using latent topic models. In Proceedings of Pp. 1411-1420. ACL, pages 479-484. [17] Atoum, I. (2016). Efficient Hybrid Semantic Text [3] Prochasson, E. and Fung, P. (2011). Rare word Similarity using Wordnet and a Corpus. (IJACSA) translation extraction from aligned comparable International Journal of Advanced Computer Science documents. In Proceedings of ACL, pages 1327- and Applications, Vol. 7, No. 9, pp.124-130. 1335. [18] Magerman, T., Van Looy, B., Baesens, B. and [4] Hotho, A., Nürnberger, A. and Paaß, G. (2005). A Debackere, K. (2011). Assessment of Latent brief survey of text mining. In Ldv Forum, Vol. Semantic Analysis (LSA) text mining algorithms for 20(1), p. 19-62. large scale mapping of patent and scientific [5] Hassani, H., Beneki, C., Unger, S., Mazinani, M.T. publication documents. Department Of Managerial and Yeganegi, M.R. (2020). Text Mining in Big Data Economics, Strategy And Innovation (MSI), Analytics. Big Data Cogn. Comput. 2020, 4, 1; October, 77 р. doi:10.3390/bdcc4010001. [19] Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., [6] Porter, A. L. (2005). Tech Mining. Competitive Menini, S., & Zamparelli, R. (2014). Semeval-2014 Intelligence Magazine. 8 (1): 30-37. task 1: Evaluation of compositional distributional [7] Ali, A., Alfayez, F. and Alquhayz, H. (2018). semantic models on full sentences through semantic Semantic Similarity Measures Between Words: A relatedness and textual entailment. SemEval-2014. Brief Survey. Sci. Int. (Lahore),30(6), 907-914, [20] Batet, M. and Sánchez, D. (2015). Ontology 2018. Selection for Semantic Similarity Assessment. [8] Wang, H. C., Chi, Y. C. and Hsin, P. L. (2018). ICAART 2015, At Lisbon, Portugal, Volume: 2 Constructing Patent Maps Using Text Mining to https://www.researchgate.net/publication/283877653 Sustainably Detect Potential Technological [21] Liu, H., Wang, P. (2014). Assessing Text Semantic Center for Information in Physics and Technique» (Nizhny Similarity Using Ontology. Journal Of Software, vol. Novgorod, Russia), Е-mail: aida_khatif@mail.ru 9, no. 2, pp.490-497. [22] Maheswari, J.U., Karpagam, G.R., Indhumathy, S. (2014). Comparison of Web Service Similarity- Assessment Methods. International Journal of Computer Applications (0975 - 8887) Volume 98 - No.22. [23] Moen, H. (2016). Distributional Semantic Models for Clinical Text Applied to Health Record Summarization Thesis for the Degree of Philosophiae Doctor Trondheim, May NTNU (Norwegian University of Science and Technology Faculty of Information Technology), 93 р. [24] Guessoum, D., Miraoui, M., Tadj, C. (2015). Survey Of Semantic Similarity Measures In Pervasive Computing. International Journal On Smart Sensing And Intelligent Systems Vol. 8, no. 1, рр.125-158. [25] Arora, S., Liang, Y., and Ma, T. (2017). A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of ICLR 2017. https://openreview.net/pdf?id=SyK00v5xx. [26] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. CoRR abs/1705.02364. http://arxiv.org/abs/1705.02364. [27] Pagliardini, M., Gupta, P., and Jaggi, M. (2017). Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. arXiv https://arxiv.org/pdf/1703.02507.pdf. [28] Ferrero, J., Besacier, L., Schwab, D., and Agnes, F. (2017). Using Word Embedding for Cross-Language Plagiarism Detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, (EACL 2017). Association for Computational Linguistics, Valencia, Spain, volume 2, pages 415-421. http://aclweb.org/anthology/E/E17/E17-2066.pdf. [29] Camacho-Collados, J. and Navigli, R. (2016). Find the word that does not belong: A framework for an intrinsic evaluation of word vector representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP. Berlin, Germany, pages 43-50. [30] Camacho-Collados, J., Taher Pilehvar, M., Collier, N., and Navigli, R. (2017). SemEval-2017 Task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of SemEval. Vancouver, Canada. [31] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 [32] Van Eck, N.J., and Waltman, L. How to normalize cooccurrence data? An analysis of some well-known similarity measures. 2009. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651. About the autors Khakimova Aida Kh., PhD, docent, Kama Institute (Naberezhnye Chelny, Russia), ANO «Scientific and Research