Analysis of Lexical Ambiguity in Vector Space Models

Analysis of Lexical Ambiguity in Vector Space Models MartaVázquezAbuín martavazquez.abuin@usc.gal Centro Singular de Investigación en Tecnoloxías Intelixentes (CITIUS) Universidade de Santiago de Compostela

15782 Santiago de Compostela Galicia Spain

Analysis of Lexical Ambiguity in Vector Space Models 1613-0073 103D5D7C98A848416B771D8E24FD69DF GROBID - A machine learning software for extracting information from scholarly documents lexical semantics distributional semantics Word Sense Disambiguation Galician

The aim of this PhD is to analyze the capabilities of current vector models to deal with lexical ambiguity, particularly in the context of polysemy and homonymy, with a special focus on how models based on Transformer architectures represent these semantic phenomena. To achieve the main objective we have set the following specific objectives: (i) to perform a comprehensive analysis of the state-of-the-art, (ii) to compile a training and evaluation dataset according to the Word in Context format, and (iii) to extend it with data from other languages, with the potential to create a cross-lingual dataset including false friends and equivalent forms. In addition, (iv) to evaluate the computational models and (v, vi) to conduct experiments to compare the results obtained above with human judgments. To do so, we will compile and create datasets to perform an evaluation in terms of lexical ambiguity. We will assess state-of-the-art models to see if they are able to identify the different meanings of ambiguous words in context. Finally, the human judgments on the same dataset will be evaluated and compared with the computational models that have been analyzed. This analysis will be carried out mainly for the Galician language, but also for Portuguese and Spanish according to the evolution of the research.

Introduction and Motivation

The development of deep learning and transformer-based models in recent years has been a huge improvement for the field of Natural Language Processing (NLP) [1,2]. Nevertheless, the identification and processing of lexical ambiguity remains a significant challenge for linguistic technologies, particularly in the context of automatic and unsupervised approaches [3,4,5]. One strategy for evaluating the ability of computational models to address lexical ambiguity is through Word Sense Disambiguation (WSD) tasks [6]. The objective of these tasks is to discern the precise meaning of a word within diverse contextual settings, analyze its behavior, and identify potential solutions [7,8,9] .

In the computational modelling of lexical semantics, there are two main approaches, both inspired by theoretical proposals [10]: (a) symbolic modelling, where each element has a specific meaning in a network of semantic relations, and (b) continuous approaches, where lexical forms are represented in a vector space. In the first case, we can highlight the case of Princeton WordNet (PWN) [11], which is one of the most frequently used lexical resources in the field of natural language processing. WordNet is a database of semantic relations that groups words into synonym sets (or synsets) and links them to other words according to their shared meaning [11]. These synsets act as nodes within the semantic networks, with the relations between them represented as edges [11,12]. Furthermore, it provides glosses and examples that help users to illustrate word usages and clarify their meanings in context [12]. In continuous approaches, we can include both approaches based on the distributional hypothesis [13,14] and new models trained with deep learning architectures [15]. In this context, we can differentiate between two categories: static models such as Word2Vec [16], fastText [17] and GloVe [18], where each word is represented by a single vector with its meaning in a specific context; and models based on language models, which present each occurrence of a word in a given context with a different vector that can potentially disambiguate the meaning of the word in context (BERT [1], ELMo [15] or GPT-3 [19]).

Lexical ambiguity phenomena, such as polysemy or homonymy, are pervasive across natural languages and present a challenge to computational models as a single word from may have different meanings depending on the context [3,4,5]. In the context of this thesis, it is appropriate to describe these two phenomena. Polysemy involves a single lexeme with several related meanings depending on the context (e.g., the Galician coche 'car' referring to a four-wheeled vehicle or a train carriage). Homonymy, on the other hand, involves a single lexical form consisting of different lexemes with multiple independent meanings (e.g. the Galician canto as one of the divisions of an epic poem 'canto' or as the gap between two walls 'edge') [20].

For English, and other languages like Italian and German, there exists a comprehensive array of resources for the training and evaluation of the resolution of lexical ambiguity in linguistic models. Some examples include the different WordNets and Word in Context datasets, such as WiC [21], XL-WiC [22] or ConSec [23]). In contrast, the availability of these resources is very limited for under-resourced languages such as Galician. Furthermore, the representation of Galician in comparison with the other languages analyzed is often very small, as evidenced by the XL-WSD dataset [24].

The objective of this PhD is to examine the strategies followed by languages models in the context of lexical ambiguity in Galician 1 .

The remainder of this work is organized as follows: Section 2 is dedicated to an examination of the diverse tasks and challenges associated with lexical ambiguity and word sense disambiguation. Sections 3 and 4 present the objectives, hypotheses, and proposed research methodology. Finally, section 5 presents the preliminary results of the research.

Related work

Lexical ambiguity is very important in human language understanding [5] and one of the most common problems in natural language technologies because it causes misinterpretation of natural language. As a result, identifying and resolving ambiguity is essential for improving the efficiency and reliability of these technologies [6,25].

We talk about disambiguation to describe the process of resolving ambiguity errors where WSD is one of the most studied aspects by researchers [6]. The main objective of the study of this aspect is to determine the word sense in a particular context, such as associating words in context with the most appropriate entry in a predefined sense inventory [7].

One of the most used linguistic resources for lexical disambiguation is WordNet [11]. This lexical database has become the primary repository for senses in NLP and it is the source of the majority of the datasets and evaluation frameworks for WSD [7]. WordNet was initially developed specifically for the English language. However, over time, other languages began to develop their own lexical resources based on it, for example, Galnet [26] (The Galician WordNet [27]) that is part of the Multilingual Central Repository [28]. The creation of such resources represents a significant challenge in the majority of languages, especially for languages with limited linguistic resources, where the investment of time and human effort is often a significant obstacle. While new automatic and unsupervised methods have been developed for this purpose, it is also necessary to ensure that they are trained with the quality and quantity of data to obtain adequate results. For this reason, the most popular option has been to extend them with PWN using different methods adapted to their developments and needs, looking for automatic and unsupervised approaches that minimize human and economic effort.

However, there is still a notable disparity between English and other languages in terms of NLP tools and improvements, as well as the identification and treatment of lexical ambiguity. The gap is particularly significant when attempting to conduct research and work with under-resourced languages, such as Galician.

To assess the efficacy of automatic lexical disambiguation, WordNet has been employed to construct datasets such as WiC [21] to evaluate vector models in a particular context and its extended version XL-WiC [22]. Each instance in the dataset will comprise a target word and two contexts, each with a specific meaning of the target word. The objective is to identify whether the two contexts of the word have the same meaning or not. However, Galician is not included in this analysis due to the limited availability of examples [22].

Regarding the models that will be evaluated, there are the static models, inspired by the distributional hypothesis, where each vector represents all the meanings of a word [14,16] and the new models based on the deep neural networks [9,29] that have been a revolution in the field [2]. These language models can model the meaning of words in context since different vectors are generated for each word in each sentence. Among the most used ones for these tasks, we can mention BERT [1], XLM-RoBERTa [30], DeBERTa [31], ELMo [15] or GPT [19].

Concerning the behavior of the different models regarding lexical ambiguity, we can observe that contextualized models [9] are effective in identifying homonyms and distinguishing between meanings [29,32,33]. However, the accuracy of the results declines when the contexts are similar due to the fine-grained distinctions [8]. In Loureiro et al [8] we can see that Transformer models, specifically BERT, have high results and capture sense distinctions, even with few examples, grouping polysemous words according to sense.

Concerning the human interpretation of lexical ambiguity, some studies have demonstrated that large models are capable of performing in a manner that is similar to human preference [34,35], but in Trott et al [36] they found that the human behavior cannot be explained purely by exposure to language statistics as in the models assessed.

For Galician, the number of resources and datasets created for lexical disambiguation tasks and models evaluation is very low [33]. This situation places Galician at a position of disadvantage in relation to other languages in terms of WSD tasks and evaluation. In this context, the present research proposes the design and creation of datasets to evaluate computational models for cases of lexical ambiguity, as well as the behavior in cases of inter-linguistic disambiguation, with special emphasis on false friends. Furthermore, we intend to examine and contrast the results with the human interpretation.

Objectives

The objective of our research proposal is to conduct an analysis of the behaviour of current vector models with regard to lexical ambiguity in Galician. In order to achieve the primary objective, the following secondary objectives have been proposed. 1. To review the current state-of-the-art in lexical ambiguity for single words and natural language processing 2. To define and compile training and evaluation datasets in Galician, and eventually in other languages (Portuguese and Spanish) following the Word in Context (WiC) format 3. To compile a cross-lingual dataset in Galician, Portuguese and Spanish with a focus on the analysis of false friends 4. To evaluate the performance of computational models in handling lexical ambiguity using the developed datasets 5. To evaluate human judgments on the same dataset as the computational models analyzed 6. To examine and compare the identification and recognition of the phenomena by humans and models

Research Questions

In the first stage of the PhD, the following questions were formulated to achieve the established objectives:

• RQ1: Is written language sufficient for the assessed models to learn how to resolve lexical ambiguity or, on the contrary, would other types of information (visual, speech, etc.) be necessary? H1: Our hypothesis is that it can be sufficient, but under optimal conditions concerning training and computing resources as in Loureiro et al [8].

• RQ2: Can distributional models resolve lexical ambiguity? H2: We assume that it would differ depending on the semantic relationship and the assessed model.

• RQ3: Which models are better to resolve the ambiguity? H3: Our hypothesis is that the new models based on deep neural networks perform better than static models [37,5,38].

• RQ4: Can the analyzed methods demonstrate an enhanced capacity to comprehend and process lexical ambiguity in a manner that transcends the limitations of human understanding? H4: We assume that humans can perform better regarding lexical ambiguity than actual vector space models [37,32], but we can have systematic errors [34]. However, in certain models, the representations utilized to achieve this are largely aligned with human intuitions [38].

Methodology

The research will be conducted in the following stages, each one with a specific methodology based on the actual state-of-the-art in computational linguistics and natural language process. The initial stage of the research will entail the creation of a training and evaluation dataset based on WiC [21] for Galician. The dataset will be created, in accordance with the methodology proposed in the original [21], and the other languages of the XL-WiC [22], using Galnet2 to obtain the context sentences. Despite the valuable insights it offers, the size of this resource is insufficient for training and evaluating WSD tools, particularly with regard to the number of example sentences [22], so it was necessary to work on expanding the examples and words. In relation to the computational models that will be assessed with this dataset, we will utilize models based on Transformer architectures, in addition to other proposed architectures throughout the study.

In regard to human behavior and understanding in relation to lexical ambiguity, the objective is to analyze how people identify some semantic phenomena and assess the limit at which one lemma changes its meaning to a new one, as well as the extent to which the senses of the same lexeme diverge from one another. These experiments will be conducted following the standard practices in the fields and will be guided in their performance of the tasks by the previously established methodology [39,32].

Moreover, the intention is to create a cross-lingual dataset that will include other languages, such as Portuguese and Spanish. The details of the dataset have yet to be determined, but it will be structured in accordance with the format previously outlined. Each instance will contain a word (which should have the same or a similar lexical form, taking into account orthographic conventions in both languages) with two contexts: one for Galician and one for the other language (Portuguese or Spanish). The target word has a specific meaning in each context, which may or may not be the same. The selection of comparable lexical forms allows for an examination of the performance of the analyzed models in the context of false friends in closely related languages.

Finally, a quantitative analysis of the results will be conducted through an evaluation employing accuracy and correlation metrics.

Preliminary results

In the initial phase of the investigation, it was determined that the number of Galician examples was insufficient for the requisite WDS disambiguation tasks [22]. We decided to design a method [40] to enhance the quantity of Galician examples by translating the English examples associated with a synset containing a Galician word in Galnet relying on the state-of-the-art of neural machine translation systems, specifically NOS-MT-OpenNMT-gl-en [41]. In some cases, however, the synset does not have an associated Galician word. Therefore, bilingual word embeddings were employed as probabilistic dictionaries to search for new words. The following procedure was utilized: the Wikipedia versions of each language were processed with FreeLing [42] for Galician and UDPipe [43] for English. Two monolingual fastText models [17] where trained for each language: with the two versions of the corpus: one lemmatized, and another representing each word as a lemma_POS-tag pair. Finally, we have mapped the monolingual models to a shared vector space. Then, we have designed and evaluated straightforward heuristics to expand Galnet to check if the new sentence can be added as a new one or not. Following the preliminary experiments, we have increased more than 4,5k synsets and 13k Galnet examples. These are being employed to construct WiC-type develope in Galician.

During the course of the thesis, we will also be able to carry out comparative analyses between other languages such as Portuguese and Spanish. https://ilg.usc.gal/galnet/

Acknowledgments

This project is carried out within the Research Group in Computational Linguistics (LComp, GI-2201), which is part of the Department of Spanish Language and Literature, Theory of Literature and General Linguistics of the University of Santiago de Compostela. Within the activity 2021-PG012 'Consolidación 2021 Modalidade C. Proxectos de excelencia -Exploración do coñecemento semántico en modelos vectoriais: homonimia, sinonimia, polisemia e idiomaticidade', following the research line 'Comprensión das linguas naturais: sintaxe e semántica computacionais' (PIESP0027) in the Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS) with funding from a pre-doctoral grant from the Xunta de Galicia (ED481A-2024-070).

BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics Survey of transformers and towards ensemble learning using transformers for natural language processing HZhang MOShafiq Journal of big Data 11 25 2024 Word sense disambiguation: A survey RNavigli ACM computing surveys (CSUR) 41 2009 Analysis and Evaluation of Language Models for Word Sense Disambiguation DLoureiro KRezaee MTPilehvar JCamacho-Collados 10.1162/coli_a_00405 Computational Linguistics 47 2021 We're afraid language models aren't modeling ambiguity ALiu ZWu JMichael ASuhr PWest AKoller SSwayamdipta NASmith YChoi 2023 A comprehensive review on resolving ambiguities in natural language processing AYadav APatel MShah 10.1016/j.aiopen.2021.05.001 AI Open 2 2021 Word sense disambiguation: A unified evaluation framework and empirical comparison ARaganato JCamacho-Collados RNavigli Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics Long Papers, Association for Computational Linguistics MLapata PBlunsom AKoller the 15th Conference of the European Chapter of the Association for Computational Linguistics

Valencia, Spain

2017 1 DLoureiro KRezaee MTPilehvar JCamacho-Collados CoRR abs/2008.11608 Language models and word sense disambiguation: An overview and analysis 2020 From Word Types to Tokens and Back: A Survey of Approaches to Word Meaning Representation and Interpretation MApidianaki 10.1162/coli_a_00474 Computational Linguistics 49 2022 Word meaning is both categorical and continuous STrott BBergen 10.1037/rev0000420 Psychological Review 130 2023 WordNet: An Electronic Lexical Database CFellbaum 10.7551/mitpress/7287.001.0001 1998 APA Wordnet then and now GAMiller CFellbaum 10.1007/s10579-007-9044-6 Language Resources and Evaluation 41 2007 Distributional memory: A general framework for corpus-based semantics MBaroni ALenci 10.1162/coli_a_00016 Computational Linguistics 36 2010 Vector space models of word meaning and phrase meaning: A survey KErk 10.1002/lnco.362 Language and Linguistics Compass 6 2012 Deep contextualized word representations MEPeters MNeumann MIyyer MGardner CClark KLee LZettlemoyer 10.18653/v1/N18-1202 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies MWalker HJi AStent the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

2018 1 Association for Computational Linguistics Efficient estimation of word representations in vector space TMikolov KChen GCorrado JDean Proceedings of Workshop at ICLR Workshop at ICLR 2013 Enriching word vectors with subword information PBojanowski EGrave AJoulin TMikolov 10.1162/tacl_a_00051 Transactions of the Association for Computational Linguistics 5 2017 Glove: Global vectors for word representation JPennington RSocher CDManning Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell 2020 DACruse Cambridge textbooks in linguistics: Lexical semantics

Cambridge, England

Cambridge University Press 1986 WiC: the word-in-context dataset for evaluating contextsensitive meaning representations MTPilehvar JCamacho-Collados 10.18653/v1/N19-1128 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics XL-WiC: A multilingual benchmark for evaluating semantic contextualization ARaganato TPasini JCamacho-Collados MTPilehvar 10.18653/v1/2020.emnlp-main.584 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics BWebber TCohn YHe YLiu the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 ConSeC: Word sense disambiguation as continuous sense comprehension EBarba LProcopio RNavigli 10.18653/v1/2021.emnlp-main.112 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and M.-FMoens XHuang LSpecia SW.-T Yih the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 XL-WSD: An extra-large and cross-lingual evaluation framework for word sense disambiguation TPasini ARaganato RNavigli Proc. of AAAI of AAAI 2021 A survey on lexical ambiguity detection and word sense disambiguation MAbeysiriwardana DSumanathilaka arXiv:2403.16129 2024 Galnet: o wordnet do galego. aplicacións lexicolóxicas e terminolóxicas MASolla Portela XGuinovart 10.17979/rgf.2015.16.0.1383 Revista Galega de Filoloxía 16 2015 Galnet: Wordnet 3.0 do galego XGómezGuinovart Linguamática 3 2011 Multilingual central repository version 3.0 AGonzalez-Agirre ELaparra GRigau Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), European Language Resources Association (ELRA) NCalzolari KChoukri TDeclerck MUDoğan BMaegaard JMariani AMoreno JOdijk SPiperidis the Eighth International Conference on Language Resources and Evaluation (LREC'12), European Language Resources Association (ELRA)

Istanbul, Turkey

2012 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LUKaiser IPolosukhin Advances in Neural Information Processing Systems IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc 2017 30 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov 10.18653/v1/2020.acl-main.747 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Deberta: Decoding-enhanced bert with disentangled attention PHe XLiu JGao WChen International Conference on Learning Representations 2021 Contextualized word embeddings encode aspects of human-like word sense knowledge SNair MSrinivasan SMeylan Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, Association for Computational Linguistics MZock EChersoni ALenci ESantus the Workshop on the Cognitive Aspects of the Lexicon, Association for Computational Linguistics 2020 Exploring the representation of word meanings in context: A case study on homonymy and synonymy MGarcia 10.18653/v1/2021.acl-long.281 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Long Papers CZong FXia WLi RNavigli the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021 1 Association for Computational Linguistics Bidirectional transformer representations of (spanish) ambiguous words in context: A new lexical resource and empirical analysis PDRivière ALBeatty-Martínez STrott 2024 Scope ambiguities in large language models GKamath SSchuster SVajjala SReddy Transactions of the Association for Computational Linguistics 12 2024 Do large language models know what humans know? STrott CJones TChang JMichaelov BBergen Cognitive Science 47 e13309 2023 Does bert make any sense? interpretable word sense disambiguation with contextualized embeddings GWiedemann SRemus AChawla CBiemann Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, German Society for Computational Linguistics & Language Technology the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, German Society for Computational Linguistics & Language Technology

Erlangen, Germany

2019 Do large language models resolve semantic ambiguities in the same way as humans? the case of word segmentation in chinese sentence reading WLiao ZWang KShum ABChan JHsiao Proceedings of the Annual Meeting of the Cognitive Science Society the Annual Meeting of the Cognitive Science Society 2024 46 Word sense distance in human similarity judgements and contextualised word embeddings JHaber MPoesio Proceedings of the Probability and Meaning Conference (PaM 2020), Association for Computational Linguistics CHowes SChatzikyriakidis AEk VSomashekarappa the Probability and Meaning Conference (PaM 2020), Association for Computational Linguistics

Gothenburg

2020 Wordnet expansion with bilingual word embeddings and neural machine translation MVázquez Abuín MGarcia EPIA Conference on Artificial Intelligence Springer 2024 PGamallo DBardanca JRPichel MGarcia SRodríguez-Rey IDe Dios-Flores Nos_mt-opennmt-en-gl 2023 Analizadores multilingües en freeling LPadró Linguamática 3 2012 UDPipe 2.0 prototype at CoNLL 2018 UD shared task MStraka 10.18653/v1/K18-2020 Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics DZeman JHajič the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics

Brussels, Belgium

2018