-

A Coruña, Spain $ marcos.garcia.gonzalez@usc.gal (M. Garcia); pablo.gamallo@usc.gal (P. Gamallo); martin.pereira@usc.gal (M. Pereira-Fariña); iria.dedios@usc.gal (I. de-Dios-Flores)

An exploration of the semantic knowledge in vector models: polysemy, synonymy and idiomaticity

Marcos Garcia

Pablo Gamallo

Martín Pereira-Fariña

Iria de-Dios-Flores

2022

000 0 0002

In this paper, we present the project An exploration of the semantic knowledge in vector models: polysemy, synonymy and idiomaticity, funded by the Xunta de Galicia within the program “Consolidación e estruturación de unidades de investigación competitivas e outras accións de fomento: Proxectos de Excelencia”, with a duration of 5 years (2021-2026). The main objective of the project is the analysis of the most recent language models regarding the representation of several aspects of lexical semantics: polysemy and homonymy, synonymy and idiomaticity. The languages in which we are working are Galician-Portuguese (in its Galician and Portuguese varieties, fundamentally), Spanish and English.

eol>lexical semantics distributional semantics language models

1. Introduction and objectives ologies (e.g., using vectors built through syntactic dependencies [10]), has become one of the most The use of architectures based on artificial neural productive in NLP research [11]. networks has become the most dominant approach In this regard, the emergence of deep learning to natural language processing (NLP) in recent years techniques using multilayer deep neural networks [1], producing significantly better results in numer- with millions of hyperparameters (which require ous areas than supervised models designed by se- large computational infrastructures) has led to the lecting individual features of the target tasks [2]. proliferation of language models that perform NLP This paradigm shift has promoted the populariza- tasks more accurately. Among various others, we tion of vector models inspired by the distributional can highlight the public models ELMo (Embedhypothesis [3, 4], which until then were mainly used dings from Language Models [12]), or the diferent in research in cognitive science and computational variants of BERT (Bidirectional Encoder Represenlinguistics [5, 6, 7]. In this field, the implementa- tations from Transformers [13]). tion of computationally more eficient architectures, The project presented in this paper fits into this with drastic reductions in dimensionality [8], has new line of research and focuses on the analysis of sparked great interest in distributional semantics the ability of these models to solve various types of studies, boosted also by the findings about the vari- lexical ambiguity:1 ous linguistic regularities encoded by these models [9]. This area, previously dominated by linguistically informed and more interpretable method1. Polysemy and homonymy, i.e., a single orthographic form that has diferent meanings (or senses) depending on the context. For example, school as a building, as an organization, or as a group of people (polysemy), or bank as a financial institution, or as a sloping raised land (homonymy). 2. Synonymy, i.e., diferent words expressing the same meaning in certain contexts (e.g., coach or bus to refer to a long motor vehicle). 3. Idiomaticity, i.e., multiword expressions (MWEs) whose meaning does not correspond to the one of its constituent elements (e.g., glass ceiling as a social barrier for women).

1We broadly follow [14] for the definition of the phenomena mentioned here. 1. Precision scores, in evaluations with discrete values (e.g. homonymy or synonymy, and in the results of linear classifiers). 2. Correlation values, in graded evaluations

(polysemy or idiomaticity). 3. Representation Similarity Analysis, to see if the models predict relative diferences between examples of the same type (e.g., a word or MWE with the same meaning in diferent contexts) in a similar way to humans.

It should be noted that these methods have already been used in previous works, which we briefly mention below.

Taking the above into account, our research aims to fill a particularly important gap in the evaluation of these computational models by investigating the presence of various types of knowledge related to lexical semantics in several languages. Thus, the main goal of the project is to explore the most recent language models concerning the representation of polysemy and homonymy, synonymy and semantic compositionality, as well as to compare them with more interpretable distributional and compositional methods.

The results of the present project will be useful, on the one hand, to advance the understanding of semantic information encoded both in static distributional representations and in large language models trained with deep neural networks. In ad- 2.1. First results dition, and although the project is mainly focused on the exploration of models, both the datasets and the results of manual annotation will be an important contribution regarding the semantic interpretation of polysemy and homonymy, synonymy and idiomaticity by native speakers of various languages.

Although we are at an early stage, we already have some published results, both from previous research directly related to this proposal and from work carried out since the beginning of the project. Thus, we have already presented various datasets with semantic idiomaticity annotation at token and type levels in English and Portuguese, and used them to evaluate several language models [ 19, 20 ]. In 2. Methodology and work plan addition, we have created a new dataset in GalicianPortuguese, English and Spanish that includes exTo develop this project, we will use the following amples of homonymy and synonymy in context, also methodology and instrumental techniques, which in used to compare various contextualization models general correspond to the state-of-the-art research and strategies [ 21 ]. in NLP and computational linguistics. More recently, we have compared Transformers

Regarding the experimental design and the data models and distributional strategies based on syncollection, we will use standard methodologies from tactic dependencies in semantic compositionality studies in semantics [14] and in psycholinguistics tasks [ 18, 22 ]. Finally, we have participated in the [15, 16], aimed at generating controlled stimuli. co-organization of the task Multilingual IdiomaticLikewise, to collect annotations from human in- ity Detection and Sentence Embedding (SemEval formants, we will use crowdsourcing methods which 2022), in which we have presented new resources will allow us to obtain data from native speakers with annotation of semantic idiomaticity in context quickly and eficiently, with quality control of the in Galician-Portuguese and English [ 23 ]. annotations [ 17 ].

Regarding the computational models, those based on Transformer architectures will be implemented 3. Work team using the transformers library, which includes the latest models based on deep learning. We will even- The project presented in this paper is carried out at tually use other open source libraries that may incor- the Centro Singular de Investigación en Tecnoloxías porate additional models. To train and run static Intelixentes (CiTIUS) of the Universidade de Sanembeddings, we will use gensim2 and the oficial tiago de Compostela, and belongs to its scientific tools released by the authors of other distributional program in Natural Language Technologies. In this methods based on interpretable syntactic dependen- sense, members of the center collaborate on difercies (e.g., [ 18 ]). ent tasks of our work plan, that are part of their

Finally, to compare the representations of the respective areas of expertise. computational models with the values obtained from Besides the principal investigator, the project has the human annotations, we will use three methods: research and work teams formed by three PhDs with specializations in Computational Linguistics,

Psycholinguistics, Logic and Computer Science. In 2https://radimrehurek.com/gensim/ collaboration with a pre-doctoral researcher and technical staf that will be hired with the project [8] T. Mikolov, K. Chen, G. Corrado, J. Dean, funds, these teams actively participate in the difer- Eficient estimation of word representations ent stages of the project. Finally, we also rely on in vector space, in: Workshop Proceedings the collaboration of researchers from other univer- of the International Conference on Learning sities, both Galician and international, with whom Representations, 2013. we have already participated in joint initiatives and [9] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic projects with similar themes to the one presented regularities in continuous space word represenin this paper. tations, in: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Acknowledgments Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, 2013, Project funded by the Galician Government (Con- pp. 746–751. URL: https://aclanthology.org/ solidación e estruturación de unidades de investi- N13-1090. gación competitivas e outras accións de fomento: [10] S. Padó, M. Lapata, Dependency-based Proxectos de Excelencia, ED431F 2021/01) and by construction of semantic space models, a Ramón y Cajal grant (RYC2019-028473-I). Computational Linguistics 33 (2007) 161– 199. URL: https://aclanthology.org/J07-2002.

References doi:10.1162/coli.2007.33.2.161. [11] G. Boleda, Distributional semantics and lin[1] R. Collobert, J. Weston, L. Bottou, M. Karlen, guistic theory, Annual Review of Linguistics 6 K. Kavukcuoglu, P. Kuksa, Natural language (2020) 213–234. processing (almost) from scratch, Journal of [12] M. E. Peters, M. Neumann, M. Iyyer, M. GardMachine Learning Research 12 (2011) 2493– ner, C. Clark, K. Lee, L. Zettlemoyer, Deep 2537. contextualized word representations, in: Pro[2] T. Schnabel, I. Labutov, D. Mimno, ceedings of the 2018 Conference of the North T. Joachims, Evaluation methods for American Chapter of the Association for Comunsupervised word embeddings, in: Proceed- putational Linguistics: Human Language Techings of the 2015 Conference on Empirical nologies, Volume 1 (Long Papers), Association Methods in Natural Language Processing, for Computational Linguistics, New Orleans, Association for Computational Linguis- Louisiana, 2018, pp. 2227–2237. URL: https:// tics, Lisbon, Portugal, 2015, pp. 298–307. aclanthology.org/N18-1202. doi:10.18653/v1/ URL: https://aclanthology.org/D15-1036. N18-1202.

doi:10.18653/v1/D15-1036. [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [3] Z. S. Harris, Distributional structure, Word BERT: Pre-training of deep bidirectional trans10 (1954) 146–162. formers for language understanding, in: [4] J. R. Firth, A synopsis of linguistic theory 1930- Proceedings of the 2019 Conference of the 1955, Studies in Linguistic Analysis (1957) 1– North American Chapter of the Association 32. Reprinted in F.R. Palmer (Ed.), Selected for Computational Linguistics: Human LanPapers of J.R. Firth 1952–1959, London: Long- guage Technologies, Volume 1 (Long and man (1968). Short Papers), Association for Computational [5] G. A. Miller, Empirical methods in the study Linguistics, Minneapolis, Minnesota, 2019, of semantics, in: D. D. Steinberg, L. A. pp. 4171–4186. URL: https://aclanthology.org/ Jakobovits (Eds.), Semantics: An Interdisci- N19-1423. doi:10.18653/v1/N19-1423. plinary Reader in Philosophy, Linguistics and [14] D. A. Cruse, Lexical semantics, Cambridge Psychology, 1971, pp. 569–585. University Press, 1986. [6] T. K. Landauer, S. T. Dumais, A solution to [15] R. L. Goldstone, Influences of categorization Plato’s problem: The latent semantic analysis on perceptual discrimination., Journal of Extheory of acquisition, induction, and represen- perimental Psychology: General 123 (1994) tation of knowledge, Psychological Review 104 178.

(1997) 211. [16] R. Richie, B. White, S. Bhatia, M. C. Hout, [7] J. Mitchell, M. Lapata, Composition in dis- The spatial arrangement method of measuring tributional models of semantics, Cognitive similarity can capture high-dimensional semanscience 34 (2010) 1388–1429. tic structures, Behavior Research Methods 52 (2020) 1906–1928.

[17]

Munro ,

Bethard ,

Kuperman ,

V. T.

Lai ,

Melnick ,

Potts ,

Schnoebelen ,

Tily , Crowdsourcing and language studies: the new generation of linguistic data , in: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Association for Computational Linguistics , Los Angeles, 2010 , pp. 122 - 130 . URL: https://aclanthology.org/W10-0719.

[18]

Gamallo , M. de Prada Corral , M. Garcia , Comparing Dependency-based Compositional Models with Contextualized Word Embeddings , in: Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART 2021 ), Volume 2 , 2021 , pp. 1258 - 1265 .

[19]

Garcia ,

T. Kramer

Vieira ,

Scarton ,

Idiart ,

Villavicencio , Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP) , ACL , 2021 , pp. 2730 - 2741 . URL: https://aclanthology. org/ 2021 . acl-long . 212 . doi: 10 .18653/v1/ 2021 . acl-long . 212 .

[20]

Garcia ,

T. Kramer

Vieira ,

Scarton ,

Idiart ,

Villavicencio , Probing for idiomaticity in vector space models , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics , Online, 2021 , pp. 3551 - 3564 . URL: https:// aclanthology.org/ 2021 .eacl-main. 310 . doi: 10 . 18653/v1/ 2021 .eacl-main. 310 .

[21]

Garcia , Exploring the representation of word meanings in context: A case study on homonymy and synonymy , in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Online, 2021 , pp. 3625 - 3640 . URL: https: //aclanthology.org/ 2021 . acl-long . 281 . doi: 10 . 18653/v1/ 2021 . acl-long . 281 .

[22]

Gamallo ,

Garcia , I. de-Dios-Flores, Evaluating Contextualized Vectors from Large Language Models and Compositional Strategies, Procesamiento del Lenguaje Natural 69 ( 2022 ).

[23]

Tayyar Madabushi ,

Gow-Smith ,

Garcia ,

Scarton ,

Idiart , A . Villavicencio, SemEval -2022 task 2: Multilingual idiomaticity detection and sentence embedding , in: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , Association for Computational Linguistics , Seattle, United States, 2022 , pp. 107 - 121 . URL: https://aclanthology.org/ 2022 .semeval- 1 . 13 .