<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A
Coruña, Spain
$ marcos.garcia.gonzalez@usc.gal (M. Garcia);
pablo.gamallo@usc.gal (P. Gamallo);
martin.pereira@usc.gal (M. Pereira-Fariña);
iria.dedios@usc.gal (I. de-Dios-Flores)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An exploration of the semantic knowledge in vector models: polysemy, synonymy and idiomaticity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcos Garcia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Gamallo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martín Pereira-Fariña</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iria de-Dios-Flores</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In this paper, we present the project An exploration of the semantic knowledge in vector models: polysemy, synonymy and idiomaticity, funded by the Xunta de Galicia within the program “Consolidación e estruturación de unidades de investigación competitivas e outras accións de fomento: Proxectos de Excelencia”, with a duration of 5 years (2021-2026). The main objective of the project is the analysis of the most recent language models regarding the representation of several aspects of lexical semantics: polysemy and homonymy, synonymy and idiomaticity. The languages in which we are working are Galician-Portuguese (in its Galician and Portuguese varieties, fundamentally), Spanish and English.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;lexical semantics</kwd>
        <kwd>distributional semantics</kwd>
        <kwd>language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction and objectives
ologies (e.g., using vectors built through syntactic
dependencies [10]), has become one of the most
The use of architectures based on artificial neural productive in NLP research [11].
networks has become the most dominant approach In this regard, the emergence of deep learning
to natural language processing (NLP) in recent years techniques using multilayer deep neural networks
[1], producing significantly better results in numer- with millions of hyperparameters (which require
ous areas than supervised models designed by se- large computational infrastructures) has led to the
lecting individual features of the target tasks [2]. proliferation of language models that perform NLP
This paradigm shift has promoted the populariza- tasks more accurately. Among various others, we
tion of vector models inspired by the distributional can highlight the public models ELMo
(Embedhypothesis [3, 4], which until then were mainly used dings from Language Models [12]), or the diferent
in research in cognitive science and computational variants of BERT (Bidirectional Encoder
Represenlinguistics [5, 6, 7]. In this field, the implementa- tations from Transformers [13]).
tion of computationally more eficient architectures, The project presented in this paper fits into this
with drastic reductions in dimensionality [8], has new line of research and focuses on the analysis of
sparked great interest in distributional semantics the ability of these models to solve various types of
studies, boosted also by the findings about the vari- lexical ambiguity:1
ous linguistic regularities encoded by these models
[9]. This area, previously dominated by
linguistically informed and more interpretable
method1. Polysemy and homonymy, i.e., a single
orthographic form that has diferent meanings
(or senses) depending on the context. For
example, school as a building, as an
organization, or as a group of people (polysemy), or
bank as a financial institution, or as a sloping
raised land (homonymy).
2. Synonymy, i.e., diferent words expressing
the same meaning in certain contexts (e.g.,
coach or bus to refer to a long motor vehicle).
3. Idiomaticity, i.e., multiword expressions
(MWEs) whose meaning does not correspond
to the one of its constituent elements (e.g.,
glass ceiling as a social barrier for women).</p>
      <p>1We broadly follow [14] for the definition of the
phenomena mentioned here.
1. Precision scores, in evaluations with discrete
values (e.g. homonymy or synonymy, and in
the results of linear classifiers).
2. Correlation values, in graded evaluations</p>
      <p>(polysemy or idiomaticity).
3. Representation Similarity Analysis, to see
if the models predict relative diferences
between examples of the same type (e.g., a word
or MWE with the same meaning in diferent
contexts) in a similar way to humans.</p>
      <p>It should be noted that these methods have
already been used in previous works, which we briefly
mention below.</p>
      <p>Taking the above into account, our research aims
to fill a particularly important gap in the evaluation
of these computational models by investigating the
presence of various types of knowledge related to
lexical semantics in several languages. Thus, the
main goal of the project is to explore the most recent
language models concerning the representation of
polysemy and homonymy, synonymy and semantic
compositionality, as well as to compare them with
more interpretable distributional and compositional
methods.</p>
      <p>The results of the present project will be useful,
on the one hand, to advance the understanding of
semantic information encoded both in static
distributional representations and in large language
models trained with deep neural networks. In ad- 2.1. First results
dition, and although the project is mainly focused
on the exploration of models, both the datasets
and the results of manual annotation will be an
important contribution regarding the semantic
interpretation of polysemy and homonymy, synonymy
and idiomaticity by native speakers of various
languages.</p>
      <p>
        Although we are at an early stage, we already have
some published results, both from previous research
directly related to this proposal and from work
carried out since the beginning of the project. Thus,
we have already presented various datasets with
semantic idiomaticity annotation at token and type
levels in English and Portuguese, and used them
to evaluate several language models [
        <xref ref-type="bibr" rid="ref3 ref4">19, 20</xref>
        ]. In
2. Methodology and work plan addition, we have created a new dataset in
GalicianPortuguese, English and Spanish that includes
exTo develop this project, we will use the following amples of homonymy and synonymy in context, also
methodology and instrumental techniques, which in used to compare various contextualization models
general correspond to the state-of-the-art research and strategies [
        <xref ref-type="bibr" rid="ref5">21</xref>
        ].
in NLP and computational linguistics. More recently, we have compared Transformers
      </p>
      <p>
        Regarding the experimental design and the data models and distributional strategies based on
syncollection, we will use standard methodologies from tactic dependencies in semantic compositionality
studies in semantics [14] and in psycholinguistics tasks [
        <xref ref-type="bibr" rid="ref2 ref6">18, 22</xref>
        ]. Finally, we have participated in the
[15, 16], aimed at generating controlled stimuli. co-organization of the task Multilingual
IdiomaticLikewise, to collect annotations from human in- ity Detection and Sentence Embedding (SemEval
formants, we will use crowdsourcing methods which 2022), in which we have presented new resources
will allow us to obtain data from native speakers with annotation of semantic idiomaticity in context
quickly and eficiently, with quality control of the in Galician-Portuguese and English [
        <xref ref-type="bibr" rid="ref7">23</xref>
        ].
annotations [
        <xref ref-type="bibr" rid="ref1">17</xref>
        ].
      </p>
      <p>
        Regarding the computational models, those based
on Transformer architectures will be implemented 3. Work team
using the transformers library, which includes the
latest models based on deep learning. We will even- The project presented in this paper is carried out at
tually use other open source libraries that may incor- the Centro Singular de Investigación en Tecnoloxías
porate additional models. To train and run static Intelixentes (CiTIUS) of the Universidade de
Sanembeddings, we will use gensim2 and the oficial tiago de Compostela, and belongs to its scientific
tools released by the authors of other distributional program in Natural Language Technologies. In this
methods based on interpretable syntactic dependen- sense, members of the center collaborate on
difercies (e.g., [
        <xref ref-type="bibr" rid="ref2">18</xref>
        ]). ent tasks of our work plan, that are part of their
      </p>
      <p>Finally, to compare the representations of the respective areas of expertise.
computational models with the values obtained from Besides the principal investigator, the project has
the human annotations, we will use three methods: research and work teams formed by three PhDs
with specializations in Computational Linguistics,</p>
      <p>Psycholinguistics, Logic and Computer Science. In
2https://radimrehurek.com/gensim/ collaboration with a pre-doctoral researcher and
technical staf that will be hired with the project [8] T. Mikolov, K. Chen, G. Corrado, J. Dean,
funds, these teams actively participate in the difer- Eficient estimation of word representations
ent stages of the project. Finally, we also rely on in vector space, in: Workshop Proceedings
the collaboration of researchers from other univer- of the International Conference on Learning
sities, both Galician and international, with whom Representations, 2013.
we have already participated in joint initiatives and [9] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic
projects with similar themes to the one presented regularities in continuous space word
represenin this paper. tations, in: Proceedings of the 2013 Conference
of the North American Chapter of the
Association for Computational Linguistics: Human
Acknowledgments Language Technologies, Association for
Computational Linguistics, Atlanta, Georgia, 2013,
Project funded by the Galician Government (Con- pp. 746–751. URL: https://aclanthology.org/
solidación e estruturación de unidades de investi- N13-1090.
gación competitivas e outras accións de fomento: [10] S. Padó, M. Lapata, Dependency-based
Proxectos de Excelencia, ED431F 2021/01) and by construction of semantic space models,
a Ramón y Cajal grant (RYC2019-028473-I). Computational Linguistics 33 (2007) 161–
199. URL: https://aclanthology.org/J07-2002.</p>
      <p>References doi:10.1162/coli.2007.33.2.161.
[11] G. Boleda, Distributional semantics and
lin[1] R. Collobert, J. Weston, L. Bottou, M. Karlen, guistic theory, Annual Review of Linguistics 6
K. Kavukcuoglu, P. Kuksa, Natural language (2020) 213–234.
processing (almost) from scratch, Journal of [12] M. E. Peters, M. Neumann, M. Iyyer, M.
GardMachine Learning Research 12 (2011) 2493– ner, C. Clark, K. Lee, L. Zettlemoyer, Deep
2537. contextualized word representations, in:
Pro[2] T. Schnabel, I. Labutov, D. Mimno, ceedings of the 2018 Conference of the North
T. Joachims, Evaluation methods for American Chapter of the Association for
Comunsupervised word embeddings, in: Proceed- putational Linguistics: Human Language
Techings of the 2015 Conference on Empirical nologies, Volume 1 (Long Papers), Association
Methods in Natural Language Processing, for Computational Linguistics, New Orleans,
Association for Computational Linguis- Louisiana, 2018, pp. 2227–2237. URL: https://
tics, Lisbon, Portugal, 2015, pp. 298–307. aclanthology.org/N18-1202. doi:10.18653/v1/
URL: https://aclanthology.org/D15-1036. N18-1202.</p>
      <p>doi:10.18653/v1/D15-1036. [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
[3] Z. S. Harris, Distributional structure, Word BERT: Pre-training of deep bidirectional
trans10 (1954) 146–162. formers for language understanding, in:
[4] J. R. Firth, A synopsis of linguistic theory 1930- Proceedings of the 2019 Conference of the
1955, Studies in Linguistic Analysis (1957) 1– North American Chapter of the Association
32. Reprinted in F.R. Palmer (Ed.), Selected for Computational Linguistics: Human
LanPapers of J.R. Firth 1952–1959, London: Long- guage Technologies, Volume 1 (Long and
man (1968). Short Papers), Association for Computational
[5] G. A. Miller, Empirical methods in the study Linguistics, Minneapolis, Minnesota, 2019,
of semantics, in: D. D. Steinberg, L. A. pp. 4171–4186. URL: https://aclanthology.org/
Jakobovits (Eds.), Semantics: An Interdisci- N19-1423. doi:10.18653/v1/N19-1423.
plinary Reader in Philosophy, Linguistics and [14] D. A. Cruse, Lexical semantics, Cambridge
Psychology, 1971, pp. 569–585. University Press, 1986.
[6] T. K. Landauer, S. T. Dumais, A solution to [15] R. L. Goldstone, Influences of categorization
Plato’s problem: The latent semantic analysis on perceptual discrimination., Journal of
Extheory of acquisition, induction, and represen- perimental Psychology: General 123 (1994)
tation of knowledge, Psychological Review 104 178.</p>
      <p>(1997) 211. [16] R. Richie, B. White, S. Bhatia, M. C. Hout,
[7] J. Mitchell, M. Lapata, Composition in dis- The spatial arrangement method of measuring
tributional models of semantics, Cognitive similarity can capture high-dimensional
semanscience 34 (2010) 1388–1429. tic structures, Behavior Research Methods 52
(2020) 1906–1928.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Munro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kuperman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. T.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Melnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schnoebelen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tily</surname>
          </string-name>
          ,
          <article-title>Crowdsourcing and language studies: the new generation of linguistic data</article-title>
          ,
          <source>in: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech</source>
          and
          <article-title>Language Data with Amazon's Mechanical Turk, Association for Computational Linguistics</article-title>
          , Los Angeles,
          <year>2010</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>130</lpage>
          . URL: https://aclanthology.org/W10-0719.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Prada Corral</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Garcia</surname>
          </string-name>
          ,
          <article-title>Comparing Dependency-based Compositional Models with Contextualized Word Embeddings</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART</source>
          <year>2021</year>
          ), Volume
          <volume>2</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>1258</fpage>
          -
          <lpage>1265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Kramer</given-names>
            <surname>Vieira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Idiart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Villavicencio</surname>
          </string-name>
          ,
          <article-title>Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP)</article-title>
          ,
          <source>ACL</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2730</fpage>
          -
          <lpage>2741</lpage>
          . URL: https://aclanthology. org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>212</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>212</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Kramer</given-names>
            <surname>Vieira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Idiart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Villavicencio</surname>
          </string-name>
          ,
          <article-title>Probing for idiomaticity in vector space models</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3551</fpage>
          -
          <lpage>3564</lpage>
          . URL: https:// aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>310</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .eacl-main.
          <volume>310</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <article-title>Exploring the representation of word meanings in context: A case study on homonymy and synonymy</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3625</fpage>
          -
          <lpage>3640</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>281</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>281</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gamallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          , I. de-Dios-Flores,
          <source>Evaluating Contextualized Vectors from Large Language Models and Compositional Strategies, Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tayyar Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gow-Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Idiart</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio, SemEval
          <article-title>-2022 task 2: Multilingual idiomaticity detection and sentence embedding</article-title>
          ,
          <source>in: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>121</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>