=Paper=
{{Paper
|id=Vol-2765/100
|storemode=property
|title=ANDI @ CONcreTEXT: Predicting Concreteness in Context for English and Italian using Distributional Models and Behavioural Norms (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper100.pdf
|volume=Vol-2765
|authors=Armand Stefan Rotaru
|dblpUrl=https://dblp.org/rec/conf/evalita/Rotaru20
}}
==ANDI @ CONcreTEXT: Predicting Concreteness in Context for English and Italian using Distributional Models and Behavioural Norms (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2765/paper100.pdf</pdf>
<pre>
      ANDI @ CONcreTEXT: Predicting concreteness in context for
English and Italian using distributional models and behavioural norms

                                      Armand Stefan Rotaru
                                       Independent researcher
                                   armand.rotaru@gmail.com


                                                        ficult to understand for language learners, and al-
                    Abstract                            lowing tighter control of contextual variables in
                                                        psycholinguistic experiments.
    In this paper we describe our participation             In this paper we describe our computational
    in the CONcreTEXT task of EVALITA                   models, based on pre-trained distributional mod-
    2020, which involved predicting subjec-             els and behavioural norms, which ranked first in
    tive ratings of concreteness for words pre-         both the English and Italian tracks of the compe-
    sented in context. Our approach, which              tition1. We find that the best performance can be
    ranked first in both the English and Italian        obtained by employing a combination of trans-
    subtasks, relies on a combination of con-           former models, developed in the last 2 years.
    text-dependent and context-independent              Moreover, for Italian, it is possible to reach good
    distributional models, together with be-            levels of performance by relying on both the orig-
    havioural norms. We show that good re-              inal stimuli and their English translation, which
    sults can be obtained for Italian, by first         allows access to resources for both languages.
    automatically translating the Italian stim-
    uli into English, and then using existing
    resources for both Italian and English.             1.1    General description
                                                        In order to predict concreteness in context, we use
                                                        information derived from three type of sources,
                                                        namely behavioural norms and distributional
1    Introduction                                       models, both context-independent (i.e., a model
                                                        outputs the same vector representation for a given
In our everyday life we rarely encounter words in
                                                        word, regardless of the context in which the word
isolation. Instead, we typically process words as
                                                        is encountered), and context-dependent (i.e., a
part of sentences or phrases, and these linguistic
                                                        model outputs a potentially different representa-
contexts shape our understanding of individual
                                                        tions for a given word, as a function of the context
words. However, for various reasons, the over-
                                                        in which the word is presented).
whelming majority of behavioural norms that
                                                           Firstly, we employ behavioural norms collected
have been collected so far focus only on single
                                                        for a wide variety of psycholinguistic factors. Of
words or word pairs (Johns et al., 2020).
                                                        particular interest to us are norms for concreteness
   Thus, the EVALITA 2020 (Basile et al., 2020)
                                                        (Brysbaert et al., 2014), semantic diversity (Hoff-
CONcreTEXT Task (Gregori et al., 2020) repre-
                                                        man et al., 2013), age of acquisition (Kuperman et
sents a timely and valuable contribution to the
                                                        al., 2012), emotional dimensions (i.e., valence,
study of context-dependent semantics. The task
                                                        arousal, and dominance; Mohammad, 2018), and
asks competitors to predict subjective ratings of
                                                        sensorimotor dimensions (i.e., modality strengths
concreteness for words presented within sen-
                                                        for the tactile, auditory, olfactory, gustatory, vis-
tences. As mentioned by the organizers, being
                                                        ual, and interoceptive modalities; interaction
able to automatically compute contextual con-
                                                        strengths for the mouth/throat, hand/arm, foot/leg,
creteness ratings would have a several practical
                                                        head excluding mouth/throat, and torso effectors;
applications, such as identifying the use of figura-
                                                        Lynott et al., 2019), as well as frequency and con-
tive language, detecting words that might be dif-
                                                        textual diversity counts (Van Heuven et al., 2014).

1
  https://github.com/armandrotaru/TeamAndi-CON-
creTEXT
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
We focus on these specific factors since they are               to the average value of V, computed over
meaningfully related to word concreteness (see                  the entire norms;
the previous references).                                   •   V(c) denotes the value of V corresponding
   Secondly, we employ context-independent dis-                 to the context c in which the word w is
tributional models, namely Skip-gram (Mikolov                   encountered (e.g., w = “offend”, c = “Do
et al., 2013), FastText (Bojanowski et al., 2017),              not insult or ___ anyone .“). Computing
GloVe (Pennington et al., 2014), and ConceptNet                 this value involves calculating the aver-
NumberBatch (Speer et al., 2017). Such models                                ∑" "($ )
                                                                 age V(c) = !#$& ! , where V(ci) is the
have been used in order to accurately predict a
range of psycholinguistic variables, including                   value of V corresponding to the i-th con-
concreteness (ρ = .88; Paetzold & Specia, 2016).                 text word, calculated as described previ-
   Thirdly, we employ context-dependent distri-                  ously, and N is the number of words that
butional models, namely BERT (Devlin et al.,                     make up the context.
2018), RoBERTa (Liu et al., 2018), AlBERTo                 These predictors allowed us to include both the
(Polignano et al., 2019), GPT-2 (Radford et al.,        individual contributions of word w and its context
2019), Bart (Lewis et al., 2019), and ALBERT            c, as well as certain interactions between w and c.
(Lan et al., 2020). Although they have become ex-          The second group was derived from Skip-gram,
tremely popular after achieving human-level per-        GloVe, and ConceptNet NumberBatch embed-
formance in various linguistic tasks (e.g., those in    dings, as well as from the concatenation of the
the GLUE benchmark; Wang et al., 2018), we are          three types of embeddings. The vocabulary of the
not aware of studies looking at whether such mod-       four models is that described in the discussion
els can accurately predict (contextualized) subjec-     above. Given the large number of dimensions in-
tive ratings. Nevertheless, since these models          volved (i.e., 300 + 300 + 300 + 900 = 1,800), we
were specifically designed to process rich contex-      first extracted the top 20 principal components
tual information, they could be a valuable tool for     from each model (although comparable results
predicting ratings of concreteness in context.          can also be obtained by using a larger number of
                                                        components). Then, for each variable V (e.g., PC3
                                                        from the GloVe model) we generated four predic-
1.2   Predictors for English                            tors, namely V(w), V(c), V(w) * V(c), and abs(V(w)
We tested (combinations of) three groups of pre-        - V(c)), following the same procedure as in the
dictors. The first group was derived from large da-     previous discussion. In addition, based on (Frass-
tasets of ratings for concreteness, semantic diver-     inelli et al., 2017), for each distributional model
sity, age of acquisition, emotional dimensions,         we added four predictors based on a measure of
and sensorimotor dimensions, as well as fre-            neighbourhood density (i.e., the mean cosine sim-
quency and contextual diversity counts based on         ilarity between a vector and its closest 20 vectors),
the SUBTLEX-UK and BNC corpora (see the ref-            using the same procedure as described above.
erences from the beginning of the previous sec-            The third group was derived from the BERT,
tion). In order to extend the coverage of the sub-      GPT-2, Bart, and ALBERT models. We used the
jective ratings, we did not directly use them as        standard (base) versions of each model (i.e., with-
predictors of concreteness in context. Instead, we      out task-specific fine-tuning), as described in the
relied on the Skip-gram, GloVe, and ConceptNet          original papers, and obtained from the Hugging
NumberBatch models, as a means of estimating            Face repository (https://huggingface.co/models).
the subjective ratings for more than 100,000               Unlike for the previous two groups, the predic-
words, via linear regression. For the frequency         tors consist only of a word’s activations from the
and contextual diversity counts, we kept the orig-      last hidden layer (i.e., for the GPT-2, Bart, and
inal values, as they already have very good cover-      ALBERT models), or averaged from the last four
age. The intersection of the two datasets, which        hidden layers (i.e., for the BERT model).
includes more than 70,000 words, served as the             Importantly, for each group of predictors we
basis for our predictors of concreteness. More spe-     generated two sets of variables, based on two ver-
cifically, for each variable V (e.g., semantic diver-   sions of the target words (i.e., the words rated by
sity), we generated four predictors, namely V(w),       the participants). In the first set we used the unin-
V(c), V(w) * V(c), and abs(V(w) - V(c)), where:         flected form of the target words, taken from the
     • V(w) denotes the value of V correspond-          TARGET column. In contrast, in the second set of
         ing to the word w (e.g., w = “offend”). If     we used the inflected form of the target words,
         w is not present in our norms, we set V(w)     taken from the words in the TEXT column located
at the positions specified in the INDEX column.          For English, the results indicate that context-
More details can be found in Table 1.                 dependent models (Fig. 1c-d) outperform behav-
   For predicting ratings of concreteness in con-     ioural norms (Fig. 1a) and context-independent
text, we employed ridge regression, with large        models (Fig. 1b). For the latter, even though we
values of the parameter lambda (i.e., strong regu-    introduced contextual variables by averaging a
larization), after standardized all the variables.    given variable (e.g., concreteness) over the words
                                                      that make up the context, it appears that this sim-
                                                      ple average does not properly capture contextual
1.3    Predictors for Italian                         information and/or interactions between single
Our approach was similar to that for English, but     word and contextual information. The addition the
with certain significant changes, as follows:         behavioural norms and/or context-independent
   • for the first group of predictors, we began      models has a negligible effect on performance
       by automatically translating the Italian       (Fig. 1e). In this respect, the excellent results for
       stimuli (i.e., the TARGET and TEXT col-        context-dependent models are likely due to sev-
       umns) into English, using the MarianMT         eral factors, such as the highly non-linear integra-
       translation model (Junczys-Dowmunt et          tion of contextual information, the use of attention
       al., 2018). Next, for the translated stimuli   mechanisms, and that of more sophisticated learn-
       we derived the predictors using the exact      ing objectives (e.g., next sentence prediction).
       same procedure as in the case of English;         Interestingly, predictors based on inflected tar-
   • for the second group of predictors, we em-       gets consistently outperform those based on unin-
       ployed Italian versions of the FastText and    flected targets, especially for the context-depend-
       ConceptNet NumberBatch models), to-            ent models. This shows that morphological infor-
       gether with their concatenation. We de-        mation can be quite valuable. Also, even for the
       rived the predictors based on the top 30       largest sets of predictors, consisting of more than
       principal components for each model, ra-       3,200 variables per 80 data points, the degree of
       ther than the top 20 principal components,     regularization appears to matter very little, indi-
       as in the case of English (although compa-     cating surprisingly small levels of overfitting.
       rable results can also be obtained by using       In the case of Italian, the findings are somewhat
       a larger number of components);                different from those for English. Performance is
   • for the third group of predictors, we again      roughly 10% lower than that for English. This is
       employed the English translations and re-      expected, given that perfect translation from Ital-
       lied on the same models as for English, and    ian to English is impossible, and that the majority
       also the RoBERTa model. For the BERT           of predictors depend on this translation. The gaps
       model, we only used the activations from       in performance between predictors for inflected vs
       the last hidden layer. We also added the Al-   uninflected targets (Fig. 2c-d), and between the
       BERTo model, but with the Italian stimuli.     various classes of predictors (Fig. 2a-e), are also
   As in the case for English, we generated two       smaller. Moreover, the performance of context-
sets of predictors, using either the uninflected or   dependent models can be increased to a small de-
inflected forms of the target words, together with    gree by adding behavioural norms and/or context-
their corresponding English translations. More        independent models (Fig. 2f).
details can be found in Table 1.                         Our best models, as described in Figures 1 and
   Once more, we employed ridge regression,           2, ranked first in both the English track (ρ = .83),
with large values of the parameter lambda (i.e.,      and the Italian track (ρ = .75). The two correla-
strong regularization), after standardizing all the   tions are smaller than those for the best models in
variables.                                            the two figures, but this is likely to be an effect of
                                                      distributional differences between the training set
                                                      and the test set.
2     Results and discussion
The results for English and Italian are shown in      3    Conclusion
Figures 1 and 2, respectively, for various sets of
predictors and regularization strengths. Results      Our results suggest that a variety of approaches
are averaged over 1,000 rounds of 5-fold cross-       can be quite successfully employed in order to
validation, using only the training dataset.          predict concreteness in context. The most effec-
tive predictors are those derived from context-de-                   whether comparable results can be obtained using
pendent models (e.g., BERT), but relatively good                     multilingual versions of context-dependent mod-
results can be obtained also by using context-in-                    els, such as BERT.
dependent models (e.g., Skip-gram) and behav-
ioural norms (e.g., ratings of semantic diversity).
   Such an approach works very well for English,                     Acknowledgements
but less so for Italian, where the range of available
                                                                     We would like to thank the anonymous reviewers,
predictors (i.e., pre-trained distributional models
                                                                     for their comments and suggestions, as well as the
and large behavioural norms) is limited. One sur-
                                                                     organizers of the competition, for their support.
prisingly effective solution to this problem is to
simply translate the Italian stimuli into English, by
relying on a neural machine translation system
(e.g., MarianMT), and then make use of existing
predictors for English. As an alternative to trans-
lating stimuli, it would be interesting to test

Table 1. Type and number of predictors obtained from behavioural norms and distributional models. The same number of
predictors are derived for both the inflected and uninflected versions of the target word. As predictors for the context-dependent
models, we use the activations associated with the target, when presented in context (i.e., we do not have separate predictors
for the target, context, and their potential interactions). More details regarding each set of predictors can be found in Subsections
2.2 and 2.3, as well as in Figures 1 and 2.


                                                   Predictors for English

               Source of predictors                       # preds.      # preds.        # preds.                # preds.
                                                           V(w)           V(c)         V(w) * V(c)          abs(V(w) - V(c))
 Behavioural norms (frequency, etc.)                         20            20              20                      20
 Skip-gram (Google News – 100B)                              21            21              21                      21
 GloVe (Common Crawl – 840B)                                 21            21              21                      21
 ConceptNet NumberBatch
                                                             21             21                21                      21
 (ConceptNet + Skip-gram + GloVe)
 Concatenation of Skip-gram, GloVe,
                                                             21             21                21                      21
 and ConceptNet NumberBatch
 ALBERT (last hidden layer)                                                                 768
 Bart (last hidden layer)                                                                   768
 BERT (last four hidden layers)                                                             768
 GPT-2 (last hidden layer)                                                                  768

                                                   Predictors for Italian

               Source of predictors                       # preds.      # preds.        # preds.                # preds.
                                                           V(w)           V(c)         V(w) * V(c)          abs(V(w) - V(c))
 Behavioural norms (frequency, etc.)                         20            20              20                      20
 FastText (Common Crawl + Wikipedia)                         31            31              31                      31
 ConceptNet NumberBatch
                                                             31             31                31                      31
 (ConceptNet + Skip-gram + GloVe)
 Concatenation of FastText and Concept-
                                                             31             31                31                      31
 Net NumberBatch
 ALBERT (last hidden layer)                                                                 768
 AlBERTo (last hidden layer)                                                                768
 Bart (last hidden layer)                                                                   768
 BERT (last hidden layer)                                                                   768
 GPT-2 (last hidden layer)                                                                  768
 RoBERTa (last hidden layer)                                                                768
Figure 1: English: Spearman correlations between predicted and actual ratings, for various groups of predictors and regulari-
zation strengths (i.e., values of lambda). C-Dep. Mod.: the combination of the ALBERT, GPT-2, Bart, and BERT models; C-
Indep. Mod.: the combination of the Skip-gram, GloVe, and ConceptNet NumberBatch models, their concatenation, and neigh-
bourhood density measures; Beh. Norms: the predicted psycholinguistic ratings, together with frequency and contextual diver-
sity counts. For the best four models, all predictors were derived from the inflected form of the target words. Our submission
to the competition was based on C-Dep. Mod. + Beh. Norms (lambda = 500).
Figure 2. Italian: Spearman correlations between predicted and actual ratings, for various groups of predictors and regulariza-
tion strengths (i.e., values of lambda). C-Dep. Mod.: the combination of the ALBERT, GPT-2, BERT, RoBERTa, Bart, and
AlBERTo models; C-Indep. Mod.: the combination of the FastText and ConceptNet NumberBatch models, their concatenation,
and neighbourhood density measures; Beh. Norms: the predicted psycholinguistic ratings, together with frequency and contex-
tual diversity counts. For the best four models, all predictors were derived from the inflected form of the target words, except
for the RoBERTa, FastText, and ConceptNet NumberBatch models (uninflected), and the behavioural norms (inflected and
uninflected). Our submission to the competition was based on C-Dep. Mod. + C-Indep. Mod. + Beh. Norms (lambda = 500).
References                                                   self-supervised learning of language representa-
                                                             tions. In Proceedings of the ICLR (pp. 1-17).
Basile, V., Croce, D., Di Maro, M., Passaro, L.C.,
  2020. EVALITA 2020: Overview of the 7th Evalu-           Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-
  ation Campaign of Natural Language Processing              hamed, A., Levy, O., Stoyanov, V., & Zettlemoyer,
  and Speech Tools for Italian, in: Basile, V., Croce,       L. (2019). Bart: Denoising sequence-to-sequence
  D., Di Maro, M., Passaro, L.C. (Eds.), Proceedings         pre-training for natural language generation, trans-
  of 7th Evaluation Campaign of Natural Language             lation, and comprehension. In D. Jurafsky, J. Chai,
  Processing and Speech Tools for Italian. Final             N. Schluter, & J. Tetreault (Eds.), Proceedings of
  Workshop (EVALITA 2020). CEUR.org, Online.                 the ACL (pp. 7871-7880). Stroudsburg, PA: ACL.

Brysbaert, M., Warriner, A. B., & Kuperman, V.             Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
  (2014). Concreteness ratings for 40 thousand gener-        Levy, O., Lewis. M., Zettlemoyer, L., & Stoyanov,
  ally known English word lemmas. Behavior Re-               V. (2019). RoBERTa: A robustly optimized BERT
  search Methods, 46(3), 904-911.                            pretraining approach. arXiv preprint:1907.11692.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.         Lynott, D., Connell, L., Brysbaert, M., Brand, J., &
  (2019). BERT: Pre-training of deep bidirectional           Carney, J. (2019). The Lancaster Sensorimotor
  transformers for language understanding. In J.             Norms: Multidimensional measures of perceptual
  Burstein, C. Doran, & T. Solorio (Eds.), Proceed-          and action strength for 40,000 English words. Be-
  ings of the NAACL-HLT (pp. 4171-4186). Stroud-             havior Research Methods, 52, 1-21.
  sburg, PA: ACL.                                          Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013).
Frassinelli, D., Naumann, D., Utt, J., & im Walde, S.        Efficient estimation of word representations in vec-
   S. (2017). Contextual characteristics of concrete and     tor space. In J. Bengio & Y. LeCun (Eds.), Proceed-
   abstract words. In C. Gardent & C. Retoré (Eds.),         ings of the Workshop at the ICLR (pp. 1-12).
   Proceedings of the IWCS (pp. 1-7). Stroudsburg,         Mohammad, S. (2018). Obtaining reliable human rat-
   PA: ACL.                                                 ings of valence, arousal, and dominance for 20,000
Gregori, L., Montefinese, M., Radicioni, D. P., Ravelli,    English words. In I. Gurevych & Y. Miyao (Eds.),
  A. A., & Varvara, R. (2020). CONcreTEXT @                 Proceedings of the ACL - Long Papers (pp. 174-
  Evalita2020: the Concreteness in Context Task. In         184). Stroudsburg, PA: ACL.
  V. Basile, D. Croce, M. Di Maro, & L. C. Passaro         Paetzold, G., & Specia, L. (2016). Inferring psycholin-
  (Eds.), Proceedings of the 7th Evaluation Campaign         guistic properties of words. In K. Knight, A.
  of Natural Language Processing and Speech tools            Nenkova, & O. Rambow (Eds.), Proceedings of the
  for Italian (EVALITA 2020). Online: CEUR.org.              NAACL-HLT (pp. 435-440). Stroudsburg, PA: ACL.
Hoffman, P., Lambon Ralph, M. A., & Rogers, T. T.          Pennington, J., Socher, R., & Manning, C. D. (2014).
  (2013). Semantic diversity: A measure of semantic          GloVe: Global vectors for word representation. In
  ambiguity based on variability in the contextual us-       A. Moschitti, B. Pang, & W. Daelemans (Eds.), Pro-
  age of words. Behavior Research Methods, 45(3),            ceedings of the EMNLP (pp. 1532-1543). Stroud-
  718-730.                                                   sburg, PA: ACL.
Johns, B. T., Jamieson, R. K., & Jones, M. N. (2020).      Polignano, M., Basile, P., de Gemmis, M., Semeraro,
  The continued importance of theory: Lessons from           G., & Basile, V. (2019). AlBERTo: Italian BERT
  big data approaches to language and cognition. In S.       language understanding model for NLP challenging
  E. Woo, R. Proctor, & L. Tay (Eds.), Big data meth-        tasks based on tweets. In R. Bernardi, R. Navigli, &
  ods for psychological research: New horizons and           G. Semeraro (Eds.), Proceedings of CLiC-it. Aa-
  challenges (pp. 277-295). Washington, DC: APA.             chen, Germany: CEUR.
Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T.,         Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
  Hoang, H., Heafield, K., Neckermann, T., Seide, F.,        & Sutskever, I. (2019). Language models are unsu-
  Germann, U., Aji, A. F., Bogoychev, N., Martins,           pervised multitask learners. OpenAI blog.
  A., & Birch, A. (2018). Marian: Fast neural machine
  translation in C++. In F. Liu & T. Solorio (Eds.),       Speer, R., Chin, J., & Havasi, C. (2017). ConceptNet
  Proceedings of the ACL - System Demonstrations             5.5: An open multilingual graph of general
  (pp. 116-121). Stroudsburg, PA: ACL.                       knowledge. In S. P. Singh & S. Markovitch (Eds.),
                                                             Proceedings of the AAAI (pp. 4444-4451). Palo
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert,          Alto, CA: AAAI Press.
  M. (2012). Age-of-acquisition ratings for 30,000
  English words. Behavior Research Methods, 44(4),         Van Heuven, W. J., Mandera, P., Keuleers, E., &
  978-990.                                                   Brysbaert, M. (2014). SUBTLEX-UK: A new and
                                                             improved word frequency database for British Eng-
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma,          lish. The Quarterly Journal of Experimental Psy-
  P., & Soricut, R. (2019). ALBERT: A lite BERT for          chology, 67(6), 1176-1190.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., &
 Bowman, S. (2018). GLUE: A multi-task bench-
 mark and analysis platform for natural language un-
 derstanding. In T. Linzen, G. Chrupała, & A.
 Alishahi (Eds.), Proceedings of the EMNLP Work-
 shop BlackboxNLP (pp. 353-355). Stroudsburg, PA:
 ACL.

</pre>