=Paper= {{Paper |id=Vol-2765/160 |storemode=property |title=CONcreTEXT @ EVALITA2020: The Concreteness in Context Task |pdfUrl=https://ceur-ws.org/Vol-2765/paper160.pdf |volume=Vol-2765 |authors=Lorenzo Gregori,Maria Montefinese,Daniele P. Radicioni,Andrea Amelio Ravelli,Rossella Varvara |dblpUrl=https://dblp.org/rec/conf/evalita/GregoriMRRV20 }} ==CONcreTEXT @ EVALITA2020: The Concreteness in Context Task== https://ceur-ws.org/Vol-2765/paper160.pdf
                              CON CRE TEXT @ EVALITA2020:
                              The Concreteness in Context Task

        Lorenzo Gregori                       Maria Montefinese                 Daniele P. Radicioni
      University of Florence                  University of Padua                University of Turin
    lorenzo.gregori@unifi.it            maria.montefinese@unipd.it         daniele.radicioni@unito.it


                 Andrea Amelio Ravelli                              Rossella Varvara
         Istituto di Linguistica Computazionale                    University of Florence
    “Antonio Zampolli” (ILC–CNR) - ItaliaNLP Lab                rossella.varvara@unifi.it
           andreaamelio.ravelli@ilc.cnr.it



                      Abstract                             the senses; abstract concepts lie on the opposite
                                                           side of the scale and are grounded in the inter-
     Focus of the CON CRE TEXT task is con-                nal sensory experience and linguistic information.
     ceptual concreteness: systems were so-                While concrete concepts have direct sensory ref-
     licited to compute a value expressing to              erents (Crutch and Warrington, 2005) and greater
     what extent target concepts are concrete              availability of contextual information (Connell et
     (i.e., more or less perceptually salient)             al., 2018; Kousta et al., 2011; Montefinese et al.,
     within a given context of occurrence. To              2020), abstract concepts tend to be more emotion-
     these ends, we have developed a new                   ally valenced (Kousta et al., 2011) and less image-
     dataset which was annotated with con-                 able (Montefinese et al., 2020; Garbarini et al.,
     creteness ratings and used as gold standard           2020).
     in the evaluation of systems. Four teams                 The CON CRE TEXT task challenges partici-
     participated in this first edition of the task,       pants to build NLP systems to automatically as-
     with a total of 15 runs submitted.                    sign a concreteness value to words in context. It is
     Interestingly, these works extend infor-              aimed at investigating how the concreteness infor-
     mation on conceptual concreteness avail-              mation affects sense selection: different from past
     able in existing (non contextual) norms               research (Brysbaert et al., 2014b; Montefinese et
     derived from human judgments with new                 al., 2014), we are interested in assessing the con-
     knowledge from recently developed neu-                creteness of concepts within the context of real
     ral architectures, in much the same multi-            sentences rather than in isolation. Additionally,
     disciplinary spirit whereby the CON CRE -             the concreteness score is assumed to be a property
     TEXT task was organized.                              of meanings rather than a property of word forms;
                                                           thus, scoring the concreteness of a concept in con-
1    Introduction                                          text implicitly requires to individuate its underly-
Concept concreteness – that is, how directly a con-        ing sense, by handling lexical phenomena such as
cept is related to sensorial experience (Brysbaert         polysemy and homonymy.
et al., 2014a)– is a fundamental dimension of con-            Ordinary experience suggests that concepts’
ceptual semantic representation that has attracted         concrete/abstract status can affect their semantic
more and more interest and attention in psycholin-         representation, and lexical access and processing:
guistics in the last decade. This dimension is usu-        concrete meanings are acknowledged to be more
ally assessed by participants ratings on a Likert          quickly and easily delivered in human commu-
scale: concrete concepts lie herein on one side of         nication than abstract meanings (Bambini et al.,
the scale and refer to something that exists in re-        2014). Historically, it has been observed that con-
ality and can be experienced immediately through           crete concepts are responded to more quickly than
                                                           abstract concepts in lexical decision tasks (Bleas-
     Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   dale, 1987; Kroll and Merves, 1986), although
International (CC BY 4.0).                                 more recent experiments have shown that abstract
concepts might have an advantage when other           based approaches, and more recent language mod-
variables have been accounted for (Kousta et al.,     els and sequence-to-sequence models. Finally,
2011). Concrete concepts are also easier to encode    like in many real-world cases, the provided trial
and retrieve than abstract concepts (Romani et al.,   data is rather scarce, in the order of hundred sen-
2008; Miller and Roodenrys, 2009), are easier to      tences for the Italian language, and as many for
make associations with (de Groot, 1989), and are      English. This aspect forced our participants to
more thoroughly described in definition tasks (Sa-    face something similar to a ‘cold start’ problem.
doski et al., 1997). Moreover, it takes generally     We hope that this edition of the CON CRE TEXT
less time to comprehend a concrete sentence than      task will be the first appointment in a series for
an abstract one (Haberlandt and Graesser, 1985;       those who are interested in the issues posed by the
Schwanenflugel and Shoben, 1983). Thus, it has        contextual conceptual concreteness to research on
been proposed that different organizational princi-   natural language semantics.
ples govern semantic representations of concrete
and abstract concepts: concrete concepts are pre-
                                                      2   Task Definition
dominantly organized by featural similarity mea-
sures, and abstract concepts by associative rela-
tions, co-occurrence patterns and syntactic infor-    The task CON CRE TEXT (so dubbed after CON-
mation (Vigliocco et al., 2009).                      creteness in conTEXT) focuses on automatic con-
                                                      creteness (and conversely, abstractness) recogni-
   All surveyed features make aspects ingrained in    tion. Given a sentence along with a target word,
the distinction between concreteness/abstractness     we asked participants to propose a system able
a stimulating and challenging field also for com-     to assess the concreteness of a concept expressed
putational linguistics. Among the earliest attempts   by a given word within a sentence, on a 7-point
at grasping concreteness, we find works that in-      Likert-like scale where 1 stands for completely ab-
vestigated on concreteness/abstractness informa-      stract (e.g., ‘freedom’) and 7 for completely con-
tion in its interplay with metaphor identification    crete (e.g., ‘car’). For example, in the sentence
and figurative language more in general (Tur-         “In summer, wheat fields are coloured in yellow”
ney et al., 2011) (and, more recently (Mensa          the noun field refers to an entity that can smell, be
et al., 2018b)). Although concreteness infor-         touched, and pointed to. In this case, in a scale
mation is acknowledged to be central to, e.g.,        ranging from 1 to 7 its concreteness may be evalu-
word-sense induction and compositionality mod-        ated as 7, because it refers to an extremely con-
eling (Hill et al., 2013), the contribution of con-   crete concept. In contrast, the same noun field
creteness/abstractness to semantic representations    in the sentence “Physics is Alice’s research field”
is not fully grasped and exploited in existing ap-    refers to a scientific subject, i.e., something that
proaches and resources, with the notable excep-       cannot be perceived through the five senses, but
tion of works aimed i) at learning multimodal em-     that can be explained through a linguistic descrip-
beddings, and how abstract and concrete repre-        tion. In this sentence, the noun field may be eval-
sentations can be acquired by multi-modal mod-        uated 1 because it refers to an extremely abstract
els (Hill and Korhonen, 2014); and ii) at exploring   concept. Moreover, the task targets can be halfway
in how far concreteness information is represented    between completely abstract and completely con-
in the distributional patterns in corpora (Hill et    crete, as in the case of “Magnetic field attracts
al., 2013). Moreover, some approaches exist that      iron”, where the noun field refers to something
attempted to create lexical resources by also em-     more abstract compared to “wheat fields” but more
ploying common-sense information (Mensa et al.,       concrete compared to “research field”. As antic-
2018a; Colla et al., 2018).                           ipated, the concreteness score being assigned to
   Characterizing tokens within sentences with        the word should be evaluated in context: the word
their concreteness requires integrating both word-    should not be considered in isolation, but as part
specific and contextual information. In our view,     of a given sentence.
the CON CRE TEXT Task entails dealing with a             Participants were invited to exploit all possible
relaxed form of word sense disambiguation; such       strategies to solve the task, including (but not lim-
aspects were faced by our participants by devising    ited to) knowledge bases, external training data,
methods relying on both traditional knowledge-        word embeddings, etc.
Table 1: Basic statistics on the CON CRE TEXT
                                                                               GOLD−EN                            GOLD−IT
dataset used as gold standard.




                                                                0.30




                                                                                                   0.30
                                   Italian   English




                                                                0.20




                                                                                                   0.20
    Unique Verb targets                52         44
    Unique Noun targets                96         73




                                                                0.10




                                                                                                   0.10
    Num. Sentences                    550        534
    Num. Sentences Verb target        189        210




                                                                0.00




                                                                                                   0.00
    Num. Sentences Noun target        361        324
                                                                       1   2   3   4   5   6   7          1   2   3   4   5   6   7
    Avg. sent. length               14.43      14.33
    Avg. sent. length (no punct)    13.03      12.87
    Avg. full words per sent.        7.14       7.15            (a) English dataset.                (b) Italian dataset.
    Num. Annotators                   333        310
    Human ratings (HR)             18,726     16,522   Figure 1: Distribution of human ratings for the En-
    Min HR per sentence                30         30   glish and Italian datasets.


3    Dataset                                           can be used across the entire Italian – and English
The dataset used for this task has been taken from     – speaking populations.
the English-Italian parallel section of The Human         The dataset has been split into trial and test data,
Instruction Dataset (Chocron and Pareti, 2018),        with a 20–80 ratio. Trial data has been released
derived from WikiHow instructions.1 All such           with the concreteness scores, while the test data
documents had been anonymized beforehand, so           has been provided at the beginning of the evalua-
that downloaded data present no privacy nor data       tion window without any score.2
sensitivity issues.
                                                       4       Evaluation Measures and Baselines
   The dataset is composed of overall 1, 096 sen-
tences, arranged as follows: 562 Italian sentences     We chose the Spearman correlation indices as our
plus 534 English sentences. Each sentence con-         main evaluation measure; for the sake of com-
tains a target term (either verb or noun) with its     pleteness, we also report Pearson indices (substan-
associated concreteness score (1–7 scale). Such        tially in accord with the previous metrics). We
score is derived from the average of at least 30       chose the former measure because the collected
human judgments from native Italian and English        ratings are not normally distributed, which makes
speakers about the concreteness of a target word in    the Spearman correlation more suited to the data.
a given sentence (see Table 1 for the dataset num-     In fact, by running the Shapiro–Wilk test we ob-
bers).                                                 tained a p-value < 0.001. The non normal distri-
   The reliability of the collected data within        bution of data is also confirmed by the plot of the
each language (Italian, English) for the trial and     gold standard ratings, as illustrated in Figure 1.
test phases was evaluated separately by apply-            Two baselines have been designed for this task.
ing the split-half correlations corrected with the
Spearman-Brown formula after randomly divid-           Baseline One. The first baseline for the Italian
ing the participants into two subgroups of equal       language is derived as follows. The fastText word
size. All the reliability indexes were calculated      embeddings have been acquired beforehand by
on 10, 000 different randomizations of the partic-     training the model on the Italian dump of the Wik-
ipants. The mean correlations between the two          iHow instructions. We chose fastText for its sup-
groups are very high for both the trial and test       port to the handling of OOV terms (Bojanowski et
phases, ranging from a minimum of r = 0.87             al., 2017), which is a crucial feature in the present
for English (at the test phase) to a maximum of        setting. The cited norms by Montefinese et al.
r = 0.98 for Italian (at the trial phase), showing     (2014) (referred to as ‘the norms’ hereafter) have
that the resulting ratings are highly reliable and     been used herein. The average score of terms in
                                                       each input sentence S = {t1 , t2 , . . . tK } has been
   1
     The  whole      Human      Instruction Dataset
                                                           2
dataset  is    freely    available      on  Kaggle,         The dataset employed in the CON CRE TEXT task is
https://www.kaggle.com/paolop/                         available at the URL https://lablita.github.io/
human-instructions-multilingual-wikihow                CONcreTEXT/.
computed by scrolling through the content words          5.1   A NDI
of the sentence. Each term t is searched in the
                                                         The A NDI team (Rotaru, 2020) proposed a system
norms: if the term is found, the associated con-
                                                         based on multiple classes of concreteness score
creteness score c(t) is returned; otherwise, if the
                                                         predictors. The first class of predictors has been
term is not present in the norms, the ranking of
                                                         derived from large datasets of behavioral norms,
the l (l = 20, 000) elements most similar to t is
                                                         collected for a wide variety of psycholinguistic
generated through fastText. In this case, we scan
                                                         factors. Beside well known concreteness norms,
the whole norms list and employ the concreteness
                                                         A NDI takes into account also semantic diversity,
score of the element in the norms closest to those
                                                         age of acquisition, emotional and sensori-motor
in the fastText ranking. In either case we obtain
                                                         dimensions, as well as frequency and contextual
a score for each and every term in the input sen-
                                                         diversity counts. The vocabulary resulting from
tence, so that the concreteness score of the target
                                                         the merging of these words collections comprises
token t̂ is computed as the averaged score of the
                                                         more than 70K words, and it is the base vocabu-
terms in the input sentence:
                                                         lary used to extract all the predictors. The second
                             K
                         1 X                             class of predictors has been derived from context-
               c(t̂) =     · c(ti ).                     independent distributional models, namely Skip-
                         K
                            i=1
                                                         gram, GloVe, and NumberBatch embeddings, as
   The first baseline for the English language is        well as from the concatenation of the three. The
analogous to the Italian one, except for the fact that   third class of predictors has been derived from fea-
the English tokens from the norms are accessed in        tures obtained through recent transformers mod-
this case. The same strategy governs the handling        els, i.e. context-dependent representations. The
of the fastText resource, that in this case has been     models exploited are: BERT, GPT-2, Bart, and
trained on the English dump of the Human Instruc-        ALBERT. The final rating has been computed
tion Dataset.                                            through a ridge regression over the three classes.
Baseline Two. The second baseline for the Ital-
ian language implements a simple lookup func-            5.2   C APISCO
tion. More specifically, input sentences have been       The C APISCO Team (Bondielli et al., 2020) sub-
translated into English through the Google Trans-        mitted 3 systems for both Italian and English.
late ajax API implementation, and then the con-
creteness scores associated to the terms in the          N ON -C APISCO. The first system computes a
norms by Brysbaert et al. (2014b) are retrieved          variation of the Baseline Two; that is, the target
(in the unlikely case the term is not found, it is       concreteness is obtained by combining the con-
dropped, thus not contributing to the final score).      creteness value of the target term (taken in isola-
The concreteness score of the target term is thus        tion), and the average concreteness of the whole
assigned to the average concreteness of terms in         sentence. Improvement from baseline comes from
the given input sentence. The baseline two for the       considering differently the weight of the concrete-
English language employs the concreteness score          ness of the target term and of the context.
—by also employing the norms by Brysbaert et
al. (2014b)— associated to all terms in the input        C APISCO -C ENTROIDS. This system is based
sentence, finally assigning to the target token the      on the assumption that close semantic spaces are
average concreteness score for the whole sentence.       featured by similar concreteness scores. In this
                                                         case the authors first build two centroids, one for
5   Systems Descriptions
                                                         concrete and one for abstract concepts based on
In this Section we briefly describe the systems that     the norms by Brysbaert et al. (2014b) and Della
participated in the competition. As a first edition,     Rosa et al. (2010), by employing fastText pre-
the CON CRE TEXT task recorded a good feed-              trained embeddings. The concreteness score of a
back from the community, with 4 teams, overall           term is then computed by averaging the distance of
7 participants and 15 submitted system runs. In          the first 50 lexical substitutes of the target (identi-
the next Section we report the results obtained by       fied through BERT) from the two polarized cen-
all such systems, while anonymizing a withdrawn          troids. Introducing a list of target substitutes in a
participant.                                             given context is thus the gist of this approach.
C APISCO -T RANSFORMERS. In this variant,              Table 2: Results for each run on English test set.
the C APISCO team fine-tuned a pre-trained BERT            System run            Spear    Pears     Eucl.D
model on the concreteness rating task, by com-             A NDI                  0.833    0.834    15.409
plementing the CON CRE TEXT training data with             N ON -C APISCO         0.785    0.787    35.663
newly generated training data. The new data gen-           KON K RETI K A 3       0.663    0.668    28.613
eration is twofold: for each original sentence, new        KON K RETI K A 1       0.651    0.667    29.933
sentences are generated by replacing the target            Baseline 2             0.554    0.567    38.451
term with the first lexical substitutes derived with       KON K RETI K A 4       0.542    0.545    29.836
BERT target masking approach. Then, more sen-              C APISCO C ENTR        0.542    0.538    48.864
tences are borrowed from Italian and English ref-          KON K RETI K A 2       0.541    0.545    30.322
erence corpora.
                                                           C APISCO T RANS        0.504    0.501    29.927
5.3    KON K RETI K A                                      Baseline 1             0.382    0.377    31.738
                                                           withdrawn run3        -0.013    0.067    41.109
The KON K RETI K A team (Badryzlova, 2020) pre-            withdrawn run1        -0.124   -0.123    44.068
sented a system that first assigns a concreteness          withdrawn run2        -0.127   -0.129    43.890
and an abstractness score to the target lemma, and
then it adjusts these values based on the surround-        Table 3: Results for each run on Italian test set.
ing context. In the first step, the system computes        System run            Spear     Pears    Eucl.D
semantic similarity between the target vectors and         A NDI                 0.749     0.749    19.950
a “seed list” consisting of abstract and concrete          C APISCO T RANS       0.625     0.617    24.367
words (extracted from the MRC Psycholinguistic             C APISCO C ENTR       0.615     0.609    28.608
Database). In the second step, the values where            N ON -C APISCO        0.557     0.557    31.588
adjusted to the sentential context considering the         Baseline 2            0.534     0.522    40.114
mean concreteness index of the entire sentence.            Baseline 1            0.346     0.368    31.046
The team submitted 4 runs based on a heuristically
selected coefficient.
                                                       substantially confirm the results: for the results on
6     Results                                          English (Table 2) it is minimal for the output of
                                                       the A NDI system, and it increases while Spearman
Four teams participated in the CON CRE TEXT            correlation values decrease. The same trend is also
competition: A NDI, C APISCO, KON K RETI K A,          confirmed on Italian results (Table 3).
and a withdrawn team. A NDI and C APISCO de-
                                                          Tables 6 and 7 report disaggregated Spearman
veloped a system for both languages (English and
                                                       correlations for verbs and nouns. This allows
Italian), while KON K RETI K A participated in the
                                                       to highlight if and to what extent the participat-
English track only, and the same did the with-
                                                       ing systems obtained better results on either POS.
drawn participant. Each team was allowed to sub-
                                                       A NDI obtained the best results on both verbs and
mit the output of up to 4 system runs; the final
                                                       nouns in both languages. This system (and N ON -
ranking has been compiled based on the results of
                                                       C APISCO as well) obtained analogous results on
the best run.
                                                       verbs and nouns. On the whole, the rest of the
   In Tables 2 and 3 we present the score of each      systems obtained results clearly better on English
run for the English and Italian language, respec-      verbs and slightly better on Italian nouns. In par-
tively. Although, as mentioned, the Spearman in-       ticular, KON K RETI K A (English only) is strongly
dices were adopted as our main evaluation metrics,     biased on verbs: its performances on verbs are
we also report Pearson correlation indices and Eu-     higher in all 4 runs. C APISCO systems exhibit the
clidean distance, that may be useful to complete       most varied behavior.
the assessment of the results. The final ranking is
provided in Tables 4 and 5.                            7     Discussion
   We can observe a substantial agreement be-
tween Spearman and Pearson indices: the aver-          The obtained results confirm transformers as a
aged delta between such figures amounts to 0.012       good device to compute concreteness score for
and to 0.008 on the English and Italian dataset, re-   words in context. The virtues of transform-
spectively. Also the Euclidean distance seems to       ers in grasping contextual information are largely
    Table 4: Final ranking on English test set.          ness score of a word in context is a complex task,
   Team               Spear     Pears    Eucl.D          involving different semantic, cognitive and expe-
   A NDI               0.833    0.834    15.409          riential levels.
   CAPISCO             0.785    0.787    35.663             The high correlation obtained by the N ON -
   KON K RETI K A      0.663    0.668    28.613          C APISCO in the English task is somehow surpris-
   withdrawn          -0.013    0.067    41.109          ing, since this system makes use only of the mean
                                                         concreteness of the sentence (computed from ex-
     Table 5: Final ranking on Italian test set.
                                                         isting norms) as contextual information. This re-
      Team          Spear      Pears    Eucl.D           sult is thus related to the availability of existing
      A NDI         0.749      0.749    19.950           norms, but it shows that there is a link between
      CAPISCO       0.625      0.617    24.367           the concreteness score of a target word in context
                                                         and the concreteness scores of the words it oc-
Table 6: Spearman rank differences between               curs with. Further analysis are needed, but it sug-
nouns and verbs on English test set.                     gests that concrete interpretations of a target word
                       Spear.N     Spear.V        Diff   are associated with concrete context words. Of
 C APISCO T RANS        0.443       0.654        0.211   course, systems based exclusively on behavioral
 KONKRETIKA 4           0.502       0.701        0.199   norms are strongly dependent on the coverage of
 KONKRETIKA 2           0.502       0.683        0.181   the considered vocabulary. In fact, the N ON -
 C APISCO C ENTR        0.478       0.659        0.181   C APISCO Italian performances (obtained exploit-
 KONKRETIKA 3           0.629       0.762        0.133   ing a ∼ 1.2K vocabulary) are lower than all the
                                                         other systems, while on the English track it ranks
 KONKRETIKA 1           0.611       0.741        0.13
                                                         second (using a ∼ 70K vocabulary).
 A NDI                  0.836       0.857        0.021
 N ON -C APISCO         0.779       0.782        0.003
Table 7: Spearman rank differences between               8   Conclusions
nouns and verbs on Italian test set.
                       Spear.N     Spear.V        Diff   We presented the results of the CON CRE TEXT
 N ON -C APISCO         0.579       0.507        0.072   task at EVALITA 2020 (Basile et al., 2020).
 C APISCO T RANS        0.607       0.667        0.060   The task challenges participants to build NLP
 C APISCO C ENTR        0.625       0.591        0.034   systems to automatically assign a concreteness
 A NDI                  0.762       0.749        0.013   score to words in context, evaluating to what ex-
                                                         tent target concepts are concrete (i.e., more or
                                                         less perceptually salient) within a given context
known, but in the present setting we observe that        of occurrence. A novel dataset was developed
their output can be further improved by integrat-        for this task as a multilingual comparable cor-
ing behavioral information (this seems to be one         pus composed of 550 Italian sentences and 534
major difference between the systems A NDI and           English sentences, annotated with the concrete-
C APISCO -T RANSFORMERS).                                ness/abstractness rating of target nouns and verbs.
   The most important output of this challenge is        Three teams completed their participation to the
definitely the great performance of the A NDI sys-       task, obtaining the following ranking: A NDI (Ro-
tem, that proves to be robust and reliable for the       taru, 2020), C APISCO (Bondielli et al., 2020), and
considered task: the system obtains the best rank-       KON K RETI K A (Badryzlova, 2020).
ing in both languages, a low deviation from the             Future work will address the following steps.
gold standard and a substantial stability in process-    First of all, we will improve our dataset by includ-
ing both verbs and nouns. Moreover, the proposed         ing further languages, also from different language
system is ready to be applied in a multi-language        families and under-resourced languages. Also the
environment, given that non-English sentences are        set of considered targets should be expanded, to
automatically translated into English. The A NDI         ensure a broader coverage to the dataset, and more
system exploits different kinds of available re-         significant results (thanks to the larger experimen-
sources and works with local and contextual in-          tal base) to its future users as well.
formation. This shows that deriving the concrete-
References                                                Louise Connell, Dermot Lynott, and Briony Banks.
                                                            2018. Interoception: the forgotten modality in per-
Yulia Badryzlova. 2020. KON K RETI K A @ CON CRE -          ceptual grounding of abstract and concrete concepts.
  TEXT: Computing concreteness indexes with sig-            Philosophical Transactions of the Royal Society B:
  moid transformation and adjustment for context. In        Biological Sciences, 373(1752):20170143.
  Valerio Basile, Danilo Croce, Maria Di Maro, and
  Lucia C. Passaro, editors, Proceedings of the 7th       Sebastian J Crutch and Elizabeth K Warrington. 2005.
  evaluation campaign of Natural Language Process-          Abstract and concrete concepts have structurally
  ing and Speech tools for Italian (EVALITA 2020),          different representational frameworks.      Brain,
  Online. CEUR.org.                                         128(3):615–627.
Valentina Bambini, Donatella Resta, and Mirko
                                                          Annette M de Groot. 1989. Representational aspects
  Grimaldi. 2014. A dataset of metaphors from
                                                            of word imageability and word frequency as as-
  the italian literature: Exploring psycholinguistic
                                                            sessed through word association. Journal of Experi-
  variables and the role of context.     PloS one,
                                                            mental Psychology: Learning, Memory, and Cogni-
  9(9):e105634.
                                                            tion, 15(5):824.
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
  cia C. Passaro. 2020. Evalita 2020: Overview            Pasquale A Della Rosa, Eleonora Catricalà, Gabriella
  of the 7th evaluation campaign of natural language        Vigliocco, and Stefano F Cappa. 2010. Beyond the
  processing and speech tools for italian. In Valerio       abstract—concrete dichotomy: Mode of acquisition,
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.         concreteness, imageability, familiarity, age of acqui-
  Passaro, editors, Proceedings of Seventh Evalua-          sition, context availability, and abstractness norms
  tion Campaign of Natural Language Processing and          for a set of 417 italian words. Behavior research
  Speech Tools for Italian. Final Workshop (EVALITA         methods, 42(4):1042–1048.
  2020), Online. CEUR.org.
                                                          Francesca Garbarini, Fabrizio Calzavarini, Matteo Di-
Fraser A Bleasdale. 1987. Concreteness-dependent as-        ano, Monica Biggio, Carola Barbero, Daniele P
  sociative priming: Separate lexical organization for      Radicioni, Giuliano Geminiani, Katiuscia Sacco,
  concrete and abstract words. Journal of Experimen-        and Diego Marconi. 2020. Imageability effect on
  tal Psychology: Learning, Memory, and Cognition,          the functional brain activity during a naming to def-
  13(4):582.                                                inition task. Neuropsychologia, 137:107275.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and       Karl F Haberlandt and Arthur C Graesser. 1985. Com-
   Tomas Mikolov. 2017. Enriching word vectors with         ponent processes in text comprehension and some of
   subword information.                                     their interactions. Journal of Experimental Psychol-
                                                            ogy: General, 114(3):357.
Alessandro Bondielli, Gianluca E. Lebani, Lucia C.
  Passaro, and Alessandro Lenci. 2020. C APISCO @         Felix Hill and Anna Korhonen. 2014. Learning ab-
  CON CRE TEXT: (Un)supervised Systems to Con-              stract concept embeddings from multi-modal data:
  textualize Concreteness with Norming Data. In Va-         Since you probably can’t see what i mean. In Pro-
  lerio Basile, Danilo Croce, Maria Di Maro, and Lu-        ceedings of the 2014 Conference on Empirical Meth-
  cia C. Passaro, editors, Proceedings of the 7th eval-     ods in Natural Language Processing (EMNLP),
  uation campaign of Natural Language Processing            pages 255–265.
  and Speech tools for Italian (EVALITA 2020), On-
  line. CEUR.org.                                         Felix Hill, Douwe Kiela, and Anna Korhonen. 2013.
                                                            Concreteness and corpora: A theoretical and prac-
Marc Brysbaert, Michaël Stevens, Simon De Deyne,           tical study. In Proceedings of the Fourth Annual
 Wouter Voorspoels, and Gert Storms.        2014a.          Workshop on Cognitive Modeling and Computa-
 Norms of age of acquisition and concreteness for           tional Linguistics (CMCL), pages 75–83.
 30,000 dutch words. Acta psychologica, 150:80–84.
Marc Brysbaert, Amy Beth Warriner, and Victor Ku-         Stavroula-Thaleia Kousta, Gabriella Vigliocco,
 perman. 2014b. Concreteness ratings for 40 thou-            David P Vinson, Mark Andrews, and Elena
 sand generally known english word lemmas. Behav-            Del Campo. 2011. The representation of ab-
 ior research methods, 46(3):904–911.                        stract words: why emotion matters. Journal of
                                                             Experimental Psychology: General, 140(1):14.
Paula Chocron and Paolo Pareti. 2018. Vocabulary
  alignment for collaborative agents: a study with        Judith F Kroll and Jill S Merves. 1986. Lexical access
  real-world multilingual how-to instructions. In IJ-       for concrete and abstract words. Journal of Experi-
  CAI, pages 159–165.                                       mental Psychology: Learning, Memory, and Cogni-
                                                            tion, 12(1):92.
D. Colla, E. Mensa, A. Porporato, and D.P. Radicioni.
  2018. Conceptual Abstractness: From Nouns to            Enrico Mensa, Aureliano Porporato, and Daniele P.
  Verbs. In Proceedings of the Fifth Italian Confer-        Radicioni. 2018a. Annotating concept abstractness
  ence on Computational Linguistics (CLiC-it 2018),         by common-sense knowledge. In Chiara Ghidini,
  volume 2253. CEUR.                                        Bernardo Magnini, Andrea Passerini, and Paolo
  Traverso, editors, AI*IA 2018 – Advances in Arti-
  ficial Intelligence, pages 415–428, Cham. Springer
  International Publishing.
Enrico Mensa, Aureliano Porporato, and Daniele P.
  Radicioni. 2018b. Grasping metaphors: Lexical
  semantics in metaphor analysis. In Aldo Gangemi,
  Anna Lisa Gentile, Andrea Giovanni Nuzzolese, Se-
  bastian Rudolph, Maria Maleshkova, Heiko Paul-
  heim, Jeff Z Pan, and Mehwish Alam, editors, The
  Semantic Web: ESWC 2018 Satellite Events, pages
  192–195, Cham. Springer International Publishing.
Leonie M Miller and Steven Roodenrys. 2009. The
  interaction of word frequency and concreteness in
  immediate serial recall. Memory & Cognition,
  37(6):850–865.
Maria Montefinese, Ettore Ambrosini, Beth Fairfield,
 and Nicola Mammarella. 2014. The adaptation of
 the affective norms for english words (anew) for ital-
 ian. Behavior research methods, 46(3):887–903.
Maria Montefinese, Ettore Ambrosini, Antonino
 Visalli, and David Vinson. 2020. Catching the in-
 tangible: a role for emotion? Behavioral and Brain
 Sciences, 43.
Cristina Romani, Sheila Mcalpine, and Randi C Mar-
  tin.    2008.    Concreteness effects in different
  tasks: Implications for models of short-term mem-
  ory. Quarterly Journal of Experimental Psychology,
  61(2):292–323.
Armand Rotaru. 2020. ANDI @ CON CRE TEXT:
  Predicting concreteness in context for English and
  Italian using distributional models and behavioural
  norms. In Valerio Basile, Danilo Croce, Maria
  Di Maro, and Lucia C. Passaro, editors, Proceedings
  of the 7th evaluation campaign of Natural Language
  Processing and Speech tools for Italian (EVALITA
  2020), Online. CEUR.org.
Mark Sadoski, William A Kealy, Ernest T Goetz, and
 Allan Paivio. 1997. Concreteness and imagery ef-
 fects in the written composition of definitions. Jour-
 nal of Educational Psychology, 89(3):518.
Paula J Schwanenflugel and Edward J Shoben. 1983.
  Differential context effects in the comprehension of
  abstract and concrete verbal materials. Journal of
  Experimental Psychology: Learning, Memory, and
  Cognition, 9(1):82.
Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co-
  hen. 2011. Literal and metaphorical sense identifi-
  cation through concrete and abstract context. In Pro-
  ceedings of the 2011 Conference on Empirical Meth-
  ods in Natural Language Processing, pages 680–
  690.
Gabriella Vigliocco, Lotte Meteyard, Mark Andrews,
  and Stavroula Kousta. 2009. Toward a theory of
  semantic representation. Language and Cognition,
  1(2):219–247.