<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MACID - Multimodal ACtion IDentification: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rossella Varvara</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Gregori</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ABSTRACTION Research Group - University of Bologna</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Independent Researcher</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Florence</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the Multimodal ACtion IDentification challenge (MACID), part of the first CALAMITA competition. The objective of this task is to evaluate the ability of Large Language Models (LLMs) to diferentiate between closely related action concepts based on textual descriptions alone. The challenge is inspired by the "find the intruder" task, where models must identify an outlier among a set of 4 sentences that describe similar yet distinct actions. The dataset is composed of “pushing” events, and it highlights action-predicate mismatches, where the same verb may describe diferent actions or diferent verbs may refer to the same action. Although currently mono-modal (text-only), the task is designed for future multimodal integration, linking visual and textual representations to enhance action recognition. By probing a model's capacity to resolve subtle linguistic ambiguities, the challenge underscores the need for deeper cognitive understanding in action-language alignment, ultimately testing the boundaries of LLMs' ability to interpret action verbs and their associated concepts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;human action recognition</kwd>
        <kwd>action types</kwd>
        <kwd>find the intruder</kwd>
        <kwd>LLM</kwd>
        <kwd>CALAMITA</kwd>
        <kwd>CLiC-it</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>starts from action capabilities that language emerged
during human evolution. In this view, understanding and
Human language and vision systems are deeply linked discriminating actions are of paramount importance for
together, and the two may have a common evolutionary the broader scope of language understanding.
basis. According to the Mirror System Hypothesis [1] Natural Language Processing is experiencing an
unthe mechanism that supports language in the human precedented revolution due to the development of
modbrain may have evolved atop the mirror neuron system els capable of understanding and generating language;
for grasping, taking advantage of its ability to recognize these models show human-like performances in solving
a set of actions, and adapting it to deal with linguistic many tasks (and above-human performance on some).
acts (i.e. utterances) and to discriminate linguistic objects Moreover, the recent development of multimodal LLMs
(i.e., audio patterns for words). Thus, according to this allowed deep reasoning tasks involving the simultaneous
hypothesis, humans “invented” language by adapting the processing of both textual and visual data.
pattern recognition system, initially developed within the With the MACID task at CALAMITA [2], we aim to
vision system to recognize actions, to identify and imitate challenge LLMs on their ability to finely discriminate
audio patterns, and to link them to real-world entities (i.e. between linguistic expressions referring to cognitively
objects and events) and their mental representation. In distinct but linguistically similar actions, due to the use
other words, language is a form of action, and it probably of the same (or remarkably close) word labels to describe
them. While the discrimination of very distant actions is
CDLeciC0-4it—200264,:2T0e2n4t,hPIitsaal,iIatnalCyonference on Computational Linguistics, a quite simple task (e.g. to distinguish between “opening
* Corresponding author. a box” and “pressing a button”), grasping the nuances
† These authors contributed equally. between actions that are much closer semantically is
$ andreaamelio.ravelli@unibo.it (A. A. Ravelli); not so obvious (e.g. “pressing a button” and “pressing
rossella.varvara01@gmail.com (R. Varvara); the wood”). These nuances are easy to highlight for a
lorenzo.gregori@unifi.it (L. Gregori) human, which can activate a simulated execution and
(A.hAt.tpRsa:/v/ewllwi);w.unibo.it/sitoweb/andreaamelio.ravelli thus find diferences in motor execution, but a model
https://scholar.google.com/citations?user=qAIgPcMAAAAJ without a physical dimension cannot. We aim to test to
(R. Varvara); which degree an LLM can find the relevant information
https://cercachi.unifi.it/p-doc2-2022-0-A-2c303c2b3930-0.html to recognize action concepts from their linguistic
descrip(L. Gregori) tion. Moreover, visual information, in these scenarios,
(R. 0V0a0r0v-a0r0a0)2;-00020302--08080811-9(A20.8A-2.3R1a1v(eLll.i)G;0re0g0o0r-0i)001-9957-2807 can facilitate the task for the computational model,
pro© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License viding more cues to disambiguate. For this reason, the
Attribution 4.0 International (CC BY 4.0).
proposed dataset has been conceived as a multimodal
resource, with links between textual descriptions of
actions and the short movie segments where these actions
are performed.</p>
      <p>Currently, the CALAMITA challenge does not deal
with multi-modal LLMs, so for the first MACID
competition, we are presenting the text-only version of the
dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Challenge Description</title>
      <p>The task shares similarity with a word-sense
discrimination task, since diferent senses of an action verb refer
to diferent actions. However, the present task requires
a deeper cognitive understanding of the sentences
provided, given that the action can be described through
diferent predicates and, the other way around, the same
predicate can extend to a variety of actions. Indeed, the
task forces the model to question a one-to-one
relationship between meaning and form.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data description</title>
      <sec id="sec-3-1">
        <title>The challenge is mono-modal (i.e., text-only), but is</title>
        <p>ready to be turned in a multi-modal task (i.e., visual and
linguistic information through video-caption pairs).</p>
        <p>We propose a task modeled over the typical “find the
intruder” game, similarly to Chang et al. [3], but
extending it to sentences instead of words in isolation. Among
a group of 4 video-caption pairs, the model is asked to
select the one that does not refer to the same kind of
action as the other three. For the task to be challenging,
we focus on actions-predicate mismatches:
We derived the data for this proposal from a small
portion of the LSMDC dataset [4], which contains short
video clips extracted from movies, along with English
DVS (descriptive video services) transcription for
visually impaired people. The LSMDC dataset is the result
of the merging of two previous dataset, both built upon</p>
        <p>DVS from movies: the Max Plank Institute für Informatik
• diferent action concepts that may be defined by Movie Description Dataset (MPII-MD) [5], and the
Monthe same verb (e.g. “pressing a button” and “press- treal Video Annotation Dataset (M-VAD) [6]. The subset
ing the wood”); considered for this task is a collection of video-caption
• the expression of the same action concept pairs restricted to the variation of the actions (and action
through diferent verbs (e.g. “pressing a button” verbs) linked to “pushing” events.
and “pushing a button”). Data have been manually filtered and annotated [ 7]
using the action conceptualization derived from the
IMAGACT Multilingual and Multimodal Ontology of Actions
[8]. IMAGACT is a multimodal and multilingual
ontology of actions that provides a fine-grained categorization
of action concepts, each represented by one or more
visual prototypes in the form of recorded videos and 3D
animations. IMAGACT currently contains 1,010 scenes
that encompass the action concepts most commonly
referred to in everyday language usage. Scenes belonging
to the same action concept are grouped together and
labeled with a unique identification number. The
categorization of action concepts proposed in the theoretical
framework behind IMAGACT has been validated in a
series of experiments with a high inter-annotator
agreement [9], confirming that the theoretical framework can
be considered well-founded and reproducible.</p>
        <p>We wrote an Italian caption for each of the selected
videos from LSMDC, which originally had only an
English textual description. The captioning took into ac- TUPLE_2
count the necessity to produce a sounding Italian
description, thus we chose the most appropriate verb (and
construction) to describe the action depicted in the videos.</p>
        <p>Moreover, we choose to keep the anonymization as
proposed in the LSMDC, but instead of using SOMEONE as
the only replacement of nouns, we choose to use general
expressions such as il ragazzo (the boy), la donna (the
woman, and so on. In this way, we removed some
ambiguities from the original dataset (e.g., SOMEONE pushes
SOMEONE).</p>
        <p>The MACID Task can also be framed as a multilingual
task, given the already available parallel English captions,
and the possibility to provide more translations in other
languages.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Data format</title>
          <p>The MACID dataset is available on HuggingFace.1</p>
          <p>The dataset consists of groups of 4 captions (or
videocaption pairs, in the case of the multimodal version),
three of which belong to the same action concept, and
one describing another action type.</p>
          <p>
            Data are released in CSV format (columns: id, s1, v1, s2,
v2, s3, v3, s4, v4, intruder), with the following meaning:
• id: the tuple id;
• s1-4: the 4 sentences describing physical actions;
• v1-4: the 4 videos depicting physical actions;
• intruder: the number (
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
            ) of the sentence (and
video) which is the intruder in the group.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>An additional folder with the video files is included in the dataset for future extension to the multimodal task. An example of the textual data follows. TUPLE_1</title>
        <p>
          1https://huggingface.co/datasets/loregreg/MACID
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) I due ragazzi spingono il carrello verso la colonna
(The two boys push the cart toward the column)
[action id: 65431186]
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) La donna spinge la signora anziana sulla sedia a
rotelle (The woman pushes the elderly lady in the
wheelchair)
[action id: 65431186]
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) L’uomo spinge a terra l’aggressore (The man pushes
the attacker to the ground)
[action id: 18ad2fa9]
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) L’infermiere spinge la barella (The nurse pushes the
gurney)
[action id: 65431186]
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) La donna si spinge fuori dalla piscina (The woman
pushes herself out of the pool)
[action id: 950a69d5]
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) L’uomo si solleva leggermente dalla donna sdraiata
(The man lifts himself slightly of the lying woman )
[action id: 950a69d5]
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Il ragazzo a terra si alza in ginocchio con fatica
(The boy on the ground gets up to his knees with
dificulty )
[action id: 950a69d5]
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) L’uomo preme il fazzoletto contro la sua narice
(The man presses the tissue against his nostril)
[action id: 8b2675f8]
        </p>
        <p>For each group, the model must select the caption
referring to the intruder action. The action ID will be
masked to the system and used for evaluating the model’s
performance, but the ID of the corresponding video will
be added, in order to enable researchers to evaluate also
multimodal models.</p>
        <sec id="sec-3-2-1">
          <title>3.2. Example of prompts used for zero shot</title>
          <p>The task is evaluated with a zero-shot prompt only. The
prompt used is reported in the example below.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Le seguenti 4 frasi sono descrizioni di azioni fisiche.</title>
        <p>Tre di queste azioni sono dello stesso tipo, mentre
una è di un tipo diverso. Individua la frase che
describe l’azione di tipo diverso rispondendo soltanto
con il numero della frase (1, 2, 3 o 4).
1: I due ragazzi spingono il carrello verso la colonna
2: La donna spinge la signora anziana sulla sedia a
rotelle
3: L’uomo spinge a terra l’aggressore
4: L’infermiere spinge la barella
Tuples
Textual descriptions
Videos
Action Types
Action verbs</p>
        <sec id="sec-3-3-1">
          <title>3.3. Detailed data statistics</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3. two diferent verbs, with two sentences sharing</title>
        <p>
          the same verb (2_2);
4. two diferent verbs, with three sentences sharing
the same verb and one with a diferent one ( 3_1);
5. one verb in all the four sentences (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Metrics</title>
      <p>The evaluation metric proposed for the MACID Task is a
simple accuracy: participating models will be evaluated
on the basis of the percentage of correct times they select
the intruder sentence in each 4-word tuple.</p>
      <p>MACID dataset is made of 100 tuples, each one containing
4 textual descriptions of human actions in the form of
short sentences in Italian, and 4 video segments depicting
those actions. See Table 1 for general details. The whole
dataset is built using 307 hand-crafted captions, with
each caption appearing at least once (either as positive 5. Limitations
sentence or as intruder), and for a maximum of 3 times
(counting both the possible roles). The main limitation of the MACID Task dataset is its size.</p>
      <p>The dataset contains 18 action types, belonging to the We propose a set of 100 4-sentence tuples, as the MACID
semantic area of pushing events. Table 2 reports the Task is intended as a zero-shot LLMs-only challenge, thus
frequency list of verbs used to describe the actions. we did not designed it as a typical Machine Learning task</p>
      <p>In building the 4-sentence tuples, we maximized the with train(-dev)-test splitting. The possibility to have
balancing between close and distant action concepts, by many more stimuli would open up to the possibility to
choosing the intruder captions on the basis of the dis- tackle the task with other kind of models, but also to ofer
tance computed over the whole IMAGACT ontology data exemplars to be used to better inform LLMs about the
[10, 11, 12]. Thus, we compiled the stimuli by paying required behavior.
attention to the distance between the action concepts
of the three positive sentences and the intruder, trying
to balance as much as possible between intruders with Acknowledgments
action concepts of high, medium or low similarity with
respect to the action concept shared by the other three
sentences in the stimulus. Furthermore, we also put our
attention on creating stimuli which are varied in terms
of action verbs, resulting in 5 possible patterns of verbs
distribution across the 4 sentences of a stimulus:
This work was partially supported by the Project
ERC2021-STG-101039777 (ABSTRACTION), funded by the
European Union. Views and opinions expressed are
however those of the author(s) only and do not necessarily
reflect those of the European Union or the European
Research Council Executive Agency. Neither the European
Union nor the granting authority can be held responsible
for them.</p>
      <sec id="sec-4-1">
        <title>1. four diferent verbs, i.e. one unique verb per sentence (1_1_1_1); 2. three diferent verbs, with a couple of sentences with the same verb (2_1_1);</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arbib</surname>
          </string-name>
          , G. Rizzolatti,
          <article-title>Neural expectations: A possible evolutionary path from manual skills to language</article-title>
          ,
          <source>Communication and Cognition</source>
          <volume>29</volume>
          (
          <year>1996</year>
          )
          <fpage>393</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gerrish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          , Reading tea leaves:
          <article-title>How humans interpret topic models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>22</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          , Movie description,
          <source>International Journal of Computer Vision</source>
          <volume>123</volume>
          (
          <year>2017</year>
          )
          <fpage>94</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>A dataset for movie description</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>3202</fpage>
          -
          <lpage>3212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Torabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <article-title>Using descriptive video services to create a large data source for video annotation research</article-title>
          ,
          <source>arXiv preprint arXiv:1503.01070</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <article-title>Annotation of linguistically derived action concepts in computer vision datasets</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Florence,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moneglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frontini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gagliardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monachini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panunzi</surname>
          </string-name>
          , et al.,
          <article-title>The imagact visual ontology. an extendable multilingual infrastructure for the representation of lexical encoding of action</article-title>
          ,
          <source>in: Proceedings of the Ninth International Conference on Language Resources and Evaluation-LREC'14</source>
          ,
          <string-name>
            <surname>European Language Resources Association</surname>
          </string-name>
          (ELRA),
          <year>2014</year>
          , pp.
          <fpage>3425</fpage>
          -
          <lpage>3432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gagliardi</surname>
          </string-name>
          ,
          <article-title>Rappresentazione dei concetti azionali attraverso prototipi e accordo nella categorizzazione dei verbi generali. una validazione statistica</article-title>
          ,
          <source>in: Proceedings of the First Italian Conference on Computational Linguistics-CLiC-it</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gregori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Varvara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <article-title>Action type induction from multilingual lexical features</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>63</volume>
          (
          <year>2019</year>
          )
          <fpage>85</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gregori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Varvara</surname>
          </string-name>
          ,
          <article-title>Comparing refvectors and word embeddings in a verb semantic similarity task</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Natural Language for Artificial Intelligence, CEUR-WS. org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>0</fpage>
          -
          <lpage>0</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gregori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moneglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panunzi</surname>
          </string-name>
          ,
          <article-title>Towards a crosslinguistic identification of action concepts. automatic clustering of video scenes based on the imagact multilingual ontology, in: AREA II workshop</article-title>
          . Annotation,
          <article-title>Recognition and Evaluation of Action, On line Areaworkshop</article-title>
          . org,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>