A Comparison of Representation Models
                   in a Non-Conventional Semantic Similarity Scenario

               Andrea Amelio Ravelli   Oier Lopez de Lacalle and Eneko Agirre
                University of Florence      University of the Basque Country
         andreaamelio.ravelli@unifi.it          e.agirre@ehu.eus
                                        oier.lopezdelacalle@ehu.eus


                        Abstract                                agent applying a continuous and controlled force
                                                                to move the object from position A to position B,
        Representation models have shown very
                                                                or is he carelessly shoving an object away from its
        promising results in solving semantic sim-
                                                                location? These are just two of the possible inter-
        ilarity problems. Normally, their perfor-
                                                                pretation of this sentence as is, without any other
        mances are benchmarked on well-tailored
                                                                lexical information or pragmatic reference.
        experimental settings, but what happens
        with unusual data? In this paper, we                       Given these premises, it is clear that the task
        present a comparison between popular                    of automatically classifying sentences referring to
        representation models tested in a non-                  actions in a fine-grained way (e.g. push/move vs.
        conventional scenario: assessing action                 push/press) is not trivial at all, and even humans
        reference similarity between sentences                  may need extra information (e.g. images, videos)
        from different domains. The action ref-                 to precisely identify the exact action. One way
        erence problem is not a trivial task, given             could be to consider action reference similarity
        that verbs are generally ambiguous and                  as a Semantic Textual Similarity (STS) problem
        complex to treat in NLP. We set four vari-              (Agirre et al., 2012), assessing that lexical seman-
        ants of the same tests to check if different            tic information encodes, at a certain level, the ac-
        pre-processing may improve models per-                  tion those words are referring to. The simplest
        formances. We also compared our results                 way is to make use of pre-computed word embed-
        with those obtained in a common bench-                  dings, which are ready to use for computing sim-
        mark dataset for a similar task.1                       ilarity between words, sentences and documents.
                                                                Various models have been presented in the past
1       Introduction                                            years that make use of well-known static word
Verbs are the standard linguistic tool that hu-                 embeddings, like word2vec, GloVe and FastText
mans use to refer to actions, and action verbs are              (Mikolov et al., 2013; Pennington et al., 2014;
very frequent in spoken language (∼50% of total                 Bojanowski et al., 2017). Recently, the best STS
verbs occurrences) (Moneglia and Panunzi, 2007).                models rely on representations obtained from con-
These verbs are generally ambiguous and com-                    textual embeddings, such as ELMO, BERT and
plex to treat in NLP tasks, because the relation be-            XLNet (Peters et al., 2018; Devlin et al., 2018;
tween verbs and action concepts is not one-to-one:              Yang et al., 2019).
e.g. (a) pushing a button is cognitively separated                 In this paper, we are testing the effectiveness of
from (b) pushing a table to the corner; action (a)              representation models in a non-conventional sce-
can also be predicated through press, while move                nario, in which we do not have labeled data to
can be used for (b) and not vice-versa (Moneglia,               train STS systems. Normally, STS is performed
2014). These represent two different pragmatic                  on sentence pairs that, on one hand, can have very
actions, despite of the verb used to describe it, and           close or distinct meaning, i.e. the assertion of sim-
all the possible objects that can undergo the ac-               ilarity is easy to formulate; on the other hand, all
tion. Another example could be the ambiguity be-                sentences derive from the same domain, thus they
hind a sentence like John pushes the bottle: is the             share some syntactic regularities and vocabulary.
    1
                                                                In our scenario, we are computing STS between
     Copyright c 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       textual data from two different resources, IMA-
ternational (CC BY 4.0).                                        GACT and LSMDC16 (described respectively in
5.1 and 5.2), in which the language used is highly      tasks and datasets, such as sentence similarity.
different: from the first, synthetic and short cap-
tions; from the latter, audio descriptions. The ob-     3     Problem Formulation
jective is to benchmark word embedding models in
the task of estimating the action concept expressed     We cast the problem as a fine-grained action con-
by a sentence.                                          cept classification for verbs in LSMDC16 captions
                                                        (e.g. push as move vs push as press, see Fig-
2       Related Works                                   ure 1). Given a caption and the target verb from
                                                        LSMDC16, our aim is to detect the most simi-
Word embeddings are abstract representations of         lar caption in IMAGACT that describe the action.
words in the form of dense vectors, specifically        The inputs to our model are the target caption and
tailored to encode semantic information. They           an inventory of captions that categorize the possi-
represent an example of the so called transfer          ble action concepts of the target verb. The model
learning, as the vectors are built to minimize cer-     ranks the captions in the inventory according to
tain objective function (i.e., guessing the next        the textual similarity with the target caption, and,
word in a sentence), but successfully applied on        similar to a kNN classifier, the model assigns the
different unrelated tasks, such as searching for        action label of k most similar captions.
words that are semantically related. In fact, em-
beddings are typically tested on semantic similar-      4     Representation Models
ity/relatedness datasets, where a comparison of the
vectors of two words is meant to mimic a human          In this section we describe the pretrained embed-
score that assesses the grade of semantic similarity    dings used to represent the contexts. Once we get
between them.                                           the representation of each caption, the final simi-
   The success of word embeddings on similar-           larity is computed based on cosine of the two rep-
ity tasks has motivated methods to learn repre-         resentation vectors.
sentations of longer pieces of text such as sen-
tences (Pagliardini et al., 2017), as representing      4.1    One-hot Encoding
their meaning is a fundamental step on any task         This is the most basic textual representation, in
requiring some level of text understanding. How-        which text is represented as binary vector indicat-
ever, sentence representation is a challenging task     ing the words occurring in the context (Manning
that has to consider aspects such as composition-       et al., 2008). This way of representing text creates
ality, phrase similarity, negation, etc. The Seman-     long and sparse vectors, but it has been success-
tic Textual Similarity (STS) task (Cer et al., 2017)    fully used in many NLP tasks.
aims at extending traditional semantic similar-
ity/relatedness measures between pair of words in       4.2    GloVe
isolation to full sentences, and is a natural dataset
to evaluate sentence representations. Through a         The Global Vector model (GloVe)4 (Pennington
set of campaigns, STS has distributed set of manu-      et al., 2014) is a log-linear model trained to en-
ally annotated datasets where annotators measure        code semantic relationships between words as vec-
the similarity among sentences with a score that        tor offsets in the learned vector space, combining
ranges between 0 (no similarity) to 5 (full equiva-     global matrix factorization and local context win-
lence).                                                 dow methods.
   In the recent years, evaluation campaigns that          Since GloVe is a word-level vector model, we
agglutinate many semantic tasks have been set           compute the mean of the vectors of all words
up, with the objective to measure the perfor-           composing the sentence, in order to obtain the
mance of many natural language understanding            sentence-level representation. The pre-trained
systems. The most well-known benchmarks are             model from GloVe considered in this paper is the
SentEval2 (Conneau and Kiela, 2018) and GLUE3           6B-300d, counting a vocabulary of 400k words
(Wang et al., 2019). They share many of existing        with 300 dimensions vectors and trained on a
                                                        dataset of 6 billion tokens.
    2
    https://github.com/facebookresearch/
SentEval                                                  4
                                                            https://nlp.stanford.edu/projects/
  3
    https://gluebenchmark.com/                          glove/
4.3      BERT                                          that productively predicates the action depicted in
The Bidirectional Encoder Representations from         an ac video are in local equivalence relation (Pa-
Transformer (BERT)5 (Devlin et al., 2018) imple-       nunzi et al., 2018b), i.e the property that differ-
ments a novel methodology based on the so called       ent verbs (even with different meanings) can re-
masked language model, which randomly masks            fer to the same action concept. Moreover, each
some of the tokens from the input, and predicts the    ac is linked to a short synthetic caption (e.g. John
original vocabulary id of the masked word based        pushes the button) for each locally equivalent verb
only on its context.                                   in every language. These captions are formally
   Similarly with GloVe, we extract the token em-      defined, thus they only contain the minimum ar-
beddings of the last layer, and compute the mean       guments needed to express an action.
vector to obtain the sentence-level representation.       We exploited IMAGACT conceptualization due
The BERT model used in our test is the BERT-           to its action-centric approach. In fact, compared
Large Uncased (24-layer, 1024-hidden, 16-heads,        to other linguistic resources, e.g. WordNet (Fell-
340M parameters).                                      baum, 1998), BabelNet (Navigli and Ponzetto,
                                                       2012), VerbNet (Schuler, 2006), IMAGACT fo-
4.4      USE                                           cuses on actions and represents them as visual
The Universal Sentence Encoder (USE) (Cer et al.,      concepts. Even if IMAGACT is a smaller re-
2018) is a model for encoding sentences into em-       source, its action conceptualization is more fine-
bedding vectors, specifically designed for trans-      grained. Other resources have more broad scopes,
fer learning in NLP. Based on a deep averaging         and for this reason senses referred to actions
network encoder, the model is trained for a vari-      are often vague and overlapping (Panunzi et al.,
ety text length, such as sentences, phrases or short   2018a), i.e. all possible actions can be gathered
paragraphs, and in a variety of semantic task in-      under one synset. For instance, if we look at the
cluding the STS. The encoder returns the corre-        senses of push in Wordnet, we find that only 4 out
sponding vector of the sentence, and we compute        of 10 synsets refer to concrete actions, and some
similarity using cosine formula.                       of the glosses are not really exhaustive and can be
                                                       applied to a wide set of different actions:
5       Datasets                                         • push, force (move with force);
In this section, we briefly introduce the resources      • push (press against forcefully without mov-
used to collect sentence pairs for our similarity          ing);
test. Figure 1 shows some examples of data,
aligned by action concepts.                              • push (move strenuously and with effort);

5.1      IMAGACT                                         • press, push (make strenuous pushing move-
                                                           ments during birth to expel the baby).
IMAGACT6 (Moneglia et al., 2014) is a multilin-
gual and multimodal ontology of action that pro-          In such framework of categorization, all possi-
vides a video-based translation and disambigua-        ble actions referred by push can be gathered under
tion framework for action verbs. The resource          the first synset, except from those specifically de-
is built on an ontology containing a fine-grained      scribed by the other three.
categorization of action concepts (acs), each rep-        For the experiments proposed in this paper, only
resented by one or more visual prototypes in the       the English captions have been used, in order to
form of recorded videos and 3D animations. IMA-        test our method in a monolingual scenario.
GACT currently contains 1,010 scenes, which en-
compass the actions most commonly referred to in       5.2   LSMDC16
everyday language usage.                               The Large Scale Movie Description Challenge
   Verbs from different languages are linked to        Dataset7 (LSMDC16) (Rohrbach et al., 2017) con-
acs, on the basis of competence-based annotation       sists in a parallel corpus of 128,118 sentences ob-
from mother tongue informants. All the verbs           tained from audio descriptions for visually im-
    5
                                                       paired people and scripts, aligned to video clips
    https://github.com/google-research/
bert                                                     7
                                                           https://sites.google.com/site/
  6
    http://www.imagact.it                              describingmovies/home
Figure 1: An example of aligned representation of action concepts in the two resources. On the left,
action concepts with prototype videos and captions for all applicable verbs in IMAGACT; on the right,
the video-caption pairs in LSMDC16, classified according to the depicted and described action.


from 200 movies. This dataset derives from the         6.1    Gold Standard
merging of two previously independent datasets,        The Gold Standard test set (GS) has been created
MPII-MD (Rohrbach et al., 2015) and M-VAD              by selecting one starting verb: push. This verb has
(Torabi et al., 2015). The language used in au-        been chosen according to the fact that, as a general
dio descriptions is particularly rich of references    action verb, it is highly frequent in the use, it ap-
to physical action, with respect to reference cor-     plies to a high number of acs in the IMAGACT
pora (e.g. BNC corpus) (Salway, 2007).                 Ontology (25 acs) and it has a high occurrence
  For this reason, LSMDC16 dataset could be            both in IMAGACT and LSMDC16.
considered a good source of video-caption pairs of        From the IMAGACT Ontology, all the verbs in
action examples, comparable to data from IMA-          relation of local equivalence with push in each of
GACT resource.                                         its acs have been queried8 , i.e all the verbs that
                                                       predicate at least one of the acs linked to push.
                                                       Then, all the captions in LSMDC16 containing
6   Experiments                                        one of those verbs have been manually annotated
                                                       with the corresponding ac’s id. In total, 377 video-
Given that the objective is not to discriminate dis-   caption pairs have been correctly annotated9 with
tant actions (e.g. opening a door vs. taking a         18 acs, and they have been paired with 38 cap-
cup) but rather to distinguish actions referred to     tions for the verbs linked to the same acs in IMA-
by the same verb or set of verbs, the experiments      GACT, consisting in a total of 14,440 similarity
herein described have been conducted on a sub-set          8
                                                             The verbs collected for this experiment are: push, insert,
of the LSMDC16 dataset, that have been manually        press, ram, nudge, compress, squeeze, wheel, throw, shove,
annotated with the corresponding acs from IMA-         flatten, put, move. Move and put have been excluded from
                                                       this list, due to the fact that this verbs are too general and
GACT. The annotation has been carried on by one        apply to a wide set of acs, with the risk of introducing more
expert annotator, trained on IMAGACT conceptu-         noise in the computation of the similarity; flatten is connected
alization framework, and revised by a supervisor.      to an ac that found no examples in LSMDC16, so it has been
                                                       excluded too.
In this way, we created a Gold Standard for the            9
                                                             Pairs with no action in the video, or pairs with a novel or
evaluation of the compared systems.                    difficult to assign ac have been excluded from the test.
judjements.                                           they do not convey semantic information, and they
   It is important to highlight that the manual an-   sometimes introduce noise in the process. Stop-
notation took into account the visual information     words removal has been executed in the moment
conveyed with the captions (i.e. videos from both     of calculating the similarity between caption pairs,
resources), that made possible to precisely assign    i.e. tokens corresponding to stop-words have been
the most applicable ac to the LSMDC16 captions.       used for the representation by contextual models,
                                                      but then discharged when computing sentence rep-
6.2   Pre-processing of the data                      resentation.
As stated in the introduction, STS methods are           With these pre-processing operations, we ob-
normally tested on data within the same domain.       tained 4 variants of testing data:
In attempt to leverage some differences between           • plain (LSMDC16 splitting only);
IMAGACT and LSMDC16, basic pre-processing
have been applied.                                        • anonIM (anonymisation of IMAGACT cap-
   Length of caption in the two resources vary:             tions by substitution of proper names with
captions in IMAGACT are artificial, and they only           someone);
contain minimum syntactic/semantic elements to            • noSW (stop-words removing from both re-
describe the ac; captions in LSMDC16 are tran-              sources);
scription of more natural spoken language, and            • anonIM+noSW (combination of the two pre-
usually convey information on more than one ac-             vious ones).
tion at the same time. For this reason, LSMDC16
captions have been splitted in shorter and sim-       7    Results
pler sentences. To do that, we parsed the origi-
nal caption with StanforNLP (Qi et al., 2018), and    To benchmark the performances of the four mod-
rewrote simplified sentences by collecting all the    els, we also defined a baseline that, following a
words in a dependency relation with the targeted      binomial distribution, randomly assigns an ac of
verbs. Table 1 shows an example of the splitting      the GS test set (actually, baseline is calculated an-
process.                                              alytically without simulations). Parameters of the
                                                      binomial are calculated from the GS test set. Table
  FULL  As he crashes onto the platform,      3       2 shows the results at different recall@k (i.e. ratio
        someone hauls him to his feet                 of examples containing the correct label in the top
        and pushes him back towards                   k answers) of the three models tested.
        someone.                                         All models show slightly better results com-
  SPLIT he crashes onto the platform and      7       pared to the baseline, but they are not much
        As someone hauls him to his feet      7       higher. Regarding the pre-processing, any strat-
        pushes him back towards some-         3       egy (noSW, anonIM, anonIM+noSW) seems not
        one                                           to make difference. We were expecting low re-
                                                      sults, given the difficulty of the task: without tak-
Table 1: Example of the split text after process-     ing into account visual information, also for a hu-
ing the output of the dependency parser. From         man annotator most of those caption pairs are am-
the original caption (FULL) we obtain three sub-      biguous.
captions (SPLIT). Only the one with the target verb      Surprisingly, GloVe model, the only one with
is used (3), and the rest is ignored (7).             static pre-trained embeddings based on statistical
                                                      distribution, outperforms the baseline and other
   LSMDC16 dataset is anonymised, i.e. the pro-       contextual models by ∼0.2 in recall@10. It is
noun someone is used in place of all proper names;    not an exciting result, but it shows that STS with
on the contrary, captions in IMAGACT always           pre-trained word embedding might be effective to
have a proper name (e.g. John, Mary). We au-          speed up manual annotation tasks, without any
tomatically substituted IMAGACT proper names          computational cost. Probably, one reason to ex-
with someone, to match with LSMDC16.                  plain the lower trend in results obtained by con-
   Finally, we also removed stop-words, which         textual models (BERT, USE) could be that these
are often the first lexical elements to be pruned     systems have been penalized by the splitting pro-
out from texts, prior of any computation, because     cess of LSMDC16 captions. Example in Table
                   Model         Pre-processing               recall@1    recall@3     recall@5      recall@10
          ONE - HOT ENCODING          plain                     0.195       0.379        0.484         0.655
                                      noSW                      0.139       0.271        0.411         0.687
                                     anonIM                     0.197        0.4         0.482         0.624
                                 anonIM+noSW                    0.155       0.329        0.453          0.65
                  G LOV E             plain                     0.213       0.392        0.553         0.818
                                      noSW                      0.182       0.408        0.505         0.755
                                     anonIM                     0.218       0.453        0.568         0.774
                                 anonIM+noSW                    0.279       0.453        0.553         0.761
                  BERT                plain                     0.245       0.439        0.539         0.632
                                      noSW                      0.247       0.484        0.558         0.679
                                     anonIM                     0.239       0.434        0.529         0.645
                                 anonIM+noSW                     0.2        0.384        0.526         0.668
                   USE                plain                     0.213       0.403        0.492         0.616
                                      noSW                      0.171       0.376        0.461         0.563
                                     anonIM                     0.239       0.471        0.561         0.666
                                 anonIM+noSW                    0.179       0.426        0.518         0.637
                      Random baseline                           0.120       0.309        0.447         0.658

                   Table 2: STS results for the models tested on IMAGACT-LSMDC scenario.

1 shows a good splitting result, while processing                8   Conclusions and Future Work
some other captions leads to less-natural sentence
                                                                 In this paper we presented a comparison of four
splitting, and this might influence the global result.
                                                                 popular representation models (one-hot encoding,
                                                                 GloVe, BERT, USE) in the task of semantic tex-
         Model        Pre-processing        Pearson              tual similarity on a non-conventional scenario: ac-
         G LOV E           plain             0.336               tion reference similarity between sentences from
         BERT              plain              0.47               different domains.
          USE              plain             0.702                  In the future, we would like to extend our Gold
                                                                 Standard dataset, not only in terms of dimension
          Table 3: Results on STS-benchmark.                     (i.e. more LSMDC16 video-caption pairs an-
                                                                 notated with acs from IMAGACT), but also in
   We run similar experiments on the publicly                    terms of annotators. It would be interesting to
available STS-benchmark dataset10 (Cer et al.,                   observe to what extend the visual stimuli offered
2017), in order to see if the models show similar                by video prototypes can be interpreted clearly by
behaviour when benchmarked on a more conven-                     more than one annotator, and thus calculate the
tional scenario. The task is similar to the one pre-             inter-annotator agreement. Moreover, we plan to
sented herein: it consists in the assessment of pairs            extend the evaluation to other representation mod-
of sentences according to their degree of seman-                 els as well as state-of-the-art supervised models,
tic similarity. In this task, models are evaluated               and see if their performances in canonical tests
by the Pearson correlation of machine scores with                are confirmed on our scenario. We would also try
human judgments. Table 3 shows the expected re-                  to augment data used for this test, by exploiting
sults: Contextual models outperform GloVe based                  dense video captioning models, i.e. videoBERT
model in a consisted way, and USE outperform                     (Sun et al., 2019).
the rest by large margin (about 20-30 points better
                                                                 Acknowledgements
overall). It confirms that model performances are
task-dependent, and that results obtained in non-                This research was partially supported by the Span-
                                                                 ish MINECO (DeepReading RTI2018-096846-B-C21
conventional scenarios can be counter-intuitive if               (MCIU/AEI/FEDER, UE)), ERA-Net CHISTERA LIHLITH
compared to results obtained in conventional ones.               Project funded by Agencia Esatatal de Investigacin (AEI,
                                                                 Spain) projects PCIN-2017-118/AEI and PCIN-2017-
                                                                 085/AEI, the Basque Government (excellence research
  10
       http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark      group, IT1343-19), and the NVIDIA GPU grant program.
References                                                       Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017.
                                                                   Unsupervised learning of sentence embeddings using
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-           compositional n-gram features. CoRR, abs/1703.02507.
  Agirre. 2012. SemEval-2012 Task 6: A pilot on semantic
  textual similarity. In *SEM 2012 - 1st Joint Conference        Alessandro Panunzi, Lorenzo Gregori, and Andrea Amelio
  on Lexical and Computational Semantics, pages 385–393.           Ravelli. 2018a. One event, many representations. map-
  Universidad del Pais Vasco, Leioa, Spain, January.               ping action concepts through visual features. In James
Piotr Bojanowski, Edouard Grave, Armand Joulin, and                Pustejovsky and Ielka van der Sluis, editors, Proceedings
   Tomas Mikolov. 2017. Enriching Word Vectors with                of the Eleventh International Conference on Language Re-
   Subword Information. Transactions of the Association for        sources and Evaluation (LREC 2018), Miyazaki, Japan.
   Computational Linguistics (TACL), 5(1):135–146.                 European Language Resources Association (ELRA).

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio,        Alessandro Panunzi, Massimo Moneglia, and Lorenzo Gre-
  and Lucia Specia. 2017. SemEval-2017 task 1: Seman-              gori. 2018b. Action identification and local equivalence
  tic textual similarity multilingual and crosslingual focused     of action verbs: the annotation framework of the imagact
  evaluation. In Proceedings of the 11th International Work-       ontology. In James Pustejovsky and Ielka van der Sluis,
  shop on Semantic Evaluation (SemEval-2017), pages 1–             editors, Proceedings of the Eleventh International Con-
  14, Vancouver, Canada, August. Association for Compu-            ference on Language Resources and Evaluation (LREC
  tational Linguistics.                                            2018), Miyazaki, Japan. European Language Resources
                                                                   Association (ELRA).
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole
  Limtiaco, Rhomni St. John, Noah Constant, Mario                Jeffrey Pennington, Richard Socher, and Christopher D Man-
  Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan               ning. 2014. GloVe: Global vectors for word representa-
  Sung, Brian Strope, and Ray Kurzweil. 2018. Universal             tion. In Conference on Empirical Methods in Natural Lan-
  sentence encoder. CoRR.                                           guage Processing, pages 1532–1543. Stanford University,
                                                                    Palo Alto, United States, January.
Alexis Conneau and Douwe Kiela. 2018. Senteval: An
   evaluation toolkit for universal sentence representations.    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
   arXiv preprint arXiv:1803.05449.                                ner, Christopher Clark, Kenton Lee, and Luke Zettle-
                                                                   moyer. 2018. Deep contextualized word representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
                                                                   In Proceedings of NAACL-HLT, pages 2227–2237.
   Toutanova. 2018. BERT - Pre-training of Deep Bidirec-
   tional Transformers for Language Understanding. CoRR,
                                                                 Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D
   1810:arXiv:1810.04805.
                                                                   Manning. 2018. Universal Dependency Parsing from
Christiane Fellbaum. 1998. WordNet: an electronic lexical          Scratch. CoNLL Shared Task.
  database. MIT Press.
                                                                 Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt
Christopher D. Manning, Prabhakar Raghavan, and Hinrich            Schiele. 2015. A dataset for Movie Description. In 2015
  Schütze. 2008. Introduction to Information Retrieval.           IEEE Conference on Computer Vision and Pattern Recog-
  Cambridge University Press, New York, NY, USA.                   nition (CVPR).

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,         Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket
  and Jeff Dean. 2013. Distributed Representations of              Tandon, Christopher Pal, Hugo Larochelle, Aaron
  Words and Phrases and their Compositionality. In C J C           Courville, and Bernt Schiele. 2017. Movie Description.
  Burges, L Bottou, M Welling, Z Ghahramani, and K Q               International Journal of Computer Vision, 123(1):94–120,
  Weinberger, editors, Advances in Neural Information Pro-         January.
  cessing Systems 26, pages 3111–3119. Curran Associates,
  Inc.                                                           Andrew Salway. 2007. A corpus-based analysis of audio
                                                                   description. In Jorge Dı́az Cintas, Pilar Orero, and Aline
Massimo Moneglia and Alessandro Panunzi. 2007. Action              Remael, editors, Media for All, pages 151–174. Leiden.
  Predicates and the Ontology of Action across Spoken Lan-
  guage Corpora. In M Alcántara Plá and Th Declerk, edi-       Karin Kipper Schuler. 2006. VerbNet: A Broad-Coverage,
  tors, Proceedings of the International Workshop on the Se-       Comprehensive Verb Lexicon. Ph.D. thesis, University of
  mantic Representation of Spoken Language (SRSL 2007),            Pennsylvania.
  pages 51–58, Salamanca.

Massimo Moneglia, Susan Brown, Francesca Frontini, Gloria        Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy,
  Gagliardi, Fahad Khan, Monica Monachini, and Alessan-            and Cordelia Schmid. 2019. Videobert: A joint model
  dro Panunzi. 2014. The IMAGACT Visual Ontology. An               for video and language representation learning. CoRR,
  Extendable Multilingual Infrastructure for the representa-       abs/1904.01766.
  tion of lexical encoding of Action. LREC, pages 3425–
  3432.                                                          Atousa Torabi, Christopher J Pal, Hugo Larochelle, and
                                                                   Aaron C Courville. 2015. Using Descriptive Video Ser-
Massimo Moneglia. 2014. The variation of Action verbs              vices to Create a Large Data Source for Video Annotation
  in multilingual spontaneous speech corpora. Spoken Cor-          Research. cs.CV:arXiv:1503.01070.
  pora and Linguistic Studies, 61:152.
                                                                 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill,
Roberto Navigli and Simone Paolo Ponzetto. 2012. Babel-             Omer Levy, and Samuel R. Bowman. 2019. GLUE: A
  net: The automatic construction, evaluation and applica-          multi-task benchmark and analysis platform for natural
  tion of a wide-coverage multilingual semantic network.            language understanding. In International Conference on
  Artificial Intelligence, 193(0):217 – 250.                        Learning Representations.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell,
  Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet:
  Generalized autoregressive pretraining for language un-
  derstanding. CoRR.