A Comparison of Representation Models in a Non-Conventional Semantic Similarity Scenario Andrea Amelio Ravelli Oier Lopez de Lacalle and Eneko Agirre University of Florence University of the Basque Country andreaamelio.ravelli@unifi.it e.agirre@ehu.eus oier.lopezdelacalle@ehu.eus Abstract agent applying a continuous and controlled force to move the object from position A to position B, Representation models have shown very or is he carelessly shoving an object away from its promising results in solving semantic sim- location? These are just two of the possible inter- ilarity problems. Normally, their perfor- pretation of this sentence as is, without any other mances are benchmarked on well-tailored lexical information or pragmatic reference. experimental settings, but what happens with unusual data? In this paper, we Given these premises, it is clear that the task present a comparison between popular of automatically classifying sentences referring to representation models tested in a non- actions in a fine-grained way (e.g. push/move vs. conventional scenario: assessing action push/press) is not trivial at all, and even humans reference similarity between sentences may need extra information (e.g. images, videos) from different domains. The action ref- to precisely identify the exact action. One way erence problem is not a trivial task, given could be to consider action reference similarity that verbs are generally ambiguous and as a Semantic Textual Similarity (STS) problem complex to treat in NLP. We set four vari- (Agirre et al., 2012), assessing that lexical seman- ants of the same tests to check if different tic information encodes, at a certain level, the ac- pre-processing may improve models per- tion those words are referring to. The simplest formances. We also compared our results way is to make use of pre-computed word embed- with those obtained in a common bench- dings, which are ready to use for computing sim- mark dataset for a similar task.1 ilarity between words, sentences and documents. Various models have been presented in the past 1 Introduction years that make use of well-known static word Verbs are the standard linguistic tool that hu- embeddings, like word2vec, GloVe and FastText mans use to refer to actions, and action verbs are (Mikolov et al., 2013; Pennington et al., 2014; very frequent in spoken language (∼50% of total Bojanowski et al., 2017). Recently, the best STS verbs occurrences) (Moneglia and Panunzi, 2007). models rely on representations obtained from con- These verbs are generally ambiguous and com- textual embeddings, such as ELMO, BERT and plex to treat in NLP tasks, because the relation be- XLNet (Peters et al., 2018; Devlin et al., 2018; tween verbs and action concepts is not one-to-one: Yang et al., 2019). e.g. (a) pushing a button is cognitively separated In this paper, we are testing the effectiveness of from (b) pushing a table to the corner; action (a) representation models in a non-conventional sce- can also be predicated through press, while move nario, in which we do not have labeled data to can be used for (b) and not vice-versa (Moneglia, train STS systems. Normally, STS is performed 2014). These represent two different pragmatic on sentence pairs that, on one hand, can have very actions, despite of the verb used to describe it, and close or distinct meaning, i.e. the assertion of sim- all the possible objects that can undergo the ac- ilarity is easy to formulate; on the other hand, all tion. Another example could be the ambiguity be- sentences derive from the same domain, thus they hind a sentence like John pushes the bottle: is the share some syntactic regularities and vocabulary. 1 In our scenario, we are computing STS between Copyright c 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- textual data from two different resources, IMA- ternational (CC BY 4.0). GACT and LSMDC16 (described respectively in 5.1 and 5.2), in which the language used is highly tasks and datasets, such as sentence similarity. different: from the first, synthetic and short cap- tions; from the latter, audio descriptions. The ob- 3 Problem Formulation jective is to benchmark word embedding models in the task of estimating the action concept expressed We cast the problem as a fine-grained action con- by a sentence. cept classification for verbs in LSMDC16 captions (e.g. push as move vs push as press, see Fig- 2 Related Works ure 1). Given a caption and the target verb from LSMDC16, our aim is to detect the most simi- Word embeddings are abstract representations of lar caption in IMAGACT that describe the action. words in the form of dense vectors, specifically The inputs to our model are the target caption and tailored to encode semantic information. They an inventory of captions that categorize the possi- represent an example of the so called transfer ble action concepts of the target verb. The model learning, as the vectors are built to minimize cer- ranks the captions in the inventory according to tain objective function (i.e., guessing the next the textual similarity with the target caption, and, word in a sentence), but successfully applied on similar to a kNN classifier, the model assigns the different unrelated tasks, such as searching for action label of k most similar captions. words that are semantically related. In fact, em- beddings are typically tested on semantic similar- 4 Representation Models ity/relatedness datasets, where a comparison of the vectors of two words is meant to mimic a human In this section we describe the pretrained embed- score that assesses the grade of semantic similarity dings used to represent the contexts. Once we get between them. the representation of each caption, the final simi- The success of word embeddings on similar- larity is computed based on cosine of the two rep- ity tasks has motivated methods to learn repre- resentation vectors. sentations of longer pieces of text such as sen- tences (Pagliardini et al., 2017), as representing 4.1 One-hot Encoding their meaning is a fundamental step on any task This is the most basic textual representation, in requiring some level of text understanding. How- which text is represented as binary vector indicat- ever, sentence representation is a challenging task ing the words occurring in the context (Manning that has to consider aspects such as composition- et al., 2008). This way of representing text creates ality, phrase similarity, negation, etc. The Seman- long and sparse vectors, but it has been success- tic Textual Similarity (STS) task (Cer et al., 2017) fully used in many NLP tasks. aims at extending traditional semantic similar- ity/relatedness measures between pair of words in 4.2 GloVe isolation to full sentences, and is a natural dataset to evaluate sentence representations. Through a The Global Vector model (GloVe)4 (Pennington set of campaigns, STS has distributed set of manu- et al., 2014) is a log-linear model trained to en- ally annotated datasets where annotators measure code semantic relationships between words as vec- the similarity among sentences with a score that tor offsets in the learned vector space, combining ranges between 0 (no similarity) to 5 (full equiva- global matrix factorization and local context win- lence). dow methods. In the recent years, evaluation campaigns that Since GloVe is a word-level vector model, we agglutinate many semantic tasks have been set compute the mean of the vectors of all words up, with the objective to measure the perfor- composing the sentence, in order to obtain the mance of many natural language understanding sentence-level representation. The pre-trained systems. The most well-known benchmarks are model from GloVe considered in this paper is the SentEval2 (Conneau and Kiela, 2018) and GLUE3 6B-300d, counting a vocabulary of 400k words (Wang et al., 2019). They share many of existing with 300 dimensions vectors and trained on a dataset of 6 billion tokens. 2 https://github.com/facebookresearch/ SentEval 4 https://nlp.stanford.edu/projects/ 3 https://gluebenchmark.com/ glove/ 4.3 BERT that productively predicates the action depicted in The Bidirectional Encoder Representations from an ac video are in local equivalence relation (Pa- Transformer (BERT)5 (Devlin et al., 2018) imple- nunzi et al., 2018b), i.e the property that differ- ments a novel methodology based on the so called ent verbs (even with different meanings) can re- masked language model, which randomly masks fer to the same action concept. Moreover, each some of the tokens from the input, and predicts the ac is linked to a short synthetic caption (e.g. John original vocabulary id of the masked word based pushes the button) for each locally equivalent verb only on its context. in every language. These captions are formally Similarly with GloVe, we extract the token em- defined, thus they only contain the minimum ar- beddings of the last layer, and compute the mean guments needed to express an action. vector to obtain the sentence-level representation. We exploited IMAGACT conceptualization due The BERT model used in our test is the BERT- to its action-centric approach. In fact, compared Large Uncased (24-layer, 1024-hidden, 16-heads, to other linguistic resources, e.g. WordNet (Fell- 340M parameters). baum, 1998), BabelNet (Navigli and Ponzetto, 2012), VerbNet (Schuler, 2006), IMAGACT fo- 4.4 USE cuses on actions and represents them as visual The Universal Sentence Encoder (USE) (Cer et al., concepts. Even if IMAGACT is a smaller re- 2018) is a model for encoding sentences into em- source, its action conceptualization is more fine- bedding vectors, specifically designed for trans- grained. Other resources have more broad scopes, fer learning in NLP. Based on a deep averaging and for this reason senses referred to actions network encoder, the model is trained for a vari- are often vague and overlapping (Panunzi et al., ety text length, such as sentences, phrases or short 2018a), i.e. all possible actions can be gathered paragraphs, and in a variety of semantic task in- under one synset. For instance, if we look at the cluding the STS. The encoder returns the corre- senses of push in Wordnet, we find that only 4 out sponding vector of the sentence, and we compute of 10 synsets refer to concrete actions, and some similarity using cosine formula. of the glosses are not really exhaustive and can be applied to a wide set of different actions: 5 Datasets • push, force (move with force); In this section, we briefly introduce the resources • push (press against forcefully without mov- used to collect sentence pairs for our similarity ing); test. Figure 1 shows some examples of data, aligned by action concepts. • push (move strenuously and with effort); 5.1 IMAGACT • press, push (make strenuous pushing move- ments during birth to expel the baby). IMAGACT6 (Moneglia et al., 2014) is a multilin- gual and multimodal ontology of action that pro- In such framework of categorization, all possi- vides a video-based translation and disambigua- ble actions referred by push can be gathered under tion framework for action verbs. The resource the first synset, except from those specifically de- is built on an ontology containing a fine-grained scribed by the other three. categorization of action concepts (acs), each rep- For the experiments proposed in this paper, only resented by one or more visual prototypes in the the English captions have been used, in order to form of recorded videos and 3D animations. IMA- test our method in a monolingual scenario. GACT currently contains 1,010 scenes, which en- compass the actions most commonly referred to in 5.2 LSMDC16 everyday language usage. The Large Scale Movie Description Challenge Verbs from different languages are linked to Dataset7 (LSMDC16) (Rohrbach et al., 2017) con- acs, on the basis of competence-based annotation sists in a parallel corpus of 128,118 sentences ob- from mother tongue informants. All the verbs tained from audio descriptions for visually im- 5 paired people and scripts, aligned to video clips https://github.com/google-research/ bert 7 https://sites.google.com/site/ 6 http://www.imagact.it describingmovies/home Figure 1: An example of aligned representation of action concepts in the two resources. On the left, action concepts with prototype videos and captions for all applicable verbs in IMAGACT; on the right, the video-caption pairs in LSMDC16, classified according to the depicted and described action. from 200 movies. This dataset derives from the 6.1 Gold Standard merging of two previously independent datasets, The Gold Standard test set (GS) has been created MPII-MD (Rohrbach et al., 2015) and M-VAD by selecting one starting verb: push. This verb has (Torabi et al., 2015). The language used in au- been chosen according to the fact that, as a general dio descriptions is particularly rich of references action verb, it is highly frequent in the use, it ap- to physical action, with respect to reference cor- plies to a high number of acs in the IMAGACT pora (e.g. BNC corpus) (Salway, 2007). Ontology (25 acs) and it has a high occurrence For this reason, LSMDC16 dataset could be both in IMAGACT and LSMDC16. considered a good source of video-caption pairs of From the IMAGACT Ontology, all the verbs in action examples, comparable to data from IMA- relation of local equivalence with push in each of GACT resource. its acs have been queried8 , i.e all the verbs that predicate at least one of the acs linked to push. Then, all the captions in LSMDC16 containing 6 Experiments one of those verbs have been manually annotated with the corresponding ac’s id. In total, 377 video- Given that the objective is not to discriminate dis- caption pairs have been correctly annotated9 with tant actions (e.g. opening a door vs. taking a 18 acs, and they have been paired with 38 cap- cup) but rather to distinguish actions referred to tions for the verbs linked to the same acs in IMA- by the same verb or set of verbs, the experiments GACT, consisting in a total of 14,440 similarity herein described have been conducted on a sub-set 8 The verbs collected for this experiment are: push, insert, of the LSMDC16 dataset, that have been manually press, ram, nudge, compress, squeeze, wheel, throw, shove, annotated with the corresponding acs from IMA- flatten, put, move. Move and put have been excluded from this list, due to the fact that this verbs are too general and GACT. The annotation has been carried on by one apply to a wide set of acs, with the risk of introducing more expert annotator, trained on IMAGACT conceptu- noise in the computation of the similarity; flatten is connected alization framework, and revised by a supervisor. to an ac that found no examples in LSMDC16, so it has been excluded too. In this way, we created a Gold Standard for the 9 Pairs with no action in the video, or pairs with a novel or evaluation of the compared systems. difficult to assign ac have been excluded from the test. judjements. they do not convey semantic information, and they It is important to highlight that the manual an- sometimes introduce noise in the process. Stop- notation took into account the visual information words removal has been executed in the moment conveyed with the captions (i.e. videos from both of calculating the similarity between caption pairs, resources), that made possible to precisely assign i.e. tokens corresponding to stop-words have been the most applicable ac to the LSMDC16 captions. used for the representation by contextual models, but then discharged when computing sentence rep- 6.2 Pre-processing of the data resentation. As stated in the introduction, STS methods are With these pre-processing operations, we ob- normally tested on data within the same domain. tained 4 variants of testing data: In attempt to leverage some differences between • plain (LSMDC16 splitting only); IMAGACT and LSMDC16, basic pre-processing have been applied. • anonIM (anonymisation of IMAGACT cap- Length of caption in the two resources vary: tions by substitution of proper names with captions in IMAGACT are artificial, and they only someone); contain minimum syntactic/semantic elements to • noSW (stop-words removing from both re- describe the ac; captions in LSMDC16 are tran- sources); scription of more natural spoken language, and • anonIM+noSW (combination of the two pre- usually convey information on more than one ac- vious ones). tion at the same time. For this reason, LSMDC16 captions have been splitted in shorter and sim- 7 Results pler sentences. To do that, we parsed the origi- nal caption with StanforNLP (Qi et al., 2018), and To benchmark the performances of the four mod- rewrote simplified sentences by collecting all the els, we also defined a baseline that, following a words in a dependency relation with the targeted binomial distribution, randomly assigns an ac of verbs. Table 1 shows an example of the splitting the GS test set (actually, baseline is calculated an- process. alytically without simulations). Parameters of the binomial are calculated from the GS test set. Table FULL As he crashes onto the platform, 3 2 shows the results at different recall@k (i.e. ratio someone hauls him to his feet of examples containing the correct label in the top and pushes him back towards k answers) of the three models tested. someone. All models show slightly better results com- SPLIT he crashes onto the platform and 7 pared to the baseline, but they are not much As someone hauls him to his feet 7 higher. Regarding the pre-processing, any strat- pushes him back towards some- 3 egy (noSW, anonIM, anonIM+noSW) seems not one to make difference. We were expecting low re- sults, given the difficulty of the task: without tak- Table 1: Example of the split text after process- ing into account visual information, also for a hu- ing the output of the dependency parser. From man annotator most of those caption pairs are am- the original caption (FULL) we obtain three sub- biguous. captions (SPLIT). Only the one with the target verb Surprisingly, GloVe model, the only one with is used (3), and the rest is ignored (7). static pre-trained embeddings based on statistical distribution, outperforms the baseline and other LSMDC16 dataset is anonymised, i.e. the pro- contextual models by ∼0.2 in recall@10. It is noun someone is used in place of all proper names; not an exciting result, but it shows that STS with on the contrary, captions in IMAGACT always pre-trained word embedding might be effective to have a proper name (e.g. John, Mary). We au- speed up manual annotation tasks, without any tomatically substituted IMAGACT proper names computational cost. Probably, one reason to ex- with someone, to match with LSMDC16. plain the lower trend in results obtained by con- Finally, we also removed stop-words, which textual models (BERT, USE) could be that these are often the first lexical elements to be pruned systems have been penalized by the splitting pro- out from texts, prior of any computation, because cess of LSMDC16 captions. Example in Table Model Pre-processing recall@1 recall@3 recall@5 recall@10 ONE - HOT ENCODING plain 0.195 0.379 0.484 0.655 noSW 0.139 0.271 0.411 0.687 anonIM 0.197 0.4 0.482 0.624 anonIM+noSW 0.155 0.329 0.453 0.65 G LOV E plain 0.213 0.392 0.553 0.818 noSW 0.182 0.408 0.505 0.755 anonIM 0.218 0.453 0.568 0.774 anonIM+noSW 0.279 0.453 0.553 0.761 BERT plain 0.245 0.439 0.539 0.632 noSW 0.247 0.484 0.558 0.679 anonIM 0.239 0.434 0.529 0.645 anonIM+noSW 0.2 0.384 0.526 0.668 USE plain 0.213 0.403 0.492 0.616 noSW 0.171 0.376 0.461 0.563 anonIM 0.239 0.471 0.561 0.666 anonIM+noSW 0.179 0.426 0.518 0.637 Random baseline 0.120 0.309 0.447 0.658 Table 2: STS results for the models tested on IMAGACT-LSMDC scenario. 1 shows a good splitting result, while processing 8 Conclusions and Future Work some other captions leads to less-natural sentence In this paper we presented a comparison of four splitting, and this might influence the global result. popular representation models (one-hot encoding, GloVe, BERT, USE) in the task of semantic tex- Model Pre-processing Pearson tual similarity on a non-conventional scenario: ac- G LOV E plain 0.336 tion reference similarity between sentences from BERT plain 0.47 different domains. USE plain 0.702 In the future, we would like to extend our Gold Standard dataset, not only in terms of dimension Table 3: Results on STS-benchmark. (i.e. more LSMDC16 video-caption pairs an- notated with acs from IMAGACT), but also in We run similar experiments on the publicly terms of annotators. It would be interesting to available STS-benchmark dataset10 (Cer et al., observe to what extend the visual stimuli offered 2017), in order to see if the models show similar by video prototypes can be interpreted clearly by behaviour when benchmarked on a more conven- more than one annotator, and thus calculate the tional scenario. The task is similar to the one pre- inter-annotator agreement. Moreover, we plan to sented herein: it consists in the assessment of pairs extend the evaluation to other representation mod- of sentences according to their degree of seman- els as well as state-of-the-art supervised models, tic similarity. In this task, models are evaluated and see if their performances in canonical tests by the Pearson correlation of machine scores with are confirmed on our scenario. We would also try human judgments. Table 3 shows the expected re- to augment data used for this test, by exploiting sults: Contextual models outperform GloVe based dense video captioning models, i.e. videoBERT model in a consisted way, and USE outperform (Sun et al., 2019). the rest by large margin (about 20-30 points better Acknowledgements overall). It confirms that model performances are task-dependent, and that results obtained in non- This research was partially supported by the Span- ish MINECO (DeepReading RTI2018-096846-B-C21 conventional scenarios can be counter-intuitive if (MCIU/AEI/FEDER, UE)), ERA-Net CHISTERA LIHLITH compared to results obtained in conventional ones. Project funded by Agencia Esatatal de Investigacin (AEI, Spain) projects PCIN-2017-118/AEI and PCIN-2017- 085/AEI, the Basque Government (excellence research 10 http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark group, IT1343-19), and the NVIDIA GPU grant program. References Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez- compositional n-gram features. CoRR, abs/1703.02507. Agirre. 2012. SemEval-2012 Task 6: A pilot on semantic textual similarity. In *SEM 2012 - 1st Joint Conference Alessandro Panunzi, Lorenzo Gregori, and Andrea Amelio on Lexical and Computational Semantics, pages 385–393. Ravelli. 2018a. One event, many representations. map- Universidad del Pais Vasco, Leioa, Spain, January. ping action concepts through visual features. In James Piotr Bojanowski, Edouard Grave, Armand Joulin, and Pustejovsky and Ielka van der Sluis, editors, Proceedings Tomas Mikolov. 2017. Enriching Word Vectors with of the Eleventh International Conference on Language Re- Subword Information. Transactions of the Association for sources and Evaluation (LREC 2018), Miyazaki, Japan. Computational Linguistics (TACL), 5(1):135–146. European Language Resources Association (ELRA). Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Alessandro Panunzi, Massimo Moneglia, and Lorenzo Gre- and Lucia Specia. 2017. SemEval-2017 task 1: Seman- gori. 2018b. Action identification and local equivalence tic textual similarity multilingual and crosslingual focused of action verbs: the annotation framework of the imagact evaluation. In Proceedings of the 11th International Work- ontology. In James Pustejovsky and Ielka van der Sluis, shop on Semantic Evaluation (SemEval-2017), pages 1– editors, Proceedings of the Eleventh International Con- 14, Vancouver, Canada, August. Association for Compu- ference on Language Resources and Evaluation (LREC tational Linguistics. 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Jeffrey Pennington, Richard Socher, and Christopher D Man- Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan ning. 2014. GloVe: Global vectors for word representa- Sung, Brian Strope, and Ray Kurzweil. 2018. Universal tion. In Conference on Empirical Methods in Natural Lan- sentence encoder. CoRR. guage Processing, pages 1532–1543. Stanford University, Palo Alto, United States, January. Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard- arXiv preprint arXiv:1803.05449. ner, Christopher Clark, Kenton Lee, and Luke Zettle- moyer. 2018. Deep contextualized word representations. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina In Proceedings of NAACL-HLT, pages 2227–2237. Toutanova. 2018. BERT - Pre-training of Deep Bidirec- tional Transformers for Language Understanding. CoRR, Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D 1810:arXiv:1810.04805. Manning. 2018. Universal Dependency Parsing from Christiane Fellbaum. 1998. WordNet: an electronic lexical Scratch. CoNLL Shared Task. database. MIT Press. Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schiele. 2015. A dataset for Movie Description. In 2015 Schütze. 2008. Introduction to Information Retrieval. IEEE Conference on Computer Vision and Pattern Recog- Cambridge University Press, New York, NY, USA. nition (CVPR). Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket and Jeff Dean. 2013. Distributed Representations of Tandon, Christopher Pal, Hugo Larochelle, Aaron Words and Phrases and their Compositionality. In C J C Courville, and Bernt Schiele. 2017. Movie Description. Burges, L Bottou, M Welling, Z Ghahramani, and K Q International Journal of Computer Vision, 123(1):94–120, Weinberger, editors, Advances in Neural Information Pro- January. cessing Systems 26, pages 3111–3119. Curran Associates, Inc. Andrew Salway. 2007. A corpus-based analysis of audio description. In Jorge Dı́az Cintas, Pilar Orero, and Aline Massimo Moneglia and Alessandro Panunzi. 2007. Action Remael, editors, Media for All, pages 151–174. Leiden. Predicates and the Ontology of Action across Spoken Lan- guage Corpora. In M Alcántara Plá and Th Declerk, edi- Karin Kipper Schuler. 2006. VerbNet: A Broad-Coverage, tors, Proceedings of the International Workshop on the Se- Comprehensive Verb Lexicon. Ph.D. thesis, University of mantic Representation of Spoken Language (SRSL 2007), Pennsylvania. pages 51–58, Salamanca. Massimo Moneglia, Susan Brown, Francesca Frontini, Gloria Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Gagliardi, Fahad Khan, Monica Monachini, and Alessan- and Cordelia Schmid. 2019. Videobert: A joint model dro Panunzi. 2014. The IMAGACT Visual Ontology. An for video and language representation learning. CoRR, Extendable Multilingual Infrastructure for the representa- abs/1904.01766. tion of lexical encoding of Action. LREC, pages 3425– 3432. Atousa Torabi, Christopher J Pal, Hugo Larochelle, and Aaron C Courville. 2015. Using Descriptive Video Ser- Massimo Moneglia. 2014. The variation of Action verbs vices to Create a Large Data Source for Video Annotation in multilingual spontaneous speech corpora. Spoken Cor- Research. cs.CV:arXiv:1503.01070. pora and Linguistic Studies, 61:152. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Roberto Navigli and Simone Paolo Ponzetto. 2012. Babel- Omer Levy, and Samuel R. Bowman. 2019. GLUE: A net: The automatic construction, evaluation and applica- multi-task benchmark and analysis platform for natural tion of a wide-coverage multilingual semantic network. language understanding. In International Conference on Artificial Intelligence, 193(0):217 – 250. Learning Representations. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language un- derstanding. CoRR.