Organizing the ASSIN 2 Shared Task Livy Real1 , Erick Fonseca2 , and Hugo Gonçalo Oliveira3[0000 0002 5779 8645] 1 B2W Digital/Grupo de Linguı́stica Computacional – University of São Paulo, livyreal@gmail.com 2 Instituto de Telecomunicações, Lisboa, Portugal, erick.fonseca@lx.it.pt 3 CISUC, University of Coimbra, Portugal, hroliv@dei.uc.pt Abstract. We describe ASSIN 2, the second edition of a task on the evaluation of Semantic Textual Similarity (STS) and Textual Entail- ment (RTE) in Portuguese. The ASSIN 2 task uses as dataset a collec- tion of pairs of sentences annotated with human judgments for textual entailment and semantic similarity. Interested teams could participate in either of the tasks (STS or RTE) or in both. Nine teams participated in STS and eight in the RTE. A workshop on this task was collocated with STIL 2019, in Salvador, Brazil. This paper describes the ASSIN 2 task and gives an overview of the participating systems. Keywords: Shared Task · Semantic Textual Similarity · Recognizing Textual Entailment · Natural Language Inference · Portuguese 1 Introduction ASSIN4 stands for Avaliação de Similaridade Semântica e INferência Textual (Evaluating Semantic Similarity and Textual Entailment) and is an evaluation shared task in the scope of the computational processing of Portuguese. In fact, as in the first ASSIN (hereafter, ASSIN 1), ASSIN 2, the second edition of this task, consisted of two di↵erent tasks: Recognizing Textual Entailment (RTE), also known known as Natural Language Inference (NLI), and Semantic Textual Similarity (STS). Following ASSIN 1 [11], ASSIN 2 o↵ered the interested community a new benchmark for computational semantic tasks in Portuguese, thus advancing the state-of-the-art. The shared task was collocated with the Symposium in Infor- mation and Human Language Technology (STIL) in Salvador, BA, Brazil, with a workshop held on October, 15th , 2019. A short paper on ASSIN 2 was published in the Proceedings of the 14th International Conference on the Computational Processing of Portuguese [24]. Briefly, as defined in a SemEval 2012 task [1] on the topic, Semantic Textual Similarity (STS) ‘measures the degree of semantic equivalence between two sen- tences’. Given a set of pairs of textual fragments, often sentences, this kind of task asks for the assignment of a score for their similarity. The similarity scale 4 Assim in Portuguese means ‘in the same way’, so arguably an adequate name for a similarity task. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). adopted in ASSIN 2 ranges between 1 and 5, with 1 meaning that the sentences are totally di↵erent and 5 that they have virtually the same meaning. The pair O cachorro está pegando uma bola azul/Uma bola azul está sendo pega pelo ca- chorro 5 is an example of a pair scored with 5, while A menina está andando de cavalo/O menino está borrifando as plantas com água 6 is scored 1. Recognizing Textual Entailment (RTE), or Natural Language Infer- ence (NLI), is the task of predicting whether a given text entails another (i.e., a premise implies a hypothesis). The entailment relation happens when, from the premise [A], we can infer that another sentence [B] is also true. That is, from [A] we can conclude [B]. For the pair [A] Um macaco está provocando um cachorro no zoológico/ [B] Um cachorro está sendo provocado por um macaco no zoológico 7 , we say A entails B. While for the pair [A]Um grupo de meninos em um quintal está brincando e um homem está de pé ao fundo/ [B]Os meninos jovens estão brincando ao ar livre e o homem está sorrindo por perto 8 , there is no entailment relation from A to B9 . We follow the tradition of shared tasks for RTE that can be traced back to 2005 with the first Pascal Challenge [7], targeting RTE in a corpus of 1,367 pairs annotated for entailment and non-entailment relations. Back then, the best teams (MITRE and Bar Ilan teams) achieved an accuracy of 0.586. In the next Pascal Challenges, di↵erent corpora and task designs were tried: paragraphs were used instead of short sentences (Challenge 3 [12]); contradictions were added to the data (Extended Challenge 3[27]); non-aligned texts were given to the participants (Challenges 6 and 7) and, more recently, the task was presented as multilingual [22,23]. Regarding STS, shared tasks for English go back to SemEval 2012 [1]. Re- cently, in 2017 [5], Arabic and Spanish were also included. SemEval 2014 also included a related task on Compositionality, that put together both Semantic Relatedness and Textual Entailment [19], which we modeled our dataset after. For both STS and RTE, this task used the SICK corpus (‘Sentences Involving Compositional Knowledge’) as its data source, the first corpus in the order of 10,000 sentence pairs annotated for inference. In 2015, SNLI, a corpus with more than 500,000 human-written English sentences annotated for NLI was released[3] and, in 2017, RepEval [21] included the MultiNLI corpus, with more than 430,000 pairs annotated for NLI, covering di↵erent textual genres. 5 The dog is catching a blue ball/A blue ball is being caught by the dog. 6 The girl is riding the horse/The boy is spraying the plants with water. 7 A monkey is teasing a dog at the zoo/A dog is being teased by a monkey at the zoo. 8 A group of boys in a backyard are playing and a man is standing in the background/ Young boys are playing outdoors and the man is smiling nearby. 9 One could possibly think there is an entailment relation among these sentences, since ‘meninos’ (boys) are always ‘meninos jovens’ (young boys) and that probably the man standing would be also smiling. Since it is also equally possible that the man nearby is not smiling, this pair is considered a non-entailment, that is, it is possible that the two scenes happens at the same time, but it is not necessary. 2 When it comes to Portuguese processing, data availability and shared tasks for semantic processing are still starting to become popular. In 2016, ASSIN [11] was the first shared task for Portuguese STS and RTE. Its dataset included 10,000 pairs of annotated sentences, half in European Portuguese and half in Brazilian Portuguese. ASSIN 2 follows the goal of ASSIN by o↵ering a new com- putational semantic benchmark to the community interested in computational processing of Portuguese. 2 Task and Data Design When designing ASSIN 2, we considered the previous experience of ASSIN and made some changes towards an improved task. This section describes the data used in the ASSIN 2 collection, its annotation process, decisions taken and the main di↵erences to ASSIN 1. It ends with a brief schedule of ASSIN 2. 2.1 Data Source The ASSIN 1 dataset is based on news and imposes several linguistic challenges, such as temporal expressions and reported speech. Following thoughts of the ASSIN 1 organization [11], we opted to have a corpus specifically created for the tasks, as SNLI and MultiNLI, and containing only simple facts, as SICK. Therefore, the ASSIN 2 data was based on SICK-BR [25], a translation and manual adaptation of SICK [19], the corpus used in SemEval 2014, Task 1. SICK is known to be based on captions of pictures and to have fewer complex linguistic phenomena, which perfectly suits our purposes. Since ASSIN 2 collection is made upon SICK-BR, it only contains the Brazilian variant of Portuguese. 2.2 Data Balancing Another goal considered in data design was to have a balanced corpus in terms of RTE labels. Both ASSIN and SICK-BR data are unbalanced, o↵ering many more neutral pairs than entailments. Even if this is more representative of the reality of language usage by people, this is undesirable for machine learning tech- niques. Since SICK-BR has less than 25% of entailment pairs, we had to create and annotate more of them. To create such pairs we followed a semi-automated strategy, starting from entailment SICK-BR pairs and changing synonyms or removing adverbial or adjectival phrases in those. All generated pairs were man- ually revised. We also manually created pairs hoping they would be annotated as entailments, but trying as much as possible not to introduce artifact bias [14]. 2.3 Data Annotation All the annotators involved in the ASSIN 2 annotation task have linguistic train- ing, being them professors, linguistics students or computational linguists. We 3 would like to express the deepest appreciation to Alissa Canesim (Universi- dade Estadual de Ponta Grossa), Amanda Oliveira (Stilingue), Ana Cláudia Zandavalle (Stilingue), Beatriz Mastrantonio (Stilingue - Universidade Federal de Ouro Preto), Carolina Gadelha (Stilingue), Denis Owa (Pontifı́cia Univer- sidade Católica de São Paulo), Evandro Fonseca (Stilingue), Marcos Zandonai (Universidade do Vale do Rio dos Sinos), Renata Ramisch (Núcleo Interinsti- tucional de Linguı́stica Computacional - Universidade Federal de São Carlos), Talia Machado (Universidade Estadual de Ponta Grossa) for taking part of this task with the single purpose of producing open resources for the community in- terested on the computational processing Portuguese. We also thank the Group of Computational Linguistics from University of São Paulo for making available SICK-BR, which served as the base for the ASSIN corpus. All the pairs created for ASSIN were annotated by at least four native speak- ers of Brazilian Portuguese. The annotation task was conducted using an online tool prepared for the RTE and STS tasks, the same as in ASSIN 1. For the RTE task, only pairs annotated the same way by the majority of the annotators were actually used in the dataset. It means that at least three of four annotators agreed on the RTE labels present in ASSIN 2 collection. For the STS task, the label is the average of the score given by all the annotators. The final result was a dataset with about 10,000 sentence pairs: 6,500 used for training, 500 for validation, and 2,448 for test, now available at https: //sites.google.com/view/assin2/. Since we wanted to have a balanced corpus and a sound annotation strategy, we opted for having only two RTE labels; entailment and non-entailment. Di↵erently from ASSIN 1, we did not use the paraphrase label because a para- phrase happens when there is a double-entailment, being, somehow, unnecessary to annotate a double-entailment with a third label. This was further motivated by the results of ASSIN 1, where systems showed much difficulty to outperform the proposed baselines, which were the same as in ASSIN 2. For example, no par- ticipant run did better than the RTE baseline in Brazilian Portuguese. Thus, we decided to pursue a new task design having in mind its utility to the community. In fact, our original intent was to follow a tradition in inference that pays at- tention to contradictions as much as to entailments, as Zaenen et al. [28] and de Marne↵e et al. [20], as well as most recent datasets. However, having a soundly annotated corpus for contradictions is not a trivial task. Firstly, defining con- tradictions and having functional guidelines for the phenomenon is a task on its own. While recent datasets aim to have a “human” perspective of the phe- nomenon [4], semanticists and logicians have already pointed out that this lay perspective on contradictions can lead to much noise on inference annotation10 , especially when considering contradictions’ annotation. For example, the work of Kalouli et al. [16] shows that almost 50% of the contradictions in the SICK dataset, around 15% of all the pairs, do not follow the basic ‘logical’ assump- tion that, if the premise (sentence A) contradicts the hypothesis (sentence B), the hypothesis (B) must also contradict the premise (A). After all, contradic- 10 See Crouch (et al.)-Manning controversy for details on this point [28,18,6]. 4 tions should be symmetric. Secondly, considering that we used SICK-BR as the base of our dataset, we would have needed to correct all the contradictions that were already in SICK, following Kalouli et al. [16], that finds many inconsisten- cies on contradictions annotation. Another point for excluding contradictions in ASSIN 2 is that we would also not have a balanced corpus among the labels, since SICK (and SICK-BR) has less than 1,500 contradictions in a corpus of 10,000 pairs. Table 1. Examples of ASSIN 2 data Premise Hypothesis RTE STS O cachorro castanho está correndo Um cachorro castanho está cor- Entails 5 na neve rendo na neve Alguns animais estão brincando Alguns animais estão brincando na Entails 4.4 selvagemente na água água Dois meninos jovens estão olhando Duas jovens meninas estão o- None 3.7 para a câmera e um está pondo sua lhando uma câmera e uma está lı́ngua para fora com a lı́ngua para fora A menina jovem está soprando Não tem nenhuma menina de rosa None 2.1 uma bolha que é grande girando uma fita Um avião está voando Um cachorro está latindo None 1 Table 1 illustrates the dataset with five annotated pairs. The collection of ASSIN 2 is distributed in the same XML format adopted in ASSIN 1. Sentence pairs are marked by the element that includes elements and , respectively for the first and second sentence. For illustrative purposes, Figure 1 represents the first example in Table 1. O cachorro castanho está correndo na neve Um cachorro castanho está correndo na neve Fig. 1. Data format of the ASSIN 2 collection. 2.4 Schedule ASSIN 2 was announced on May 2019, in several NLP mailing lists. On June 2019, a Google Group was created for communication between the organiza- tion and participants or other interested people (https://groups.google.com/ forum/#!forum/assin2). Training and validation data were released on 16th June and testing data on 16th September, which also marked the beginning of 5 the evaluation period. The deadline for result submission was 10 days later, on 26th September, and the official results were announced a few days after this. A physical workshop where ASSIN 2 was presented, as well as some partici- pations, was held on 15th October 2019, in Salvador, Brazil, collocated with the STIL 2019 symposium11 . On 2nd March 2020, a second opportunity was given to participants to present their work in the POP2 workshop, in Évora, Portugal, collocated with the PROPOR 2020 conference12 . 2.5 Metrics As it happened in ASSIN 1 and in other shared tasks with the same goal, systems’ performance on the RTE task was measured with the macro F1 of precision and recall as the main metric. For STS, performance was measured with the Pearson correlation index (⇢) between the gold and the submitted scores, with Mean Squared Error (MSE) computed as a secondary metric. The evaluation scripts can be found at https://github.com/erickrf/assin. 3 Participants and Results ASSIN 2 had a total of nine participating teams, five from Portugal and four from Brazil, namely: – CISUC-ASAPPj (Portugal) – CISUC-ASAPPy (Portugal) – Deep Learning Brasil (Brazil) – IPR (Portugal) – L2F/INESC (Portugal) – LIACC (Portugal) – NILC (Brazil) – PUCPR (Brazil) – Stilingue (Brazil) Each team could submit up to three runs and participate in both STS and RTE, or in only one of them. Moreover, each team could participate without attending the workshop venue, held in Salvador. We believe this was an im- portant point for increasing participation, because travelling expenses can be high, especially for those that were coming from Europe. The main drawback was that only four teams actually presented their approaches in the ASSIN 2 workshop, namely CISUC-ASAPP, CISUC-ASAPPy, Deep Learning Brasil and Stilingue. On the other hand, a total of six teams submitted a paper describing their participation, to be included in this volume. 6 STS RTE Team Run ⇢ MSE F1 Accuracy 1 0.642 0.63 0.560 58.91 CISUC-ASAPPj 2 0.652 0.61 0.606 62.05 3 0.616 0.82 0.576 59.76 1 0.726 0.58 0.401 53.10 CISUC-ASAPPpy 2 0.730 0.58 0.656 66.67 3 0.740 0.60 0.649 65.52 1 0.751 1.20 0.816 81.90 Deep Learning Brasil 2 0.785 0.59 0.883 88.32 3 0.657 0.65 0.333 50.00 1 0.826 0.52 0.876 87.58 IPR 2 -0.037 15.48 0.873 87.38 3 0.809 0.62 0.87 87.01 1 0.771 0.54 0.775 77.66 L2F/INESC 2 0.778 0.52 0.784 78.47 3 0.751 1.20 0.816 81.90 1 0.493 1.08 0.77 77.41 LIACC 2 0.459 1.02 0.72 73.20 3 0.458 1.04 0.733 74.31 1 0.729 0.64 0.871 87.17 NILC 2 0.729 0.64 0.868 86.85 3 0.729 0.64 0.865 86.56 1 0.643 0.90 N/A N/A PUCPR 2 0.678 0.85 N/A N/A 3 0.646 0.92 N/A N/A 1 0.748 0.53 0.788 78.84 Stilingue 2 0.800 0.39 0.866 86.64 3 0.817 0.47 0.866 86.64 WordOverlap (baseline) – 0.577 0.75 0.667 66.71 BoW sentence 2 (baseline) – 0.175 1.15 0.557 56.74 Infernal (baseline) – N/A N/A 0.742 74.18 Table 2. Results of each submitted run and baselines. 3.1 Results The results of the runs submitted by each team in the STS and RTE tasks are shown in Table 2, together with three baselines. Considering the Pearson correlation (⇢), the best result in STS (0.826) was achieved by the first run submitted by the IPR team, although the best MSE 11 http://comissoes.sbc.org.br/ce-pln/stil2019/ 12 https://sites.google.com/view/pop2-propor2020 7 was by the second run of Stilingue (0.39). We highlight that these were the only teams with ⇢ higher than 0.8. Although Pearson ⇢ was used as the main metric, this metric and the MSE are two di↵erent ways of analysing the results. A high ⇢ means that the ranking of most similar pairs is closer to the one in the gold standard, while a low MSE means that the similarity scores are closer to the gold ones. Both the best MSE and the best values of ⇢ are significantly better than the best results achieved in ASSIN 1, both in the official evaluation (⇢=0.73 [9]) and in post-evaluation experiments (⇢=0.75 [2]). On RTE, Deep Learning Brasil had the best run (second run), considering both F1 and Accuracy, though not very far from IPR, Stilingue and NILC. Again, the values achieved are higher than the best official results in ASSIN 1. The globally higher performances suggest that, when compared to ASSIN 1, ASSIN 2 was an easier task. This might, indeed, be true, especially considering that, for RTE, the ASSIN 2 collection only used two labels, due to Paraphrases being labelled as Entailment and thus not “competing”. ASSIN 2 data was also aimed to be easier and not having complex linguistic phenomena. Another point to keep in mind when comparing ASSIN 1 and ASSIN 2 is that in this edition, competitors had access to a balanced corpus. This might have also contributed to the better performance of systems in ASSIN 2 data. Still, we should also consider that, in the last two years, NLP had substantial advances when it comes to the representation of sentences and their meaning, which lead to significant improvements in many tasks. 3.2 Approaches Approaches followed by the participants show that the Portuguese NLP com- munity is quickly adopting the most recent trends, with several teams (IPR, Deep Learning Brasil, L2F/INESC, Stilingue and NILC), including those with the best results, somehow exploring BERT [8] contextual embeddings, some of which (IPR, NILC) fine-tuned for ASSIN 2. Some teams combined the previous with other features commonly used in STS / RTE, including string similarity measures (e.g., Jaccard for tokens, token n-grams and character n- grams), agreement in negation and sentiment, lexical-semantic relations (syn- onyms and hyponymy), as well as pre-trained classic word embeddings (e.g., word2vec, GloVe, fastText, all available for Portuguese as part of the NILC embeddings [15]). Besides BERT, non-pretrained neural models, namely LSTM Siamese Networks (PUCPR) and Transformers (Stilingue), were also used, while a few teams (ASAPPpy, ASAPPj) followed a more classic machine learning ap- proach, and learned a regressor from some of the previous features. Models were trained not only in the ASSIN 2 train collection, but also in data from ASSIN 1. Towards the best Pearson ⇢, the IPR team relied on a pre-trained multilingual BERT model, freely available by the developers of BERT, which they fine-tuned with large Portuguese corpora. A neural network was built by adding one layer to the resulting BERT model and trained with ASSIN 1 (train and test) and ASSIN 2 (train) data. 8 Stilingue relied on the exploration of Transformers, trained with BERT [8] features, plus a set of 18 additional features covering sentiment and nega- tion agreement, synonyms and hyponyms according to Onto.PT [13] and Verb- Net [26], similarity, gender and number agreement, Jaccard similarity of shared tokens, verb tense, presence of the conjunction ‘e’ (and), similar and di↵erent tokens, sentence subject, and cosine of sentence embeddings computed with Fast- Text [15]. For RTE, the best run, by Deep Learning Brasil, was based on an ensemble of multilingual BERT and RoBERTa [17], which improves on the results of BERT, both fine-tuned for the ASSIN 2 data. However, for RoBERTa, this data was previously translated to English, with Google Translate. The IPR team also relied on BERT and used ASSIN 1 data to fine-tune the model. Our first baseline was the word overlap, which had very competitive results in ASSIN 1. It counts the ratio of overlapping tokens in both the first and second sentence, and trains a logistic/linear regressor (for RTE/STS) with these two features. A second baseline is inspired by Gururangan et al. [14] and trains the same algorithms on bag-of-words features extracted only from the second sentence of each pair. It aims to detect biases in the construction of the dataset. For RTE, a third baseline was considered, namely, Infernal [10], a system based on hand designed features, which has state-of-the-art results on ASSIN 1. 3.3 Results in Harder Pairs Similar to Gururangan et al. [14], we took all the pairs misclassified by our second baseline and called them a hard subset of the data. In other words, these pairs were not correctly classified by only looking at the hypothesis, the second sentence of the pair. In order to provide an alternative view on the results, we analysed the participants’ results on this subset. Results are shown in table 3. Though worse than the performance in the full collection, in table 2, the di↵erences are not as large as those reported by Gururangan et al. [14]. This is not surprising, as the second baseline had an F1 score only marginally above chance level13 , indicating that the dataset does not su↵er from annotation artifacts as seriously as SNLI. A particular outlier is the second run of IPR in STS. But the highly di↵ering value is due to their already very low Pearson ⇢ in the original data. Still, eight runs had a decrease of more than 15% in RTE, suggesting they might have been exploiting some bias in the collection. 4 Conclusions ASSIN 2 was the second edition of ASSIN, a shared task targeting Recognizing Textual Entailment / Natural Language Inference and Semantic Textual Simi- larity in Portuguese. It had nine participating teams, from Portugal and Brazil, 13 For comparison, Gururangan et al. [14] had 67% accuracy in a dataset with three classes. 9 STS RTE Team Run ⇢⇤ MSE ⇢ di↵ F1 Acc F1 di↵ 1 0.567 0.75 -11.68% 0.526 64.02 -6.07% ASAPPj 2 0.578 0.70 -11.35% 0.558 64.59 -7.92% 3 0.586 0.94 -4.87% 0.551 64.40 -4.34% 1 0.630 0.75 -13.22% 0.302 35.79 -24.69% ASAPPpy 2 0.635 0.75 -13.01% 0.591 59.49 -9.91% 3 0.655 0.77 -11.49% 0.587 59.49 -9.55% 1 0.764 0.70 -7.51% 0.798 81.02 -8.90% IPR 2 0.018 13.43 -148.65% 0.788 79.89 -9.74% 3 0.734 0.86 -9.27% 0.781 79.32 -10.23% 1 0.368 1.20 -25.35% 0.630 63.36 -18.18% LIACC 2 0.383 1.09 -16.56% 0.574 57.41 -20.28% 3 0.348 1.15 -24.02% 0.581 58.07 -20.74% 1 0.632 0.88 -13.31% 0.777 78.94 -10.79% NILC 2 0.632 0.88 -13.31% 0.774 78.56 -10.83% 3 0.632 0.88 -13.31% 0.760 77.05 -12.14% 1 0.528 1.23 -17.88% — — — PUCPR 2 0.562 1.17 -17.11% — — — 3 0.528 1.25 -18.27% — — — 1 0.593 0.77 -15.04% 0.639 66.76 -11.98% L2F/INESC 2 0.677 0.73 -12.19% 0.644 65.34 -16.90% 3 0.684 0.71 -12.08% 0.658 66.95 -16.07% 1 0.659 1.54 -12.25% 0.681 68.56 -16.54% Deep Learning Brasil 2 0.718 0.75 -8.54% 0.804 81.40 -8.95% 3 0.579 0.78 -11.87% 0.244 32.20 -26.73% 1 0.666 0.68 -10.96% 0.747 77.24 -5.20% Stilingue 2 0.718 0.52 -10.25% 0.777 79.13 -10.28% 3 0.744 0.66 -8.94% 0.777 79.13 -10.28% Table 3. Results of each submitted run in the harder pairs and di↵erence towards the full results. and, di↵erently from the previous ASSIN edition [11], most of the systems out- performed the proposed baselines. We believe that the e↵ort of having a simpler task in ASSIN 2 was beneficial, not only because systems could do better in this edition, but also because the ASSIN 2 corpus has a sound annotation strategy, comparable with previous shared tasks for English. Looking at the participation, it seems that the Portuguese processing community is now more interested in the proposed tasks. On the results achieved, it is notable that systems based on transfer learning had better results in the competition for both tasks. A note should be added on the Deep Learning Brasil team, which achieved the best scores for RTE with a 10 strategy based on translating the data to English, to make possible the use of more powerful models. However, it is possible that the nature of the data, which is a translated and adapted version of SICK, makes this strategy more sound than it would be in real-world scenarios. After all, ASSIN 2 results may indicate how the pre-trained language models used, namely BERT and RoBERTa, rapidly improved the state-of-the-art of a given task. For the future, we would like to discuss new ways of evaluating the generalization power of the proposed systems, since intrinsic metrics, considering only a subset of the data that follows exactly the same format of the training data, seems nowadays not to be enough to e↵ectively evaluate the systems’ performance. References 1. Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: SemEval-2012 task 6: A pilot on semantic textual similarity. In: Proc. 1st Joint Conf. on Lexical and Compu- tational Semantics-Vol. 1: Proc. of main conference and shared task, and Vol. 2: Proc. of Sixth Intl. Workshop on Semantic Evaluation. pp. 385–393. Association for Computational Linguistics (2012) 2. Alves, A., Gonçalo Oliveira, H., Rodrigues, R., Encarnação, R.: ASAPP 2.0: Ad- vancing the state-of-the-art of semantic textual similarity for Portuguese. In: Pro- ceedings of 7th Symposium on Languages, Applications and Technologies (SLATE 2018). OASIcs, vol. 62, pp. 12:1–12:17. Schloss Dagstuhl–Leibniz-Zentrum fuer In- formatik, Dagstuhl, Germany (June 2018) 3. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pp. 632–642. Association for Computational Linguistics, Lisbon, Portugal (Sep 2015) 4. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (2015) 5. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of 11th International Workshop on Semantic Evaluation (SemEval- 2017). pp. 1–14. Association for Computational Linguistics (2017) 6. Crouch, R., Karttunen, L., Zaenen, A.: Circumscribing is not excluding: A response to manning. http://web.stanford.edu/ laurik/publications/reply-to-manning.pdf 7. Dagan, I., Glickman, O., Magnini, B.: The pascal recognising textual entailment challenge. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognizing Textual Entailment. (2006) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019) 9. Fialho, P., Marques, R., Martins, B., Coheur, L., Quaresma, P.: INESC-ID@ASSIN: Medição de similaridade semântica e reconhecimento de inferência textual. Lin- guamática 8(2), 33–42 (2016) 11 10. Fonseca, E., Aluı́sio, S.M.: Syntactic knowledge for natural language inference in Portuguese. In: Villavicencio, A., Moreira, V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., Paetzold, G.H. (eds.) Computational Processing of the Portuguese Language. pp. 242–252. Springer, Cham (2018) 11. Fonseca, E., Santos, L., Criscuolo, M., Aluı́sio, S.: Visão geral da avaliação de similaridade semântica e inferência textual. Linguamática 8(2), 3–13 (2016) 12. Giampiccolo, D., Magnini, B., Dagan, I., Dolan, B.: The third PASCAL recognizing textual entailment challenge. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. pp. 1–9. Association for Computational Linguistics, Prague (Jun 2007) 13. Gonçalo Oliveira, H., Gomes, P.: ECO and Onto.PT: A flexible approach for cre- ating a Portuguese wordnet automatically. Language Resources and Evaluation 48(2), 373–393 (2014) 14. Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., Smith, N.A.: Annotation artifacts in natural language inference data. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 2 (Short Papers). pp. 107–112. Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018) 15. Hartmann, N.S., Fonseca, E.R., Shulby, C.D., Treviso, M.V., Rodrigues, J.S., Aluı́sio, S.M.: Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In: Proceedings the 11th Brazilian Symposium in Informa- tion and Human Language Technology. STIL 2017 (2017) 16. Kalouli, A.L., Real, L., de Paiva, V.: Correcting contradictions. In: Proceedings of Computing Natural Language Inference (CONLI) Workshop, 19 September 2017 (2017) 17. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 18. Manning, C.: Local textual inference: It’s hard to circumscribe, but you know it when you see it – and nlp needs it. https://nlp.stanford.edu/ man- ning/papers/LocalTextualInference.pdf (2006) 19. Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., Zamparelli, R.: SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: Proc. of 8th Intl. Workshop on Semantic Evaluation (SemEval 2014). pp. 1–8. Association for Computational Linguistics, Dublin, Ireland (2014) 20. de Marne↵e, M.C., Ra↵erty, A.N., Manning, C.D.: Finding contradictions in text. In: Proceedings of ACL-08: HLT. pp. 1039–1047. Association for Computational Linguistics, Columbus, Ohio (Jun 2008) 21. Nangia, N., Williams, A., Lazaridou, A., Bowman, S.: The RepEval 2017 shared task: Multi-genre Natural Language Inference with sentence representations. In: Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP. pp. 1–10. Association for Computational Linguistics, Copenhagen, Denmark (Sep 2017) 22. Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L., Giampiccolo, D.: Semeval- 2012 task 8: Cross-lingual textual entailment for content synchronization. In: Pro- ceedings of *SEM (2012) 23. Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L., Giampiccolo, D.: Semeval- 2013 task 8: Cross-lingual textual entailment for content synchronization. In: Pro- ceedings of *SEM (2013) 12 24. Real, L., Fonseca, E., Oliveira, H.G.: The assin 2 shared task: a quick overview. In: Computational Processing of the Portuguese Language - 13th International Conference, PROPOR 2020, Évora, Portugal, March 2-4, 2020, Proceedings. p. in press. LNCS, Springer (2020) 25. Real, L., Rodrigues, A., Vieira, A., Albiero, B., Thalenberg, B., Guide, B., Silva, C., de Oliveira Lima, G., C. S. Câmara, I., Stanojević, M., Souza, R., De Paiva, V.: SICK-BR: A Portuguese corpus for inference. In: Proceedings of 13th PROPOR (2018) 26. Scarton, C., Aluısio, S.: Towards a cross-linguistic verbnet-style lexicon for brazil- ian portuguese. In: Proceedings of LREC 2012 Workshop on Creating Cross- language Resources for Disconnected Languages and Styles (2012) 27. Voorhees, E.M.: Contradictions and justifications: Extensions to the textual en- tailment task. In: Proceedings of ACL-08: HLT. pp. 63–71. Association for Com- putational Linguistics, Columbus, Ohio (Jun 2008) 28. Zaenen, A., Karttunen, L., Crouch, R.: Local textual inference: Can it be defined or circumscribed? In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. pp. 31–36. Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005) 13