Introduction

Organizing the ASSIN 2 Shared Task

Livy Real

livyreal@gmail.com 0

Erick Fonseca

erick.fonseca@lx.it.pt 2

Hugo Gon¸calo Oliveira

1 0 B2W Digital/Grupo de Lingu ́ıstica Computacional - University of Sa ̃o Paulo 1 CISUC, University of Coimbra , Portugal 2 Instituto de Telecomunicac ̧o ̃es , Lisboa , Portugal

We describe ASSIN 2, the second edition of a task on the evaluation of Semantic Textual Similarity (STS) and Textual Entailment (RTE) in Portuguese. The ASSIN 2 task uses as dataset a collection of pairs of sentences annotated with human judgments for textual entailment and semantic similarity. Interested teams could participate in either of the tasks (STS or RTE) or in both. Nine teams participated in STS and eight in the RTE. A workshop on this task was collocated with STIL 2019, in Salvador, Brazil. This paper describes the ASSIN 2 task and gives an overview of the participating systems.

Shared Task Semantic Textual Similarity Recognizing Textual Entailment Natural Language Inference Portuguese

Introduction

4 Assim in Portuguese means ‘in the same way’, so arguably an adequate name for a similarity task. adopted in ASSIN 2 ranges between 1 and 5, with 1 meaning that the sentences are totally di↵erent and 5 that they have virtually the same meaning. The pair O cachorro esta´ pegando uma bola azul/Uma bola azul est´a sendo pega pelo cachorro5 is an example of a pair scored with 5, while A menina est´a andando de cavalo/O menino est´a borrifando as plantas com ´agua 6 is scored 1.

Recognizing Textual Entailment (RTE), or Natural Language Inference (NLI), is the task of predicting whether a given text entails another (i.e., a premise implies a hypothesis). The entailment relation happens when, from the premise [A], we can infer that another sentence [B] is also true. That is, from [A] we can conclude [B]. For the pair [A] Um macaco est´a provocando um cachorro no zool´ogico/ [B] Um cachorro est´a sendo provocado por um macaco no zool´ogico7, we say A entails B. While for the pair [A]Um grupo de meninos em um quintal est´a brincando e um homem esta´ de p´e ao fundo/ [B]Os meninos jovens est˜ao brincando ao ar livre e o homem est´a sorrindo por perto 8, there is no entailment relation from A to B9.

We follow the tradition of shared tasks for RTE that can be traced back to 2005 with the first Pascal Challenge [ 7 ], targeting RTE in a corpus of 1,367 pairs annotated for entailment and non-entailment relations. Back then, the best teams (MITRE and Bar Ilan teams) achieved an accuracy of 0.586. In the next Pascal Challenges, di↵erent corpora and task designs were tried: paragraphs were used instead of short sentences (Challenge 3 [ 12 ]); contradictions were added to the data (Extended Challenge 3[ 27 ]); non-aligned texts were given to the participants (Challenges 6 and 7) and, more recently, the task was presented as multilingual [ 22,23 ].

Regarding STS, shared tasks for English go back to SemEval 2012 [ 1 ]. Recently, in 2017 [ 5 ], Arabic and Spanish were also included. SemEval 2014 also included a related task on Compositionality, that put together both Semantic Relatedness and Textual Entailment [ 19 ], which we modeled our dataset after. For both STS and RTE, this task used the SICK corpus (‘Sentences Involving Compositional Knowledge’) as its data source, the first corpus in the order of 10,000 sentence pairs annotated for inference.

In 2015, SNLI, a corpus with more than 500,000 human-written English sentences annotated for NLI was released[ 3 ] and, in 2017, RepEval [ 21 ] included the MultiNLI corpus, with more than 430,000 pairs annotated for NLI, covering di↵erent textual genres. 5 The dog is catching a blue ball/A blue ball is being caught by the dog. 6 The girl is riding the horse/The boy is spraying the plants with water. 7 A monkey is teasing a dog at the zoo/A dog is being teased by a monkey at the zoo. 8 A group of boys in a backyard are playing and a man is standing in the background/

Young boys are playing outdoors and the man is smiling nearby. 9 One could possibly think there is an entailment relation among these sentences, since ‘meninos’ (boys) are always ‘meninos jovens’ (young boys) and that probably the man standing would be also smiling. Since it is also equally possible that the man nearby is not smiling, this pair is considered a non-entailment, that is, it is possible that the two scenes happens at the same time, but it is not necessary.

When it comes to Portuguese processing, data availability and shared tasks for semantic processing are still starting to become popular. In 2016, ASSIN [ 11 ] was the first shared task for Portuguese STS and RTE. Its dataset included 10,000 pairs of annotated sentences, half in European Portuguese and half in Brazilian Portuguese. ASSIN 2 follows the goal of ASSIN by o↵ering a new computational semantic benchmark to the community interested in computational processing of Portuguese. 2

Task and Data Design

When designing ASSIN 2, we considered the previous experience of ASSIN and made some changes towards an improved task. This section describes the data used in the ASSIN 2 collection, its annotation process, decisions taken and the main di↵erences to ASSIN 1. It ends with a brief schedule of ASSIN 2. 2.1

Data Source

The ASSIN 1 dataset is based on news and imposes several linguistic challenges, such as temporal expressions and reported speech. Following thoughts of the ASSIN 1 organization [ 11 ], we opted to have a corpus specifically created for the tasks, as SNLI and MultiNLI, and containing only simple facts, as SICK. Therefore, the ASSIN 2 data was based on SICK-BR [ 25 ], a translation and manual adaptation of SICK [ 19 ], the corpus used in SemEval 2014, Task 1. SICK is known to be based on captions of pictures and to have fewer complex linguistic phenomena, which perfectly suits our purposes. Since ASSIN 2 collection is made upon SICK-BR, it only contains the Brazilian variant of Portuguese. 2.2

Data Balancing

Another goal considered in data design was to have a balanced corpus in terms of RTE labels. Both ASSIN and SICK-BR data are unbalanced, o↵ering many more neutral pairs than entailments. Even if this is more representative of the reality of language usage by people, this is undesirable for machine learning techniques. Since SICK-BR has less than 25% of entailment pairs, we had to create and annotate more of them. To create such pairs we followed a semi-automated strategy, starting from entailment SICK-BR pairs and changing synonyms or removing adverbial or adjectival phrases in those. All generated pairs were manually revised. We also manually created pairs hoping they would be annotated as entailments, but trying as much as possible not to introduce artifact bias [ 14 ]. 2.3

Data Annotation

All the annotators involved in the ASSIN 2 annotation task have linguistic training, being them professors, linguistics students or computational linguists. We would like to express the deepest appreciation to Alissa Canesim (Universidade Estadual de Ponta Grossa), Amanda Oliveira (Stilingue), Ana Cl´audia Zandavalle (Stilingue), Beatriz Mastrantonio (Stilingue - Universidade Federal de Ouro Preto), Carolina Gadelha (Stilingue), Denis Owa (Pontif´ıcia Universidade Cat´olica de S˜ao Paulo), Evandro Fonseca (Stilingue), Marcos Zandonai (Universidade do Vale do Rio dos Sinos), Renata Ramisch (Nu´cleo Interinstitucional de Lingu´ıstica Computacional - Universidade Federal de S˜ao Carlos), Talia Machado (Universidade Estadual de Ponta Grossa) for taking part of this task with the single purpose of producing open resources for the community interested on the computational processing Portuguese. We also thank the Group of Computational Linguistics from University of S˜ao Paulo for making available SICK-BR, which served as the base for the ASSIN corpus.

All the pairs created for ASSIN were annotated by at least four native speakers of Brazilian Portuguese. The annotation task was conducted using an online tool prepared for the RTE and STS tasks, the same as in ASSIN 1.

For the RTE task, only pairs annotated the same way by the majority of the annotators were actually used in the dataset. It means that at least three of four annotators agreed on the RTE labels present in ASSIN 2 collection. For the STS task, the label is the average of the score given by all the annotators. The final result was a dataset with about 10,000 sentence pairs: 6,500 used for training, 500 for validation, and 2,448 for test, now available at https: //sites.google.com/view/assin2/.

Since we wanted to have a balanced corpus and a sound annotation strategy, we opted for having only two RTE labels; entailment and non-entailment. Di↵erently from ASSIN 1, we did not use the paraphrase label because a paraphrase happens when there is a double-entailment, being, somehow, unnecessary to annotate a double-entailment with a third label. This was further motivated by the results of ASSIN 1, where systems showed much diculty to outperform the proposed baselines, which were the same as in ASSIN 2. For example, no participant run did better than the RTE baseline in Brazilian Portuguese. Thus, we decided to pursue a new task design having in mind its utility to the community.

In fact, our original intent was to follow a tradition in inference that pays attention to contradictions as much as to entailments, as Zaenen et al. [ 28 ] and de Marne↵e et al. [ 20 ], as well as most recent datasets. However, having a soundly annotated corpus for contradictions is not a trivial task. Firstly, defining contradictions and having functional guidelines for the phenomenon is a task on its own. While recent datasets aim to have a “human” perspective of the phenomenon [ 4 ], semanticists and logicians have already pointed out that this lay perspective on contradictions can lead to much noise on inference annotation10, especially when considering contradictions’ annotation. For example, the work of Kalouli et al. [ 16 ] shows that almost 50% of the contradictions in the SICK dataset, around 15% of all the pairs, do not follow the basic ‘logical’ assumption that, if the premise (sentence A) contradicts the hypothesis (sentence B), the hypothesis (B) must also contradict the premise (A). After all, contradic10 See Crouch (et al.)-Manning controversy for details on this point [ 28,18,6 ]. tions should be symmetric. Secondly, considering that we used SICK-BR as the base of our dataset, we would have needed to correct all the contradictions that were already in SICK, following Kalouli et al. [ 16 ], that finds many inconsistencies on contradictions annotation. Another point for excluding contradictions in ASSIN 2 is that we would also not have a balanced corpus among the labels, since SICK (and SICK-BR) has less than 1,500 contradictions in a corpus of 10,000 pairs. Premise Hypothesis RTE STS O cachorro castanho est´a correndo Um cachorro castanho est´a cor- Entails 5 na neve rendo na neve Alguns animais est˜ao brincando Alguns animais est˜ao brincando na Entails 4.4 selvagemente na a´gua a´gua Dois meninos jovens esta˜o olhando Duas jovens meninas esta˜o o- None 3.7 para a caˆmera e um esta´ pondo sua lhando uma caˆmera e uma esta´ l´ıngua para fora com a l´ıngua para fora A menina jovem est´a soprando N˜ao tem nenhuma menina de rosa None 2.1 uma bolha que ´e grande girando uma fita Um avi˜ao est´a voando Um cachorro est´a latindo None 1

<pair entailment="Entailment" id="681" similarity="5"> <t>O cachorro castanho est´a correndo na neve</t> <h>Um cachorro castanho esta´ correndo na neve</h> </pair> ASSIN 2 was announced on May 2019, in several NLP mailing lists. On June 2019, a Google Group was created for communication between the organization and participants or other interested people (https://groups.google.com/ forum/#!forum/assin2). Training and validation data were released on 16th June and testing data on 16th September, which also marked the beginning of the evaluation period. The deadline for result submission was 10 days later, on 26th September, and the ocial results were announced a few days after this.

A physical workshop where ASSIN 2 was presented, as well as some participations, was held on 15th October 2019, in Salvador, Brazil, collocated with the STIL 2019 symposium11. On 2nd March 2020, a second opportunity was given to participants to present their work in the POP2 workshop, in E´vora, Portugal, collocated with the PROPOR 2020 conference12. 2.5

Metrics

As it happened in ASSIN 1 and in other shared tasks with the same goal, systems’ performance on the RTE task was measured with the macro F1 of precision and recall as the main metric. For STS, performance was measured with the Pearson correlation index (⇢ ) between the gold and the submitted scores, with Mean Squared Error (MSE) computed as a secondary metric. The evaluation scripts can be found at https://github.com/erickrf/assin. 3

Participants and Results

ASSIN 2 had a total of nine participating teams, five from Portugal and four from Brazil, namely: – CISUC-ASAPPj (Portugal) – CISUC-ASAPPy (Portugal) – Deep Learning Brasil (Brazil) – IPR (Portugal) – L2F/INESC (Portugal) – LIACC (Portugal) – NILC (Brazil) – PUCPR (Brazil) – Stilingue (Brazil)

Each team could submit up to three runs and participate in both STS and RTE, or in only one of them. Moreover, each team could participate without attending the workshop venue, held in Salvador. We believe this was an important point for increasing participation, because travelling expenses can be high, especially for those that were coming from Europe. The main drawback was that only four teams actually presented their approaches in the ASSIN 2 workshop, namely CISUC-ASAPP, CISUC-ASAPPy, Deep Learning Brasil and Stilingue. On the other hand, a total of six teams submitted a paper describing their participation, to be included in this volume. Run CISUC-ASAPPj CISUC-ASAPPpy Deep Learning Brasil IPR L2F/INESC LIACC NILC PUCPR Stilingue WordOverlap (baseline) BoW sentence 2 (baseline) Infernal (baseline) The results of the runs submitted by each team in the STS and RTE tasks are shown in Table 2, together with three baselines.

Considering the Pearson correlation (⇢ ), the best result in STS (0.826) was achieved by the first run submitted by the IPR team, although the best MSE 11 http://comissoes.sbc.org.br/ce-pln/stil2019/ 12 https://sites.google.com/view/pop2-propor2020 ⇢ was by the second run of Stilingue (0.39). We highlight that these were the only teams with ⇢ higher than 0.8. Although Pearson ⇢ was used as the main metric, this metric and the MSE are two di↵erent ways of analysing the results. A high ⇢ means that the ranking of most similar pairs is closer to the one in the gold standard, while a low MSE means that the similarity scores are closer to the gold ones. Both the best MSE and the best values of ⇢ are significantly better than the best results achieved in ASSIN 1, both in the ocial evaluation ( ⇢ =0.73 [ 9 ]) and in post-evaluation experiments (⇢ =0.75 [ 2 ]).

On RTE, Deep Learning Brasil had the best run (second run), considering both F1 and Accuracy, though not very far from IPR, Stilingue and NILC. Again, the values achieved are higher than the best ocial results in ASSIN 1.

The globally higher performances suggest that, when compared to ASSIN 1, ASSIN 2 was an easier task. This might, indeed, be true, especially considering that, for RTE, the ASSIN 2 collection only used two labels, due to Paraphrases being labelled as Entailment and thus not “competing”. ASSIN 2 data was also aimed to be easier and not having complex linguistic phenomena. Another point to keep in mind when comparing ASSIN 1 and ASSIN 2 is that in this edition, competitors had access to a balanced corpus. This might have also contributed to the better performance of systems in ASSIN 2 data. Still, we should also consider that, in the last two years, NLP had substantial advances when it comes to the representation of sentences and their meaning, which lead to significant improvements in many tasks. 3.2

Approaches

Approaches followed by the participants show that the Portuguese NLP community is quickly adopting the most recent trends, with several teams (IPR, Deep Learning Brasil, L2F/INESC, Stilingue and NILC), including those with the best results, somehow exploring BERT [ 8 ] contextual embeddings, some of which (IPR, NILC) fine-tuned for ASSIN 2. Some teams combined the previous with other features commonly used in STS / RTE, including string similarity measures (e.g., Jaccard for tokens, token n-grams and character ngrams), agreement in negation and sentiment, lexical-semantic relations (synonyms and hyponymy), as well as pre-trained classic word embeddings (e.g., word2vec, GloVe, fastText, all available for Portuguese as part of the NILC embeddings [ 15 ]). Besides BERT, non-pretrained neural models, namely LSTM Siamese Networks (PUCPR) and Transformers (Stilingue), were also used, while a few teams (ASAPPpy, ASAPPj) followed a more classic machine learning approach, and learned a regressor from some of the previous features. Models were trained not only in the ASSIN 2 train collection, but also in data from ASSIN 1.

Towards the best Pearson ⇢ , the IPR team relied on a pre-trained multilingual BERT model, freely available by the developers of BERT, which they fine-tuned with large Portuguese corpora. A neural network was built by adding one layer to the resulting BERT model and trained with ASSIN 1 (train and test) and ASSIN 2 (train) data.

Stilingue relied on the exploration of Transformers, trained with BERT [ 8 ] features, plus a set of 18 additional features covering sentiment and negation agreement, synonyms and hyponyms according to Onto.PT [ 13 ] and VerbNet [ 26 ], similarity, gender and number agreement, Jaccard similarity of shared tokens, verb tense, presence of the conjunction ‘e’ (and), similar and di↵erent tokens, sentence subject, and cosine of sentence embeddings computed with FastText [ 15 ].

For RTE, the best run, by Deep Learning Brasil, was based on an ensemble of multilingual BERT and RoBERTa [ 17 ], which improves on the results of BERT, both fine-tuned for the ASSIN 2 data. However, for RoBERTa, this data was previously translated to English, with Google Translate. The IPR team also relied on BERT and used ASSIN 1 data to fine-tune the model.

Our first baseline was the word overlap, which had very competitive results in ASSIN 1. It counts the ratio of overlapping tokens in both the first and second sentence, and trains a logistic/linear regressor (for RTE/STS) with these two features. A second baseline is inspired by Gururangan et al. [ 14 ] and trains the same algorithms on bag-of-words features extracted only from the second sentence of each pair. It aims to detect biases in the construction of the dataset. For RTE, a third baseline was considered, namely, Infernal [ 10 ], a system based on hand designed features, which has state-of-the-art results on ASSIN 1. 3.3

Results in Harder Pairs

Similar to Gururangan et al. [ 14 ], we took all the pairs misclassified by our second baseline and called them a hard subset of the data. In other words, these pairs were not correctly classified by only looking at the hypothesis, the second sentence of the pair. In order to provide an alternative view on the results, we analysed the participants’ results on this subset.

Results are shown in table 3. Though worse than the performance in the full collection, in table 2, the di↵erences are not as large as those reported by Gururangan et al. [ 14 ]. This is not surprising, as the second baseline had an F1 score only marginally above chance level13, indicating that the dataset does not su↵er from annotation artifacts as seriously as SNLI.

A particular outlier is the second run of IPR in STS. But the highly di↵ering value is due to their already very low Pearson ⇢ in the original data. Still, eight runs had a decrease of more than 15% in RTE, suggesting they might have been exploiting some bias in the collection. 4

Conclusions

ASSIN 2 was the second edition of ASSIN, a shared task targeting Recognizing Textual Entailment / Natural Language Inference and Semantic Textual Similarity in Portuguese. It had nine participating teams, from Portugal and Brazil, 13 For comparison, Gururangan et al. [ 14 ] had 67% accuracy in a dataset with three classes.

RTE Acc ASAPPj ASAPPpy IPR LIACC NILC PUCPR L2F/INESC

1 Deep Learning Brasil 2

3 Stilingue and, di↵erently from the previous ASSIN edition [ 11 ], most of the systems outperformed the proposed baselines. We believe that the e↵ort of having a simpler task in ASSIN 2 was beneficial, not only because systems could do better in this edition, but also because the ASSIN 2 corpus has a sound annotation strategy, comparable with previous shared tasks for English. Looking at the participation, it seems that the Portuguese processing community is now more interested in the proposed tasks.

On the results achieved, it is notable that systems based on transfer learning had better results in the competition for both tasks. A note should be added on the Deep Learning Brasil team, which achieved the best scores for RTE with a strategy based on translating the data to English, to make possible the use of more powerful models. However, it is possible that the nature of the data, which is a translated and adapted version of SICK, makes this strategy more sound than it would be in real-world scenarios. After all, ASSIN 2 results may indicate how the pre-trained language models used, namely BERT and RoBERTa, rapidly improved the state-of-the-art of a given task. For the future, we would like to discuss new ways of evaluating the generalization power of the proposed systems, since intrinsic metrics, considering only a subset of the data that follows exactly the same format of the training data, seems nowadays not to be enough to e↵ectively evaluate the systems’ performance.

1. Agirre , E. , Diab , M. , Cer , D. , Gonzalez-Agirre , A. : SemEval-2012 task 6: A pilot on semantic textual similarity . In: Proc. 1st Joint Conf. on Lexical and Computational Semantics-Vol. 1: Proc. of main conference and shared task, and Vol. 2: Proc. of Sixth Intl. Workshop on Semantic Evaluation . pp. 385 - 393 . Association for Computational Linguistics ( 2012 )

2. Alves , A. , Gonc¸alo Oliveira, H., Rodrigues , R. , Encarnac¸a˜o, R.: ASAPP 2 . 0: Advancing the state-of-the-art of semantic textual similarity for Portuguese . In: Proceedings of 7th Symposium on Languages, Applications and Technologies (SLATE 2018 ). OASIcs , vol. 62 , pp. 12 : 1 - 12 : 17 . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany ( June 2018 )

3. Bowman , S.R. , Angeli , G. , Potts , C. , Manning , C.D.: A large annotated corpus for learning natural language inference . In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing . pp. 632 - 642 . Association for Computational Linguistics, Lisbon, Portugal (Sep 2015 )

4. Bowman , S.R. , Angeli , G. , Potts , C. , Manning , C.D.: A large annotated corpus for learning natural language inference . In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics ( 2015 )

5. Cer , D. , Diab , M. , Agirre , E. , Lopez-Gazpio , I. , Specia , L. : Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation . In: Proceedings of 11th International Workshop on Semantic Evaluation (SemEval2017) . pp. 1 - 14 . Association for Computational Linguistics ( 2017 )

6. Crouch , R. , Karttunen , L. , Zaenen , A. : Circumscribing is not excluding: A response to manning . http://web.stanford.edu/ laurik/publications/reply-to-manning.pdf

7. Dagan , I. , Glickman , O. , Magnini , B. : The pascal recognising textual entailment challenge . Machine Learning Challenges. Evaluating Predictive Uncertainty , Visual Object Classification, and Recognizing Textual Entailment . ( 2006 )

8. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proc 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 - 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 )

9. Fialho , P. , Marques , R. , Martins , B. , Coheur , L. , Quaresma , P. : INESC-ID@ASSIN: Medic¸a˜o de similaridade semaˆntica e reconhecimento de inferˆencia textual . Linguama´tica 8(2) , 33 - 42 ( 2016 )

10. Fonseca , E. , Alu´ısio, S.M.: Syntactic knowledge for natural language inference in Portuguese . In: Villavicencio, A. , Moreira , V. , Abad , A. , Caseli , H. , Gamallo , P. , Ramisch , C. , Gon¸calo Oliveira, H., Paetzold , G.H. (eds.) Computational Processing of the Portuguese Language . pp. 242 - 252 . Springer, Cham ( 2018 )

11. Fonseca , E. , Santos , L. , Criscuolo , M. , Alu´ısio, S.: Vis˜ao geral da avaliac¸a˜o de similaridade semˆantica e inferˆencia textual . Linguama´tica 8(2) , 3 - 13 ( 2016 )

12. Giampiccolo , D. , Magnini , B. , Dagan , I. , Dolan , B. : The third PASCAL recognizing textual entailment challenge . In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing . pp. 1 - 9 . Association for Computational Linguistics, Prague (Jun 2007 )

13. Gon¸calo Oliveira, H., Gomes , P. : ECO and Onto .PT: A flexible approach for creating a Portuguese wordnet automatically . Language Resources and Evaluation 48 ( 2 ), 373 - 393 ( 2014 )

14. Gururangan , S. , Swayamdipta , S. , Levy , O. , Schwartz , R. , Bowman , S. , Smith , N.A. : Annotation artifacts in natural language inference data . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 (Short Papers). pp. 107 - 112 . Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018 )

15. Hartmann , N.S. , Fonseca , E.R. , Shulby , C.D. , Treviso , M.V. , Rodrigues , J.S. , Alu´ısio, S.M.: Portuguese word embeddings: Evaluating on word analogies and natural language tasks . In: Proceedings the 11th Brazilian Symposium in Information and Human Language Technology. STIL 2017 ( 2017 )

16. Kalouli , A.L. , Real , L., de Paiva , V.: Correcting contradictions . In: Proceedings of Computing Natural Language Inference (CONLI) Workshop , 19 September 2017 ( 2017 )

17. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V. : Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 . 11692 ( 2019 )

18. Manning , C. : Local textual inference: It's hard to circumscribe, but you know it when you see it - and nlp needs it . https://nlp.stanford.edu/ manning/papers/LocalTextualInference.pdf ( 2006 )

19. Marelli , M. , Bentivogli , L. , Baroni , M. , Bernardi , R. , Menini , S. , Zamparelli , R.: SemEval -2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment . In: Proc. of 8th Intl. Workshop on Semantic Evaluation (SemEval 2014 ). pp. 1 - 8 . Association for Computational Linguistics, Dublin, Ireland ( 2014 )

20. de Marne↵e, M.C. , Ra↵erty, A.N. , Manning , C.D.: Finding contradictions in text . In: Proceedings of ACL-08: HLT . pp. 1039 - 1047 . Association for Computational Linguistics, Columbus, Ohio (Jun 2008 )

21. Nangia , N. , Williams , A. , Lazaridou , A. , Bowman , S.: The RepEval 2017 shared task: Multi-genre Natural Language Inference with sentence representations . In: Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP . pp. 1 - 10 . Association for Computational Linguistics, Copenhagen, Denmark (Sep 2017 )

22. Negri , M. , Marchetti , A. , Mehdad , Y. , Bentivogli , L. , Giampiccolo , D.: Semeval2012 task 8: Cross-lingual textual entailment for content synchronization . In: Proceedings of *SEM ( 2012 )

23. Negri , M. , Marchetti , A. , Mehdad , Y. , Bentivogli , L. , Giampiccolo , D.: Semeval2013 task 8: Cross-lingual textual entailment for content synchronization . In: Proceedings of *SEM ( 2013 )

24. Real , L. , Fonseca , E. , Oliveira , H.G. : The assin 2 shared task: a quick overview . In: Computational Processing of the Portuguese Language - 13th International Conference, PROPOR 2020 , E´vora, Portugal, March 2- 4 , 2020 , Proceedings. p. in press. LNCS , Springer ( 2020 )

25. Real , L. , Rodrigues , A. , Vieira , A. , Albiero , B. , Thalenberg , B. , Guide , B. , Silva , C. , de Oliveira Lima , G., C. S. Caˆmara , I., Stanojevi´c, M. , Souza , R. , De Paiva , V.: SICK-BR: A Portuguese corpus for inference . In: Proceedings of 13th PROPOR ( 2018 )

26. Scarton , C. , Aluısio , S. : Towards a cross-linguistic verbnet-style lexicon for brazilian portuguese . In: Proceedings of LREC 2012 Workshop on Creating Crosslanguage Resources for Disconnected Languages and Styles ( 2012 )

27. Voorhees , E.M. : Contradictions and justifications: Extensions to the textual entailment task . In: Proceedings of ACL-08: HLT . pp. 63 - 71 . Association for Computational Linguistics, Columbus, Ohio (Jun 2008 )

28. Zaenen , A. , Karttunen , L. , Crouch , R.: Local textual inference: Can it be defined or circumscribed? In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment . pp. 31 - 36 . Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005 )