Estimation of the Factual Correctness of Summaries of a Ukrainian-language Silver Standard Corpus

Estimation of the Factual Correctness of Summaries of a Ukrainian-language Silver Standard Corpus OleksandrBauzha asbauzha@gmail.com Taras Shevchenko National University of Kyiv

Volodymyrska Street 64/13 01601 Kyiv Ukraine

ArtemKramov artemkramov@gmail.com Seraf AI LLC

PO Box 3978 60532 Lisle Illinois United States

OleksandrYavorskyi Seraf AI LLC

PO Box 3978 60532 Lisle Illinois United States

Information Technology and Implementation (IT&I-2023)

November 20-21 2023 Kyiv Ukraine

Estimation of the Factual Correctness of Summaries of a Ukrainian-language Silver Standard Corpus 1613-0073 72E0C15954717F5182DC2E52FE706B32 GROBID - A machine learning software for extracting information from scholarly documents Natural language processing factual correctness abstractive summarization low-resource languages multilingual models

In this paper, different metrics for estimating the factual correctness of summaries of a Ukrainian-language silver standard summarization corpus have been analyzed. The different state-of-the-art methods of detecting the factually inconsistent document-summary pairs have been considered first; moreover, the types of errors in current summarization datasets have been analyzed too. It has been shown that suggested metrics can be used for the discrimination of correct/incorrect document-summary pairs that may be useful for the automatic generation of a summarization corpus. The results obtained for the ground-truth samples may indicate the availability of many erroneous summaries: more than 50% of the test subset can contain factually inconsistent samples. Further analysis of the factual correctness of model-generated summaries showed better factual consistency between documents and summaries than the ground-truth summaries. However, due to the availability of noisy ground-truth samples, the generated summaries can still contain hallucinated information; applying the suggested metrics may allow filtering out erroneous samples, which should also increase the summarization model's performance.

Introduction

Abstractive text summarization falls into the category of sequence-to-sequence natural language processing (NLP) tasks. The development of the self-supervised methods of the training of language models on large corpora [1,2] with further fine-tuning of the corresponding model on the summarization dataset allows for achieving remarkable success in the domains of the abstractive summarization of articles [3,4] and dialogues [5,6]. However, the aforementioned advances in the abstractive text summarization task are mostly connected with the analysis of high-resource languages (English, Chinese, etc.). Unfortunately, the research on the abstractive summarization of Ukrainian documents is still in the initial stage. Similarly to other NLP issues that are presented for the lowresource languages, the lack of human-written datasets remains a key problem for the investigation of the summarization of Ukrainian corpora [7,8]: while the summarization models themselves can potentially be created by the projection of the corresponding English models into the Ukrainianlanguage space (e.g., the Ukrainian GPT-2 model has been recently created according to the paper [9]), the verification of the quality of the summaries that are generated by the produced models remains a challenging task. One of the possible solutions for the generation of a summarization dataset consists in the web-scrapping of news portals [10,11]. Namely, the well-known XSum dataset [12] was created by treating the headline of a news article as the corresponding summary. However, such an approach cannot be reliable as far as the headlines can contain extra information that is not presented in the article for the attention attraction of a reader.

To overcome this problem, the authors of the paper [13] suggested extracting the summary of the article from the short description of the article of a BBC news portal resulting in a multilingual silver standard XL-Sum summarization dataset. The statistics of the Ukrainian-language subset (articlesummary pairs) in the XL-Sum dataset are presented in Table 1. Moreover, the corresponding Ukrainian-language summarization model was trained as well.

Table 1

Number of samples for the Ukrainian-language part of the XL-Sum dataset according to the performed train-dev-test split The aforementioned automatic generation of the document-summary pairs requires the answer to the following question: how to verify the quality of the collected summaries automatically? While the coherency and fluency of summaries should be preserved (texts were written by editors), the factual consistency between the document and the summary should be estimated. The authors of the XL-Sum dataset conducted the human evaluation of the summaries of 10 languages from a small subset (around 250 article-summary pairs). According to the results [13], up to 42% of the selected summaries contained extra information. The availability of such factual errors complicates the usage of the dataset for the verification of the quality of any summarization model; moreover, the training of the model on such samples can lead to the generation of hallucinated summaries by the last one. Thus, the detection of factual errors in summaries is a relevant problem for the analysis of the automatically generated dataset and the estimation of the performance of the summarization model.

In this paper, the factual consistency metric for a Ukrainian-language document-summary pair is suggested. Namely, different cross-lingual approaches that can be applied to a wide range of languages are considered with the following analysis of their effectiveness. Moreover, the factual correctness of the Ukrainian-language summaries of the XL-Sum dataset is considered due to the retrieved metrics. In addition, the performance of the already trained Ukrainian summarization model in terms of the factual consistency of generated summaries is analyzed as well.

Before the creation of the metric for the estimation of the factual consistency of a documentsummary pair, it was decided to consider existing approaches and current issues within this subject area. The next section is devoted to the analysis of the different state-of-the-art methods of the detection and correction of factual mistakes in a summary given an input document.

Related work

One of the key concepts in the factual consistency analysis consists in the generation of the corresponding dataset that defines the types of factual errors that are presented in an erroneous summary. According to the paper [14], two approaches for dataset generation are mostly used: entitycentric approach (Ent-C) and generation-centric approach (Gen-C).

The Ent-C approach implies the transformation of a ground-truth summary into an erroneous summary by applying different modification operations on its entities and noun phrases: entity swap, pronoun swap, negation, etc. The corresponding dataset (K2019) was first presented in the paper [15] and was later used as a baseline for other methods. The ground-truth samples were taken from the CNN/DM dataset. The authors of the dataset also presented a FactCC method for detecting factual errors in a summary. The main idea consists in the fine-tuning of the uncased BERT model [16] with the further binary classification of a document-summary pair (consistent/inconsistent) on the training dataset. As was shown, the FastCC method outperformed the MNLI-based approach [17] that consisted in the interpretation of an entailment measure between a document and a summary as a factual consistency metric. In the paper [18], it was suggested to fine-tune the sequence-to-sequence BART model [4] to generate the corrected version of a summary. Namely, a document and an inconsistent summary were concatenated and passed to the input of an encoder; the entire model was trained to generate the corrected consistent summary. The authors of the paper [19] proposed to mask each entity of a summary with the further usage of the BERT model (BertForQuestionAnswering architecture) for the prediction of answer spans in a source document. In contrast to this paper, the method QAGS [20] consists in generating questions to the entities of a summary automatically; then the question-answering model is used to find answers in both a source document and a summary to verify their match.

Unlike the Ent-C approach, the Gen-C approach [21] consists in the transformation of a groundtruth summary by applying the paraphrasing model. The following assumption is made: the bottomplaced candidates of the beam search (e.g., the 10th best paraphrase) potentially contain error facts. In contrast to the Ent-C-related methods, the authors [21] considered the factual consistency problem at the level of dependency arcs retrieved from a syntactic parser: the dependency arc (fact) is entailed by a source document if a semantic relation between the corresponding head and the child word is also entailed by the document. Elaborating on this assumption, the Dependency Arc Entailment (DAE) model was designed and trained to estimate the entitlement of dependency arcs by a source document. In order to extend the consideration of the dependency arcs as a representation of facts in a more general way, the FactGraph method [22] was recently proposed. The main idea of the FactGraph method consists in decomposing the document and the summary into structured meaning representations. Such meaning representations define semantic concepts and their relations by generating a semantic graph for both a document and a summary. Following the idea of the entailment of dependency arcs, the factual consistency was calculated based on the probability of establishing edges between the semantic concepts of a summary.

As mentioned in the papers [15,18], the NLI-based models showed worse results than their counterparts. However, in the paper [23], the usage of the NLI models was reconsidered by presenting a SummaC method. Namely, while the previous attempts were focused on estimating the entailment of a document and a summary entirely, the SummaC method is based on the consideration of their factual consistency at the level of sentences. The SummaC method outperformed FastCC, DAE, and QA-based methods, thus, confirming the ability of the usage of the NLI models for the estimation of the factual correctness of summaries. In parallel with our work, the factual consistency evaluation method for multilingual corpora based on the usage of the NLI model was recently suggested [24]. The NLI model was created by fine-tuning the mT5-XXL model [25] for the binary classification of a documentsummary pair: the input data are represented as the concatenation of a document and a summary; the output binary value indicates whether the given pair is consistent or not. This classification model was later used for filtering inconsistent samples in the XL-Sum dataset and re-training models. Such an approach allowed for better results in ROUGE scores and human scores (the Ukrainian language was not considered during those experiments). However, according to the conclusion of annotators, only 52% of retrieved summaries (or even more for some languages) were factually consistent with documents. Moreover, the estimation of the entailment of a document-summary pair entirely can contradict recent results shown by the sentence-level SummaC method [23]. We assume that the consideration of the entailment of a document and a summary at the level of sentences may be crucial for the XL-Sum dataset: collected summaries can potentially contain additional information (references, full names, positions, etc.) that may be revealed by increasing the granularity of the analysis of the document parts.

Finally, before applying the aforementioned methods or creating a new one, the following question should be answered: which types of factual errors are most expected in the XL-Sum dataset? In order to get insights, the corresponding statistics for the XSum dataset [12] that was also generated automatically can be considered. In the paper [14], the authors conducted an error analysis of the summaries of the XSum. Namely, the errors were classified into four main categories:  Entity-related (conflating two different entities, hallucinated entities).  Event-related (incorrect event description, agents, new event).



Noun phrase related (incorrect NP or NP modifiers, new NP, etc.).  Others (grammar, noise). In addition, each category was divided into 2 subcategories: extrinsic (hallucination) and intrinsic (incorrect data interpretation) errors. According to the results [14], most of the errors are actually connected with the appearance of extrinsic errors of all categories. The ratio of intrinsic entity-related errors which are typical for the aforementioned K2019 dataset is relatively small. Thus, it was decided to rely on NLI-based approaches that can be useful for detecting relevant types of errors. The next section describes the corresponding selected methods and metrics.

Factual consistency estimation metrics

According to the previous section, the usage of the NLI-based metrics seems useful for analyzing the different types of errors. Taking into account the findings of the SummaC zero-shot method [23], it was decided to process document-summary pairs at the level of sentences. Namely, given a pair of a document and a summary (doc,summary), let us represent both of them ( D and S correspondingly) as a list of sentences:

     12

In other words, the retrieved vector EntRed contains information about the best consistency score for each summary sentence. Then an output factual consistency score

Ent

FactCons

is calculated as the mean value of the vector EntRed :

 mean( ) Ent FactCons EntRed (6)

The aggregation of the consistency scores for summary sentences as an average value allows reducing the

Ent

FactCons

in cases when some summary sentences are not consistent with any of the document sentences. Taking into account the potential big ratio of hallucinated summaries in the XL-Sum dataset, such an approach may help to reveal erroneous samples. Figure 1 demonstrates an example of the detection of a factually inconsistent hallucinated sentence.

The summary sentence (s2) which describes the source of information in a news article is not consistent with any of the document sentences; thus, its maximum consistency value is low. The availability of such consistency outlier decreases the final factual consistency score Ent FactCons .

Experimental part 4.1. Inconsistent summaries discrimination

Before the calculation and analysis of the values of metrics for the Ukrainian part of the XL-Sum dataset, it was decided to verify the ability of the different methods to discriminate between factually consistent and inconsistent summaries. This inconsistent summaries discrimination task consists in the following: given two document-summary pairs with a common document where one pair contains a correct summary, and another one contains an incorrect one, it is necessary to predict which pair contains a factually consistent summary. The accuracy is calculated as the ratio of correctly processed pairs to a general number of them.

Dataset

The test part of the Ukrainian-language XL-Sum dataset was analyzed. In order to generate a factually inconsistent sample for each document, the following rules were applied:

 an inconsistent summary should belong to another document;  the ROUGE-1 F1 measure between the document and inconsistent summary should be higher than the corresponding value between the document and the consistent summary.

The aforementioned rules allowed picking inconsistent summaries that can relate to the same topic as a document, but contain other information to make the discrimination task more challenging. Half of the test dataset was analyzed resulting in 1619 data points. The statistics of the dataset are available in Table 2. The Stanza package [26] was used for the tokenization; the stemming process was performed with the usage of the Ukrainian Stemmer library [27].

Metrics configurations

According to the previous section, it was suggested to use the NLI-based metric (SummaC). SummaC metric (

Emb

SummaC

) was calculated with the usage of sentence embedding models that are mentioned below:

 paraphrase-multilingual-mpnet-base-v2 [28] -multilingual sentence embedding model based on the MPNet [29] model;  distiluse-base-multilingual-cased-v2 [28] -multilingual knowledge distilled version of multilingual Universal Sentence Encoder [30].

SummaC metric ( Ent SummaC ) was implemented based on the usage of the NLI model xlm- roberta-large-xnli -XLM-RoBERTa model [31] fine-tuned on the multilingual XNLI dataset [32].

All pre-trained models were taken from the Huggingface repository [33]. It was decided to use the chosen multilingual models for the SummaC-based metric as far as they were pre-trained on Ukrainian parallel data as well.

Results

Table 3 shows the results of solving the inconsistent summaries discrimination task using different metrics. Except for the accuracy of the discrimination of incorrect/correct samples, the Pearson correlation coefficient (PCC) between metrics and the ROUGE-1 score is also provided. As can be seen, the Emb SummaC metric showed the best accuracy results. The usage of the Emb SummaC metric based on the model paraphrase-multilingual-mpnet-base-v2 (the best option due to accuracy results) may be useful especially for the automatic construction of a summarization dataset when it is necessary to map a document with a potential summary. For instance, the BookSum [34] summarization dataset (namely, its chapter-level subset) was constructed by the mapping of the chapter of a book with sentences of a summary that relates to an entire book; we assume that the analyzed metrics can be used for the construction of a similar Ukrainian or even multilingual dataset as well.

Let us consider the Pearson correlation coefficient values between metrics and ROUGE-1 scores. As far as a higher ROUGE-1 score should imply the lower value of a metric (incorrect summaries have higher ROUGE-scores than correct ones), the PPC value should be low. As can be seen, the lowest (and a negative) PPC value was retrieved for the Ent SummaC metric indicating the possibility of the usage of the metric for the detection of the factually inconsistent summaries by setting up some threshold value. Thus, this metric was later used to analyze the Ukrainian-language part of the XL-Sum dataset and the summarization itself.

Table 3

Results of solving the inconsistent summaries discrimination task using different metrics: accuracy of the discrimination of correct/incorrect summaries and the Pearson correlation coefficient (PCC) between the metrics and the ROUGE-1 score of samples

XL-Sum dataset analysis

Firstly, let us analyze the Ukrainian test part of the dataset. The value of the Ent SummaC metric across the dataset was calculated. The density of the distribution of the retrieved metric value is shown in Figure 2. As can be seen, the distribution is skewed, and the 50th percentile equals 0.845. Thus, referring to the paper [24] where the threshold value 0.5 for the NLI model allowed filtering almost a half of incorrect samples (but approximately 50% of left summaries were judged by human evaluation as factually inconsistent), it can be concluded that a higher threshold value for the

Ent

SummaC

has to be taken as well. Indeed, the probability mass peak that starts from the 70th percentile value can potentially indicate the threshold for filtering incorrect summaries; however, this hypothesis should be later verified by an appropriate human evaluation.

Summarization model analysis

As the test dataset may contain many erroneous samples, it is hard to rely on the estimated ROUGE metrics. Thus, it was decided to calculate the Ent SummaC metric for the summaries generated by the summarization model on the test dataset. The summaries were picked from the set provided by the authors [13]. Figure 3 shows the retrieved distribution. As can be seen, the distribution of Ent SummaC scores is skewed too.

In order to compare the results between ground-truth and model-generated summaries, it was decided to take a median value as an average score, and the interquartile range (IQR) value for the measurement of the deviation of the metric. Table 4 demonstrates the retrieved results. The median value of the metric for the model-predicted summaries is higher; moreover, its IQR value is lower. Thus, the summaries that were generated by the model are considered to be even more factually correct than the ground-truth summaries.

As can be seen from Figure 3, there are some document-summary pairs whose Ent SummaC value is close to zero. Moreover, as can be expected from the noisy hallucinated dataset, the summarization model learned some pattern relations available in the dataset (e.g., the positions of persons) that led to the generation of hallucinated content (see Figure 4 and Figure 5 for such examples that were revealed by the low values of the metric). The removal of such dataset samples by the suggested metric can allow for avoiding such a situation and provide a more robust summarization model in terms of its ability to generalize the knowledge of a source document.

Conclusions

In this paper, several metrics for estimating the factual consistency of documents and summaries were analyzed for processing the Ukrainian-language part of the XL-Sum corpus. Moreover, the experimental verification of the effectiveness of the chosen SummaC metric was performed on the Ukrainian-language part of the XL-Sum corpus using different configurations and models. According to the results obtained from the evaluation of the discrimination of factually correct/incorrect document-summary pairs, the best accuracy was achieved with the usage of the multilingual sentence embedding model. Such a result may indicate the advisability of the utilizing of the aforementioned model for related tasks as the automatic construction of document-summary pairs for the generation of a silver standard Ukrainian summarization corpus. Moreover, the configuration of the metric SummaC with an NLI model showed the lowest expected correlation with a ROUGE score that can underline the possibility of the usage of this model for further detailed analysis of factual mistakes. The analysis of the values of the chosen NLI-based metric for the ground-truth samples of the XL-Sum dataset may indicate the availability of at least 50% of erroneous summaries that match the results of the previous research. Moreover, the retrieved distribution of metric values may indicate the presence of even more than 70% of error samples; however, the search for an appropriate threshold value for the considered metric still requires the usage of a more general human evaluation.

Finally, it was shown that the metrics retrieved from evaluating the factual consistency of modelgenerated summaries are higher than those of ground-truth summaries. Nevertheless, the availability of generated summaries with an almost zero metric score may indicate the big impact of the hallucinated dataset on the trained model. Further filtering of erroneous samples from the dataset using the considered metrics may allow learning the model to generate more factually consistent summaries.

References

Figure 1 :1Figure 1: Detection of a factually inconsistent summary sentence. The edge values indicate the maximum consistency score for each summary sentence. As far as the summary sentence (s2) is not entailed with any of the document sentences, its consistency score is lower.

Figure 2 :2Figure 2: Density of the distribution of the

Figure 3 :3Figure 3: Density of the distribution of the

Figure 4 :4Figure 4: A summary is inconsistent with a document in terms of events: an entire summary statement contradicts the facts from a document (both are highlighted in orange color)

Figure 5 :5Figure 5: A summary contains two types of errors: hallucinated entity (person name and his position) that is marked in a blue color, and the contradiction of facts (the document states that the person suggests participating in a negation process, but the summary states an opposite fact)

Table 22Statistics of the generated dataset for the inconsistent summaries discrimination task: a number of samples, an average number of sentences per a document, an average number of sentences per summarySamples numberDoc sentencesSummary sentences161924.40 17.92 1.43 0.65 

Table 44Statistics of theSummariesMedianIQRGround-truth0.8480.248Model-predicted0.9580.186

EntSummaCmetric for ground-truth and model-predicted summaries

Roberta: A robustly optimized bert pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov 10.48550/arXiv.1907.11692 2019 SpanBERT: Improving pre-training by representing and predicting spans MJoshi DChen YLiu DSWeld LZettlemoyer OLevy 10.1162/tacl_a_00300 Transactions of the Association for Computational Linguistics 8 2020 JZhang YZhao MSaleh PJLiu arXiv:1912.08777 Pegasus: Pre-training with extracted gap-sentences for abstractive summarization 2019 BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension MLewis YLiu NGoyal MGhazvininejad AMohamed OLevy VStoyanov LZettlemoyer 10.18653/v1/2020.acl-main.703 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Dialogue discourse-aware graph model and data augmentation for meeting summarization XFeng XFeng BQin XGeng International Joint Conference on Artificial Intelligence 2020 Controllable abstractive dialogue summarization with sketch supervision C.-SWu LLiu WLiu PStenetorp CXiong 10.18653/v1/2021.findings-acl.454 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics 2021 Evaluation of the coherence of Ukrainian texts using a transformer architecture AKramov SPogorilyy 10.1109/ATIT50783.2020.9349355 IEEE 2nd International Conference on Advanced Trends in Information Theory (ATIT) 2020. 2020 Coreference resolution method using a convolutional neural network SPogorilyy AKramov 10.1109/ATIT49449.2019.9030596 IEEE International Conference on Advanced Trends in Information Theory (ATIT) 2019. 2019 WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models BMinixhofer FPaischer NRekabsaz 10.18653/v1/2022.naacl-main.293 Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics

Seattle, United States

2022 Automated extraction of structured information from a variety of web pages SPogorilyy AKramov 10.15407/pp2018.02.149 PROBLEMS IN PROGRAMMING 2018 About the issue of algorithms formalized design for parallel computer architectures AAnisimov SPogorilyy DVitel Applied and computational mathematics 12 2013 Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization SNarayan SBCohen MLapata 10.18653/v1/D18-1206 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Brussels, Belgium

2018 XL-sum: Large-scale multilingual abstractive summarization for 44 languages THasan ABhattacharjee MSIslam KMubasshir Y.-FLi Y.-BKang MSRahman RShahriyar 10.18653/v1/2021.findings-acl.413 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics 2021 Annotating and modeling fine-grained factuality in summarization TGoyal GDurrett 10.18653/v1/2021.naacl-main.114 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics 2021 Evaluating the factual consistency of abstractive text summarization WKryscinski BMccann CXiong RSocher 10.18653/v1/2020.emnlp-main.750 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics A broad-coverage challenge corpus for sentence understanding through inference AWilliams NNangia SBowman 10.18653/v1/N18-1101 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

2018 1 Association for Computational Linguistics Factual error correction for abstractive summarization models MCao YDong JWu JC KCheung 10.18653/v1/2020.emnlp-main.506 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 Multi-fact correction in abstractive text summarization YDong SWang ZGan YCheng JC KCheung JLiu 10.18653/v1/2020.emnlp-main.749 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 Asking and answering questions to evaluate the factual consistency of summaries AWang KCho MLewis 10.18653/v1/2020.acl-main.450 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Evaluating factuality in generation with dependency-level entailment TGoyal GDurrett 10.18653/v1/2020.findings-emnlp.322 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics 2020 FactGraph: Evaluating factuality in summarization with semantic graph representations LF RRibeiro MLiu IGurevych MDreyer MBansal 10.18653/v1/2022.naacl-main.236 Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics

Seattle, United States

2022 SummaC: Re-visiting NLI-based models for inconsistency detection in summarization PLaban TSchnabel PNBennett MAHearst 10.1162/tacl_a_00453 Transactions of the Association for Computational Linguistics 10 2022 mface: Multilingual summarization with factual consistency evaluation RAharoni SNarayan JMaynez JHerzig EClark MLapata 10.48550/ARXIV.2212.10622 2022 mT5: A massively multilingual pre-trained text-to-text transformer LXue NConstant ARoberts MKale RAl-Rfou ASiddhant ABarua CRaffel 10.18653/v1/2021.naacl-main.41 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics 2021 Stanza: A python natural language processing toolkit for many human languages PQi YZhang YZhang JBolton CDManning 10.18653/v1/2020.acl-demos.14 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics 2020 Ukrainian stemmer VladislavKlim 2019 Sentence-BERT: Sentence embeddings using Siamese BERTnetworks NReimers IGurevych 10.18653/v1/D19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 Mpnet: Masked and permuted pre-training for language understanding KSong XTan TQin JLu T.-YLiu Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 Multilingual universal sentence encoder for semantic retrieval YYang DCer AAhmad MGuo JLaw NConstant GHernandez Abrego SYuan CTar Y-H. Sung BStrope RKurzweil 10.18653/v1/2020.acl-demos.12 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics 2020 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov 10.18653/v1/2020.acl-main.747 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 XNLI: Evaluating cross-lingual sentence representations AConneau RRinott GLample AWilliams SBowman HSchwenk VStoyanov 10.18653/v1/D18-1269 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Brussels, Belgium

2018 Hugging face ClémentDelangue 2023 Booksum: A collection of datasets for longform narrative summarization WKryscinski NRajani DAgarwal CXiong DRadev 10.48550/ARXIV.2105.08209 2021