-

1613-0073

Multi-label Classification of Factual Incorrectness in Machine-Generated Sum maries

Aniket Deroy

Subhankar Maity

subhankar.ai@kgpian.iitkgp.ac.in 0

Saptarshi Ghosh

saptarshi.ghosh@gmail.com 0 0 IIT Kharagpur , Kharagpur , India

This study addresses the critical issue of factual inaccuracies in machine-generated text summaries, an increasingly prevalent issue in information dissemination. Recognizing the potential of such errors to compromise information reliability, we investigate the nature of factual inconsistencies across machinesummarized content. We introduce a prompt-based classification system that categorizes errors into four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and fabrication. The participants are tasked with evaluating a corpus of machine-generated summaries against their original articles. Our methodology employs qualitative judgements to identify the occurrence of factual distortions. The results show that our prompt-based approaches are able to detect the type of errors in the summaries to some extent, although there is scope for improvement in our classification systems.

Machine-Generated Multi-label classification Prompting Large language model (LLM) Factual Incorrectness

CEUR ceur-ws.org

1. Introduction

In an era where information dissemination is predominantly driven by digital platforms, the accuracy and integrity of content have become paramount. Machine-generated summaries, designed to distill complex articles into digestible formats, have gained traction because of their eficiency and scalability. However, the susceptibility of these systems to introduce factual errors poses a significant challenge. This research endeavors to meticulously analyze the prevalence of factual inaccuracies within machine-generated summaries by establishing a systematic methodology for identification and categorization.

We propose a novel prompt-based framework [ 1 ] that empowers participants to discern and classify factual inaccuracies into one of four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and fabrication. Each category embodies unique characteristics of factual errors, ranging from subtle misinterpretations to the deliberate creation of non-existent facts. Misrepresentation refers to the skewed presentation of information that can alter the perceived meaning. Inaccuracies in quantities or measurements involve numerical or statistical deviations from the truth. False attribution represents the erroneous association of CEUR Workshop Proceedings statements or actions with individuals or entities. Lastly, fabrication denotes the most egregious breach, where information is concocted without any factual foundation.

This research serves as a critical investigation into the fidelity [ 2 ] of machine-generated summaries. By scrutinizing these summaries against their source articles, we aim to quantify the extent of factual distortions and understand their implications. The ultimate goal is to enhance the credibility of machine-generated content, ensuring that it serves as a reliable conduit for knowledge and information in the digital age. Our results show that our novel prompt-based approaches are capable of detecting the type of errors in the summaries to some extent, although there is scope for improvement for our classification systems. The task is based on [ 3, 4 ] which are the original track papers.

2. Related Work

In recent years, the field of natural language processing (NLP) has seen a significant shift towards the development and utilization of large language models (LLMs). These LLMs, particularly exemplified by OpenAI’s GPT series (Generative Pre-trained Transformer), have revolutionized various NLP tasks. The foundational concept behind these models involves pre-training on vast amounts of text data, enabling them to learn intricate language patterns and structures.

Zero-shot prompting with LLMs has been leveraged across various tasks. In text generation, these LLMs exhibit the capacity to produce coherent and contextually relevant content even when prompted by unseen topics or styles. Translation tasks [ 5 ] benefit from zero-shot capabilities, allowing language conversion without specific paired training data. Sentiment analysis [ 6 ], intent classification [ 7 ], named entity recognition [ 8, 9 ], and multi-label text classification [ 10 ] are among other tasks where LLMs prompted in a zero-shot manner showcase robust performance without explicit task-oriented training. Furthermore, question-answering [ 11 ] and summarization tasks [ 12 ] witness efective output through the zero-shot prompting approach, ofering pertinent answers and concise summaries without task-specific fine-tuning.

GPT-3.5 Turbo represents a significant advancement in the landscape of LLMs. It builds on the foundation laid by GPT-3 [ 13 ], showcasing scale and potential improvement in training methodologies, although specific details regarding the “Turbo” improvements remain proprietary to OpenAI. GPT-3.5 Turbo’s training involves self-supervised learning on an extensive and diverse corpus of Internet text, refining its language understanding and generation capabilities. Leveraging the zero-shot learning paradigm, it excels in performing various natural language processing tasks without specific fine-tuning, a hallmark feature carried forward from the GPT-3 architecture. In the context of detecting factual incorrectness in machine-generated summaries, the zero-shot prompting method utilizing LLMs presents a promising approach. The methodology involves instructing the model with label descriptions and tasks, allowing it to identify and classify factual inaccuracies without direct training on specific datasets. GPT 3.5 Turbo, known for its advanced zero-shot learning capabilities, stands as a potential solution for discerning factual errors in machine-generated content.

3. Dataset 4. Task Definition

We have been provided with the original articles and incorrect summaries in the training set and the testing set of ILSUM task 2. There are 8497 articles in the train set and 200 articles in the test set.

The task focuses on identifying factual errors in machine-generated summaries. The objective is to categorize each datapoint into diferent categories based on factual incorrectness in the summaries.

Possible types of factual incorrectness: • Misrepresentation: This involves presenting information in a way that is misleading or gives a false impression. This could be done by exaggerating certain aspects, understating others, or twisting facts to fit a particular narrative. • Inaccurate Quantities or Measurements: Factual incorrectness can occur when precise quantities, measurements, or statistics are misrepresented, whether by error or intent. • False Attribution: Incorrectly attributing a statement, idea, or action to a person or group is another form of factual incorrectness. • Fabrication: Making up data, sources, or events is a severe form of factual incorrectness.

This involves creating “facts” that have no basis in reality. 5. Methodology 5.1. Why Prompting?

Prompting is a valuable approach to solving multilabel classification problems for several reasons: – Natural Language Bridge: Prompting allows the use of natural language to bridge the gap between machine learning models and complex tasks without the need for extensive reprogramming or model redesign. It essentially converts the classification task into a text generation problem, which large language models are inherently good at solving. – Transfer Learning: Through prompting, models that have been trained on vast datasets can apply their learned knowledge to classify data across multiple labels. This transfer learning is eficient because it leverages pre-existing knowledge without the need for extensive additional training on specialized datasets. – Flexibility: Prompt-based approaches are highly flexible and easily adapted to diferent tasks and domains. This is particularly useful in multilabel classification, where the relationships and distinctions between categories can be nuanced and context-dependent. – Eficiency: Prompting can reduce the need for large annotated datasets that are typically required to train multi-label classifiers. By using prompts, models can often make predictions without any examples(Zero-Shot classification).

5.2. Prompting approach

The prompting approach involved employing the GPT-3.5 Turbo model in zero-shot mode for the multi-label classification task of detecting factual incorrectness in machine-generated summaries. The approach included instructing the GPT-3.5 Turbo model in zero-shot mode, providing a set of label descriptions, and outlining the task to be executed. The hyperparameters are as follows: temperature = {0.5, 0.6, 0.7, 0.8, 0.9}, max-tokens = 50, and stop = None. A diagrammatic representation of the model is shown in Figure 1. The prompt we use for the model is provided in Figure 2.

5.3. Algorithmic approach

Here we discuss the various prompt based algorithmic approaches that we took to attempt the problem of multi-label error classification:- 1 – Algorithm 1 tries to prompt the LLM to understand whether the given incorrect summary belongs to the class misrepresentation. Then the labels fabrication, false_attribution, and incorrect_quantities are checked, respectively. If we get one predicted label for a given incorrect summary, we stop the algorithm. The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to class misrepresentation is the fact that the misrepresentation class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 1. – Algorithm 2 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class false_attribution. Then the labels misrepresentation, fabrication, and incorrect_quantities are checked, respectively. If we get one predicted label for a given incorrect summary, we stop the algorithm. The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to the false_attribution class is the fact that the false_attribution class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 2. – Algorithm 3 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class misrepresentation. Then, the labels false_attribution, fabrication, and incorrect_quantities are checked, respectively. If we get two predicted labels for a given incorrect summary, we stop the algorithm. The simple heuristic behind ifrst checking whether the (incorrect summary, original document) pair belongs to the misrepresentation, and false_attribution class is the fact that the misrepresentation, and false_attribution class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 3. – Algorithm 4 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class misrepresentation. Then the labels fabrication, false_attribution, and incorrect_quantities are checked, respectively. If we get two predicted labels for a given incorrect summary we stop the algorithm.The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to the misrepresentation, and fabrication class is the fact that the misrepresentation, and fabrication class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 4. – Algorithm 5 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class false_attribution. Then the labels misrepresentation, fabrication, and incorrect_quantities are checked, respectively. If we get four predicted labels for a given incorrect summary, we stop the algorithm. There can be data points for which we get less than four correct data points. We run GPT-3.5 Turbo at diferent temperatures 0.5, 0.6, 0.7, 0.8, and 0.9 respectively. Then we take an ensemble of the five output test runs being run at diferent temperatures by considering all the labels that 1We want to specify that all our results are non-deterministic in nature i.e. the same hyperparameter settings can lead to diferent results on diferent test runs.

occurred at least twice for a particular datapoint. The ensembling method that we tried helped in providing improved accuracy and generalization bringing more stability into the nature of outputs. The pseudocode of the algorithm is given in Algorithm 5.

6. Results

else

Do not perform any action

end else

Do not perform any action Do not perform any action Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then

Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action else end else else end else else end else end

Algorithm 1: Pseudocode

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then

Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

( ) ← 0 ;

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then

Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then

stop checking for the next label for that datapoint else

Do not perform any action

end else

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then

Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then

Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action else end else else end else else end else end

Algorithm 2: Pseudocode

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then

Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

else

Do not perform any action

end else

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the correponding summary belongs to class false_attribution then

Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then

Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action else end else else end else else end else end

Algorithm 3: Pseudocode

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then

Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

else

Do not perform any action

end else

Do not perform any action Do not perform any action Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then

Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action else end else else end else else end else end

Algorithm 4: Pseudocode

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the correponding summary belongs to class incorrect_quantities then

Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

( ) ← 0 ;

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then

Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then

stop checking for the next label for that datapoint else

Do not perform any action

end else

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then

Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then

Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action else end else else end else else end else end

Algorithm 5: Pseudocode

Do not perform any action

end

For all data points:

if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then

Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint

Do not perform any action Do not perform any action 7. Conclusion and Future Work

We were given the task of multi-label error classification where we had to classify a document into four classes namely misrepresentation, fabrication, false_attribution, and incorrect_quantities. We tried several prompt-based algorithmic approaches for the multi-label error classification task that we were given as a part of Task 2. We observed that the Algorithm 5 (Ensembling approach) that we explored obtained the best results. Future work would involve trying a few-shot technique and trying larger language models such as GPT-4.

[1]

Ding ,

Hu ,

Zhao ,

Chen ,

Liu ,

Zheng ,

Sun , Openprompt: An open-source framework for prompt-learning , in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , 2022 , pp. 105 - 113 .

[2]

Kryscinski ,

McCann ,

Xiong ,

Socher , Evaluating the factual consistency of abstractive text summarization , CoRR abs/ 1910 .12840 ( 2019 ). URL: http://arxiv.org/abs/ 1910 .12840. arXiv: 1910 .12840.

[3]

Satapara ,

Mehta ,

Modha ,

Ganguly , Indian language summarization at fire 2023, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation , FIRE 2023 , Goa, India. December 15-18 , 2023 , ACM, 2023 .

[4]

Satapara ,

Mehta ,

Modha ,

Ganguly , Key takeaways from the second shared task on indian language summarization (ilsum 2023 ), in: K. Ghosh,

Mandl ,

Majumder , M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India . December 15-18 , 2023 , CEUR Workshop Proceedings, CEUR-WS.org, 2023 .

[5]

Zhang ,

Haddow ,

Birch , Prompting large language model for machine translation: A case study , in: Proceedings of the 40th International Conference on Machine Learning, ICML'23 , JMLR.org, 2023 .

[6]

Zhao ,

Nasukawa ,

Muraoka ,

Bhattacharjee , A simple yet strong domain-agnostic de-bias method for zero-shot sentiment classification , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 , Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 3923 - 3931 . URL: https: //aclanthology.org/ 2023 .findings-acl. 242 . doi: 10 .18653/v1/ 2023 .findings- acl.242.

[7]

Parikh ,

Tiwari ,

Tumbade ,

Vohra , Exploring zero and few-shot techniques for intent classification , in: S. Sitaram,

B. Beigman

Klebanov ,

J. D.

Williams (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5 : Industry

Track)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 744 - 751 . URL: https://aclanthology.org/ 2023 .acl-industry. 71 . doi: 10 .18653/v1/ 2023 . acl- industry.71.

[8]

Ji , Vicunaner: Zero/few-shot named entity recognition using vicuna , 2023 . arXiv: 2305 . 03253 .

[9]

Hu ,

Ameer ,

Zuo ,

Peng ,

Zhou ,

Li ,

Jiang ,

Xu , Zero-shot clinical entity recognition using chatgpt , 2023 . arXiv: 2303 . 16416 .

[10]

Song ,

Liu ,

Chen ,

An ,

Zhang ,

Wang ,

Xu , Label prompt for multi-label text classification , Applied Intelligence 53 ( 2023 ) 8761 - 8775 .

[11]

Baek ,

Aji ,

Safari , Knowledge-augmented language model prompting for zeroshot knowledge graph question answering , in: E. Hruschka, T. Mitchell,

Rahman ,

Mladenić , M. Grobelnik (Eds.), Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023 ), Association for Computational Linguistics , Toronto, ON, Canada, 2023 , pp. 70 - 98 . URL: https://aclanthology.org/ 2023 . matching- 1 .7. doi: 10 .18653/v1/ 2023 .matching- 1 .7.

[12]

Bhaskar ,

Fabbri , G. Durrett, Prompted opinion summarization with GPT-3.5 , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 , Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 9282 - 9300 . URL: https://aclanthology.org/ 2023 .findings-acl. 591 . doi: 10 .18653/v1/ 2023 .findings- acl.591.

[13]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.