Prompted Zero-Shot Multi-label Classification of Factual Incorrectness in Machine-Generated Summaries Aniket Deroy1 , Subhankar Maity1 and Saptarshi Ghosh1 1 IIT Kharagpur, Kharagpur, India Abstract This study addresses the critical issue of factual inaccuracies in machine-generated text summaries, an increasingly prevalent issue in information dissemination. Recognizing the potential of such errors to compromise information reliability, we investigate the nature of factual inconsistencies across machine- summarized content. We introduce a prompt-based classification system that categorizes errors into four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and fabrication. The participants are tasked with evaluating a corpus of machine-generated summaries against their original articles. Our methodology employs qualitative judgements to identify the occurrence of factual distortions. The results show that our prompt-based approaches are able to detect the type of errors in the summaries to some extent, although there is scope for improvement in our classification systems. Keywords Multi-label classification, Prompting, Large language model (LLM), Factual Incorrectness 1. Introduction In an era where information dissemination is predominantly driven by digital platforms, the accuracy and integrity of content have become paramount. Machine-generated summaries, designed to distill complex articles into digestible formats, have gained traction because of their efficiency and scalability. However, the susceptibility of these systems to introduce factual errors poses a significant challenge. This research endeavors to meticulously analyze the prevalence of factual inaccuracies within machine-generated summaries by establishing a systematic methodology for identification and categorization. We propose a novel prompt-based framework [1] that empowers participants to discern and classify factual inaccuracies into one of four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and fabrication. Each category embodies unique characteristics of factual errors, ranging from subtle misinterpretations to the deliberate creation of non-existent facts. Misrepresentation refers to the skewed presentation of information that can alter the perceived meaning. Inaccuracies in quantities or measurements involve numerical or statistical deviations from the truth. False attribution represents the erroneous association of Forum for Information Retrieval Evaluation, December 15-18 2023, India Envelope-Open roydanik18@kgpian.iitkgp.ac.in (A. Deroy); subhankar.ai@kgpian.iitkgp.ac.in (S. Maity); saptarshi.ghosh@gmail.com (S. Ghosh) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings statements or actions with individuals or entities. Lastly, fabrication denotes the most egregious breach, where information is concocted without any factual foundation. This research serves as a critical investigation into the fidelity [2] of machine-generated summaries. By scrutinizing these summaries against their source articles, we aim to quantify the extent of factual distortions and understand their implications. The ultimate goal is to enhance the credibility of machine-generated content, ensuring that it serves as a reliable conduit for knowledge and information in the digital age. Our results show that our novel prompt-based approaches are capable of detecting the type of errors in the summaries to some extent, although there is scope for improvement for our classification systems. The task is based on [3, 4] which are the original track papers. 2. Related Work In recent years, the field of natural language processing (NLP) has seen a significant shift towards the development and utilization of large language models (LLMs). These LLMs, particularly exemplified by OpenAI’s GPT series (Generative Pre-trained Transformer), have revolutionized various NLP tasks. The foundational concept behind these models involves pre-training on vast amounts of text data, enabling them to learn intricate language patterns and structures. Zero-shot prompting with LLMs has been leveraged across various tasks. In text generation, these LLMs exhibit the capacity to produce coherent and contextually relevant content even when prompted by unseen topics or styles. Translation tasks [5] benefit from zero-shot capabil- ities, allowing language conversion without specific paired training data. Sentiment analysis [6], intent classification [7], named entity recognition [8, 9], and multi-label text classification [10] are among other tasks where LLMs prompted in a zero-shot manner showcase robust performance without explicit task-oriented training. Furthermore, question-answering [11] and summarization tasks [12] witness effective output through the zero-shot prompting approach, offering pertinent answers and concise summaries without task-specific fine-tuning. GPT-3.5 Turbo represents a significant advancement in the landscape of LLMs. It builds on the foundation laid by GPT-3 [13], showcasing scale and potential improvement in training methodologies, although specific details regarding the “Turbo” improvements remain proprietary to OpenAI. GPT-3.5 Turbo’s training involves self-supervised learning on an extensive and diverse corpus of Internet text, refining its language understanding and generation capabilities. Leveraging the zero-shot learning paradigm, it excels in performing various natural language processing tasks without specific fine-tuning, a hallmark feature carried forward from the GPT-3 architecture. In the context of detecting factual incorrectness in machine-generated summaries, the zero-shot prompting method utilizing LLMs presents a promising approach. The methodology involves instructing the model with label descriptions and tasks, allowing it to identify and classify factual inaccuracies without direct training on specific datasets. GPT 3.5 Turbo, known for its advanced zero-shot learning capabilities, stands as a potential solution for discerning factual errors in machine-generated content. 3. Dataset We have been provided with the original articles and incorrect summaries in the training set and the testing set of ILSUM task 2. There are 8497 articles in the train set and 200 articles in the test set. 4. Task Definition The task focuses on identifying factual errors in machine-generated summaries. The objective is to categorize each datapoint into different categories based on factual incorrectness in the summaries. Possible types of factual incorrectness: • Misrepresentation: This involves presenting information in a way that is misleading or gives a false impression. This could be done by exaggerating certain aspects, understating others, or twisting facts to fit a particular narrative. • Inaccurate Quantities or Measurements: Factual incorrectness can occur when precise quantities, measurements, or statistics are misrepresented, whether by error or intent. • False Attribution: Incorrectly attributing a statement, idea, or action to a person or group is another form of factual incorrectness. • Fabrication: Making up data, sources, or events is a severe form of factual incorrectness. This involves creating “facts” that have no basis in reality. 5. Methodology 5.1. Why Prompting? Prompting is a valuable approach to solving multilabel classification problems for several reasons: – Natural Language Bridge: Prompting allows the use of natural language to bridge the gap between machine learning models and complex tasks without the need for extensive reprogramming or model redesign. It essentially converts the classification task into a text generation problem, which large language models are inherently good at solving. – Transfer Learning: Through prompting, models that have been trained on vast datasets can apply their learned knowledge to classify data across multiple labels. This transfer learning is efficient because it leverages pre-existing knowledge without the need for extensive additional training on specialized datasets. – Flexibility: Prompt-based approaches are highly flexible and easily adapted to different tasks and domains. This is particularly useful in multilabel classification, where the relationships and distinctions between categories can be nuanced and context-dependent. – Efficiency: Prompting can reduce the need for large annotated datasets that are typi- cally required to train multi-label classifiers. By using prompts, models can often make predictions without any examples(Zero-Shot classification). 5.2. Prompting approach The prompting approach involved employing the GPT-3.5 Turbo model in zero-shot mode for the multi-label classification task of detecting factual incorrectness in machine-generated summaries. The approach included instructing the GPT-3.5 Turbo model in zero-shot mode, providing a set of label descriptions, and outlining the task to be executed. The hyperparameters are as follows: temperature = {0.5, 0.6, 0.7, 0.8, 0.9}, max-tokens = 50, and stop = None. A diagrammatic representation of the model is shown in Figure 1. The prompt we use for the model is provided in Figure 2. Figure 1: An overview of GPT for zero-shot multi-label classification of factual incorrectness in machine- generated summaries. Figure 2: Prompt used for GPT-3.5 Turbo. Where, XX can be misrepresentation, fabrication, false_attri- bution, and incorrect_quantities. 5.3. Algorithmic approach Here we discuss the various prompt based algorithmic approaches that we took to attempt the problem of multi-label error classification:- 1 – Algorithm 1 tries to prompt the LLM to understand whether the given incorrect summary belongs to the class misrepresentation. Then the labels fabrication, false_attribution, and incorrect_quantities are checked, respectively. If we get one predicted label for a given incorrect summary, we stop the algorithm. The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to class misrepresentation is the fact that the misrepresentation class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 1. – Algorithm 2 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class false_attribution. Then the labels misrepresentation, fabri- cation, and incorrect_quantities are checked, respectively. If we get one predicted label for a given incorrect summary, we stop the algorithm. The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to the false_at- tribution class is the fact that the false_attribution class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 2. – Algorithm 3 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class misrepresentation. Then, the labels false_attribution, fabrication, and incorrect_quantities are checked, respectively. If we get two predicted labels for a given incorrect summary, we stop the algorithm. The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to the misrepresentation, and false_attribution class is the fact that the misrepresentation, and false_attribution class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 3. – Algorithm 4 tries to prompt the LLM in order to understand whether the given incorrect summary belongs to the class misrepresentation. Then the labels fabrication, false_attri- bution, and incorrect_quantities are checked, respectively. If we get two predicted labels for a given incorrect summary we stop the algorithm.The simple heuristic behind first checking whether the (incorrect summary, original document) pair belongs to the misrep- resentation, and fabrication class is the fact that the misrepresentation, and fabrication class occurs in higher proportions in the training data. The pseudocode of the algorithm is given in Algorithm 4. – Algorithm 5 tries to prompt the LLM in order to understand whether the given incor- rect summary belongs to the class false_attribution. Then the labels misrepresentation, fabrication, and incorrect_quantities are checked, respectively. If we get four predicted labels for a given incorrect summary, we stop the algorithm. There can be data points for which we get less than four correct data points. We run GPT-3.5 Turbo at different temperatures 0.5, 0.6, 0.7, 0.8, and 0.9 respectively. Then we take an ensemble of the five output test runs being run at different temperatures by considering all the labels that 1 We want to specify that all our results are non-deterministic in nature i.e. the same hyperparameter settings can lead to different results on different test runs. occurred at least twice for a particular datapoint. The ensembling method that we tried helped in providing improved accuracy and generalization bringing more stability into the nature of outputs. The pseudocode of the algorithm is given in Algorithm 5. 6. Results Table 1 shows the macro-F1 score considering both correct and incorrect labels. Table 2 shows the macro-F1 score considering only correct labels. We tried five different prompting-based algorithms. The best result is obtained for Algorithm 5 (Ensembling approach) which we explored in both Table 1 and Table 2. Team Name Method Run No. Macro-F1 Text Titans Algorithm 1 1 0.044 Text Titans Algorithm 2 2 0.024 Text Titans Algorithm 3 3 0.089 Text Titans Algorithm 4 4 0.112 Text Titans Algorithm 5(Ensemble) 5 0.156 Table 1 Macro-F1 score considering both correct and incorrect labels. Team Name Method Run No. Macro-F1 Text Titans Algorithm 1 1 0.152 Text Titans Algorithm 2 2 0.093 Text Titans Algorithm 3 3 0.291 Text Titans Algorithm 4 4 0.355 Text Titans Algorithm 5(Ensemble) 5 0.527 Table 2 Macro-F1 score considering only correct labels. 𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0; For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end Algorithm 1: Pseudocode 𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0; For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==1 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end Algorithm 2: Pseudocode 𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0; For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the correponding summary belongs to class false_attribution then Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end Algorithm 3: Pseudocode 𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0; For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the correponding summary belongs to class incorrect_quantities then Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==2 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end Algorithm 4: Pseudocode 𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0; For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class misrepresentation then Then output the class misrepresentation counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class false_attribution then Then output the class false_attribution counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class fabrication then Then output the class fabrication counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end For all data points: if prompted LLM to check whether the given article and the corresponding summary belong to class incorrect_quantities then Then output the class incorrect_quantities counter(variable) = counter(variable)+1 if counter(variable)==4 for a datapoint then stop checking for the next label for that datapoint else Do not perform any action end else Do not perform any action end Algorithm 5: Pseudocode 7. Conclusion and Future Work We were given the task of multi-label error classification where we had to classify a document into four classes namely misrepresentation, fabrication, false_attribution, and incorrect_quanti- ties. We tried several prompt-based algorithmic approaches for the multi-label error classifica- tion task that we were given as a part of Task 2. We observed that the Algorithm 5 (Ensembling approach) that we explored obtained the best results. Future work would involve trying a few-shot technique and trying larger language models such as GPT-4. References [1] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng, M. Sun, Openprompt: An open-source framework for prompt-learning, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2022, pp. 105–113. [2] W. Kryscinski, B. McCann, C. Xiong, R. Socher, Evaluating the factual consistency of abstractive text summarization, CoRR abs/1910.12840 (2019). URL: http://arxiv.org/abs/ 1910.12840. arXiv:1910.12840 . [3] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Indian language summarization at fire 2023, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023. [4] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Key takeaways from the second shared task on indian language summarization (ilsum 2023), in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [5] B. Zhang, B. Haddow, A. Birch, Prompting large language model for machine translation: A case study, in: Proceedings of the 40th International Conference on Machine Learning, ICML’23, JMLR.org, 2023. [6] Y. Zhao, T. Nasukawa, M. Muraoka, B. Bhattacharjee, A simple yet strong domain-agnostic de-bias method for zero-shot sentiment classification, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, As- sociation for Computational Linguistics, Toronto, Canada, 2023, pp. 3923–3931. URL: https: //aclanthology.org/2023.findings-acl.242. doi:10.18653/v1/2023.findings- acl.242 . [7] S. Parikh, M. Tiwari, P. Tumbade, Q. Vohra, Exploring zero and few-shot techniques for intent classification, in: S. Sitaram, B. Beigman Klebanov, J. D. Williams (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 744–751. URL: https://aclanthology.org/2023.acl-industry.71. doi:10.18653/v1/2023. acl- industry.71 . [8] B. Ji, Vicunaner: Zero/few-shot named entity recognition using vicuna, 2023. arXiv:2305.03253 . [9] Y. Hu, I. Ameer, X. Zuo, X. Peng, Y. Zhou, Z. Li, Y. Li, J. Li, X. Jiang, H. Xu, Zero-shot clinical entity recognition using chatgpt, 2023. arXiv:2303.16416 . [10] R. Song, Z. Liu, X. Chen, H. An, Z. Zhang, X. Wang, H. Xu, Label prompt for multi-label text classification, Applied Intelligence 53 (2023) 8761–8775. [11] J. Baek, A. Aji, A. Saffari, Knowledge-augmented language model prompting for zero- shot knowledge graph question answering, in: E. Hruschka, T. Mitchell, S. Rahman, D. Mladenić, M. Grobelnik (Eds.), Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023), Association for Computational Linguistics, Toronto, ON, Canada, 2023, pp. 70–98. URL: https://aclanthology.org/2023. matching-1.7. doi:10.18653/v1/2023.matching- 1.7 . [12] A. Bhaskar, A. Fabbri, G. Durrett, Prompted opinion summarization with GPT-3.5, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023, pp. 9282–9300. URL: https://aclanthology.org/2023.findings-acl.591. doi:10.18653/v1/ 2023.findings- acl.591 . [13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.