Prompted Zero-Shot Multi-label Classification of
                                Factual Incorrectness in Machine-Generated
                                Summaries
                                Aniket Deroy1 , Subhankar Maity1 and Saptarshi Ghosh1
                                1
                                    IIT Kharagpur, Kharagpur, India


                                                                         Abstract
                                                                         This study addresses the critical issue of factual inaccuracies in machine-generated text summaries, an
                                                                         increasingly prevalent issue in information dissemination. Recognizing the potential of such errors to
                                                                         compromise information reliability, we investigate the nature of factual inconsistencies across machine-
                                                                         summarized content. We introduce a prompt-based classification system that categorizes errors into
                                                                         four distinct types: misrepresentation, inaccurate quantities or measurements, false attribution, and
                                                                         fabrication. The participants are tasked with evaluating a corpus of machine-generated summaries against
                                                                         their original articles. Our methodology employs qualitative judgements to identify the occurrence of
                                                                         factual distortions. The results show that our prompt-based approaches are able to detect the type of
                                                                         errors in the summaries to some extent, although there is scope for improvement in our classification
                                                                         systems.

                                                                         Keywords
                                                                         Multi-label classification, Prompting, Large language model (LLM), Factual Incorrectness


                                1. Introduction
                                In an era where information dissemination is predominantly driven by digital platforms, the
                                accuracy and integrity of content have become paramount. Machine-generated summaries,
                                designed to distill complex articles into digestible formats, have gained traction because of
                                their efficiency and scalability. However, the susceptibility of these systems to introduce
                                factual errors poses a significant challenge. This research endeavors to meticulously analyze
                                the prevalence of factual inaccuracies within machine-generated summaries by establishing a
                                systematic methodology for identification and categorization.
                                   We propose a novel prompt-based framework [1] that empowers participants to discern
                                and classify factual inaccuracies into one of four distinct types: misrepresentation, inaccurate
                                quantities or measurements, false attribution, and fabrication. Each category embodies unique
                                characteristics of factual errors, ranging from subtle misinterpretations to the deliberate creation
                                of non-existent facts. Misrepresentation refers to the skewed presentation of information that
                                can alter the perceived meaning. Inaccuracies in quantities or measurements involve numerical
                                or statistical deviations from the truth. False attribution represents the erroneous association of

                                Forum for Information Retrieval Evaluation, December 15-18 2023, India
                                Envelope-Open roydanik18@kgpian.iitkgp.ac.in (A. Deroy); subhankar.ai@kgpian.iitkgp.ac.in (S. Maity);
                                saptarshi.ghosh@gmail.com (S. Ghosh)
                                                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
statements or actions with individuals or entities. Lastly, fabrication denotes the most egregious
breach, where information is concocted without any factual foundation.
   This research serves as a critical investigation into the fidelity [2] of machine-generated
summaries. By scrutinizing these summaries against their source articles, we aim to quantify the
extent of factual distortions and understand their implications. The ultimate goal is to enhance
the credibility of machine-generated content, ensuring that it serves as a reliable conduit for
knowledge and information in the digital age. Our results show that our novel prompt-based
approaches are capable of detecting the type of errors in the summaries to some extent, although
there is scope for improvement for our classification systems. The task is based on [3, 4] which
are the original track papers.


2. Related Work
In recent years, the field of natural language processing (NLP) has seen a significant shift towards
the development and utilization of large language models (LLMs). These LLMs, particularly
exemplified by OpenAI’s GPT series (Generative Pre-trained Transformer), have revolutionized
various NLP tasks. The foundational concept behind these models involves pre-training on vast
amounts of text data, enabling them to learn intricate language patterns and structures.
   Zero-shot prompting with LLMs has been leveraged across various tasks. In text generation,
these LLMs exhibit the capacity to produce coherent and contextually relevant content even
when prompted by unseen topics or styles. Translation tasks [5] benefit from zero-shot capabil-
ities, allowing language conversion without specific paired training data. Sentiment analysis
[6], intent classification [7], named entity recognition [8, 9], and multi-label text classification
[10] are among other tasks where LLMs prompted in a zero-shot manner showcase robust
performance without explicit task-oriented training. Furthermore, question-answering [11] and
summarization tasks [12] witness effective output through the zero-shot prompting approach,
offering pertinent answers and concise summaries without task-specific fine-tuning.
   GPT-3.5 Turbo represents a significant advancement in the landscape of LLMs. It builds on
the foundation laid by GPT-3 [13], showcasing scale and potential improvement in training
methodologies, although specific details regarding the “Turbo” improvements remain proprietary
to OpenAI. GPT-3.5 Turbo’s training involves self-supervised learning on an extensive and
diverse corpus of Internet text, refining its language understanding and generation capabilities.
Leveraging the zero-shot learning paradigm, it excels in performing various natural language
processing tasks without specific fine-tuning, a hallmark feature carried forward from the
GPT-3 architecture. In the context of detecting factual incorrectness in machine-generated
summaries, the zero-shot prompting method utilizing LLMs presents a promising approach.
The methodology involves instructing the model with label descriptions and tasks, allowing it
to identify and classify factual inaccuracies without direct training on specific datasets. GPT 3.5
Turbo, known for its advanced zero-shot learning capabilities, stands as a potential solution for
discerning factual errors in machine-generated content.
3. Dataset
We have been provided with the original articles and incorrect summaries in the training set
and the testing set of ILSUM task 2. There are 8497 articles in the train set and 200 articles in
the test set.


4. Task Definition
The task focuses on identifying factual errors in machine-generated summaries. The objective
is to categorize each datapoint into different categories based on factual incorrectness in the
summaries.
   Possible types of factual incorrectness:

    • Misrepresentation: This involves presenting information in a way that is misleading or
      gives a false impression. This could be done by exaggerating certain aspects, understating
      others, or twisting facts to fit a particular narrative.
    • Inaccurate Quantities or Measurements: Factual incorrectness can occur when precise
      quantities, measurements, or statistics are misrepresented, whether by error or intent.
    • False Attribution: Incorrectly attributing a statement, idea, or action to a person or
      group is another form of factual incorrectness.
    • Fabrication: Making up data, sources, or events is a severe form of factual incorrectness.
      This involves creating “facts” that have no basis in reality.


5. Methodology
5.1. Why Prompting?
Prompting is a valuable approach to solving multilabel classification problems for several
reasons:

   – Natural Language Bridge: Prompting allows the use of natural language to bridge the
     gap between machine learning models and complex tasks without the need for extensive
     reprogramming or model redesign. It essentially converts the classification task into a
     text generation problem, which large language models are inherently good at solving.
   – Transfer Learning: Through prompting, models that have been trained on vast datasets
     can apply their learned knowledge to classify data across multiple labels. This transfer
     learning is efficient because it leverages pre-existing knowledge without the need for
     extensive additional training on specialized datasets.
   – Flexibility: Prompt-based approaches are highly flexible and easily adapted to different
     tasks and domains. This is particularly useful in multilabel classification, where the
     relationships and distinctions between categories can be nuanced and context-dependent.
   – Efficiency: Prompting can reduce the need for large annotated datasets that are typi-
     cally required to train multi-label classifiers. By using prompts, models can often make
     predictions without any examples(Zero-Shot classification).
5.2. Prompting approach
The prompting approach involved employing the GPT-3.5 Turbo model in zero-shot mode
for the multi-label classification task of detecting factual incorrectness in machine-generated
summaries. The approach included instructing the GPT-3.5 Turbo model in zero-shot mode,
providing a set of label descriptions, and outlining the task to be executed. The hyperparameters
are as follows: temperature = {0.5, 0.6, 0.7, 0.8, 0.9}, max-tokens = 50, and stop = None. A
diagrammatic representation of the model is shown in Figure 1. The prompt we use for the
model is provided in Figure 2.


Figure 1: An overview of GPT for zero-shot multi-label classification of factual incorrectness in machine-
generated summaries.


Figure 2: Prompt used for GPT-3.5 Turbo. Where, XX can be misrepresentation, fabrication, false_attri-
bution, and incorrect_quantities.
5.3. Algorithmic approach
Here we discuss the various prompt based algorithmic approaches that we took to attempt the
problem of multi-label error classification:- 1

    – Algorithm 1 tries to prompt the LLM to understand whether the given incorrect summary
      belongs to the class misrepresentation. Then the labels fabrication, false_attribution,
      and incorrect_quantities are checked, respectively. If we get one predicted label for
      a given incorrect summary, we stop the algorithm. The simple heuristic behind first
      checking whether the (incorrect summary, original document) pair belongs to class
      misrepresentation is the fact that the misrepresentation class occurs in higher proportions
      in the training data. The pseudocode of the algorithm is given in Algorithm 1.
    – Algorithm 2 tries to prompt the LLM in order to understand whether the given incorrect
      summary belongs to the class false_attribution. Then the labels misrepresentation, fabri-
      cation, and incorrect_quantities are checked, respectively. If we get one predicted label
      for a given incorrect summary, we stop the algorithm. The simple heuristic behind first
      checking whether the (incorrect summary, original document) pair belongs to the false_at-
      tribution class is the fact that the false_attribution class occurs in higher proportions in
      the training data. The pseudocode of the algorithm is given in Algorithm 2.
    – Algorithm 3 tries to prompt the LLM in order to understand whether the given incorrect
      summary belongs to the class misrepresentation. Then, the labels false_attribution,
      fabrication, and incorrect_quantities are checked, respectively. If we get two predicted
      labels for a given incorrect summary, we stop the algorithm. The simple heuristic behind
      first checking whether the (incorrect summary, original document) pair belongs to the
      misrepresentation, and false_attribution class is the fact that the misrepresentation, and
      false_attribution class occurs in higher proportions in the training data. The pseudocode
      of the algorithm is given in Algorithm 3.
    – Algorithm 4 tries to prompt the LLM in order to understand whether the given incorrect
      summary belongs to the class misrepresentation. Then the labels fabrication, false_attri-
      bution, and incorrect_quantities are checked, respectively. If we get two predicted labels
      for a given incorrect summary we stop the algorithm.The simple heuristic behind first
      checking whether the (incorrect summary, original document) pair belongs to the misrep-
      resentation, and fabrication class is the fact that the misrepresentation, and fabrication
      class occurs in higher proportions in the training data. The pseudocode of the algorithm
      is given in Algorithm 4.
    – Algorithm 5 tries to prompt the LLM in order to understand whether the given incor-
      rect summary belongs to the class false_attribution. Then the labels misrepresentation,
      fabrication, and incorrect_quantities are checked, respectively. If we get four predicted
      labels for a given incorrect summary, we stop the algorithm. There can be data points
      for which we get less than four correct data points. We run GPT-3.5 Turbo at different
      temperatures 0.5, 0.6, 0.7, 0.8, and 0.9 respectively. Then we take an ensemble of the five
      output test runs being run at different temperatures by considering all the labels that
    1
     We want to specify that all our results are non-deterministic in nature i.e. the same hyperparameter settings
can lead to different results on different test runs.
      occurred at least twice for a particular datapoint. The ensembling method that we tried
      helped in providing improved accuracy and generalization bringing more stability into
      the nature of outputs. The pseudocode of the algorithm is given in Algorithm 5.


6. Results
Table 1 shows the macro-F1 score considering both correct and incorrect labels. Table 2 shows
the macro-F1 score considering only correct labels. We tried five different prompting-based
algorithms. The best result is obtained for Algorithm 5 (Ensembling approach) which we
explored in both Table 1 and Table 2.

                     Team Name      Method                  Run No.   Macro-F1
                     Text Titans    Algorithm 1             1         0.044
                     Text Titans    Algorithm 2             2         0.024
                     Text Titans    Algorithm 3             3         0.089
                     Text Titans    Algorithm 4             4         0.112
                     Text Titans    Algorithm 5(Ensemble)   5         0.156

Table 1
Macro-F1 score considering both correct and incorrect labels.


                      Team Name     Method                  Run No.   Macro-F1
                      Text Titans   Algorithm 1             1         0.152
                      Text Titans   Algorithm 2             2         0.093
                      Text Titans   Algorithm 3             3         0.291
                      Text Titans   Algorithm 4             4         0.355
                      Text Titans   Algorithm 5(Ensemble)   5         0.527

Table 2
Macro-F1 score considering only correct labels.
𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0;
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class misrepresentation then
    Then output the class misrepresentation counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class fabrication then
    Then output the class fabrication counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class false_attribution then
    Then output the class false_attribution counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class incorrect_quantities then
    Then output the class incorrect_quantities counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
                               Algorithm 1: Pseudocode
𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0;
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class false_attribution then
    Then output the class false_attribution counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class misrepresentation then
    Then output the class misrepresentation counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class fabrication then
    Then output the class fabrication counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class incorrect_quantities then
    Then output the class incorrect_quantities counter(variable) = counter(variable)+1
    if counter(variable)==1 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
                               Algorithm 2: Pseudocode
𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0;
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class misrepresentation then
    Then output the class misrepresentation counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the correponding summary belongs
 to class false_attribution then
    Then output the class false_attribution counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class fabrication then
    Then output the class fabrication counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class incorrect_quantities then
    Then output the class incorrect_quantities counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
                               Algorithm 3: Pseudocode
𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0;
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class misrepresentation then
    Then output the class misrepresentation counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class fabrication then
    Then output the class fabrication counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class false_attribution then
    Then output the class false_attribution counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the correponding summary belongs
 to class incorrect_quantities then
    Then output the class incorrect_quantities counter(variable) = counter(variable)+1
    if counter(variable)==2 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
                               Algorithm 4: Pseudocode
𝑐𝑜𝑢𝑛𝑡𝑒𝑟(𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠) ← 0;
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class misrepresentation then
    Then output the class misrepresentation counter(variable) = counter(variable)+1
    if counter(variable)==4 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class false_attribution then
    Then output the class false_attribution counter(variable) = counter(variable)+1
    if counter(variable)==4 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class fabrication then
    Then output the class fabrication counter(variable) = counter(variable)+1
    if counter(variable)==4 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
For all data points:
if prompted LLM to check whether the given article and the corresponding summary belong
 to class incorrect_quantities then
    Then output the class incorrect_quantities counter(variable) = counter(variable)+1
    if counter(variable)==4 for a datapoint then
        stop checking for the next label for that datapoint
    else
        Do not perform any action
    end
else
    Do not perform any action
end
                               Algorithm 5: Pseudocode
7. Conclusion and Future Work
We were given the task of multi-label error classification where we had to classify a document
into four classes namely misrepresentation, fabrication, false_attribution, and incorrect_quanti-
ties. We tried several prompt-based algorithmic approaches for the multi-label error classifica-
tion task that we were given as a part of Task 2. We observed that the Algorithm 5 (Ensembling
approach) that we explored obtained the best results. Future work would involve trying a
few-shot technique and trying larger language models such as GPT-4.


References
 [1] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng, M. Sun, Openprompt: An open-source
     framework for prompt-learning, in: Proceedings of the 60th Annual Meeting of the
     Association for Computational Linguistics: System Demonstrations, 2022, pp. 105–113.
 [2] W. Kryscinski, B. McCann, C. Xiong, R. Socher, Evaluating the factual consistency of
     abstractive text summarization, CoRR abs/1910.12840 (2019). URL: http://arxiv.org/abs/
     1910.12840. arXiv:1910.12840 .
 [3] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Indian language summarization at fire
     2023, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval
     Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023.
 [4] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Key takeaways from the second shared task
     on indian language summarization (ilsum 2023), in: K. Ghosh, T. Mandl, P. Majumder,
     M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation,
     Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023.
 [5] B. Zhang, B. Haddow, A. Birch, Prompting large language model for machine translation:
     A case study, in: Proceedings of the 40th International Conference on Machine Learning,
     ICML’23, JMLR.org, 2023.
 [6] Y. Zhao, T. Nasukawa, M. Muraoka, B. Bhattacharjee, A simple yet strong domain-agnostic
     de-bias method for zero-shot sentiment classification, in: A. Rogers, J. Boyd-Graber,
     N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023, As-
     sociation for Computational Linguistics, Toronto, Canada, 2023, pp. 3923–3931. URL: https:
     //aclanthology.org/2023.findings-acl.242. doi:10.18653/v1/2023.findings- acl.242 .
 [7] S. Parikh, M. Tiwari, P. Tumbade, Q. Vohra, Exploring zero and few-shot techniques for
     intent classification, in: S. Sitaram, B. Beigman Klebanov, J. D. Williams (Eds.), Proceedings
     of the 61st Annual Meeting of the Association for Computational Linguistics (Volume
     5: Industry Track), Association for Computational Linguistics, Toronto, Canada, 2023,
     pp. 744–751. URL: https://aclanthology.org/2023.acl-industry.71. doi:10.18653/v1/2023.
     acl- industry.71 .
 [8] B. Ji, Vicunaner: Zero/few-shot named entity recognition using vicuna, 2023.
     arXiv:2305.03253 .
 [9] Y. Hu, I. Ameer, X. Zuo, X. Peng, Y. Zhou, Z. Li, Y. Li, J. Li, X. Jiang, H. Xu, Zero-shot
     clinical entity recognition using chatgpt, 2023. arXiv:2303.16416 .
[10] R. Song, Z. Liu, X. Chen, H. An, Z. Zhang, X. Wang, H. Xu, Label prompt for multi-label
     text classification, Applied Intelligence 53 (2023) 8761–8775.
[11] J. Baek, A. Aji, A. Saffari, Knowledge-augmented language model prompting for zero-
     shot knowledge graph question answering, in: E. Hruschka, T. Mitchell, S. Rahman,
     D. Mladenić, M. Grobelnik (Eds.), Proceedings of the First Workshop on Matching From
     Unstructured and Structured Data (MATCHING 2023), Association for Computational
     Linguistics, Toronto, ON, Canada, 2023, pp. 70–98. URL: https://aclanthology.org/2023.
     matching-1.7. doi:10.18653/v1/2023.matching- 1.7 .
[12] A. Bhaskar, A. Fabbri, G. Durrett, Prompted opinion summarization with GPT-3.5, in:
     A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the Association for Computational
     Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada, 2023,
     pp. 9282–9300. URL: https://aclanthology.org/2023.findings-acl.591. doi:10.18653/v1/
     2023.findings- acl.591 .
[13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,
     H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran
     Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/
     paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.