A Framework for Automated Text Generation Benchmarking Steven Layne1,2 , Sebastian Gehrmann3 , Franck Dernoncourt1 , Lidan Wang1 , Trung Bui1 and Walter Chang1 1 Adobe Research 2 University of Illinois Urbana Champaign 3 Harvard University Abstract Researchers in areas such as translation and summarization need to compare their results to a wide range of published baselines that commonly use different evaluation methods. We aim to enable an easy comparison by presenting TextGen-Benchmarch, an open-sourced tool1 for streamlining the generation and evaluation of text. Text generation methods and evaluation metrics can easily be added to TextGen-Benchmarch, and its pipeline results in a more efficient comparison between methods as users can supply corpora, systems, and evaluation techniques and receive comparison reports in easy to analyze tabular and graphic formats. Keywords Summarization, Text generation, evaluation 1. Introduction should look like for a given task and input. The metrics are used to give a quantitative measure of quality for An in-depth evaluation and a fair comparison to the cur- generated text with respect to the reference(s). However, rent literature are crucial parts in the development of every metric uses different input and output formats. Machine Learning (ML) systems. In addition to model- Moreover, some metrics like ROUGE and METEOR can specific investigations, this evaluation process typically be configured with multiple parameters. For example, the includes automated metrics that allow predictions to be 𝛼 parameter in ROUGE mediates the preference for preci- compared to those of other approaches. However, subtle sion or recall for computing F-Measures [8]. Any change differences in output formatting or evaluation metrics in the selection of options results in a different result and can lead to drastically different reported results [1]. It is the evaluation options are typically not reported along thus of particular importance to ensure a homogeneous with published results. Thus, it is necessary to compare evaluation environment that applies the same evaluation outputs of multiple systems with the same options across to each system output. multiple metrics to ensure a fair comparison. In the case of (conditional) text generation problems, We propose TextGen-Benchmarch, which simplifies the goal is to generate text that is conditioned on an in- the process of model evaluation by streamlining the put and subject to constraints defined by the task, for benchmarking process and enabling the quick compar- example, the length. Depending on the task, there are ison of text-generation systems for a given task. The various metrics that can be applied for the evaluation, framework is agnostic to the underlying problem and such as ROUGE [2, 3], METEOR [4], BLEU [5], NIST [6], implements a wide range of common evaluation metrics. or CIDEr [7]. A commonality between these metrics is Moreover, TextGen-Benchmarch provides a simple API that all of them compare a generated text against one or to include additional models. During the evaluation, it many, typically human-generated, references. These ref- can use either cached or user-provided predictions or use erences are a demonstration of what an adequate result the model API to run inference on a given sample. We demonstrate the effectiveness of the tool for the problem The AAAI 2022 Workshop on Scientific Document Understanding of extractive summarization and show how it can make a Envelope-Open stevenlayne2017@u.northwestern.edu (S. Layne); comparison between related approaches easier and more gehrmann@seas.harvard.edu (S. Gehrmann); well-rounded. franck.dernoncourt@gmail.com (F. Dernoncourt); lidwang@adobe.com (L. Wang); bui@adobe.com (T. Bui); wachang@adobe.com (W. Chang) GLOBE http://francky.me/ (F. Dernoncourt) 2. Related Work Orcid 0000-0002-8257-9516 (S. Gehrmann); 0000-0002-1119-1346 (F. Dernoncourt) Some tools encapsulate different metrics into a single li- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). brary so that users can evaluate their hypotheses against CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) references using a shared interface.While these tools suc- Corpus Framework Text-Generator Evaluator Plots f(y, ŷ) Scores User- or Model-Generated Text Figure 1: Demonstration of the TextGen-Benchmarch pipeline. The framework uses the Evaluator to compare gold-references against predictions and generates plots and tables to summarize the results. The predictions can either take the form of predefined user-input, cached outputs from previous runs, or generated with the Text-Generator module that can be extended with any model. cessfully enable the evaluation of a specific system, they gold folder contains files with line separated references.2 are limited to a single system at a time. Therefore, each Samples are read in using Python’s file-stream which user is required to develop their own comparison. ensures minimal memory usage. The references can be Some libraries are also restricted in the compatible stored as either plain-text or as a JSON file to enable input formats. For example, the COCO (Common Objects multiple references. Here, each line should be formatted in Context) Caption evaluation library [9] provides an as follows: interface that was created to evaluate captioning results. {” references ” : It has support for BLEU, METEOR, ROUGE-L, CIDEr, and [” ref 1” ,” ref 2” , ” ref 3”] SPICE. The evaluation library enables users of COCO } caption to streamline the evaluation of their results but is limited to COCO-compatible input objects as the library TextGen-Benchmarch loads samples from the datasets was intended to be used in the context of the MS-COCO specified in the configuration file. It parses the files and Evaluation Server [9]. passes one document at a time to the Text-Generator. Other libraries can compare models with different in- The Text-Generator returns model-generated text, which put formats, but only for limited tasks. For example, is then stored in the file system to be used during eval- Spark provides ML Pipelines 1 , a high-level API for their uation. Users may provide their own generated text in data handlers. At the end of a pipeline, users may pass conjunction with model-generated texts or skip text gen- their results to evaluators which are designed for clas- eration entirely by turning off the generation in con- sification and regression models, and do not serve text figuration. Text generation is also skipped if TextGen- generation models. Benchmarch infers that a given dataset has already been processed with the model and is cached on the file system. If the evaluation is enabled, the user and model-generated 3. System Overview text are evaluated against the reference texts. TextGen- The TextGen-Benchmarch framework is built in Python Benchmarch currently supports ROUGE, METEOR, NIST, and provides a pipeline as illustrated in Figure 1. Before and BLEU scores. We provide additional details on how starting, TextGen-Benchmarch parses a configuration file the library interfaces with data in Section 4. that contains (1) the paths to datasets, (2) the systems, and (3) the metrics to be used. It additionally allows for 4. Extending the System descriptors for the text format. For example, if sentences are surrounded by tags that should be ignored during TextGen-Benchmarch is designed to make it as easy as evaluation, it can be specified here. Any specified dataset possible for users to add and remove text generators must contain two sub-folders samples and gold. The and metrics. TextGen-Benchmarch interfaces with two 1 2 https://spark.apache.org/docs/latest/mllib-evaluation- Python natively supports file-stream with line separated files metrics.html which is why it is a formatting requirement. METEOR Metric Score for each System Sorted by Corpus smmrRE CNN-DM sumyEdmundson sumyEdmundsonLocation sumyEdmundsonCue sumyEdmundsonKey sumyEdmundsonTitle sumyTextRank DUC-2004 sumyLuhn sumyLSA sumyLexRank sumySumBasic sumyRandom Arxiv 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 METEOR Score Figure 2: Meteor Score Report on DUC, ArXiv, and CNN-DM library files – one for metrics and one for text generators. 5. Report types Additions can be added to these two libraries. TextGen-Benchmarch provides the following report types. 4.1. Adding text generators The text generator library provides a single public • CSV: Fixed Metric generates a separate CSV for method with two inputs: a targeted text generator and each metric. Each row is a different model. Each text. The targeted text-generator is called and it returns column represents a corpus. To assist with com- the resulting text. A user can add additional models by parisons of the same evaluation metric and set of adding a method for their model that takes in text as summarizers but against different corpora. input and returns a generated text as output. • CSV: Fixed corpus generates a single report for one On load of the library class, information related to the corpus. Each row represents a model and each format of samples is saved. This information includes column a metric. This assists with comparisons separators for tokenized sentences and a Boolean that on the same corpus with a single set of models indicates whether the text is tokenized. Custom methods but against different metrics. must use this information to decide how to preprocess • Horizontal Barchart: Fixed Metric. Grouped by the the input text before passing it into the flow of their corpus, this shows scores on the X-axis, sorted by added model. Some text generators require sentences average metric score across corpora. This visual- to be pre-tokenized whereas other text generators have ization helps draw comparisons between models custom tokenizers and expect raw text. For additional across different corpora. convenience, we provide an interface for a tokenizer and a detokenizer with the library. 6. Example Reports 4.2. Adding Metrics We demonstrate the usage of TextGen-Benchmarch us- ing the extractive summarization problem. An extractive Adding metrics follows a similar process to to one out- summary is defined as a subset of sentences from a num- lined for text generators. The metric library provides a ber of documents (either one or many) that effectively method with a single input: a custom Summary Reader summarizes the message of the input. Typical metrics Object (SRO). The SRO has two public methods: r e a d O n e for this task include ROUGE and METEOR. We present a and r e a d A l l . When r e a d O n e is called a tuple of the form comparison of popular non-parametric extractive sum- ( p r e d i c t i o n , r e f e r e n c e s ) . When r e a d A l l is called, a list marizers on the DUC 2004 [10], ArXiv [11] and CNN- of all 𝑁 tuples is returned, where 𝑁 corresponds to the DM [12, 13] datasets respectively. We are comparing number of generated texts. smmrRE, our re-implementation of SMMRY extractive The r e a d O n e and r e a d A l l methods are abstractions for Python’s file-stream reader. summarizer 3 , and Python’s sumy summarizers 4 . Data collection and evaluation server, CoRR For ArXiv and CNN-DM, we used 1,000 samples of the abs/1504.00325 (2015). test-set for demonstration purposes. Thus, the results [10] P. Over, J. Yen, Introduction to DUC-2004: an intrin- should not be interpreted as official scores. They do, how- sic evaluation of generic news text summarization ever, highlight some interesting variation between the systems, 2004. performance of the summarizers in the different metrics. [11] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, Figure 2 shows the METEOR scores. The order corre- W. Chang, N. Goharian, A discourse-aware atten- sponds, from top to bottom, to a summarizer’s rank when tion model for abstractive summarization of long comparing the average score across all corpora. Here, documents, in: Proceedings of the 2018 Confer- smmrRE ranks first and sumyRandom comes in last. ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, Volume 2 (Short Papers), As- References sociation for Computational Linguistics, 2018, pp. 615–621. URL: http://aclweb.org/anthology/N18- [1] S. A. Ellafi, Preprocessing and normalization for 2097. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 2 0 9 7 . automatic evaluation of machine translation, Asso- [12] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe- ciation for Computational Linguistics, 2005. holt, W. Kay, M. Suleyman, P. Blunsom, Teaching [2] C.-Y. Lin, E. Hovy, Automatic evaluation of sum- machines to read and comprehend, in: Advances in maries using n-gram co-occurrence statistics, in: Neural Information Processing Systems, 2015, pp. Proceedings of the 2003 Conference of the North 1693–1701. American Chapter of the Association for Computa- [13] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, tional Linguistics on Human Language Technology- B. Xiang, Abstractive text summarization us- Volume 1, Association for Computational Linguis- ing sequence-to-sequence rnns and beyond, in: tics, 2003, pp. 71–78. Proceedings of The 20th SIGNLL Conference on [3] C.-Y. Lin, ROUGE: A package for automatic eval- Computational Natural Language Learning, Asso- uation of summaries, in: Text summarization ciation for Computational Linguistics, 2016, pp. branches out: Proceedings of the Association for 280–290. URL: http://aclweb.org/anthology/K16- Computational Linguistic workshop, volume 8, 1028. doi:1 0 . 1 8 6 5 3 / v 1 / K 1 6 - 1 0 2 8 . 2004. [4] S. Banerjee, A. Lavie, METEOR: An automatic met- ric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization, volume 29, 2005, pp. 65–72. [5] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine trans- lation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002. URL: http://aclweb.org/anthology/P02-1040. [6] G. Doddington, Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, 2002, pp. 138–145. [7] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575. [8] C.-Y. Lin, A brief introduction of the rouge sum- mary evaluation package, Univeristy of Southern California/Information Sciences Institute, 2005. [9] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C. L. Zitnick, Microsoft coco captions: 3 https://smmry.com/about 4 https://github.com/miso-belica/sumy