A Framework for Automated Text Generation
Benchmarking
Steven Layne1,2 , Sebastian Gehrmann3 , Franck Dernoncourt1 , Lidan Wang1 , Trung Bui1 and
Walter Chang1
1
  Adobe Research
2
  University of Illinois Urbana Champaign
3
  Harvard University


                                             Abstract
                                             Researchers in areas such as translation and summarization need to compare their results to a wide range of published baselines
                                             that commonly use different evaluation methods. We aim to enable an easy comparison by presenting TextGen-Benchmarch,
                                             an open-sourced tool1 for streamlining the generation and evaluation of text. Text generation methods and evaluation metrics
                                             can easily be added to TextGen-Benchmarch, and its pipeline results in a more efficient comparison between methods as
                                             users can supply corpora, systems, and evaluation techniques and receive comparison reports in easy to analyze tabular and
                                             graphic formats.

                                             Keywords
                                             Summarization, Text generation, evaluation


1. Introduction                                                                                                       should look like for a given task and input. The metrics
                                                                                                                      are used to give a quantitative measure of quality for
An in-depth evaluation and a fair comparison to the cur-                                                              generated text with respect to the reference(s). However,
rent literature are crucial parts in the development of                                                               every metric uses different input and output formats.
Machine Learning (ML) systems. In addition to model-                                                                  Moreover, some metrics like ROUGE and METEOR can
specific investigations, this evaluation process typically                                                            be configured with multiple parameters. For example, the
includes automated metrics that allow predictions to be                                                               𝛼 parameter in ROUGE mediates the preference for preci-
compared to those of other approaches. However, subtle                                                                sion or recall for computing F-Measures [8]. Any change
differences in output formatting or evaluation metrics                                                                in the selection of options results in a different result and
can lead to drastically different reported results [1]. It is                                                         the evaluation options are typically not reported along
thus of particular importance to ensure a homogeneous                                                                 with published results. Thus, it is necessary to compare
evaluation environment that applies the same evaluation                                                               outputs of multiple systems with the same options across
to each system output.                                                                                                multiple metrics to ensure a fair comparison.
   In the case of (conditional) text generation problems,                                                                We propose TextGen-Benchmarch, which simplifies
the goal is to generate text that is conditioned on an in-                                                            the process of model evaluation by streamlining the
put and subject to constraints defined by the task, for                                                               benchmarking process and enabling the quick compar-
example, the length. Depending on the task, there are                                                                 ison of text-generation systems for a given task. The
various metrics that can be applied for the evaluation,                                                               framework is agnostic to the underlying problem and
such as ROUGE [2, 3], METEOR [4], BLEU [5], NIST [6],                                                                 implements a wide range of common evaluation metrics.
or CIDEr [7]. A commonality between these metrics is                                                                  Moreover, TextGen-Benchmarch provides a simple API
that all of them compare a generated text against one or                                                              to include additional models. During the evaluation, it
many, typically human-generated, references. These ref-                                                               can use either cached or user-provided predictions or use
erences are a demonstration of what an adequate result                                                                the model API to run inference on a given sample. We
                                                                                                                      demonstrate the effectiveness of the tool for the problem
The AAAI 2022 Workshop on Scientific Document Understanding                                                           of extractive summarization and show how it can make a
Envelope-Open stevenlayne2017@u.northwestern.edu (S. Layne);                                                          comparison between related approaches easier and more
gehrmann@seas.harvard.edu (S. Gehrmann);                                                                              well-rounded.
franck.dernoncourt@gmail.com (F. Dernoncourt);
lidwang@adobe.com (L. Wang); bui@adobe.com (T. Bui);
wachang@adobe.com (W. Chang)
GLOBE http://francky.me/ (F. Dernoncourt)
                                                                                                                      2. Related Work
Orcid 0000-0002-8257-9516 (S. Gehrmann); 0000-0002-1119-1346
(F. Dernoncourt)
                                                                                                                      Some tools encapsulate different metrics into a single li-
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      brary so that users can evaluate their hypotheses against
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        references using a shared interface.While these tools suc-
                 Corpus
                                                                     Framework
                                                          Text-Generator          Evaluator
                                                                                                                 Plots

                                                                                            f(y, ŷ)              Scores
  User- or Model-Generated Text


Figure 1: Demonstration of the TextGen-Benchmarch pipeline. The framework uses the Evaluator to compare gold-references
against predictions and generates plots and tables to summarize the results. The predictions can either take the form of
predefined user-input, cached outputs from previous runs, or generated with the Text-Generator module that can be extended
with any model.


cessfully enable the evaluation of a specific system, they       gold folder contains files with line separated references.2
are limited to a single system at a time. Therefore, each        Samples are read in using Python’s file-stream which
user is required to develop their own comparison.                ensures minimal memory usage. The references can be
   Some libraries are also restricted in the compatible          stored as either plain-text or as a JSON file to enable
input formats. For example, the COCO (Common Objects             multiple references. Here, each line should be formatted
in Context) Caption evaluation library [9] provides an           as follows:
interface that was created to evaluate captioning results.
                                                                 {” references ” :
It has support for BLEU, METEOR, ROUGE-L, CIDEr, and
                                                                      [” ref 1” ,” ref 2” , ” ref 3”]
SPICE. The evaluation library enables users of COCO
                                                                 }
caption to streamline the evaluation of their results but is
limited to COCO-compatible input objects as the library              TextGen-Benchmarch loads samples from the datasets
was intended to be used in the context of the MS-COCO            specified in the configuration file. It parses the files and
Evaluation Server [9].                                           passes one document at a time to the Text-Generator.
   Other libraries can compare models with different in-         The Text-Generator returns model-generated text, which
put formats, but only for limited tasks. For example,            is then stored in the file system to be used during eval-
Spark provides ML Pipelines 1 , a high-level API for their       uation. Users may provide their own generated text in
data handlers. At the end of a pipeline, users may pass          conjunction with model-generated texts or skip text gen-
their results to evaluators which are designed for clas-         eration entirely by turning off the generation in con-
sification and regression models, and do not serve text          figuration. Text generation is also skipped if TextGen-
generation models.                                               Benchmarch infers that a given dataset has already been
                                                                 processed with the model and is cached on the file system.
                                                                 If the evaluation is enabled, the user and model-generated
3. System Overview                                               text are evaluated against the reference texts. TextGen-
The TextGen-Benchmarch framework is built in Python              Benchmarch currently supports ROUGE, METEOR, NIST,
and provides a pipeline as illustrated in Figure 1. Before       and BLEU scores. We provide additional details on how
starting, TextGen-Benchmarch parses a configuration file         the library interfaces with data in Section 4.
that contains (1) the paths to datasets, (2) the systems,
and (3) the metrics to be used. It additionally allows for 4. Extending the System
descriptors for the text format. For example, if sentences
are surrounded by tags that should be ignored during TextGen-Benchmarch is designed to make it as easy as
evaluation, it can be specified here. Any specified dataset possible for users to add and remove text generators
must contain two sub-folders samples and gold. The and metrics. TextGen-Benchmarch interfaces with two

    1                                                                2
      https://spark.apache.org/docs/latest/mllib-evaluation-           Python natively supports file-stream with line separated files
metrics.html                                                     which is why it is a formatting requirement.
                                       METEOR Metric Score for each System Sorted by Corpus

                                                                                                                  smmrRE
               CNN-DM                                                                                             sumyEdmundson
                                                                                                                  sumyEdmundsonLocation
                                                                                                                  sumyEdmundsonCue
                                                                                                                  sumyEdmundsonKey
                                                                                                                  sumyEdmundsonTitle
                                                                                                                  sumyTextRank
              DUC-2004                                                                                            sumyLuhn
                                                                                                                  sumyLSA
                                                                                                                  sumyLexRank
                                                                                                                  sumySumBasic
                                                                                                                  sumyRandom

                   Arxiv


                           0.0   2.5       5.0      7.5      10.0    12.5     15.0      17.5     20.0
                                                          METEOR Score

Figure 2: Meteor Score Report on DUC, ArXiv, and CNN-DM


library files – one for metrics and one for text generators.           5. Report types
Additions can be added to these two libraries.
                                                                       TextGen-Benchmarch provides the following report
                                                                       types.
4.1. Adding text generators
The text generator library provides a single public                          • CSV: Fixed Metric generates a separate CSV for
method with two inputs: a targeted text generator and                          each metric. Each row is a different model. Each
text. The targeted text-generator is called and it returns                     column represents a corpus. To assist with com-
the resulting text. A user can add additional models by                        parisons of the same evaluation metric and set of
adding a method for their model that takes in text as                          summarizers but against different corpora.
input and returns a generated text as output.                                • CSV: Fixed corpus generates a single report for one
   On load of the library class, information related to the                    corpus. Each row represents a model and each
format of samples is saved. This information includes                          column a metric. This assists with comparisons
separators for tokenized sentences and a Boolean that                          on the same corpus with a single set of models
indicates whether the text is tokenized. Custom methods                        but against different metrics.
must use this information to decide how to preprocess                        • Horizontal Barchart: Fixed Metric. Grouped by the
the input text before passing it into the flow of their                        corpus, this shows scores on the X-axis, sorted by
added model. Some text generators require sentences                            average metric score across corpora. This visual-
to be pre-tokenized whereas other text generators have                         ization helps draw comparisons between models
custom tokenizers and expect raw text. For additional                          across different corpora.
convenience, we provide an interface for a tokenizer and
a detokenizer with the library.
                                                                       6. Example Reports
4.2. Adding Metrics                                                                  We demonstrate the usage of TextGen-Benchmarch us-
                                                                                     ing the extractive summarization problem. An extractive
Adding metrics follows a similar process to to one out- summary is defined as a subset of sentences from a num-
lined for text generators. The metric library provides a ber of documents (either one or many) that effectively
method with a single input: a custom Summary Reader summarizes the message of the input. Typical metrics
Object (SRO). The SRO has two public methods: r e a d O n e for this task include ROUGE and METEOR. We present a
and r e a d A l l . When r e a d O n e is called a tuple of the form comparison of popular non-parametric extractive sum-
( p r e d i c t i o n , r e f e r e n c e s ) . When r e a d A l l is called, a list
                                                                                     marizers on the DUC 2004 [10], ArXiv [11] and CNN-
of all 𝑁 tuples is returned, where 𝑁 corresponds to the DM [12, 13] datasets respectively. We are comparing
number of generated texts.                                                           smmrRE, our re-implementation of SMMRY extractive
    The r e a d O n e and r e a d A l l methods are abstractions for
Python’s file-stream reader.
summarizer 3 , and Python’s sumy summarizers 4 .                 Data collection and evaluation server, CoRR
   For ArXiv and CNN-DM, we used 1,000 samples of the            abs/1504.00325 (2015).
test-set for demonstration purposes. Thus, the results [10] P. Over, J. Yen, Introduction to DUC-2004: an intrin-
should not be interpreted as official scores. They do, how-      sic evaluation of generic news text summarization
ever, highlight some interesting variation between the           systems, 2004.
performance of the summarizers in the different metrics. [11] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim,
   Figure 2 shows the METEOR scores. The order corre-            W. Chang, N. Goharian, A discourse-aware atten-
sponds, from top to bottom, to a summarizer’s rank when          tion model for abstractive summarization of long
comparing the average score across all corpora. Here,            documents, in: Proceedings of the 2018 Confer-
smmrRE ranks first and sumyRandom comes in last.                 ence of the North American Chapter of the Associ-
                                                                 ation for Computational Linguistics: Human Lan-
                                                                 guage Technologies, Volume 2 (Short Papers), As-
References                                                       sociation for Computational Linguistics, 2018, pp.
                                                                 615–621. URL: http://aclweb.org/anthology/N18-
 [1] S. A. Ellafi, Preprocessing and normalization for
                                                                 2097. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 2 0 9 7 .
      automatic evaluation of machine translation, Asso-
                                                            [12] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe-
      ciation for Computational Linguistics, 2005.
                                                                 holt, W. Kay, M. Suleyman, P. Blunsom, Teaching
 [2] C.-Y. Lin, E. Hovy, Automatic evaluation of sum-
                                                                 machines to read and comprehend, in: Advances in
      maries using n-gram co-occurrence statistics, in:
                                                                 Neural Information Processing Systems, 2015, pp.
      Proceedings of the 2003 Conference of the North
                                                                 1693–1701.
      American Chapter of the Association for Computa-
                                                            [13] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre,
      tional Linguistics on Human Language Technology-
                                                                 B. Xiang, Abstractive text summarization us-
      Volume 1, Association for Computational Linguis-
                                                                 ing sequence-to-sequence rnns and beyond, in:
      tics, 2003, pp. 71–78.
                                                                 Proceedings of The 20th SIGNLL Conference on
 [3] C.-Y. Lin, ROUGE: A package for automatic eval-
                                                                 Computational Natural Language Learning, Asso-
      uation of summaries, in: Text summarization
                                                                 ciation for Computational Linguistics, 2016, pp.
      branches out: Proceedings of the Association for
                                                                 280–290. URL: http://aclweb.org/anthology/K16-
      Computational Linguistic workshop, volume 8,
                                                                 1028. doi:1 0 . 1 8 6 5 3 / v 1 / K 1 6 - 1 0 2 8 .
      2004.
 [4] S. Banerjee, A. Lavie, METEOR: An automatic met-
      ric for mt evaluation with improved correlation
      with human judgments, in: Proceedings of the
      acl workshop on intrinsic and extrinsic evaluation
      measures for machine translation and/or summa-
      rization, volume 29, 2005, pp. 65–72.
 [5] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
      method for automatic evaluation of machine trans-
      lation, in: Proceedings of the 40th Annual Meeting
      of the Association for Computational Linguistics,
      2002. URL: http://aclweb.org/anthology/P02-1040.
 [6] G. Doddington, Automatic evaluation of machine
      translation quality using n-gram co-occurrence
      statistics, 2002, pp. 138–145.
 [7] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
      Consensus-based image description evaluation, in:
      Proceedings of the IEEE conference on computer
      vision and pattern recognition, 2015, pp. 4566–4575.
 [8] C.-Y. Lin, A brief introduction of the rouge sum-
      mary evaluation package, Univeristy of Southern
      California/Information Sciences Institute, 2005.
 [9] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
      P. Dollár, C. L. Zitnick, Microsoft coco captions:


   3
       https://smmry.com/about
   4
       https://github.com/miso-belica/sumy