<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scientific Chart Summarization: Datasets and Improved Text Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hao Tan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen-tse Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujie He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohit Bansal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bloomberg</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of North Carolina at Chapel Hill</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Chart figures usually convey the key message in a multimodal document. Understanding charts automatically and making charts more accessible becomes indispensable in the information era. In this paper, we study the chart summarization problem in which the goal is to generate sentences that describe the salient information in a chart image. To obtain training examples, we leverage image-caption pairs in multiple scientific areas. We create a dataset of single-chart images from research papers in PubMed Central (PMC) and arXiv. Most recent vision-and-language works focus on natural images. Several challenges in structured images such as charts are under-explored. One key property of charts is that the text components (e.g., legends and axis names) carry important information. In our proposed model, we not only use a standard visual encoder but also a text encoder to encode a chart image. The visual and textual representations are connected to a large pre-trained language decoder via pre-embedding and cross-attention approaches, respectively. Experimental results show that the proposed model is significantly better than an image captioning baseline.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Chart Summarization</kwd>
        <kwd>Multimodal Learning</kwd>
        <kwd>Document Understanding</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>mary for structural charts. First, to obtain a large quantity
of summaries of chart images, we leverage captions in
Information graphics, such as line charts and bar charts, scientific articles. Unlike magazines or newspapers, in
are essential and common components of a document. which image captions could be less descriptive, captions
Charts are usually used for visually summarizing im- in scientific papers tend to be more detailed and verbose.
portant information that a document intends to convey. We build a chart summarization dataset from the papers
Moreover, as shown in the study of Carberry et al. [1], in arXiv and PubMed Central (PMC) by assuming that
information graphics in magazines and newspapers of- captions are salient summaries of chart figures. Image
ten convey messages that are not repeated in the text. captions in these data sources are written by the
corTherefore, summarizing the primary message in a chart responding paper’s authors, and hence would be more
is an important step towards understanding a multimodal natural in the language format. Since these articles also
document. Potential applications of chart summarization contain figures other than charts, we create
crowdsourcinclude indexing information content for a search engine, ing tasks to select single-chart images and collect these
making charts accessible for individuals with eyesight charts’ detailed types (e.g., line chart, bar chart, etc.).
impairments, and simplifying information dissemination Diferent from the traditional captioning for natural
of technical visual info to a layperson. images, there are two main challenges from the language</p>
      <p>
        We have seen the success of image captioning works perspective when the target images are charts: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
Berecently, which can be viewed as generating summaries sides visual content, charts usually also contain text (e.g.,
for an image. However, this research has mostly focused legends and axis titles) which carries significant
inforon natural images while other types of images (e.g., struc- mation of components in charts. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Charts are likely
tured images shown in Fig. 2) are under-explored. On to be used in some specific domains, thus the language
the other hand, abstractive text summarization models generation model may sufer from rare-word issues.
also have been greatly improved due to the development To address these two challenges, we first use an
optiof neural network models. However, these models only cal character recognition (OCR) model to detect the text
look at the text component in a document. In this work, boxes in the charts. An OCR embedding layer is proposed
we focus on the less-studied yet important task of ‘chart to encode these extracted texts with their position
inforsummarization’, where we want to generate a salient sum- mation into vectors, and these vector representations
aEqual Contribution. are treated as another input to the language decoder
bWork done during an internship at Bloomberg through cross-attention mechanism. Secondly, to endow
The second workshop on Scientific Document Understanding at AAAI the decoder with domain-specific knowledge, we use a
2022 large pre-trained language decoder instead of training it
from the scratch. The chart information is connected to
© 2022 Bloomberg Finance L.P. Use permitted under Creative Commons License
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
this pre-trained language decoder via two approaches: and focus more on the text generation model. These
betpre-embedding and cross attention. We empirically find ter text analysis models could potentially improve our
that using pre-embedding for visual content and cross- model performance, which we leave for future
investiattention for OCR representations gives the best results. gation. Kahou et al. [10] introduce FigureQA, a visual
      </p>
      <p>We apply our models on our collected datasets of two reasoning corpus of question-answer pairs over synthetic
scientific domains. We conduct both metric-based auto- chart images. Instead of answering questions on the
synmatic evaluation and human-annotated qualitative eval- thetic charts, we aim at directly summarizing real chart
uation. Experimental results show that our model with images.
the integration of OCR and pre-trained language model There are some earlier works on chart summarization.
significantly outperforms the baseline image captioning Elzer et al. [15] proposed SIGHT, a system that
summamodel. We also show the ablation studies that illustrate rizes bar charts for visually impaired users. The system
the efectiveness of our proposed methods. identifies one of the twelve message categories that can
be conveyed by a bar chart and produces a logical form.</p>
      <p>This logic representation is then translated into natural
2. Related Work language via templates. Demir et al. [16] built on top of
SIGHT. The proposed system first identifies an additional
Most work on understanding chart images involves chart set of propositions that may reflect some information in
type classification. Savva et al. [2] classify given chart a bar chart by rules. These propositions are then
orgaimages into 10 chart categories using an SVM classifier nized and structured by a bottom-up planner. Finally, a
with visual bag-of-words and text-region features. With surface realizer is applied to produce natural language
a similar model, Ray Choudhury and Giles [3] proposed summaries.
a binary classifier to determine whether an image is a Greenbacker et al. [17] built a corpus of human-written
line chart. Siegel et al. [4] experimented with CNN-based English summaries of line graphs. They selected 23 line
models for classifying images they extracted from schol- graphs and asked annotators to summarize the most
imarly articles. In order to identify chart figures for training portant information in each graph. As this process is
our summarization model, we build a binary classifier dificult to be scaled up, we take the captions of chart
to identify common charts (e.g., line charts, bar charts, images in scientific papers to represent the summaries
scatter plots, etc.). instead. Greenbacker et al. [18] further used this
cor</p>
      <p>There is a line of works on interpreting text compo- pus and proposed an abstractive summarization system
nents in chart images [5, 6, 7, 8, 9, 4, 10, 11, 12, 13]. One for line charts. The system uses a Bayesian network to
of the applications here is to recover visual encodings for classify the intents of line segment, and then rules are
purposes of indexing and search. For example, Poco and applied to identify additional important informational
Heer [14] proposed an end-to-end text analysis pipeline propositions conveyed by the line graph. The sets of
that identifies text elements in a chart image, determine intents and prepositions are pre-defined from the study
their bounding box, and classifies their role in the chart on the corpus. They left the final step of generating
natu(e.g., x-axis label, x-axis title, legend title, etc). They also ral language summary from prepositions as future work.
proposed a CNN model that classifies the type of graph- Therefore, no evaluation results were shown.
ical mark (e.g., bars or lines). We simply use a general A common challenge of these earlier works is that
purpose OCR tool for recognizing text in chart images
they are limited to a fixed set of propositions and need most of the figures in these papers are not charts. Hence,
to convert the selected propositions to natural language. to be able to train and evaluate the proposed chart
sumInstead of using a pipeline with hand-crafted intents and marization model, we need to identify which figures are
propositions, we propose to leverage an end-to-end neu- charts. In this work, we focus on the common 5 chart
ral network, which has been shown to be powerful in types, including line, bar, scatter, pie, and area charts
generating coherent and grammatical sentences in the (Figure 2). Moreover, we further focus on the simplest
context of image captioning and abstractive text summa- case where images only contain a single chart. Figures
rization. with multiple charts or with any non-chart component</p>
      <p>Another thread of related works is (natural) image cap- will be considered as negative images in this work. In the
tioning, which tries to generate descriptions for natural following sections, we describe how do we obtain single
images. Vinyals et al. [19] first illustrate the end-to-end chart and chart type annotations.
encoder-decoder architecture and Xu et al. [20] extends it
with attention modules. Ranzato et al. [21] use reinforce- 3.1. PubMed Central Data
ment learning to eliminate exposure bias but requires a
large amount of data to reduce the high variance. An- For PMC data, we create a crowdsourcing task to
annoderson et al. [22] take object-level information to enable tate whether a given image contains single chart. We
ranifne-grained visual understanding. However, we empiri- domly sample 50,000 images from the papers published
cally found that the detection features for natural image from 2011 to 2019. For each image, we ask annotators
do not work well for charts (structural images). Previous whether it is a single chart figure. If the answer is yes, the
vision-and-language pre-training, e.g., VLP [23] and OS- annotators are required to select a chart type from line,
CAR [24], use pre-trained vision-and-language model to bar, scatter, pie, area, or other chart. Since this task is
improve image captioning but requires a large in-domain pretty simple, we ask two annotators to label each image
corpus and heavy pre-training. in the first round. In most cases, two annotators agree
on the labels. More specifically, the Fleiss’ kappa scores
for “whether it’s a single chart” and “chart type” tasks
3. Datasets Creation are 0.56 and 0.73 respectively, which shows significant
agreement 5.</p>
      <p>If there is a disagreement on either single chart label
or chart types, we further ask the other three annotators
to perform a second round of annotation on these
images. Finally, majority vote is applied to resolve conflicts
among all five annotators. We note that single charts
with “other” chart type are considered negative images
in our experiments.</p>
      <p>Among 50,000 images, we obtain 7,397 positive images
(single chart), including 3681 line charts, 3088 bar charts,
478 scatter charts, 125 pie charts, and 25 area charts. The
positive ratio of the charts is about 13%. This low ratio is
because most of the figures in scientific articles are
nonchart figures (e.g., model architecture diagrams). In this
work, we only use chart types in analyzing model
performance. That is, chart type information is not included
explicitly in model training.</p>
      <p>We create our datasets based on image-caption pairs that
appear in public scientific papers. Diferent from the
ifgures in magazines or newspapers where the captions
could be less descriptive, figure captions in scientific
articles tend to convey the key message of figures. The
assumption here is that captions written by the paper
authors could represent the most salient information in
the gfiures, therefore could serve as summaries of the
corresponding figures. The overview of our datasets
creation pipeline is shown in Figure 1. We consider two data
sources: arXiv1 and PMC.2 ArXiv is a free distribution
service and an open-access archive for scholarly articles
in the fields such as physics, computer science, and
mathematics. PMC is a free full-text archive of biomedical
and life sciences journal literature at the U.S. National
Institutes of Health’s National Library of Medicine. We
take articles in the Open Access Subset.3 These two data
sources are chosen because they both provide structural
data in addition to the PDF files. That is, we can obtain
image-caption pairs by parsing the LaTeX source files
provided by arXiv or the XML files provided by PMC. We
write our own LaTeX parser for the arXiv data, and use
a public PubMed parser4 for parsing XML information.</p>
      <p>Although we can extract lots of image-caption pairs,</p>
      <sec id="sec-1-1">
        <title>3.2. ArXiv Data</title>
        <sec id="sec-1-1-1">
          <title>1https://arxiv.org/ 2https://www.ncbi.nlm.nih.gov/pmc/ 3https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ 4https://github.com/titipata/pubmed_parser</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>We also build another dataset from the arXiv data. We</title>
          <p>take papers in Computer Vision, Computation and
Language, Machine Learning, Artificial Intelligence, and
Neural and Evolutionary Computing fields from 2008 to 2020.
Because of the copyright issue, we cannot put arXiv
images on a public crowdsourcing platform. Instead, the
authors went through and annotated 2000 randomly
sampled figures with the same crowdsourcing interface that</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>5https://en.wikipedia.org/wiki/Fleiss%27_kappa</title>
          <p>4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In this section, we introduce the proposed models and
training strategies for the chart summarization task. In
this chart summarization task, the model needs to
generate a sequence of words {} for describing the contents
in a chart . We start with introducing the basic
captioning model. To enhance in-image text understanding
and endow external knowledge, we incorporate an OCR
encoder and a pre-trained language decoder. Lastly, we
propose a simple semi-supervised learning and domain
adaptation approach using a chart classifier.
4.1. Base Model
decoder model for image captioning proposed in Xu et al.
[20]. A ResNet-101 [25] visual feature extractor encodes
the chart into a 7
×
7</p>
      <p>×
where each vector in the feature map corresponds to a
grid region of the image. Feature maps are then flattened
to 49 × 2048 feature sequences {}.</p>
      <p>49
{}=1 = ResNet ()</p>
      <sec id="sec-2-1">
        <title>At each decoding step , the LSTM [26] language decoder</title>
        <p>outputs the hidden outputs ℎ and cell  by reading
Our base model is adopted from the attentive encoder- (e.g., in the legend, in the title, or inside the chart):
2048 dimensional feature map, These OCR representations are treated as another view
attention module (denoted as Attℎ→ ) then attends to
the feature sequence {} with the hidden output ℎ as
a query. The context ˆ  and the hidden vector ℎ are
merged into an attentive hidden vector ℎˆ with a
fullyconnected layer:
˜− 1 = embedding (− 1)
ℎ,  = LSTM (˜− 1, ℎ− 1, − 1)
ˆ  = Attℎ→ (ℎ, {})
ℎ = tanh(1[ˆ ; ℎ] + 1)
ˆ
of the ground truth token * :
The probability of generating the -th token at time step
 is the softmax over a linear transformation of the
attentive hidden ℎˆ. The loss ℒ is the negative log likelihood
(,) = softmax w ℎˆ + w
︁(</p>
        <p>︁)
ℒ = − log (* )</p>
        <sec id="sec-2-1-1">
          <title>4.2. Text Understanding</title>
          <p>the image .</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Diferent from natural image captioning, the summariza</title>
        <p>tion of charts heavily relies on the understanding of text
inside the images. However, the ResNet visual encoder
(in Section 4.1) is insensitive to the text in the images (as
shown in Singh et al. [11] as well) thus we need to build a
pipeline to extract the text information from the images.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Specifically, we first use the Tesseract [ 27] to extract a</title>
        <p>sequence of  texts text  with their positions pos from

{(text  , pos  )}=1 = OCR()</p>
      </sec>
      <sec id="sec-2-4">
        <title>Since the characters in charts are usually in small font</title>
        <p>and sometimes blurred with the chart content, the copy
mechanism [28, 29] that directly brings the text into final
summarization does not provide good results. We instead
use the shallow text embedding layer to project the OCR
text to dense vector representations that denoises the
OCR detection results. We also encode the position of
the OCR along with the text representation since the
spatial information indicates the properties of the text</p>
        <p>= Embtext(text  ) + pos pos 
features { }. The final hidden output
of the charts and the language decoder simultaneously
attends to the OCR information {} and visual image
˜
ℎ is calculated
based on the concatenation of the visually attended
vector ˜ , the OCR attended vector ˜, and the hidden state
ℎ.</p>
        <p>
          ˜ = Attℎ→ (ℎ, {})
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(3)
Fluorescence emission
spectrum recorded from
the ...
ResNet
Tokenizer
        </p>
        <p>Text Position
Fluorescence (X, Y)
i1n.n8m.t0.e0.n.s.ity (((XXX,,, YYY)))
F #lu #orescence
emission spectrum
recorded from the
......
Fixed-Len
Transformer</p>
        <p>Word
Embedding
Pre-trained
Language
Decoder
[OMIT]
[OMIT]
[OMIT]
F
#lu
#orescence
emission
spectrum</p>
        <p>˜ = Attℎ→(ℎ, { }) (4) of red blocks and blue blocks in Figure 3). The
crossℎ˜ = tanh(2[˜ , ˜, ℎ] + 2) (5) attention approach adds cross-attention layers [34]
inside the language decoder to fuse visual information. The
We next replace the original attentive hidden ℎˆ with this cross-attention layers contain residual short-cut
connecOCR-enhanced hidden output ℎ˜ (in Sec. 4.1) in succeed- tions thus the decoder still benefits from the pre-trained
ing decoding steps. weights with these additional layers.</p>
        <p>As shown in Figure 3, we use the pre-embedding
approach for the features from the visual image content
4.3. Pre-trained Language Decoder (i.e., from the ResNet encoder) and use the cross-attention
When summarizing charts in news or scientific papers, a layers for the OCR texts. The idea of this specific design
faithful description of the chart contents also relies on is that the generation would be led by the image content
external knowledge, and hence a pre-trained language and will use the OCR information to generate concrete
decoder might help the generation. As shown in Figure 3, words. We empirically find that it is the best
combinawe illustrate our model which integrates a pre-trained tion to fuse information into the language decoder, and
language decoder GPT-2 [30].6 As described in the pre- we show the comparison in Section 6.2. In detail, the
vious section, we have two image encoders (i.e., ResNet length of the ResNet feature map is 49 and the order of
encoder and OCR text encoder) to process the image con- the features is not aligned with the positional encoding
tent and image text respectively. The ResNet encoder in the pre-trained language decoder. We thus do not
dimaps the features into a squared feature map (the purple rectly append it before the word embedding but use a
vector blocks in Figure 3) where each vector corresponds ifxed-length transformer to map it to a sequence of 10
to a part of image content. We will view this feature map vectors (the red blocks in Figure 3; we only draw 3
vecas a sequence of vectors (as in Eq. 1) in the following pro- tors for simplicity). The fixed-length transformer is built
cedures. The OCR encoder (Eq. 4.2) maps the chart into a by transformer decoder layers [34] with only positional
sequence of recognized words and their positions on the embedding (without word embedding). We use only 1
chart. The OCR embedding layer (Eq. 2) adds the word layer in our experiments.
embedding and the position encoding into one vector for
each OCR entry (the yellow vectors in Figure 3). 4.4. Semi-Supervised Learning and</p>
        <p>In order to connect these visual and textual infor- Domain Adaptation
mation from the image to the language decoder, we
adopt two ways: appending pre-embeddings and adding
cross-attention layers. The pre-embedding approach is
to concatenate the sequence of visual vectors before
the word embeddings thus the language decoder will
take this concatenation as input (e.g., the concatenation
Although we can extract abundant image-caption pairs,
most figures in scientific articles do not contain a chart as
we discussed in Section 3. If we want to reserve enough
human-annotated examples for the metric-based
evaluation purpose, that leaves very little data for training,
especially for the arXiv domain in which we only have
hundreds of single-chart images. Therefore, we leverage
6The method could also be applied to other pre-trained lan- semi-supervised learning techniques to take advantage
guage decoders such as XLNet [31], T5 [32], and BART [33].</p>
        <p>Base Model
+ OCR
+ GPT-2</p>
        <p>BLEU</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Results</title>
      <p>
        of large unannotated data and use domain adaption to
transfer to other datasets. Both of these two methods
rely on a chart classifier that we will introduce first. In this section, we evaluate our proposed methods on
Chart Classifier. The key component in getting more our collected datasets of two domains: PMC and arXiv.
training examples is a classifier that can identify single- We start with describing the experiment setups and show
chart images. We take the ResNet [25] as the visual back- results with both automatic metric-based evaluation and
bone and use a binary linear classifier after the mean- human evaluation.
pooled features. Instead of freezing the backbone model
as in the previous works [20], we fine-tune the classi- 5.1. Experimental Setup
ifer with a small learning rate, 10− 4. We find that this
standard classifier reaches good results (see Appendix Data Setup. The supervised learning setup is conducted
for details). on our annotated PMC dataset. We randomly sample
Semi-Supervised Learning. In the semi-supervised 1,000 charts as the test set and split the remaining charts
learning setup, we have labeled data (Section 3) and we into training (5,819) and validation (646) sets with a ratio
want to improve the performance from the unlabeled of 9:1.
data. The unlabeled data contains both charts and non- In order to increase the number of training examples,
chart images (e.g., model figures in scientific publications we apply the proposed semi-supervised learning
techand natural images in news). Including these non-chart nique (Section 4.4). The single-chart classifier is based on
images in training data will introduce noise and thus the ResNet-101 model and is fine-tuned on our datasets.
lead to an increment in training time. To provide clean We use the 50,000 human-labeled images (7,465 positives)
data in semi-supervised learning, we filter the unlabeled from PMC data to build this binary classifier. After the
data with our chart classifier and train the summarization model converges on the training set, we calibrate the
model based on the filtered data. In this way, we increase classifier to optimize the recall with an precision over
the amount of data and the coverage of topics. 99% on the validation set. Since we have lots of images,
Domain Adaptation. Diferent from semi-supervised we can aford a lower recall for high-quality positive
learning, domain adaptation focuses on transferring the examples. We then use this classifier to filter the
unlalabeled dataset into another domain. Naïve transferring beled images in the PMC data to augment the training set.
without training on the target domain would under-fit More specifically, besides the 50,000 images we used in
the target distribution and we empirically show its in- the crowdsourcing task, there are 137,928 remaining
artiefectiveness in Appendix. To solve this issue, we use cles in our PMC collection from the year of 2011 to 2019.
a similar approach to the semi-supervised learning that After applying the chart classifier, we obtain 13,637 single
trains the proposed summarization model on the dataset chart images which could serve as additional training
created by the chart classifier. More specifically, since examples for the summarization model.
we have much less labeled charts in the arXiv domain, For domain adaptation, we take charts and captions
we treat it as the target domain whereas PMC data is the from arXiv as the target domain. As described in
Secsource domain. We train the chart classifier on the PMC tion 3, we have manually annotated 370 single-chart
data, and apply it on the images from arXiv papers to images in this domain, which are served as the test set.
obtain large amount of single-chart images. We use the same chart classifier in the previous
semisupervised learning setup to annotate 140,000 arXiv
images. This results in 22,044 positive examples. We split
this 22,044 examples into training data (19,840) and
validation data (
        <xref ref-type="bibr" rid="ref2">2,204</xref>
        ) with a ratio of 9:1.
      </p>
      <p>Model Setup. For the base model, we use a ResNet-101
7
11
4.32
4.28
4.71
5.39</p>
      <p>CIDEr
model from the Torchvision [35] library7. We resize the Baseline Final Model
image into 224 × 224 and the backbone model maps it Better Better
to a 7 × 7 × 2048 vectors. We sort the OCR-extracted PMC 20 70 3
texts by their confidence and only keep the top 20 texts arXiv 37 50 2
for post-processing. Since we want the image position
to be related to the OCR position. We do not apply
random resize and cropping but directly resize the chart into
224 × 224. For the pre-trained GPT-2 [30] model, we
downloaded the small GPT-2 model from Hugging Face’s BLEU ROUGE-L
Transformer [36]. The GPT-2 implementation has sup- All 4.47 12.46
port of cross-attention layers as in Vaswani et al. [34] Line Chart 4.44 12.70
and we use it to attention to the OCR features. For the Bar Chart 4.77 12.30
ifxed-length transformer, we use 1 layer with the same Scatter Chart 5.96 16.63
architecture as the GPT-2 model but do not apply the
causal attention mask. More implementation and hyper- Table 3
parameter details can be found in Appendix. Results regarding diferent types of charts.</p>
      <sec id="sec-3-1">
        <title>5.2. Metric-based Evaluation</title>
      </sec>
      <sec id="sec-3-2">
        <title>5.3. Human Evaluation</title>
        <sec id="sec-3-2-1">
          <title>In order to conduct eficient evaluation, we take the au</title>
          <p>tomatic language metrics to evaluate our model. We In order to get a faithful evaluation, we conduct a human
report the BLEU [37], ROUGE-L [38], METEOR [39], and evaluation on 100 randomly sampled examples for PMC
CIDEr [40] as in previous image captioning papers. As and arXiv. The human evaluation is conducted by the
shown in Table 1, we compare our proposed models (in authors and their colleagues (4 in total) since this task
Section 4.2 and Section 4.3) with the baseline captioning requires a certain expert knowledge. We use both base
model (in Section 4.1) on both PMC and arXiv datasets. captioning model and our final model (with OCR encoder
The model with OCR text encoder is strictly better than and GPT-2 decoder)8 to generate two summaries. Each
the baseline captioning model for every metrics, which image with the generated summaries from the two
modindicates that the in-chart text understanding is very els is annotated by all 4 annotators. We randomly shufle
important for generating good summarization for scien- the order of these two summaries and only show the A/B
tific charts. The integration of the pre-trained language labels to the human annotators. The human annotators is
model (GPT-2) further enhances the performance over asked to choose one from the four options: “Both Good”,
the OCR encoder results. The pre-trained decoder shows “Both Bad”, “A wins”, and “B wins”. As shown in Table 2,
more improvement on the semi-supervised setup since our proposed model significantly outperforms the
basethe model needs enough data to learn the weights in the line model for both datasets. Moreover, we find that our
ifxed-length transformer and the cross-attention mod- annotators have a high agreement on which generated
ules, which bridge the vision encoder and the language sentence is better since this scientific summarization is
decoder. mostly about facts and salience.</p>
          <p>Note that the CIDEr score of the +GPT-2 model is lower
than the +OCR model on the PMC dataset under the su- 6. Analysis
pervised setup. We find that this is due to the size of
data. The smaller size of the PMC data makes the learned In this section, we provide the fine-grained analysis to
model have a stronger bias towards the original GPT-2 illustrate the efectiveness of each component in the
progeneration. Namely, although the model would gener- posed pipeline. We first demonstrate the results for
difate more fluent sentences (reflected on the high BLEU ferent chart types and cross-domain evaluation in
Secscore), it is biased towards the GPT-2 prior by leverag- tion 6.1. In Section 6.2, we empirically show the
advaning mostly common words. This bias is captured by the tage of our pre-embedding and cross-attention
combinaCIDEr metric’s over-weighting protocol. However, under tion.
the semi-supervised setting, the CIDEr score is higher
with GPT-2 because of the adequate amount of data. This 6.1. Diferent Chart Categories
also demonstrates the usefulness of the proposed
semisupervised approach.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>During our data collection, we also let the annotators to</title>
          <p>select the type of the chart (Figure 2). In this paper, we</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>7https://pytorch.org/docs/stable/torchvision/models.html</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>8The PMC model is with the semi-supervised setup.</title>
          <p>None
Concat
None
Img
OCR
Concat</p>
          <p>None
None
Concat
OCR
Img
Concat
aim for a general chart summarization model that does
not rely on the details of each chart type. We here analyze
the performance of the proposed model on each chart
category with our final model trained on PMC
(SemiSupervised). In Table 3, we show the results of the most
common three chart types (i.e., “Line”, “Bar”, “Scatter”)
that have suficient amount of data (513 for Line, 400
for Bar, and 57 for Scatter) to support automatic
metricbased evaluation. Although the line charts contribute the
most to the training and test data, the BLEU score is the
lowest compared to the results of bar charts and scatter
charts. The reason might be that the image features
produced by convolutional neural networks (CNN) are
insensitive to the properties (e.g., trending, crossings) of
the curved lines. At the same time, the CNN could capture
the local intensity of points thus show higher results for
scatter chart. According to this observation, we think
that using visual encoder that are specifically designed
for understanding the curved lines in chart might be a
promising future direction.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>6.2. Pre-Embeddings and</title>
      </sec>
      <sec id="sec-3-4">
        <title>Cross-Attention Layers</title>
        <sec id="sec-3-4-1">
          <title>In Section 4.3, we discuss two ways to connect the visual</title>
          <p>information to the language decoder: the pre-embedding
approach and the additional cross-attention layers. In
Table 4, we show the results of diferent combinations on
PMC (semi-supervised) dataset. “Img” and “OCR”
indicates using the image output and OCR representations as
the input to the pre-embedding approach and the
crossattention layers. “None” means that we do not use input
and thus excludes the parameters. “Concat” means that
we concatenate the output of image and OCR
representations together and use it as the input. We can see that the
our approach (Img for Pre-Embed and OCR for Cross-Att)
is comparable to its reverse (OCR for Pre-Embed and Img
for Cross-Att) and is much better than other alternatives.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>6.3. Chart Classification Performance</title>
        <sec id="sec-3-5-1">
          <title>In both the semi-supervised learning and domain adap</title>
          <p>tion setup, we use a classifier to identify single-chart
images from lots of automatically extracted image-caption
pairs. Since the images filtered by the classifier will be
further used as data augmentation, we take the 1 score
as the main metric to balance the precision and recall.
We start with the frozen ResNet-101 [25] features with
an additional linear classifier. This setup achieves 90% 1
score. After fine-tuning the backbone model on our data,
the model achieves an 1 score of 94.9%. We also tried
adding other neural modules (e.g., attentive modules and
detection branches) and enhanced visual backbones but
we do not observer a significant result improvement on
the test set.</p>
          <p>When we use this classifier in the semi-supervised and
domain adaptation setups, we calibrate the classification
threshold to maintain a precision over 99% since we have
lots of unannotated images. Under this precision level,
we achieve a recall of 59.8% and precision of 99.2%. We
kept the same classification threshold and test it on our
annotated arXiv test split. The precision and recall are
93.4% and 65.7%, respectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusions</title>
      <p>In this paper, we propose datasets and models for
summarizing scientific charts, a specific type of structured
images. We construct datasets from PMC and arXiv by
leveraging crowdsourcing and the figure captions in the
papers. To enable better understanding text components in
charts and to endow the model with external knowledge,
we propose to use an OCR encoder and a pre-trained
language decoder on top of a standard image captioning
model. In our experiments, we show the efectiveness of
our models in terms of both automatic evaluation metrics
and human evaluation.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The authors thank Bloomberg’s AI Engineering team,</title>
        <p>
          especially Alakananda Vempala, Ketevan Tsereteli, and
Anju Kambadur for helpful feedback and directions.
Additional thanks to the anonymous reviewers for their
insights. Hao Tan acknowledges support from Bloomberg’s
Data Science Ph.D. Fellowship.
ifcation, analysis and redesign of chart images, in: [15] S. Elzer, E. Schwartz, S. Carberry, D. Chester,
Proceedings of the 24th annual ACM symposium S. Demir, P. Wu, A browser extension for providing
on User interface software and technology, 2011, visually impaired users access to the content of bar
pp. 393–402. charts on the web., in: WEBIST (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), Citeseer, 2007,
[3] S. Ray Choudhury, C. L. Giles, An architecture pp. 59–66.
        </p>
        <p>for information extraction from figures in digital [16] S. Demir, S. Carberry, K. McCoy, Generating
texlibraries, in: Proceedings of the 24th International tual summaries of bar charts, in: Proceedings of the
Conference on World Wide Web, 2015, pp. 667–672. Fifth International Natural Language Generation
[4] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Conference, Association for Computational
LinFigureseer: Parsing result-figures in research pa- guistics, Salt Fork, Ohio, USA, 2008, pp. 7–15. URL:
pers, in: European Conference on Computer Vision, https://www.aclweb.org/anthology/W08-1103.</p>
        <p>Springer, 2016, pp. 664–680. [17] C. Greenbacker, S. Carberry, K. McCoy, A corpus
[5] W. Huang, C. L. Tan, A system for understand- of human-written summaries of line graphs, in:
ing imaged infographics and its applications, in: Proceedings of the UCNLG+Eval: Language
GenProceedings of the 2007 ACM symposium on Doc- eration and Evaluation Workshop, Association for
ument engineering, 2007, pp. 9–18. Computational Linguistics, Edinburgh, Scotland,
[6] S. Demir, S. Carberry, K. F. McCoy, Summarizing 2011, pp. 23–27. URL: https://www.aclweb.org/
information graphics textually, Computational Lin- anthology/W11-2703.</p>
        <p>guistics 38 (2012) 527–574. [18] C. Greenbacker, P. Wu, S. Carberry, K. McCoy,
[7] Z. Chen, M. Cafarella, E. Adar, Diagramflyer: A S. Elzer, Abstractive summarization of line graphs
search engine for data-driven diagrams, in: Pro- from popular media, in: Proceedings of the
ceedings of the 24th International Conference on Workshop on Automatic Summarization for
DifWorld Wide Web, 2015, pp. 183–186. ferent Genres, Media, and Languages, Association
[8] S. R. Choudhury, S. Wang, C. L. Giles, Scalable algo- for Computational Linguistics, Portland, Oregon,
rithms for scholarly figure mining and semantics, 2011, pp. 41–48. URL: https://www.aclweb.org/
in: Proceedings of the International Workshop on anthology/W11-0506.</p>
        <p>Semantic Big Data, 2016, pp. 1–6. [19] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show
[9] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha- and tell: A neural image caption generator, in:
jishirzi, A. Farhadi, A diagram is worth a dozen Proceedings of the IEEE conference on computer
images, in: European Conference on Computer vision and pattern recognition, 2015, pp. 3156–3164.</p>
        <p>Vision, Springer, 2016, pp. 235–251. [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
[10] S. E. Kahou, A. Atkinson, V. Michalski, Á. Kádár, R. Salakhudinov, R. Zemel, Y. Bengio, Show,
atA. Trischler, Y. Bengio, Figureqa: An annotated tend and tell: Neural image caption generation with
ifgure dataset for visual reasoning, in: ICLR Work- visual attention, in: International conference on
shop, 2018. machine learning, 2015, pp. 2048–2057.
[11] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, [21] M. Ranzato, S. Chopra, M. Auli, W. Zaremba,
SeD. Batra, D. Parikh, M. Rohrbach, Towards vqa quence level training with recurrent neural
netmodels that can read, in: Proceedings of the IEEE works, in: International Conference on Learning
Conference on Computer Vision and Pattern Recog- Representations, 2016.</p>
        <p>nition, 2019, pp. 8317–8326. [22] P. Anderson, X. He, C. Buehler, D. Teney, M.
John[12] T. Hiippala, M. Alikhani, J. Haverinen, son, S. Gould, L. Zhang, Bottom-up and top-down
T. Kalliokoski, E. Logacheva, S. Orekhova, attention for image captioning and visual question
A. Tuomainen, M. Stone, J. A. Bateman, Ai2d-rst: A answering, in: Proceedings of the IEEE Conference
multimodal corpus of 1000 primary school science on Computer Vision and Pattern Recognition, 2018,
diagrams, Language Resources and Evaluation pp. 6077–6086.</p>
        <p>(2020) 1–28. [23] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso,
[13] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar, J. Gao, Unified vision-language pre-training for
Plotqa: Reasoning over scientific plots, in: Pro- image captioning and vqa, in: AAAI, 2019.
ceedings of the IEEE/CVF Winter Conference on [24] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang,
Applications of Computer Vision, 2020, pp. 1527– L. Wang, H. Hu, L. Dong, F. Wei, et al., Oscar:
1536. Object-semantics aligned pre-training for
vision[14] J. Poco, J. Heer, Reverse-engineering visualizations: language tasks, in: European Conference on
ComRecovering visual encodings from chart images, in: puter Vision, Springer, 2020, pp. 121–137.
Computer Graphics Forum, volume 36, Wiley On- [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learnline Library, 2017, pp. 353–363. ing for image recognition, in: Proceedings of the
IEEE conference on computer vision and pattern sociation for Computational Linguistics, 2002, pp.
recognition, 2016, pp. 770–778. 311–318.
[26] S. Hochreiter, J. Schmidhuber, Long short-term [38] C.-Y. Lin, Rouge: A package for automatic
evalmemory, Neural computation 9 (1997) 1735–1780. uation of summaries, in: Text summarization
[27] R. Smith, An overview of the tesseract ocr engine, branches out, 2004, pp. 74–81.
in: Ninth international conference on document [39] S. Banerjee, A. Lavie, Meteor: An automatic
metanalysis and recognition (ICDAR 2007), volume 2, ric for mt evaluation with improved correlation
IEEE, 2007, pp. 629–633. with human judgments, in: Proceedings of the
[28] J. Gu, Z. Lu, H. Li, V. O. Li, Incorporating copying acl workshop on intrinsic and extrinsic evaluation
mechanism in sequence-to-sequence learning, in: measures for machine translation and/or
summaProceedings of the 54th Annual Meeting of the As- rization, 2005, pp. 65–72.
sociation for Computational Linguistics (Volume 1: [40] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider:
Long Papers), 2016, pp. 1631–1640. Consensus-based image description evaluation, in:
[29] A. See, P. J. Liu, C. D. Manning, Get to the point: Proceedings of the IEEE conference on computer
Summarization with pointer-generator networks, vision and pattern recognition, 2015, pp. 4566–4575.
in: Proceedings of the 55th Annual Meeting of the [41] D. P. Kingma, J. Ba, Adam: A method for stochastic
Association for Computational Linguistics (Volume optimization, in: ICLR, 2015.</p>
        <p>
          1: Long Papers), 2017, pp. 1073–1083. [42] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
[30] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, Pre-training of deep bidirectional transformers for
I. Sutskever, Language models are unsupervised language understanding, in: NAACL-HLT (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), 2019.
multitask learners (2019).
[31] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R.
Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive
pretraining for language understanding, in:
Advances in neural information processing systems,
2019, pp. 5753–5763.
[32] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
        </p>
        <p>M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified text-to-text
transformer, JMLR (2019).
[33] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A.
Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart:
Denoising sequence-to-sequence pre-training for
natural language generation, translation, and
comprehension, in: ACL, 2020.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,</p>
        <p>L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attention is all you need, in: Advances in Neural
Information Processing Systems, 2017, pp. 5998–
6008.
[35] S. Marcel, Y. Rodriguez, Torchvision the
machinevision package of torch, in: Proceedings of the 18th
ACM international conference on Multimedia, 2010,
pp. 1485–1488.
[36] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C.
Delangue, A. Moi, P. Cistac, M. Funtowicz, J. Davison,
S. Shleifer, et al., Transformers: State-of-the-art
natural language processing, in: Proceedings of
the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations,
2020, pp. 38–45.
[37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting
on association for computational linguistics,
As</p>
      </sec>
      <sec id="sec-5-2">
        <title>We then use this classifier to filter the unlabeled im</title>
        <p>ages in the PMC data to augment the training set. More
The supervised learning setup is conducted on our an- specifically, besides the 50,000 images we used in the
notated English PMC dataset in Sec. 3. We kept 1,000 crowdsourcing task, there are 137,928 remaining articles
charts in the test set and split the the remaining charts in our PMC collection from the year of 2011 to 2019. After
into training(5,819)/validation(646) with a ratio of 9:1. applying the chart classifier, we obtain 13,637 single chart
We train our model on the training set and tune the hy- images which could serve as additional training examples
perparamters on the validation set. The test set is only for the summarization model. The hyper-parameters of
used to report results. We train for 200 epochs on this the summarization model is the same as the ones used
small dataset. All our code are written in PyTorch and in the supervised setup. For the models trained on this
all experiments converge in 4 5 hours on 1 Titan V GPU. dataset, we use a max sequence of 80 and train for 100</p>
        <p>For the base model, we use a ResNet-101 model from epochs. The other hyperparameters are same as the small
the Torchvision [35] library 9. We resize the image into supervised PMC data for each model.
224 x 224 and the backbone model maps it to a 7 x 7 x For domain adaptation, we take charts and captions
2048 vectors. We use 512 dimensions for the LSTM and from English arXiv as the target domain. As described
256 dimensions for the word embedding. The attentive in the dataset section, we have manually annotated 370
hidden states has the same size as the hidden states (512 single-chart images in this domain, which are served as
dimensions). We use an Adam [41] with a fixed learning the test set. We use the same chart classifier in the
previrate of 10− 4. The batch size is 64. ous semi-supervised learning setup to annotate 140,000</p>
        <p>
          For the OCR model, we sort the ocr texts by their arXiv images. This results in 22,044 positive examples.
confidence and remove the empty text. We kept the top We split this 22,044 examples into training data (19,840)
20 ocr texts for post-processing. We use 512 dimensions and validation data (
          <xref ref-type="bibr" rid="ref2">2,204</xref>
          ) with a ratio of 9:1. The
summafor the OCR feature representations (yellow blocks in Fig. rization model is trained on the training data, tuned on
3). Since we want the image position to be related to the the validation data, and finally evaluated on the
manuallyOCR position. We did not do random resize and cropping annotated test set. For the models trained on this dataset,
but directly resize the chart into 224 x 224. we use a max sequence of 40 since the captions in arXiv
        </p>
        <p>For the pre-trained GPT-2 [30] model, we downloaded are much shorter. Since we halve the max sequence, we
the small GPT-2 model (124M parameters) from Hugging train for 200 epochs thus roughly keep the same
compuFace’s Transformer [36] 10. The GPT-2 implementation tational resources for both datasets.
has support of cross-attention layers as in Vaswani et al.
[34] and we use it to attention to the OCR features. For
the fixed-length transformer, we use 1 layer with the B. Details of Data Collection
same architecture as the GPT-2 model but do not apply
the causal attention mask. We use an Adam [41] with The crowdsourcing task is conducted on Appen11. There
weight decay of 0.01 following the practice in Devlin are 2263 distinct annotators from 50 countries. Since the
et al. [42]. We do not use weight decay for the layer task is to classify image types, it doesn’t require native
normalization layer and bias. We use a linear warmup English speakers. The top 5 countries are Venezuela
with a peak learning rate at 10− 4. The first 5% steps are (53%), USA (23%), Egypt (8%), Colombia (2%), and Peru
warmup steps. The batch size is 64. (1.4%). We paid one cent per judgement (image). For the</p>
        <p>In order to increase the number of training examples, first round of annotation tasks, the Fleiss’ kappa scores
we apply the proposed semi-supervised learning tech- for “whether it’s a single chart” and “chart type” tasks are
nique. The single-chart classifier is based on the ResNet- 0.56 and 0.73 respectively, which shows pretty significant
101 model and is fine-tuned on our datasets. We use the agreement.
50,000 human-labeled images (7,465 positives) from PMC
data to build this classifier. The training, validation, and C. Additional Analysis
test sets have 5,819, 646, and 1,000 data point,
respectively. The data split is the same as the above supervised C.1. Cross-Domain Transferability
learning setup. After the model converges on the
training set, we calibrate the classifier to optimize the recall
with an precision over 99% on the validation set. Since
we have lots of images, we can aford a lower recall for
high-quality positive examples.</p>
      </sec>
      <sec id="sec-5-3">
        <title>To illustrate the need of domain adaption led by the chart</title>
        <p>classifier (in Sec. 4.4), we show the low cross-domain
transferability of models in this section. Each row in
Table 5 indicates the results of our final model trained on
the designated dataset while each line in the Table
indicate the evaluation results on the test set. The model does</p>
      </sec>
      <sec id="sec-5-4">
        <title>9https://pytorch.org/docs/stable/torchvision/models.html 10https://github.com/huggingface/transformers 11client.appen.com</title>
        <p>not transfer well between diferent domains, probably
because the diferent figuring and captioning conventions
from diferent communities. The diferent topics also
introduce diverging vocabularies.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>D. Ethical Considerations</title>
      <p>The technique developed in this paper would help
automatic summarize news, articles, and publications where
charts are involved in. It would also help visually
impaired people to understand the content of the charts. It
would fail in cases when the OCR detector miss the key
information of the charts and would lead to unfaithful
summarization of the chart. Since we use a pre-trained
language decoder in our final model, the generated
summarization might be biased towards the pre-training
domain of the language decoder. Regrading the dataset
collection, we have resolved all legal and licenses issue
for the PMC dataset before showing them to annotators.
More specifically, we only use articles with CC BY
licenses from the Open Access Subset of PMC data. For
arXiv data, we annotate a small test set by the authors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Carberry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Demir</surname>
          </string-name>
          ,
          <article-title>Information graphics: an untapped resource for digital libraries</article-title>
          ,
          <source>in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>581</fpage>
          -
          <lpage>588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Savva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chhajta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agrawala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          , Revision: Automated classi-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>