Dual Reinforcement-Based Specification Generation for Image De-Rendering

            Ramakanth Pasunuru♣                    David Rosenberg†            Gideon Mann†                        Mohit Bansal♣
                                                                    ♣                     †
                                             UNC Chapel Hill Bloomberg LP
                                              {ram,mbansal}@cs.unc.edu
                                        {drosenberg44,gmann16}@bloomberg.net


                           Abstract
  Advances in deep learning have led to promising progress
  in inferring graphics programs by de-rendering computer-
  generated images. However, current methods do not explore
  which decoding methods lead to better inductive bias for in-
  ferring graphics programs. In our work, we first explore the
  effectiveness of LSTM-RNN versus Transformer networks
  as decoders for order-independent graphics programs. Since
  these are sequence models, we must choose an ordering of
                                                                            <object>                                <object>
  the objects in the graphics programs for likelihood training.               <supercategory>C-1</supercategory>      <category>Rectangle<\category>
  We found that the LSTM performance was highly sensitive                     <category>CS-3<\category>               <x1-coordinate>7</x1-coordinate>
  to the sequence ordering (random order vs. pattern-based or-                <x-coordinate>120</x-coordinate>        <y1-coordinate>1</y1-coordinate>
                                                                              <y-coordinate>240</y-coordinate>        <x2-coordinate>11</x2-coordinate>
  der), while Transformer performance was roughly indepen-                    <depth>1</depth>                        <y2-coordinate>16</y2-coordinate>
  dent of the sequence ordering. Further, we present a policy                 <flip>0</flip>                        </object>
  gradient based reinforcement learning approach for better in-             </object>                               <object>....
  ductive bias in the decoder via multiple diverse rewards based            <object>....
  both on the graphics program specification and the rendered
  image. We also explore the combination of these comple-               Figure 1: Example images from the abstract scene dataset
  mentary rewards. We achieve state-of-the-art results on two           (left) and the Noisy Shapes dataset (right), along with por-
  graphics program generation datasets.                                 tions of their specifications.

                     1    Introduction
The large majority of computer vision work deals in the do-             mapped to some sequence of output tokens. The neural
main of natural images or video. However, there is tremen-              encoder-decoder approach has proved to be very success-
dous potential for applying computer vision techniques to               ful for this class of problems, including image caption-
computer-generated images, such as plots, charts, schemat-              ing (Karpathy and Fei-Fei 2015; Xu et al. 2015), handwrit-
ics, complicated math formulas, and even a page of printed              ing recognition (Bluche, Louradour, and Messina 2016), as
text. For these domains, there is often a domain-specific               well as the de-rendering problem for math formulas (Deng et
language for precisely specifying the image, such as mat-               al. 2017) and graphics images (Ellis et al. 2018; Wu, Tenen-
plotlib code for a chart, PicTeX for a schematic1 , and La-             baum, and Kohli 2017). In this paper, we improve these
TeX for math formulas and text. “De-rendering” a computer-              encoder-decoder models for the specific case of graphical
generated image back to the original (or a different) domain-           images, via methods based on Transformer models with both
specific language specification can be a useful first step in           cross-entropy training and reinforcement learning with up to
many tasks, such as changing the visual appearance of an                two “dual modality” reward functions. De-rendering graph-
image (Huang et al. 2016; Wu, Tenenbaum, and Kohli 2017)                ical images is a problem that differs in several interesting
or extracting information contained in an image (Cliche et              ways from image captioning and OCR problems. Two ex-
al. 2017; Mishchenko and Vassilieva 2011).                              amples of the de-rendering problem we consider are shown
   The de-rendering problem is part of a larger class of                in Fig. 1. Each image is an input, and a portion of the desired
“image-to-text” problems, in which an input image is                    output is displayed below each image. In de-rendering, every
Copyright c 2021 for this paper by its authors. Use permitted un-       object in the image must be described in the specification,
der Creative Commons License Attribution 4.0 International (CC          and typically many output tokens are required to describe
BY 4.0).                                                                each object. Thus outputs from de-rendering are typically
   1
     https://ctan.org/pkg/pictex?lang=en                                much longer than those in image captioning datasets (Chen
et al. 2015), since caption labels (e.g., in COCO (Lin et al.           tially for a poorly chosen object ordering. This suggests that
2014)) tend to focus on simple descriptions involving only              the advantage of Transformers over LSTM-RNNs may be
the most salient objects in the image. OCR and de-rendering             particularly strong in tasks where we are using an output se-
are similar in that they encode information about all ele-              quence to represent an unordered set of objects.
ments of the image, but the order of the output sequence
in OCR is completely determined by the image, while in de-                                 2    Related Work
rendering, the output sequences represent sets2 , and as such           De-rendering a computer-generated image to a domain-
the final rendering is invariant to a large degree of reordering        specific language provides an abstraction that is easy to
in the output sequence (e.g., by shuffling the sub-sequences            change, store, compare and match to other images. As a
of tokens that correspond to separate objects).                         consequence, there has been recent interest and work in this
   We start our investigation with a basic image captioning             area. Huang et al. (2016) used CNNs to translate a hand-
model (similar to Wu, Tenenbaum, and Kohli (2017)) and                  drawn sketch of an object (e.g., jewellery) to a fixed set of
extend it with an attention mechanism. We then swap out the             parameters for a procedural model. In a similar vein, Nishida
LSTM-RNN decoder with a Transformer network (Vaswani                    et al. (2016) proposed a simple procedural grammar as a
et al. 2017). Our original motivation for this replacement is           building block to turn sketches into realistic 3D models. El-
that we think that output generation requires long-term de-             lis et al. (2018) proposed an automatic visual program in-
pendencies to avoid representing the same object multiple               duction model to infer programs from hand-drawn images,
times. As mentioned above, de-rendered output sequences                 where the images are encoded via CNNs and a multi-layer
can be quite long, and we thought the multi-head attention              perceptron predicts a distribution over drawing commands.
mechanism of the Transformer would handle the long-range                   Ha and Eck (2018) presented a recurrent neural network
dependencies better than the LSTM-RNN. Unexpectedly,                    based sketch-rnn to construct conditional and unconditional
we found another advantage of Transformers over LSTM-                   sketch generation of common objects, constrained by a very
RNNs for handling output sequence that can be reordered                 simple set of primitives. Their model describes images as
in many ways and still be correct. We expand on this in                 pen movements either in a drawing mode or in a non-
Sec. 7.1. To our knowledge, we are the first ones to use                drawing mode. Unlike our approach, this program is highly
Transformer networks for de-rendering graphical images,                 sequence dependent and non-compositional. While there are
and we find this change is a significant source of our per-             different solutions to the problem by re-ordering, one can-
formance improvement.                                                   not arbitrarily shuffle the sequence of pen movements. Liu
   Another challenge with graphics de-rendering is that                 et al. (2019) infer scene programs by exploiting hierarchi-
changing one or a few tokens in the specification can cause             cal object-based scene representations. Sun et al. (2018)
a significant change in many pixel values (e.g., by changing            proposed a neural program synthesizer that generates un-
the location or color of a large object). Conversely, one can           derlying programs for behaviorally diverse demonstration
have two images that are very close visually, yet have com-             videos. In this work, we use Transformer networks (Vaswani
pletely different specifications. To this end, we explore the           et al. 2017) for decoding the specification from the given
error minimization in the image as well as the specification            input image. Transformers have been used in other gener-
space via a dual-modality, two-way reward reinforcement                 ation tasks such as image and video captioning (Sharma et
learning approach (Williams 1992; Zaremba and Sutskever                 al. 2018; Zhou et al. 2018), however, we are the first ones
2015). We train with non-differentiable reward functions                to use Transformer networks for the image de-rendering
that reflect performance measures of interest in both the im-           problem. Vinyals, Bengio, and Kudlur (2015) shows that
age space and the specification space (the “dual modes”). We            an LSTM trained with shuffled targets (unordered) using
further explore training a single model using rewards from              cross-entropy has a substantial drop in performance com-
both modalities, with the hopes that we get complementary               pared to natural orderings. Our result supports their findings
feedback from each.                                                     and moreover we find that Transformers, by contrast, are rel-
   We empirically evaluate our methods on two image                     atively insensitive to the ordering of the objects.
de-rendering datasets: Noisy Shapes dataset (Ellis et al.                  Recently, policy gradient-based reinforcement learning
2018) and Abstract Scene dataset (Zitnick and Parikh 2013;              (RL) methods have been widely used for sequence gener-
Wu, Tenenbaum, and Kohli 2017). Our Transformer mod-                    ation tasks: machine translation (Ranzato et al. 2016), im-
els trained with cross-entropy loss achieve very significant            age captioning (Ranzato et al. 2016; Rennie et al. 2017),
improvement over previous work on these datasets. We                    and textual summarization (Paulus, Xiong, and Socher 2018;
show even more improvement when we train the Trans-                     Pasunuru and Bansal 2018). Daumé, Langford, and Marcu
former models using policy gradients-based methods, both                (2009) proposed to improve sequence generation by allow-
via single-modality rewards and further improvements via                ing a model to use its own prediction at training time, ex-
dual-modality joint rewards. Finally, in our analysis we find           tending their work in structured prediction. In the context of
evidence that the performance of Transformers is relatively             program synthesis, Bunel et al. (2018) used RL for generat-
insensitive to the ordering of objects in the output sequence,          ing semantically correct programs. In the context of image
while the performance of LSTM-RNN’s can decay substan-                  de-rendering, Wu, Tenenbaum, and Kohli (2017) proposed
                                                                        a neural scene de-rendering model (NSD) with a neural en-
    2                                                                   coder and a graphics engine as a decoder. The encoder has an
      We say sets, rather than sequences, because in our datasets ob-
ject ordering does not affect the rendering.                            object proposal generator that produces segment proposals,
                                                                                                         Output
and then it tries to interpret objects and their properties from
these segments. They use RL to better sample the proposals                                               Softmax
and use the rendered image reconstruction error as reward.
   Recently, Ganin et al. (2018) introduced an adversarially
                                                                                                          Linear
trained agent that is trained via a reinforcement learning
setup without any supervision to generate a program that
is executed by a graphics engine to interpret and sample                                              Add & Norm
images. In contrast, our work presents two complementary
rewards (one in image space and another in specification                                              Feed Forward
space) in a reinforcement learning setup for the image de-                     ResNet
rendering problem.                                                              CNN
                                                                                                      Add & Norm

                        3    Models                                                                    Multi-Head
Task. For each task we consider, there is a simple graphics                                            Attention
specification language that can be used to specify a partic-
ular image. While differing in details, the overall scheme
of the specifications are the same for each. A specification                                          Add & Norm

consists of a set of “objects”, and each object is specified                                            Masked
by a set of properties. Examples of an object specification                                            Multi-Head
for each of our tasks can be seen in Fig. 1. Given an im-                                              Attention
age rendered from a specification, our task is to “de-render”                    (or)
this image back to the original specification. We can eval-
uate a predicted specification by looking for exact matches                                                              Positional
                                                                                                            +           Embedding
between the objects in the predicted specification and the
objects in the original specification. We can summarize the                                               Output
                                                                                                        Embedding
object matches with standard measures, such as precision,                                                Outputs
recall, F1, and intersection-over-union. While these mea-                                             (shifted right)
sures describe performance on a single image, we can av-
erage these measures across a collection of images, to get a                Figure 2: Our Image-Transformer model.
performance measure of a method overall. We provide more
details in Sec. 5.2. Another approach to evaluation is to gen-
erate the image corresponding to a predicted specification,        where F is a trainable non-linear function. The context vec-
and see how well it matches the original image, using some         Pmct is a convex combination of the image features: ct =
                                                                   tor
reasonable metric on the space of images.                             i=1 αt,i fi , where αt,i are “attention weights” defined as
Reduction to sequence prediction. While each specifica-
tions is represented by a set of objects with specific proper-                          exp(et,i )
                                                                               αt,i = Pm                                              (2)
ties, our models require sequences of tokens. We convert the                           k=1 exp(et,k )
set of objects to a sequence of tokens via some ordering of                    et,i = v T tanh(W fi + U st−1 + b),                    (3)
the objects. We investigate various approaches to ordering
(Sec. 7.1), and find that ordering by object type works best.      where v, W , U , and b are the trainable parameters.
Once the model predicts a sequence of tokens, we can parse
it back into original structure to compute performance mea-        3.2   Image-to-Transformer Sequence Model
sures and our reward functions for reinforcement learning.         Recently, there is an increasing amount of interest in Trans-
                                                                   former networks (Vaswani et al. 2017), which are said to
3.1   Image-to-LSTM Sequence Model                                 train faster and to better capture long-term dependencies
Our baseline model is similar to an image captioning model         than LSTM-based RNN models. In our specification predic-
with an attention mechanism (Xu et al. 2015). We use the           tion problem, the length of the specification can be large,
ResNet-18 architecture (He et al. 2016) for encoding the in-       and we need long-term dependencies to avoid generating ob-
put image, and we use an LSTM-RNN for predicting the               jects that have already been generated. This suggests Trans-
corresponding specification as a sequence of tokens.               former networks would be a better fit for our scenario. In
   We will denote the convolutional features from the              this work, we only use the decoder part of the Transformer
ResNet-18 as {fi }m                     d                          network (Vaswani et al. 2017). The Transformer encoder
                     i=1 , where fi ∈ R . For any decoder out-
                           d0
put token o, let Eo ∈ R denote its embedding, which will           is for use on sequences, and we replace it with ResNet-18
be learned during training. Let st be the decoder state at step    CNN described above. We give a high-level description of
t, ot be the output token at step t, and ct be the image context   the Transformer decoder below, and refer to Vaswani et al.
vector at step t, which will be defined below. Then at step t,     (2017) for full details.
the decoder state st is given by                                      The decoder of the Transformer has a stack of N iden-
                                                                   tical layers containing self-attention modules, normaliza-
                  st = F(ct , st−1 , Eot−1 ),               (1)    tion modules, and feed-forward modules, along with posi-
tional encoding module for output embeddings (see Fig. 2).          4   Dual-Modality Two-Way Reinforcement
While the original model in Vaswani et al. (2017) took N =6,                          Learning
through hyperparameter tuning we found N =4 to work bet-
ter for our problem. Besides that, we used the hyperparame-       Traditionally, sequence generation models are trained using
ter settings as in Vaswani et al. (2017). The decoder has two     a cross-entropy loss. More recently, a policy gradient-based
attention modules: one for attending to the image convolu-        reinforcement learning approach has been explored for se-
tion features and another self-attention module to attend to      quence generation tasks (Ranzato et al. 2016; Rennie et al.
different previous positions in the decoder state.                2017), which has two advantages over the cross-entropy loss
                                                                  optimization approach: (1) avoiding the exposure bias issue,
                                                                  which is about the imbalance in the output distributions cre-
Attention in Transformer. As shown in Fig. 2, we have             ated by different train and test time decoding approaches
two attention mechanisms in the model: one attending to the       in cross-entropy loss optimization (Bengio et al. 2015;
CNN features, and another attending to different parts of the     Ranzato et al. 2016); (2) allows direct optimization of the
decoder state. They all have the same structure, which we         evaluation metric of interest, even if it is not differentiable.
describe below.                                                   To this end, we use a policy gradient-based approach via re-
   An attention mechanism in the Transformer can be               wards in both the specification space and the image space.
viewed as a mapping from a query (Q) and a key-value              Also, we explore joint rewards based on these two spaces
(K, V ) pair to an output. An attention weight is computed        for better capturing feedback that is complementary between
using the query and key and those weights are used with           these two modalities.
values to compute the output of the attention module. Em-            For this reward optimization, we use the REINFORCE
pirically, it has been proven that instead of performing a sin-   algorithm (Williams 1992; Zaremba and Sutskever 2015) to
gle attention function, linearly projecting the queries, keys,    learn a policy pθ that produces a distribution over sequences
and values with different learned projection layers and then      os for any given input. We try to find a policy pθ such that
performing the attention function in parallel and concatenat-     the expected reward for a label sequence os drawn according
ing those outputs to get the final attention module output to     to the predicted distribution has maximum expected reward.
work better. This attention mechanism is called multi-head        Equivalently, we minimize the following loss function, in
attention mechanism (MH), which is defined as follows:            average across all training inputs:
       MH(Q, K, V ) = Concat(head1 , .., headh )W O         (4)
                                                                                     LRL = −Eos ∼pθ [r(os )],                 (9)
         headi = Attention(QWiQ , KWiK , V WiV )         (5)
                                       
                                         QK   T
                                                                 where os is the sequence of sampled tokens with ost sampled
        Attention(Q, K, V ) = softmax             V      (6)      at time step t of the decoder. We can approximate the gradi-
                                           dk
                                                                  ent of this loss function with respect to the parameter θ using
where, dk is the dimension of the queries and keys, WiQ ,         a single sample os drawn from pθ as:
WiK , and WiV are the parameters of the projection matrices.
                                                                            ∇θ LRL = −(r(os ) − be )∇θ log pθ (os ),         (10)
Position-wise Feed-Forward Networks. In addition to               where the leading factor is included for variance reduction
the attention sub-layers, each of the layers in the Trans-        using a baseline estimator (Zaremba and Sutskever 2015).
former decoder contains a fully connected feed-forward net-       There are several ways to calculate the baseline estima-
work that is applied to each position of the decoder sepa-        tor; we employ the effective SCST approach (Rennie et al.
rately and identically. This network is defined as                2017).
          FFN(x) = max(0, xW1 + b1 )W2 + b2 ,              (7)
where W1 , W2 , b1 , and b2 are the linear projection parame-     4.1   Rewards
ters which are same across different positions but are differ-    In this work we consider three different reward functions.
ent from layer to layer.                                          Two of the rewards are based in “specification space”, which
                                                                  make a direct comparison between the predicted specifica-
Positional Encoding. In the model described thus far, the         tion and the ground truth specification, and one of the re-
model is symmetric with respect to sequence position. For         wards is based in “image space”, which compares the image
example, at the bottom right of Fig. 2, the model has no          rendered from the predicted specification with original input
structural way to determine which output embeddings come          image. We also investigate using these rewards in combina-
from which part of the output sequence. To remedy this is-        tion, with the hope that there is complementary information
sue, we concatenate a “positional encoding” (PE) to the em-       in the feedback based on the two spaces.
bedding representation of the tokens. We use the sine and         Intersection-Over-Union Reward (IOU) As mentioned
cosine functions for positional encoding:                         in Sec. 3, after the specification is predicted as a sequence
              PE(pos, 2i) = sin(pos/100002i/dmodel )              of tokens, we can parse the sequence into a set of object
                                                        (8)       specifications. The intersection-over-union (IOU) reward is
        PE(pos, 2i + 1) = cos(pos/100002i/dmodel )                based in specification space. Roughly speaking, the IOU re-
where pos is the position, i is the dimension, and dmodel is      ward gives credit for predicting objects that exactly match
the dimension of the embedding vector representation.             objects in the ground truth specification, and penalizes both
                                                                                                                                                             Image Distance Reward


                                                           Output


                                                                                                                                      Graphics
                                                           Softmax


                                                                                        Transformer Decoder
                                                            Linear


                                                        Add & Norm                                                                    Renderer
                                                        Feed Forward


                                                                                                                                                                                                Ground-truth
                                                                                              Sampler
                                                        Add & Norm                                                   Specs
                                                         Multi-Head
                                                         Attention
                                                                                                                     from os
                                                        Add & Norm

                                                          Masked


                                                                                                                           [Object(type(boy), 60, 240),         [Object(type(boy), 60, 240),
                                                         Multi-Head
                                                         Attention


                                                              +
                                                                           Positional
                                                                          Embedding
                                                                                                                           Object(type(tree), 100, 20),         Object(type(tree), 440, 20),
                                                            Output
                                                          Embedding
                                                           Outputs
                                                                                                                           Object(type(bear), 360, 220)]        Object(type(bear), 360, 220)]
         Ground-truth Image (encoded)
                                                        (shifted right)


                                                                                                                                                      IOU Reward


Figure 3: Example showing the samples from our model based on abstract scene dataset and the corresponding rewards in
specification and image space. For simplicity, all object properties are not shown in specification space.


for predicting objects that do not match ground truth objects                                                                  in the target image have noise (see Fig. 1). We take      to
and for failing to predict objects that are part of the ground                                                                 be a simple subtraction operation. The image reward for this
                                           ∗ n
truth. More formally, let {oi }m
                               i=1 and {oj }j=1 represent the                                                                  dataset is:
                                                                                                                                                                 c
objects in predicted and ground-truth specifications, respec-                                                                                         rimg =                          (13)
tively. Then the IOU reward is defined as:                                                                                                                     dimg
                       count({oi }m       ∗ n                                                                                  where c is a tunable parameter.
                                  i=1 ∩ {oj }j=1 )
             riou =                       ∗ n                                                                 (11)                For the Abstract Scene dataset, is a logical operator that
                       count({oi }m
                                  i=1 ∪ {oj }j=1 )                                                                             takes the value 0 in every position where the pixel values
The object oi in the prediction specification is the same as                                                                   “match”, and 1 in every other position. The range of possible
object o∗j in the ground-truth specification if and only if all                                                                pixel values is 0-255 and, similarly to the discretization of
the properties of these objects match exactly.                                                                                 position in the inference reward, we divide the pixel value
                                                                                                                               range into 20 equisized buckets and consider pixel values to
Inference Reward Our second reward, which we call the                                                                          match if they are in the same bucket. We take Ψ to be the
“inference reward”, is also a reward in specification space.                                                                   identity function. The image reward for the Abstract Scenes
The name is based on the “inference error”, which is a per-                                                                    dataset is then defined as:
formance measure introduced in Wu, Tenenbaum, and Kohli
(2017) for the Abstract Scenes dataset. While IOU is based                                                                                                                dimg
                                                                                                                                                           rimg = 1 −                                          (14)
on exact matches between predicted objects and ground-                                                                                                                    w·h
truth objects, the inference error and inference reward are                                                                    where w and h are width and height of the image.
based on the number of properties (within objects) that cor-
rectly match the corresponding properties in the ground-                                                                       Joint Dual-Modality Reward Since we expect the re-
truth. For those properties specifying location in pixel co-                                                                   wards based in specification space to be complementary to
ordinates, we follow Wu, Tenenbaum, and Kohli (2017) and                                                                       the reward based in image space, we want a way to combine
divide the space of each coordinate into 20 bins of equal                                                                      rewards on the two spaces. One way to combine two rewards
size, and we consider it a match if the predicted and ground-                                                                  is to create a weighted combination of individual rewards to
truth locations are in the same bin. We define the inference                                                                   formulate the joint reward. Another approach is to alternate
error as the fraction of predicted properties that fail to match                                                               the reward used during the learning process (Pasunuru and
the corresponding ground-truth properties. The inference re-                                                                   Bansal 2018). In this work, we follow the latter approach, as
ward is one minus the inference error.                                                                                         the former approach requires expensive tuning for scale and
                                                                                                                               weight balancing. Let r1 and r2 be the two reward functions
Image Distance Reward Our third and final reward, the                                                                          that we want to optimize. In our approach, we first take a1
“image distance reward”, is in image space. We define it                                                                       optimization steps to minimize the reinforcement learning
generically first, as it takes slightly different forms in our                                                                 loss LRL1 (r1 ; θ) (i.e. we use a1 mini-batches). Then we take
two datasets. If we let I and I R represent vectorized ver-                                                                    a2 optimization steps to minimize the reinforcement learn-
sions of the input image and the image rendered from the                                                                       ing loss LRL2 (r2 ; θ). We then repeat this cycle of steps until
predicted specification, respectively, then we define the im-                                                                  convergence. All other optimization parameters, such as step
age distance as                                                                                                                size, remain the same for each set of steps. The values a1 and
                    dimg = ||I          Ψ(I R )||22 ,                                                         (12)             a2 are tuning parameters.3 The two rewards r1 and r2 could
                                                                                                                               be based on different aspects of the output, such as IOU and
where || · ||2 is the `2 -norm.                                                                                                image distance reward.
  For Noisy Shapes dataset, we follow Ellis et al. (2018)
                                                                                                                                  3
and take Ψ to be a Gaussian blurring function, as the objects                                                                         Pasunuru and Bansal (2018) set a1 and a2 to 1, without tuning.
               Model                              Precision      Recall       F1     IOU    IOU1.0   IOU0.8    IOU0.6
                                                    C ROSS -E NTROPY L OSS
               Image2LSTM+atten.                    98.7         98.5         98.6   97.6    90.7     95.3       98.8
               Image2Transformer                    99.1         99.1         99.1   98.5    94.1     97.3       99.1
                                       I MAGE 2T RANSFORMER WITH R EINFORCE L OSS
               IOU Reward                           99.4         99.3         99.3   98.8    95.0     98.0       99.4
               Image-distance Reward                99.4         99.2         99.3   98.8    94.5     98.0       99.5
               Image-distance + IOU Reward          99.4         99.3         99.3   98.8    95.0     98.1       99.4

                               Table 1: Performance of various models on Noisy Shapes dataset.


               5    Experimental Setup                                  macro average). Further, we also report IOUk , which is de-
                                                                        fined as the percent of test examples for which the IOU score
5.1   Dataset
                                                                        is greater than or equal to k.
Noisy Shapes Dataset. Ellis et al. (2018) provides a syn-
thetic dataset of images containing multiple simple objects
(lines, circles, and rectangles), each with various properties          Abstract Scene Dataset. For the abstract scene dataset,
that can be specified. The images are specified using a small           following previous work (Wu, Tenenbaum, and Kohli 2017),
subset of LATEX drawing commands. Additional noise is in-               we report specification inference error and image recon-
troduced into the rendered images by rescaling image in-                struction error based on a micro average across all test ex-
tensity, translating the image by a few pixels, rendering the           amples. As described in Sec. 4.1 and Sec. 4.1, inference er-
LATEX using the pencildraw style, and randomly perturbing               ror is based on the percentage of incorrectly inferred values
the position and sizes of these LATEX drawing commands.                 (i.e., how many properties of objects do not match with the
The dataset was created by randomly sampling image speci-               ground-truth) for the specification, and image reconstruction
fications with between 1 and 12 objects, excluding any spec-            error is based on percentage of incorrect pixel prediction.
ifications that lead to images with overlapping objects. The            During these evaluations, all the continuous variables (pixel
size of each image is 256x256. The dataset contains 100,000             values, and x and y coordinates) are quantized into 20 bins.
images paired with specifications, from which we use 1000               Additionally, we report the macro average based IOU error
for testing and the rest for training.                                  as described for the noisy shapes dataset.

                                                                        5.3     Training Details
Abstract Scene Dataset. The Abstract Scene dataset (Zit-                In all of our models, we encode the image information via
nick and Parikh 2013) contains 10,020 images, each of                   ResNet-18 (He et al. 2016), where we take the penultimate
which has 3-18 objects. There are over 100 types of objects,            layer’s features as outputs from this image encoder. For
each of which is specified by two integers, one indicating              LSTM-RNN, we use a hidden state size of 128, input token
a broad category (e.g. sky object, animal, boy, girl) and an-           embedding size of 128, and a batch size of 64. For Trans-
other indicating a subcategory (e.g. girl pose, animal type,            former networks, we use the same hidden and embedding
etc.). Each object can be drawn at one of 3 scales, with or             size, and use 4 layers at each time step. We use the Adam op-
without a horizontal flip, and at any pixel location in the             timizer (Kingma and Ba 2015) with the default learning rate
500x400 image. These properties are specified by 4 addi-                of 0.001 for all the cross-entropy models, and a learning rate
tional integers. Thus each object is specified by 6 integers.           of 0.0001 for all the reinforcement learning based models.
There are often heavy occlusions among these objects when               For the Noisy Shapes dataset, the maximum decoder length
rendered in an image (see input image in Fig. 2). However,              is fixed to 80, and we use a vocabulary size of 27, which are
the objects are rendered in a deterministic order based on              placeholders for object properties. For the Abstract Scene
the object types and other properties, and thus the image is            dataset, the maximum decoder length is fixed to 100, and
independent of the order of the objects in the specification.           we use a vocabulary size of 1078 which represents all the
Similar to Wu, Tenenbaum, and Kohli (2017), we randomly                 object properties. For the joint reward optimization, we use
sample 90% of the images for training and rest for testing.             a mixing ratio of 1:1 for the Noisy Shapes dataset and 1:4
                                                                        for the Abstract Scene dataset.
5.2   Evaluation Metrics
Noisy Shapes Dataset. As described in the Task descrip-                                        6     Results
tion of Sec. 3, we can summarize performance on a single
image with precision, recall, F1, and IOU (intersection over            6.1     Results on the Noisy Shapes Dataset
union) at the object level. Following previous work (Ellis              We first compare the performance of the LSTM-RNN model
et al. 2018), we summarize the performance of a method                  (Image2LSTM+atten) to the Transformer-based model,
by averaging these metrics across all test examples (i.e. a             when both are trained with cross-entropy loss. We see in
 Model                   Infer.     Recons.   Avg.       IOU         we tried three different reward functions, corresponding to
                         Error      Error     Error                  three of our performance metrics: inference error, recon-
                       P REVIOUS W ORK
                                                                     struction error, and IOU. All the Transformer models trained
                                                                     with REINFORCE out-performed the model trained with
 CNN+LSTM (2017)         45.31      41.38     43.84      -           cross-entropy loss for each of the error measures.5 For in-
 NSD (full) (2017)       42.74      21.55     32.14      -           ference error, the model trained with the inference reward
                    C ROSS -E NTROPY L OSS                           did the best, as one might hope and expect. Compared to
 Image2LSTM+atten.       17.27      15.70     16.48      32.06
                                                                     the cross-entropy trained Transformer, the inference error
 Image2Transformer       8.78       10.92     9.85       58.54       measure was reduced by 11.0%. For reconstruction error
                                                                     (image-based), the best performing model was the model
       I MAGE 2T RANSFORMER WITH R EINFORCE L OSS                    trained with the reconstruction reward, which reduced the
 IOU Reward              7.91       10.50     9.20       61.29       reconstruction error by 8.5% compared to the cross-entropy
 Inference Reward        7.81       10.75     9.28       59.35       trained version. When evaluating performance using the av-
 Recons. Reward          8.34       9.99      9.16       62.44       erage of the inference and reconstruction error, one of our
 Inference + Recons.     8.21       10.12     9.16       61.54       joint-reward models performed best, though interestingly,
 IOU + Recons.           8.05       10.04     9.04       62.45       not the one that uses the corresponding inference and re-
                                                                     construction rewards. The best performing model for this
Table 2: Models performance on the abstract scene dataset.           performance measure used IOU and reconstruction rewards,
Errors: lower is better; IOU: higher is better.                      suggesting that IOU reward has more information that is
                                                                     complementary to the reconstruction error than does the in-
                                                                     ference reward. For IOU performance measure, the model
Table 1 that the Transformer model dominates on all mea-             trained with IOU reward did well, but when trained jointly
sures. In particular, we highlight IOU1.0 , which measures           with IOU and reconstruction reward, it performed the best.
the percent of examples on which the predicted specifica-            This suggests that using image-based feedback during train-
tion exactly matches the ground-truth specification. While           ing (recons. error) can be beneficial even when the ultimate
the LSTM-RNN model achieves a 90.7% IOU1.0 , the Trans-              goal (IOU) depends only on the specification output.
former model achieves 94.1%, which is an impressive
36.5% reduction in the number of errors. We have similar                                      7    Analysis
performance improvements for the other metrics. We now
                                                                     7.1   LSTM vs. Transformer Networks
compare the Transformer model trained with reinforcement
learning, using various reward functions, to training using          As noted above, for the Abstract Scene and the Noisy Shapes
cross-entropy loss. Table 1 shows that, although all three re-       datasets that we consider, the order of the objects in the
ward variations have roughly the same performance, they              specification does not affect the final image. Nevertheless,
all show significant improvement over cross-entropy train-           for training both the LSTM-RNN and the Transformer mod-
ing, on all measures.4 For example, the model trained with           els, one must choose an ordering. We ran an experiment us-
IOU reward achieved a 95.0% IOU1.0 measure, which is an              ing the Noisy Shapes dataset, in which we tried ordering
impressive 15.3% reduction in the number of errors com-              the objects by shape size, shape type, and by shape posi-
pared to the same model trained with cross-entropy loss, and         tion in the rendered image. We found that ordering by shape
a 46.2% reduction compared to the original LSTM-RNN                  type worked best across our models, so that’s what we used
model. Performance improvement in the other measures is              for our main results in Table 1. We also wanted to investi-
at least as good.                                                    gate how important it is to have the objects in some sensible
                                                                     order, compared to a random ordering. Table 3 shows the
6.2   Results on Abstract Scene Dataset                              results of our two models when trained with cross-entropy
                                                                     on specification sequences where the objects are put in ran-
In Table 2, we see the performance of various models on
                                                                     dom order. We find that the LSTM-RNN model perfor-
the Abstract Scene dataset, for the metrics described in
                                                                     mance drops dramatically (e.g. IOU1.0 drops from 90.7% to
Sec. 5.2. We first note that even our baseline LSTM-RNN
                                                                     72.0%), while the drop with Transformer networks is quite
model (Image2LSTM+atten) shows a very large error reduc-
                                                                     small (e.g. IOU1.0 drops from 94.1% to 93.2%.6 This is addi-
tion compared to the results presented in Wu, Tenenbaum,
and Kohli (2017) (first 4 rows of the table). This highlights            5
                                                                           For the IOU and inference reward models, this improve-
the importance of an attention mechanism in these tasks.             ment is statistically significant for all metrics except reconstruc-
For the models trained with cross-entropy, the Transformer           tion error. For the reconstruction reward model, the improvement
model shows an additional remarkable improvement over                is significant for all but the inference error metric. For the dual
the LSTM-RNN model, across all measures.                             (IOU+Recons.) reward model, the difference is significant for all
   For reinforcement learning with the Transformer model,            metrics (p < 0.01 for each test).
                                                                         6
                                                                           Note that the number of model parameters is approximately
    4
      The improvement of our Transformer models trained with re-     the same (11.8M for Transformer model and 11.5M for LSTM
inforcement learning over the corresponding cross-entropy models     model). Further, Transformer models are 2.5x faster to train in
is statistically significant with p < 0.01, based on the bootstrap   comparison to the LSTM models. During inference, both models
test (Noreen 1989; Efron and Tibshirani 1994).                       take approximately the same time.
       Model                     F1    IOU     IOU1.0
       Image2LSTM+atten.        95.2    92.0     72.0
       Image2Transformer        99.0    98.3     93.2

Table 3: Performance of LSTM-RNN and Transformer net-
works on the Noisy Shapes dataset when specifications have
randomly ordered objects.


tional evidence for Transformers being the preferred model
for tasks of this type.

7.2   Performance vs. Data Size
We conduct an experiment where we vary the percentage of               Ground-truth Images    Transformer Baseline Rendered Images   Transformer RL Rendered Images

Noisy Shapes data used during our models’ training from
10% to 100% by steps of 20%. We observe that with less           Figure 4: Comparing the ground-truth images from noisy
data (10%-40%), the RL-based model is approximately 2            shapes and abstract scene datasets with the rendered images
points better (on the IOU1.0 metric) than its corresponding      of predicted specifications.
cross-entropy baseline. As we use more data (>60%), the
gap decreases to 1 point between RL and cross-entropy mod-
els. This suggests that RL, which has the advantage of explo-    that Transformers are a better choice than LSTMs for un-
ration, is more powerful when the data is less.                  ordered sequence prediction tasks.

7.3   Output Examples                                                                        Acknowledgments
Fig. 4 presents the output rendered images of the                We thank the reviewers for their helpful comments. This
predicted specifications from Image2Transformer cross-           work was partially supported by NSF-CAREER Award
entropy model and the corresponding RL-based model with          1846185, ARO-YIP Award W911NF-18-1-0336, and a Mi-
IOU+Image-distance as reward for noisy shapes dataset and        crosoft PhD Fellowship. The views contained in this article
IOU+Recons. as reward for abstract scene dataset. In the         are those of the authors and not of the funding agency.
first example (top row in Fig. 4), the cross-entropy model
predicts an extra ‘line shape’ which is not present in the                                         References
ground-truth. Our RL model correctly predicts the exact          Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015.
same shapes present in the ground-truth. However, neither        Scheduled sampling for sequence prediction with recurrent
models getting the type of ‘line shape’ correct in couple of     neural networks. In Proceedings of the 28th International
instances. In the second example (second row in Fig. 4),         Conference on Neural Information Processing Systems -
the cross-entropy model predicts an extra object (glasses),      Volume 1, NeurIPS’15, 1171–1179. Cambridge, MA, USA:
which is not present in the ground-truth image, and is also      MIT Press.
missing the cap on the snake. The RL model improves on the       Bluche, T.; Louradour, J.; and Messina, R. O. 2016. Scan,
cross-entropy model by not having any extra objects, but it is   attend and read: End-to-end handwritten paragraph recogni-
also missing the cap. In the third example, both the rendered    tion with MDLSTM attention. CoRR abs/1604.03286.
images look very similar to the ground-truth, but the cross-
entropy model predicts one of the objects (glasses) slightly     Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli,
off in position. Our RL model was able to accurately posi-       P. 2018. Leveraging grammar and reinforcement learning
tion the glasses (bottom row in Fig. 4). The better perfor-      for neural program synthesis. In ICLR. OpenReview.net.
mance of RL model may be due to the image space compo-           Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.;
nent of the error signal, which is more sensitive to position    Dollár, P.; and Zitnick, C. L. 2015. Microsoft COCO
errors, while the cross-entropy loss gives the same penalty      captions: Data collection and evaluation server. CoRR
to all incorrect positions regardless of the error size.         abs/1504.00325.
                                                                 Cliche, M.; Rosenberg, D.; Madeka, D.; and Yee, C. 2017.
                     8   Conclusion                              Scatteract: Automated extraction of data from scatter plots.
                                                                 In Joint European Conference on Machine Learning and
We present various neural de-rendering models based on           Knowledge Discovery in Databases, 135–150. Springer.
LSTMs with attention mechanism and Transformer net-
works. Further, we introduce complimentary dual rewards          Daumé, H.; Langford, J.; and Marcu, D. 2009. Search-based
(one in specification space and another in image space) and      structured prediction. Machine Learning 75(3):297–325.
optimize them via reinforcement learning, and achieve state-     Deng, Y.; Kanervisto, A.; Ling, J.; and Rush, A. M. 2017.
of-the-art results. Further, our results and analyses suggest    Image-to-markup generation with coarse-to-fine attention.
In Proceedings of the 34th International Conference on Ma-        of the 2018 Conference of the North American Chapter of
chine Learning - Volume 70, ICML’17, 980–989. JMLR.org.           the Association for Computational Linguistics: Human Lan-
Efron, B., and Tibshirani, R. J. 1994. An introduction to         guage Technologies, Volume 2 (Short Papers), 646–653.
the bootstrap. Number 57 in Monographs on Statistics and          Paulus, R.; Xiong, C.; and Socher, R. 2018. A deep rein-
Applied Probability. Boca Raton, Florida, USA: Chapman            forced model for abstractive summarization. In 6th Interna-
& Hall/CRC.                                                       tional Conference on Learning Representations, ICLR 2018,
Ellis, K.; Ritchie, D.; Solar-Lezama, A.; and Tenenbaum,          Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
J. B. 2018. Learning to infer graphics programs from              Track Proceedings. OpenReview.net.
hand-drawn images. In Proceedings of the 32nd Interna-            Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2016.
tional Conference on Neural Information Processing Sys-           Sequence level training with recurrent neural networks. In
tems, NeurIPS’18, 6062–6071. Red Hook, NY, USA: Curran            ICLR.
Associates Inc.                                                   Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel,
Ganin, Y.; Kulkarni, T.; Babuschkin, I.; Eslami, S. M. A.;        V. 2017. Self-critical sequence training for image caption-
and Vinyals, O. 2018. Synthesizing programs for images            ing. In IEEE Conference on Computer Vision and Pattern
using reinforced adversarial learning. In Proceedings of the      Recognition, 1179–1195.
35th International Conference on Machine Learning.                Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018.
Ha, D., and Eck, D. 2018. A neural representation of sketch       Conceptual captions: A cleaned, hypernymed, image alt-text
drawings. In 6th International Conference on Learning Rep-        dataset for automatic image captioning. In Proceedings of
resentations, ICLR 2018, Vancouver, BC, Canada, April 30          the 56th Annual Meeting of the Association for Computa-
- May 3, 2018, Conference Track Proceedings. OpenRe-              tional Linguistics (Volume 1: Long Papers), 2556–2565.
view.net.                                                         Sun, S.-H.; Noh, H.; Somasundaram, S.; and Lim, J.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual       2018. Neural program synthesis from diverse demonstration
learning for image recognition. In CVPR, 770–778.                 videos. In International Conference on Machine Learning,
                                                                  4797–4806.
Huang, H.; Kalogerakis, E.; Yumer, E.; and Mech, R. 2016.
Shape synthesis from sketches via procedural models and           Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
convolutional networks. IEEE Transactions on Visualization        L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
and Computer Graphics 2.                                          tention is all you need. In NeurIPS, 5998–6008.
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic          Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Order mat-
alignments for generating image descriptions. In Proceed-         ters: Sequence to sequence for sets. In ICLR.
ings of the IEEE conference on computer vision and pattern        Williams, R. J. 1992. Simple statistical gradient-following
recognition, 3128–3137.                                           algorithms for connectionist reinforcement learning. Ma-
Kingma, D. P., and Ba, J. 2015. Adam: A method for                chine learning 8(3-4):229–256.
stochastic optimization. In Bengio, Y., and LeCun, Y., eds.,      Wu, J.; Tenenbaum, J. B.; and Kohli, P. 2017. Neural scene
3rd International Conference on Learning Representations,         de-rendering. In Proceedings of the IEEE Conference on
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer-             Computer Vision and Pattern Recognition, 699–707.
ence Track Proceedings.                                           Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-
Lin, T.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick,    nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and
R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zit-   tell: Neural image caption generation with visual attention.
nick, C. L. 2014. Microsoft COCO: common objects in               In ICML, 2048–2057.
context. CoRR abs/1405.0312.                                      Zaremba, W., and Sutskever, I.          2015.      Reinforce-
Liu, Y.; Wu, Z.; Ritchie, D.; Freeman, W. T.; Tenenbaum,          ment learning neural turing machines. arXiv preprint
J. B.; and Wu, J. 2019. Learning to describe scenes with          arXiv:1505.00521.
programs. In ICLR.                                                Zhou, L.; Zhou, Y.; Corso, J. J.; Socher, R.; and Xiong,
Mishchenko, A., and Vassilieva, N. 2011. Chart image un-          C. 2018. End-to-end dense video captioning with masked
derstanding and numerical data extraction. In 2011 Sixth          transformer. In Proceedings of the IEEE Conference on
International Conference on Digital Information Manage-           Computer Vision and Pattern Recognition, 8739–8748.
ment, 115–120. IEEE.                                              Zitnick, C. L., and Parikh, D. 2013. Bringing semantics into
Nishida, G.; Garcia-Dorado, I.; Aliaga, D. G.; Benes, B.;         focus using visual abstraction. In Proceedings of the IEEE
and Bousseau, A. 2016. Interactive sketching of urban             Conference on Computer Vision and Pattern Recognition,
procedural models. ACM Transactions on Graphics (TOG)             3009–3016.
35(4):130.
Noreen, E. W. 1989. Computer-intensive methods for testing
hypotheses. Wiley New York.
Pasunuru, R., and Bansal, M. 2018. Multi-reward reinforced
summarization with saliency and entailment. In Proceedings