Dual Reinforcement-Based Specification Generation for Image De-Rendering
Ramakanth Pasunuru♣ David Rosenberg† Gideon Mann† Mohit Bansal♣
♣ †
UNC Chapel Hill Bloomberg LP
{ram,mbansal}@cs.unc.edu
{drosenberg44,gmann16}@bloomberg.net
Abstract
Advances in deep learning have led to promising progress
in inferring graphics programs by de-rendering computer-
generated images. However, current methods do not explore
which decoding methods lead to better inductive bias for in-
ferring graphics programs. In our work, we first explore the
effectiveness of LSTM-RNN versus Transformer networks
as decoders for order-independent graphics programs. Since
these are sequence models, we must choose an ordering of
....
ductive bias in the decoder via multiple diverse rewards based ....
both on the graphics program specification and the rendered
image. We also explore the combination of these comple- Figure 1: Example images from the abstract scene dataset
mentary rewards. We achieve state-of-the-art results on two (left) and the Noisy Shapes dataset (right), along with por-
graphics program generation datasets. tions of their specifications.
1 Introduction
The large majority of computer vision work deals in the do- mapped to some sequence of output tokens. The neural
main of natural images or video. However, there is tremen- encoder-decoder approach has proved to be very success-
dous potential for applying computer vision techniques to ful for this class of problems, including image caption-
computer-generated images, such as plots, charts, schemat- ing (Karpathy and Fei-Fei 2015; Xu et al. 2015), handwrit-
ics, complicated math formulas, and even a page of printed ing recognition (Bluche, Louradour, and Messina 2016), as
text. For these domains, there is often a domain-specific well as the de-rendering problem for math formulas (Deng et
language for precisely specifying the image, such as mat- al. 2017) and graphics images (Ellis et al. 2018; Wu, Tenen-
plotlib code for a chart, PicTeX for a schematic1 , and La- baum, and Kohli 2017). In this paper, we improve these
TeX for math formulas and text. “De-rendering” a computer- encoder-decoder models for the specific case of graphical
generated image back to the original (or a different) domain- images, via methods based on Transformer models with both
specific language specification can be a useful first step in cross-entropy training and reinforcement learning with up to
many tasks, such as changing the visual appearance of an two “dual modality” reward functions. De-rendering graph-
image (Huang et al. 2016; Wu, Tenenbaum, and Kohli 2017) ical images is a problem that differs in several interesting
or extracting information contained in an image (Cliche et ways from image captioning and OCR problems. Two ex-
al. 2017; Mishchenko and Vassilieva 2011). amples of the de-rendering problem we consider are shown
The de-rendering problem is part of a larger class of in Fig. 1. Each image is an input, and a portion of the desired
“image-to-text” problems, in which an input image is output is displayed below each image. In de-rendering, every
Copyright c 2021 for this paper by its authors. Use permitted un- object in the image must be described in the specification,
der Creative Commons License Attribution 4.0 International (CC and typically many output tokens are required to describe
BY 4.0). each object. Thus outputs from de-rendering are typically
1
https://ctan.org/pkg/pictex?lang=en much longer than those in image captioning datasets (Chen
et al. 2015), since caption labels (e.g., in COCO (Lin et al. tially for a poorly chosen object ordering. This suggests that
2014)) tend to focus on simple descriptions involving only the advantage of Transformers over LSTM-RNNs may be
the most salient objects in the image. OCR and de-rendering particularly strong in tasks where we are using an output se-
are similar in that they encode information about all ele- quence to represent an unordered set of objects.
ments of the image, but the order of the output sequence
in OCR is completely determined by the image, while in de- 2 Related Work
rendering, the output sequences represent sets2 , and as such De-rendering a computer-generated image to a domain-
the final rendering is invariant to a large degree of reordering specific language provides an abstraction that is easy to
in the output sequence (e.g., by shuffling the sub-sequences change, store, compare and match to other images. As a
of tokens that correspond to separate objects). consequence, there has been recent interest and work in this
We start our investigation with a basic image captioning area. Huang et al. (2016) used CNNs to translate a hand-
model (similar to Wu, Tenenbaum, and Kohli (2017)) and drawn sketch of an object (e.g., jewellery) to a fixed set of
extend it with an attention mechanism. We then swap out the parameters for a procedural model. In a similar vein, Nishida
LSTM-RNN decoder with a Transformer network (Vaswani et al. (2016) proposed a simple procedural grammar as a
et al. 2017). Our original motivation for this replacement is building block to turn sketches into realistic 3D models. El-
that we think that output generation requires long-term de- lis et al. (2018) proposed an automatic visual program in-
pendencies to avoid representing the same object multiple duction model to infer programs from hand-drawn images,
times. As mentioned above, de-rendered output sequences where the images are encoded via CNNs and a multi-layer
can be quite long, and we thought the multi-head attention perceptron predicts a distribution over drawing commands.
mechanism of the Transformer would handle the long-range Ha and Eck (2018) presented a recurrent neural network
dependencies better than the LSTM-RNN. Unexpectedly, based sketch-rnn to construct conditional and unconditional
we found another advantage of Transformers over LSTM- sketch generation of common objects, constrained by a very
RNNs for handling output sequence that can be reordered simple set of primitives. Their model describes images as
in many ways and still be correct. We expand on this in pen movements either in a drawing mode or in a non-
Sec. 7.1. To our knowledge, we are the first ones to use drawing mode. Unlike our approach, this program is highly
Transformer networks for de-rendering graphical images, sequence dependent and non-compositional. While there are
and we find this change is a significant source of our per- different solutions to the problem by re-ordering, one can-
formance improvement. not arbitrarily shuffle the sequence of pen movements. Liu
Another challenge with graphics de-rendering is that et al. (2019) infer scene programs by exploiting hierarchi-
changing one or a few tokens in the specification can cause cal object-based scene representations. Sun et al. (2018)
a significant change in many pixel values (e.g., by changing proposed a neural program synthesizer that generates un-
the location or color of a large object). Conversely, one can derlying programs for behaviorally diverse demonstration
have two images that are very close visually, yet have com- videos. In this work, we use Transformer networks (Vaswani
pletely different specifications. To this end, we explore the et al. 2017) for decoding the specification from the given
error minimization in the image as well as the specification input image. Transformers have been used in other gener-
space via a dual-modality, two-way reward reinforcement ation tasks such as image and video captioning (Sharma et
learning approach (Williams 1992; Zaremba and Sutskever al. 2018; Zhou et al. 2018), however, we are the first ones
2015). We train with non-differentiable reward functions to use Transformer networks for the image de-rendering
that reflect performance measures of interest in both the im- problem. Vinyals, Bengio, and Kudlur (2015) shows that
age space and the specification space (the “dual modes”). We an LSTM trained with shuffled targets (unordered) using
further explore training a single model using rewards from cross-entropy has a substantial drop in performance com-
both modalities, with the hopes that we get complementary pared to natural orderings. Our result supports their findings
feedback from each. and moreover we find that Transformers, by contrast, are rel-
We empirically evaluate our methods on two image atively insensitive to the ordering of the objects.
de-rendering datasets: Noisy Shapes dataset (Ellis et al. Recently, policy gradient-based reinforcement learning
2018) and Abstract Scene dataset (Zitnick and Parikh 2013; (RL) methods have been widely used for sequence gener-
Wu, Tenenbaum, and Kohli 2017). Our Transformer mod- ation tasks: machine translation (Ranzato et al. 2016), im-
els trained with cross-entropy loss achieve very significant age captioning (Ranzato et al. 2016; Rennie et al. 2017),
improvement over previous work on these datasets. We and textual summarization (Paulus, Xiong, and Socher 2018;
show even more improvement when we train the Trans- Pasunuru and Bansal 2018). Daumé, Langford, and Marcu
former models using policy gradients-based methods, both (2009) proposed to improve sequence generation by allow-
via single-modality rewards and further improvements via ing a model to use its own prediction at training time, ex-
dual-modality joint rewards. Finally, in our analysis we find tending their work in structured prediction. In the context of
evidence that the performance of Transformers is relatively program synthesis, Bunel et al. (2018) used RL for generat-
insensitive to the ordering of objects in the output sequence, ing semantically correct programs. In the context of image
while the performance of LSTM-RNN’s can decay substan- de-rendering, Wu, Tenenbaum, and Kohli (2017) proposed
a neural scene de-rendering model (NSD) with a neural en-
2 coder and a graphics engine as a decoder. The encoder has an
We say sets, rather than sequences, because in our datasets ob-
ject ordering does not affect the rendering. object proposal generator that produces segment proposals,
Output
and then it tries to interpret objects and their properties from
these segments. They use RL to better sample the proposals Softmax
and use the rendered image reconstruction error as reward.
Recently, Ganin et al. (2018) introduced an adversarially
Linear
trained agent that is trained via a reinforcement learning
setup without any supervision to generate a program that
is executed by a graphics engine to interpret and sample Add & Norm
images. In contrast, our work presents two complementary
rewards (one in image space and another in specification Feed Forward
space) in a reinforcement learning setup for the image de- ResNet
rendering problem. CNN
Add & Norm
3 Models Multi-Head
Task. For each task we consider, there is a simple graphics Attention
specification language that can be used to specify a partic-
ular image. While differing in details, the overall scheme
of the specifications are the same for each. A specification Add & Norm
consists of a set of “objects”, and each object is specified Masked
by a set of properties. Examples of an object specification Multi-Head
for each of our tasks can be seen in Fig. 1. Given an im- Attention
age rendered from a specification, our task is to “de-render” (or)
this image back to the original specification. We can eval-
uate a predicted specification by looking for exact matches Positional
+ Embedding
between the objects in the predicted specification and the
objects in the original specification. We can summarize the Output
Embedding
object matches with standard measures, such as precision, Outputs
recall, F1, and intersection-over-union. While these mea- (shifted right)
sures describe performance on a single image, we can av-
erage these measures across a collection of images, to get a Figure 2: Our Image-Transformer model.
performance measure of a method overall. We provide more
details in Sec. 5.2. Another approach to evaluation is to gen-
erate the image corresponding to a predicted specification, where F is a trainable non-linear function. The context vec-
and see how well it matches the original image, using some Pmct is a convex combination of the image features: ct =
tor
reasonable metric on the space of images. i=1 αt,i fi , where αt,i are “attention weights” defined as
Reduction to sequence prediction. While each specifica-
tions is represented by a set of objects with specific proper- exp(et,i )
αt,i = Pm (2)
ties, our models require sequences of tokens. We convert the k=1 exp(et,k )
set of objects to a sequence of tokens via some ordering of et,i = v T tanh(W fi + U st−1 + b), (3)
the objects. We investigate various approaches to ordering
(Sec. 7.1), and find that ordering by object type works best. where v, W , U , and b are the trainable parameters.
Once the model predicts a sequence of tokens, we can parse
it back into original structure to compute performance mea- 3.2 Image-to-Transformer Sequence Model
sures and our reward functions for reinforcement learning. Recently, there is an increasing amount of interest in Trans-
former networks (Vaswani et al. 2017), which are said to
3.1 Image-to-LSTM Sequence Model train faster and to better capture long-term dependencies
Our baseline model is similar to an image captioning model than LSTM-based RNN models. In our specification predic-
with an attention mechanism (Xu et al. 2015). We use the tion problem, the length of the specification can be large,
ResNet-18 architecture (He et al. 2016) for encoding the in- and we need long-term dependencies to avoid generating ob-
put image, and we use an LSTM-RNN for predicting the jects that have already been generated. This suggests Trans-
corresponding specification as a sequence of tokens. former networks would be a better fit for our scenario. In
We will denote the convolutional features from the this work, we only use the decoder part of the Transformer
ResNet-18 as {fi }m d network (Vaswani et al. 2017). The Transformer encoder
i=1 , where fi ∈ R . For any decoder out-
d0
put token o, let Eo ∈ R denote its embedding, which will is for use on sequences, and we replace it with ResNet-18
be learned during training. Let st be the decoder state at step CNN described above. We give a high-level description of
t, ot be the output token at step t, and ct be the image context the Transformer decoder below, and refer to Vaswani et al.
vector at step t, which will be defined below. Then at step t, (2017) for full details.
the decoder state st is given by The decoder of the Transformer has a stack of N iden-
tical layers containing self-attention modules, normaliza-
st = F(ct , st−1 , Eot−1 ), (1) tion modules, and feed-forward modules, along with posi-
tional encoding module for output embeddings (see Fig. 2). 4 Dual-Modality Two-Way Reinforcement
While the original model in Vaswani et al. (2017) took N =6, Learning
through hyperparameter tuning we found N =4 to work bet-
ter for our problem. Besides that, we used the hyperparame- Traditionally, sequence generation models are trained using
ter settings as in Vaswani et al. (2017). The decoder has two a cross-entropy loss. More recently, a policy gradient-based
attention modules: one for attending to the image convolu- reinforcement learning approach has been explored for se-
tion features and another self-attention module to attend to quence generation tasks (Ranzato et al. 2016; Rennie et al.
different previous positions in the decoder state. 2017), which has two advantages over the cross-entropy loss
optimization approach: (1) avoiding the exposure bias issue,
which is about the imbalance in the output distributions cre-
Attention in Transformer. As shown in Fig. 2, we have ated by different train and test time decoding approaches
two attention mechanisms in the model: one attending to the in cross-entropy loss optimization (Bengio et al. 2015;
CNN features, and another attending to different parts of the Ranzato et al. 2016); (2) allows direct optimization of the
decoder state. They all have the same structure, which we evaluation metric of interest, even if it is not differentiable.
describe below. To this end, we use a policy gradient-based approach via re-
An attention mechanism in the Transformer can be wards in both the specification space and the image space.
viewed as a mapping from a query (Q) and a key-value Also, we explore joint rewards based on these two spaces
(K, V ) pair to an output. An attention weight is computed for better capturing feedback that is complementary between
using the query and key and those weights are used with these two modalities.
values to compute the output of the attention module. Em- For this reward optimization, we use the REINFORCE
pirically, it has been proven that instead of performing a sin- algorithm (Williams 1992; Zaremba and Sutskever 2015) to
gle attention function, linearly projecting the queries, keys, learn a policy pθ that produces a distribution over sequences
and values with different learned projection layers and then os for any given input. We try to find a policy pθ such that
performing the attention function in parallel and concatenat- the expected reward for a label sequence os drawn according
ing those outputs to get the final attention module output to to the predicted distribution has maximum expected reward.
work better. This attention mechanism is called multi-head Equivalently, we minimize the following loss function, in
attention mechanism (MH), which is defined as follows: average across all training inputs:
MH(Q, K, V ) = Concat(head1 , .., headh )W O (4)
LRL = −Eos ∼pθ [r(os )], (9)
headi = Attention(QWiQ , KWiK , V WiV ) (5)
QK T
where os is the sequence of sampled tokens with ost sampled
Attention(Q, K, V ) = softmax V (6) at time step t of the decoder. We can approximate the gradi-
dk
ent of this loss function with respect to the parameter θ using
where, dk is the dimension of the queries and keys, WiQ , a single sample os drawn from pθ as:
WiK , and WiV are the parameters of the projection matrices.
∇θ LRL = −(r(os ) − be )∇θ log pθ (os ), (10)
Position-wise Feed-Forward Networks. In addition to where the leading factor is included for variance reduction
the attention sub-layers, each of the layers in the Trans- using a baseline estimator (Zaremba and Sutskever 2015).
former decoder contains a fully connected feed-forward net- There are several ways to calculate the baseline estima-
work that is applied to each position of the decoder sepa- tor; we employ the effective SCST approach (Rennie et al.
rately and identically. This network is defined as 2017).
FFN(x) = max(0, xW1 + b1 )W2 + b2 , (7)
where W1 , W2 , b1 , and b2 are the linear projection parame- 4.1 Rewards
ters which are same across different positions but are differ- In this work we consider three different reward functions.
ent from layer to layer. Two of the rewards are based in “specification space”, which
make a direct comparison between the predicted specifica-
Positional Encoding. In the model described thus far, the tion and the ground truth specification, and one of the re-
model is symmetric with respect to sequence position. For wards is based in “image space”, which compares the image
example, at the bottom right of Fig. 2, the model has no rendered from the predicted specification with original input
structural way to determine which output embeddings come image. We also investigate using these rewards in combina-
from which part of the output sequence. To remedy this is- tion, with the hope that there is complementary information
sue, we concatenate a “positional encoding” (PE) to the em- in the feedback based on the two spaces.
bedding representation of the tokens. We use the sine and Intersection-Over-Union Reward (IOU) As mentioned
cosine functions for positional encoding: in Sec. 3, after the specification is predicted as a sequence
PE(pos, 2i) = sin(pos/100002i/dmodel ) of tokens, we can parse the sequence into a set of object
(8) specifications. The intersection-over-union (IOU) reward is
PE(pos, 2i + 1) = cos(pos/100002i/dmodel ) based in specification space. Roughly speaking, the IOU re-
where pos is the position, i is the dimension, and dmodel is ward gives credit for predicting objects that exactly match
the dimension of the embedding vector representation. objects in the ground truth specification, and penalizes both
Image Distance Reward
Output
Graphics
Softmax
Transformer Decoder
Linear
Add & Norm Renderer
Feed Forward
Ground-truth
Sampler
Add & Norm Specs
Multi-Head
Attention
from os
Add & Norm
Masked
[Object(type(boy), 60, 240), [Object(type(boy), 60, 240),
Multi-Head
Attention
+
Positional
Embedding
Object(type(tree), 100, 20), Object(type(tree), 440, 20),
Output
Embedding
Outputs
Object(type(bear), 360, 220)] Object(type(bear), 360, 220)]
Ground-truth Image (encoded)
(shifted right)
IOU Reward
Figure 3: Example showing the samples from our model based on abstract scene dataset and the corresponding rewards in
specification and image space. For simplicity, all object properties are not shown in specification space.
for predicting objects that do not match ground truth objects in the target image have noise (see Fig. 1). We take to
and for failing to predict objects that are part of the ground be a simple subtraction operation. The image reward for this
∗ n
truth. More formally, let {oi }m
i=1 and {oj }j=1 represent the dataset is:
c
objects in predicted and ground-truth specifications, respec- rimg = (13)
tively. Then the IOU reward is defined as: dimg
count({oi }m ∗ n where c is a tunable parameter.
i=1 ∩ {oj }j=1 )
riou = ∗ n (11) For the Abstract Scene dataset, is a logical operator that
count({oi }m
i=1 ∪ {oj }j=1 ) takes the value 0 in every position where the pixel values
The object oi in the prediction specification is the same as “match”, and 1 in every other position. The range of possible
object o∗j in the ground-truth specification if and only if all pixel values is 0-255 and, similarly to the discretization of
the properties of these objects match exactly. position in the inference reward, we divide the pixel value
range into 20 equisized buckets and consider pixel values to
Inference Reward Our second reward, which we call the match if they are in the same bucket. We take Ψ to be the
“inference reward”, is also a reward in specification space. identity function. The image reward for the Abstract Scenes
The name is based on the “inference error”, which is a per- dataset is then defined as:
formance measure introduced in Wu, Tenenbaum, and Kohli
(2017) for the Abstract Scenes dataset. While IOU is based dimg
rimg = 1 − (14)
on exact matches between predicted objects and ground- w·h
truth objects, the inference error and inference reward are where w and h are width and height of the image.
based on the number of properties (within objects) that cor-
rectly match the corresponding properties in the ground- Joint Dual-Modality Reward Since we expect the re-
truth. For those properties specifying location in pixel co- wards based in specification space to be complementary to
ordinates, we follow Wu, Tenenbaum, and Kohli (2017) and the reward based in image space, we want a way to combine
divide the space of each coordinate into 20 bins of equal rewards on the two spaces. One way to combine two rewards
size, and we consider it a match if the predicted and ground- is to create a weighted combination of individual rewards to
truth locations are in the same bin. We define the inference formulate the joint reward. Another approach is to alternate
error as the fraction of predicted properties that fail to match the reward used during the learning process (Pasunuru and
the corresponding ground-truth properties. The inference re- Bansal 2018). In this work, we follow the latter approach, as
ward is one minus the inference error. the former approach requires expensive tuning for scale and
weight balancing. Let r1 and r2 be the two reward functions
Image Distance Reward Our third and final reward, the that we want to optimize. In our approach, we first take a1
“image distance reward”, is in image space. We define it optimization steps to minimize the reinforcement learning
generically first, as it takes slightly different forms in our loss LRL1 (r1 ; θ) (i.e. we use a1 mini-batches). Then we take
two datasets. If we let I and I R represent vectorized ver- a2 optimization steps to minimize the reinforcement learn-
sions of the input image and the image rendered from the ing loss LRL2 (r2 ; θ). We then repeat this cycle of steps until
predicted specification, respectively, then we define the im- convergence. All other optimization parameters, such as step
age distance as size, remain the same for each set of steps. The values a1 and
dimg = ||I Ψ(I R )||22 , (12) a2 are tuning parameters.3 The two rewards r1 and r2 could
be based on different aspects of the output, such as IOU and
where || · ||2 is the `2 -norm. image distance reward.
For Noisy Shapes dataset, we follow Ellis et al. (2018)
3
and take Ψ to be a Gaussian blurring function, as the objects Pasunuru and Bansal (2018) set a1 and a2 to 1, without tuning.
Model Precision Recall F1 IOU IOU1.0 IOU0.8 IOU0.6
C ROSS -E NTROPY L OSS
Image2LSTM+atten. 98.7 98.5 98.6 97.6 90.7 95.3 98.8
Image2Transformer 99.1 99.1 99.1 98.5 94.1 97.3 99.1
I MAGE 2T RANSFORMER WITH R EINFORCE L OSS
IOU Reward 99.4 99.3 99.3 98.8 95.0 98.0 99.4
Image-distance Reward 99.4 99.2 99.3 98.8 94.5 98.0 99.5
Image-distance + IOU Reward 99.4 99.3 99.3 98.8 95.0 98.1 99.4
Table 1: Performance of various models on Noisy Shapes dataset.
5 Experimental Setup macro average). Further, we also report IOUk , which is de-
fined as the percent of test examples for which the IOU score
5.1 Dataset
is greater than or equal to k.
Noisy Shapes Dataset. Ellis et al. (2018) provides a syn-
thetic dataset of images containing multiple simple objects
(lines, circles, and rectangles), each with various properties Abstract Scene Dataset. For the abstract scene dataset,
that can be specified. The images are specified using a small following previous work (Wu, Tenenbaum, and Kohli 2017),
subset of LATEX drawing commands. Additional noise is in- we report specification inference error and image recon-
troduced into the rendered images by rescaling image in- struction error based on a micro average across all test ex-
tensity, translating the image by a few pixels, rendering the amples. As described in Sec. 4.1 and Sec. 4.1, inference er-
LATEX using the pencildraw style, and randomly perturbing ror is based on the percentage of incorrectly inferred values
the position and sizes of these LATEX drawing commands. (i.e., how many properties of objects do not match with the
The dataset was created by randomly sampling image speci- ground-truth) for the specification, and image reconstruction
fications with between 1 and 12 objects, excluding any spec- error is based on percentage of incorrect pixel prediction.
ifications that lead to images with overlapping objects. The During these evaluations, all the continuous variables (pixel
size of each image is 256x256. The dataset contains 100,000 values, and x and y coordinates) are quantized into 20 bins.
images paired with specifications, from which we use 1000 Additionally, we report the macro average based IOU error
for testing and the rest for training. as described for the noisy shapes dataset.
5.3 Training Details
Abstract Scene Dataset. The Abstract Scene dataset (Zit- In all of our models, we encode the image information via
nick and Parikh 2013) contains 10,020 images, each of ResNet-18 (He et al. 2016), where we take the penultimate
which has 3-18 objects. There are over 100 types of objects, layer’s features as outputs from this image encoder. For
each of which is specified by two integers, one indicating LSTM-RNN, we use a hidden state size of 128, input token
a broad category (e.g. sky object, animal, boy, girl) and an- embedding size of 128, and a batch size of 64. For Trans-
other indicating a subcategory (e.g. girl pose, animal type, former networks, we use the same hidden and embedding
etc.). Each object can be drawn at one of 3 scales, with or size, and use 4 layers at each time step. We use the Adam op-
without a horizontal flip, and at any pixel location in the timizer (Kingma and Ba 2015) with the default learning rate
500x400 image. These properties are specified by 4 addi- of 0.001 for all the cross-entropy models, and a learning rate
tional integers. Thus each object is specified by 6 integers. of 0.0001 for all the reinforcement learning based models.
There are often heavy occlusions among these objects when For the Noisy Shapes dataset, the maximum decoder length
rendered in an image (see input image in Fig. 2). However, is fixed to 80, and we use a vocabulary size of 27, which are
the objects are rendered in a deterministic order based on placeholders for object properties. For the Abstract Scene
the object types and other properties, and thus the image is dataset, the maximum decoder length is fixed to 100, and
independent of the order of the objects in the specification. we use a vocabulary size of 1078 which represents all the
Similar to Wu, Tenenbaum, and Kohli (2017), we randomly object properties. For the joint reward optimization, we use
sample 90% of the images for training and rest for testing. a mixing ratio of 1:1 for the Noisy Shapes dataset and 1:4
for the Abstract Scene dataset.
5.2 Evaluation Metrics
Noisy Shapes Dataset. As described in the Task descrip- 6 Results
tion of Sec. 3, we can summarize performance on a single
image with precision, recall, F1, and IOU (intersection over 6.1 Results on the Noisy Shapes Dataset
union) at the object level. Following previous work (Ellis We first compare the performance of the LSTM-RNN model
et al. 2018), we summarize the performance of a method (Image2LSTM+atten) to the Transformer-based model,
by averaging these metrics across all test examples (i.e. a when both are trained with cross-entropy loss. We see in
Model Infer. Recons. Avg. IOU we tried three different reward functions, corresponding to
Error Error Error three of our performance metrics: inference error, recon-
P REVIOUS W ORK
struction error, and IOU. All the Transformer models trained
with REINFORCE out-performed the model trained with
CNN+LSTM (2017) 45.31 41.38 43.84 - cross-entropy loss for each of the error measures.5 For in-
NSD (full) (2017) 42.74 21.55 32.14 - ference error, the model trained with the inference reward
C ROSS -E NTROPY L OSS did the best, as one might hope and expect. Compared to
Image2LSTM+atten. 17.27 15.70 16.48 32.06
the cross-entropy trained Transformer, the inference error
Image2Transformer 8.78 10.92 9.85 58.54 measure was reduced by 11.0%. For reconstruction error
(image-based), the best performing model was the model
I MAGE 2T RANSFORMER WITH R EINFORCE L OSS trained with the reconstruction reward, which reduced the
IOU Reward 7.91 10.50 9.20 61.29 reconstruction error by 8.5% compared to the cross-entropy
Inference Reward 7.81 10.75 9.28 59.35 trained version. When evaluating performance using the av-
Recons. Reward 8.34 9.99 9.16 62.44 erage of the inference and reconstruction error, one of our
Inference + Recons. 8.21 10.12 9.16 61.54 joint-reward models performed best, though interestingly,
IOU + Recons. 8.05 10.04 9.04 62.45 not the one that uses the corresponding inference and re-
construction rewards. The best performing model for this
Table 2: Models performance on the abstract scene dataset. performance measure used IOU and reconstruction rewards,
Errors: lower is better; IOU: higher is better. suggesting that IOU reward has more information that is
complementary to the reconstruction error than does the in-
ference reward. For IOU performance measure, the model
Table 1 that the Transformer model dominates on all mea- trained with IOU reward did well, but when trained jointly
sures. In particular, we highlight IOU1.0 , which measures with IOU and reconstruction reward, it performed the best.
the percent of examples on which the predicted specifica- This suggests that using image-based feedback during train-
tion exactly matches the ground-truth specification. While ing (recons. error) can be beneficial even when the ultimate
the LSTM-RNN model achieves a 90.7% IOU1.0 , the Trans- goal (IOU) depends only on the specification output.
former model achieves 94.1%, which is an impressive
36.5% reduction in the number of errors. We have similar 7 Analysis
performance improvements for the other metrics. We now
7.1 LSTM vs. Transformer Networks
compare the Transformer model trained with reinforcement
learning, using various reward functions, to training using As noted above, for the Abstract Scene and the Noisy Shapes
cross-entropy loss. Table 1 shows that, although all three re- datasets that we consider, the order of the objects in the
ward variations have roughly the same performance, they specification does not affect the final image. Nevertheless,
all show significant improvement over cross-entropy train- for training both the LSTM-RNN and the Transformer mod-
ing, on all measures.4 For example, the model trained with els, one must choose an ordering. We ran an experiment us-
IOU reward achieved a 95.0% IOU1.0 measure, which is an ing the Noisy Shapes dataset, in which we tried ordering
impressive 15.3% reduction in the number of errors com- the objects by shape size, shape type, and by shape posi-
pared to the same model trained with cross-entropy loss, and tion in the rendered image. We found that ordering by shape
a 46.2% reduction compared to the original LSTM-RNN type worked best across our models, so that’s what we used
model. Performance improvement in the other measures is for our main results in Table 1. We also wanted to investi-
at least as good. gate how important it is to have the objects in some sensible
order, compared to a random ordering. Table 3 shows the
6.2 Results on Abstract Scene Dataset results of our two models when trained with cross-entropy
on specification sequences where the objects are put in ran-
In Table 2, we see the performance of various models on
dom order. We find that the LSTM-RNN model perfor-
the Abstract Scene dataset, for the metrics described in
mance drops dramatically (e.g. IOU1.0 drops from 90.7% to
Sec. 5.2. We first note that even our baseline LSTM-RNN
72.0%), while the drop with Transformer networks is quite
model (Image2LSTM+atten) shows a very large error reduc-
small (e.g. IOU1.0 drops from 94.1% to 93.2%.6 This is addi-
tion compared to the results presented in Wu, Tenenbaum,
and Kohli (2017) (first 4 rows of the table). This highlights 5
For the IOU and inference reward models, this improve-
the importance of an attention mechanism in these tasks. ment is statistically significant for all metrics except reconstruc-
For the models trained with cross-entropy, the Transformer tion error. For the reconstruction reward model, the improvement
model shows an additional remarkable improvement over is significant for all but the inference error metric. For the dual
the LSTM-RNN model, across all measures. (IOU+Recons.) reward model, the difference is significant for all
For reinforcement learning with the Transformer model, metrics (p < 0.01 for each test).
6
Note that the number of model parameters is approximately
4
The improvement of our Transformer models trained with re- the same (11.8M for Transformer model and 11.5M for LSTM
inforcement learning over the corresponding cross-entropy models model). Further, Transformer models are 2.5x faster to train in
is statistically significant with p < 0.01, based on the bootstrap comparison to the LSTM models. During inference, both models
test (Noreen 1989; Efron and Tibshirani 1994). take approximately the same time.
Model F1 IOU IOU1.0
Image2LSTM+atten. 95.2 92.0 72.0
Image2Transformer 99.0 98.3 93.2
Table 3: Performance of LSTM-RNN and Transformer net-
works on the Noisy Shapes dataset when specifications have
randomly ordered objects.
tional evidence for Transformers being the preferred model
for tasks of this type.
7.2 Performance vs. Data Size
We conduct an experiment where we vary the percentage of Ground-truth Images Transformer Baseline Rendered Images Transformer RL Rendered Images
Noisy Shapes data used during our models’ training from
10% to 100% by steps of 20%. We observe that with less Figure 4: Comparing the ground-truth images from noisy
data (10%-40%), the RL-based model is approximately 2 shapes and abstract scene datasets with the rendered images
points better (on the IOU1.0 metric) than its corresponding of predicted specifications.
cross-entropy baseline. As we use more data (>60%), the
gap decreases to 1 point between RL and cross-entropy mod-
els. This suggests that RL, which has the advantage of explo- that Transformers are a better choice than LSTMs for un-
ration, is more powerful when the data is less. ordered sequence prediction tasks.
7.3 Output Examples Acknowledgments
Fig. 4 presents the output rendered images of the We thank the reviewers for their helpful comments. This
predicted specifications from Image2Transformer cross- work was partially supported by NSF-CAREER Award
entropy model and the corresponding RL-based model with 1846185, ARO-YIP Award W911NF-18-1-0336, and a Mi-
IOU+Image-distance as reward for noisy shapes dataset and crosoft PhD Fellowship. The views contained in this article
IOU+Recons. as reward for abstract scene dataset. In the are those of the authors and not of the funding agency.
first example (top row in Fig. 4), the cross-entropy model
predicts an extra ‘line shape’ which is not present in the References
ground-truth. Our RL model correctly predicts the exact Bengio, S.; Vinyals, O.; Jaitly, N.; and Shazeer, N. 2015.
same shapes present in the ground-truth. However, neither Scheduled sampling for sequence prediction with recurrent
models getting the type of ‘line shape’ correct in couple of neural networks. In Proceedings of the 28th International
instances. In the second example (second row in Fig. 4), Conference on Neural Information Processing Systems -
the cross-entropy model predicts an extra object (glasses), Volume 1, NeurIPS’15, 1171–1179. Cambridge, MA, USA:
which is not present in the ground-truth image, and is also MIT Press.
missing the cap on the snake. The RL model improves on the Bluche, T.; Louradour, J.; and Messina, R. O. 2016. Scan,
cross-entropy model by not having any extra objects, but it is attend and read: End-to-end handwritten paragraph recogni-
also missing the cap. In the third example, both the rendered tion with MDLSTM attention. CoRR abs/1604.03286.
images look very similar to the ground-truth, but the cross-
entropy model predicts one of the objects (glasses) slightly Bunel, R.; Hausknecht, M.; Devlin, J.; Singh, R.; and Kohli,
off in position. Our RL model was able to accurately posi- P. 2018. Leveraging grammar and reinforcement learning
tion the glasses (bottom row in Fig. 4). The better perfor- for neural program synthesis. In ICLR. OpenReview.net.
mance of RL model may be due to the image space compo- Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.;
nent of the error signal, which is more sensitive to position Dollár, P.; and Zitnick, C. L. 2015. Microsoft COCO
errors, while the cross-entropy loss gives the same penalty captions: Data collection and evaluation server. CoRR
to all incorrect positions regardless of the error size. abs/1504.00325.
Cliche, M.; Rosenberg, D.; Madeka, D.; and Yee, C. 2017.
8 Conclusion Scatteract: Automated extraction of data from scatter plots.
In Joint European Conference on Machine Learning and
We present various neural de-rendering models based on Knowledge Discovery in Databases, 135–150. Springer.
LSTMs with attention mechanism and Transformer net-
works. Further, we introduce complimentary dual rewards Daumé, H.; Langford, J.; and Marcu, D. 2009. Search-based
(one in specification space and another in image space) and structured prediction. Machine Learning 75(3):297–325.
optimize them via reinforcement learning, and achieve state- Deng, Y.; Kanervisto, A.; Ling, J.; and Rush, A. M. 2017.
of-the-art results. Further, our results and analyses suggest Image-to-markup generation with coarse-to-fine attention.
In Proceedings of the 34th International Conference on Ma- of the 2018 Conference of the North American Chapter of
chine Learning - Volume 70, ICML’17, 980–989. JMLR.org. the Association for Computational Linguistics: Human Lan-
Efron, B., and Tibshirani, R. J. 1994. An introduction to guage Technologies, Volume 2 (Short Papers), 646–653.
the bootstrap. Number 57 in Monographs on Statistics and Paulus, R.; Xiong, C.; and Socher, R. 2018. A deep rein-
Applied Probability. Boca Raton, Florida, USA: Chapman forced model for abstractive summarization. In 6th Interna-
& Hall/CRC. tional Conference on Learning Representations, ICLR 2018,
Ellis, K.; Ritchie, D.; Solar-Lezama, A.; and Tenenbaum, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
J. B. 2018. Learning to infer graphics programs from Track Proceedings. OpenReview.net.
hand-drawn images. In Proceedings of the 32nd Interna- Ranzato, M.; Chopra, S.; Auli, M.; and Zaremba, W. 2016.
tional Conference on Neural Information Processing Sys- Sequence level training with recurrent neural networks. In
tems, NeurIPS’18, 6062–6071. Red Hook, NY, USA: Curran ICLR.
Associates Inc. Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel,
Ganin, Y.; Kulkarni, T.; Babuschkin, I.; Eslami, S. M. A.; V. 2017. Self-critical sequence training for image caption-
and Vinyals, O. 2018. Synthesizing programs for images ing. In IEEE Conference on Computer Vision and Pattern
using reinforced adversarial learning. In Proceedings of the Recognition, 1179–1195.
35th International Conference on Machine Learning. Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018.
Ha, D., and Eck, D. 2018. A neural representation of sketch Conceptual captions: A cleaned, hypernymed, image alt-text
drawings. In 6th International Conference on Learning Rep- dataset for automatic image captioning. In Proceedings of
resentations, ICLR 2018, Vancouver, BC, Canada, April 30 the 56th Annual Meeting of the Association for Computa-
- May 3, 2018, Conference Track Proceedings. OpenRe- tional Linguistics (Volume 1: Long Papers), 2556–2565.
view.net. Sun, S.-H.; Noh, H.; Somasundaram, S.; and Lim, J.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual 2018. Neural program synthesis from diverse demonstration
learning for image recognition. In CVPR, 770–778. videos. In International Conference on Machine Learning,
4797–4806.
Huang, H.; Kalogerakis, E.; Yumer, E.; and Mech, R. 2016.
Shape synthesis from sketches via procedural models and Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
convolutional networks. IEEE Transactions on Visualization L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
and Computer Graphics 2. tention is all you need. In NeurIPS, 5998–6008.
Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic Vinyals, O.; Bengio, S.; and Kudlur, M. 2015. Order mat-
alignments for generating image descriptions. In Proceed- ters: Sequence to sequence for sets. In ICLR.
ings of the IEEE conference on computer vision and pattern Williams, R. J. 1992. Simple statistical gradient-following
recognition, 3128–3137. algorithms for connectionist reinforcement learning. Ma-
Kingma, D. P., and Ba, J. 2015. Adam: A method for chine learning 8(3-4):229–256.
stochastic optimization. In Bengio, Y., and LeCun, Y., eds., Wu, J.; Tenenbaum, J. B.; and Kohli, P. 2017. Neural scene
3rd International Conference on Learning Representations, de-rendering. In Proceedings of the IEEE Conference on
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Confer- Computer Vision and Pattern Recognition, 699–707.
ence Track Proceedings. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-
Lin, T.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick, nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and
R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zit- tell: Neural image caption generation with visual attention.
nick, C. L. 2014. Microsoft COCO: common objects in In ICML, 2048–2057.
context. CoRR abs/1405.0312. Zaremba, W., and Sutskever, I. 2015. Reinforce-
Liu, Y.; Wu, Z.; Ritchie, D.; Freeman, W. T.; Tenenbaum, ment learning neural turing machines. arXiv preprint
J. B.; and Wu, J. 2019. Learning to describe scenes with arXiv:1505.00521.
programs. In ICLR. Zhou, L.; Zhou, Y.; Corso, J. J.; Socher, R.; and Xiong,
Mishchenko, A., and Vassilieva, N. 2011. Chart image un- C. 2018. End-to-end dense video captioning with masked
derstanding and numerical data extraction. In 2011 Sixth transformer. In Proceedings of the IEEE Conference on
International Conference on Digital Information Manage- Computer Vision and Pattern Recognition, 8739–8748.
ment, 115–120. IEEE. Zitnick, C. L., and Parikh, D. 2013. Bringing semantics into
Nishida, G.; Garcia-Dorado, I.; Aliaga, D. G.; Benes, B.; focus using visual abstraction. In Proceedings of the IEEE
and Bousseau, A. 2016. Interactive sketching of urban Conference on Computer Vision and Pattern Recognition,
procedural models. ACM Transactions on Graphics (TOG) 3009–3016.
35(4):130.
Noreen, E. W. 1989. Computer-intensive methods for testing
hypotheses. Wiley New York.
Pasunuru, R., and Bansal, M. 2018. Multi-reward reinforced
summarization with saliency and entailment. In Proceedings