<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Conceptualization of a GAN for future frame prediction</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of KwaZulu-Natal</institution>
          ,
          <addr-line>Durban, RSA</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1925</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The generation of future frames of a video involves the analysis of the previous t-i frames and the subsequent prediction of the following t+j frames. The majority of state of the art models are able to accurately predict a single future frame that exhibits a high degree of photorealism. The effectiveness of these models at generating quality results decreases as the number of frames generated increases due to the divergence of the solution space. The solution space is now multimodal and optimization of traditional loss functions, such as MSE loss, does not adequately model the multimodality and the resultant frames are blurred. The conceptualization of a GAN that generates several plausible future frames with adequate motion representation and a high degree of photorealism is presented.</p>
      </abstract>
      <kwd-group>
        <kwd>GANs</kwd>
        <kwd>Transformation</kwd>
        <kwd>ConvLSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The prediction of future frames has several applications in autonomous
decision-making areas that include; self-driving cars, social robots and video completion [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For
example, a SocialGAN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] determines plausible and socially acceptable walking
trajectories of people, thus, aiding in the navigation of human-centric environments. GANs
([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) have been a popular approach to training spatio-temporal models
for future frame prediction. The constituent components of a GAN is a generator and a
discriminator, engaged in a minimax game [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. GANs, however, are difficult to train;
and are susceptible to mode collapse. In transformation space, the generator extracts
transformations between adjacent input frames. It subsequently predicts a future
transformation and applies it to the last frame of the input to generate the next frame and so
forth. The source of variability is modelled and, thus, the need to store low level details
of the input is eliminated. The resultant model requires fewer parameters which
simplifies learning. Furthermore, the spatial data of the input is conserved.
To model spatio-temporal relationships in video data, networks include either CNNs,
RNNs or both. The standard for sequential modelling tasks is RNNs, such as LSTMs,
due to its ability to represent long term temporal dependencies. A CNN that exhibits a
similar efficacy is the Temporal Convolutional Network (TCN). A TCN in conjunction
with a dilated CNN to model temporal and spatial dependencies respectively was
implemented by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A similar approach was undertaken by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], with a PGGAN modelling
spatial dependencies instead. Another attempt at sequential modelling utilizing CNNs
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was an architecture in which a network was replicated through time. The resultant
model was a ‘peculiar RNN’ as parameters were now shared across time whilst still
convolving spatial data. A CNN-LSTM architecture was implemented by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to predict
future frames of synthetic video data. These aspects were later united by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] into a single
network, a convolutional LSTM (ConvLSTM). A stacked ConvLSTM, coupled with a
Spatial Transformer Network (STN) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], addressed the problem of future frame
prediction and determined the state of motion of a robot arm. The representation of motion is
improved by models that operate in transformation space ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Such a model,
a CGAN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] was evaluated using a Two-Alternative Forced Choice (2AFC) test. The
generated video was preferred only 30.6% of the time over its ground-truth counterpart.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Model</title>
      <p>
        In a bid to address the issues of motion representation, photorealism and plausibility of
generated frames, this research proposes the implementation of a CGAN. The
discriminator of the CGAN receives the context frames coupled with alternatively ground truth
future frames or generated future frames and is only deceived by sequences of frames
that exhibit plausibility. A mini-batch standard deviation layer is added to one of the
last layers of the Progressively Growing Network (PGN) discriminator; aiding in the
prevention of mode collapse. The generator comprises of 7 stacked ConvLSTMs,
similar to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and preserves spatial data whilst modelling the complex dynamics of the
data. Hidden Layer5 parameterizes a modified STN and the output of ConvLSTM5 is
a predicted affine transformation matrix for each separate ‘good feature’ in the frame.
The STN is modified to determined points by the Shi-Tomasi Corner Detection
algorithm for which transformations are then predicted. The model also predicts a
compositing mask over each transformation. The generated frame is reconstructed by applying
predicted affine transformations, merged by masking, to the last input frame.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aigner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Körner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>“FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing GANs</article-title>
          .arXivpreprint,” arXiv:
          <year>1810</year>
          .
          <volume>01325</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
          </string-name>
          , S.: “
          <article-title>Unsupervised Learning for Physical Interaction through Video Prediction</article-title>
          .
          <source>arXivpreprint” arXiv: 1605.07157</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Goodfellow, “
          <article-title>NIPS 2016 Tutorial: Generative Adversarial Networks” in 2016</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alahi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : “
          <string-name>
            <surname>Social</surname>
            <given-names>GAN</given-names>
          </string-name>
          :
          <article-title>Socially Acceptable Trajectories with Generative Adversarial Networks</article-title>
          .arXivpreprint” arXiv:
          <year>1803</year>
          .
          <volume>10892</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Ebert</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
          </string-name>
          , S.: “Stochastic Adversarial Video Prediction.arXivpreprint,” arXiv:
          <year>1804</year>
          .
          <volume>01523</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lotter</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kreiman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cox</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>: “Unsupervised Learning of Visual Structure using Predictive Generative Networks</article-title>
          .arXivpreprint,” arXiv:
          <fpage>1511</fpage>
          .06380 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeung</surname>
          </string-name>
          , D.:
          <article-title>“Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting</article-title>
          .
          <source>arXivpreprint” arXiv:506.04214</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. van Amersfoort,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.A</given-names>
            ,
            <surname>Szlam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Chintala</surname>
          </string-name>
          , S.: “
          <article-title>Transformation-based Models of Video Sequences</article-title>
          .arXivpreprint,” arXiv:
          <fpage>1701</fpage>
          .08435 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Vondrick</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>“Generating the Future with Adversarial Transformers</article-title>
          ,”
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>2992</fpage>
          -
          <lpage>3000</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jai</surname>
          </string-name>
          , YT.,
          <string-name>
            <surname>Hu</surname>
          </string-name>
          , SM.,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>“Video Completion using Tracking and Fragment Merging,”</article-title>
          <source>The Visual Compute</source>
          <volume>21</volume>
          (
          <issue>8-10</issue>
          ), pp.
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>