<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Conceptualization of a GAN for future frame prediction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Nirvana</forename><surname>Pillay</surname></persName>
							<email>nirvanap02@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">University of KwaZulu-Natal</orgName>
								<address>
									<settlement>Durban</settlement>
									<country>RSA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Conceptualization of a GAN for future frame prediction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0A6FC44A6CB3ABBF43F7B181952FAD85</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T10:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>GANs</term>
					<term>Transformation</term>
					<term>ConvLSTM</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The generation of future frames of a video involves the analysis of the previous t-i frames and the subsequent prediction of the following t+j frames. The majority of state of the art models are able to accurately predict a single future frame that exhibits a high degree of photorealism. The effectiveness of these models at generating quality results decreases as the number of frames generated increases due to the divergence of the solution space. The solution space is now multimodal and optimization of traditional loss functions, such as MSE loss, does not adequately model the multimodality and the resultant frames are blurred. The conceptualization of a GAN that generates several plausible future frames with adequate motion representation and a high degree of photorealism is presented.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The prediction of future frames has several applications in autonomous decision-making areas that include; self-driving cars, social robots and video completion <ref type="bibr" target="#b9">[10]</ref>. For example, a SocialGAN <ref type="bibr" target="#b3">[4]</ref> determines plausible and socially acceptable walking trajectories of people, thus, aiding in the navigation of human-centric environments. GANs ( <ref type="bibr" target="#b0">[1]</ref>, <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b5">[6]</ref>, <ref type="bibr" target="#b8">[9]</ref>) have been a popular approach to training spatio-temporal models for future frame prediction. The constituent components of a GAN is a generator and a discriminator, engaged in a minimax game <ref type="bibr" target="#b2">[3]</ref>. GANs, however, are difficult to train; and are susceptible to mode collapse. In transformation space, the generator extracts transformations between adjacent input frames. It subsequently predicts a future transformation and applies it to the last frame of the input to generate the next frame and so forth. The source of variability is modelled and, thus, the need to store low level details of the input is eliminated. The resultant model requires fewer parameters which simplifies learning. Furthermore, the spatial data of the input is conserved. similar efficacy is the Temporal Convolutional Network (TCN). A TCN in conjunction with a dilated CNN to model temporal and spatial dependencies respectively was implemented by <ref type="bibr" target="#b8">[9]</ref>. A similar approach was undertaken by <ref type="bibr" target="#b0">[1]</ref>, with a PGGAN modelling spatial dependencies instead. Another attempt at sequential modelling utilizing CNNs <ref type="bibr" target="#b7">[8]</ref> was an architecture in which a network was replicated through time. The resultant model was a 'peculiar RNN' as parameters were now shared across time whilst still convolving spatial data. A CNN-LSTM architecture was implemented by <ref type="bibr" target="#b5">[6]</ref> to predict future frames of synthetic video data. These aspects were later united by <ref type="bibr" target="#b6">[7]</ref> into a single network, a convolutional LSTM (ConvLSTM). A stacked ConvLSTM, coupled with a Spatial Transformer Network (STN) <ref type="bibr" target="#b1">[2]</ref>, addressed the problem of future frame prediction and determined the state of motion of a robot arm. The representation of motion is improved by models that operate in transformation space ( <ref type="bibr" target="#b1">[2]</ref>, <ref type="bibr" target="#b7">[8]</ref>, <ref type="bibr" target="#b8">[9]</ref>). Such a model, a CGAN <ref type="bibr" target="#b8">[9]</ref> was evaluated using a Two-Alternative Forced Choice (2AFC) test. The generated video was preferred only 30.6% of the time over its ground-truth counterpart.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Model</head><p>In a bid to address the issues of motion representation, photorealism and plausibility of generated frames, this research proposes the implementation of a CGAN. The discriminator of the CGAN receives the context frames coupled with alternatively ground truth future frames or generated future frames and is only deceived by sequences of frames that exhibit plausibility. A mini-batch standard deviation layer is added to one of the last layers of the Progressively Growing Network (PGN) discriminator; aiding in the prevention of mode collapse. The generator comprises of 7 stacked ConvLSTMs, similar to <ref type="bibr" target="#b1">[2]</ref>, and preserves spatial data whilst modelling the complex dynamics of the data. Hidden Layer5 parameterizes a modified STN and the output of ConvLSTM5 is a predicted affine transformation matrix for each separate 'good feature' in the frame.</p><p>The STN is modified to determined points by the Shi-Tomasi Corner Detection algorithm for which transformations are then predicted. The model also predicts a compositing mask over each transformation. The generated frame is reconstructed by applying predicted affine transformations, merged by masking, to the last input frame. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Schematic of CGAN Generator</figDesc><graphic coords="2,134.15,535.90,326.92,127.57" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal</title>
		<author>
			<persName><forename type="first">S</forename><surname>Aigner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Körner</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.01325</idno>
	</analytic>
	<monogr>
		<title level="m">3d Convolutions in Progressively Growing GANs.arXivpreprint</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Unsupervised Learning for Physical Interaction through Video Prediction</title>
		<author>
			<persName><forename type="first">C</forename><surname>Finn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1605.07157</idno>
	</analytic>
	<monogr>
		<title level="j">arXivpreprint</title>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">NIPS 2016 Tutorial: Generative Adversarial Networks</title>
		<author>
			<persName><surname>Goodfellow</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alahi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.10892</idno>
	</analytic>
	<monogr>
		<title level="j">arXivpreprint</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ebert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Abbeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Finn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.01523</idno>
		<title level="m">Stochastic Adversarial Video Prediction.arXivpreprint</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Unsupervised Learning of Visual Structure using Predictive Generative Networks</title>
		<author>
			<persName><forename type="first">W</forename><surname>Lotter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kreiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cox</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1511.06380</idno>
	</analytic>
	<monogr>
		<title level="m">arXivpreprint</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting</title>
		<author>
			<persName><forename type="first">X</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yeung</surname></persName>
		</author>
		<idno>arXiv:506.04214</idno>
	</analytic>
	<monogr>
		<title level="j">arXivpreprint</title>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Transformation-based Models of Video Sequences.arXivpreprint</title>
		<author>
			<persName><forename type="first">J</forename><surname>Van Amersfoort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kannan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Ranzato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Szlam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chintala</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1701.08435</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Generating the Future with Adversarial Transformers</title>
		<author>
			<persName><forename type="first">C</forename><surname>Vondrick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Torralba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="2992" to="3000" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Video Completion using Tracking and Fragment Merging</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">T</forename><surname>Jai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Visual Compute</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">8-10</biblScope>
			<biblScope unit="page" from="601" to="610" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
