<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards an AI Holodeck: Generating Virtual Scenes from Sparse Natural Language Input</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jason Smith</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazanin Alsadat Tabatabaei Anaraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Atefeh Mahdavi Goloujeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karan Khosla</string-name>
          <email>kkhosla7g@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Magerko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology Atlanta</institution>
          ,
          <country>Georgia USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Holodeck, a virtual reality simulator from the television show Star Trek, is known as the “holy grail” of interactive narrative experiences. However, while there have been approaches to various components of a theoretical Holodeck, scene generation from dialogue is often overlooked. This paper introduces a prototype AI Holodeck application for scene generation, demonstrating the use of Natural Language Processing and a corpus of spatial data. The application creates scenes from user input text and fills those scenes with objects and relationships not explicitly defined by the user. This paper discusses potential use cases of scene generation in creating environments for interactive narrative, virtual reality, and other development opportunities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Holodeck is a fictional virtual reality device in the
television show Star Trek, taking the form of a blank room
that generates interactive characters and objects dictated by
voice commands from people inside of it. It represents a
“holy grail” of interactive virtual reality
        <xref ref-type="bibr" rid="ref20">(Spector 2013)</xref>
        , has
been at the forefront of discussion of the role of AI in
digital storytelling
        <xref ref-type="bibr" rid="ref17">(Murray 2017)</xref>
        , and has been the
inspiration for a number of projects integrating narrative with
visual and audio generation
        <xref ref-type="bibr" rid="ref16 ref21">(Swartout et al. 2006; Marks,
Estevez, and Connor 2014)</xref>
        . These prior approaches have
focused on graphics and visualization over the mechanisms of
scene generation, the processes with which interactive
applications create spatial environments from user input criteria.
      </p>
      <p>For example, a Holodeck-inspired scene generation
system could be used in the following scenario:</p>
      <p>Sarah has an idea about a game about a detective. She
is thinking of the events in the game but having a hard time
imagining the space, so she uses the AI Holodeck to see how
the space would look. “Holodeck, give me a detective’s
office”. The Holodeck renders an office using objects sourced
from a database. There is a desk and a chair, a window
behind the desk, and a shelf in the corner. She then thinks
that a desk lamp could make it more mysterious at night.
“Holodeck, put a lamp on the desk.” The Holodeck adds
a lamp on the desk and a notebook beside it.” She thinks
“Yes, there should be a notebook on the desk!” But she feels
the metal desk looks really rough in the office. She says,
“Holodeck, give me a wooden desk”. The Holodeck
renders the new wooden desk by adding “wooden” to its initial
database search of desks and replacing the original desk.</p>
      <p>
        To facilitate the creation of scenes – and to populate them
– some semantically annotated datasets are currently
available that categorize objects with relative positions and sizes
        <xref ref-type="bibr" rid="ref2 ref6 ref9">(Forbes and Choi 2017; Chang et al. 2015b)</xref>
        .
      </p>
      <p>
        Scene generation applications are able to parse these
datasets to create a “scene template”, a constrained mapping
of a scene’s objects and their basic spatial relationships
between them
        <xref ref-type="bibr" rid="ref15 ref3">(Chang, Savva, and Manning 2014)</xref>
        . Systems
like these use Natural Language Processing pipelines, such
as the techniques in the CoreNLP library
        <xref ref-type="bibr" rid="ref15">(Manning et al.
2014)</xref>
        , to add items specified in a user’s input text to a scene.
      </p>
      <p>However, the addition of elements that are both 1)
unspecified by a user and 2) gathered from semantically annotated
datasets in order to maintain relevance to the user-described
scene is missing from current research. Therefore, this
paper aims to address the following research question: Can
semantically annotated datasets be used to extract
contextinformed scene templates in a text-to-scene generation
application, including items not specified in the input text?</p>
      <p>In this paper, we introduce an AI Holodeck application.
This system draws from previous NLP and scene generation
work in order to create scenes with appropriate elements that
were not specified by the user. For example, if the user
inputs a farm, then the application may populate the scene with
things commonly found in a farm such as cows, hay barrels,
and fields of crops. The AI Holodeck can be used in a
variety of scenarios: examples include visual story generation,
game design and prototyping, interior design sketching or
idea generation, and creating virtual or fantasy worlds.</p>
      <p>The remainder of this paper explains how our system
draws from and synthesizes existing works in the domains
of natural language processing and scene generation, details
the design of our AI Holodeck application, and discusses
potential applications for integrating scenes generated by this
system with other interactive mediums.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Scene Recognition</title>
      <p>
        Scene graphs have been extensively used and explored in
the contexts of scene understanding and semantic image
captioning
        <xref ref-type="bibr" rid="ref12">(Johnson et al. 2015)</xref>
        . Representing objects in a
scene as nodes and their relationships as edges of graphs
makes it possible to both represent the content of scenes,
generate new scenes or manipulate scenes by modifying the
corresponding scene graph
        <xref ref-type="bibr" rid="ref22 ref8">(Dhamo et al. 2020)</xref>
        . Once the
objects and their relationships are learned, it can be used to
meaningfully place objects in the scene. Scene graphs can
be based on 3D scenes, reconstructed 3D scenes
        <xref ref-type="bibr" rid="ref22">(Wald et al.
2020)</xref>
        , images or text-image combinations.
      </p>
      <p>
        Depending on the data format, different approaches have
been devised to extract the scene content. SceneGen focuses
on novel representations of scene graphs which embed
positional and orientation information of a set of objects present
in a given room to achieve the most realistic placement
        <xref ref-type="bibr" rid="ref13">(Keshavarzi et al. 2020)</xref>
        .
      </p>
      <p>
        Other literature has investigated learning the implicit
positional relationship between objects using transformers and
attention mechanisms. In SceneFormer
        <xref ref-type="bibr" rid="ref25">(Wang, Yeshwanth,
and Nießner 2020)</xref>
        , authors represent 3D objects and their
corresponding environment through a sequence of numbers
representing object category, location of objects in the
environment, object orientation and dimensions of the room.
      </p>
      <p>
        A recent transformer based model CLIP connects text and
images to understand the image content. This
discriminative model can be used to predict image content at scale by
encoding the image and caption in parallel
        <xref ref-type="bibr" rid="ref19">(Radford et al.
2021)</xref>
        . This model in combination with generative
models can be used to generate scenes
        <xref ref-type="bibr" rid="ref10">(Galatolo, Cimino, and
Vaglini 2021)</xref>
        but the effectiveness of this depends on the
level of precision and detail required for the generated scene.
      </p>
      <p>We infer the object categories and their positional
relationships directly from images for two reasons: 1) images
are more accessible than 3D models and there are more
datasets available. 2) images function as representations of
real world situations and convey properties that may be
manipulated in 3D scenes such as messy desks or cluttered
rooms.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Text-to-Scene Generation</title>
      <p>
        WordsEye
        <xref ref-type="bibr" rid="ref7">(Coyne and Sproat 2001)</xref>
        , as one of earlier works
on the text-to-scene generation, relies on explicit
descriptions following the template of objects and their position.
Requiring specific inputs in the format of “the [object] is
[distance] [position] the [object]” make for a rigid and
unnatural user experience.
      </p>
      <p>
        Notating 3D datasets with natural language descriptions
is one approach to improve the unnatural experience
compared to the prior work
        <xref ref-type="bibr" rid="ref2 ref6">(Chang et al. 2015a)</xref>
        . However, the
text query is not the only contributing component to achieve
a natural user experience. SceneSeer breaks down the
problem of text-to-scene generation into scene parsing, scene
inference, scene generation and scene interaction.
      </p>
      <p>To avoid unnatural languages caused by strict input
language requirements like with WordsEye, they also bring
objects that are not mentioned but are relevant to the mentioned
objects. They select inferred objects by searching the object
hierarchy and bringing the explicit object’s parent objects
with the highest probability (Chang et al. 2017). In
addition to this, our approach considers the environment and
objects that have the higher probability of co-appear with the
explicit objects in that specific context. Our focus is not to
produce the most sophisticated interior layout possible, but
rather to demonstrate a natural language-based scene
generation system that creates appropriate scenes fitting the user’s
input contextually and thematically.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Scene Manipulation</title>
      <p>We explored scene manipulation research from the
perspectives of content and interaction. In terms of content, research
on scene manipulation focuses on scene level manipulation
or object level manipulation which targets different purposes
such as object removal or image blending.</p>
      <p>
        Recent research investigates scene manipulation through
updating the existing scene graph
        <xref ref-type="bibr" rid="ref22 ref8">(Dhamo et al. 2020)</xref>
        and
regenerating the scene. Other approaches explore
modifying images using the semantic label maps or boundary maps
extracted from the image
        <xref ref-type="bibr" rid="ref24">(Wang et al. 2018)</xref>
        . However, due
to challenges presented by distinguishing different objects
of the same type (e.g. several different cars in the scene),
our approach is limited to single instances of an object in a
scene.
      </p>
      <p>
        In terms of interaction, prior research explored different
methods or modes of interaction allowing users to
manipulate scenes. SceneSeer (Chang et al. 2017) enables users
to manipulate the scene with textual commands like
“replace the bowl with a red lamp”. In Scribbling Speech
        <xref ref-type="bibr" rid="ref26">(Yang
2018)</xref>
        , a speech-to-image generation tool, users interact with
the interface through sound and modify the scene in a step by
step process using natural language. The placement of
objects happens in the different depths of the scene. We use the
scene graph modification approach in our refactoring
process and users can modify the scene by adding new queries.
      </p>
      <p>Unlike the above tools, which visually render a scene, our
approach also hosts the Holodeck component models in an
API-like format, allowing for integration into a variety of
applications such as Unity.</p>
      <p>3</p>
      <sec id="sec-4-1">
        <title>System Design</title>
        <p>As seen in Figure 1, our Holodeck scene generator contains a
full pipeline to collect input text and form a visual
representation of a scene. Our system generates a scene template for
any input text, determining objects and their locations in
order for them to be placed in a scene. Then, implicit
connections between objects in our semantically annotated datasets
are used to add additional nodes to the scene template,
creating a more vibrant scene. Objects are then mapped to a
graphical representation of a scene in sequence, while
resolving any collisions between them. Finally, a lightweight
interface was created to allow for easily understood
demonstration.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Scene Templates</title>
      <p>
        User input utterances are parsed through the CoreNLP
library
        <xref ref-type="bibr" rid="ref15">(Manning et al. 2014)</xref>
        . The application separates
sentences into dependency trees comprised of subjects, objects,
and descriptors. Each subject and object are stored as a node,
and each descriptor is stored as a property of that node.
Descriptors concerning relative position between objects (such
as “above” or “below”) are stored inside of properties
specifying a cardinal direction. Phrases such as “on top of” and
“over” are all considered as the same “above” direction, and
phrases as “beside” or “by” are set to either “left” or “right”.
These connections form a scene template
        <xref ref-type="bibr" rid="ref15 ref3">(Chang, Savva,
and Manning 2014)</xref>
        , a collection of the various spatial
relations between objects in a scene. The scene template
allows all objects in a scene to be connected either directly or
through an intermediate object, such as in Figure 2.
      </p>
      <p>
        CoreNLP has made it possible to use an in-depth
dependency parser, allowing for parsing complex sentence
structures. However, we are also offering an offline version using
the NLTK library
        <xref ref-type="bibr" rid="ref1 ref18">(Bird, Klein, and Loper 2009)</xref>
        to extract
objects, properties and their corresponding locations.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Implicit Nodes and Positional Relations</title>
      <p>In order to create a more fleshed-out scene, our system adds
additional nodes which are not explicitly mentioned by the
user. Figure 3 provides an example of this concept. In this
example, the user is creating an office space and has used
the text “There is a couch and a table” as the first input.
Since the “couch” and “table” are explicitly mentioned by
the user, these nodes are created and used in the scene
template. The addition of these implicit nodes allows the system
to bring in other objects such as a rug, a window, and a book
because these are objects usually found near a couch or table
in an office. If the object mentioned by the user is not
usually found in such environment, eg. “a horse in an office”,
our system searches for other objects that have a high
probability of being found in an office space rather than objects
that are typically found near a horse.</p>
      <p>With this method, the objects surrounding an object are
dependent on which environment that object is in. Figures 3
and 4 show this difference by using the same input sentence
“There is a couch and a table” in different environments of
“bedroom” and “library”.</p>
      <p>Finally, our system prioritizes the explicitly defined
positional relations over the implicit relations created by the
system. Figure 5 shows that a computer is implicitly brought
to the scene after 2 input sentences “There is a table” and
“There is a chair”. In Figure 6, the user adds the input “The
computer is on top of the table”, moving the computer to the
explicitly specified position.</p>
      <p>
        Datasets Used for Extracting Implicit Relations In
order to create a dataset of potential positional relations, we
used the MIT Indoor Scenes Dataset
        <xref ref-type="bibr" rid="ref18">(Quattoni and Torralba
2009)</xref>
        , which contains 67 indoor categories and a total of
15620 annotated JPEG images. We sorted the objects found
in each indoor category based on the number of occurrences
in that category. Additionally, for each object found in a
specific category, we looked at the objects found in immediate
and far distance of the specified object. We divided these
surrounding objects based on their positional relation to the
specified object (eg. below, on top of) and sorted them based
on the number of occurrences. We exported this information
as a JSON file for our system’s use.
3.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Scene Visualization</title>
      <p>When generating a scene, the system places bounding boxes
representing each object in the scene template into a 3D
graph. It searches for sizes for each object in the
ShapeNetSem metadata. If none are found, they are replaced with
default values for the output graph. The algorithm used to
prioritize object placement queues objects on the bottom of a
scene (objects with no “below” parameter), and recursively
adds objects in those objects’ “above” property on the graph,
stacking objects on top of each other.</p>
      <p>As each object is added to the graph, collisions are
detected. Objects with lower priority (determined by their
place in the scene template) are shifted in the direction
corresponding to their property name the until their bounding
boxes no longer overlap with the other object. For example,
if one object is “above” the other, it will be shifted vertically.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Interface Design</title>
      <p>The AI Holodeck application uses a Tkinter interface 1 (see
Figure 7), activated from the command line.</p>
      <p>The application opens a window with a menu for
selecting a scene found in the Indoor Scenes dataset, a prompt for
entering in text or microphone input, and a display of the
objects currently registered from the scene. When a user selects
the “Create Graph” button, objects found in the input text are
added to the list of objects. The scene is then displayed as a
movable, 3D matplotlib 2 graph in a separate window.
1https://docs.python.org/3/library/tkinter.html
2https://matplotlib.org/</p>
      <p>
        The application allows for a number of command line
arguments. Mode selects either text or voice input. When vocal
input is activated, a recording button is added to the
interface beside the text box. Pressing this button activates a
continuous microphone stream until a sentence is recognized,
which then populates the text field and automatically
activates the graph creation function. Model selects either NLTK
        <xref ref-type="bibr" rid="ref14">(Loper and Bird 2002)</xref>
        or CoreNLP
        <xref ref-type="bibr" rid="ref15">(Manning et al. 2014)</xref>
        as
a model to generate a dependency. The NLTK model is
usable offline, while CoreNLP requires a separate command
line prompt to begin a server with an internet connection.
However, the CoreNLP model allows for more variety in
sentence structure. Examination of future iterations of this
system will include a comparison of error between the two
models.
A fully realized AI Holodeck application will require a
relative positioning and collision detection system that allows
for more spatial relationships than just “above”, “below”,
“left”, and “right”. In particular, size-dependent
relationships such as “inside” will allow generated scenes to have
a greater amount of realism and variety.
      </p>
      <p>This system is also limited in the fidelity of the
visualizations it is able to create. Objects are represented only as
a bounding box labeled with the object’s name. A more
sophisticated visualization application would include indexing
of a database of 3D models, in order to dynamically
populate generated scenes with appropriate representations of the
objects inside.</p>
      <p>Additionally, we plan on modifying the system to allow
for the manual removal and repositioning of objects. As
users correct the system output to fit the needs of the scene
they are trying to create, the stored database will update with
new spatial relationships and as such the system will be able
to learn from generated scenes.</p>
      <p>Our system is robust in terms of affording future
modifications. For example, in this phase of the project, we have
used the MIT dataset to extract possible positional
relationships between various objects. These positional
relationships can be constantly modified/added to by using other
datasets not limited to the visual datasets, such as datasets of
narrative texts. Other techniques – such as deep learning –
could also be beneficial in removing our need for annotated
visual datasets by extracting spatial relations automatically
from other collections of images or narrative text.</p>
      <p>The visualization of objects can be modified and extended
to various platforms. Our system provides a scene
template and graphical representation as formatted data and data
structures extracted from the input text, which can be used
by various platforms to create a detailed visualization. As
explained in the system design of this paper, this data
includes the various objects in the scene, their properties,
positional relations and center-points for placement of the
objects in the scene. Hence, the visual modification could be
either in the form of changing visualization platform or in the
form of adding new objects to the dataset of 2D/3D models
used in these platforms.</p>
      <p>5</p>
      <sec id="sec-8-1">
        <title>Future Work</title>
        <p>In future iterations of the system, we will include a
separate narrative text interpretation module. This module will be
comprised of a series of models trained on literature, which
will provide additional scene details given a user’s
starting input. The current implementation, for both the NLTK
and CoreNLP models, primarily uses simple sentences
comprised of subject-object pairs in each clause. Training
models on literature will enhance the system’s ability to capture
information from input sentences with a higher structural
variety.</p>
        <p>Other prospective improvements in the software are
documentation and user functionality to facilitate connections
to software such as Unity and virtual reality for integration
with game development. This addition will be used to
explore manipulation of objects already in a scene, and the
placement or movement of objects with both input vocal
commands and gestures. Additionally, once objects are
generated in a higher-fidelity graphical environment, modifiers
can be extracted from the input. These modifiers include
adjectives describing the scene and the objects within. They
will be added to the scene template and transmitted to the
external environment in order to generate visuals more
appropriate for the user query.</p>
        <p>Future work will also include evaluation of the
application. The first evaluation will include separate analysis of
the NLP and scene generation models, in terms of precision
in the sentences they parse as well as the relevance of
objects generated for a scene. System performance will also be
measured in terms of stability and framerate, while
generating large amounts of objects. Finally, we will conduct user
studies measuring the strength of the system as a piece of
interactive technology as well as users’ experience.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Natural language processing with Python: analyzing text with the natural language toolkit</article-title>
          . ”
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Monroe</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Savva,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ; and
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. D.</surname>
          </string-name>
          <year>2015a</year>
          .
          <article-title>Text to 3d scene generation with rich lexical grounding</article-title>
          .
          <source>arXiv preprint arXiv:1505</source>
          .
          <fpage>06289</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Savva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Semantic parsing for text to 3d scene generation</article-title>
          .
          <source>In Proceedings of the ACL 2014 Workshop on Semantic Parsing</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          2017.
          <article-title>SceneSeer: 3D scene design with natural language</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>00050</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>A. X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Funkhouser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hanrahan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Savarese</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Savva,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Su, H.; et al. 2015b.
          <article-title>Shapenet: An information-rich 3d model repository</article-title>
          .
          <source>arXiv preprint arXiv:1512</source>
          .
          <fpage>03012</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Coyne</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and Sproat,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>WordsEye: An automatic text-to-scene conversion system</article-title>
          .
          <source>In Proceedings of the 28th annual conference on Computer graphics and interactive techniques</source>
          ,
          <fpage>487</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Dhamo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Farshad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Laina</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Navab,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Hager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            ;
            <surname>Tombari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ; and
            <surname>Rupprecht</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Semantic image manipulation using scene graphs</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>5213</fpage>
          -
          <lpage>5222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Forbes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Choi,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Verb physics: Relative physical knowledge of actions and objects</article-title>
          .
          <source>arXiv preprint arXiv:1706</source>
          .
          <fpage>03799</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Galatolo</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cimino</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          ; and Vaglini,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Generating images from caption and vice versa via CLIPGuided Generative Latent Space Search</article-title>
          . arXiv preprint arXiv:
          <volume>2102</volume>
          .
          <fpage>01645</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Johnson</surname>
            , J.; Krishna,
            <given-names>R.</given-names>
          </string-name>
          ; Stark,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            -J.;
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            ;
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            ; and
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Image retrieval using scene graphs</article-title>
          .
          <source>In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>3668</fpage>
          -
          <lpage>3678</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2015</year>
          .
          <volume>7298990</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Keshavarzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Caldas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Scenegen: Generative contextual scene augmentation using scene graph priors</article-title>
          . arXiv preprint arXiv:
          <year>2009</year>
          .12395 .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>arXiv preprint cs/0205028 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          ; Surdeanu,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. R.</surname>
          </string-name>
          ; Bethard,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>McClosky</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations</source>
          ,
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Marks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Estevez</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Towards the Holodeck: fully immersive virtual reality visualisation of scientific and engineering data</article-title>
          .
          <source>In Proceedings of the 29th International Conference on Image and Vision Computing New Zealand</source>
          ,
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Hamlet on the holodeck: The future of narrative in cyberspace</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Quattoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Recognizing indoor scenes</article-title>
          .
          <source>In 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>413</fpage>
          -
          <lpage>420</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          ; Hallacy,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Agarwal,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Sastry,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; et al.
          <year>2021</year>
          .
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          .
          <source>arXiv preprint arXiv:2103</source>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Spector</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Holodeck: Holy Grail or Hollow Promise? Part 1</article-title>
          . URL https://www.gamesindustry.biz/articles/2013- 07-31
          <article-title>-holodeck-holy-grail-or-hollow-promise-part-1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Swartout</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Hill,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Gratch,
          <string-name>
            <given-names>J.</given-names>
            ; Johnson, W. L.;
            <surname>Kyriakakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; LaBore,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Lindheim</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Marsella,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Miraglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Toward the holodeck: Integrating graphics, sound, character and story</article-title>
          .
          <source>Technical report, UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY CA INST FOR CREATIVE . . . .</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Wald</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Dhamo,
          <string-name>
            <given-names>H.</given-names>
            ; Navab, N.; and
            <surname>Tombari</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>Learning 3d semantic scene graphs from 3d indoor reconstructions</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>3961</fpage>
          -
          <lpage>3970</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , T.-C.;
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , M.-Y.;
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , J.-Y.;
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kautz</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Catanzaro</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>High-resolution image synthesis and semantic manipulation with conditional gans</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <fpage>8798</fpage>
          -
          <lpage>8807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yeshwanth</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and Nießner,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Sceneformer: Indoor scene generation with transformers</article-title>
          .
          <source>arXiv preprint arXiv:2012</source>
          .09793 .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Scribbling Speech</article-title>
          . URL https:// experiments.withgoogle.com/scribbling-speech.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>