Towards an AI Holodeck: Generating Virtual Scenes from Sparse Natural
                                Language Input
                                   Jason Smith, Nazanin Alsadat Tabatabaei Anaraki,
                                       Atefeh Mahdavi Goloujeh, Karan Khosla,
                                                   Brian Magerko
                                                  Georgia Institute of Technology
                                                       Atlanta, Georgia USA
                                   {jsmith775, nazanin.tbt, atefehmahdavi, kkhosla7}@gatech.edu

                            Abstract                                 a lamp on the desk and a notebook beside it.” She thinks
                                                                     “Yes, there should be a notebook on the desk!” But she feels
  The Holodeck, a virtual reality simulator from the television      the metal desk looks really rough in the office. She says,
  show Star Trek, is known as the “holy grail” of interactive
                                                                     “Holodeck, give me a wooden desk”. The Holodeck ren-
  narrative experiences. However, while there have been ap-
  proaches to various components of a theoretical Holodeck,          ders the new wooden desk by adding “wooden” to its initial
  scene generation from dialogue is often overlooked. This pa-       database search of desks and replacing the original desk.
  per introduces a prototype AI Holodeck application for scene         To facilitate the creation of scenes – and to populate them
  generation, demonstrating the use of Natural Language Pro-         – some semantically annotated datasets are currently avail-
  cessing and a corpus of spatial data. The application creates
                                                                     able that categorize objects with relative positions and sizes
  scenes from user input text and fills those scenes with objects
  and relationships not explicitly defined by the user. This pa-     (Forbes and Choi 2017; Chang et al. 2015b).
  per discusses potential use cases of scene generation in creat-       Scene generation applications are able to parse these
  ing environments for interactive narrative, virtual reality, and   datasets to create a “scene template”, a constrained mapping
  other development opportunities.                                   of a scene’s objects and their basic spatial relationships be-
                                                                     tween them (Chang, Savva, and Manning 2014). Systems
                     1     Introduction                              like these use Natural Language Processing pipelines, such
                                                                     as the techniques in the CoreNLP library (Manning et al.
The Holodeck is a fictional virtual reality device in the tele-      2014), to add items specified in a user’s input text to a scene.
vision show Star Trek, taking the form of a blank room
that generates interactive characters and objects dictated by           However, the addition of elements that are both 1) unspec-
voice commands from people inside of it. It represents a             ified by a user and 2) gathered from semantically annotated
“holy grail” of interactive virtual reality (Spector 2013), has      datasets in order to maintain relevance to the user-described
been at the forefront of discussion of the role of AI in dig-        scene is missing from current research. Therefore, this pa-
ital storytelling (Murray 2017), and has been the inspira-           per aims to address the following research question: Can
tion for a number of projects integrating narrative with vi-         semantically annotated datasets be used to extract context-
sual and audio generation (Swartout et al. 2006; Marks, Es-          informed scene templates in a text-to-scene generation ap-
tevez, and Connor 2014). These prior approaches have fo-             plication, including items not specified in the input text?
cused on graphics and visualization over the mechanisms of              In this paper, we introduce an AI Holodeck application.
scene generation, the processes with which interactive appli-        This system draws from previous NLP and scene generation
cations create spatial environments from user input criteria.        work in order to create scenes with appropriate elements that
   For example, a Holodeck-inspired scene generation sys-            were not specified by the user. For example, if the user in-
tem could be used in the following scenario:                         puts a farm, then the application may populate the scene with
   Sarah has an idea about a game about a detective. She             things commonly found in a farm such as cows, hay barrels,
is thinking of the events in the game but having a hard time         and fields of crops. The AI Holodeck can be used in a va-
imagining the space, so she uses the AI Holodeck to see how          riety of scenarios: examples include visual story generation,
the space would look. “Holodeck, give me a detective’s of-           game design and prototyping, interior design sketching or
fice”. The Holodeck renders an office using objects sourced          idea generation, and creating virtual or fantasy worlds.
from a database. There is a desk and a chair, a window
behind the desk, and a shelf in the corner. She then thinks             The remainder of this paper explains how our system
that a desk lamp could make it more mysterious at night.             draws from and synthesizes existing works in the domains
“Holodeck, put a lamp on the desk.” The Holodeck adds                of natural language processing and scene generation, details
                                                                     the design of our AI Holodeck application, and discusses po-
Copyright © 2021 for this paper by its authors. Use permitted        tential applications for integrating scenes generated by this
under Creative Commons License Attribution 4.0 International         system with other interactive mediums.
(CC BY 4.0).
                   2    Related Work                              jects that are not mentioned but are relevant to the mentioned
2.1   Scene Recognition                                           objects. They select inferred objects by searching the object
                                                                  hierarchy and bringing the explicit object’s parent objects
Scene graphs have been extensively used and explored in           with the highest probability (Chang et al. 2017). In addi-
the contexts of scene understanding and semantic image            tion to this, our approach considers the environment and ob-
captioning (Johnson et al. 2015). Representing objects in a       jects that have the higher probability of co-appear with the
scene as nodes and their relationships as edges of graphs         explicit objects in that specific context. Our focus is not to
makes it possible to both represent the content of scenes,        produce the most sophisticated interior layout possible, but
generate new scenes or manipulate scenes by modifying the         rather to demonstrate a natural language-based scene gener-
corresponding scene graph (Dhamo et al. 2020). Once the           ation system that creates appropriate scenes fitting the user’s
objects and their relationships are learned, it can be used to    input contextually and thematically.
meaningfully place objects in the scene. Scene graphs can
be based on 3D scenes, reconstructed 3D scenes (Wald et al.
2020), images or text-image combinations.                         2.3   Scene Manipulation
   Depending on the data format, different approaches have
                                                                  We explored scene manipulation research from the perspec-
been devised to extract the scene content. SceneGen focuses
                                                                  tives of content and interaction. In terms of content, research
on novel representations of scene graphs which embed posi-
                                                                  on scene manipulation focuses on scene level manipulation
tional and orientation information of a set of objects present
                                                                  or object level manipulation which targets different purposes
in a given room to achieve the most realistic placement (Ke-
                                                                  such as object removal or image blending.
shavarzi et al. 2020).
   Other literature has investigated learning the implicit po-       Recent research investigates scene manipulation through
sitional relationship between objects using transformers and      updating the existing scene graph (Dhamo et al. 2020) and
attention mechanisms. In SceneFormer (Wang, Yeshwanth,            regenerating the scene. Other approaches explore modify-
and Nießner 2020), authors represent 3D objects and their         ing images using the semantic label maps or boundary maps
corresponding environment through a sequence of numbers           extracted from the image (Wang et al. 2018). However, due
representing object category, location of objects in the envi-    to challenges presented by distinguishing different objects
ronment, object orientation and dimensions of the room.           of the same type (e.g. several different cars in the scene),
   A recent transformer based model CLIP connects text and        our approach is limited to single instances of an object in a
images to understand the image content. This discrimina-          scene.
tive model can be used to predict image content at scale by          In terms of interaction, prior research explored different
encoding the image and caption in parallel (Radford et al.        methods or modes of interaction allowing users to manip-
2021). This model in combination with generative mod-             ulate scenes. SceneSeer (Chang et al. 2017) enables users
els can be used to generate scenes (Galatolo, Cimino, and         to manipulate the scene with textual commands like “re-
Vaglini 2021) but the effectiveness of this depends on the        place the bowl with a red lamp”. In Scribbling Speech (Yang
level of precision and detail required for the generated scene.   2018), a speech-to-image generation tool, users interact with
   We infer the object categories and their positional rela-      the interface through sound and modify the scene in a step by
tionships directly from images for two reasons: 1) images         step process using natural language. The placement of ob-
are more accessible than 3D models and there are more             jects happens in the different depths of the scene. We use the
datasets available. 2) images function as representations of      scene graph modification approach in our refactoring pro-
real world situations and convey properties that may be ma-       cess and users can modify the scene by adding new queries.
nipulated in 3D scenes such as messy desks or cluttered              Unlike the above tools, which visually render a scene, our
rooms.                                                            approach also hosts the Holodeck component models in an
                                                                  API-like format, allowing for integration into a variety of
2.2   Text-to-Scene Generation                                    applications such as Unity.
WordsEye (Coyne and Sproat 2001), as one of earlier works
on the text-to-scene generation, relies on explicit descrip-                         3    System Design
tions following the template of objects and their position.
Requiring specific inputs in the format of “the [object] is       As seen in Figure 1, our Holodeck scene generator contains a
[distance] [position] the [object]” make for a rigid and un-      full pipeline to collect input text and form a visual represen-
natural user experience.                                          tation of a scene. Our system generates a scene template for
   Notating 3D datasets with natural language descriptions        any input text, determining objects and their locations in or-
is one approach to improve the unnatural experience com-          der for them to be placed in a scene. Then, implicit connec-
pared to the prior work (Chang et al. 2015a). However, the        tions between objects in our semantically annotated datasets
text query is not the only contributing component to achieve      are used to add additional nodes to the scene template, cre-
a natural user experience. SceneSeer breaks down the prob-        ating a more vibrant scene. Objects are then mapped to a
lem of text-to-scene generation into scene parsing, scene in-     graphical representation of a scene in sequence, while re-
ference, scene generation and scene interaction.                  solving any collisions between them. Finally, a lightweight
   To avoid unnatural languages caused by strict input lan-       interface was created to allow for easily understood demon-
guage requirements like with WordsEye, they also bring ob-        stration.
                                                                   Figure 2: An example scene template, with inputs “There is
                                                                   a black cat on a wooden chair.” and “The chair is to the left
                                                                   of the desk”.


                                                                   to bring in other objects such as a rug, a window, and a book
                                                                   because these are objects usually found near a couch or table
                                                                   in an office. If the object mentioned by the user is not usu-
  Figure 1: System flow for the AI Holodeck application.           ally found in such environment, eg. “a horse in an office”,
                                                                   our system searches for other objects that have a high prob-
                                                                   ability of being found in an office space rather than objects
3.1   Scene Templates                                              that are typically found near a horse.
User input utterances are parsed through the CoreNLP li-              With this method, the objects surrounding an object are
brary (Manning et al. 2014). The application separates sen-        dependent on which environment that object is in. Figures 3
tences into dependency trees comprised of subjects, objects,       and 4 show this difference by using the same input sentence
and descriptors. Each subject and object are stored as a node,     “There is a couch and a table” in different environments of
and each descriptor is stored as a property of that node. De-      “bedroom” and “library”.
scriptors concerning relative position between objects (such
as “above” or “below”) are stored inside of properties spec-
ifying a cardinal direction. Phrases such as “on top of” and
“over” are all considered as the same “above” direction, and
phrases as “beside” or “by” are set to either “left” or “right”.
These connections form a scene template (Chang, Savva,
and Manning 2014), a collection of the various spatial re-
lations between objects in a scene. The scene template al-
lows all objects in a scene to be connected either directly or
through an intermediate object, such as in Figure 2.
   CoreNLP has made it possible to use an in-depth depen-
dency parser, allowing for parsing complex sentence struc-
tures. However, we are also offering an offline version using
the NLTK library (Bird, Klein, and Loper 2009) to extract
objects, properties and their corresponding locations.

3.2   Implicit Nodes and Positional Relations
In order to create a more fleshed-out scene, our system adds
additional nodes which are not explicitly mentioned by the
user. Figure 3 provides an example of this concept. In this
example, the user is creating an office space and has used
the text “There is a couch and a table” as the first input.        Figure 3: An example of adding implicit nodes, with the in-
Since the “couch” and “table” are explicitly mentioned by          put sentence “There is a couch and a table.” being used to
the user, these nodes are created and used in the scene tem-       create an office space.
plate. The addition of these implicit nodes allows the system
                                                                   Figure 6: An example of prioritizing explicit relations to im-
Figure 4: An example of adding implicit nodes. with the in-        plicit ones. The user requests the computer to be moved to
put sentence “There is a couch and a table.” being used to         the top of the table.
create a library space.

                                                                   cific category, we looked at the objects found in immediate
   Finally, our system prioritizes the explicitly defined posi-    and far distance of the specified object. We divided these
tional relations over the implicit relations created by the sys-   surrounding objects based on their positional relation to the
tem. Figure 5 shows that a computer is implicitly brought          specified object (eg. below, on top of) and sorted them based
to the scene after 2 input sentences “There is a table” and        on the number of occurrences. We exported this information
“There is a chair”. In Figure 6, the user adds the input “The      as a JSON file for our system’s use.
computer is on top of the table”, moving the computer to the
explicitly specified position.                                     3.3      Scene Visualization
                                                                   When generating a scene, the system places bounding boxes
                                                                   representing each object in the scene template into a 3D
                                                                   graph. It searches for sizes for each object in the ShapeNet-
                                                                   Sem metadata. If none are found, they are replaced with de-
                                                                   fault values for the output graph. The algorithm used to pri-
                                                                   oritize object placement queues objects on the bottom of a
                                                                   scene (objects with no “below” parameter), and recursively
                                                                   adds objects in those objects’ “above” property on the graph,
                                                                   stacking objects on top of each other.
                                                                      As each object is added to the graph, collisions are de-
                                                                   tected. Objects with lower priority (determined by their
                                                                   place in the scene template) are shifted in the direction cor-
                                                                   responding to their property name the until their bounding
                                                                   boxes no longer overlap with the other object. For example,
                                                                   if one object is “above” the other, it will be shifted vertically.

                                                                   3.4      Interface Design
                                                                   The AI Holodeck application uses a Tkinter interface 1 (see
Figure 5: An example of adding implicit nodes. A computer          Figure 7), activated from the command line.
is implicitly added to the scene.                                     The application opens a window with a menu for select-
                                                                   ing a scene found in the Indoor Scenes dataset, a prompt for
                                                                   entering in text or microphone input, and a display of the ob-
Datasets Used for Extracting Implicit Relations In or-             jects currently registered from the scene. When a user selects
der to create a dataset of potential positional relations, we      the “Create Graph” button, objects found in the input text are
used the MIT Indoor Scenes Dataset (Quattoni and Torralba          added to the list of objects. The scene is then displayed as a
2009), which contains 67 indoor categories and a total of          movable, 3D matplotlib 2 graph in a separate window.
15620 annotated JPEG images. We sorted the objects found
                                                                      1
in each indoor category based on the number of occurrences                https://docs.python.org/3/library/tkinter.html
                                                                      2
in that category. Additionally, for each object found in a spe-           https://matplotlib.org/
   The application allows for a number of command line ar-        they are trying to create, the stored database will update with
guments. Mode selects either text or voice input. When vocal      new spatial relationships and as such the system will be able
input is activated, a recording button is added to the inter-     to learn from generated scenes.
face beside the text box. Pressing this button activates a con-      Our system is robust in terms of affording future modifi-
tinuous microphone stream until a sentence is recognized,         cations. For example, in this phase of the project, we have
which then populates the text field and automatically acti-       used the MIT dataset to extract possible positional rela-
vates the graph creation function. Model selects either NLTK      tionships between various objects. These positional relation-
(Loper and Bird 2002) or CoreNLP (Manning et al. 2014) as         ships can be constantly modified/added to by using other
a model to generate a dependency. The NLTK model is us-           datasets not limited to the visual datasets, such as datasets of
able offline, while CoreNLP requires a separate command           narrative texts. Other techniques – such as deep learning –
line prompt to begin a server with an internet connection.        could also be beneficial in removing our need for annotated
However, the CoreNLP model allows for more variety in             visual datasets by extracting spatial relations automatically
sentence structure. Examination of future iterations of this      from other collections of images or narrative text.
system will include a comparison of error between the two            The visualization of objects can be modified and extended
models.                                                           to various platforms. Our system provides a scene tem-
                                                                  plate and graphical representation as formatted data and data
                                                                  structures extracted from the input text, which can be used
                                                                  by various platforms to create a detailed visualization. As
                                                                  explained in the system design of this paper, this data in-
                                                                  cludes the various objects in the scene, their properties, po-
                                                                  sitional relations and center-points for placement of the ob-
                                                                  jects in the scene. Hence, the visual modification could be ei-
                                                                  ther in the form of changing visualization platform or in the
                                                                  form of adding new objects to the dataset of 2D/3D models
                                                                  used in these platforms.

                                                                                      5    Future Work
                                                                  In future iterations of the system, we will include a sepa-
                                                                  rate narrative text interpretation module. This module will be
                                                                  comprised of a series of models trained on literature, which
                                                                  will provide additional scene details given a user’s start-
                                                                  ing input. The current implementation, for both the NLTK
                                                                  and CoreNLP models, primarily uses simple sentences com-
                                                                  prised of subject-object pairs in each clause. Training mod-
                                                                  els on literature will enhance the system’s ability to capture
                                                                  information from input sentences with a higher structural va-
                                                                  riety.
                                                                     Other prospective improvements in the software are doc-
 Figure 7: User interface for the AI Holodeck application         umentation and user functionality to facilitate connections
                                                                  to software such as Unity and virtual reality for integration
                                                                  with game development. This addition will be used to ex-
                                                                  plore manipulation of objects already in a scene, and the
                      4   Discussion                              placement or movement of objects with both input vocal
A fully realized AI Holodeck application will require a rel-      commands and gestures. Additionally, once objects are gen-
ative positioning and collision detection system that allows      erated in a higher-fidelity graphical environment, modifiers
for more spatial relationships than just “above”, “below”,        can be extracted from the input. These modifiers include ad-
“left”, and “right”. In particular, size-dependent relation-      jectives describing the scene and the objects within. They
ships such as “inside” will allow generated scenes to have        will be added to the scene template and transmitted to the
a greater amount of realism and variety.                          external environment in order to generate visuals more ap-
   This system is also limited in the fidelity of the visual-     propriate for the user query.
izations it is able to create. Objects are represented only as       Future work will also include evaluation of the applica-
a bounding box labeled with the object’s name. A more so-         tion. The first evaluation will include separate analysis of
phisticated visualization application would include indexing      the NLP and scene generation models, in terms of precision
of a database of 3D models, in order to dynamically popu-         in the sentences they parse as well as the relevance of ob-
late generated scenes with appropriate representations of the     jects generated for a scene. System performance will also be
objects inside.                                                   measured in terms of stability and framerate, while generat-
   Additionally, we plan on modifying the system to allow         ing large amounts of objects. Finally, we will conduct user
for the manual removal and repositioning of objects. As           studies measuring the strength of the system as a piece of
users correct the system output to fit the needs of the scene     interactive technology as well as users’ experience.
                       References                                Murray, J. H. 2017. Hamlet on the holodeck: The future of
Bird, S.; Klein, E.; and Loper, E. 2009. Natural language        narrative in cyberspace. MIT press.
processing with Python: analyzing text with the natural lan-     Quattoni, A.; and Torralba, A. 2009. Recognizing indoor
guage toolkit. ” O’Reilly Media, Inc.”.                          scenes. In 2009 IEEE Conference on Computer Vision and
Chang, A.; Monroe, W.; Savva, M.; Potts, C.; and Manning,        Pattern Recognition, 413–420. IEEE.
C. D. 2015a. Text to 3d scene generation with rich lexical       Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
grounding. arXiv preprint arXiv:1505.06289 .                     Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
Chang, A.; Savva, M.; and Manning, C. D. 2014. Semantic          et al. 2021. Learning transferable visual models from natural
parsing for text to 3d scene generation. In Proceedings of       language supervision. arXiv preprint arXiv:2103.00020 .
the ACL 2014 Workshop on Semantic Parsing, 17–21.                Spector, W. 2013. Holodeck: Holy Grail or Hollow Promise?
Chang, A. X.; Eric, M.; Savva, M.; and Manning, C. D.            Part 1. URL https://www.gamesindustry.biz/articles/2013-
2017. SceneSeer: 3D scene design with natural language.          07-31-holodeck-holy-grail-or-hollow-promise-part-1.
arXiv preprint arXiv:1703.00050 .                                Swartout, W.; Hill, R.; Gratch, J.; Johnson, W. L.; Kyri-
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.;          akakis, C.; LaBore, C.; Lindheim, R.; Marsella, S.; Miraglia,
Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su,        D.; and Moore, B. 2006. Toward the holodeck: Integrating
H.; et al. 2015b. Shapenet: An information-rich 3d model         graphics, sound, character and story. Technical report, UNI-
repository. arXiv preprint arXiv:1512.03012 .                    VERSITY OF SOUTHERN CALIFORNIA MARINA DEL
                                                                 REY CA INST FOR CREATIVE . . . .
Coyne, B.; and Sproat, R. 2001. WordsEye: An automatic
text-to-scene conversion system. In Proceedings of the 28th      Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020.
annual conference on Computer graphics and interactive           Learning 3d semantic scene graphs from 3d indoor recon-
techniques, 487–496.                                             structions. In Proceedings of the IEEE/CVF Conference on
                                                                 Computer Vision and Pattern Recognition, 3961–3970.
Dhamo, H.; Farshad, A.; Laina, I.; Navab, N.; Hager, G. D.;
Tombari, F.; and Rupprecht, C. 2020. Semantic image              Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; and
manipulation using scene graphs. In Proceedings of the           Catanzaro, B. 2018. High-resolution image synthesis and
IEEE/CVF Conference on Computer Vision and Pattern               semantic manipulation with conditional gans. In Proceed-
Recognition, 5213–5222.                                          ings of the IEEE conference on computer vision and pattern
                                                                 recognition, 8798–8807.
Forbes, M.; and Choi, Y. 2017. Verb physics: Relative
physical knowledge of actions and objects. arXiv preprint        Wang, X.; Yeshwanth, C.; and Nießner, M. 2020. Scene-
arXiv:1706.03799 .                                               former: Indoor scene generation with transformers. arXiv
                                                                 preprint arXiv:2012.09793 .
Galatolo, F. A.; Cimino, M. G.; and Vaglini, G. 2021.
Generating images from caption and vice versa via CLIP-          Yang, X. 2018.      Scribbling Speech.     URL https://
Guided Generative Latent Space Search. arXiv preprint            experiments.withgoogle.com/scribbling-speech.
arXiv:2102.01645 .
Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma,
D. A.; Bernstein, M. S.; and Fei-Fei, L. 2015. Image re-
trieval using scene graphs. In 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 3668–
3678. doi:10.1109/CVPR.2015.7298990.
Keshavarzi, M.; Parikh, A.; Zhai, X.; Mao, M.; Caldas,
L.; and Yang, A. 2020. Scenegen: Generative contextual
scene augmentation using scene graph priors. arXiv preprint
arXiv:2009.12395 .
Loper, E.; and Bird, S. 2002. Nltk: The natural language
toolkit. arXiv preprint cs/0205028 .
Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J. R.;
Bethard, S.; and McClosky, D. 2014. The Stanford CoreNLP
natural language processing toolkit. In Proceedings of 52nd
annual meeting of the association for computational linguis-
tics: system demonstrations, 55–60.
Marks, S.; Estevez, J. E.; and Connor, A. M. 2014. Towards
the Holodeck: fully immersive virtual reality visualisation of
scientific and engineering data. In Proceedings of the 29th
International Conference on Image and Vision Computing
New Zealand, 42–47.