Towards an AI Holodeck: Generating Virtual Scenes from Sparse Natural Language Input Jason Smith, Nazanin Alsadat Tabatabaei Anaraki, Atefeh Mahdavi Goloujeh, Karan Khosla, Brian Magerko Georgia Institute of Technology Atlanta, Georgia USA {jsmith775, nazanin.tbt, atefehmahdavi, kkhosla7}@gatech.edu Abstract a lamp on the desk and a notebook beside it.” She thinks “Yes, there should be a notebook on the desk!” But she feels The Holodeck, a virtual reality simulator from the television the metal desk looks really rough in the office. She says, show Star Trek, is known as the “holy grail” of interactive “Holodeck, give me a wooden desk”. The Holodeck ren- narrative experiences. However, while there have been ap- proaches to various components of a theoretical Holodeck, ders the new wooden desk by adding “wooden” to its initial scene generation from dialogue is often overlooked. This pa- database search of desks and replacing the original desk. per introduces a prototype AI Holodeck application for scene To facilitate the creation of scenes – and to populate them generation, demonstrating the use of Natural Language Pro- – some semantically annotated datasets are currently avail- cessing and a corpus of spatial data. The application creates able that categorize objects with relative positions and sizes scenes from user input text and fills those scenes with objects and relationships not explicitly defined by the user. This pa- (Forbes and Choi 2017; Chang et al. 2015b). per discusses potential use cases of scene generation in creat- Scene generation applications are able to parse these ing environments for interactive narrative, virtual reality, and datasets to create a “scene template”, a constrained mapping other development opportunities. of a scene’s objects and their basic spatial relationships be- tween them (Chang, Savva, and Manning 2014). Systems 1 Introduction like these use Natural Language Processing pipelines, such as the techniques in the CoreNLP library (Manning et al. The Holodeck is a fictional virtual reality device in the tele- 2014), to add items specified in a user’s input text to a scene. vision show Star Trek, taking the form of a blank room that generates interactive characters and objects dictated by However, the addition of elements that are both 1) unspec- voice commands from people inside of it. It represents a ified by a user and 2) gathered from semantically annotated “holy grail” of interactive virtual reality (Spector 2013), has datasets in order to maintain relevance to the user-described been at the forefront of discussion of the role of AI in dig- scene is missing from current research. Therefore, this pa- ital storytelling (Murray 2017), and has been the inspira- per aims to address the following research question: Can tion for a number of projects integrating narrative with vi- semantically annotated datasets be used to extract context- sual and audio generation (Swartout et al. 2006; Marks, Es- informed scene templates in a text-to-scene generation ap- tevez, and Connor 2014). These prior approaches have fo- plication, including items not specified in the input text? cused on graphics and visualization over the mechanisms of In this paper, we introduce an AI Holodeck application. scene generation, the processes with which interactive appli- This system draws from previous NLP and scene generation cations create spatial environments from user input criteria. work in order to create scenes with appropriate elements that For example, a Holodeck-inspired scene generation sys- were not specified by the user. For example, if the user in- tem could be used in the following scenario: puts a farm, then the application may populate the scene with Sarah has an idea about a game about a detective. She things commonly found in a farm such as cows, hay barrels, is thinking of the events in the game but having a hard time and fields of crops. The AI Holodeck can be used in a va- imagining the space, so she uses the AI Holodeck to see how riety of scenarios: examples include visual story generation, the space would look. “Holodeck, give me a detective’s of- game design and prototyping, interior design sketching or fice”. The Holodeck renders an office using objects sourced idea generation, and creating virtual or fantasy worlds. from a database. There is a desk and a chair, a window behind the desk, and a shelf in the corner. She then thinks The remainder of this paper explains how our system that a desk lamp could make it more mysterious at night. draws from and synthesizes existing works in the domains “Holodeck, put a lamp on the desk.” The Holodeck adds of natural language processing and scene generation, details the design of our AI Holodeck application, and discusses po- Copyright © 2021 for this paper by its authors. Use permitted tential applications for integrating scenes generated by this under Creative Commons License Attribution 4.0 International system with other interactive mediums. (CC BY 4.0). 2 Related Work jects that are not mentioned but are relevant to the mentioned 2.1 Scene Recognition objects. They select inferred objects by searching the object hierarchy and bringing the explicit object’s parent objects Scene graphs have been extensively used and explored in with the highest probability (Chang et al. 2017). In addi- the contexts of scene understanding and semantic image tion to this, our approach considers the environment and ob- captioning (Johnson et al. 2015). Representing objects in a jects that have the higher probability of co-appear with the scene as nodes and their relationships as edges of graphs explicit objects in that specific context. Our focus is not to makes it possible to both represent the content of scenes, produce the most sophisticated interior layout possible, but generate new scenes or manipulate scenes by modifying the rather to demonstrate a natural language-based scene gener- corresponding scene graph (Dhamo et al. 2020). Once the ation system that creates appropriate scenes fitting the user’s objects and their relationships are learned, it can be used to input contextually and thematically. meaningfully place objects in the scene. Scene graphs can be based on 3D scenes, reconstructed 3D scenes (Wald et al. 2020), images or text-image combinations. 2.3 Scene Manipulation Depending on the data format, different approaches have We explored scene manipulation research from the perspec- been devised to extract the scene content. SceneGen focuses tives of content and interaction. In terms of content, research on novel representations of scene graphs which embed posi- on scene manipulation focuses on scene level manipulation tional and orientation information of a set of objects present or object level manipulation which targets different purposes in a given room to achieve the most realistic placement (Ke- such as object removal or image blending. shavarzi et al. 2020). Other literature has investigated learning the implicit po- Recent research investigates scene manipulation through sitional relationship between objects using transformers and updating the existing scene graph (Dhamo et al. 2020) and attention mechanisms. In SceneFormer (Wang, Yeshwanth, regenerating the scene. Other approaches explore modify- and Nießner 2020), authors represent 3D objects and their ing images using the semantic label maps or boundary maps corresponding environment through a sequence of numbers extracted from the image (Wang et al. 2018). However, due representing object category, location of objects in the envi- to challenges presented by distinguishing different objects ronment, object orientation and dimensions of the room. of the same type (e.g. several different cars in the scene), A recent transformer based model CLIP connects text and our approach is limited to single instances of an object in a images to understand the image content. This discrimina- scene. tive model can be used to predict image content at scale by In terms of interaction, prior research explored different encoding the image and caption in parallel (Radford et al. methods or modes of interaction allowing users to manip- 2021). This model in combination with generative mod- ulate scenes. SceneSeer (Chang et al. 2017) enables users els can be used to generate scenes (Galatolo, Cimino, and to manipulate the scene with textual commands like “re- Vaglini 2021) but the effectiveness of this depends on the place the bowl with a red lamp”. In Scribbling Speech (Yang level of precision and detail required for the generated scene. 2018), a speech-to-image generation tool, users interact with We infer the object categories and their positional rela- the interface through sound and modify the scene in a step by tionships directly from images for two reasons: 1) images step process using natural language. The placement of ob- are more accessible than 3D models and there are more jects happens in the different depths of the scene. We use the datasets available. 2) images function as representations of scene graph modification approach in our refactoring pro- real world situations and convey properties that may be ma- cess and users can modify the scene by adding new queries. nipulated in 3D scenes such as messy desks or cluttered Unlike the above tools, which visually render a scene, our rooms. approach also hosts the Holodeck component models in an API-like format, allowing for integration into a variety of 2.2 Text-to-Scene Generation applications such as Unity. WordsEye (Coyne and Sproat 2001), as one of earlier works on the text-to-scene generation, relies on explicit descrip- 3 System Design tions following the template of objects and their position. Requiring specific inputs in the format of “the [object] is As seen in Figure 1, our Holodeck scene generator contains a [distance] [position] the [object]” make for a rigid and un- full pipeline to collect input text and form a visual represen- natural user experience. tation of a scene. Our system generates a scene template for Notating 3D datasets with natural language descriptions any input text, determining objects and their locations in or- is one approach to improve the unnatural experience com- der for them to be placed in a scene. Then, implicit connec- pared to the prior work (Chang et al. 2015a). However, the tions between objects in our semantically annotated datasets text query is not the only contributing component to achieve are used to add additional nodes to the scene template, cre- a natural user experience. SceneSeer breaks down the prob- ating a more vibrant scene. Objects are then mapped to a lem of text-to-scene generation into scene parsing, scene in- graphical representation of a scene in sequence, while re- ference, scene generation and scene interaction. solving any collisions between them. Finally, a lightweight To avoid unnatural languages caused by strict input lan- interface was created to allow for easily understood demon- guage requirements like with WordsEye, they also bring ob- stration. Figure 2: An example scene template, with inputs “There is a black cat on a wooden chair.” and “The chair is to the left of the desk”. to bring in other objects such as a rug, a window, and a book because these are objects usually found near a couch or table in an office. If the object mentioned by the user is not usu- Figure 1: System flow for the AI Holodeck application. ally found in such environment, eg. “a horse in an office”, our system searches for other objects that have a high prob- ability of being found in an office space rather than objects 3.1 Scene Templates that are typically found near a horse. User input utterances are parsed through the CoreNLP li- With this method, the objects surrounding an object are brary (Manning et al. 2014). The application separates sen- dependent on which environment that object is in. Figures 3 tences into dependency trees comprised of subjects, objects, and 4 show this difference by using the same input sentence and descriptors. Each subject and object are stored as a node, “There is a couch and a table” in different environments of and each descriptor is stored as a property of that node. De- “bedroom” and “library”. scriptors concerning relative position between objects (such as “above” or “below”) are stored inside of properties spec- ifying a cardinal direction. Phrases such as “on top of” and “over” are all considered as the same “above” direction, and phrases as “beside” or “by” are set to either “left” or “right”. These connections form a scene template (Chang, Savva, and Manning 2014), a collection of the various spatial re- lations between objects in a scene. The scene template al- lows all objects in a scene to be connected either directly or through an intermediate object, such as in Figure 2. CoreNLP has made it possible to use an in-depth depen- dency parser, allowing for parsing complex sentence struc- tures. However, we are also offering an offline version using the NLTK library (Bird, Klein, and Loper 2009) to extract objects, properties and their corresponding locations. 3.2 Implicit Nodes and Positional Relations In order to create a more fleshed-out scene, our system adds additional nodes which are not explicitly mentioned by the user. Figure 3 provides an example of this concept. In this example, the user is creating an office space and has used the text “There is a couch and a table” as the first input. Figure 3: An example of adding implicit nodes, with the in- Since the “couch” and “table” are explicitly mentioned by put sentence “There is a couch and a table.” being used to the user, these nodes are created and used in the scene tem- create an office space. plate. The addition of these implicit nodes allows the system Figure 6: An example of prioritizing explicit relations to im- Figure 4: An example of adding implicit nodes. with the in- plicit ones. The user requests the computer to be moved to put sentence “There is a couch and a table.” being used to the top of the table. create a library space. cific category, we looked at the objects found in immediate Finally, our system prioritizes the explicitly defined posi- and far distance of the specified object. We divided these tional relations over the implicit relations created by the sys- surrounding objects based on their positional relation to the tem. Figure 5 shows that a computer is implicitly brought specified object (eg. below, on top of) and sorted them based to the scene after 2 input sentences “There is a table” and on the number of occurrences. We exported this information “There is a chair”. In Figure 6, the user adds the input “The as a JSON file for our system’s use. computer is on top of the table”, moving the computer to the explicitly specified position. 3.3 Scene Visualization When generating a scene, the system places bounding boxes representing each object in the scene template into a 3D graph. It searches for sizes for each object in the ShapeNet- Sem metadata. If none are found, they are replaced with de- fault values for the output graph. The algorithm used to pri- oritize object placement queues objects on the bottom of a scene (objects with no “below” parameter), and recursively adds objects in those objects’ “above” property on the graph, stacking objects on top of each other. As each object is added to the graph, collisions are de- tected. Objects with lower priority (determined by their place in the scene template) are shifted in the direction cor- responding to their property name the until their bounding boxes no longer overlap with the other object. For example, if one object is “above” the other, it will be shifted vertically. 3.4 Interface Design The AI Holodeck application uses a Tkinter interface 1 (see Figure 5: An example of adding implicit nodes. A computer Figure 7), activated from the command line. is implicitly added to the scene. The application opens a window with a menu for select- ing a scene found in the Indoor Scenes dataset, a prompt for entering in text or microphone input, and a display of the ob- Datasets Used for Extracting Implicit Relations In or- jects currently registered from the scene. When a user selects der to create a dataset of potential positional relations, we the “Create Graph” button, objects found in the input text are used the MIT Indoor Scenes Dataset (Quattoni and Torralba added to the list of objects. The scene is then displayed as a 2009), which contains 67 indoor categories and a total of movable, 3D matplotlib 2 graph in a separate window. 15620 annotated JPEG images. We sorted the objects found 1 in each indoor category based on the number of occurrences https://docs.python.org/3/library/tkinter.html 2 in that category. Additionally, for each object found in a spe- https://matplotlib.org/ The application allows for a number of command line ar- they are trying to create, the stored database will update with guments. Mode selects either text or voice input. When vocal new spatial relationships and as such the system will be able input is activated, a recording button is added to the inter- to learn from generated scenes. face beside the text box. Pressing this button activates a con- Our system is robust in terms of affording future modifi- tinuous microphone stream until a sentence is recognized, cations. For example, in this phase of the project, we have which then populates the text field and automatically acti- used the MIT dataset to extract possible positional rela- vates the graph creation function. Model selects either NLTK tionships between various objects. These positional relation- (Loper and Bird 2002) or CoreNLP (Manning et al. 2014) as ships can be constantly modified/added to by using other a model to generate a dependency. The NLTK model is us- datasets not limited to the visual datasets, such as datasets of able offline, while CoreNLP requires a separate command narrative texts. Other techniques – such as deep learning – line prompt to begin a server with an internet connection. could also be beneficial in removing our need for annotated However, the CoreNLP model allows for more variety in visual datasets by extracting spatial relations automatically sentence structure. Examination of future iterations of this from other collections of images or narrative text. system will include a comparison of error between the two The visualization of objects can be modified and extended models. to various platforms. Our system provides a scene tem- plate and graphical representation as formatted data and data structures extracted from the input text, which can be used by various platforms to create a detailed visualization. As explained in the system design of this paper, this data in- cludes the various objects in the scene, their properties, po- sitional relations and center-points for placement of the ob- jects in the scene. Hence, the visual modification could be ei- ther in the form of changing visualization platform or in the form of adding new objects to the dataset of 2D/3D models used in these platforms. 5 Future Work In future iterations of the system, we will include a sepa- rate narrative text interpretation module. This module will be comprised of a series of models trained on literature, which will provide additional scene details given a user’s start- ing input. The current implementation, for both the NLTK and CoreNLP models, primarily uses simple sentences com- prised of subject-object pairs in each clause. Training mod- els on literature will enhance the system’s ability to capture information from input sentences with a higher structural va- riety. Other prospective improvements in the software are doc- Figure 7: User interface for the AI Holodeck application umentation and user functionality to facilitate connections to software such as Unity and virtual reality for integration with game development. This addition will be used to ex- plore manipulation of objects already in a scene, and the 4 Discussion placement or movement of objects with both input vocal A fully realized AI Holodeck application will require a rel- commands and gestures. Additionally, once objects are gen- ative positioning and collision detection system that allows erated in a higher-fidelity graphical environment, modifiers for more spatial relationships than just “above”, “below”, can be extracted from the input. These modifiers include ad- “left”, and “right”. In particular, size-dependent relation- jectives describing the scene and the objects within. They ships such as “inside” will allow generated scenes to have will be added to the scene template and transmitted to the a greater amount of realism and variety. external environment in order to generate visuals more ap- This system is also limited in the fidelity of the visual- propriate for the user query. izations it is able to create. Objects are represented only as Future work will also include evaluation of the applica- a bounding box labeled with the object’s name. A more so- tion. The first evaluation will include separate analysis of phisticated visualization application would include indexing the NLP and scene generation models, in terms of precision of a database of 3D models, in order to dynamically popu- in the sentences they parse as well as the relevance of ob- late generated scenes with appropriate representations of the jects generated for a scene. System performance will also be objects inside. measured in terms of stability and framerate, while generat- Additionally, we plan on modifying the system to allow ing large amounts of objects. Finally, we will conduct user for the manual removal and repositioning of objects. As studies measuring the strength of the system as a piece of users correct the system output to fit the needs of the scene interactive technology as well as users’ experience. References Murray, J. H. 2017. Hamlet on the holodeck: The future of Bird, S.; Klein, E.; and Loper, E. 2009. Natural language narrative in cyberspace. MIT press. processing with Python: analyzing text with the natural lan- Quattoni, A.; and Torralba, A. 2009. Recognizing indoor guage toolkit. ” O’Reilly Media, Inc.”. scenes. In 2009 IEEE Conference on Computer Vision and Chang, A.; Monroe, W.; Savva, M.; Potts, C.; and Manning, Pattern Recognition, 413–420. IEEE. C. D. 2015a. Text to 3d scene generation with rich lexical Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; grounding. arXiv preprint arXiv:1505.06289 . Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Chang, A.; Savva, M.; and Manning, C. D. 2014. Semantic et al. 2021. Learning transferable visual models from natural parsing for text to 3d scene generation. In Proceedings of language supervision. arXiv preprint arXiv:2103.00020 . the ACL 2014 Workshop on Semantic Parsing, 17–21. Spector, W. 2013. Holodeck: Holy Grail or Hollow Promise? Chang, A. X.; Eric, M.; Savva, M.; and Manning, C. D. Part 1. URL https://www.gamesindustry.biz/articles/2013- 2017. SceneSeer: 3D scene design with natural language. 07-31-holodeck-holy-grail-or-hollow-promise-part-1. arXiv preprint arXiv:1703.00050 . Swartout, W.; Hill, R.; Gratch, J.; Johnson, W. L.; Kyri- Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; akakis, C.; LaBore, C.; Lindheim, R.; Marsella, S.; Miraglia, Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, D.; and Moore, B. 2006. Toward the holodeck: Integrating H.; et al. 2015b. Shapenet: An information-rich 3d model graphics, sound, character and story. Technical report, UNI- repository. arXiv preprint arXiv:1512.03012 . VERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY CA INST FOR CREATIVE . . . . Coyne, B.; and Sproat, R. 2001. WordsEye: An automatic text-to-scene conversion system. In Proceedings of the 28th Wald, J.; Dhamo, H.; Navab, N.; and Tombari, F. 2020. annual conference on Computer graphics and interactive Learning 3d semantic scene graphs from 3d indoor recon- techniques, 487–496. structions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3961–3970. Dhamo, H.; Farshad, A.; Laina, I.; Navab, N.; Hager, G. D.; Tombari, F.; and Rupprecht, C. 2020. Semantic image Wang, T.-C.; Liu, M.-Y.; Zhu, J.-Y.; Tao, A.; Kautz, J.; and manipulation using scene graphs. In Proceedings of the Catanzaro, B. 2018. High-resolution image synthesis and IEEE/CVF Conference on Computer Vision and Pattern semantic manipulation with conditional gans. In Proceed- Recognition, 5213–5222. ings of the IEEE conference on computer vision and pattern recognition, 8798–8807. Forbes, M.; and Choi, Y. 2017. Verb physics: Relative physical knowledge of actions and objects. arXiv preprint Wang, X.; Yeshwanth, C.; and Nießner, M. 2020. Scene- arXiv:1706.03799 . former: Indoor scene generation with transformers. arXiv preprint arXiv:2012.09793 . Galatolo, F. A.; Cimino, M. G.; and Vaglini, G. 2021. Generating images from caption and vice versa via CLIP- Yang, X. 2018. Scribbling Speech. URL https:// Guided Generative Latent Space Search. arXiv preprint experiments.withgoogle.com/scribbling-speech. arXiv:2102.01645 . Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2015. Image re- trieval using scene graphs. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3668– 3678. doi:10.1109/CVPR.2015.7298990. Keshavarzi, M.; Parikh, A.; Zhai, X.; Mao, M.; Caldas, L.; and Yang, A. 2020. Scenegen: Generative contextual scene augmentation using scene graph priors. arXiv preprint arXiv:2009.12395 . Loper, E.; and Bird, S. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028 . Manning, C. D.; Surdeanu, M.; Bauer, J.; Finkel, J. R.; Bethard, S.; and McClosky, D. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguis- tics: system demonstrations, 55–60. Marks, S.; Estevez, J. E.; and Connor, A. M. 2014. Towards the Holodeck: fully immersive virtual reality visualisation of scientific and engineering data. In Proceedings of the 29th International Conference on Image and Vision Computing New Zealand, 42–47.