Semantically Annotated 3D Material Supporting the Design of Natural User Interfaces for Architectural Heritage Valeria Cera Antonio Origlia Department of Architecture, University of Naples URBAN/ECO Research Center, University of Naples "Federico II" "Federico II" Naples, Italy Naples, Italy valeria.cera@unina.it antonio.origlia@unina.it Francesco Cutugno Massimiliano Campi Department of Electrical Engineering and Information Department of Architecture, University of Naples Technology, University of Naples "Federico II" "Federico II" Naples, Italy Naples, Italy cutugno@unina.it campi@unina.it ABSTRACT beneficial effects on interaction quality with systems based on ad- With the advent of artificial intelligence and natural user interfaces, vanced knowledge representation and dialogue-based interaction the need for multimedia material that can be semantically inter- (e.g. [6]). In particular, the use of conversational agents, represented preted in real time becomes critical. In the field of 3D architectural in the form of 3D avatars moving in virtual reconstructions, pro- survey, a significant amount of research has been conducted to al- vides a natural way to access information. Establishing a dialogue low domain experts represent semantic data while keeping spatial with an artificial character is becoming a more and more frequent references. Such data becomes valuable for natural user interfaces way to interact with technological devices. designed to let non-expert users obtain information about archi- The annotation of digital models lets scholars associate spatial tectural heritage. In this paper, we present the architectural data shapes with the heterogeneous data describing them through the collection and annotation procedure adopted in the Cultural Her- use of semantic descriptors. The most relevant approach to this itage Orienting Multimodal Experiences (CHROME) project. This kind of semantic annotation is presented in [1] and it is based procedure aims at providing conversational agents with fast access on the geometrical segmentation of architectural digital artefacts. to fine-detailed semantic data linked to the available 3D models. We These become collections of separate elements, organised using will discuss how this will make it possible to support multimodal part-whole relationships. Each entity is identified by a precise con- user interaction and generate cultural heritage presentations. cept in a specialised domain thesaurus: the architectural dictionary. Different geometrical representations (point clouds, nurbs, textured CCS CONCEPTS meshes, etc. . . ) are linked to the objects represented by the terms, included in the dictionary, depending on the specific descriptive • Human-centered computing → User centered design; In- objectives. Each geometrical element can be linked to a single se- formation visualization; mantic descriptor, while a semantic descriptor may be associated to multiple geometrical elements. More recently, the original method- KEYWORDS ology has been updated [5] and implemented as a cloud-based Semantic annotation, architectural survey, interaction design service called Aioli1 . Using the projective relationship between ACM Reference Format: bidimensional and tridimensional representations, the semantic Valeria Cera, Antonio Origlia, Francesco Cutugno, and Massimiliano Campi. annotation of digital models, obtained through a set of reference 2018. Semantically Annotated 3D Material Supporting the Design of Natural images, is produced by segmenting the same reference images, thus User Interfaces for Architectural Heritage. In Proceedings of 2nd Workshop removing the need of a geometrical segmentation. Images sharing on Advanced Visual Interfaces for Cultural Heritage (AVI-CH 2018). Vol. 2091. the same semantic label may be linked to one or more specific CEUR-WS.org, Article 7. http://ceur-ws.org/Vol-2091/paper7.pdf, 4 pages. terms in a controlled vocabulary, or they may be characterised with customised attributes. Semantically annotated 3D models contain a 1 INTRODUCTION AND RELATED WORK significant amount of data that, to promote cultural heritage, may be used to let non-expert users navigate cultural contents by de- Recent advances in graphics hardware, together with the availabil- veloping interactive technologies. These technologies should be ity of professional video-game engines, have opened a number of designed to assist the exploration of the large amount of infor- possibilities to develop innovative approaches for cultural heritage mation available for cultural heritage (texts, images, 3D models, presentation. The use of game engines has been shown to produce etc. . . ) in an engaging way. To tackle this problem, we pursue the use of conversational agents, in the form of 3D avatars, immersed AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy in the digital representations of cultural artefacts. Using semantic © 2018 Copyright held by the owner/author(s). 1 www.aioli.cloud/ 1 AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy V. Cera et al. Figure 1: Normal map of a sample segment. RGB values rep- Figure 2: Color map of a sample segment. RGB information resent the normal vector coefficients driving the lighting is computed by comparing high and low poly meshes. simulation of details. also delivers a photorealistic view of the surveyed cultural site, based on state of the art techniques. processing techniques coming from different domains (e.g. Natu- Starting from the entrance, the positions of the different ac- ral Language Processing, Computer Vision, etc. . . ), it is possible, quisitions have been organised to cover the entire volume of the using semantic labels, to link separate sources of information and monument, taking into account the tangency of the surfaces and generate a consistent presentation. shadows. A Continuous Wave Faro Focus 3D S120 laser scanner In this paper, we present the architectural data collection pipeline was used to perform a total number of 40 scans, positioning the we adopted to obtain the 3D meshes representing relevant parts of scanner uniformly along an ascending path - for the eastern side - the San Lorenzo Charterhouse in Padula (Italy) and how we anno- and a descendant one - for the western side -, with a spatial resolu- tated them with semantic information. The obtained data represent tion of 6 mm at 10 m. A terrestrial photogrammetric survey was a multi-faceted documentation of architectural heritage describing carried out mainly for texturing purposes. Using a Reflex Canon both geometrical detail and visual experience. We also present the EOS 1300D and a zoom 18-55 lens set at 24 mm view, about 380 work in progress on a software architecture designed to link the images were acquired to obtain a better color information for the semantically enriched 3D data to textual resources describing the final texturing of the 3D digital model. represented artefact. This architecture will be used to support natu- ral user interaction [10] through the use of Social Signal Processing 3 DATA PROCESSING [8] techniques and game engines. The complete range-based 3D point cloud was obtained employing a classical processing procedure: the adjacent TLS stations were 2 DATA COLLECTION aligned using a solid-rigid transformation based on planar printed The Charterhouse of San Lorenzo, in Padula, and its monumental checkboards targets and spheres. A final point cloud of about 500 staircase represents the selected case study, which is used to test millions points was obtained. the developed pipeline, spanning from the 3D data acquisition to After a manual cleaning of vegetation and artefacts caused by the semantic annotation process. The staircase, made of local white noise, a polygonal mesh model was generated using a Delaunay stone, was built towards the end of the eighteenth century, has an triangulation algorithm. A final mesh of about 392.4 millions of elliptical plan and a double ramp. Closed outside by an octagonal triangles was obtained this way. Once the triangulated model edit- tower, it leads to the first floor of the great cloister, used by the ing was completed, a texture mapping was carried out, using the Carthusians for their weekly walk. Several surveying techniques images from the photogrammetric survey. To optimise the computa- were employed to produce a 3D reality-based model, suitable for tional management of the models during online rendering, instead dissemination purposes in virtual and interactive environments. In of generating a whole mesh, the process was set to divide the result order to obtain a geometrically accurate 3D model, the survey was into subparts. Each part is defined by an automatic subdivision of performed using a terrestrial laser scanner (TLS). Given the mor- the model using a constraint of keeping a maximum of 5 million phology of the staircase, its materials and colors, the geometrical vertices per subpart. Considering the aims of dissemination and data has been integrated with data collected during a photogram- communication, the textured model was simplified using a succes- metric campaign. This results in a physically accurate model that sive geometric optimisation. The quadratic edge collapse algorithm 2 Semantically Annotated 3D Material Supporting the... AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy Figure 3: The rendered 3D model of the great staircase in the San Lorenzo Charterhouse. was applied to obtain a polygonal mesh that allowed fluid real-time rendering while preserving an adequate level of perceived detail. The mesh produced with this procedure was collapsed with a target 1% vertices use from the initial mesh. Figure 4: A semantic map for the pediment concept. The deviation between the original and the decimated meshes was measured by calculating the Hausdorff distance. The approxi- mation error was below 1 cm. To retain geometrical fidelity in the visualisation task, we compensate this error by computing normal semantic map is created and assigned the same unique ID the con- maps that result from the comparison of the high poly and the low cept it represents is recorded with in the AAT. As an improvement poly meshes, shown in Figure 1. Following the same approach used with respect to previous approaches, the semantic information is to bake normals on the low poly mesh, color information is baked represented as a grayscale map: each map records which polygons, using the high poly mesh. The resulting color texture is shown in in the digital model, are relevant for the concept it represents by Figure 2. This way, although geometrical data is lost during decima- using the model’s UV map. In our approach, white indicates high tion, the simulated behaviour of light in the rendering engine takes relevance, while black indicates no relevance. An example of seman- into account the effect of the removed details. A rendered example tic map is shown in Figure 4. Using semantic maps and reference of the final result is shown in Figure 3. IDs for the annotated concepts allows the integration of multiple From a cultural heritage documentation point of view, it is desir- sources of information (texts, images, audio recordings, etc. . . ) shar- able to preserve both geometrical fidelity and the visual experience. ing the same annotation scheme. Cross-referencing these sources Considering that the error of the measures acquired with the laser opens the possibility to produce advanced interfaces to link the scanner is approximately 2 mm, for geometrical documentation descriptions a specific artefact has in separate domains. purposes an error of 1 cm is considered significant so the high poly The possibility of using gradients in the map lets annotators mesh must be stored. On the other hand, to document the visual refine the quality of the semantic data. This way, it is possible to experience, it is only necessary to retain the effect geometry has express, more than a binary relevance of each vertex for a given on lighting. Normal maps allow to retain these effects, although concept, a relevance level for that concept. This is important in the the original geometry is not present in the low poly mesh, and let field of architectural heritage, as it is not always possible to classify a rendering engine keep high frame rates. To improve the final an element in a unique and precise way, and becomes useful when quality of the visualised 3D models, Ambient Occlusion maps were an architectural element cannot be assigned to a specific category. also computed. The same applies to situations in which it is not possible to indicate where, exactly, an architectural element becomes another one. This 4 ANNOTATION also makes it possible to consider semantic maps produced by multiple annotators to obtain a final map by computing the mean Semantic annotation consists of linking abstract concepts to the rel- values for each UV coordinate, similarly to what has been done evant parts of a 3D mesh. Since these concepts must be represented in other fields where annotation uncertainty is important, like for in standard formats, in order to be recognisable, a reference source emotions [4]. must be selected. In our case, we refer to the Art and Architecture Thesaurus (AAT) [7]: a controlled domain vocabulary containing generic terms and other data concerning the represented concepts. 5 INTERACTION MANAGEMENT These are connected using hierarchical, equivalence and associative In the scenario of automated information providers for architectural relationships. The annotation consists in the use of maps describing heritage, semantically annotated material can be used to generate semantic concepts applied to the 3D model like a texture, thus avoid- cross-domain presentations. Moreover, the possibility of adding ing the need to geometrically segment the architectural artefact. For virtual characters to the scene allows the elicitation of social signals each concept in the AAT found in the digital architectural model, a and the use of natural, multimodal commands. To take advantage 3 AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy V. Cera et al. of these possibilities, we designed a software architecture combin- historians presenting the Campanian Charterhouses to small groups ing specialised modules for interaction management and knowl- of visitors, which are currently being collected in the framework of edge representation. This architecture includes: a) a graph database the CHROME project. (Neo4J) to represent knowledge, b) a dialog manager (Opendial) to handle interaction, c) a game engine (Unreal Engine 4) to control 6 CONCLUSIONS AND FUTURE WORK the virtual character, d) a voice synthesizer (Mivoq) and e) a Kinect We have presented the work in progress in the framework of the sensor to collect users’ activity data. CHROME project. We have described the work flow leading from Neo4J [9] is an open source graph database manager that has 3D architectural data collected with laser scanners and photogram- been applied to a high number of tasks related to data representa- metry to an interactive system designed to present such data in a tion (e.g. [2]). Opendial [3] is a dialogue management framework rich and entertaining way. Using high and low poly meshes with based on probabilistic rules aiming at merging the best of rule-based normal maps to retain the necessary details in real time render- and probabilistic dialogue management. Probabilistic rules, in Open- ing, we document architectural heritage from a geometrical and dial, are used to setup and update a Bayesian network consisting from a visual experience points of view. Furthermore, an original of variables that represent the current dialogue state, including method to semantically annotate the low poly meshes has been uncertainty. A utility based approach is used to compute the next developed to allow a direct link between concepts in the AAT the- system action, if any. The virtual avatar and the real time rendering saurus and geometric parts, introducing the possibility to represent of the obtained artefacts is controlled using the Unreal Engine 42 . uncertainty. The semantic data produced with this work flow will The voice of the avatar is dynamically generated using the Mivoq allow the development of 3D conversational agents able to refer to Voice Synthesis Engine3 . User gestures and speech are detected the reconstructed environment. with the Kinect sensor and are continuously forwarded to the game engine. 7 ACKNOWLEDGMENTS With this approach, the task of handling raw user data is assigned Antonio Origlia’s work is funded by the Italian PRIN project Cul- to a module designed to manage complex, dynamic interfaces that tural Heritage Resources Orienting Multimodal Experience (CHROME) include video, audio and user control systems, while high-level #B52F15000450001. decision processes are delegated to the dialogue manager. This com- ponent accesses the encyclopedic knowledge represented in the REFERENCES graph database to select the most appropriate response to a user [1] Livio De Luca. 2006. Relevé et multi-représentations du patrimoine architectural input. The response consists of an abstract plan that may include Définition d’une approche hybride pour la reconstruction 3D d’édifices. Ph.D. text extracted from the knowledge base, clarification requests or Dissertation. Sciences de l’Homme et Société. Arts et Métiers ParisTech. [2] Felix Dietze, Johannes Karoff, André Calero Valdez, Martina Ziefle, Christoph generic action instructions (e.g. enter another environment). Decid- Greven, and Ulrik Schroeder. 2016. An Open-Source Object-Graph-Mapping ing how to implement the action is assigned to the game engine. Framework for Neo4j and Scala: Renesca. In International Conference on Avail- ability, Reliability, and Security. Springer, 204–218. This choice is motivated by the dynamic nature of the interaction [3] Pierre Lison and Casey Kennington. 2016. OpenDial: A toolkit for developing between the users and the avatar, but also between the avatar and spoken dialogue systems with probabilistic rules. ACL 2016 (2016), 67. the 3D surroundings: a reactive behavioural logic is needed to man- [4] G. McKeown, M. F. Valstar, R. Cowie, and Maja Pantic. 2010. The SEMAINE Corpus of Emotionally Coloured Character Interactions. In Proc. of ICME. 1079– age interrupts caused by both implicit or explicit user activity. Also, 1084. social signals generation and monitoring must be performed in real [5] Tommy Messaoudi, Philippe Véron, Gilles Halin, and Livio De Luca. 2018. An time to ensure consistency with the users’ behaviour. Lastly, the ontological model for the reality-based 3D annotation of heritage building con- servation state. Journal of Cultural Heritage 29 (2018), 100–112. relative position of the avatar with respect to the concepts that are [6] Antonio Origlia, Piero Cosi, Antonio Rodà, and Claudio Zmarich. 2017. A relevant for the generated utterances must also be evaluated in real dialogue-based software architecture for gamified discrimination tests. In Proc. of GHItaly. http://ceur-ws.org/Vol-1956/ time in order to generate pointing gestures. [7] Toni Petersen. 1990. Developing a New Thesaurus for Art and Architecture. To support multimodal commands from the users and allow a Library Trends 38, 4 (1990), 644–658. richer interaction, the user skeletons provided by the Kinect sensor [8] Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image and vision computing 27, 12 will be used: by exploiting the raycasting system included in the (2009), 1743–1759. engine, it is possible to emit a single, invisible, ray of light from [9] Jim Webber. 2012. A programmatic introduction to Neo4j. In Proceedings of the the tip of the arm bone to capture collision events between the ray 3rd annual conference on Systems, programming, and applications: software for humanity. ACM, 217–218. and the objects in the scene. From the data included in the collision [10] Daniel Wigdor and Dennis Wixon. 2011. Brave NUI world: designing natural user event, it is, then, possible to extrapolate the UV coordinates of the interfaces for touch and gesture. Elsevier. vertex that is closest to the collision point. These UV coordinates can then be used to query the semantic maps of the object the ray collided with to extract relevance information for the annotated concepts. These can then be passed to the dialog manager when a speech command is detected and multimodal fusion has been performed. The details of the interaction management strategy will be formalised on the basis of audiovisual recordings of expert art 2 www.unrealengine.com 3 www.mivoq.it 4