Semantically Annotated 3D Material Supporting the Design of
         Natural User Interfaces for Architectural Heritage
                                    Valeria Cera                                                           Antonio Origlia
          Department of Architecture, University of Naples                            URBAN/ECO Research Center, University of Naples
                           "Federico II"                                                             "Federico II"
                           Naples, Italy                                                             Naples, Italy
                      valeria.cera@unina.it                                                    antonio.origlia@unina.it

                             Francesco Cutugno                                                           Massimiliano Campi
      Department of Electrical Engineering and Information                             Department of Architecture, University of Naples
         Technology, University of Naples "Federico II"                                                 "Federico II"
                          Naples, Italy                                                                 Naples, Italy
                       cutugno@unina.it                                                               campi@unina.it

ABSTRACT                                                                            beneficial effects on interaction quality with systems based on ad-
With the advent of artificial intelligence and natural user interfaces,             vanced knowledge representation and dialogue-based interaction
the need for multimedia material that can be semantically inter-                    (e.g. [6]). In particular, the use of conversational agents, represented
preted in real time becomes critical. In the field of 3D architectural              in the form of 3D avatars moving in virtual reconstructions, pro-
survey, a significant amount of research has been conducted to al-                  vides a natural way to access information. Establishing a dialogue
low domain experts represent semantic data while keeping spatial                    with an artificial character is becoming a more and more frequent
references. Such data becomes valuable for natural user interfaces                  way to interact with technological devices.
designed to let non-expert users obtain information about archi-                       The annotation of digital models lets scholars associate spatial
tectural heritage. In this paper, we present the architectural data                 shapes with the heterogeneous data describing them through the
collection and annotation procedure adopted in the Cultural Her-                    use of semantic descriptors. The most relevant approach to this
itage Orienting Multimodal Experiences (CHROME) project. This                       kind of semantic annotation is presented in [1] and it is based
procedure aims at providing conversational agents with fast access                  on the geometrical segmentation of architectural digital artefacts.
to fine-detailed semantic data linked to the available 3D models. We                These become collections of separate elements, organised using
will discuss how this will make it possible to support multimodal                   part-whole relationships. Each entity is identified by a precise con-
user interaction and generate cultural heritage presentations.                      cept in a specialised domain thesaurus: the architectural dictionary.
                                                                                    Different geometrical representations (point clouds, nurbs, textured
CCS CONCEPTS                                                                        meshes, etc. . . ) are linked to the objects represented by the terms,
                                                                                    included in the dictionary, depending on the specific descriptive
• Human-centered computing → User centered design; In-
                                                                                    objectives. Each geometrical element can be linked to a single se-
formation visualization;
                                                                                    mantic descriptor, while a semantic descriptor may be associated to
                                                                                    multiple geometrical elements. More recently, the original method-
KEYWORDS                                                                            ology has been updated [5] and implemented as a cloud-based
Semantic annotation, architectural survey, interaction design                       service called Aioli1 . Using the projective relationship between
ACM Reference Format:                                                               bidimensional and tridimensional representations, the semantic
Valeria Cera, Antonio Origlia, Francesco Cutugno, and Massimiliano Campi.           annotation of digital models, obtained through a set of reference
2018. Semantically Annotated 3D Material Supporting the Design of Natural           images, is produced by segmenting the same reference images, thus
User Interfaces for Architectural Heritage. In Proceedings of 2nd Workshop          removing the need of a geometrical segmentation. Images sharing
on Advanced Visual Interfaces for Cultural Heritage (AVI-CH 2018). Vol. 2091.       the same semantic label may be linked to one or more specific
CEUR-WS.org, Article 7. http://ceur-ws.org/Vol-2091/paper7.pdf, 4 pages.            terms in a controlled vocabulary, or they may be characterised with
                                                                                    customised attributes. Semantically annotated 3D models contain a
1     INTRODUCTION AND RELATED WORK                                                 significant amount of data that, to promote cultural heritage, may
                                                                                    be used to let non-expert users navigate cultural contents by de-
Recent advances in graphics hardware, together with the availabil-
                                                                                    veloping interactive technologies. These technologies should be
ity of professional video-game engines, have opened a number of
                                                                                    designed to assist the exploration of the large amount of infor-
possibilities to develop innovative approaches for cultural heritage
                                                                                    mation available for cultural heritage (texts, images, 3D models,
presentation. The use of game engines has been shown to produce
                                                                                    etc. . . ) in an engaging way. To tackle this problem, we pursue the
                                                                                    use of conversational agents, in the form of 3D avatars, immersed
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy
                                                                                    in the digital representations of cultural artefacts. Using semantic
© 2018 Copyright held by the owner/author(s).
                                                                                    1 www.aioli.cloud/

                                                                                1
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy                                                                          V. Cera et al.


Figure 1: Normal map of a sample segment. RGB values rep-                     Figure 2: Color map of a sample segment. RGB information
resent the normal vector coefficients driving the lighting                    is computed by comparing high and low poly meshes.
simulation of details.

                                                                              also delivers a photorealistic view of the surveyed cultural site,
                                                                              based on state of the art techniques.
processing techniques coming from different domains (e.g. Natu-                  Starting from the entrance, the positions of the different ac-
ral Language Processing, Computer Vision, etc. . . ), it is possible,         quisitions have been organised to cover the entire volume of the
using semantic labels, to link separate sources of information and            monument, taking into account the tangency of the surfaces and
generate a consistent presentation.                                           shadows. A Continuous Wave Faro Focus 3D S120 laser scanner
   In this paper, we present the architectural data collection pipeline       was used to perform a total number of 40 scans, positioning the
we adopted to obtain the 3D meshes representing relevant parts of             scanner uniformly along an ascending path - for the eastern side -
the San Lorenzo Charterhouse in Padula (Italy) and how we anno-               and a descendant one - for the western side -, with a spatial resolu-
tated them with semantic information. The obtained data represent             tion of 6 mm at 10 m. A terrestrial photogrammetric survey was
a multi-faceted documentation of architectural heritage describing            carried out mainly for texturing purposes. Using a Reflex Canon
both geometrical detail and visual experience. We also present the            EOS 1300D and a zoom 18-55 lens set at 24 mm view, about 380
work in progress on a software architecture designed to link the              images were acquired to obtain a better color information for the
semantically enriched 3D data to textual resources describing the             final texturing of the 3D digital model.
represented artefact. This architecture will be used to support natu-
ral user interaction [10] through the use of Social Signal Processing         3   DATA PROCESSING
[8] techniques and game engines.
                                                                              The complete range-based 3D point cloud was obtained employing
                                                                              a classical processing procedure: the adjacent TLS stations were
2   DATA COLLECTION                                                           aligned using a solid-rigid transformation based on planar printed
The Charterhouse of San Lorenzo, in Padula, and its monumental                checkboards targets and spheres. A final point cloud of about 500
staircase represents the selected case study, which is used to test           millions points was obtained.
the developed pipeline, spanning from the 3D data acquisition to                 After a manual cleaning of vegetation and artefacts caused by
the semantic annotation process. The staircase, made of local white           noise, a polygonal mesh model was generated using a Delaunay
stone, was built towards the end of the eighteenth century, has an            triangulation algorithm. A final mesh of about 392.4 millions of
elliptical plan and a double ramp. Closed outside by an octagonal             triangles was obtained this way. Once the triangulated model edit-
tower, it leads to the first floor of the great cloister, used by the         ing was completed, a texture mapping was carried out, using the
Carthusians for their weekly walk. Several surveying techniques               images from the photogrammetric survey. To optimise the computa-
were employed to produce a 3D reality-based model, suitable for               tional management of the models during online rendering, instead
dissemination purposes in virtual and interactive environments. In            of generating a whole mesh, the process was set to divide the result
order to obtain a geometrically accurate 3D model, the survey was             into subparts. Each part is defined by an automatic subdivision of
performed using a terrestrial laser scanner (TLS). Given the mor-             the model using a constraint of keeping a maximum of 5 million
phology of the staircase, its materials and colors, the geometrical           vertices per subpart. Considering the aims of dissemination and
data has been integrated with data collected during a photogram-              communication, the textured model was simplified using a succes-
metric campaign. This results in a physically accurate model that             sive geometric optimisation. The quadratic edge collapse algorithm
                                                                          2
Semantically Annotated 3D Material Supporting the...                               AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy


Figure 3: The rendered 3D model of the great staircase in the
San Lorenzo Charterhouse.


was applied to obtain a polygonal mesh that allowed fluid real-time
rendering while preserving an adequate level of perceived detail.
The mesh produced with this procedure was collapsed with a target
1% vertices use from the initial mesh.                                          Figure 4: A semantic map for the pediment concept.
   The deviation between the original and the decimated meshes
was measured by calculating the Hausdorff distance. The approxi-
mation error was below 1 cm. To retain geometrical fidelity in the
visualisation task, we compensate this error by computing normal            semantic map is created and assigned the same unique ID the con-
maps that result from the comparison of the high poly and the low           cept it represents is recorded with in the AAT. As an improvement
poly meshes, shown in Figure 1. Following the same approach used            with respect to previous approaches, the semantic information is
to bake normals on the low poly mesh, color information is baked            represented as a grayscale map: each map records which polygons,
using the high poly mesh. The resulting color texture is shown in           in the digital model, are relevant for the concept it represents by
Figure 2. This way, although geometrical data is lost during decima-        using the model’s UV map. In our approach, white indicates high
tion, the simulated behaviour of light in the rendering engine takes        relevance, while black indicates no relevance. An example of seman-
into account the effect of the removed details. A rendered example          tic map is shown in Figure 4. Using semantic maps and reference
of the final result is shown in Figure 3.                                   IDs for the annotated concepts allows the integration of multiple
   From a cultural heritage documentation point of view, it is desir-       sources of information (texts, images, audio recordings, etc. . . ) shar-
able to preserve both geometrical fidelity and the visual experience.       ing the same annotation scheme. Cross-referencing these sources
Considering that the error of the measures acquired with the laser          opens the possibility to produce advanced interfaces to link the
scanner is approximately 2 mm, for geometrical documentation                descriptions a specific artefact has in separate domains.
purposes an error of 1 cm is considered significant so the high poly           The possibility of using gradients in the map lets annotators
mesh must be stored. On the other hand, to document the visual              refine the quality of the semantic data. This way, it is possible to
experience, it is only necessary to retain the effect geometry has          express, more than a binary relevance of each vertex for a given
on lighting. Normal maps allow to retain these effects, although            concept, a relevance level for that concept. This is important in the
the original geometry is not present in the low poly mesh, and let          field of architectural heritage, as it is not always possible to classify
a rendering engine keep high frame rates. To improve the final              an element in a unique and precise way, and becomes useful when
quality of the visualised 3D models, Ambient Occlusion maps were            an architectural element cannot be assigned to a specific category.
also computed.                                                              The same applies to situations in which it is not possible to indicate
                                                                            where, exactly, an architectural element becomes another one. This
4   ANNOTATION                                                              also makes it possible to consider semantic maps produced by
                                                                            multiple annotators to obtain a final map by computing the mean
Semantic annotation consists of linking abstract concepts to the rel-       values for each UV coordinate, similarly to what has been done
evant parts of a 3D mesh. Since these concepts must be represented          in other fields where annotation uncertainty is important, like for
in standard formats, in order to be recognisable, a reference source        emotions [4].
must be selected. In our case, we refer to the Art and Architecture
Thesaurus (AAT) [7]: a controlled domain vocabulary containing
generic terms and other data concerning the represented concepts.           5    INTERACTION MANAGEMENT
These are connected using hierarchical, equivalence and associative         In the scenario of automated information providers for architectural
relationships. The annotation consists in the use of maps describing        heritage, semantically annotated material can be used to generate
semantic concepts applied to the 3D model like a texture, thus avoid-       cross-domain presentations. Moreover, the possibility of adding
ing the need to geometrically segment the architectural artefact. For       virtual characters to the scene allows the elicitation of social signals
each concept in the AAT found in the digital architectural model, a         and the use of natural, multimodal commands. To take advantage
                                                                        3
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy                                                                                       V. Cera et al.


of these possibilities, we designed a software architecture combin-           historians presenting the Campanian Charterhouses to small groups
ing specialised modules for interaction management and knowl-                 of visitors, which are currently being collected in the framework of
edge representation. This architecture includes: a) a graph database          the CHROME project.
(Neo4J) to represent knowledge, b) a dialog manager (Opendial) to
handle interaction, c) a game engine (Unreal Engine 4) to control             6    CONCLUSIONS AND FUTURE WORK
the virtual character, d) a voice synthesizer (Mivoq) and e) a Kinect         We have presented the work in progress in the framework of the
sensor to collect users’ activity data.                                       CHROME project. We have described the work flow leading from
   Neo4J [9] is an open source graph database manager that has                3D architectural data collected with laser scanners and photogram-
been applied to a high number of tasks related to data representa-            metry to an interactive system designed to present such data in a
tion (e.g. [2]). Opendial [3] is a dialogue management framework              rich and entertaining way. Using high and low poly meshes with
based on probabilistic rules aiming at merging the best of rule-based         normal maps to retain the necessary details in real time render-
and probabilistic dialogue management. Probabilistic rules, in Open-          ing, we document architectural heritage from a geometrical and
dial, are used to setup and update a Bayesian network consisting              from a visual experience points of view. Furthermore, an original
of variables that represent the current dialogue state, including             method to semantically annotate the low poly meshes has been
uncertainty. A utility based approach is used to compute the next             developed to allow a direct link between concepts in the AAT the-
system action, if any. The virtual avatar and the real time rendering         saurus and geometric parts, introducing the possibility to represent
of the obtained artefacts is controlled using the Unreal Engine 42 .          uncertainty. The semantic data produced with this work flow will
The voice of the avatar is dynamically generated using the Mivoq              allow the development of 3D conversational agents able to refer to
Voice Synthesis Engine3 . User gestures and speech are detected               the reconstructed environment.
with the Kinect sensor and are continuously forwarded to the game
engine.                                                                       7    ACKNOWLEDGMENTS
   With this approach, the task of handling raw user data is assigned         Antonio Origlia’s work is funded by the Italian PRIN project Cul-
to a module designed to manage complex, dynamic interfaces that               tural Heritage Resources Orienting Multimodal Experience (CHROME)
include video, audio and user control systems, while high-level               #B52F15000450001.
decision processes are delegated to the dialogue manager. This com-
ponent accesses the encyclopedic knowledge represented in the                 REFERENCES
graph database to select the most appropriate response to a user               [1] Livio De Luca. 2006. Relevé et multi-représentations du patrimoine architectural
input. The response consists of an abstract plan that may include                  Définition d’une approche hybride pour la reconstruction 3D d’édifices. Ph.D.
text extracted from the knowledge base, clarification requests or                  Dissertation. Sciences de l’Homme et Société. Arts et Métiers ParisTech.
                                                                               [2] Felix Dietze, Johannes Karoff, André Calero Valdez, Martina Ziefle, Christoph
generic action instructions (e.g. enter another environment). Decid-               Greven, and Ulrik Schroeder. 2016. An Open-Source Object-Graph-Mapping
ing how to implement the action is assigned to the game engine.                    Framework for Neo4j and Scala: Renesca. In International Conference on Avail-
                                                                                   ability, Reliability, and Security. Springer, 204–218.
This choice is motivated by the dynamic nature of the interaction              [3] Pierre Lison and Casey Kennington. 2016. OpenDial: A toolkit for developing
between the users and the avatar, but also between the avatar and                  spoken dialogue systems with probabilistic rules. ACL 2016 (2016), 67.
the 3D surroundings: a reactive behavioural logic is needed to man-            [4] G. McKeown, M. F. Valstar, R. Cowie, and Maja Pantic. 2010. The SEMAINE
                                                                                   Corpus of Emotionally Coloured Character Interactions. In Proc. of ICME. 1079–
age interrupts caused by both implicit or explicit user activity. Also,            1084.
social signals generation and monitoring must be performed in real             [5] Tommy Messaoudi, Philippe Véron, Gilles Halin, and Livio De Luca. 2018. An
time to ensure consistency with the users’ behaviour. Lastly, the                  ontological model for the reality-based 3D annotation of heritage building con-
                                                                                   servation state. Journal of Cultural Heritage 29 (2018), 100–112.
relative position of the avatar with respect to the concepts that are          [6] Antonio Origlia, Piero Cosi, Antonio Rodà, and Claudio Zmarich. 2017. A
relevant for the generated utterances must also be evaluated in real               dialogue-based software architecture for gamified discrimination tests. In Proc.
                                                                                   of GHItaly. http://ceur-ws.org/Vol-1956/
time in order to generate pointing gestures.                                   [7] Toni Petersen. 1990. Developing a New Thesaurus for Art and Architecture.
   To support multimodal commands from the users and allow a                       Library Trends 38, 4 (1990), 644–658.
richer interaction, the user skeletons provided by the Kinect sensor           [8] Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal
                                                                                   processing: Survey of an emerging domain. Image and vision computing 27, 12
will be used: by exploiting the raycasting system included in the                  (2009), 1743–1759.
engine, it is possible to emit a single, invisible, ray of light from          [9] Jim Webber. 2012. A programmatic introduction to Neo4j. In Proceedings of the
the tip of the arm bone to capture collision events between the ray                3rd annual conference on Systems, programming, and applications: software for
                                                                                   humanity. ACM, 217–218.
and the objects in the scene. From the data included in the collision         [10] Daniel Wigdor and Dennis Wixon. 2011. Brave NUI world: designing natural user
event, it is, then, possible to extrapolate the UV coordinates of the              interfaces for touch and gesture. Elsevier.
vertex that is closest to the collision point. These UV coordinates
can then be used to query the semantic maps of the object the ray
collided with to extract relevance information for the annotated
concepts. These can then be passed to the dialog manager when
a speech command is detected and multimodal fusion has been
performed. The details of the interaction management strategy will
be formalised on the basis of audiovisual recordings of expert art

2 www.unrealengine.com
3 www.mivoq.it

                                                                          4