<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantically Annotated 3D Material Supporting the Design of Natural User Interfaces for Architectural Heritage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valeria Cera</string-name>
          <email>valeria.cera@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Cutugno</string-name>
          <email>cutugno@unina.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Origlia</string-name>
          <email>antonio.origlia@unina.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimiliano Campi</string-name>
          <email>campi@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Architecture, University of Naples, "Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical Engineering and Information, Technology, University of Naples "Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>URBAN/ECO Research Center, University of Naples, "Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2091</volume>
      <abstract>
        <p>With the advent of artificial intelligence and natural user interfaces, the need for multimedia material that can be semantically interpreted in real time becomes critical. In the field of 3D architectural survey, a significant amount of research has been conducted to allow domain experts represent semantic data while keeping spatial references. Such data becomes valuable for natural user interfaces designed to let non-expert users obtain information about architectural heritage. In this paper, we present the architectural data collection and annotation procedure adopted in the Cultural Heritage Orienting Multimodal Experiences (CHROME) project. This procedure aims at providing conversational agents with fast access to fine-detailed semantic data linked to the available 3D models. We will discuss how this will make it possible to support multimodal user interaction and generate cultural heritage presentations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Human-centered computing → User centered design;
Information visualization;
Semantic annotation, architectural survey, interaction design</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION AND RELATED WORK</title>
      <p>
        Recent advances in graphics hardware, together with the
availability of professional video-game engines, have opened a number of
possibilities to develop innovative approaches for cultural heritage
presentation. The use of game engines has been shown to produce
beneficial efects on interaction quality with systems based on
advanced knowledge representation and dialogue-based interaction
(e.g. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). In particular, the use of conversational agents, represented
in the form of 3D avatars moving in virtual reconstructions,
provides a natural way to access information. Establishing a dialogue
with an artificial character is becoming a more and more frequent
way to interact with technological devices.
      </p>
      <p>
        The annotation of digital models lets scholars associate spatial
shapes with the heterogeneous data describing them through the
use of semantic descriptors. The most relevant approach to this
kind of semantic annotation is presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and it is based
on the geometrical segmentation of architectural digital artefacts.
These become collections of separate elements, organised using
part-whole relationships. Each entity is identified by a precise
concept in a specialised domain thesaurus: the architectural dictionary.
Diferent geometrical representations (point clouds, nurbs, textured
meshes, etc. . . ) are linked to the objects represented by the terms,
included in the dictionary, depending on the specific descriptive
objectives. Each geometrical element can be linked to a single
semantic descriptor, while a semantic descriptor may be associated to
multiple geometrical elements. More recently, the original
methodology has been updated [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and implemented as a cloud-based
service called Aioli1. Using the projective relationship between
bidimensional and tridimensional representations, the semantic
annotation of digital models, obtained through a set of reference
images, is produced by segmenting the same reference images, thus
removing the need of a geometrical segmentation. Images sharing
the same semantic label may be linked to one or more specific
terms in a controlled vocabulary, or they may be characterised with
customised attributes. Semantically annotated 3D models contain a
significant amount of data that, to promote cultural heritage, may
be used to let non-expert users navigate cultural contents by
developing interactive technologies. These technologies should be
designed to assist the exploration of the large amount of
information available for cultural heritage (texts, images, 3D models,
etc. . . ) in an engaging way. To tackle this problem, we pursue the
use of conversational agents, in the form of 3D avatars, immersed
in the digital representations of cultural artefacts. Using semantic
processing techniques coming from diferent domains (e.g.
Natural Language Processing, Computer Vision, etc. . . ), it is possible,
using semantic labels, to link separate sources of information and
generate a consistent presentation.
      </p>
      <p>
        In this paper, we present the architectural data collection pipeline
we adopted to obtain the 3D meshes representing relevant parts of
the San Lorenzo Charterhouse in Padula (Italy) and how we
annotated them with semantic information. The obtained data represent
a multi-faceted documentation of architectural heritage describing
both geometrical detail and visual experience. We also present the
work in progress on a software architecture designed to link the
semantically enriched 3D data to textual resources describing the
represented artefact. This architecture will be used to support
natural user interaction [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] through the use of Social Signal Processing
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] techniques and game engines.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA COLLECTION</title>
      <p>The Charterhouse of San Lorenzo, in Padula, and its monumental
staircase represents the selected case study, which is used to test
the developed pipeline, spanning from the 3D data acquisition to
the semantic annotation process. The staircase, made of local white
stone, was built towards the end of the eighteenth century, has an
elliptical plan and a double ramp. Closed outside by an octagonal
tower, it leads to the first floor of the great cloister, used by the
Carthusians for their weekly walk. Several surveying techniques
were employed to produce a 3D reality-based model, suitable for
dissemination purposes in virtual and interactive environments. In
order to obtain a geometrically accurate 3D model, the survey was
performed using a terrestrial laser scanner (TLS). Given the
morphology of the staircase, its materials and colors, the geometrical
data has been integrated with data collected during a
photogrammetric campaign. This results in a physically accurate model that
also delivers a photorealistic view of the surveyed cultural site,
based on state of the art techniques.</p>
      <p>Starting from the entrance, the positions of the diferent
acquisitions have been organised to cover the entire volume of the
monument, taking into account the tangency of the surfaces and
shadows. A Continuous Wave Faro Focus 3D S120 laser scanner
was used to perform a total number of 40 scans, positioning the
scanner uniformly along an ascending path - for the eastern side
and a descendant one - for the western side -, with a spatial
resolution of 6 mm at 10 m. A terrestrial photogrammetric survey was
carried out mainly for texturing purposes. Using a Reflex Canon
EOS 1300D and a zoom 18-55 lens set at 24 mm view, about 380
images were acquired to obtain a better color information for the
ifnal texturing of the 3D digital model.
3</p>
    </sec>
    <sec id="sec-4">
      <title>DATA PROCESSING</title>
      <p>The complete range-based 3D point cloud was obtained employing
a classical processing procedure: the adjacent TLS stations were
aligned using a solid-rigid transformation based on planar printed
checkboards targets and spheres. A final point cloud of about 500
millions points was obtained.</p>
      <p>After a manual cleaning of vegetation and artefacts caused by
noise, a polygonal mesh model was generated using a Delaunay
triangulation algorithm. A final mesh of about 392.4 millions of
triangles was obtained this way. Once the triangulated model
editing was completed, a texture mapping was carried out, using the
images from the photogrammetric survey. To optimise the
computational management of the models during online rendering, instead
of generating a whole mesh, the process was set to divide the result
into subparts. Each part is defined by an automatic subdivision of
the model using a constraint of keeping a maximum of 5 million
vertices per subpart. Considering the aims of dissemination and
communication, the textured model was simplified using a
successive geometric optimisation. The quadratic edge collapse algorithm
Semantically Annotated 3D Material Supporting the...
was applied to obtain a polygonal mesh that allowed fluid real-time
rendering while preserving an adequate level of perceived detail.
The mesh produced with this procedure was collapsed with a target
1% vertices use from the initial mesh.</p>
      <p>The deviation between the original and the decimated meshes
was measured by calculating the Hausdorf distance. The
approximation error was below 1 cm. To retain geometrical fidelity in the
visualisation task, we compensate this error by computing normal
maps that result from the comparison of the high poly and the low
poly meshes, shown in Figure 1. Following the same approach used
to bake normals on the low poly mesh, color information is baked
using the high poly mesh. The resulting color texture is shown in
Figure 2. This way, although geometrical data is lost during
decimation, the simulated behaviour of light in the rendering engine takes
into account the efect of the removed details. A rendered example
of the final result is shown in Figure 3.</p>
      <p>From a cultural heritage documentation point of view, it is
desirable to preserve both geometrical fidelity and the visual experience.
Considering that the error of the measures acquired with the laser
scanner is approximately 2 mm, for geometrical documentation
purposes an error of 1 cm is considered significant so the high poly
mesh must be stored. On the other hand, to document the visual
experience, it is only necessary to retain the efect geometry has
on lighting. Normal maps allow to retain these efects, although
the original geometry is not present in the low poly mesh, and let
a rendering engine keep high frame rates. To improve the final
quality of the visualised 3D models, Ambient Occlusion maps were
also computed.
4</p>
    </sec>
    <sec id="sec-5">
      <title>ANNOTATION</title>
      <p>
        Semantic annotation consists of linking abstract concepts to the
relevant parts of a 3D mesh. Since these concepts must be represented
in standard formats, in order to be recognisable, a reference source
must be selected. In our case, we refer to the Art and Architecture
Thesaurus (AAT) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: a controlled domain vocabulary containing
generic terms and other data concerning the represented concepts.
These are connected using hierarchical, equivalence and associative
relationships. The annotation consists in the use of maps describing
semantic concepts applied to the 3D model like a texture, thus
avoiding the need to geometrically segment the architectural artefact. For
each concept in the AAT found in the digital architectural model, a
semantic map is created and assigned the same unique ID the
concept it represents is recorded with in the AAT. As an improvement
with respect to previous approaches, the semantic information is
represented as a grayscale map: each map records which polygons,
in the digital model, are relevant for the concept it represents by
using the model’s UV map. In our approach, white indicates high
relevance, while black indicates no relevance. An example of
semantic map is shown in Figure 4. Using semantic maps and reference
IDs for the annotated concepts allows the integration of multiple
sources of information (texts, images, audio recordings, etc. . . )
sharing the same annotation scheme. Cross-referencing these sources
opens the possibility to produce advanced interfaces to link the
descriptions a specific artefact has in separate domains.
      </p>
      <p>
        The possibility of using gradients in the map lets annotators
refine the quality of the semantic data. This way, it is possible to
express, more than a binary relevance of each vertex for a given
concept, a relevance level for that concept. This is important in the
ifeld of architectural heritage, as it is not always possible to classify
an element in a unique and precise way, and becomes useful when
an architectural element cannot be assigned to a specific category.
The same applies to situations in which it is not possible to indicate
where, exactly, an architectural element becomes another one. This
also makes it possible to consider semantic maps produced by
multiple annotators to obtain a final map by computing the mean
values for each UV coordinate, similarly to what has been done
in other fields where annotation uncertainty is important, like for
emotions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>INTERACTION MANAGEMENT</title>
      <p>In the scenario of automated information providers for architectural
heritage, semantically annotated material can be used to generate
cross-domain presentations. Moreover, the possibility of adding
virtual characters to the scene allows the elicitation of social signals
and the use of natural, multimodal commands. To take advantage
of these possibilities, we designed a software architecture
combining specialised modules for interaction management and
knowledge representation. This architecture includes: a) a graph database
(Neo4J) to represent knowledge, b) a dialog manager (Opendial) to
handle interaction, c) a game engine (Unreal Engine 4) to control
the virtual character, d) a voice synthesizer (Mivoq) and e) a Kinect
sensor to collect users’ activity data.</p>
      <p>
        Neo4J [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is an open source graph database manager that has
been applied to a high number of tasks related to data
representation (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Opendial [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a dialogue management framework
based on probabilistic rules aiming at merging the best of rule-based
and probabilistic dialogue management. Probabilistic rules, in
Opendial, are used to setup and update a Bayesian network consisting
of variables that represent the current dialogue state, including
uncertainty. A utility based approach is used to compute the next
system action, if any. The virtual avatar and the real time rendering
of the obtained artefacts is controlled using the Unreal Engine 42.
The voice of the avatar is dynamically generated using the Mivoq
Voice Synthesis Engine3. User gestures and speech are detected
with the Kinect sensor and are continuously forwarded to the game
engine.
      </p>
      <p>With this approach, the task of handling raw user data is assigned
to a module designed to manage complex, dynamic interfaces that
include video, audio and user control systems, while high-level
decision processes are delegated to the dialogue manager. This
component accesses the encyclopedic knowledge represented in the
graph database to select the most appropriate response to a user
input. The response consists of an abstract plan that may include
text extracted from the knowledge base, clarification requests or
generic action instructions (e.g. enter another environment).
Deciding how to implement the action is assigned to the game engine.
This choice is motivated by the dynamic nature of the interaction
between the users and the avatar, but also between the avatar and
the 3D surroundings: a reactive behavioural logic is needed to
manage interrupts caused by both implicit or explicit user activity. Also,
social signals generation and monitoring must be performed in real
time to ensure consistency with the users’ behaviour. Lastly, the
relative position of the avatar with respect to the concepts that are
relevant for the generated utterances must also be evaluated in real
time in order to generate pointing gestures.</p>
      <p>To support multimodal commands from the users and allow a
richer interaction, the user skeletons provided by the Kinect sensor
will be used: by exploiting the raycasting system included in the
engine, it is possible to emit a single, invisible, ray of light from
the tip of the arm bone to capture collision events between the ray
and the objects in the scene. From the data included in the collision
event, it is, then, possible to extrapolate the UV coordinates of the
vertex that is closest to the collision point. These UV coordinates
can then be used to query the semantic maps of the object the ray
collided with to extract relevance information for the annotated
concepts. These can then be passed to the dialog manager when
a speech command is detected and multimodal fusion has been
performed. The details of the interaction management strategy will
be formalised on the basis of audiovisual recordings of expert art
historians presenting the Campanian Charterhouses to small groups
of visitors, which are currently being collected in the framework of
the CHROME project.
6</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We have presented the work in progress in the framework of the
CHROME project. We have described the work flow leading from
3D architectural data collected with laser scanners and
photogrammetry to an interactive system designed to present such data in a
rich and entertaining way. Using high and low poly meshes with
normal maps to retain the necessary details in real time
rendering, we document architectural heritage from a geometrical and
from a visual experience points of view. Furthermore, an original
method to semantically annotate the low poly meshes has been
developed to allow a direct link between concepts in the AAT
thesaurus and geometric parts, introducing the possibility to represent
uncertainty. The semantic data produced with this work flow will
allow the development of 3D conversational agents able to refer to
the reconstructed environment.
7</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>Antonio Origlia’s work is funded by the Italian PRIN project
Cultural Heritage Resources Orienting Multimodal Experience (CHROME)
#B52F15000450001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Livio</given-names>
            <surname>De Luca</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Relevé et multi-représentations du patrimoine architectural Définition d'une approche hybride pour la reconstruction 3D d'édifices</article-title>
          .
          <source>Ph.D. Dissertation. Sciences de l'Homme et Société. Arts et Métiers ParisTech.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Dietze</surname>
          </string-name>
          , Johannes Karof, André Calero Valdez, Martina Ziefle, Christoph Greven, and
          <string-name>
            <given-names>Ulrik</given-names>
            <surname>Schroeder</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An Open-Source Object-Graph-Mapping Framework for Neo4j and Scala: Renesca</article-title>
          . In International Conference on Availability, Reliability, and Security. Springer,
          <fpage>204</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>Casey</given-names>
            <surname>Kennington</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>OpenDial: A toolkit for developing spoken dialogue systems with probabilistic rules</article-title>
          .
          <source>ACL</source>
          <year>2016</year>
          (
          <year>2016</year>
          ),
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>McKeown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Valstar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cowie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Maja</given-names>
            <surname>Pantic</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The SEMAINE Corpus of Emotionally Coloured Character Interactions</article-title>
          .
          <source>In Proc. of ICME</source>
          .
          <volume>1079</volume>
          -
          <fpage>1084</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tommy</given-names>
            <surname>Messaoudi</surname>
          </string-name>
          , Philippe Véron, Gilles Halin, and Livio De Luca.
          <year>2018</year>
          .
          <article-title>An ontological model for the reality-based 3D annotation of heritage building conservation state</article-title>
          .
          <source>Journal of Cultural Heritage</source>
          <volume>29</volume>
          (
          <year>2018</year>
          ),
          <fpage>100</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Origlia</surname>
          </string-name>
          , Piero Cosi, Antonio Rodà, and
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Zmarich</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A dialogue-based software architecture for gamified discrimination tests</article-title>
          .
          <source>In Proc. of GHItaly</source>
          . http://ceur-ws.
          <source>org/</source>
          Vol-1956/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Toni</given-names>
            <surname>Petersen</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>Developing a New Thesaurus for Art and Architecture</article-title>
          .
          <source>Library Trends</source>
          <volume>38</volume>
          ,
          <issue>4</issue>
          (
          <year>1990</year>
          ),
          <fpage>644</fpage>
          -
          <lpage>658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Vinciarelli</surname>
          </string-name>
          , Maja Pantic, and
          <string-name>
            <given-names>Hervé</given-names>
            <surname>Bourlard</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Social signal processing: Survey of an emerging domain</article-title>
          .
          <source>Image and vision computing 27</source>
          ,
          <issue>12</issue>
          (
          <year>2009</year>
          ),
          <fpage>1743</fpage>
          -
          <lpage>1759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jim</given-names>
            <surname>Webber</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A programmatic introduction to Neo4j</article-title>
          .
          <source>In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity. ACM</source>
          ,
          <volume>217</volume>
          -
          <fpage>218</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Wigdor</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Wixon</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Brave NUI world: designing natural user interfaces for touch and gesture</article-title>
          . Elsevier.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>