<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Software Architecture for Multimodal Semantic Perception Fusion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Buoncompagniy</string-name>
          <email>luca.buoncompagni@edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Carfìy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fulvio Mastrogiovanni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bioengineering, Robotics and Systems Engineering, University of Genoa</institution>
          ,
          <addr-line>Via Opera Pia 13, 16145, Genoa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Robots need advanced perceptive systems to interact with the environment and with humans. Integration of different perception modalities increases the system reliability and provides a richer environmental representation. The article proposes a general-purpose architecture to fuse semantic information, extracted by difference perceptive modules. Therefore, the article describes a mockup implementation of our general-purpose architecture to fuse geometric features, computed from point clouds, and Convolution Neural Network (CNN) classifications, based on images.</p>
      </abstract>
      <kwd-group>
        <kwd>robot perception</kwd>
        <kwd>multimodal perception</kwd>
        <kwd>multimodal fusion</kwd>
        <kwd>late fusion</kwd>
        <kwd>semantic perception</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Multimodal perception gained much attention both for its bioinspired nature
and for the benefits that can provide in terms of reliabilities and richness of
the information. Indeed, the integration of multiple perception modalities can
increase the reliability of shared information while adding to the final
representation information exclusive of a particular modality. Robotic systems are an
interesting scenario of application for multimodal perception since they typically
have different sensors that can be integrated to enhance the robot understanding
of the environment.</p>
      <p>
        The multimodal perception paradigm requires a fusion process integrating
information from all the modalities, an extensive overview of fusion techniques
is presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The fusion process can be performed at feature level, early
fusion, or at decision level, late fusion [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In early fusion feature extracted from
the raw data are combined and then analysed as a whole, on the contrary in
late fusion outputs from all the perceptive modules are merged to obtain the
final output. Both late [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and early [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] fusion have been used in robotics for
multimodal recognition of objects. Late fusion offers particular advantages in
terms of modularity, each time a new sensor is installed the module processing
fused
objec’s
features
      </p>
      <p>P
«node»
Features
Matcher
items
union
R
«node»
Feature
Selector
: : : : : :</p>
      <p>O2
I2
matching
items
indexes
items
intersection</p>
      <p>U</p>
      <p>F
: : :</p>
      <p>«node»</p>
      <p>Reasoner
correlation
tables</p>
      <p>T
«node»
Correlation</p>
      <p>Table</p>
      <p>Manager
Oj</p>
      <p>Ij
«meta node»
Perception
Module Mj
feitaetmurse’s : : :</p>
      <p>: : :
raw data : : :
raw data : : :</p>
      <p>Om
Im
«meta node»
Perception
Module Mm
items’
features
raw data
O1
I1
items’
features
raw data
«meta node»
Perception
Module M1
«meta node»
Perception
Module M2</p>
      <p>feitaetmurse’s : : :
its data can be easily integrated into the system. Furthermore, this approach
encourages reusability and when a well-known technique to extract information
from a sensor is available can be easily adapted to the particular use case.</p>
      <p>To enhance modularity and reusability of code in robotic, we propose an
architecture for multimodal perception using late fusion. Late fusion requires a
common representation to be shared among all the module outputs. Because of
its intuitiveness, we have designed a semantic representation in which each item,
detected by the perception modules, is associated with a list of semantic
characteristics, which in the paper will be simply named features. The architecture
uses features shared between different modalities to correlate items.
2</p>
      <p>A Modular Software Architecture Overview
The proposed architecture1, shown in Figure 1, performs a late fusion of
distinct perception modules resulting in a structure P , provided as output. The
perceptive modules fMi; 8i 2 [0 : : : m]g have an unconstrained input interface
Ii and a well defined output structure Oi. In particular, Mi generates a set of
semantic items Xij Oi described by features through a map hvij is that relates
semantic key (s 2 Si) to a value (visj ) (as shown in Table 1). Remarkably, we
assume that in all key-values maps, the keys are unique and we define the set
1 an implementation is available at:
https://github.com/EmaroLab/mmodal_perception_fusion</p>
      <p>X12</p>
      <p>X11
X13</p>
      <p>X14
X23</p>
      <p>X21
containing the semantic key of the whole system as S = Sim=1 Si. The features
describing an item Xij span in a subset of S, note that it might be possible @ visj .
Finally, the output P has the same structure of Oi, but while the latter contains
key-value maps generated from a single module, P is created by the merging
process possibly using features from all the perception modalities.</p>
      <p>
        The key-value structure is expressive, flexible and suitable as input for further
symbolic reasoning, such as Ontology Web Language (OWL) compatible with
the Robotic Operative System (ROS), e.g. through a bridge presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Indeed, each feature of a perceived item is represented with a semantic key, that
belongs to the symbolic domain (i.e. is encoded as a string ), and a value, which
can be a boolean signal, a real or natural number, as well as another symbol,
e.g. Xij = fhradius; 0:3i; hcluttered; truei; hcolor; redig.
      </p>
      <p>The architecture interfaces with the perception modules through the Features
Selector, which manages the synchronisation of the incoming data and generated
R and F . Where R is the union of all the perceived items and F is a structure
containing only the values with shared keys. The Correlation Table Manager
computes the correlation tables T as a function of the features distance while
considering only the features contained in F . This map is used by the Reasoner
to identify lists of items that can be merged, and corresponding item indexes
are stored in U . Finally, the Feature Matcher uses indexes store in U to fuse
correlated items and provides as output a set of new items P .
3</p>
      <p>Software Interfaces for Multimodal Perception Fusion
As describe in Section 3 the proposed architecture is designed to work with
modules that provide outputs through the Oi interface, which is formally defined as
Oi = fXij ; 8j 2 [1 : : : (i)]g, where (i) represents the number of items
perceived by the i-th module at some instant of time, and each item is represented
with a map of features Xij = hvij is. Given some output Oi from different i-th
modules, we define their union as the concatenation of all the items perceived</p>
      <p>s
vij</p>
      <p>semantic features (s)
time [h:m:s.ms] position [m] shape
09:37:45.92 (.42, .13, .04) sphere
09:37:46.03 (.37, -.21, .02) cylinder
09:37:46.85 (.31, -.22, .03)
09:37:47.35 (.17, .34, .04)
09:37:46.20 (.45, .11, .05)
09:37:46.31 (.21, .33, .03)
09:37:46.37 (.34, -.19, .02)
09:37:46.42 (.31, -.22, .03)
by all the modules, i.e.</p>
      <p>m
: [ Oi = fXij ; 8i 2 [1 : : : m]; j 2 [1 : : : (i)]g :
R =</p>
      <p>i=1
On the other hand, we define the intersection operator as the collection of pairs
of items Xhq and Xkp where all the features related to not common keys are
removed. And the remaining values referring to the common keys, vhzq and vkp
z
where
z 2 Zhq;kp = ns : 8s 2 S; 9 vhsq; vksp 2 R; h 6= k
o</p>
      <p>S;
are structured as Hhzq;kp =
as</p>
      <p>hvhqiz; hvkpiz . Finally the intersection is defined
m
F =: \ Oi = nHhzq;kp : 8z 2 Zhq;kp; k; h 2 [1 : : : m];
i=1
o
q 2 [1 : : : (h)]; p 2 [1 : : : (k)] :
Remarkably, our architecture correlates items perceived from different modules
based on feature with common semantic key. In particular, if Hhzq;kp = ; the
hq-th and kp-th items can not be directly correlated and, if F = ; all the items
can not be correlated.</p>
      <p>Let = f'z; 8z 2 Zhq;kpg be a set of 'z distance functions associated to the
hq-th and kp-th items; thus, each distance can be computed as 'z vhzq; vkzp =
z
dhq;kp 2 [0; inf). We define the correlation score between the hq-th and kp-th
items as
fhq;kp = tanh</p>
      <p>Pz dhq;kp
z
w
+ 1 2 [0; 1];
in this way low distances values are mapped to high-level of correlation scores,
and w is a parameter that can be tuned for modulate the mapping function
behaviour. Through the computation of fhq;kp for all the pairs of perceived items
in F , we obtain a set of tables T = Thk; 8h; k 2 [1 : : : m]; h 6= k (thus T
collects m(m 1)=2 tables), where Thk is a table of size (h) (k).</p>
      <p>The system uses the correlation tables T as a grounded representation to
reason on the best matching among the Xij items. Such a reasoning generates
a set U = fUe; 8e 2 [1 : : : g]g, where g is the number of objects perceived by the
architecture (i.e. real objects), and Ue is a list of indexes ij-th associated to the
l-th items that can be merged to describe the e-th real object, i.e. Ue = hi; jil.
From R we extract all the l-th items fXij ; 8i; j 2 Ueg which have z-th shared and
y-th unique features. Fusing the l-th items generates Pe = hveiz \ hveiy, where a
function is used to compute vez = vizj 8i; j 2 Ue and vey = viyj ; 8i; j 2 Ue .
Finally, the architecture output is P = fPe; 8e 2 [1 : : : g]g.
4</p>
    </sec>
    <sec id="sec-2">
      <title>Implementation</title>
      <p>
        To provide an application example, we have built an implementation that uses
images and point clouds to detect objects in a tabletop scenario (as shown in
Figures 2). The architecture have been implemented using the ROS
middleware, specifically for two perception modules (i.e. m = 2): M1 and M2. The
point clouds are processed by M1 with a stack of RANSAC simulations to
segment the objects laying on the table [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each j-th item perceived by M1 can
be described by one or more of the features contained in S1 = ftime, shape,
position, orientation, radius, high, vertexg. On the other hand, M2
exploits a Convolution Neural Network (CNN) from the tensorflow tutorial [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to
detect objects and assign them a describing label. Each j-th item perceived by
M2 can be described by one or more of the features contained in S2 = f time,
label, positiong. Therefore, common features of object detected by the two
methods are contained in Z1p;2q = ftime; positiong.
      </p>
      <p>The correlation table T12 have been computed as described in Section 3, while
the two 'z functions have been defined as Euclidean distance. To finally merge
information from M1 and M2 we have used an algorithm that explores T12 to
find the row and column indexes of cells which contains a high correlation score.
The algorithm ensures that each index cannot occur twice in Ur (i.e. each object
detected from M1 is associate at maximum to one object detected by M2) and
conflicts are addressed to prioritise higher correlation scores. Finally, to merge
all the objects we have defined the function for time and position as the
geometric mean.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Discussions and Conclusions</title>
      <p>The paper proposed a general-purpose architecture for late semantic fusion.
Indeed, it can accommodate an arbitrary set of perception modules that process
different data sources, but they have to generate a specific type of outcomes,</p>
      <p>Buoncompagni et al.
defined through the semantic item’s features. Nevertheless, these semantic
structures are flexible, and the architecture uses them to correlate items perceived by
different modules, providing a fused representation as output.</p>
      <p>The architecture relies on the distance between shared features, computes
the correlation between items, requires a reasoner for items matching, and a
function for item fusing. We deeply analysed how to orchestrate such elements
in a general scenario and we present a simple implementation based on RANSAC
and CNNs.</p>
      <p>We argued that for a general case, it is required a further investigation of
the distance functions between complex features, (e.g. color, shape, etc.), as
well as regarding the types of reasoning to be performed with the computed
correlation tables. On the other hand, such tables are expressive, allowing to
achieve complex decisions for the item fusion. For example, they contain all the
information to merge objects with partially shared features, through transitivity
properties. Future developments of this work will include a wider integration of
perceptive modules and an experimental evaluation of the architecture.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irving</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Tensorflow: a system for large-scale machine learning</article-title>
          .
          <source>In: OSDI</source>
          . vol.
          <volume>16</volume>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aldoma</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tombari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prankl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richtsfeld</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di</surname>
            <given-names>Stefano</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Vincze</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Multimodal cue integration through hypotheses verification for rgb-d object recognition and 6dof pose estimation</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)</source>
          . pp.
          <fpage>2104</fpage>
          -
          <lpage>2111</lpage>
          . IEEE, Karlsruhe, Germany (May
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Atrey</surname>
            ,
            <given-names>P.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hossain</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El Saddik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kankanhalli</surname>
            ,
            <given-names>M.S.:</given-names>
          </string-name>
          <article-title>Multimodal fusion for multimedia analysis: A survey</article-title>
          .
          <source>Multimedia Systems</source>
          <volume>16</volume>
          (
          <issue>6</issue>
          ) (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buoncompagni</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Capitanelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mastrogiovanni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A ROS multi-ontology references services: OWL reasoners and application prototyping issues</article-title>
          .
          <source>In: Proceedings of the 5th Italian Workshop on Artificial Intelligence and Robotics (AIRO) A workshop of the XVII International Conference of the Italian Association for Artificial Intelligence. CEUR-WS</source>
          , Trento, Italy (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buoncompagni</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mastrogiovanni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A software architecture for object perception and semantic representation</article-title>
          .
          <source>In: Proceedings of the 2nd Italian Workshop on Artificial Intelligence and Robotics (AIRO) A workshop of the XIV International Conference of the Italian Association for Artificial Intelligence</source>
          . vol.
          <volume>1544</volume>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>124</lpage>
          . CEUR-WS, Ferrara, Italy (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eitel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spinello</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgard</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Multimodal deep learning for robust rgb-d object recognition</article-title>
          .
          <source>In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          . pp.
          <fpage>681</fpage>
          -
          <lpage>687</lpage>
          . IEEE,
          <string-name>
            <surname>La</surname>
            <given-names>Jolla</given-names>
          </string-name>
          , California, USA (
          <year>October 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          :
          <article-title>Early versus late fusion in semantic video analysis</article-title>
          .
          <source>In: Proceedings of the 13th annual ACM international conference on Multimedia</source>
          . pp.
          <fpage>399</fpage>
          -
          <lpage>402</lpage>
          . ACM,
          <string-name>
            <surname>Singapore</surname>
          </string-name>
          (November
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>