<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-modal Sense-Making</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ro OLTRAMARI</string-name>
          <email>Alessandro.Oltramari.ext@us.bosch.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alessandro Oltramari, Bosch Research and Technology Center</institution>
          ,
          <addr-line>Pittsburgh</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bosch Research and Technology Center</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        By learning how to make sense of the environment we live in, as humans we survived
in the wilderness, escaping predators, enduring natural catastrophes, epidemics, and
overcoming the intrinsic limitations of our own species. But what is this
“sensemaking” capability, anyway? Although at first sight the notion may sound naive, it can
be traced back to Newell and Simon’s theory of cognition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]: through sensory
stimuli, we cumulate experiences, generalize and reason over them, “storing” the
resulting knowledge in long-term memory; the dynamic combination of live
experiences and knowledge during task execution, enables us to make time-sensitive
rational decisions, evaluating how good (or bad) a decision was by factoring in the
feedback from the environment.
      </p>
      <p>
        Do you see any artificial being around, exhibiting these properties? Of course
you don’t: despite the progress that robotics has made over the last decade (e.g., some
of the most dexterous results being BigDog [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Baxter [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) embodied AI is still in
its infancy. Let me manage the expectations then, and reframe the question without the
burden of embodiment: are you aware of any AI system capable of processing
multimodal sensor data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in real time, and identify with high degree of accuracy the events
represented in the data (e.g., gunshot, people taking cover), and their context of
reference (armed robbery in a bank)? Well, if you attended TriCoLoRe 2018, you may
have heard my own answer to that question: we are getting there, but there are still
significant gaps that need to be filled.
      </p>
      <p>
        First of all, the explosion of deep neural networks, namely networks with a
high number of intermediate layers, has augmented the breadth and depth of machine
learning, to the point that these algorithms, running on powerful GPU clusters, play a
major role in the “artificial brains” of self-driving cars [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. But, regardless of hype [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and dramatic setbacks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], these systems are de facto not reliable: weather conditions,
anomalous behavior of vehicles and pedestrians, street lightning and all sort of
adversarial situations that the environment can naturally present, have experimentally
demonstrated how error-prone deep learning solutions still are. These limits, though,
shouldn’t really surprise if we recognize that autonomous vehicles lack of
“sensemaking” capabilities. No human driver exclusively relies on her senses behind the
wheel! The decisions we make are the result of a continuous evaluation of the context,
where perceptual cues are constantly (and seemingly unconsciously) combined with
background knowledge of the surroundings, and common sense: for instance, driving in
an area where college students go clubbing on a Friday night requires extra attention to
erratic behavior of possibly intoxicated jaywalkers. But if you don’t brave the weekend
nightlife, the following examples may sound more familiar: residents generally know
which area of the neighborhood might have icy road conditions in a frigid winter day,
or where in the city flooding is more frequent after a powerful storm, which streets are
more likely to have kids playing around after school, and which intersections tend to
have poor lighting. Currently, this type of common knowledge is not being used to
assist self-driving cars.
      </p>
      <p>
        Like humans, machines can only make sense of the intricacies of physical and
social reality by combining perception and knowledge. This is not just a theoretical
tenet: from an empirical standpoint, the performance of purely data-driven AI is close
to reach a plateau. Knowledge is required not only for complex tasks like autonomous
driving [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], or natural language understanding [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], but also for relatively simpler
applications. For instance, the company Vicarious2 showed that a system trained on
only 1406 images, but endowed with spatial knowledge, can break captchas with
significantly higher accuracy than state of the art deep neural nets trained with ~8
million images.
      </p>
      <p>Far from being a new idea, but definitely boosted by the technological
breakthroughs of our era, the integration between symbolic knowledge and
subsymbolic learning is poised to become important in AI again. I firmly believe that
multi-modal sense making will serve as testbed for such integration.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Newell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Simon</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <year>1972</year>
          .
          <article-title>Human problem solving</article-title>
          (Vol.
          <volume>104</volume>
          , No. 9). Englewood Cliffs, NJ: Prentice-Hall.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Newell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Unified theories of cognition</article-title>
          . Harvard University Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>[3] https://www.bostondynamics.com/bigdog</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>[4] https://robots.ieee.org/robots/baxter/</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>[5] https://www.heykuri.com/</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Baltrušaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          ,
          <year>2019</year>
          .
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bojarski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Testa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dworakowski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flepp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackel</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monfort</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muller</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , Zhang, J. and
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>End to end learning for self-driving cars</article-title>
          .
          <source>arXiv preprint arXiv:1604</source>
          .
          <fpage>07316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] https://www.nytimes.com/
          <year>2018</year>
          /11/27/business/self
          <article-title>-driving-cars-autonomous-vehicles</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] https://www.nytimes.com/interactive/2018/03/20/us/self
          <article-title>-driving-uber-pedestrian-killed</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zhao</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mita</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sasaki</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Core</surname>
          </string-name>
          <article-title>Ontologies for Safe Autonomous Driving</article-title>
          . In International Semantic Web Conference (Posters &amp; Demos)
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>McShane</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Natural language understanding (NLU, not NLP) in cognitive systems</article-title>
          .
          <source>AI Magazine</source>
          .
          <source>2017 Dec</source>
          <volume>1</volume>
          ;
          <issue>38</issue>
          (
          <issue>4</issue>
          ):
          <fpage>43</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>