<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling of Human Mental-Image Based Understanding of Spatiotemporal Language for Intuitive Human-Robot Interaction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rojanee Khummongkol</string-name>
          <email>rojanee.kh@up.ac.th</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering, University of Phayao</institution>
          ,
          <addr-line>Phayao</addr-line>
          ,
          <country country="TH">Thailand</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Masao Yokota Department of System Management, Fukuoka Institute of Technology</institution>
          ,
          <addr-line>Fukuoka</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>29</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>Mental Image Directed Semantic Theory (MIDST) has proposed a human mental image model and its description language Lmd. This is one kind of knowledge representation language and has already been applied to integrative multimedia understanding intended for facilitating intuitive human-robot interaction, especially, language-centered interaction between ordinary people and home robots. The most remarkable feature of Lmd is its capability of formalizing spatiotemporal events in good correspondence with human/robotic sensations and actions, which can lead to integrative computation of sensory, motory and conceptual information. This paper sketches MIDST and its application, namely, the natural language understanding system named conversation management system (CMS) intended to simulate human mental-image based understanding of natural language, overviewing related work. CMS was evaluated based on a psychological experiment and showed a good agreement with human subjects in answering questions about stimulus sentences, inevitably involving spatiotemporal reasoning. 2012 ACM Subject Classification General and reference → General literature; General and reference Copyright © 2019 for this paper by its authors. 29 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) In Proceedings Speaking of Location 2019: Communicating about Space, Regensburg, Germany, September-2019. Editors: K. Stock, C.B. Jones and T. Tenbrink (eds.); Published at http://ceur-ws.org</p>
      </abstract>
      <kwd-group>
        <kwd>and phrases Natural language understanding</kwd>
        <kwd>Mental image model</kwd>
        <kwd>Human-robot interaction</kwd>
        <kwd>Knowledge representation</kwd>
        <kwd>Spatiotemporal reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>For ordinary people, natural language (NL) is the most important among the various
communication media because it can convey the exact intention of the emitter to the receiver
due to the syntax and semantics common to its users. This is not necessarily the case for
another media, such as gesture, and so NL can also play the most crucial role in intuitive
human-robot interaction (iHRI) intended here and shown in Figure 1. This figure implies
that the robot should find and solve the problems in knowledge representation language
(KRL) communicating with the human in NL. As easily understood, in such a scenario,
the robot must be provided with a very powerful artificial intelligence (AI) for integrative
comprehension of perceptual information (i.e., sensory or motory data) and conceptual
information (i.e., lexical knowledge or world knowledge), and, especially, its capability of
natural language understanding (NLU) (or more broadly, natural language processing (NLP))
should be much more cognitively elaborated than the conventional approaches (e.g., [26];
[19]; [7]; [27]) in order to cope with symbol grounding problems ([8]).</p>
      <p>In the field of ontology, special attention has been paid to spatial (more exactly,
spatiotemporal) language covering geography because its constituent concepts stand in highly
complex relationships to underlying physical reality, accompanied with fundamental issues
in terms of human cognition (for example, ambiguity, vagueness, temporality, identity, ...)
appearing in varied subtle expressions ([9]). For facilitating iHRI, spatial language is also the
most important of all sublanguages, especially, when both the entities must share knowledge
of spatial arrangement of home utilities such as desk, table, etc.</p>
      <p>As known well, people do not perceive the external world as it is, which naturally leads
to human-specific cognition and conception of the external world. For example, as shown in
Figure 2, people often perceive continuous forms among separately located objects so called
spatial gestalts in the field of psychology and refer to them by such an expression as ‘Nine
disks are placed in the shape of X’. For another example, people would intuitively and easily
understand the following expressions S1 and S2 so that they describe the same scene in the
external world. This is also the case for S3 and S4.</p>
      <p>(S1) The path sinks to the brook.
(S2) The path rises from the brook.
(S3) The roads meet there.
(S4) The roads separate there.</p>
      <p>It is, however, extremely difficult for robots to reach such a paradoxical understanding in
a systematic way because these expressions are assumed to reflect not so much the purely
objective geometrical relations but very much human mental activity at cognition of the
involved objects, inevitably employing mental image operations (e.g., [28], [30], [29], [32]).
However, most conventional approaches to spatial language understanding have focused on
computing ostensible geometric relations (i.e., topological, directional and metric relations)
conceptualized as spatial prepositions or so, considering some properties or functions of
the objects involved (e.g., [5]; [16]; [2]). From the semantic viewpoint, spatial expressions
have the virtue of relating in some way to visual scenes being described. Therefore, their
semantic descriptions can be grounded in perceptual representations, possibly, cognitively
inspired and coping with all kinds of spatial expressions including such verb-centered ones as
S1-S4 as well as preposition-centered ones. In particular, these verb-centered expressions are
assumed to reflect very much certain dynamism at human perception of the objects involved.
This implies that conventional approaches to spatial language understanding will inevitably
lead to serious cognitive divide between humans and robots that causes miscommunication
between them. That is, AI should be more cognitive ([28]; [24]).</p>
      <p>Reflecting our own psychological expriences, mental image must deeply concern our
thinking. It is considered that people can create fictive or non-veridical stories thanks to
mental images independent of the real world. More actually, it is quite ordinary to understand
a spatiotemporal (or 4D) expression in NL with the mental image of a certain scene being
described by it. Therefore, such a human mental process is worth simulating by computers
in order to facilitate iHRI.</p>
      <p>MIDST (Mental Image Directed Semantic Theory) (e.g., [28]; [32]) has proposed a
dynamic model of human sensory cognition yielding omnisensory image of the world. In
MIDST, natural event concepts (i.e., event concepts in NL) are classified into two types
of categories, ‘Temporal Change Events’ and ‘Spatial Change Events’. These are defined
as temporal and spatial changes (or constancies) in certain attributes of physical objects,
respectively, with S1-S4 included in the latter. Both the types of events are uniformly
analyzable as temporally parameterized loci in attribute spaces to be described distinctively
in a logical form, so called, “locus formula”. MIDST has already been applied to several types
of computerized intelligent systems (e.g., [28]; [11]) and there is a feedback loop between
them for their mutual refinement.</p>
      <p>This paper sketches MIDST and our NLU system named conversation management system
(CMS) intended to simulate human mental-image based understanding of natural language
with its evaluation based on a psychological experiment. The remainder of this paper is
organized as follows. Section II considers human mental image-based understanding of
natural language and Section III presents a brief description of MIDST. Section IV describes
the methodology for NLU based on MIDST. Section V gives a brief description of CMS and
its evaluation based on a psychological experiment. Lastly, Section VI concludes this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Mental-image based NLU</title>
      <p>For example, read the assertion S5 and answer to the questions S6 and S7. Perhaps, without
any exception, we cannot answer the questions correctly without reasoning based on the
mental images evoked by these expressions.</p>
      <p>(S5) Mary was in the tram heading for the town. She had a bag with her.
(S6) Was the tram carrying Mary?
(S7) Was the bag heading for the town?</p>
      <p>This kind of reasoning is considered to belong to what are required for the Winograd
Schema Challenge (WSC) ([15]) that would discourage conventional NLU systems adapted
for the Turing Test, even the renowned quiz champion AI, Watson by IBM ([6]). The WSC
is a more cognitively-inspired variant of the Textual Entailment Challenge ([3]) to eliminate
cheap tricks intended to pass the Turing Test which is essentially based on behaviorism in
the field of psychology.</p>
      <p>There are a considerable number of cognitively motivated studies on NL semantics or
pragmatics in association with mental image explicitly or implicitly (e.g., [14]; [17]; [21]; [22];
[13]). However, almost none of them are for NLU because certain systematic methodologies
for both representation and computation of mental imagery are inevitably required for NLU.
As well, a lot of interesting researches on mental image itself in association with human
thinking modes have been reported from various fields ([23]) but none of them are from the
viewpoint of NLU, either.</p>
      <p>Distinctively from them, MIDST is intended for systematic representation and
computation of NL semantics, more broadly, human knowledge grounded in the world through
mental images.</p>
      <p>Originally, our mental images of the external world are acquired through our inborn
sensory systems, and therefore, it is worth considering our perceptual processes. As already
mentioned, we do not perceive the external world as it is. That is, our perception does not
begin with objective data gained through artificial sensors but with subjective sensation
intrinsically (or subconsciously) articulated with contours of involved objects and gestalts
among them as shown in Figure 2. Here, this kind of articulation is called intrinsic
articulation, attributed to our subconscious propensities toward the external world. Then, at the
next stage, as active perception, we work our attention consciously to elaborate (or calibrate)
intrinsic articulation by reasoning based on various kinds of knowledge and come to an
interpretation of the sensation as a spatiotemporal relation among its significant portions or
constituents. Here, this elaboration of intrinsic articulation is called semantic articulation
that MIDST concerns in particular. The neural network architectures prevailing today are
based on simple-minded algorithms and are essentially to provide a machine with intrinsic
articulation but not semantic articulation of the stimuli posed.</p>
      <p>Overviewing conventional methodologies for robotic NLU, almost all of them have provided
robotic systems with such quasi-natural language expressions as ‘move(Velocity, Distance,
Direction)’, ‘find(Object, Shape, Color)’, etc., for human instruction or suggestion, uniquely
related to computer programs for deploying sensors/motors as their semantics (e.g., [1]; [4]).
These expression schemas, however, are too linguistic or coarse to represent and compute
sensory/motory events in such an integrative way as the intuitive human-robot interaction
intended here. This is also the case for AI planning (‘action planning’), which deals with the
development of representation languages (i.e., KRLs) for planning problems and with the
development of algorithms for plan construction ([25]).</p>
      <p>In order to challenge a complex problem domain, the first thing to do is to design/select
a certain KRL suitable for constructing a well-structured problem formulation, namely, a
representation. Among conventional KRLs, the ones employable for first order logic have been
the most prevailing because of good availability of deductive inference engines intrinsically
prepared for computer languages such as Python. According to these schemes, for example,
the semantic relation between ‘x carry y’ and ‘y move’ is often to be translated into such
a representation as (∀x, y)(carry(x, y) ⊃ move(y)). As easily imagined, such a declarative
definition will enable an NLU system to answer correctly to such a question as “When Jim
carried the box, did it move?” but it will be of no use for a robot to recognize or produce any
external event referred to by ‘x carry y’ or ‘y move’ in a dynamic and incompletely known
environment unlike the Winograd’s block world ([26]). That is, this type of logical expression
as is can give only combinations of dummy tokens at best. For example, carry(x, y) and
move(y) are substitutable with w013(x, y) and w025(y), respectively, which do not represent
any word concepts or meanings at all but are the coded names of such concepts or meanings.
If you find any inconvenience with this kind of substitution, that is due to being without
symbol grounding ([8]) on your lexical knowledge of English. Schank’s Conceptual Dependency
theory ([20]) was an attempt to decrease paraphrastic variety in knowledge representation
by employing a small set of coded names of concepts called conceptual primitives, although
its expressive power was very limited.</p>
      <p>The fact above destines a cognitive robot to be provided with procedural definitions of
word meanings grounded in the external world, as well as declarative ones for reasoning in
order both to work its sensors and actuators appropriately and to communicate by natural
language with humans properly. Therefore, it is noticeable that some certain interface
must be employed for translation between declarative and procedural definitions of word
meanings, where the problem is how to realize such a translator systematically. Conventional
KRLs, however, are not so viable of such systematization because they are not so cognitively
designed, namely, not so systematically grounded in sensors or actors. That is, they are not
provided with their semantics explicitly but implicitly grounded in natural language word
concepts that can be interpretable for people but have never been grounded in the world
well enough for robots to cognize their environments or themselves through NL expressions.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Brief Description of MIDST</title>
      <p>MIDST has proposed a dynamic model of human sensory cognition, yielding omnisensory
image of the world and its description language named ‘mental image description language’
(Lmd). This formal language is one kind of KRL employed for predicate logic. In MIDST,
omnisensory mental images are modeled as “Loci in Attribute Spaces”. An attribute space
corresponds with a certain measuring instrument just like a barometer, thermometer or so
and the loci represent the movements of its indicator.</p>
      <p>For example, the moving red triangular object shown in Figure 3 is assumed to be
perceived as the loci in the three attribute spaces, namely, those of ‘Location’, ‘Color’ and
‘Shape’ in the observer’s brain as the result of intrinsic articulation. At the next stage as
semantic articulation, a general locus is to be articulated by “Atomic Locus” as depicted in
Figure 4 and formulated as (1).
(1)
(2)
(3)
L(x, y, p, q, a, g, k)</p>
      <p>The intuitive interpretation of (1) is given as follows. “Matter ‘x’ causes Attribute ‘a’
of Matter ‘y’ to keep (p=q) or change (p 6= q) its values temporally (g = Gt) or
spatially (g = Gs) over a time-interval, where the values ‘p’ and ‘q’ are relative
to the standard ‘k’.”</p>
      <p>When g = Gt, the locus indicates monotonic change or constancy of the attribute in
time domain and when g = Gs, that in space domain, respectively. The former is called
‘temporal change event’ and the latter, ‘spatial change event’. For example, the motion of
the ‘bus’ represented by S8 is a temporal change event and the ranging or extension of the
‘road’ by S9 is a spatial change event whose meanings or concepts are formulated as (2) and
(3), respectively, where ‘A12’ denotes the attribute ‘Physical Location’. These two formulas
are different only at the term ‘Event Type (= g)’.</p>
      <p>(S8) The bus runs from Tokyo to Osaka.</p>
      <p>(∃x, y, k)L(x, y, T okyo, Osaka, A12, Gt, k) ∧ bus(y)
(S9) The road runs from Tokyo to Osaka.</p>
      <p>(∃x, y, k)L(x, y, T okyo, Osaka, A12, Gs, k) ∧ road(y)</p>
      <p>The formal language Lmd has employed ‘tempo-logical connectives’ representing both
logical and temporal relations between loci. Articulated loci are combined with tempo-logical
conjunctions, where ‘SAND (∧0)’ and ‘CAND (∧1)’ are most frequently utilized, standing
for ‘Simultaneous AND’ and ‘Consecutive AND’, conventionally symbolized as ‘Π’ and ‘·’,
respectively. For example, the expression (4) is the definition of the English verb concept
‘fetch’ depicted as Figure 5. This implies such a temporal change event that x goes for y
and then comes back with it, where the special symbol λ is employed to denote free vaiables
explicitly.
⇔ (λx, y)(∃p1, p2, k)L(x, x, p1, p2, A12, Gt, k)·</p>
      <p>((L(x, x, p2, p1, A12, Gt, k)ΠL(x, y, p2, p1, A12, Gt, k)) ∧ x 6= y ∧ p1 6= p2 (4)
It has been often argued that human active sensing processes may affect perception and
in turn conceptualization and recognition of the physical world. The difference between
temporal and spatial change event concepts can be attributed to the relationship between the
Attribute Carrier (AC) and the Focus of Attention of the Observer (FAO). To be brief, the
FAO is fixed at one point on the AC in a temporal change event but runs about on the AC
in a spatial change event. Consequently, as shown in Figure 6, the bus and the FAO move
together in the case of S8 while the FAO solely moves along the road in the case of S9. That
is, all loci in attribute spaces correspond one to one with movements or, more
generally, temporal change events of the FAO. This implies that Lmd expression
can suggest a robot what and how should be attended to in its environment.
And this is why S1 and S2 (as well as S3 and S4) can refer to the same scene in spite of their
appearances, where what ‘sinks’ or ‘rises’ is the FAO and whose conceptual descriptions are
given as (5) and (6), respectively, where ‘A13’, ‘↑’ and ‘↓’ refer to the attribute ‘Direction’ and
its values ‘upward’ and ‘downward’, respectively. Such a fact is generalized as ’Postulate
of Reversibility of a Spatial Change Event (PRS)’ that can be one of the principal
cognitive laws belonging to people’s intuitive common-sense knowledge about geography.
These pairs of conceptual descriptions are called equivalent in the PRS, and the paired
sentences are treated as paraphrases each other.</p>
      <p>(∃x, y, p, z, k1, k2)L(x, y, p, z, A12, Gs, k1)Π
(∃x, y, p, z, k1, k2)L(x, y, z, p, A12, Gs, k1)Π</p>
      <p>L(x, y, ↓, ↓, A13, Gs, k2) ∧ path(y) ∧ brook(z) ∧ p 6= z (5)</p>
      <p>L(x, y, ↑, ↑, A13, Gs, k2) ∧ path(y) ∧ brook(z) ∧ p 6= z (6)</p>
      <p>For another example of spatial change event, Figure 7 concerns the perception of the
formation of multiple isolated objects, where FAO runs along an imaginary object so called
‘Imaginary Space Region (ISR)’. This spatial event can be verbalized as S10 using the
preposition ‘between’ and formulated as (7) or (8), corresponding also to such concepts as
‘row’, ‘line-up’, etc. It is noticeable that ISR is itended to include spatial gestalt conceptually
but it is assumed that people can imagine ISRs consciously or arbitrarily at semantic
articulation..</p>
      <p>(S10) Y is between X and Z.
(∃x, y, p, q, k1, k2)(L(x, y, X, Y, A12, Gs, k1)ΠL(x, y, p, p, A13, Gs, k2))</p>
      <p>· (L(x, y, Y, Z, A12, Gs, k1)ΠL(x, y, q, q, A13, Gs, k2)) ∧ ISR(y) ∧ p = q (7)
(∃x, y, p, k1, k2)(L(x, y, Z, Y, A12, Gs, k1)·</p>
      <p>L(x, y, Y, X, A12, Gs, k1))ΠL(x, y, p, p, A13, Gs, k2) ∧ ISR(y) (8)</p>
      <p>At our best knowledge, there is no other theory or method that can provide spatiotemporal
expressions with semantic interpretation in such a systematic way where both temporal and
spatial change events are simply and adequately formulated by controlling the term of Event
Type of the atomic locus formula reflecting FAO movement. About 50 attributes (Table 1)
were extracted exclusively from English and Japanese words of common use contained in
certain thesauri (e.g., [18]). Most of them correspond to the sensory receptive fields in human
brains. Correspondingly, seven categories of standards (Table 2) were extracted that are
necessary for representing relative values of each attribute.</p>
      <p>These findings imply that ordinary people live their casual life, attending to tens of
attributes of the matters in the world to cognize them in comparison with several kinds of
standards. That is, with verbal hint, a robot can work its sensors or actuators very efficiently
or economically and otherwise it is extremely difficult for the robot to understand which part
of its environment is significant or not for people because there are too many things to attend
to as is.
In MIDST, natural language expression (i.e., surface structure) and Lmd expression (i.e.,
conceptual structure) are mutually translatable through the surface dependency structure by
utilizing syntactic rules and word meaning descriptions (e.g., [28]).</p>
      <p>A word meaning description Mw is given by (9) as a pair of ‘Concept Part (Cp)’ and
‘Unification Part (Up)’.</p>
      <p>Mw ⇔ [Cp : Up]
(9)</p>
      <p>The Cp of a word W is a logical formula while its Up is a set of operations for unifying the
Cps of W ’s syntactic governors or dependents. For example, the meaning of the English verb
‘x carry y’ is approximately given by (10), where A12 is the attribute of “Physical location”.
[(λx, y)(∃p1, p2, k)L(x, x, p1, p2, A12, Gt, k)Π</p>
      <p>L(x, y, p1, p2, A12, Gt, k) ∧ x 6= y ∧ p1 6= p2 : ARG(Dep.1, x); ARG(Dep.2, y); ] (10)
The Up above consists of two operations to unify the arguments of the first dependent
(Dep.1) and the second dependent (Dep.2) of the current word with the variables x and
y, respectively. Here, Dep.1 and Dep.2 refer to the ‘subject’ and the ‘object’ of ‘carry’,
respectively. Therefore, the sentence ‘Mary carries a book’ is to be translated into (11) via
Cou
dependency structure and vice versa as depicted in Figure 8.</p>
      <p>(∃y, p1, p2, k)L(M ary, M ary, p1, p2, A12, Gt, k)ΠL(M ary, y, p1, p2, A12, Gt, k)
∧ M ary 6= y ∧ p1 6= p2 ∧ book(y) (11)</p>
      <p>For another example, the meaning description of the English preposition ‘x (verb)
through y’ is also approximately given by (12).</p>
      <p>[(λx, y)(∃p1, z, p3, g, k, p4, k0)(L(x, y, p1, z, A12, g, k)·
L(x, y, z, p3, A12, g, k))ΠL(x, y, p4, p4, A13, g, k0) ∧ p1 6= z ∧ z 6= p3 : ARG(Dep.1, z);</p>
      <p>IF (Gov = V erb) → P AT (Gov, (1, 1)); IF (Gov = N oun) → ARG(Gov, y); ] (12)
The Up above is for unifying the Cps of the very word, its governor (Gov, a verb or a
noun) and its dependent (Dep.1, a noun). The second argument (1,1) of the command PAT
indicates the underlined part of (12) and in general (i, j) refers to the partial formula covering
from the ith to the jth atomic formula of the current Cp. This part is the pattern common
to both the Cps to be unified and called ‘Unification Handle (Uh)’ and when missing, the
Cps are to be combined simply with ‘∧’.</p>
      <p>Therefore the sentences S11-S13 are interpreted as (13)-(15), respectively. The underlined
parts of these formulas are the results of PAT operations. The expression (16) is the Cp of
the adjective ‘long’ implying ‘there is some value greater than some standard of length (A02)’
which is often simplified as (17).</p>
      <p>(S11) The train runs through the tunnel.</p>
      <p>(∃x, y, p1, z, p3, k, p4, k0)(L(x, y, p1, z, A12, Gt, k) · L(x, y, z, p3, A12, Gt, k))</p>
      <p>ΠL(x, y, p4, p4, A13, Gt, k0) ∧ p1 6= z ∧ z 6= p3 ∧ train(y) ∧ tunnel(z) (13)
(S12) The path runs through the forest.</p>
      <p>(∃x, y, p1, z, p3, k, p4, k0)(L(x, y, p1, z, A12, Gs, k) · L(x, y, z, p3, A12, Gs, k))</p>
      <p>ΠL(x, y, p4, p4, A13, Gs, k0) ∧ p1 6= z ∧ z 6= p3 ∧ path(y) ∧ f orest(z) (14)
(S13) The path through the forest is long.</p>
      <p>(∃x, y, p1, z, p3, x1, k, q, k1, p4, k0)(L(x, y, p1, z, A12, Gs, k)·
L(x, y, z, p3, A12, Gs, k))ΠL(x, y, p4, p4, A13, Gs, k0) ∧ L(x1, y, q, q, A02, Gt, k1)
∧ p1 6= z ∧ z 6= p3 ∧ q &gt; k1 ∧ path(y) ∧ f orest(z) (15)
(∃x1, y1, q, k1)L(x1, y1, q, q, A02, Gt, k1) ∧ q &gt; k1 (16)</p>
      <p>(∃x1, y1, k1)L(x1, y1, Long, Long, A02, Gt, k1) (17)
Every version of our intelligent system IMAGES (e.g., [28]; [11]) can perform text
understanding based on word meaning descriptions as follows.</p>
      <p>Firstly, a text is parsed into a surface dependency structure (or more than one if
syntactically ambiguous). Secondly, each surface dependency structure is translated into a conceptual
structure (or more than one if semantically ambiguous) using word-meaning descriptions.
Finally, each conceptual structure is semantically evaluated.</p>
      <p>The fundamental semantic computations on a text are to detect semantic anomalies,
ambiguities and paraphrase relations ([10]). Semantic anomaly detection is very important
to prevent robots from meaningless computations and actions to such a verbal command by
people as S14.</p>
      <p>(S14) Find a moving object which is stationary.</p>
      <p>Consider such a conceptual structure as (18), where ‘A29’ is the attribute ‘Taste’ and ‘Sw’ is
the value for ‘sweetness’. This locus formula can correspond to the English sentence ‘The
desk is sweet’, which is usually semantically anomalous because a ‘desk’ ordinarily has no
taste. The anonymous variable ‘_’ defined by (22) is often used instead of the variable bound
by an existential quantifier, for the sake of simplicity.</p>
      <p>(∃x)L(_, x, Sw, Sw, A29, Gt, _) ∧ desk(x)</p>
      <p>This kind of semantic anomaly can be detected in the following process. Firstly, assume
the commonsense knowledge of ‘desk’ as (16), where ‘A39’ refers to the attribute ‘Vitality’.
The special symbols ‘*’ and ‘/’ are defined as (20) and (21) representing ‘always’ and ‘no
value’, respectively.</p>
      <p>(∃x)desk(x) ↔ (λx)(. . . L∗(_, x, /, /, A29, Gt, _) ∧ . . . ∧ L∗(_, x, /, /, A39, Gt, _) ∧ . . . ) (19)
X∗ ↔ (∀t1, t2)XΠε(t1, t2)
L(. . . , /, . . . ) ↔∼ (∃p)L(. . . , p, . . . )
L(. . . , _, . . . ) ↔ (∃x)L(. . . , x, . . . )
(18)
(20)
(21)
(22)
(23)
(24)</p>
      <p>Secondly, the postulates (23) and (24) are utilized. The formula (23) means that if one
of two loci exists every time interval, then they can coexist and the formula (24) states that
a matter never has different values of an attribute at a time.</p>
      <p>X ∧ Y ∗. ⊃ .XΠY
L(x, y, p1, q1, a, g, k)ΠL(z, y, p2, q2, a, g, k). ⊃ .p1 = p2 ∧ q1 = q2</p>
      <p>Lastly, the semantic anomaly of ‘sweet desk’ is detected by using (18)-(24). That is, the
formula (25) below is finally deduced from (18)-(23) and violates the commonsense given by
(24), that is, “Sw 6= /”.</p>
      <p>(∃x)L(_, x, Sw, Sw, A29, Gt, _)ΠL(_, x, /, /, A29, Gt, _)
(25)</p>
      <p>This process is also employed for dissolving such a syntactic ambiguity as found in S15.
That is, the semantic anomaly of ‘sweet desk’ is detected and eventually ‘sweet coffee’ is
adopted as a plausible interpretation.</p>
      <p>(S15) Bring me the coffee on the desk, which is sweet.</p>
      <p>If a text has multiple plausible interpretations, it is semantically ambiguous. In this case,
IMAGES will ask for further information in order for disambiguation. For another case, if
two different texts are interpreted into the same locus formula, they are paraphrases of each
other. The detection of paraphrase relations is very useful for deleting redundant information.</p>
    </sec>
    <sec id="sec-4">
      <title>Conversation Management System</title>
      <p>Our conversation management system (CMS), the latest version of IMAGES, understands
User’s assertions or questions in Lmd and responds to them by text or animation. The
general performances of CMS have already been published in ([11]) and, therefore, here is
focused on how well it can simulate human mental-image based understanding (MBU) of
spatiotemporal expresssions in NL. This capability was evaluated based on a psychological
experiment and showed a good agreement with human subjects in answering questions about
stimulus sentences, inevitably involving spatiotemporal reasoning.</p>
      <p>The stimulus sentences to CMS and human subjects were I1-I3 as shown below.
(I1) Tom was with the book in the bus running from Town to University.
(I2) Tom was with the book in the car driven from Town to University by Mary.
(I3) Tom kept the book in a box before he drove the car from Town to University with
the box.
(∀. . . )L(z, x, p, q, Λt)ΠL(w, y, x, x, Λt) → L(z, x, p, q, Λt)ΠL(w, y, p, q, Λt)
(∀. . . )L(z, x, p, q, Λt)ΠL(w, y, x, x, Λt) → L(z, x, p, q, Λt)ΠL(z, y, p, q, Λt)
(∀. . . )L(z, x, p, p, Λt) · X → L(z, x, p, p, Λt) · (L(z, x, p, p, Λt)ΠX)</p>
      <p>For example, when p 6= q, PMV reads that if ‘z causes x to move from p to q as w causes
y to be with x’ then ‘w causes y to move from p to q’. Similarly, PSC , so that if ‘z causes
x to move from p to q as w causes y to be with x’ then ‘z causes y to move from p to q as
well as x’. Distinctively from these two, PCV is conditional, reading that if ‘z keeps x at p
(until some event X happens)’ then ‘it will continue’. That is, PCV is valid only when X
does not contradict with L(z, x, p, p, Λt). These postulates are also applicable to the scene
being described by S5 to answer S6 and S7.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Katz and Fodor ([10]) presented the first analytical issue with human semantic processing
ability. They claimed after their own experiences that people (more specifically, fluent
speakers) can at least detect in a text or between texts such semantic properties or relations
as follows and presented a model of disambiguation process called ‘selection restriction’
employing lexical information roughly specified by semantic markers and distinguishers in
English.</p>
      <p>(a) semantic ambiguity
(b) semantic anomaly
(c) paraphrase relation (i.e., semantic identity between different expressions)
To our best knowledge, there has been no systematic implementation of these functions
reported in any NLU or NLP systems other than the work by us ([28]; [11]). Among them,
the most essential for NLU is to detect paraphrase relation because the other two are possible
if it is possible to determine equality (or inequality) between knowledge representations (or
semantic representations) of different NL expressions.</p>
      <p>As easily understood, the quality of this function depends on the capability of the adopted
KRL to normalize knowledge representations, that is, to assign one knowledge representation
to the same meanings. However, reflecting our psychological experiences in NLU, we utilize
tacit or explicit knowledge associated to the words or so involved in order to process an NL
expression semantically. This is also the case for NLU systems. That is, they should be
inevitably provided with knowledge good enough for the purpose, for example, lexical and
ontological knowledge, computably formalized in KRL and Lmd can be a KRL appropriate
enough for such a purpose as shown by the result of our psychological experiment.</p>
      <p>The system CMS was designed to disambiguate an input sentence for its most plausible
semantic interpretation by semantic computation (i.e., inference) in Lmd. Our psychological
experiment revealed that the human subjects remembered their own experiences in association
with the entity names and that they selected the dependency corresponding to their most
familiar experience among all the possibilities. For example, consider the stimulus sentence
I1. How can the machine know who/what was running from Town to University? —Tom,
or book, or bus? Here, to see its syntactic possibilities, Dependency Grammar is employed
in order to determine the relations between head words and their dependents. In principle,
I1 can have twelve possible dependency trees, that is, it can be syntactically ambiguous in
twelve ways. In this case, the names (e.g., Tom, bus, book) made the people remember the
images in the way as formulated by (26) – (28), where A ≈&gt; B reads that A evokes B, and
+ and - denote whether the image is positive (i.e., probable) or negative (i.e., improbable),
respectively.</p>
      <p>T om ≈&gt; {+L(_, T om, Human, Human, Θt), +L(T om, T om, p, q, Λt), . . . }
(26)
Book ≈&gt; {−L(Book, Book, p, q, Λt), +L(Human, Book, Human, Human, Λt), . . . } (27)
Bus ≈&gt; {+L(Bus, Bus, p, q, Λt), +L(Bus, x, p, q, Λt), +L(_, Human, Bus, Bus, Λt), . . . }</p>
      <p>In (26), Θt represents ‘Quality (or Category) (i.e., A41)’ with g = Gt, and then
+L(_, T om, Human, Human, Θt) is interpretable as ‘it is positive that Tom is a human’. In
the same way, +L(T om, T om, p, q, Λt) as ‘it is positive that Tom moves by himself’, and
−L(Book, Book, p, q, Λt) as ‘it is negative that a book moves by itself’. According to semantic
preferences such as (26)-(28), CMS infers that the book did not run but Tom or bus and it
reaches the final decision that the bus did because Tom was static in the bus.</p>
      <p>Disambiguation is the most serious problem for any NLP system. Most current approaches
to it are based on the statistics about certain corpora of texts, however, they are what lead to
the most plausible syntactic interpretation but not to the most plausible semantic
interpretation grounded in the concerned world that is most essential to work robots appropriately by
(28)
words. Concerning the research field of spatial language understanding, for example, the task
of spatial role labeling ([12]) is intended to formalize the representation of spatial concepts
and relations in the natural language text to be mapped to qualitative spatial representation
models by means of machine learning techniques. This is in the same line as the UIMA
(Unstructured Information Management Architecture) approach employed for Watson ([7]),
specialized to extract spatial information from natural language texts, but its applicability
to disambiguation or deeper understanding like ours remains questionable because it is to
return spatial representations approximated by coded names at best.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>Robotic NLU intended to simulate human NLU based on mental images was described.
CMS at the present stage has been evaluated in comparison with human subjects for our
psychological experiment on MBU and has shown a good agreement with them in NLU
performance. The semantic computation in Lmd performed by CMS is based on simple and
general rules about atomic loci, hence, CMS works feasibly in Python except for computational
cost in the Animation Generator. As for the coverage of word concepts, a considerable
number of spatial terms have been analyzed over various kinds of English words, such as
prepositions, verbs, adverbs, etc., categorized as Dimensions, Form and Motion in the class
SPACE of the Roget’s thesaurus, and it is found that almost all the concepts of 4D events
can be defined in exclusive use of 5 kinds of attributes for FAO (the focus of attention of
the observer), namely, Physical location (A12), Direction (A13), Trajectory (A15), Mileage
(A17) and Topology (A44). This implies that spatiotemporal information systems with NL
interfaces are very feasible in terms of the size of knowledge to be installed.</p>
      <p>The future work will include development of learning facilities for automatic acquisition of
word concepts from sensory data and through language-centered interaction between humans
and robots under real environments in the same way as human acquisition of language.
On the other hand, the semantic plausibilities of names denoted as (26)-(28) can be more
efficiently and automatically obtained from certain corpora of big size and good quality.
1
2
3
4
5
18
19
20
21
22
23
24
25</p>
      <p>Terry Winograd. Understanding natural language. Cognitive Psychology, 3(1):1 –
191, 1972. URL: http://www.sciencedirect.com/science/article/pii/0010028572900023,
doi:https://doi.org/10.1016/0010-0285(72)90002-3.</p>
      <p>Terry Winograd. Shifting viewpoints: Artificial intelligence and human–computer
interaction. Artificial Intelligence, 170(18):1256 – 1258, 2006. Special Review Issue. URL:
http://www.sciencedirect.com/science/article/pii/S0004370206000920, doi:https://
doi.org/10.1016/j.artint.2006.10.011.</p>
      <p>Masao Yokota. An approach to natural language understanding based on a mental image
model. In NLUCS, pages 22 – 31, 2005.</p>
      <p>Masao Yokota. Towards a universal knowledge representation language for ubiquitous
intelligence based on mental image directed semantic theory. In Jianhua Ma, Hai Jin, Laurence T.
Yang, and Jeffrey J.-P. Tsai, editors, Ubiquitous Intelligence and Computing, pages 1124–1133,
Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.</p>
      <p>Masao Yokota. Towards a universal language for distributed intelligent robot networking.
In 2006 IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages
3304–3309, Oct 2006. doi:10.1109/ICSMC.2006.384628.</p>
      <p>Masao Yokota. Aware computing in spatial language understanding guided by cognitively
inspired knowledge representation. Appl. Comp. Intell. Soft Comput., 2012:5:5–5:5, January
2012. URL: http://dx.doi.org/10.1155/2012/184103, doi:10.1155/2012/184103.</p>
      <p>Masao Yokota. Natural language understanding and cognitive robotics. CRC press, in press.
Questions about I1 (Tom was with the book in the bus
running from Town to University.)
Q1: What ran?
Q2: What was in the bus?
Q3: What traveled from Town to University?
Q4: Did the bus carry Tom from Town to University?
Q5: Did the bus move Tom from Town to University?
Q6: Did the bus carry the book from Town to University?
Q7: Did the bus move the book from Town to University?
Q8: Did Tom carry the book from Town to University?
Q9: Did Tom move the book from Town to University?</p>
      <sec id="sec-6-1">
        <title>Questions about I2 (Tom was with the book in the car driven from Town to University by Mary.) Q1: What was driven? Q2: What was in the car?</title>
        <p>Q3: What traveled from Town to University?
Q4: Did the car carry Tom from Town to University?
Q5: Did the car move Tom from Town to University?
Q6: Did the car carry the book from Town to University?
Q7: Did the car move the book from Town to University?
Q8: Did Tom carry the book from Town to University?
Q9: Did Tom move the book from Town to University?
Q10: Did Mary carry the car from Town to University?
Q11: Did Mary carry the book from Town to University?
Q12: Did Mary carry Tom from Town to University?</p>
      </sec>
      <sec id="sec-6-2">
        <title>Questions about I3 (Tom kept the book in a box before he</title>
        <p>drove the car from Town to University with the box.)
Q1: What traveled from Town to University?
Q2: Did the car carry Tom from Town to University?
Q3: Did the car carry the box from Town to University?
Q4: Did the car carry the book from Town to University?
Q5: Did Tom carry the car from Town to University?
Q6: Did Tom carry the box from Town to University?
Q7: Did Tom carry the book from Town to University?
Q8: Did the box carry the book from Town to University?</p>
      </sec>
      <sec id="sec-6-3">
        <title>Answers by CMS</title>
        <p>A1: bus
A2: Tom, book
A3: Tom, bus, book
A4: yes
A5: yes
A6: yes
A7: yes
A8: yes
A9: yes</p>
      </sec>
      <sec id="sec-6-4">
        <title>Answers by CMS</title>
        <p>A1: car
A2: Mary, Tom, book
A3: Mary, Tom, book, car
A4: yes
A5: yes
A6: yes
A7: yes
A8: yes
A9: yes
A10: yes
A11: yes
A12: yes</p>
      </sec>
      <sec id="sec-6-5">
        <title>Answers by CMS</title>
      </sec>
      <sec id="sec-6-6">
        <title>A1: Tom, book, box, car A2: yes A3: yes A4: yes</title>
        <p>A5: yes
A6: yes
A7: yes
A8: yes</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Robotics and Autonomous Systems</source>
          ,
          <volume>43</volume>
          (
          <issue>2</issue>
          ):
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
          ,
          <year>2003</year>
          . Perceptual Anchoring:
          <article-title>Anchoring Symbols to Sensor Data in Single and Multiple Robot Systems</article-title>
          . URL: http://www.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>sciencedirect.com/science/article/pii/S0921889003000216, doi:https://doi.org/10.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          1016/
          <fpage>S0921</fpage>
          -
          <volume>8890</volume>
          (
          <issue>03</issue>
          )
          <fpage>00021</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kenny R. Coventry</surname>
            , Mercè Prat-Sala, and
            <given-names>Lynn</given-names>
          </string-name>
          <string-name>
            <surname>Richards</surname>
          </string-name>
          .
          <article-title>The interplay between geometry and function in the comprehension of over, under, above, and below</article-title>
          .
          <source>Journal of Memory and Language</source>
          ,
          <volume>44</volume>
          (
          <issue>3</issue>
          ):
          <fpage>376</fpage>
          -
          <lpage>398</lpage>
          ,
          <year>2001</year>
          . URL: http://www.sciencedirect.com/science/article/ pii/S0749596X00927426, doi:https://doi.org/10.1006/jmla.
          <year>2000</year>
          .
          <volume>2742</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Glickman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          .
          <article-title>The pascal recognising textual entailment challenge</article-title>
          .
          <source>In Joaquin Quiñonero-Candela</source>
          , Ido Dagan, Bernardo Magnini, and Florence d'Alché Buc, editors,
          <source>Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment</source>
          , pages
          <fpage>177</fpage>
          -
          <lpage>190</lpage>
          , Berlin, Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Evan</given-names>
            <surname>Drumwright</surname>
          </string-name>
          ,
          <article-title>Victor Ng-Thow-</article-title>
          <string-name>
            <surname>Hing</surname>
            , and
            <given-names>Maja J.</given-names>
          </string-name>
          <string-name>
            <surname>Mataric</surname>
          </string-name>
          .
          <article-title>Toward a vocabulary of primitive task programs for humanoid robots</article-title>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Max J.</given-names>
            <surname>Egenhofer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Robert D.</given-names>
            <surname>Franzosa</surname>
          </string-name>
          .
          <article-title>Point-set topological spatial relations</article-title>
          .
          <source>International Journal of Geographical Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>161</fpage>
          -
          <lpage>174</lpage>
          ,
          <year>1991</year>
          . URL: https://doi.org/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>