<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using attentive focus to discover action ontologies from perception</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amitabha Mukerjee</string-name>
          <email>amit@cse.iitk.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Engineering Indian Institute of Technology Kanpur</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <fpage>32</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>The word “symbol”, as it is used in logic and computational theory, is considerably different from its usage in cognitive linguistics and in everyday life. Formal approaches that define symbols in terms of other symbols ultimately need to be grounded in perceptualmotor terms. Based on cognitive evidence that the earliest action structures may be learned from perception alone, we propose to use attentive focus to identify the agents participating in an action, map the characteristics of their interaction, and ultimately discover actions as clusters in perceptuo-temporal space. We demonstrate its applicability by learning actions from simple 2D image sequences, and then demonstrate the learned predicate by recognizing 3D actions. This mapping, which also identifies the objects involved in the interaction, informs us on the argument structure of the verb, and may help guide syntax. Ontologies in such systems are learned as different granularities in the clustering space; action hierarchies emerge as membership relations between actions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Learning the concepts for concrete objects require the
perceptual system to abstract across visual presentations
of these objects. In contrast, modeling actions present
a more complex challenge [Fleischman and Roy, 2005],
[Sugiura and Iwahashi, 2007]. Yet actions are the
central structure for organizing concepts; the corresponding
language units (verbs) also acts as “heads” (predicates)
in sentences, controlling how an utterance is to be
interpreted. Typically the structure for an action/verb
includes a set of possible constituents that participate
in the action, and also some constraints on the type of
action (e.g. the type of motion that may constitute “A
chases B”).</p>
      <p>In this work, we consider the learning of the
structure of actions, based on image sequences. Cognitively,
there is evidence that some action schemas are acquired
through perception in a pre-linguistic stage [Mandler,
2004]; later these are reinforced via participation, and
may eventually seed linguistic aspects such as argument
structure.</p>
      <p>We postulate that a key aspect of this process is the
role of perceptual attention [Regier, 2003],[Ballard and
Yu, 2003]. Thus, an action involving two agents may
involve attention shifts between them, which helps limit
the set of agents participating in the action. The set of
agents participating in an action eventually generalizes
to the argument structure. In [Ballard and Yu, 2003],
human gaze was directly tracked and matched with
language fragments, and verbs such as “picking up” and
“stapling” were associated with certain actions.
However, the verbal concepts learned were specific to the
context, and no attempt was made to generalize these
into action schemas, applicable to new scenes or
situations. Top down attention guided by linguistic inputs is
used to identify objects in [Roy and Mukherjee, 2005].
More recently, in [Guha and Mukerjee, 2007] attentive
focus is used to learn labels for simple motion
trajectories, but this is also restricted to a particular visual
domain.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>From Percept to Concept to Symbol</title>
      <p>The word “symbol”, as it is used in logic and
computational theory is considerably different from its usage
in cognitive linguistics and in everyday life. The OED
defines it as “Something that stands for, represents, or
denotes something else”. This meaning carries over to
the cognitive usage, where it is viewed as a tight coupling
of a set of mental associations (the semantic pole) with
the psychological impression of the sound (the
phonological pole) [?]. Formally, however, a symbol is detached
from any meaning, it is just a token constructed from
some finite alphabet, and is related only to other such
tokens. A computer system dealing with such symbols
can define many relations with other symbols, but finds
it difficult to relate it to the world, and this makes it
difficult also to keep the relations between symbols up to
date. The objective of this work is to try to align a
symbol to a perceptual stimulus, so as to provide grounding
for the symbols used in language or in reasoning.</p>
      <p>In other work, we have addressed the question of
learning the language label (or the phonological pole)
of a symbol [Satish and Mukerjee, 2008]. Here we
focus on modeling the semantic pole, especially with
respect to action ontologies. Such models, called Image
Schema in Cognitive Linguistics [Langacker, 1999] or
Perceptual Schema in Experimental Psychology
[Mandler, 2004], involve abstractions on low-level features
extracted from sensorimotor modalities (positions and
velocities), as well as the argument structure.</p>
      <p>We ask here if, given a system that is observing a
simple 2D scene (see fig. 1) with shapes like squares and
circles chasing each other, is it possible for it to cluster
all 2-agent interactions in some meaningful way into a
set of action schemas? If so, do these action schemas
relate reliably to any useful conceptual structures?
Further, is there any possibility of learning any
relationships between these action schemata, thus constructing
a primitive ontology? Note that all this has to take place
without any language, witout any human inputs in any
form.</p>
      <p>Constructing such action templates has a long history
in Computer vision, but most gather statistics in
viewspecific ways with an emphasis on recognition [Xiang
and Gong, 2006; ?]. We restrict ourselves to two-object
interactions, using no priors, and our feature vectors are
combinations of relative position and velocity vectors of
the objects (we use a simple inner product). We perform
unsupervised clustering on the spatio-temporal feature
space using the Merge Neural Gas algorithm [Strickert
and Hammer, 2005]; the resulting clusters constitute our
action schemas. By considering different levels of cluster
granularity in the unsupervised learning process, we also
learn subsets of coarse concepts as finer action concepts,
resulting in an action hierarchy which may be thought
of as a rudimentary ontology.</p>
      <p>Having learned the action schema based on a given
input, we apply it to recognize novel 2-body interactions
in a 3D fixed camera video, in which the depth of a
foreground object is indicated by it’s image y-coordinate.
We show that the motion features of humans can be
labelled using the action schemas learned.</p>
      <sec id="sec-2-1">
        <title>Analysis: Role of Attentive Focus</title>
        <p>One of the key issues we explore in this work is the
relevance of perceptual attention. It turns out that
restricting computation to attended events somehow results in
a better correlation with motions that are named in
language. This may reflect a bias in conceptualization
towards actions that attract attention. Like other models
that use attention to associate agents or actions to
language [Ballard and Yu, 2003; Guha and Mukerjee, 2007],
we use attentive focus to constrain the region of visual
salience, and thereby the constituents participating in an
action. We use a computational model of dynamic visual
attention [Singh et al., 2006] to identify agents possibly
in focus.</p>
        <p>In order to analyze the different types of motion
possible in the scene, we first perform a qualitative analysis
of the motions. We assume that all objects have an
intrinsic frame with a privileged “front” direction defined
either by its present direction of motion, or by the last
such observed direction. Let the reference object be A,
then the pose of located object B w.r.t. the frame of A
can be described as a 2-dimensional qualitative vector
[Forbus et al., 1987], where each axis is represented as
{−, 0, +} instead of quantitative values. This results in
eight possible non-colliding states for the pose of B. In
each pose, the velocity of B is similarly encoded,
resulting in 9 possible velocities (including non-moving).</p>
        <p>This results in 72 possible relations, and
distinguishing the situation when the reference object A is
moving, from that when it is stationary, results in a total of
144 possible states. Linguistic labels(Come-Close(CC),
Move-Away(MA), Chase(CH), Go-Around(GoA),
MoveTogether(MT), Move-Opposite(MO)) are manually
assigned to each of these qualitative relative motion states.
The motion in nearly half the states do not appear to
have clear linguistic terms associated with them, and
these undenominated interactions are left empty. The
remaining classes assigned are shown in Figure 2.
Qualitative classification for the frames in Fig.1 is shown in
Fig. 3.</p>
        <p>Next, we analyze the frequency of these cases observed
on the Chase video. Fig. 2 compares the frequency of the
qualitative states with non-stationary first object, in the
situation where all possible object pairs are considered
(no attentive focus), versus that where using attentive
cues pairs of agents attended to within a temporal
window of 20 frames become candidates for mutual
interaction; all other agent pairings are ignored. The frequency
of indeterminate qualitative cases are 58% in the first
situation and 24% in the second. Thus, attentive focus
biases the learning towards relations that we have names
for in language.
3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Visual Attention</title>
        <p>We consider a bottom-up model of visual attention (not
dependent on task at hand) [Itti, 2000]. Here we
consider a model designed to capture bottom-up attention
in dynamic scenes based on motion saliency [Singh et al.,
2006]. Objects are taken as the attentive foci instead of
pixels. Motion saliency map is computed from optical
flow, a confidence map is introduced to assign higher
salience to objects not visited for a long time. A small</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Name</title>
      <p>pos·velDiff
pos·velSum</p>
    </sec>
    <sec id="sec-4">
      <title>Formula</title>
      <p>(~xB − ~xA) · (~vB − ~vA)
(~xB − ~xA) · (~vB + ~vA)
foveal bias is introduced to mediate in favour of proximal
fixations against large saccadic motions.
Winner-TakeAll network on the combined saliency map gives the most
salient object for fixation.
4</p>
      <sec id="sec-4-1">
        <title>Unsupervised Perceptual Clustering</title>
        <p>Perceptual systems return certain abstractions of the
raw sensory data - “features” - which are used for
recognition, motor control, categorization, etc. In this work
we use two features that capture the interaction of two
agents. All learning takes place in the space of these two
features, (Table 1); the first feature captures the
combination of relative position and velocity, the second the
relative position and magnitude.</p>
        <p>These feature vectors are then clustered into
categories in an unsupervised manner based on a notion of
distance between individuals. We use the Merge
Neural Gas(MNG) algorithm[Strickert and Hammer, 2005]
for unsupervised learning which has been shown to be
well-suited for processing complex dynamic sequences as
compared to the other existing models for temporal data
processing like Temporal Kohonen map, Recursive SOM
etc. This class of temporal learning algorithms are more
flexible with respect to the state specifications and time
history compared to HMMs or VLMMs. MNG algorithm
performs better than other unsupervised clustering
algorithms like K-Windows [Vrahatis et al., 2002], DBSCAN
[Ester et al., 1996] because of the utilization of the
temporal information present in the frame sequences unlike
the other algorithms.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Merge Neural Gas algorithm</title>
      <p>The Neural Gas algorithm [Martinetz and Schulten,
1994] learns important topological relations in a given
set of input vectors (signals) in an unsupervised manner
by means of a simple Hebb-like learning rule. It takes a
distribution of high-dimensional data, P(ξ) and returns
a densely connected network resembling the topology of
the input.</p>
      <p>For input feature vectors arriving from temporally
connected data, the basic neural gas algorithm can be
generalized by including explicit context representation
which utilizes the temporal ordering present in the
feature vectors of the frames, resulting in the Merge Neural
Gas algorithm [Strickert and Hammer, 2005]. Here, a
Context vector is adjusted based on the present winning
neuron data. Cluster labels for the frames are obtained
in the final iteration of the algorithm based on the
winner neuron.
5</p>
      <sec id="sec-5-1">
        <title>Concept Acquisition: Chase video</title>
        <p>Unsupervised clustering using the Merge Neural Gas
algorithm is used on the feature vectors from the video,
corresponding to object pairs that were in attentive
focus around the same time. Salient objects in a scene
are ordered by a computational model of bottom-up
dynamic attention[Singh et al., 2006]. The most salient
object is determined for each frame, and other objects
that were salient within k frames before and after (we
use k = 10) are considered as attended simultaneously.
Dyadic feature vectors are computed for all object pairs
in these 2k frames.</p>
        <p>Owing to the randomized nature of the algorithm, the
number of clusters varies from run to run. Clusters with
less than ten frames are dropped. With the aging
parameter set to 30, the number of clusters came out to
be four in 90% of the runs; the set of four clusters with
highest total classification accuracy (refer Table 2) are
considered below.</p>
        <p>In order to validate these clusters with human
concepts, we asked three subjects (Male, Hindi-English/
Telugu-English bilinguals, Age-22, 20 and 30) to label
the scenes in the video. They were shown the video twice
and in the third viewing they were asked to speak out
one of three action labels (CC, MA, Chase) which was
recorded. Given the label and the frame when this was
uttered, the actual event boundaries and participating
objects for the groundtruth data were assigned by
inspection. In case of disagreement, we took the majority
view.</p>
        <p>The percentage accuracies shown in table 2 do not
reflect the degree of match, since although an event may
last over 15 frames, even if 10 frames have been detected,
it is usually quite helpful. This can be seen in 6 which
present results along a time line for Chase; each row
reflects a different combination of agents (small square,
big square, circle). At first glance, figures like 6 would
seem to reflect a higher accuracy than 84% in table 2.</p>
        <p>A surprising result was found when by experimenting
with the edge aging parameter in the Merge Neural Gas
4
−5000
−5000</p>
        <p>C1
C2
C3
C4</p>
        <p>0
Feature 1 (pos.velDiff)
5000
algorithm. The number of clusters increase as aging
parameter is decreased, and at one stage eight clusters were
formed (edge aging parameter=16). The Total
Classification Accuracy (TCA) was about 51 and we would have
discarded the result, but inspecting the frames revealed
that the clusters may be reflecting what appeared to be
hierarchy of action types. Thus cluster C1 from the
earlier classification (majority correlation=CC) was broken
up into C1, C5, C6. C1 was found to contain frames where
both objects are moving towards each other whereas C5
contains frames where the smaller object is stationary
and the other moves closer. Thus Come-Closer and
Move-Away appear to be sub-classified into 3 classes
(two one object static cases, and one both moving case).
This ‘finer’ classification is given in Table 3.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Argument order in Action Schemas</title>
      <p>In another experiment, we investigated the importance
of argument ordering by re-classifying the same frames,
but reversing the order of the objects used in the dyadic
vector computation. Earlier, if the larger object was
arg1 or reference object, now it became arg2 or
nonreference object. If the corresponding concept changed,
especially if it flipped, this would reflect a semantic
necessity to preserve the argument order; otherwise the
arguments were commutative. Using the coarser clusters,
we observe that the argument order is immaterial since
the majority relation is unchanged (black) for C1 and
C2 (CC,MA respectively). On the other hand, both C3
and C4 (correlations with Chase) are flipped (Table 4).
Thus, the fact that argument order is important for
Chase is learned implicitly within the action schema
itself. The non-commutativity of CCone−object−static and
M Aone−object−static could not be established because of
the skewed distribution of frames in the input video
amongst the two sub-classes for the action verbs.
5.2</p>
    </sec>
    <sec id="sec-7">
      <title>Comparison with K-Windows</title>
    </sec>
    <sec id="sec-8">
      <title>Clustering</title>
      <p>We compare the clustering accuracy obtained by the
unsupervised Merge Neural Gas algorithm with K-windows
algorithm [Vrahatis et al., 2002]. K-Windows is an
improvement of K-Means clustering algorithm with a
better time complexity and clustering accuracy. We set the
value of k in this algorithm to 4 and run it on the input
feature vectors obtained after attentive pruning. The
initial cluster points for the algorithm are set randomly.
Table 5 gives the clustering results obtained.</p>
      <p>The lower accuracy (as compared to results in Table 2)
is expected because K-windows treats each feature vector
as a separate entity without utilizing the information
present in the temporal ordering of the frames.
6</p>
      <sec id="sec-8-1">
        <title>Recognizing actions in 3D</title>
        <p>In order to test the effectiveness of the clusters learned,
we test the recognition of motions from a 3D video of
three persons running around in a field (Fig.7). In
human classification of the action categories (into one of
CC, MA, Chase), the dominant predicate in the video,
(777 out of 991 frames), is Chase.</p>
        <p>In the image processing stage, the system learns the
background over the initial frames based on which it
segments out the foreground blobs. It is then able to track
all the three agents using the Meanshift algorithm.
Assuming camera height near eye level, the bottom-most
point in each blob corresponds to that agent’s contact
with the ground, from which its depth can be determined
within some scaling error (157 frames with extensive
occlusion between agents were omitted). Given this depth,
one can solve for the lateral position - thus, we are able to
obtain, from a single view video, the (x, y) coordinates
for each agent in each frame, within a constant scale.
Based on this, the relative pose and motion parameters
are computed for each agent pair, and therefrom the
features as outlined earlier. Now these feature vectors are
classified using the action schemas (coarse clusters)
al5
ready obtained from the Chase video (2D) (Table 6).
7</p>
      </sec>
      <sec id="sec-8-2">
        <title>Discussion and Conclusion</title>
        <p>We have outlined how our unsupervised approach learns
action schemas of two-agent interactions resulting in an
action ontology. The image schematic nature of the
clusters are validated by producing a description for a 3D
video. The approach provided here underlines the role
of concept argument structures in aligning with linguistic
expressions, and that of bottom-up dynamic attention in
pruning the visual input and in aligning linguistic focus.</p>
        <p>Once a few basic concepts are learned, other
concepts can be learned without direct grounding, by using
conceptual blending mechanisms on the concept itself.
These operations are often triggered by linguistic cues,
resulting in new concepts, as well as their labels being
learned together, in a later stage. Indeed, the vast
majority of our vocabularies are learned later purely from the
linguistic input [Bloom, 2000]. But this is only possible
because of the grounded nature of the first few concepts,
without which these later concepts cannot be grounded.
Thus the perceptually grounded nature of the very first
concepts are crucial to subsequent compositions.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Ballard and Yu</source>
          , 2003]
          <string-name>
            <surname>Dana</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ballard</surname>
            and
            <given-names>Chen</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>A multimodal learning interface for word acquisition</article-title>
          .
          <source>In International Conference on Acoustics,Speech and Signal Processing(ICASSP03)</source>
          , volume
          <volume>5</volume>
          , pages
          <fpage>784</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>April 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Bloom</source>
          , 2000]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Bloom</surname>
          </string-name>
          .
          <article-title>How Children Learn the Meanings of Words</article-title>
          . MIT Press, Cambridge, MA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Ester et al.,
          <year>1996</year>
          ]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
            , Jorg Sander, and
            <given-names>Xiaowei</given-names>
          </string-name>
          <string-name>
            <surname>Xug</surname>
          </string-name>
          .
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Fleischman and Roy</source>
          , 2005]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Fleischman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Deb</given-names>
            <surname>Roy</surname>
          </string-name>
          .
          <article-title>Why verbs are harder to learn than nouns: Initial insights from a computational model of intention recognition in situated word learning</article-title>
          .
          <source>In Proceedings of the 27th Annual Meeting of the Cognitive Science Society</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Forbus et al.,
          <year>1987</year>
          ]
          <string-name>
            <given-names>Kenneth D</given-names>
            <surname>Forbus</surname>
          </string-name>
          , Paul Nielsen, and
          <string-name>
            <given-names>Boi</given-names>
            <surname>Faltings</surname>
          </string-name>
          .
          <article-title>Qualitative kinematics: A framework</article-title>
          .
          <source>In IJCAI</source>
          , pages
          <fpage>430</fpage>
          -
          <lpage>436</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Guha and Mukerjee</source>
          , 2007]
          <string-name>
            <given-names>Prithwijit</given-names>
            <surname>Guha</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amitabha</given-names>
            <surname>Mukerjee</surname>
          </string-name>
          .
          <article-title>Language label learning for visual concepts discovered from video sequences</article-title>
          . In Lucas Paletta, editor,
          <source>Attention in Cognitive Systems. Theories and Systems from an Interdisciplinary Viewpoint</source>
          , volume
          <volume>4840</volume>
          , pages
          <fpage>91</fpage>
          -
          <lpage>105</lpage>
          . Springer LNCS, Berlin / Heidelberg,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Itti</source>
          , 2000]
          <string-name>
            <given-names>L.</given-names>
            <surname>Itti</surname>
          </string-name>
          .
          <article-title>Models of Bottom-Up and Top-Down Visual Attention</article-title>
          .
          <source>PhD thesis</source>
          , Pasadena, California,
          <year>Jan 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Langacker</source>
          , 1999]
          <article-title>Ronald Wayne Langacker</article-title>
          .
          <source>Grammar and Conceptualization</source>
          . Berlin/New York: Mouton de Gruyer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Mandler</source>
          , 2004]
          <string-name>
            <given-names>J M</given-names>
            <surname>Mandler. Foundations</surname>
          </string-name>
          of Mind. Oxford University Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Martinetz and Schulten</source>
          , 1994]
          <string-name>
            <given-names>T.</given-names>
            <surname>Martinetz</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Schulten</surname>
          </string-name>
          .
          <article-title>Topology representing networks</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>507</fpage>
          -
          <lpage>522</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Regier</source>
          , 2003]
          <string-name>
            <given-names>Terry</given-names>
            <surname>Regier</surname>
          </string-name>
          .
          <article-title>Emergent constraints on wordlearning: A computational review</article-title>
          .
          <source>Trends in Cognitive Sciences</source>
          ,
          <volume>7</volume>
          :
          <fpage>263</fpage>
          -
          <lpage>268</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Roy and Mukherjee</source>
          , 2005]
          <string-name>
            <given-names>Deb</given-names>
            <surname>Roy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Niloy</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          .
          <article-title>Towards situated speech understanding: visual context priming of language models</article-title>
          .
          <source>Computer Speech and Language</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>227</fpage>
          -
          <lpage>248</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Satish and Mukerjee</source>
          , 2008]
          <string-name>
            <given-names>G.</given-names>
            <surname>Satish</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukerjee</surname>
          </string-name>
          .
          <article-title>Acquiring linguistic argument structure from multimodal input using attentive focus</article-title>
          .
          <source>In 7th IEEE International Conference on Development and Learning</source>
          ,
          <source>2008. ICDL</source>
          <year>2008</year>
          , pages
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>[Singh</surname>
          </string-name>
          et al.,
          <year>2006</year>
          ]
          <string-name>
            <given-names>Vivek</given-names>
            <surname>Kumar</surname>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Subranshu</given-names>
            <surname>Maji</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Amitabha</given-names>
            <surname>Mukerjee</surname>
          </string-name>
          .
          <article-title>Confidence based updation of motion conspicuity in dynamic scenes</article-title>
          .
          <source>In Third Canadian Conference on Computer and Robot Vision</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Strickert and Hammer</source>
          , 2005]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Strickert</surname>
          </string-name>
          and
          <string-name>
            <given-names>Barbara</given-names>
            <surname>Hammer</surname>
          </string-name>
          .
          <article-title>Merge som for temporal data</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>64</volume>
          :
          <fpage>39</fpage>
          -
          <lpage>71</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Sugiura and Iwahashi</source>
          , 2007]
          <string-name>
            <given-names>Komei</given-names>
            <surname>Sugiura</surname>
          </string-name>
          and
          <string-name>
            <given-names>Naoto</given-names>
            <surname>Iwahashi</surname>
          </string-name>
          .
          <article-title>Learning object-manipulation verbs for humanrobot communication</article-title>
          .
          <source>In WMISI '07: Proceedings of the [Vrahatis</source>
          et al.,
          <year>2002</year>
          ]
          <string-name>
            <surname>Michael N Vrahatis</surname>
          </string-name>
          , Basilis Boutsinas, Panagiotis Alevizos, and
          <string-name>
            <given-names>Georgios</given-names>
            <surname>Pavlides</surname>
          </string-name>
          .
          <article-title>The new kwindows algorithm for improving thek -means clustering algorithm</article-title>
          .
          <source>Journal of Complexity</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>375</fpage>
          -
          <lpage>391</lpage>
          ,
          <year>March 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Xiang and Gong</source>
          , 2006]
          <string-name>
            <given-names>T.</given-names>
            <surname>Xiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gong</surname>
          </string-name>
          .
          <article-title>Beyond tracking: Modelling activity and understanding behaviour</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>67</volume>
          (
          <issue>1</issue>
          ):
          <fpage>21</fpage>
          -
          <lpage>51</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>