=Paper=
{{Paper
|id=Vol-481/paper-4
|storemode=property
|title=Using Attentive Focus to Discover Action Ontologies from Perception
|pdfUrl=https://ceur-ws.org/Vol-481/paper-3.pdf
|volume=Vol-481
|dblpUrl=https://dblp.org/rec/conf/nesy/Mukerjee09
}}
==Using Attentive Focus to Discover Action Ontologies from Perception==
<pdf width="1500px">https://ceur-ws.org/Vol-481/paper-3.pdf</pdf>
<pre>
    Using attentive focus to discover action ontologies from perception

                                           Amitabha Mukerjee
                               Department of Computer Science & Engineering
                                   Indian Institute of Technology Kanpur
                                             amit@cse.iitk.ac.in


                       Abstract                                 through perception in a pre-linguistic stage [Mandler,
                                                                2004]; later these are reinforced via participation, and
     The word “symbol”, as it is used in logic and              may eventually seed linguistic aspects such as argument
     computational theory, is considerably differ-              structure.
     ent from its usage in cognitive linguistics and
                                                                   We postulate that a key aspect of this process is the
     in everyday life. Formal approaches that de-
                                                                role of perceptual attention [Regier, 2003],[Ballard and
     fine symbols in terms of other symbols ulti-
                                                                Yu, 2003]. Thus, an action involving two agents may
     mately need to be grounded in perceptual-
                                                                involve attention shifts between them, which helps limit
     motor terms. Based on cognitive evidence that
                                                                the set of agents participating in the action. The set of
     the earliest action structures may be learned
                                                                agents participating in an action eventually generalizes
     from perception alone, we propose to use at-
                                                                to the argument structure. In [Ballard and Yu, 2003],
     tentive focus to identify the agents participat-
                                                                human gaze was directly tracked and matched with lan-
     ing in an action, map the characteristics of
                                                                guage fragments, and verbs such as “picking up” and
     their interaction, and ultimately discover ac-
                                                                “stapling” were associated with certain actions. How-
     tions as clusters in perceptuo-temporal space.
                                                                ever, the verbal concepts learned were specific to the
     We demonstrate its applicability by learning
                                                                context, and no attempt was made to generalize these
     actions from simple 2D image sequences, and
                                                                into action schemas, applicable to new scenes or situa-
     then demonstrate the learned predicate by rec-
                                                                tions. Top down attention guided by linguistic inputs is
     ognizing 3D actions. This mapping, which also
                                                                used to identify objects in [Roy and Mukherjee, 2005].
     identifies the objects involved in the interac-
                                                                More recently, in [Guha and Mukerjee, 2007] attentive
     tion, informs us on the argument structure of
                                                                focus is used to learn labels for simple motion trajec-
     the verb, and may help guide syntax. Ontolo-
                                                                tories, but this is also restricted to a particular visual
     gies in such systems are learned as different
                                                                domain.
     granularities in the clustering space; action hi-
     erarchies emerge as membership relations be-               1.1    From Percept to Concept to Symbol
     tween actions.
                                                                The word “symbol”, as it is used in logic and compu-
                                                                tational theory is considerably different from its usage
1   Introduction                                                in cognitive linguistics and in everyday life. The OED
Learning the concepts for concrete objects require the          defines it as “Something that stands for, represents, or
perceptual system to abstract across visual presentations       denotes something else”. This meaning carries over to
of these objects. In contrast, modeling actions present         the cognitive usage, where it is viewed as a tight coupling
a more complex challenge [Fleischman and Roy, 2005],            of a set of mental associations (the semantic pole) with
[Sugiura and Iwahashi, 2007]. Yet actions are the cen-          the psychological impression of the sound (the phonolog-
tral structure for organizing concepts; the corresponding       ical pole) [?]. Formally, however, a symbol is detached
language units (verbs) also acts as “heads” (predicates)        from any meaning, it is just a token constructed from
in sentences, controlling how an utterance is to be in-         some finite alphabet, and is related only to other such
terpreted. Typically the structure for an action/verb           tokens. A computer system dealing with such symbols
includes a set of possible constituents that participate        can define many relations with other symbols, but finds
in the action, and also some constraints on the type of         it difficult to relate it to the world, and this makes it dif-
action (e.g. the type of motion that may constitute “A          ficult also to keep the relations between symbols up to
chases B”).                                                     date. The objective of this work is to try to align a sym-
   In this work, we consider the learning of the struc-         bol to a perceptual stimulus, so as to provide grounding
ture of actions, based on image sequences. Cognitively,         for the symbols used in language or in reasoning.
there is evidence that some action schemas are acquired            In other work, we have addressed the question of


                                                            9
                                                                    2   Analysis: Role of Attentive Focus
                                                                    One of the key issues we explore in this work is the rele-
                                                                    vance of perceptual attention. It turns out that restrict-
                                                                    ing computation to attended events somehow results in
                                                                    a better correlation with motions that are named in lan-
                                                                    guage. This may reflect a bias in conceptualization to-
                                                                    wards actions that attract attention. Like other models
                                                                    that use attention to associate agents or actions to lan-
                                                                    guage [Ballard and Yu, 2003; Guha and Mukerjee, 2007],
Figure 1:    Scenes from 2D video: “Chase”: Three                   we use attentive focus to constrain the region of visual
agents, “big square”, ‘small square” and “circle” play              salience, and thereby the constituents participating in an
and chase each other. Velocities are shown as gray ar-              action. We use a computational model of dynamic visual
rows.                                                               attention [Singh et al., 2006] to identify agents possibly
                                                                    in focus.
                                                                       In order to analyze the different types of motion pos-
                                                                    sible in the scene, we first perform a qualitative analysis
learning the language label (or the phonological pole)
                                                                    of the motions. We assume that all objects have an in-
of a symbol [Satish and Mukerjee, 2008]. Here we fo-
                                                                    trinsic frame with a privileged “front” direction defined
cus on modeling the semantic pole, especially with re-
                                                                    either by its present direction of motion, or by the last
spect to action ontologies. Such models, called Image
                                                                    such observed direction. Let the reference object be A,
Schema in Cognitive Linguistics [Langacker, 1999] or
                                                                    then the pose of located object B w.r.t. the frame of A
Perceptual Schema in Experimental Psychology [Man-
                                                                    can be described as a 2-dimensional qualitative vector
dler, 2004], involve abstractions on low-level features ex-         [Forbus et al., 1987], where each axis is represented as
tracted from sensorimotor modalities (positions and ve-
                                                                    {−, 0, +} instead of quantitative values. This results in
locities), as well as the argument structure.
                                                                    eight possible non-colliding states for the pose of B. In
   We ask here if, given a system that is observing a               each pose, the velocity of B is similarly encoded, result-
simple 2D scene (see fig. 1) with shapes like squares and           ing in 9 possible velocities (including non-moving).
circles chasing each other, is it possible for it to cluster           This results in 72 possible relations, and distinguish-
all 2-agent interactions in some meaningful way into a              ing the situation when the reference object A is mov-
set of action schemas? If so, do these action schemas               ing, from that when it is stationary, results in a total of
relate reliably to any useful conceptual structures? Fur-           144 possible states. Linguistic labels(Come-Close(CC),
ther, is there any possibility of learning any relation-            Move-Away(MA), Chase(CH), Go-Around(GoA), Move-
ships between these action schemata, thus constructing              Together(MT), Move-Opposite(MO)) are manually as-
a primitive ontology? Note that all this has to take place          signed to each of these qualitative relative motion states.
without any language, witout any human inputs in any                The motion in nearly half the states do not appear to
form.                                                               have clear linguistic terms associated with them, and
                                                                    these undenominated interactions are left empty. The
   Constructing such action templates has a long history            remaining classes assigned are shown in Figure 2. Qual-
in Computer vision, but most gather statistics in view-             itative classification for the frames in Fig.1 is shown in
specific ways with an emphasis on recognition [Xiang                Fig. 3.
and Gong, 2006; ?]. We restrict ourselves to two-object                Next, we analyze the frequency of these cases observed
interactions, using no priors, and our feature vectors are          on the Chase video. Fig. 2 compares the frequency of the
combinations of relative position and velocity vectors of           qualitative states with non-stationary first object, in the
the objects (we use a simple inner product). We perform             situation where all possible object pairs are considered
unsupervised clustering on the spatio-temporal feature              (no attentive focus), versus that where using attentive
space using the Merge Neural Gas algorithm [Strickert               cues pairs of agents attended to within a temporal win-
and Hammer, 2005]; the resulting clusters constitute our            dow of 20 frames become candidates for mutual interac-
action schemas. By considering different levels of cluster          tion; all other agent pairings are ignored. The frequency
granularity in the unsupervised learning process, we also           of indeterminate qualitative cases are 58% in the first
learn subsets of coarse concepts as finer action concepts,          situation and 24% in the second. Thus, attentive focus
resulting in an action hierarchy which may be thought               biases the learning towards relations that we have names
of as a rudimentary ontology.                                       for in language.
   Having learned the action schema based on a given
input, we apply it to recognize novel 2-body interactions           3   Visual Attention
in a 3D fixed camera video, in which the depth of a                 We consider a bottom-up model of visual attention (not
foreground object is indicated by it’s image y-coordinate.          dependent on task at hand) [Itti, 2000]. Here we con-
We show that the motion features of humans can be                   sider a model designed to capture bottom-up attention
labelled using the action schemas learned.                          in dynamic scenes based on motion saliency [Singh et al.,

                                                               2

                                                               10
                                                                             Name                    Formula
                                                                           pos·velDiff          (~xB − ~xA ) · (~vB − ~vA )
                                                                           pos·velSum           (~xB − ~xA ) · (~vB + ~vA )

                                                                   Table 1: Dyadic Features Formulae. A and B refer to
                                                                   the two objects; A is said to be the Reference Object
                                                                   (The more salient, usually the larger of the two objects,
                                                                   is taken as Reference Object) and B the Located Object
                                                                   in the feature computation; v~A refers to velocity vector
                                                                   of A; x~A refers to position vector of A; and ‘·’ refers to
Figure 2: Qualitative analysis of two object interaction:          the inner product of the vectors.
Single frame qualitative classification for (a) stationary
first object and (b) when the first object is moving hori-         foveal bias is introduced to mediate in favour of proximal
zontally to the right. X-axis gives the different positions        fixations against large saccadic motions. Winner-Take-
of the second object and Y-axis gives the different veloc-         All network on the combined saliency map gives the most
ity directions(including zero velocity) of second object           salient object for fixation.
w.r.t. the first object at origin. Cases when motion
does not have a simple English label are blank. Others             4     Unsupervised Perceptual Clustering
labels are: Come Closer (CC), Move {Away,Opposite,
Together} (MA,MO,MT), Chase (CH) and Go Around                     Perceptual systems return certain abstractions of the
(GoA)                                                              raw sensory data - “features” - which are used for recog-
                                                                   nition, motor control, categorization, etc. In this work
                                                                   we use two features that capture the interaction of two
                                                                   agents. All learning takes place in the space of these two
                                                                   features, (Table 1); the first feature captures the combi-
                                                                   nation of relative position and velocity, the second the
                                                                   relative position and magnitude.
                                                                      These feature vectors are then clustered into cate-
                                                                   gories in an unsupervised manner based on a notion of
                                                                   distance between individuals. We use the Merge Neu-
                                                                   ral Gas(MNG) algorithm[Strickert and Hammer, 2005]
                                                                   for unsupervised learning which has been shown to be
Figure 3: Single frame qualitative classification for the          well-suited for processing complex dynamic sequences as
frames in Fig.1. The big square is taken as the first              compared to the other existing models for temporal data
object. The labels assigned are MA(left) and CC(right).            processing like Temporal Kohonen map, Recursive SOM
P and V refer to the position and velocity in the reference        etc. This class of temporal learning algorithms are more
frame with origin at the first object and x-axis along its         flexible with respect to the state specifications and time
velocity.                                                          history compared to HMMs or VLMMs. MNG algorithm
                                                                   performs better than other unsupervised clustering algo-
                                                                   rithms like K-Windows [Vrahatis et al., 2002], DBSCAN
                                                                   [Ester et al., 1996] because of the utilization of the tem-
                                                                   poral information present in the frame sequences unlike
                                                                   the other algorithms.

                                                                   4.1    Merge Neural Gas algorithm
                                                                   The Neural Gas algorithm [Martinetz and Schulten,
                                                                   1994] learns important topological relations in a given
                                                                   set of input vectors (signals) in an unsupervised manner
Figure 4: Without and with Attention: Frequency map                by means of a simple Hebb-like learning rule. It takes a
for Single frame qualitative classification for the case of        distribution of high-dimensional data, P(ξ) and returns
non-stationary first object. % of the feature vectors that         a densely connected network resembling the topology of
can not be labelled (given in Red) is 58% without atten-           the input.
tion, and 24% with attentive focus.                                   For input feature vectors arriving from temporally
                                                                   connected data, the basic neural gas algorithm can be
                                                                   generalized by including explicit context representation
2006]. Objects are taken as the attentive foci instead of          which utilizes the temporal ordering present in the fea-
pixels. Motion saliency map is computed from optical               ture vectors of the frames, resulting in the Merge Neural
flow, a confidence map is introduced to assign higher              Gas algorithm [Strickert and Hammer, 2005]. Here, a
salience to objects not visited for a long time. A small           Context vector is adjusted based on the present winning

                                                              3

                                                              11
Table 2: Clustering Accuracy: The ith row, j th column
gives the number of ith action labels in the j th NG Clus-
ter. % is the fraction of vectors of an action correctly
classified to the total vectors of that type. Total Classifi-
cation Accuracy(TCA) is the % of total vectors correctly
classified .                                                                                     4000

                                                                                                 3000
 Action     C1      C2      C3     C4     Tot    %     TCA
  CC        399      6       10     29    444    90                                              2000


                                                                       Feature 2 (pos.velSum)
  MA         16     311       5     48    380    82     84                                       1000
 Chase       21      59     149    154    383    79
                                                                                                    0

                                                                                                −1000
neuron data. Cluster labels for the frames are obtained                                         −2000
in the final iteration of the algorithm based on the win-                                                 C1
ner neuron.                                                                                     −3000
                                                                                                          C2
                                                                                                −4000     C3
                                                                                                          C4
5    Concept Acquisition: Chase video                                                           −5000
                                                                                                  −5000                  0               5000
Unsupervised clustering using the Merge Neural Gas al-                                                         Feature 1 (pos.velDiff)
gorithm is used on the feature vectors from the video,
corresponding to object pairs that were in attentive fo-             Figure 5: Feature Vectors of the Four Clusters from the
cus around the same time. Salient objects in a scene                 MNG Algorithm: CC - C1 , MA - C2 , Chase(Reference
are ordered by a computational model of bottom-up dy-                Object is the Chaser) - C3 , Chase(Reference Object is
namic attention[Singh et al., 2006]. The most salient                the Leader) - C4 ; The clusters reflect the spatio-temporal
object is determined for each frame, and other objects               proximity of the vectors.
that were salient within k frames before and after (we
use k = 10) are considered as attended simultaneously.
Dyadic feature vectors are computed for all object pairs
in these 2k frames.
   Owing to the randomized nature of the algorithm, the
number of clusters varies from run to run. Clusters with
less than ten frames are dropped. With the aging pa-
rameter set to 30, the number of clusters came out to
be four in 90% of the runs; the set of four clusters with
highest total classification accuracy (refer Table 2) are
considered below.
   In order to validate these clusters with human con-
cepts, we asked three subjects (Male, Hindi-English/
Telugu-English bilinguals, Age-22, 20 and 30) to label
the scenes in the video. They were shown the video twice
and in the third viewing they were asked to speak out
one of three action labels (CC, MA, Chase) which was
recorded. Given the label and the frame when this was
uttered, the actual event boundaries and participating
objects for the groundtruth data were assigned by in-
spection. In case of disagreement, we took the majority
view.
   The percentage accuracies shown in table 2 do not                 Figure 6: Comparison of Human and Algorithm La-
reflect the degree of match, since although an event may             belling of “chase” over first 1500 frames. Because of
last over 15 frames, even if 10 frames have been detected,           our choice of reference object, frames in first row are in
it is usually quite helpful. This can be seen in 6 which             C4 and second row are in C3 .
present results along a time line for Chase; each row
reflects a different combination of agents (small square,
big square, circle). At first glance, figures like 6 would
seem to reflect a higher accuracy than 84% in table 2.
   A surprising result was found when by experimenting
with the edge aging parameter in the Merge Neural Gas

                                                                4

                                                                12
Table 3: Hierarchical clustering: Using a larger num-             Table 4: Relevance of Argument Order: Value at ith row,
ber of clusters reveals a sub-classification; e.g. frames         j th column gives number of vectors that were originally
classified as CC in Table 2, are now in C1 , C5 , orC6 , re-      in Cluster i and now assigned to Cluster j when ob-
flecting two cases of CCone−object−static , or one case of        ject order was switched in dyadic feature vectors. Note
CCboth−moving .                                                   that C3 and C4, the clusters corresponding to Chase, are
                                                                  flipped.
           C1     C2      C3     C4     C5     C6    C7        C8                       C1      C2    C4     C3
  CC       201     3       9      20    189    21     1         0           Cluster 1   390     20     11    15
  MA        8     126      4      45     9      1    181        6           Cluster 2    9      323    15    29
 Chase      1      9      142    151     13     9     32       26           Cluster 3    6      12     1     145
                                                                            Cluster 4   22      48    152     9

algorithm. The number of clusters increase as aging pa-
rameter is decreased, and at one stage eight clusters were          Table 5: Clustering Accuracy by K-Windows: The value
formed (edge aging parameter=16). The Total Classifi-               of ‘k’ is set to 4. The ith row, j th column gives the num-
cation Accuracy (TCA) was about 51 and we would have                ber of ith action labels in j th Cluster. % is the fraction of
discarded the result, but inspecting the frames revealed            vectors of an action correctly classified to the total vec-
that the clusters may be reflecting what appeared to be             tors of that type. Total Classification Accuracy(TCA)
hierarchy of action types. Thus cluster C1 from the ear-            is the % of total vectors correctly classified .
lier classification (majority correlation=CC) was broken
up into C1 , C5 , C6 . C1 was found to contain frames where
both objects are moving towards each other whereas C5                Action     C1      C2     C3      C4     Total    %    TCA
contains frames where the smaller object is stationary                CC        277      19    51      97      444     62
and the other moves closer. Thus Come-Closer and                      MA         29     234    61      56      380     61     59
Move-Away appear to be sub-classified into 3 classes                 Chase       95      83    91      114     383     54
(two one object static cases, and one both moving case).
This ‘finer’ classification is given in Table 3.
                                                                    initial cluster points for the algorithm are set randomly.
5.1    Argument order in Action Schemas                             Table 5 gives the clustering results obtained.
In another experiment, we investigated the importance                  The lower accuracy (as compared to results in Table 2)
of argument ordering by re-classifying the same frames,             is expected because K-windows treats each feature vector
but reversing the order of the objects used in the dyadic           as a separate entity without utilizing the information
vector computation. Earlier, if the larger object was               present in the temporal ordering of the frames.
arg1 or reference object, now it became arg2 or non-
reference object. If the corresponding concept changed,
especially if it flipped, this would reflect a semantic ne-
                                                                    6    Recognizing actions in 3D
cessity to preserve the argument order; otherwise the ar-           In order to test the effectiveness of the clusters learned,
guments were commutative. Using the coarser clusters,               we test the recognition of motions from a 3D video of
we observe that the argument order is immaterial since              three persons running around in a field (Fig.7). In hu-
the majority relation is unchanged (black) for C1 and               man classification of the action categories (into one of
C2 (CC,MA respectively). On the other hand, both C3                 CC, MA, Chase), the dominant predicate in the video,
and C4 (correlations with Chase) are flipped (Table 4).             (777 out of 991 frames), is Chase.
Thus, the fact that argument order is important for                    In the image processing stage, the system learns the
Chase is learned implicitly within the action schema it-            background over the initial frames based on which it seg-
self. The non-commutativity of CCone−object−static and              ments out the foreground blobs. It is then able to track
M Aone−object−static could not be established because of            all the three agents using the Meanshift algorithm. As-
the skewed distribution of frames in the input video                suming camera height near eye level, the bottom-most
amongst the two sub-classes for the action verbs.                   point in each blob corresponds to that agent’s contact
                                                                    with the ground, from which its depth can be determined
5.2    Comparison with K-Windows                                    within some scaling error (157 frames with extensive oc-
       Clustering                                                   clusion between agents were omitted). Given this depth,
We compare the clustering accuracy obtained by the un-              one can solve for the lateral position - thus, we are able to
supervised Merge Neural Gas algorithm with K-windows                obtain, from a single view video, the (x, y) coordinates
algorithm [Vrahatis et al., 2002]. K-Windows is an im-              for each agent in each frame, within a constant scale.
provement of K-Means clustering algorithm with a bet-               Based on this, the relative pose and motion parameters
ter time complexity and clustering accuracy. We set the             are computed for each agent pair, and therefrom the fea-
value of k in this algorithm to 4 and run it on the input           tures as outlined earlier. Now these feature vectors are
feature vectors obtained after attentive pruning. The               classified using the action schemas (coarse clusters) al-

                                                                5

                                                               13
                                                                   References
                                                                   [Ballard and Yu, 2003] Dana H. Ballard and Chen Yu. A
                                                                      multimodal learning interface for word acquisition. In
                                                                      International Conference on Acoustics,Speech and Signal
                                                                      Processing(ICASSP03), volume 5, pages 784–7, April 2003.
                                                                   [Bloom, 2000] Paul Bloom. How Children Learn the Mean-
                                                                      ings of Words. MIT Press, Cambridge, MA, 2000.
                                                                   [Ester et al., 1996] Martin Ester, Hans-Peter Kriegel, Jorg
                                                                      Sander, and Xiaowei Xug. A density-based algorithm for
    Figure 7: Test Video : Scenes from the 3D video                   discovering clusters in large spatial databases with noise.
                                                                      In Proceedings of 2nd International Conference on Knowl-
                                                                      edge Discovery and Data Mining, 1996.
Table 6: Distribution of Chase frames(ground truth)                [Fleischman and Roy, 2005] Michael Fleischman and Deb
from the 3D video across the Neural gas clusters                      Roy. Why verbs are harder to learn than nouns: Initial
                                                                      insights from a computational model of intention recogni-
          C1 C2       C3      C4    Chase total %                     tion in situated word learning. In Proceedings of the 27th
   Chase   13    15   496     253       777      96                   Annual Meeting of the Cognitive Science Society, 2005.
                                                                   [Forbus et al., 1987] Kenneth D Forbus, Paul Nielsen, and
                                                                      Boi Faltings. Qualitative kinematics: A framework. In
ready obtained from the Chase video (2D) (Table 6).                   IJCAI, pages 430–436, 1987.
                                                                   [Guha and Mukerjee, 2007] Prithwijit Guha and Amitabha
                                                                      Mukerjee. Language label learning for visual concepts dis-
7    Discussion and Conclusion                                        covered from video sequences. In Lucas Paletta, editor, At-
                                                                      tention in Cognitive Systems. Theories and Systems from
                                                                      an Interdisciplinary Viewpoint, volume 4840, pages 91–
We have outlined how our unsupervised approach learns                 105. Springer LNCS, Berlin / Heidelberg, 2007.
action schemas of two-agent interactions resulting in an
                                                                   [Itti, 2000] L. Itti. Models of Bottom-Up and Top-Down Vi-
action ontology. The image schematic nature of the clus-
                                                                      sual Attention. PhD thesis, Pasadena, California, Jan
ters are validated by producing a description for a 3D                2000.
video. The approach provided here underlines the role
of concept argument structures in aligning with linguistic         [Langacker, 1999] Ronald Wayne Langacker. Grammar and
expressions, and that of bottom-up dynamic attention in               Conceptualization. Berlin/New York: Mouton de Gruyer,
                                                                      1999.
pruning the visual input and in aligning linguistic focus.
                                                                   [Mandler, 2004] J M Mandler. Foundations of Mind. Oxford
   Once a few basic concepts are learned, other con-                  University Press, 2004.
cepts can be learned without direct grounding, by using
conceptual blending mechanisms on the concept itself.              [Martinetz and Schulten, 1994] T. Martinetz and K. Schul-
These operations are often triggered by linguistic cues,              ten. Topology representing networks. Neural Networks,
                                                                      7(3):507–522, 1994.
resulting in new concepts, as well as their labels being
learned together, in a later stage. Indeed, the vast major-        [Regier, 2003] Terry Regier. Emergent constraints on word-
ity of our vocabularies are learned later purely from the             learning: A computational review. Trends in Cognitive
                                                                      Sciences, 7:263–268, 2003.
linguistic input [Bloom, 2000]. But this is only possible
because of the grounded nature of the first few concepts,          [Roy and Mukherjee, 2005] Deb Roy and Niloy Mukherjee.
without which these later concepts cannot be grounded.                Towards situated speech understanding: visual context
Thus the perceptually grounded nature of the very first               priming of language models. Computer Speech and Lan-
                                                                      guage, 19(2):227–248, 2005.
concepts are crucial to subsequent compositions.
                                                                   [Satish and Mukerjee, 2008] G. Satish and A. Mukerjee. Ac-
                                                                      quiring linguistic argument structure from multimodal in-
                                                                      put using attentive focus. In 7th IEEE International Con-
                                                                      ference on Development and Learning, 2008. ICDL 2008,
                                                                      pages 43–48, 2008.
                                                                   [Singh et al., 2006] Vivek Kumar Singh, Subranshu Maji,
                                                                      and Amitabha Mukerjee. Confidence based updation of
                                                                      motion conspicuity in dynamic scenes. In Third Canadian
                                                                      Conference on Computer and Robot Vision, 2006.
                                                                   [Strickert and Hammer, 2005] Marc Strickert and Barbara
                                                                      Hammer. Merge som for temporal data. Neurocomput-
                                                                      ing, 64:39–71, 2005.
Figure 8:   Image Schemas identified for actions:                  [Sugiura and Iwahashi, 2007] Komei Sugiura and Naoto Iwa-
“Red Chase Green”, “Move Away(Red,Yellow)”, “Move                     hashi. Learning object-manipulation verbs for human-
Away(Green,Yellow)”                                                   robot communication. In WMISI ’07: Proceedings of the


                                                              6

                                                              14
  2007 workshop on Multimodal interfaces in semantic in-
  teraction, pages 32–38, New York, NY, USA, 2007. ACM.
[Vrahatis et al., 2002] Michael N Vrahatis, Basilis Boutsinas,
  Panagiotis Alevizos, and Georgios Pavlides. The new k-
  windows algorithm for improving thek -means clustering
  algorithm. Journal of Complexity, 18(1):375–391, March
  2002.
[Xiang and Gong, 2006] T. Xiang and S. Gong. Beyond
  tracking: Modelling activity and understanding behaviour.
  International Journal of Computer Vision, 67(1):21–51,
  2006.


                                                                 7

                                                                 15

</pre>