-

Simulating Actions with the Associative Self-Organizing Map

Miriam Buonamente

miriam.buonamente@unipa.it 1

Haris Dindo

haris.dindo@unipa.it 1

Magnus Johnsson

magnus@magnusjohnsson.se 0 0 Lund University Cognitive Science , Lundagard, 222 22 Lund , Sweden 1 RoboticsLab, DICGIM, University of Palermo , Viale delle Scienze, Ed. 6, 90128 Palermo , Italy

We present a system that can learn to represent actions as well as to internally simulate the likely continuation of their initial parts. The method we propose is based on the Associative Self Organizing Map (A-SOM), a variant of the Self Organizing Map. By emulating the way the human brain is thought to perform pattern recognition tasks, the ASOM learns to associate its activity with di erent inputs over time, where inputs are observations of other's actions. Once the A-SOM has learnt to recognize actions, it uses this learning to predict the continuation of an observed initial movement of an agent, in this way reading its intentions. We evaluate the system's ability to simulate actions in an experiment with good results, and we provide a discussion about its generalization ability. The presented research is part of a bigger project aiming at endowing an agent with the ability to internally represent action patterns and to use these to recognize and simulate others behaviour.

Associative Self-Organizing Map Neural Network Action Recognition Internal Simulation Intention Understanding

Robots are on the verge of becoming a part of the human society. The aim is to augment human capabilities with automated and cooperative robotic devices to have a more convenient and safe life. Robotic agents could be applied in several elds such as the general assistance with everyday tasks for elderly and handicapped enabling them to live independent and comfortable lives like people without disabilities. To deal with such desire and demand, natural and intuitive interfaces, which allow inexperienced users to employ their robots easily and safely, have to be implemented.

E cient cooperation between humans and robots requires continuous and complex intention recognition; agents have to understand and predict human intentions and motion. In our daily interactions, we depend on the ability to understand the intent of others, which allows us to read other's mind. In a simple dance, two persons coordinate their steps and their movements by predicting subliminally the intentions of each other. In the same way in multi-agents environments, two or more agents that cooperate (or compete) to perform a certain task have to mutually understand their intentions.

Intention recognition can be de ned as the problem of inferring an agent's intention through the observation of its actions. This problem has been faced in several elds of human-robot collaboration [ 1 ]. In robotics, intention recognition has been addressed in many contexts like social interaction [ 2 ] and learning by imitation [ 3 ] [ 4 ] [ 5 ].

Intention recognition requires a wide range of evaluative processes including, among others, the decoding of biological motion and the ability to recognize tasks. This decoding is presumably based on the internal simulation [ 6 ] of other peoples behaviour within our own nervous system. The visual perception of motion is a particularly crucial source of sensory input. It is essential to be able to pick out the motion to predict the actions of other individuals. Johansson's experiment [ 7 ] showed that humans, just by observing points of lights, were able to perceive and understand movements. By looking at biological motion, such as Johansson's walkers, humans attribute mental states such as intentions and desires to the observed movements. Recent neurobiological studies [ 8 ] corroborate Johansson's experiment by arguing that the human brain can perceive actions by observing only the human body poses, called postures, during action execution. Thus, actions can be described as sequences of consecutive human body poses, in terms of human body silhouettes [ 9 ] [ 10 ] [ 11 ]. Many neuroscientists believe that the ability to understand the intentions of other people just by observing them depends on the so-called mirror-neuron system in the brain [ 12 ], which comes into play not only when an action is performed, but also when a similar action is observed. It is believed that this mechanism is based on the internal simulation of the observed action and the estimation of the actor's intentions on the basis of a representation of ones own intentions [ 13 ].

Our long term goal is to endow an agent with the ability to internally represent motion patterns and to use these patterns to recognize and simulate other's behaviour. The study presented here is part of a bigger project whose rst step was to e ciently represent and recognize human actions [ 14 ] by using the Associative Self-Organizing Map (A-SOM) [ 15 ]. In this paper we want to use the same biologically-inspired model to predict an agent's intentions by internally simulating the behaviour likely to follow initial movements. As humans do effortlessly, agents have to be able to elicit the likely continuation of the observed action even if an obstacle or other factors obscure their view. Indeed, as we will see below, the A-SOM can remember perceptual sequences by associating the current network activity with its own earlier activity. Due to this ability, the ASOM could receive an incomplete input pattern and continue to elicit the likely continuation, i.e. to carry out sequence completion of perceptual activity over time.

We have tested the A-SOM on simulation of observed actions on a suitable dataset made of images depicting the only part of the persons body involved in the movement. The images used to create this dataset was taken from the \INRIA 4D repository 3", a publicly available dataset of movies representing 13 common actions: check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, point, pick up, and throw (see Fig. 1).

This paper is organized as follows: A short presentation of the A-SOM network is given in section II. Section III presents the method and the experiments for evaluating performance. Conclusions and future works are outlined in section IV. 2

Associative Self-Organizing Map The A-SOM is an extension of the Self-Organizing Map (SOM) [ 16 ] which learns to associate its activity with the activity of other neural networks. It can be considered a SOM with additional (possibly delayed) ancillary input from other networks, Fig. 2.

Ancillary connections can also be used to connect the A-SOM to itself, thus associating its activity with its own earlier activity. This makes the A-SOM able to remember and to complete perceptual sequences over time. Many simulations prove that the A-SOM, once receiving some initial input, can continue to elicit the likely following activity in the nearest future even though no further input is received [ 17 ] [ 18 ].

The A-SOM consists of an I J grid of neurons with a xed number of neurons and a xed topology. Each neuron nij is associated with r + 1 weight vectors wiaj 2 Rn and wi1j 2 Rm1 , wi2j 2 Rm2 , . . . , wirj 2 Rmr . All the elements of all the weight vectors are initialized by real numbers randomly selected from a uniform distribution between 0 and 1, after which all the weight vectors are normalized, i.e. turned into unit vectors.

At time t each neuron nij receives r + 1 input vectors xa(t) 2 Rn and x1(t d1) 2 Rm1 , x2(t d2) 2 Rm2 , . . . , xr(t dr) 2 Rmr where dp is the time delay for input vector xp, p = 1; 2; : : : ; r.

The main net input sij is calculated using the standard cosine metric sij (t) =

xa(t) wiaj (t) jjxa(t)jjjjwiaj (t)jj ; The activity in the neuron nij is given by

yij = [yiaj (t) + yi1j (t) + yi2j (t) + : : : + yirj (t)]=(r + 1) where the main activity yiaj is calculated by using the softmax function [ 19 ] 3 The repository is available at http://4drepository.inrialpes.fr. It o ers several movies representing sequences of actions. Each video is captured from 5 di erent cameras. For the experiments in this paper we chose the movie \Julien1" with the frontal camera view \cam0". (1) (2)

where m is the softmax exponent.

The ancillary activity yipj (t), p=1,2,. . . ,r is calculated by again using the standard cosine metric yiaj (t) =

(sij (t))m maxij (sij (t))m yipj (t) =

xp(t jjxp(t dp) wipj (t)

p dp)jjjjwij (t)jj : wiajk(t + 1) = wiajk(t) + (t)Gijc(t)[xka(t) wiajk(t)] (t) where 0 The neighbourhood function Gijc(t) = e is a Gaussian function decreasing with time, and rc 2 R2 and rij 2 R2 are location vectors of neurons c and nij respectively.

1 is the adaptation strength with (t) ! 0 when t ! 1.

jjrc rijjj 2 2(t)

The weights wipjl, p=1,2,. . . ,r, are adapted by

The weights wiajk are adapted by

wipjl(t + 1) = wipjl(t) + xlp(t dp)[yiaj (t) yipj (t)] where is the adaptation strength.

All weights wiajk(t) and wipjl(t) are normalized after each adaptation.

In this paper the ancillary input vector x1 is the activity of the A-SOM from the previous iteration rearranged into a vector with the time delay d1 = 1: We want to evaluate if the bio-inspired model, introduced and tested for the action recognition task in [ 14 ], Fig. 3, is also able to simulate the continuation of the initial part of an action. To this end, we tested the simulation capabilities of the A-SOM. The experiments scope is to verify if the network is able to receive an incomplete input pattern and continue to elicit the likely continuation of recognized actions. Actions, de ned as single motion patterns performed by a single human [ 20 ], are described as sequences of body postures.

The dataset of actions is the same as we used for the recognition experiment in [ 14 ]. It consists of more than 700 postural images representing 13 di erent actions. Since we want the agent to be able to simulate one action at a time, we split the original movie into 13 di erent movies: one movie for each action (see Fig. 1). Each frame is preprocessed to reduce the noise and to improve its quality and the posture vectors are extracted (see section 3.1 below). The posture vectors are used to create the training set required to train the A-SOM. Our nal training set is composed of about 20000 samples where every sample is a posture vector.

The created input is used to train the A-SOM network. The training lasted for about 90000 iterations. The generated weight le is used to execute tests. The implementation of all code for the experiments presented in this paper was done in C++ using the neural modelling framework Ikaros [ 21 ]. The following sections detail the preprocessing phase as well as the results obtained. To reduce the computational load and to improve the performance, movies should have the same duration and images should depict the only part of the body involved in the movement. By reducing the numbers of images for each movie to 10, we have a good compromise to have seamless and uid actions, guaranteeing the quality of the movie. As Fig. 4 shows, the reduction of the number of images, depicting the \walk action" movie, does not a ect the quality of the action reproduction.

Consecutive images were subtracted to depict the only part of the body involved in the action, focusing in this way the attention on the movement exclusively. This operation further reduced the number of frames for each movie to 9, without a ecting the quality of the video. As can be seen in Fig. 5, in the \walk action" only the arm is involved in the movement.

To further improve the system's performance, we need to produce binary images of xed and small size. By using a xed boundary box, including the part of the body performing the action, we cut out the images eliminating anything not involved in the movement. In this way, we simulate an attentive process in which the human eye observes and follows the salient parts of the action only. To have smaller representations the binary images depicting the actions were shrunk to 30 30 matrices. Finally, the obtained matrix representations were vectorized to produce 9 posture vectors p 2 RD, where D = 900, for each action. These posture vectors are used as input to the A-SOM. 3.2

Action Simulation The objective was to verify whether the A-SOM is able to internally simulate the likely continuation of initial actions. Thus, we fed the trained A-SOM with incomplete input patterns and expected it to continue to elicit activity patterns corresponding to the remaining part of the action. The action recognition task has been already tested in [ 14 ] with good results. The system we set up was the same as the one used in [ 14 ] and consists of one A-SOM connected to itself with time delayed ancillary connections. To evaluate the A-SOM, 13 sequences each containing 9 posture vectors were constructed as explained above. Each of these sequences represents an action. The posture vectors represent the binary images that form the videos and depict only the part of the human body involved in the action, see Fig.6

We fed the A-SOM with one sequence at a time, reducing the number of posture vectors at the end of the sequence each time and replacing them with null vectors (representing no input). In this way, we created the incomplete input that the A-SOM has to complete.The conducted experiment consisted of several tests. The rst one was made by using the sequences consisting of all the 9 frames with the aim to record the coordinates of the activity centres generated by the A-SOM and to use these values as reference values for the further iterations. Subsequent tests had the sequences with one frame less (replaced by a null vector representing no input) each time and the A-SOM had the task to complete the frame sequence by eliciting activity corresponding to the activity representing the remaining part of the sequence. The last test included only the sequences made of one frame (followed by 8 null vectors representing no input).

The centres of activity generated by the A-SOM at each iteration were collected in tables, and colour coding was used to indicate the ability (or the inability) of the A-SOM to predict the action continuation. The dark green colour indicates that the A-SOM predicted the right centres of activity; the light green indicates that the A-SOM predicted a value close to the expected centre of activity and the red one indicates that the A-SOM could not predict the right value, see Fig.7. The ability to predict varies with the type of action. For actions like \sit down" and \punch", A-SOM needed 8 images to predict the rest of the sequence; whereas for the \walk" action, A-SOM needed only 4 images to complete the sequence. In general the system needed between 4 and 9 inputs to internally simulate the rest of the actions. This is a reasonable result, since even humans cannot be expected to be able to predict the intended action of another agent without a reasonable amount of initial information. For example, looking at the initial part of an action like \punch", we can hardly say what the person is going to do. It could be \punch" or \point"; we need more frames to exactly determine the performed action. In the same way, looking at a person starting to walk, we cannot say in advance if the person would walk or turn around or even kick because the initial postures are all similar to one another.

The results obtained through this experiment allowed us to speculate about the ability of the A-SOM to generalize. The generalization is the network's ability to recognize inputs it has never seen before. Our idea is that if the A-SOM is able to recognize images as similar by generating close or equal centres of activity, then it will also be able to recognize an image it has never encountered before if this is similar to a known image. We checked if similar images had the same centres of activity and if similar centres of activity corresponded to similar images. The results show that the A-SOM generated very close or equal values for very similar images, see Fig.8. Actions like \turn around", \walk" and \get up" present some frames very similar to each other and for such frames the ASOM generates the same centres of activity. This ability is validated through the selection of some centres of activity and the veri cation that they correspond to similar images. \Check watch", \get up", \point" and \kick" actions include in their sequences frames depicting the movement of the arm that can be attributed to all of them. For these frames the A-SOM elicits the same centre of activity, see Fig. 9. The results presented here support the belief that our system is also able to generalize. In this paper, we proposed a new method for internally simulating behaviours of observed agents. The experiment presented here is part of a bigger project whose scope is to develop a cognitive system endowed with the ability to read other's intentions. The method is based on the A-SOM, a novel variant of the SOM, whose ability of recognition and classi cation has already been tested in [ 14 ]. In our experiment, we connected the A-SOM to itself with time delayed ancillary connections and the system was trained and tested with a set of images depicting the part of the body performing the movement. The results presented here show that the A-SOM can receive some initial sensory input and internally simulate the rest of the action without any further input.

Moreover, we veri ed the ability of the A-SOM to recognize input never encountered before, with encouraging results. In fact, the A-SOM recognizes similar actions by eliciting close or identical centres of activity.

We are currently working on improving the system to increase the recognition and simulation abilities.

Acknowledgements The authors gratefully acknowledge the support from the Linnaeus Centre Thinking in Time: Cognition, Communication, and Learning, nanced by the Swedish Research Council, grant no. 349-2007-8695.

1. Awais , M. , Henrich , D. : Human-robot collaboration by intention recognition using probabilistic state machines . In: Robotics in Alpe-Adria-Danube Region (RAAD) , 2010 IEEE 19th International Workshop on Robotics. ( 2010 ) 75 { 80

2. Breazeal , C. : Designing sociable robots . the MIT Press ( 2004 )

3. Chella , A. , Dindo , H. , Infantino , I.: A cognitive framework for imitation learning . Robotics and Autonomous Systems 54 ( 5 ) ( 2006 ) 403 { 408

4. Chella , A. , Dindo , H. , Infantino , I. : Imitation learning and anchoring through conceptual spaces . Applied Arti cial Intelligence 21 ( 4-5 ) ( 2007 ) 343 { 359

5. Argall , B.D. , Chernova , S. , Veloso , M. , Browning , B. : A survey of robot learning from demonstration . Robotics and Autonomous Systems 57 ( 5 ) ( 2009 ) 396 { 483

6. Hesslow , G.: Conscious thought as simulation of behaviour and perception . Trends in Cognitive Sciences 6 ( 2002 ) 242 { 247

7. Johansson , G.: Visual perception of biological motion and a model for its analysis . Perception & Psychophysics 14 ( 2 ) ( 1973 ) 201 { 211

8. Giese , M.A. , Poggio , T. Nat Rev Neurosci 4 ( 3 ) ( March 2003 ) 179 { 192

9. Gorelick , L. , Blank , M. , Shechtman , E. , Irani , M. , Basri , R.: Actions as space-time shapes . IEEE Trans. Pattern Anal. Mach. Intell . 29 ( 12 ) ( 2007 ) 2247 { 2253

10. Iosi

dis

, A., Tefas , A. , Pitas , I. : View-invariant action recognition based on arti cial neural networks . IEEE Trans. Neural Netw. Learning Syst . 23 ( 3 ) ( 2012 ) 412 { 424

11. Gkalelis , N. , Tefas , A. , Pitas , I. : Combining fuzzy vector quantization with linear discriminant analysis for continuous human movement recognition . IEEE Transactions on Circuits Systems Video Technology 18 ( 11 ) ( 2008 ) 15111521

12. Rizzolatti , G. , Craighero , L. : The mirror-neuron system . Annual Review of Neuroscience 27 ( 2004 ) 169192

13. Goldman , A.I. : Simulating minds: The philosophy, psychology, and neuroscience of mindreading. (2) ( 2006 )

14. Buonamente , M. , Dindo , H. , Johnsson , M. : Recognizing actions with the associative self-organizing map . In: the proceedings of the XXIV International Conference on Information, Communication and Automation Technologies (ICAT 2013 ). ( 2013 )

15. Johnsson , M. , Balkenius , C. , Hesslow , G.: Associative self-organizing map . In: Proceedings of IJCCI . ( 2009 ) 363 { 370

16. Kohonen , T. : Self-Organization and Associative Memory . Springer Verlag ( 1988 )

17. Johnsson , M. , Gil , D. , Balkenius , C. , Hesslow , G.: Supervised architectures for internal simulation of perceptions and actions . In: Proceedings of BICS . ( 2010 )

18. Johnsson , M. , Mendez , D.G. , Hesslow , G. , Balkenius , C. : Internal simulation in a bimodal system . In: Proceedings of SCAI . ( 2011 ) 173 { 182

19. Bishop , C.M. : Neural Networks for Pattern Recognition . Oxford University Press, Oxford ( 1995 )

20. Turaga , P.K. , Chellappa , R. , Subrahmanian , V.S. , Udrea , O.: Machine recognition of human activities: A survey . IEEE Trans. Circuits Syst. Video Techn . 18 ( 11 ) ( 2008 ) 1473 { 1488

21. Balkenius , C. , Moren , J. , Johansson , B. , Johnsson , M. : Ikaros: Building cognitive models for robots . Advanced Engineering Informatics 24 ( 1 ) ( 2010 ) 40 { 48