Moving Object Segmentation using Visual Attention

Introduction

Segmentation is one of the fundamental problems in image processing. For humans and most animals this task is relatively straightforward. For machine vision systems, however, segmentation can be a very complex task due to the mechanisms involved and the fact that different interpretations about what to segment may exist. Most image retrieval algorithms are based on the global features extracted from static images [14]. However, when a user is interested in retrieving a moving object from a set of images (i.e. video frames), the need of extracting reliable local features, i.e., segmenting the object from its background, turns to be a difficult task. For the particular problem of segmenting moving objects in real time, also known as active segmentation [10], it is relevant to pay attention to the objects of interest in the scene and choose a fixation point within the area or region that covers them. Once a fixation point is chosen, the surrounding visual characteristics can be easily extracted and grouped according to its properties. Existing segmentation approaches [9,2] assume that a fixation point from where to start the segmentation is given. However, using a visual attention model can solve the problem of selecting automatically a fixation point. The fixation points selected should be ideally located on the object of interest and close to its center to facilitate segmentation.

Distinct science areas, such as psychology, physics, cognitive neurology and computer science, have studied visual attention for more than a century. Our approach is based on the field of cognitive neurology, where one of the main goals is to study the eyes movement to obtain information about the human visual attention. However, as we are dealing with artificial vision systems, rather than focusing on the eyes (cameras), we need to focus first on deciding which visual characteristics may be more relevant within a dynamic scene and then direct the cameras to them. We explore the entire scene in a more natural way compared to simply sweeping it out point by point in some order or to wait patiently that an object of interest appears. With this we assure that all relevant visual information in the scene is taken into account according to the priorities and objectives of the system. The outline of this paper is as follows. In Section 2, we describe related work on visual attention models. Section 3 briefly describes the artificial visual system we constructed. In Section 4 we give details of our visual attention approach. Section 5 describes the saliency map generated together with the experimental results. Finally, in Section 6 are the conclusions and future work.

Related Work

There exist different methods in the literature for visual attention. We propose a visual attention approach based on a combination of three existing methods. First, there is the approach used by von Helmholtz [17], who observed the relation between the eyes and an involuntary segmentation process of the visual field, where the eyes are attracted to objects that have not already been tracked. In other words, we involuntarily select regions of the space (the "where") based on the visual characteristics which are, generally, outside the fovea, and then, under the approach of James [6], we fix voluntarily our attention over the selected region (the "what"), with the goal of identifying it, exploring it or just not losing it from the sight. These two approaches were reinforced by Nakayama and Mackeben [11], who gave evidence of this dichotomy in the attention, first the quick and transitory aspect and then the slow and steady one. In the 80's, Klein [7] presented evidence of a new component in the attention called inhibition of return which consists in a type of selective attenuation of regions on a saliency map, avoiding that the focus of attention is directed to regions already visited. This new component has certain similarity with the Helmoltz's approach from the point of view that when eyes are attracted to unknown or new regions is somewhat equivalent to be repelled from already explored regions.

We have mentioned about fixing the sight in a point of interest in the scene, but the next question arises: How to choose the point of interest within an unknown scene? Humans and some animals solve this problem by using a very powerful biological tool known as visual attention. As established by Itti [5], the main purpose of visual attention is to direct the sight to objects of interest; it is for this reason that visual attention and eye movements are closely related. Therefore, the ability to visually understand or interpret a scene goes together with the object recognition problem, which restricts the selection of the regions that must be attended. Based on this, it is established that people use the combination of two approaches: the bottom-up approach, in which the direction of sight is determined by using relevant visual characteristics based directly on the visual information; and the top-down approach, where visual cues are used depending on the task to carry on (e.g., exploring, tracking or searching objects of interest, etc.) The bottom-up approach is based on the hypothesis that certain visual characteristics (i.e., pre-attentive ones) inherently attract the attention (e.g., color, contrast, intensity, edges, etc.) The topdown approach requires additional information to establish the preferences in the estimation of the visual attention map.

In addition to the characteristics mentioned above, there exist other similarities between our artificial visual system and the human visual system. One is the geometric configuration of the cameras, i.e., we have the coplanarity restriction (via software) between the optical axis of the cameras and a point of interest within the scene. Other characteristic of particular importance is focused on the motor abilities of the human visual system [16]. Our system was designed with kinematic abilities similar to the human eye, but with different dynamics.

The visual system design

The first aspect to consider when designing an active vision system is the type of configuration we want our system to have. We have implemented an active visual system based on the Fick architecture [4] similar to that presented in [13] Through a visual servo structure we can force the fixation point on static or moving objects by using a vision system with Fick architecture [4] of 4 DoF for camera motion. After the fixation point is reached we estimate its 3D location using extra-retinal signals [18], which come from the rotational encoders that are used to calibrate and modify the binocular disparity [1]. In Figure 1 we illustrate the final physical assembly of our active visual system. We highlight that the similarity of our artificial visual system with that of a human, far from being anthropomorphic, is more in terms of its qualities of being active, its geometry and functionality.

The visual attention algorithm

We have considered the distinct methods and ideas mentioned in Section 2 to develop our visual attention algorithm. The algorithm allows us to select regions of interest over which we forced the fixations of a vision mechanism. We then create saliency maps with characteristics that are highlighted within the scene. The amount of maps that can be extracted in an image is huge, so we just use some of them to avoid high latencies that can harm the performance of our system.

In order to select an object of interest within the image scene we have developed a typical bottom-up visual attention model based on the Itti's model [5] (see Figure 2), with the only difference that we use foveated images. We associate a degree of preference or weight to each of the extracted characteristics (saliency maps) such as movement, color, distance to the center of the fovea, etc. These weights can be modulated as a function of the task to be carried. For example, if the task consists in following a red object, we give more weight to visual cues of color and movement and inhibit the characteristic of contrast, depth, illumination, etc. Once a point of interest is selected over any of the two images obtained from our active stereo system, we need to solve the correspondence problem. With this information we can calculate the desired position to force the fixation point over the point of interest. A Kalman filter [19] is used for the case of tracking, this filter predicts the future position of the tracking object based on previous observations and the model of the object is used to recognize the object in the other image.

Visual abilities

Our active stereo system is capable of fovealizing points of interest or previously known objects within the scene by using a conventional CCD and an exponential retinotopic mapping (explained in Section 4.2) to resample the images. Working with foveated images improves significantly the time of extracting information from the images. However, its variable resolution makes eye movements necessary to drag the projection of the object to the fovea at each retina (camera). One of the key characteristics of active vision is its real-time requirement. Having cameras able to move allows searching for new strategies to decrease the response time. This is the main motivation of working with foveated images. One of the objectives of a system with foveated vision is to achieve a good combination between a big aperture Adapted to our model from Itti's model [5]. First, the virtual retina algorithm is applied to the input images. Second, from the foveated images we extract some characteristics (color, motion, etc.), obtain their map and associate a weight to each of them depending on the task to be carried. After that a characteristics combination step is performed followed by a filtering step that inhibits those already visited places on the image. Finally, a saliency map is constructed which is used to segment the moving object of interest according to the Winner-take-all (WTA) algorithm.

angle of the camera with a significant decrease in the number of pixels, reaching the maximum resolution over the regions of interest (fovea) [12].

Retinotopic mapping

At a biological level, the term retinotopic means that near points to the scene that are projected into the retina are mapped near the striated cortex, i.e., the retinal topography is respected. Although it is presumed that in primates the retinotopic mapping is polar-logarithmic [15], we have defined our retinotopic mapping as an inverse Cartesian mapping. It is an inverse mapping because we make it exponential from the memory to the image space using the following equation:

X = x + sgn(x) K |x| − 1, (1)

where w is the floor function of w, w ∈ R; X is a vector that represents the coordinates of a pixel (U,V ) in the original image, x corresponds to the coordinates of the pixel in the compressed image (u, v), where {X, x ∈ Z 2 }; and K > 1 ∈ R is a constant. In this way, a pixel (u, v) given in the compressed image takes the corresponding value of the pixel (U,V ) obtained from Equation 1, as follows:

[U,V ] = [u + K |u| 1 − 1, v + K |v| 2 − 1].(2)

Equation 1 describes an infinite family of functions to resample the image. The exponential component establishes how the resolution diminishes according to the pixel distance to the fovea. Therefore, we need to define K carefully so the image is resampled including the borders and the aperture angle of the camera is preserved without going beyond the limits of the image.

A good way to define constant K in a general form is:

K x = (Sx/2 − sx/2 + 1) sx/2 , K y = (Sy/2 − sy/2 + 1) sy/2 . (3)

Where Sn and sn (see Figure 3) represent the size of the positive and negative axis of the original and compressed images when the origin is displaced to pixel (X 1 ,Y 1 ). Equation 3 is used to design a mapping that decreases the area of the images coming from the camera to a quarter of the total size, i.e., from 480 × 640 to 240 × 320, with the origin (fovea) in the center of the image. In this way, we obtain K x = 161 (1/160) and K y = 121 (1/120) . The complexity of this algorithm can be inferred from Equation 1, which selects the pixels from the buffer with which a new image is formed to be processed. Although this image is smaller, it preserves the same aperture angle of the original image. This allows us to have an effect as shown Figure 3b, by reducing the image while keeping the retinotopic property of the sensor.

Segmentation using visual attention

The ability of the human brain to process images is far superior to any computational algorithm created. For example, it is excellent extracting the color of an object even with the presence of external variables, such as illumination. When we segment objects many different cues or characteristics are extracted.

Color is one of the most relevant and commonly used to segment objects. The color of an object not only depends on the chemical composition of its surface, but also on the conditions of environment: illumination, intensity, number and color of the illumination sources, location and shape of the object, the intrinsic and extrinsic of the sensors, etc. Based on this knowledge we have developed a simple but fast algorithm that keeps the color constancy of objects. The procedure consists on changing the RGB format of each color pixel (of the fovealized image) to its quaternion representation assuming linearity in the reception of the luminous spectrum of the accopled device of the camera charge upon illumination changes. This representation helps in the segmentation because the direction of the color vector xi + yj + zk is less sensitive to illumination changes, thus facilitating the segmentation using color thresholding. The quaternion representation is defined as:

w = xi + yj + zk,(4)

where w is the intensity of the pixel, and can be defined with the color components on the RGB triangle:

w = √ R 2 + G 2 + B 2 √ 3 , x = R R + G + B , y = G R + G + B , z = B R + G + B ,(5)

where R,G,B, correspond to the red, green and blue components, respectively. Therefore, we say that a pixel P u,v = w u,v + x u,v i + y u,v j + z u,v k, where u and v are the pixel coordinates within the image, will be of a color of interest if its vector of directions v = (x, y, z) is additive inverse of the color being searched

P = 0 + xi + yj + zk, i.e.: (P h,k − P) w u,v + 0i + 0j + 0k.(6)

The result of converting an image to its representation in quaternions is a format not well understood by the computer as an image, but it can be seen if we decompose it in two: the magnitud w and the direction (xi+y j +zk) as it can be observed in Figure 4. It can be noted that the quaternion representation makes the colors less sensible to variations in illumination. Note that same color objects but with different illumination in the left image are seen practically identical in the right image. Although this does not solve completely the problem, it will facilitate in great measure the search of the object of interest. In other words, what we do is to classify colors by grouping colinear vectors in just one vector.

Obtaining the color histogram

By using Equation 5, which implies that (x + y + z) = 1, we can project the color on one plane (RGB triangle) and form histograms like the one shown in Figure 5. To obtain these histograms, the first step is to locate each color in the plane, there exists infinite ways to do this. In Figure 5, it can be observed that the red color is located in coordinates (255, 0), the blue in (0, 255) and the green color in (383, 383). The easiest way is to put the blue in the origin, the red in (0, 255) and the green color in (255, 128). This way a pixel's RGB components are located in the coordinate plane (x, y) = (g, b + g/2). The following step is to define the desired resolution in the RGB triangle. The parameter we choose to restrict the resolution 4. The red color is located in coordinates (255, 0), the blue in (0, 255) and the green in (383, 383). By using Eq. 5, each color pixel on the image can be projected on the one plane (i.e. the RGB triangle).

is the number of transitions (n) between two primary colors. The last step is to count how many pixels fall in each of these slots and make a graph.

Searching for the object of interest

The search of the object of interest is based on the local extracted characteristics by the visual attention algorithm. The algorithm first selects the point of interest within the scene and then extracts some characteristics of that area, such as its histogram, the contrast map and the rate of change (explained in the next Section). We describe here the algorithm used to extract geometric characteristics of an object. As an start we assume that the object has a characteristic or set of characteristics C, as color, intensity, texture, etc. The algorithm begins by sweeping the images from top to bottom in search of C. At the moment in which this characteristic is found in a pixel P u,v , it starts surrounding the pixel by searching all 8-connected neighboring pixels in clockwise direction in order to obtain its silhouette. In this way, we can obtain the contour of a set of pixels with a common characteristic in a fast way and without using derivative-based operators that do not guarantee a closed curve and require slim filters that are sensible to the borders that the object may have inside. Once the silouhoutte of the object is obtained in a vector form, we can extract useful information to achieve the recognition of an object model previously saved in memory, such as the area (moment of order zero), perimeter, shape (compact or regularity factor), centroid, etc. If we look for an object that has been found in previous frames, it is not necessary to search the whole image. If we know the sampling rate of the camera and the maximum velocity of the motors, we can generate a search radius from the last position where the object was seen. In our case, we make the search in a spiral form increasing the radius r by two pixels each 2Π radians. The angular increment is given by π r−1 + 0.01. Continuing with the example of Figure 4, we show now experimental results of the search algorithm described above in Figure 6. In this case the characteristic C to search is the red color. It can be observed that the algorithm is robust enough to detect in a predefined search zone the objects that contain the characteristic to search.

Saliency map

On of the most successful models in computational visual attention was proposed by Koch and Ullman [8], who based their model in a certain type of topographical map, totally compatible with previous presented approaches. They associated a saliency measure, based on the extraction of visual cues to each region of the image forming with all of them a bi-dimensional attention map known as saliency map. To build a saliency map we need to integrate the information extracted from multiple visual cues, such as: color, geometry, optical flow, intensity, etc., known as pre-attentive [5], with which we calculate the maps with relevant characteristics, as the contrast in color and intensity, motion, color histograms, geometry, etc. Then, all these characteristics are weighted depending on the task to solve and combined to form the saliency map, then the winner-take-all algorithm is used to select the most relevant region in the map. In the following sections we describe the algorithm to obtain the saliency map.

Contrast and color maps

As users we can define a priority or degree of interest for the different colors based on the task. We also want our robot to have a certain degree of preference for some color or set of colors. Therefore, we set a weight different to zero to this map only when there is a search task.

Contrast is defined as the difference in color and intensity between a region and its surroundings. We can literally use this definition to generate the contrast map, which can be thought as a 3D filter, either a high-pass filter or magnitude of the gradient. The contrast map, Mc, is the result of the convolution in the space domain for the image I ∈ R 3 in its quaternion representation with the filter

F ∈ R 2 : [Mc = I * F, ∈ R 2 ]

, where the value of each pixel of Mc is obtained by using the discrete convolution (Equation 7), using a filter of (n × m × 4) as follows:

Mc(x, y) = n ∑ i=1 m ∑ j=1 4 ∑ k=1 I(i, j, k) f (i, j).(7)

In Figure 7, the colors that seem to highlight more are marked in the RGB triangle with a white cross, these are however, the ones that are far away from the background color in the triangle (indicated with a circle surrounding the white cross). Figure 8 shows the contrast map implemented with a magnitude of the gradient filter. In the three cases the circles that highlight the most are the ones indicated with the white cross, and the ones that highlight least are those near the background color, marked with black cross. What we can infer from the saliency of the contrast can be in function of the absolute difference Figure 7: Color and grayscale images where different contrast can be appreciated. The background color is represented in the RGB triangle with a cross and a circle while in the grayscale bar with an arrow; the rest of the colors are marked with a cross or small lines, respectively. It can be observed in the color images that the white crosses highlight the colors that are far away from the background color. between pixel and its neighbors, either vectorial for colors or scalar for intensities. This is the reason we use a magnitude of the gradient filter, besides of being faster than a high-pass filter.

Figure 8: Color and intensity contrast maps of the images in Fig. 7 using a magnitude of the gradient filter. It can be seen that the circles that highlight the most are the ones indicated with the white cross. The saliency of the contrast seems to be a function of the absolute difference between pixel and its neighbors.

We now show the contrast maps of a scene full of high contrasts in Figure 9, where the regions of maximum contrast are in red.

Motion map

The individual motion of pixels in an image is known as optical flow and is measured when corresponding pixels are found in two consequent frames. On one hand, the measurement of the optical flow can be highly complex due to the similarity between pixels, and with current hardware technology, is almost impossible to implement this algorithm on a real-time system. On the other hand, the main attentional map is the motion map as is related with the most important task of the system: tracking moving objects. An easy way to obtain a motion map without solving the correspondence problem is by using a motion filter of absolute rate [3]. This filter consists on subtracting two images (I) taken at different instants of time (t,t − 1) and observing the regions for which the squared difference is maximized. The absolute rate of motion is given by M t = I t − I t−1 . The result can be seen in Figure 10. M t can operate as a motion map, however when convoluting a mean filter on M t , we obtain a more precise reading of the region with the greatest motion. Obtaining this map could be complicated when the vision system is active, as the motion of the camera generates a visual flow in the whole image, for this reason the use of this map will be done exclusively at the end of each saccadic motion, when the motion of the cameras is practically null.

Inhibition of return

Inhibition of return is one of the most important components in the visual attention process as it retains in memory the regions that were already attended, to incentive the recollection of information in regions that have not been attended. This inhibitory effects does not depend on visual cues but on their spatial locations, for which we store the position of fixation points and induce an artificial potential field built in the robot's workspace, so the inhibition map is the projection of the sum of the potential fields on the image. We define the potential field on a point as:

U(θ ) = 1 − n ∑ i=1 e θ 2 i σ 2 , (8)

where θ is the angle between vectors A and B, σ is the constant used to restrict the dilatation of the potential field so that does not exceed a solid angle greater than that occupied by the fovea. And, in order to project the effect of the potential field to an inhibition of return map (IoR), we use the coordinates (u, v) of vector B: IoR(u, v) = U(θ ).

Map integration

Under this computational model, visual attention is seen as a cost function that varies over time. Each visual characteristic constitutes a variable that, depending on its manipulation, can make fluctuate the attentional focus of the system giving an appearance of an animal or human behavior before an unknown scene. The integration of all maps is the more difficult and challenging part of the whole design. In our system we assign the preferences (weights) to each map according to the priority of each of the tasks. We show in Figure 11 an experimental result of the saliency map obtained after integrating all the maps described above. In the sequence of images we show a scene where the object of interest is the red sphere. It can be seen that the sphere gets far from the fovea (located first at the center of the image) and therefore we used an attention map based on the color of the sphere to find it. Despite that there exist more red objects in the scene, the histogram resolution is good enough to broadly distinguish the sphere over the other objects. Even though this map is enough for this case, there will be other cases in which there exist very similar objects and the difference is not so well marked. Those cases can be solved using information about the motion of the cameras in order to see the regions of interest in high resolution, obtain fine details and facilitate segmentation.

Conclusions and Future Work

We have presented a visual attention algorithm that combines existing approaches for segmenting moving objects in real time. First, the proposed image compression technique (foveated vision) with multiple resolution allows us to process images four to seven times faster and accomplish the objective of scanning the complete field of view without having to use special cameras. Second, the use of a quaternion representation of the RGB space together with the color histograms allow us to identify objects in a robust manner before illumination changes; so, we can use the same algorithm during the day with ambient illumination and during the night with artificial light. One clear disadvantage, however, about