Preliminary Study Towards a Fuzzy Model for Visual Attention

Preliminary Study Towards a Fuzzy Model for Visual Attention AncaRalescu anca.ralescu@uc.edu EECS Department University of Cincinnati

ML 0030 45221 Cincinnati OH USA

IsabelleBloch isabelle.bloch@telecom-paristech.fr Institut Mines Telecom Telecom Paristech CNRS LTCI

Paris France

RobertoCesar cesar@ime.usp.br IME University of Sao Paulo

Sao Paulo Brazil

Preliminary Study Towards a Fuzzy Model for Visual Attention 5AC8EB11917075898960C572C5A1F0A9 GROBID - A machine learning software for extracting information from scholarly documents

Attention, in particular visual attention, has been a subject of studies in various disciplines, including cognitive science, experimental psychology, and computer vision. In cognitive science and experimental psychology the objective is to develop theories that can explain the attention phenomenon of cognition. In computer vision, the objective is to inform image understanding systems by hypotheses on the human visual attention. There is, however, very little influence of studies across these two disciplines. In a departure from this state of affairs, this study seeks to develop an algorithmic approach to visual attention as part of an image understanding system, by starting with a theory of visual attention put forward in experimental psychology. In the process, it will become useful to revise some of the concepts of this theory, in particular by adopting fuzzy set based representations and the necessary calculus for them.

Introduction

As subject of human cognition, attention has attracted a great interest from the fields of cognitive science and experimental psychology.

Visual attention is a wide field, largely addressed in the literature covering different aspects. Some works related to the present paper are briefly reviewed, without seeking at exhaustivity. One approach relies on Gestalt theory, and Gestalt and computer vision models are compared by (Desolneux, Moisan, and Morel 2003). Two sets of experiments for Gestalt detection methods are carried out and compared to computationally predicted results. Object size and noise are the two parameters taken into account in these experiments. The authors indicate that the qualitative thresholds predicted by the proposed computational approach of gestalt detection fit the human perception.

Another approach is purely computational and based on image information. An important review on visual attention modeling is presented by (Borji and Itti 2013). The important aspect of saliency-based attention is specifically addressed in this review. Nearly 65 models are reviewed and classified in a didactical taxonomy that helps clarifying the field. Visual saliency refers to a bottom-up phenomenon where some scene regions are detected as more prominent than others due to some visual features. There are different biological and computational approaches to model such phenomena. For instance, the center-surround hypothesis (a common issue for the analysis of receptive fields in the retina) is a classical model for bottom-up saliency (Gao, Mahadevan, and Vasconcelos 2008). In such settings, Gao and co-authors (Gao, Mahadevan, and Vasconcelos 2008) incorporate discriminant features and decision-theoretic model for saliency characterization. Saliency detection is important in many different imaging and vision applications (Yan et al. 2013;Yang et al. 2013). For instance, in medical imaging, saliency maps are useful to guide model-based image segmentation (Fouquier, Atif, and Bloch 2012), thus merging top-down and bottom-up approaches.

The mechanism of attention has been studied intensively in the field of psychology and cognitive science, (Kahneman 1973), (Treisman and Gelade 1980), (Treisman 1988), (Treisman 2014), (Humphreys 2014), (Bundesen, Habekost, and Kyllingsbaek 2005) (Bundesen, Vangkilde, and Petersen 2014). In this paper we focus on the theory of visual attention introduced in (Bundesen 1990), where visual recognition and attentional selection are considered as the task of perceptual categorization, basically deciding to which category an object or element of the visual field belongs.

Following the notation of (Bundesen 1990), throughout this paper, x is an input item, e.g. image or image region, or more generally an item to be categorized of recognized. The collection of all items x is denoted by S. A category is denoted by i and the collection of all categories is denoted by R. A category can stand for an ontological category (e.g., an object, or a scene), or for subsets in the range of a particular attribute (e.g., red for the attribute color). Regardless of the situation the conceptual treatment of categories and/or items is the same. E(x, i) denotes the event/statement "x is in category i". When viewed as an event, one can talk about its probability; when viewed as a statement, one can talk about its truth or its possibility.

From this point on this paper is organized as follows: Section 2 contains a brief review of TVA concepts and mechanisms -filtering and pigeonholing. Section 3 presents the motivation for the introduction of fuzzy sets; the fuzzy mechanisms of filtering and pigeonholing. Conclusions and future research are in Section 4.

In this section, we review and comment the main concepts and modeling steps of the Theory of Visual Attention (TVA) by (Bundesen 1990).

Attentional Weight

One of the main concepts introduced in TVA is that of attentional weight defined as follows:

w(x) = i∈R η(x, i)π(i)(1)

What are the possible interpretations of the quantities in Equation (1)? If η(x, i) is interpreted as the salience of x for category i, then w(x) could be interpreted as the salience of x across the family of categories R, averaged with respect to category pertinence. From the point of view of computer vision, η(x, i) is simply the output of an operator designed to provide information for category i.

Note that pertinence of a category is (or must be) considered with respect to a task, which could be a categorization at a higher semantic/ontological level. Adopting this point of view, the product η(x, i)π(i) can then be interpreted as the pertinence of item x to the task with respect to which category i had pertinence π(i). More precisely, one can define π(x, T i ) = η(x, i)π(i)

as the pertinence of x to T i where T i is the task to which category i has pertinence value π(i). For example, suppose that i is the color category "red" of the attribute color. Furthermore, suppose that the color category "red" has pertinence π(red) to the task of identifying visually an object such as, for instance, the "flag of some country". Let now x be a region in an image, and η(x, red) the output of evaluating it with respect to the color "red". Then η(x, T red ) = η(x, red)π(red) is the pertinence of x to the task T red .

Taking max/min with respect to x obtains:

x max,red = arg max x∈S η(x, T red ),

the region in the input which is most pertinent to T red , and

x min,red = arg min x∈S η(x, T red ),

the region in the input which is least pertinent to T red . Similarly, taking max / min over categories, yields

i max = arg max i∈R π(i); i min = arg min i∈R,π(i)>0 π(i)

the most/least pertinent categories respectively. The condition π(i) > 0 ensures that categories which are not pertinent at all, i.e. with π(i) = 0, are not taken into account, so the trivial case π(i min ) = 0 is never obtained. Then, for fixed x, η(x, i max ), η(x, i min ) are the strengths of evidence for x to be in the highest/lowest pertinence category, and π(x, T max ) = η(x, i max )π(i max ) π(x, i min ) = η(x, T min )π(i min ) are the importance of x to the task corresponding to the category of highest/lowest pertinence value. Versions of the following "flag example" will be used in this paper to illustrate various points.

Example 1 Let T stand for the task to determine if an object identified in an image corresponds to a "flag of some country". The decision is to be based on color information only. Assume several color categories and their respective pertinences as shown in Table 1.

Table 1: Color categories and their respective pertinence values to the task "Identify flag of a country".

Color category: i Category pertinence:

π(i) red 0.8 yellow 0.3 black 0.1 green 0.2 (max π(i), i max ) (0.8, red) (min π(i), i min ) (0.1, black)

In this example η(x, T red ) = 0.8η(x, red); η(x, T black ) = 0.1η(x, black).

In Equation ( 1) only those categories i with π(i) > 0 contribute to w(x). This means that categories which are not pertinent (i.e., π(i) = 0) are never considered for x, even when η(x, i) is very large.

To summarize, with the interpretation of η(x, i)π(i) as described above, the attentional weight w(x) defined by Equation (1) is the cumulative pertinence of x to a task T , obtained from strength of the sensory evidence given by x to all categories, in proportion to their pertinence to the task T .

Hazard Function

In (Bundesen 1990) the notion of a hazard function ν(x, i) is introduced as ν(x, i) = P rob(E(x, i)), that is, the probability that item x is in category i (e.g., image region x is red). It is assumed (see 2nd assumption in (Bundesen 1990)) that ν is computed as:

ν(x, i) = η(x, i)β(i)w(x)(2)

where η(x, i) and w(x) are as described above1 , and β(i) is introduced to indicate a bias for category i. Since ν is interpreted as a probability, ν(x, i) ∈ [0, 1], which is ensured when η(x, i), β(i), w(x) ∈ [0, 1], without additional constraints on these values. Moreover, when R is an exhaustive set of exclusive (non-overlapping) categories, then ν should be normalized so that i∈R ν(x, i) = 1, in order to really satisfy its interpretation from (Bundesen 1990) as a probability. More recently, in (Bundesen, Vangkilde, and Petersen 2014) β(i) is decomposed as

β(i) = Ap(i)u(i)(3)

where A ∈ [0, 1] is the level of alertness, and p(i) and u(i) are respectively, the prior probability and utility of category i. One can imagine that A also varies with the category, in which case A in Equation ( 3) is replaced by an A i . This is justified by the fact that one may be more alert to a category than to others. In an image processing system, A, or A i could be tied to the performance of the image processing operators used. The components p(i), u(i) of β(i), and hence β(i), must also be tied up to a (higher level) task T . While p(i) may be obtained from past data and experiments on the task T , u(i) seems to be purely subjective, and to a large extent, its role seems to overlap with that is π(i). Plugging w(x) and β(i) in (2) results in

ν(x, i) = Aη(x, i)p(i)u(i) j∈R η(x, j)π(j) = Ap(i)u(i)[η(x, i) 2 π(i)+ +η(x, i) j =i η(x, j)π(j)](4)

which suggests that the most important role in computing ν(x, i) is played by the sensory evidence. In particular, ν's largest value is obtained when

A = p(i) = u(i) = 1, (i.e.

under maximum alertness, maximum prior probability, and maximum utility), and in that case ν(x, i) is a function only of the sensory evidence. Stated differently, this means that A, p(i) and u(i) can only decrease the value of ν(x, i). However, they may provide a mechanism to account for different types of subjective information, and of ranking the values of ν(x, i) when they enter its definition as shown in Equations ( 2) -( 4). The justification in (Bundesen, Vangkilde, and Petersen 2014) of Equation ( 3) is based on the fact that when either one of A, p(i), or u(i) is null, then β(i) = 0. However, the same result holds when these quantities enter the definition of β not through a product, but through other operations, such as the min, or more generally, t-norms.

The fact that the value of ν(x, i) decreases when Ap(i)u(i) = 1 (i.e. at least one of these three values is less than 1, u(i) for instance) can be interpreted as follows: x will be less probably categorized in i if, for instance, the utility for i is low, which means that we do not really care for this category. This also goes with the interpretation as a rate of encoding information in the memory, according to (Bundesen 1990), even without considering time information.

The two mechanisms for visual attention proposed in (Bundesen 1990), filtering and pigeonholing, are described next.

Filtering

Filtering (Bundesen 1990) refers to the mechanism of selecting an item x ∈ S (given a higher level task), for a target category i. This mechanism seeks to (F1) increase ν(x, i) for some category i, while (F2) not changing the conditional probability of E(x, i)

given that x is categorized.

Filtering can be achieved by increasing w(x) as follows: For category j ∈ R assume π (j) = aπ(j), where a > 1. Then w(x) of equation (1) becomes w (x) = i∈R,i =j η(x, i)π i + η(x, j)π j = i∈R,i =j η(x, i)π i + η(x, j)aπ j > w(x). Therefore, ν(x, i) becomes ν (x, i) = η(x, i)β(i)w (x) > ν(x, i), which satisfies condition (F1) above. Computing now P (x is i | x is categorized) yields:

P (x is i | x is categorized) = ν(x,i) k∈R ν(x,k) = η(x,i)β(i)w(x) w(x) k∈R ν(x,k) = η(x,i)β(i) k∈R ν(x,k)

(5) which does not depend on w, hence satifies condition (F2). In Equation ( 5) the numerator is ν(x, i) since {x is i} ⊂ {x is categorized} and therefore P (x is i, x is categorized) = P (x is i), while the denominator uses an assumption on non-overlapping categories to write P (x is categorized) as k∈R ν(x, k). Dropping the constraint of non-overlapping categories is discussed later in this study.

Pigeonholing

For fixed item x ∈ S, pigeonholing (Bundesen 1990) refers to the mechanism of selecting a category i ∈ R (given a higher level task), across a set of items S. It seeks to:

(P1) increase x∈S ν(x, i) for category i pertinent to the task, such that (P2) for all j ∈ R, j = i, x∈S ν(x, j) does not change Pigeonholing can be done by increasing β(i) for some i ∈ R as follows: For category i ∈ R, let

β i = aβ i , with a > 1. Then ν (x, i) = η(x, i)β i w x = η(x, i)aβ i w x > η(x, i)β i w x = ν(x, i).

Summing up over x ∈ S obtains

P (i is selected) = x∈R η(x, i)β i w x > P (i is selected), (6

) which achieves (P1). At the same time, it is clear that for any other category j = i, P (j is selected) does not change, and hence (P2) is satisfied too.

Equation ( 6) uses the assumption that items x are nonoverlapping, for example that they form a partition of the image. However, this partition need not be crisp, i.e. may allow overlapping x's, as for example these are stated in qualitative terms. In such cases, Equation ( 6) does not hold. Dropping the constraint of non-overlapping items, discussed later, leads to a different interpretation of ν(x, i).

Fuzzy Mechanisms for Visual Attention

We consider in this section the situations when the values of the attentional weight and/or category pertinence are not exact. In such situations these values may be represented as fuzzy sets, and therefore, the computation of the categorization of an item must resort to calculus with fuzzy sets. First, let us see why indeed such situations may arise.

Recall that in its original definition, for a given input x and category i, the strength of sensory evidence for E(x, i), η(x, i) ∈ [0, 1]. Assuming that η(x, i) is the output of an operator/test for category i on item x, this output may be inexact because of the inexact nature of the category i. For example, if the category i = red of the attribute color, then for a given input pixel value x this category holds "more or less" and it may not be useful to commit to an exact 0/1 value.

Likewise, in its original definition, the pertinence of a category, π(i) conveys its importance. Obviously, given a collection of visual categories, and task, they may be distinguished along their pertinence values. Moreover, several categories may have the same, maximum importance for the given task. As an example, consider the pertinence of color categories for the detection of an object which is known to have one of two possible color categories, white or yellow, from the collection of all possible color categories. In this case, it is useful to be able to encode

π(white) = π(yellow) = 1,

which would be possible when π is considered as a possibility distribution on the color categories, regardless of the number of color categories allowed. By contrast, using a probability based approach, the cardinality of R, the collection of categories, restricts the values assigned to these equally possible categories, to at most 0.5. That is, π(white) = π(yellow) ≤ 0.5 with equality when R = {yellow, white}.

A new definition for w(x)

The departure point for the new definition for w(x) is the interpretation of a special case of Equation ( 1

). Let R a = {i ∈ R | π(i) = a} and consider the special case R = R 0 ∪ R 1 , that is, all categories in R are either "fully" pertinent, π(i) = 1 (i ∈ R 1 ), or not pertinent π(i) = 0 (i ∈ R 0 ). Then (1) becomes w(x) = i∈R1 η(x, i)

Next let η max = max i∈R1 , and recall that η(x, i) ≤ 1. Then

w(x) ≤ i:π(i)=1 η max = η max i∈R1 1 = η max |R 1 | ≤ |R 1 |,

where |R 1 | denotes the cardinality of the set R 1 . That is, w(x) is bounded by the number of categories i with pertinence π(i) = 1. If η(x, i) = 1 for all i ∈ R 1 then w(x) is exactly the number of such categories. This meaning of w(x) is very natural and appealing. Indeed, one would expect the item x to count to the extent that it supports more categories. To generalize this notion, define for fixed x ∈ S and fixed task T µ (x,T ) (i) = η(x, i)π T (i) the degree to which category i, pertinent to task T , is supported by the (data) item x as shown by the strength of sensory evidence, η(x, i). Therefore, µ (x,T ) : R → [0, 1] is the membership of a fuzzy set on the set of categories.2 Then the weight of item x is now defined as the cardinality of this fuzzy set. That is

w(x) = Card {(i, µ x (i)) | i ∈ R} (7)

Several formulas for the cardinality of a fuzzy set have been put forward. Here, for illustration purposes, the definition from (Ralescu 1986) is used to obtain

Card ({µ x (i) | i ∈ R}) (k) = µ x,(k) ∧ (1 − µ x,(k+1) ) (8)

where µ x,(k) denotes, the kth largest value of µ x (•), and µ x,(|R|+1) = 0. Thus, the cardinality defined in Equation ( 7) is a fuzzy set on {0, ..., |R|}. For an exact value of w(x) the 0.5-level set of w(x) (which is an interval), or its classic cardinality can be used (Ralescu 1995).

3.2 A new definition for β(i)

Following the discussion from Section 2.4, define

β(i) = min{A, p(i), u(i)} (9)

As in the case of β defined in (3), β(i) = 0 whenever A = 0, or p(i) = 0, or u(i) = 0, and the discussion of (Bundesen 1990) holds: that is, category i biases the selection to the extent that the system is alert, and category i is possible and useful. Alternatively, (9) means that the bias for the selection of i cannot be greater than the system alertness, the possibility of i or its utility. Furthermore, replacing the product by min also eliminates the possibility of values for β smaller than each one of A, p(i), and u(i), which is the well-known drowning effect of multiplication of positive values smaller than 1. More importantly, it should be mentioned that the min can handle ordinal or qualitative values, without needing specifying precise numbers. Specifying such precise values might be difficult when subjective assessments are made. By contrast, in the case of such assessments, ordinal or qualitative values are usually easily produced.

As already mentioned, in the fuzzy set framework, the product and min are but two particular cases of a t-norm (conjunction operator). A, p(i), and u(i) are interpreted respectively, as degrees of alertness, possibility (rather than probability) of i to be selected, and utility for the category i, and the bias for i is defined as the conjunction of these. This interpretation makes (9) meaningful beyond a mere computational artifice. Another choice for defining β is to select a more general, aggregation operator, H : [0, 1] × [0, 1] × [0, 1] → [0, 1], which would allow the contribution of more than one of A, p(i), u(i) towards β.

A new definition for ν(x, i)

With the new definitions, w(x), and β of w(x) and β respectively, the meaning of ν(x, i) also changes from a probability to a possibility, more precisely, P ossibility(x is i): P ossibility(x is i) = H(η(x, i), β(i), w(x))

(10

)

where H is again an aggregation operator, and hence the definition of ν(x, i) from (Bundesen 1990) is a particular case, when H is the product.

For defining H, one may rely on the huge literature on information fusion, for which the fuzzy sets theory provides a number of useful operators (see e.g. (Dubois and Prade 1985;Yager 1991;Bloch 1996) for reviews on fuzzy fusion operators). The large choice offered by these operators allows modeling different combination behaviors (conjunctive, disjunctive, compromise, etc.), with different degrees (e.g. the min is a less severe conjunction as the product). Operators can also behave differently depending on whether the values to be combined are small, large, of the same order of magnitude, or having different priorities. The operators H could also be set differently for the three values. For instance η and w, which depend on x and i could be combined using an operator H 1 , and the result combined with β, which depends on i only, using another operators H 2 .

Conclusions and Future Work

This paper discussed an attentional model developed in the field of psychology and cognitive science set in a probabilistic framework. The basic concepts of this model were discussed and an alternative, fuzzy set based approach was suggested. In the fuzzy set framework, modeling would be easier, more natural (for instance replacing numbers by ordinal or qualitative values), and it would allow for more flexible ways of combining the different terms. This discussion paves the way for a new attentional model, the complete development of it being left for future work.

Note that the expression of(Bundesen 1990) involves a normalized version of w, i.e. w(x)/ x∈S w(x). Here we implicitly assume that w is normalized, in order to simplify equations. In the following, assuming only one task, T , for ease of notation, the subscript T will be dropped, to write µx(i).

Acknowledgments

Anca Ralescu's contribution was partially supported by a visit to Telecom ParisTech.

Information Combination Operators for Data Fusion: A Comparative Review with Classification IBloch Borji IEEE Transactions on Pattern Analysis and Machine Intelligence 26 1 1996. 2013 IEEE Transactions on Systems, Man, and Cybernetics A neural theory of visual attention: bridging cognition and neurophysiology CBundesen THabekost SKyllingsbaek Psychological review 112 2 291 2005 CBundesen SVangkilde APetersen Recent developments in a computational theory of visual attention 2014 Vision research tva A theory of visual attention CBundesen Psychological review 97 4 523 1990 Computational gestalts and perception thresholds ADesolneux LMoisan J.-MMorel Journal of Physiology-Paris 97 2 2003 A Review of Fuzzy Set Aggregation Connectives DDubois HPrade Information Sciences 36 1985 Sequential model-based segmentation and recognition of image structures driven by visual features and spatial relations GFouquier JAtif IBloch Computer Vision and Image Understanding 116 1 2012 The discriminant center-surround hypothesis for bottom-up saliency DGao VMahadevan NVasconcelos Advances in Neural Information Processing Systems 2008 Feature confirmation in object perception: Feature integration theory 26 years on from the Treisman Bartlett lecture GWHumphreys The Quarterly Journal of Experimental Psychology 2014 just-accepted Attention and Effort DKahneman 1973 Prentice-Hall A note on rule representation in expert systems ALRalescu Information Sciences 38 2 1986 Cardinality, quantifiers, and the aggregation of fuzzy criteria DRalescu Fuzzy sets and systems 69 3 1995 A feature-integration theory of attention AMTreisman GGelade Cognitive psychology 12 1 1980 Features and objects: The fourteenth bartlett memorial lecture ATreisman The Quarterly Journal of Experimental Psychology 40 2 1988 The psychological reality of levels of processing. Levels of processing in human memory 301-330 ATreisman RRYager Fuzzy Sets and Systems 40 2014. 1991 Connectives and Quantifiers in Fuzzy Sets Hierarchical saliency detection QYan LXu JShi JJia IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013 Saliency detection via graph-based manifold ranking CYang LZhang HLu XRuan M.-HYang IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2013