Introduction

Reduction dimension of bags of visual words with FCA

Ngoc Bich Dao

Karell Bertet

Arnaud Revel

0 0 Laboratoire L3i, University of La Rochelle , France

In image retrieval involving bag of visual words, reduction dimension is a fundamental task of data preprocessing. In recent years, several methods have been proposed for supervised and unsupervised cases. In the supervised case, the problem has been addressed with encouraging results. However, in the unsupervised case, reduction dimension is still an unavoidable challenge. In this article, we propose an application of a logic reduction dimension method which is based on Formal Concept Analysis for image retrieval. This method is the reduction of a closure system without, theoretically, loss of information. In our context, combining our proposed method with bag of visual words is original. Experimental results on ve data sets such as COREL, CALTECH256, VOC2005, VOC2012 and MIR ickr are analyzed to show the in uence of the data structures and the parameters on the reduction factor.

Introduction

Thanks to the generalization of multimedia devices, huge collections of digital images are available today. As far as mining in multimedia documents is concerned, web search engines usually give poor results. Hence, such results are far from expected regarding the semantics of the documents. Content Based Image Retrieval (CBIR)[ 1 ] has been investigated in order to give an answer to this problem for decades. The main idea is to build a description based on the image content, and to nd similarities between descriptions. Classically, visual features are extracted from images and then compiled into an index or signature to give a dense description of images. To perform the retrieval, a similarity function is computed to compare the index of the query with those of collection. A ranking of the results according to the calculated similarity is proposed to the users. The detection of visual features can be performed by a SIFT detector[ 2 ] or a dense grid which both select an important number of interest points (up to several thousands) from the images. Each of these points is then described thanks to a SIFT-like descriptor. However, to limit the dimension of the description space, a vector quantization (usually k-means) is performed in order to cluster similar interest points into "visual words", and to generate a dictionary of "visual words" (usually up to 1000 words). Then, the signature of the image is composed of the set of all the visual words corresponding to each feature point detected into the image (what formed a "bag of visual words"[ 3 ]). The comparison between the images then consists in comparing the bags of visual words of each image in a dataset. The processing cost introduced by these techniques makes them di cult to use with large amounts of images such as a query on the Internet.

On the other hand, supervised data is labeled (the data has ground truth) and classi cation methods are required to deal with the categorization problem. Data in the case unsupervised is unlabeled, hence clustering methods are used to gather the similar observations in the same cluster. There are many applications for classi cation and clustering on many domains of computer science such as bioinformatics, numerical analysis, machine learning, data mining, pattern recognition, etc., where data may contain a grand set of features, means the description of the data is high dimension, and therefore it need to be reduced. However, reduction them while preserving the quality of the data is still challenging.

To be able to manage high dimensional description spaces, reduction techniques have been proposed. These techniques are much used as a data preprocessing step in machine learning and pattern recognition. This step can usually increase the accuracy of the results in the next steps such as classi cation or clustering while the computational cost and time cost of the former step may be signi cantly decreased. Regarding statistics and machine learning literature, we distinguish two main strategies: feature extraction and feature selection. These methods can be used for supervised case or unsupervised case. The main idea of feature transformation consists in transforming the given set of features into a new one. In case that the size of the new feature set is greater than the original feature set, we called it the feature generation. And when new feature set size is smaller than the original feature set, feature extraction is mentioned. Feature selection methods propose a manipulation of data to select features from the original set. This approach is interesting in some domains when they prefer the existing features in order to maintain their physical properties.

In this article, we propose a logic and unsupervised feature reduction method issued from FCA to address the visual word reduction problem in a CBIR system. In FCA, data are organised into a "context" by a set of observations (called "objects", "samples" or "experimental units" in other elds) and a set of features (also known as "attributes", "parameters", or "variables" in computer science, machine learning and statistic communities) that are associated with each observation.

Context reduction is a simple and polynomial treatment in FCA classically applied on the whole context, thus both reducing observations and features. This treatment is based on a nice result establishing that the concept lattice of the context can be reduced to a minimal one while preserving its graph structure by deleting some redundant observations and features. For example, when two attributes are shared by the same objects, then they belong to the same concepts of the concept lattice, thus they are redundant and one of these two objects can be deleted while preserving the concept lattice structure. In our case, we focus on feature reduction of a context. Our algorithm accepts as input the closure operator of the context on attributes set, and returns the redundant attributes. Thus, this algorithm extends the classical attributes reduction of a context to the more general case of data described by a closure operator. Moreover, we propose a new application in image analysis for features reduction of visual words.

This paper is organized as follows: In order to introduce our approach, we recall some de nitions of formal concepts in the section 2.1. Section 2.2 shows details our proposed method. Section 3 shows some experimental results with real data. Finally, section 4 ends this paper with a conclusion and perspectives. 2

The proposed features selection method

The feature reduction algorithm we propose is a logic and unsupervised method stemming from FCA where a concept lattice, de ning from a binary table, represents the description of all object-attribute combinations. When the concept lattice structure is preserved after the deletion of some attributes and objects, then these attributes are "redundant" for the lattice structure and can be deleted from the initial data without a ecting the structure of object-attributes combinations. Therefore, from a theoretical point of view, the description of data is equivalently represented by a concept lattice where "redundant" attributes and objects are deleted.

The reduction is a simple and polynomial treatment in FCA, classically decomposed into two steps: attribute and object reduction. In this article, we focus on attributes/features reduction, thus on the detection of redundant attributes for the concept lattice structure reduced to attributes. A nice result establishes that each subset of a concept (A,B) is a closure de ned on the objects and attributes set, and the concept lattice reduced to the attributes/objects is denoted a closure lattice.

In the rst subsection, we introduce the notions of closure lattice according to a closure operator, reduced closure lattice and redundant attributes. In the second section, we presents the reduction algorithm aiming at removing redundant attributes, with a closure operator as input. This algorithm is thus a generic algorithm that can be applied either on attributes or on objects of a binary table, but also on any closure system. 2.1

Reduced lattice

In FCA, the relationship between a set of attributes I and a set of objects O are described by a formal context (O; I; ( ; )) where (A) the set of attributes sharing by a subset A of objects, and (B) the set of objects sharing a subset B of attributes. One can derive two closure systems from a context. The rst one is de ned on the set of attributes I, with as closure operator. The second one is de ned on the set of objects O with as closure operator[ 18 ]. A closure system ('; S) is de ned by a closure operator ' on a set S, i.e. a map on P(S) satisfying the three following properties: ' is isotone, extensive and idempotent. A subset X S is called closed if '(X) = X (see Table 2). The set system F of all closed subsets, tted out with the inclusion relation , forms a lattice usually called the closure lattice (see Fig. 1a). See the survey of Caspard and Monjardet[ 19 ] for more details about closure systems. There are in nitely set systems whose closure lattice are isomorphic. A reduced closure lattice is a closure lattice de ned on a set S of the smallest size among all isomorphic closure lattices. A nice result[ 20,18 ] establishes that a closure system is reduced when, for each x 2 S, the closure '(x) is a join irreducible (Equation 1). 8x 2 S; 8Y

S so that x 62 Y; then '(x) 6= '(Y ) (1)

Therefore, a non-reduced closure system contains reducible elements - elements which do not satisfy Equation 1 - each reducible element x 2 S is then equivalent to a set Ex S of equivalent elements with x 62 Ex and '(x) = '(Ex). Reducible elements can be removed without a ecting the structure of the closure lattice. The reduction of a closure system consists then in removing or replacing each reducible element x 2 S by its equivalent set Ex. 2.2

Proposed reduction algorithm

The algorithm we propose is a generic reduction algorithm since it only needs a closure operator as input. Thus it can be applied with the same complexity on any closure system, and in particular on a context by considering the attributes - using as closure operator.

a b c d e f g h a b c d e f (a) The context

A direct application of the de nition (see Eq. 1) would imply an exponential cost by checking if any subset Y S is equivalent to each x 2 S. We use the precedence relation (precedence graph) for a polynomial reduction. The precedence graph is de ned on the set S, with an edge between two elements x; y 2 S when '(x) '(y). This graph is clearly acyclic for a reduced closure system. We propose a generic algorithm in 3 steps: Step 1: Standardization. Check if there exists x; y 2 S such that '(x) = '(y). When '(x) = '(y), then x and y belong to the same strongly connected components of the graph. Each strongly connected components X S include the elements xi; xj so that '(xi) = '(xj ); 8xi 6= xj 2 X. Thus, we can delete all elements except one representative element x 2 X of the component. The obtained precedence graph is then an acyclic graph. Step 2: Clari cation. Check if there exists x 2 S such that '(x) = '(;).

When such an x exists, then '(x) is included into '(y) for any y 2 S, thus x is the only source of the precedence graph. The clari cation test has only to be performed for graphs with one source.

Step 3: Reduction. Check, for any x 2 S, if there exists a set Ex S such that x 2= Ex and '(x) = '(Ex). One can observe that an attribute x with only one immediate predecessor y is not reducible, because it would be equivalent to y, and thus belong to the same strongly connected component already removed in the previous step. If there exists Ex S such that '(x) = '(Ex), then elements of Ex are clearly predecessors of x in the precedence graph since, for 8y 2 Ex, '(x) = \'(y). Moreover, this test can be reduced to maximal predecessors of x. Therefore, this treatment has only to be performed for elements with more than one immediate predecessors, and the equality has to be checked with the set of immediate predecessors of x.

This algorithm takes into account a closure operator ' on a set S as input. The output of the alforithm is the reducible element set X S and the equivalent elements set Ex for each x 2 X.

Alg. 1 reduces a closure system in O(jSj:c' + jSj2 log jSj) where c' is the cost of a closure generation and |S| is the number of nodes. Indeed, the precedence graph can be initialized in O(jSjc' + jSj2logjSj) by computing the closures in O(jSjc'), and then comparing two closures in O(jSj2logjSj). Then, the SCCs can be computed using Kosaraju's algorithm by two passes of depth rst search, thus a complexity in O(jSj + jAj) O(jSj2), with jAj nb of edges in the graph. Standardization and clari cation are clearly in O(jSj) by a simple pass into the graph. Finaly, reduction considers the immediate predecessors of each x 2 S in O(jSj2), and then computes and compare two closures in O(jSjc' +jSj2logjSj). Therefore, Alg. 1 computes the attribute reduced context in O(jIj2jOj + jIj2logjIj). since a closure can be obtained in O(jIj:jOj).

Input: a closure operator ' on a set S Output: the reducible elements set X

for each x 2 X init a set Res with ;; init a graph G with S as set of node; nn Precedence graph; foreach (x; y) 2 S S do if '(x) '(y) then

add the edge (x; y) in G; end

S, and the equivalent elements set Ex end end end compute the set CF C of the strongly connected components of G; let source be the sources of the graph G; nn Step (1): Standardization; foreach C 2 CF C do choose y 2 C; foreach x 2 C such that x 6= y do

add x in Res with Ex = fyg; delete x from the graph G; end nn Step (2): Clari cation; if jsourcej = 1 and '(source) = '(;) then

add source in Res with Esource = ;; delete source from G; end nn Step (3): Reduction; foreach x 2 G do let P the set of immediate predecessors x in the graph G; if jP j 6= 1 and '(x) = '(P ) then

add x in Res with Ex = P ; delete x from the graph G; end return Res, (Ex)x2Res;

Algorithm 1: Reduction of a closure system

Experimentation Datasets

In our experiments, we compare the performance of the method we propose on di erent image data sets. Each image in a data set is described by a vector composed of the occurrence frequencies of its visual words, where a set of visual words is de ned for each data set. Table 3 describes the di erent data sets we used in our experiments, and the methods applied to generate the whole bag of visual words.

Database

Images nb Features nb

Detector

Descriptor CMI (Colour Random

Moment selection of Invariants)[ 22 ] all key points

CMI1 SIFT[ 2 ] SIFT2 SIFT

Dictionary of visual words

Random selection of all key points K-means[ 25 ] (OpenCV) K-means (OpenCV) K-means (OpenCV) VOC2012[ 21 ]

17124 MIR ickr[ 23 ]

24991 COREL[ 24 ]

4998 CALTECH

256[ 26 ]

Dataset 1 (VOC2005)[ 27 ] 30607 1354 4096 4096 500 500 262

HarrisLaplace HarrisLaplace SIFT SIFT

HarrisLaplace and

Laplacian3 As mentioned earlier, the algorithm we propose requires binary values indicating for each object whether it possesses a given attribute or not. Since each image is described by a visual word occurence frequency vector, its values can vary from 0 to a max value depending on the image size and the quantity of visual words in the image. For instance, if an image is black painted, there is only one visual word "black" for the whole image with a big frequency, and the vector 1 http://koen.me/research/colordescriptors/ 2 http://www.robots.ox.ac.uk/ vgg/research/a ne/#software 3 http://lear.inrialpes.fr/people/dorko/downloads.html will be sparse. Conversely, an image with a patchwork of colors is described by a frequency vector mainly composed of low but not zero values. To be able to compare several images, it is thus necessary to normalize their frequency vector before binarization.

Normalization As mentioned before, the visual word occurrence frequency can be very important in some images, and insigni cant in others. In order to compare the visual words, several strategies can be adopted.

First of all, it is necessary to nd out a "max" value in the data set and then divide the visual word frequency by this max value to transform the values in a range 0 to 1. Two manners to de ne the max value have been considered into this article.

Normalization by line (image) With this type of normalization, a max value is computed for each image as being the maximum frequency value of the corresponding image. The interpretation of this normalization is that we consider as signi cant the ratio between the di erent attributes of a given image. This kind of normalization does not depend on the database size and on the image size. However, the normalized values do not account for the ratio measurement of the same attribute between the images in the database.

Normalization by column (feature) Normalization by column nds out the maximum values of the frequency for each attribute in the database. With this approach, the correspondence between the images in the database is taken into account. The drawback is that each time a new image is inserted into the database, the normalized values must be recomputed. Besides, the image size must also be taken into account. Table 4 gives an illustrated example.

f 1 f 2 f 3 f 4 img1 1 0 50 5 img2 10 9 1 8 img3 0 0 0 99 (a) Initial data

f 1 f 2 f 3 f 4 img1 0.02 0 1 0.1 img2 1 0.9 0.1 0.8 img3 0 0 0 1

f 1 f 2 f 3 f 4 img1 0.1 0 1 0.05 img2 1 1 0.02 0.08 img3 0 0 0 1 (b) After normalization (c) After normalization by line by column Binarization After the normalization, we simply binarize the normalized values by comparing these values with a threshold varying from 0 to 0.9. At the highest threshold one, in the normalization by line case, it is possible that most of the attributes in an image should be below the threshold. To avoid removing all the visual words from an image, the highest threshold has been assigned to 0.9. Reduction The next phase in the algorithm is to apply our reduction method which is itself composed of three steps (clari cation, standardisation, reduction). Indeed, before applying the proposed method to bag of visual words, we must remove all the visual words that appear (resp. do not appear) in each (resp. any) image. This step corresponds to the clari cation. The standardization step reduces the feature that the vector of images of a given feature equivalent to the vector of images of another feature. At last, in the reduction step, all the features which are the combination of other features are removed. 3.3

Results

In this section, we detail the results obtained with our reduction method for 5 data sets, described in section 2.2. To analyze the behavior of our method, and the contribution of each step of the algorithm, we introduce the ratio of removed features for each step of the reduction algorithm as follows:

a , 1 = Natt 2 = Natbt a ,

c 3 = Natt a b

Where a (resp. b and c) is the number of removed attributes in the standardization (resp. clari cation and reduction) step; Natt is the attribute number in total. Figure 2 shows the evolution of 1, 2, 3 with regard to the threshold level, for both normalization types: line and column.

The maximum ratio of removed attributes of the data sets (CALTECH, COREL, VOC2005, MIR ickr, VOC2012) are approximately equal to 0.67%, 2.6%, 22.5%, 95%, 96% respectively. The impact of the reduction is more interesting in the last three datasets. This phenomenon can be explained by the bag of visual words generation since the two data sets MIR ickr and VOC2012 are composed of randomly selected visual words stemming from the keypoints set. Conversely, the data sets CALTECH, COREL and VOC2005, are composed of bags of visual words de ned by the SIFT detector and descriptor, and by a K-means clustering. Thus, the randomly selected visual words are less consistent.

We can also observe that the percentage of removed attributes increases while the binarization threshold increases. With an increasing threshold, only the most frequent words are kept, thus more attributes are potentially equivalent and removed.

At last, there is no attribute reduction in the step 1 ( 1 value) with a normalization by column because this kind of normalization can not generate empty columns. Morover, a normalization by line keeps the most frequent attributes in each image whereas a normalization by column keeps the most frequent images for each attribute. To summarize, the number of removed attributes depends both on the visual words generation, on the chosen threshold of binarization and on the normalization process (by line or column). However, care should be taken, that the greater the binarization threshold is, the smaller the number of images remaining. Except in the case normalization by line.

CALT ECH

COREL V OC2005 M IRf lickr V OC2012 (a) Normalization by line (b) Normalization by column

Conclusion and perspective

In this article, we present a logic feature selection method of bags of visual words. This method, stemming from Formal Concept Analysis, is a closure system reduction without, theoretically, loss of information. That means that the data description lattice is preserved by the reduction treatment. In our context, combining our proposed method with a bag of visuals words is original. The experimentations show that the number of deleted features can be interesting, depending on the data set and the binarization treatment. Moreover, it is possible to perform both an object and an attribute reduction.

A ner analysis should be obtained in the supervised case, by comparing classi cation performance before and after reduction. Moroever, the number of potentially deleted objects could also be usefull to autmatically de ne a good binarization thresold in the supervised case: while suppression of objects belonging to the same class is to promote, we must avoid removing objects of di erent classes. Objects reduction can easily be performed by applying our reduction algorithm on the objects set.

At last, we plan to study the number of deleted attributes and deleted objects (of the same class / of di erent class) to evaluate the complexity of a data set, and the quality of its visuals words.

Acknowledgment: We would like to thank Thierry URRUTY, Nhu Van NGUYEN and Dounia AWAD who extracted the bag of visual words we used in this paper.

1. Smeulders , A. , Worring , M. , Santini , S. , Gupta , A. , Jain , R.: Content-based image retrieval at the end of the early years . IEEE Transactions on Pattern Analysis and Machine Intelligence 22 ( 2000 ) 1349 { 1380

2. Lowe , D.G. : Object recognition from local scale-invariant features . In: Proceedings of the International Conference on Computer Vision , Kerkyra ( 1999 ) 1150 { 1157

3. Bosch , A. , Zisserman , A. , Munoz , X. : Scene Classi cation Via pLSA . In Leonardis, A. , Bischof , H. , Pinz , A., eds. : 9th European Conference on Computer Vision . Volume 3954 of Lecture Notes in Computer Science., Graz , Austria, Springer Berlin Heidelberg ( 2006 ) 517 { 530

4. Tu ery, S.: Data mining et statistique decisionnelle: L'intelligence des donnees . Technip edn. Volume 2010 . ( 2010 )

5. Belohlavek , R. , Kruse , R. , Vychodil , V. : Discovery of optimal factors in binary data via a novel method of matrix decomposition . Journal of Computer and System Sciences 76 ( 2010 ) 3 { 20

6. Fisher, R.A.: The use of multiple measurements in taxonomic problems . The Annals of Eugenics 7 ( 1936 ) 179 { 188

7. Hotelling , H. : Analysis of a complex of statistical variables into principal components . Journal of Educational Psychology 24 ( 1933 ) 417 { 441

8. Yu , L. , Liu , H.: Feature selection for high-dimensional data: A fast correlationbased lter solution . In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) , Washington DC ( 2003 ) 856 { 863

9. Hall , M.A. : Correlation-based feature subset selection for machine learning . Doctor of philosophy , University of Waikato, Hamilton, NewZealand ( 1999 )

10. Battiti , R.: Using mutual information for selecting features in supervised neural net learning . IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 5 ( 1994 ) 537 { 550

11. Rakotomalala , R. , Lallich , S. : Construction d'arbres de decision par optimisation . Revue Extraction des Connaissances et Apprentissage 16 ( 2002 ) 685 { 703

12. Kononenko, I.: Estimating attributes: Analysis and extensions of RELIEF . In Bergadano , F., Raedt , L., eds.: Machine Learning: ECML-94. Volume 784 of Lecture Notes in Computer Science . Springer Berlin Heidelberg, Berlin, Heidelberg ( 1994 ) 171 { 182

13. He , X. , Cai , D. , Niyogi , P. : Laplacian score for feature selection . In: Neural Information Processing Systems Foundation, MIT Press ( 2005 )

14. Devaney , M. , Ram , A. : Efcient Feature Selection in Conceptual Clustering . In: Machine Learning: Proceedings of the Fourteenth International Conference , Nashville, TN ( 1997 )

15. Dy , J.G. , Brodley , C.E. : Feature Selection for Unsupervised Learning . Journal of Machine Learning Research 5 ( 2004 ) 845 { 889

16. Wolf , L. , Shashua , A. : Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach . The Journal of Machine Learning Research 6 ( 2005 ) 1855 { 1887

17. Elghazel , H. , Aussem , A. : Unsupervised feature selection with ensemble learning . Machine Learning ( 2013 )

18. Barbut , M. , Monjardet , B. : Ordre et classi cation: algebre et combinatoire . Hachette, Paris ( 1970 )

19. Caspard , N. , Monjardet , B. : The lattices of closure systems, closure operators, and implicational systems on a nite set: a survey . Discrete Applied Mathematics 127 ( 2003 ) 241 { 269

20. Birkho , G.: Lattice Theory. 1st edn. American Mathematical Society ( 1940 )

21. Everingham , M. , Van Gool , L. , Williams , C.K.I. , Winn , J. , Zisserman , A. : The PASCAL Visual Object Classes (VOC) Challenge ( 2012 )

22. Mindru , F. , Tuytelaars , T. , Gool , L.V. , Moons , T. : Moment invariants for recognition under changing viewpoint and illumination . Computer Vision and Image Understanding 94 ( 2004 ) 3 { 27

23. Huiskes , M.J. , Lew , M.S.: The MIR ickr retrieval evaluation . In: Proceeding of the 1st ACM international conference on Multimedia information retrieval - MIR '08 , New York, USA, ACM Press ( 2008 ) 39 { 43

24. Carneiro , G. , Chan , A.B. , Moreno , P.J. , Vasconcelos , N.: Supervised Learning of Semantic Classes for Image Annotation and Retrieval . IEEE Transactions on Pattern Analysis and Machine Intelligence 29 ( 2007 ) 394 { 410

25. Macqueen , J.B. : Some Methods for classi cation and Analysis of Multivariate Observations . In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability . ( 1967 ) 281 { 297

26. Gri n , G., Holub , A.D. , Perona , P. : Caltech-256

Object

Category Dataset . Technical report ( 2007 )

27. Everingham , M. , Zisserman , A. , Williams , C.K.I. , Van Gool , L. , Al., A. : The 2005 PASCAL Visual Object Classes Challenge . In: First PASCAL Machine Learning Challenges Workshop , MLCW 2005 . Volume 3944 of Lecture Notes in Computer Science., Berlin, Heidelberg, Springer Berlin Heidelberg ( 2005 ) 117 { 176