1. Introduction

Feature Selection Methods for Remote Sensing Images Classification

E. Goncharova

A. Gaidel

0 1 0 Image Processing Systems Institute - Branch of the Federal Scientific Research Centre “Crystallography and Photonics” of Russian Academy of Sciences , 151 Molodogvardeyskaya st., 443001, Samara , Russia 1 Samara National Research University , 34 Moskovskoe Shosse, 443086, Samara , Russia

2017

86 91

Different methods of feature selection are used to improve the performance of remote sensing images classification. In this work two methods of feature selection are examined. The first one is based on the discriminant analysis, and the second one rests on building the regression model. Histogram and textural features are considered as characteristics of an image. The experiments on the remote sensing dataset UC Merced Land Use show the effectiveness of these methods. As the result, the largest fraction of correctly classified images accounts for the 95%. Dimension of the initial feature space consisting of 18 features has been reduced to 3 features.

Feature selection classification remote sensing images discriminant analysis regression analysis

1. Introduction 2. The object of the study 3. Methods 3.1. Feature extraction

where R, G, B is an intensity of red, green, and blue component of the image resolution cell having coordinates (m, n) respectively.

I (m, n) ranges in value from 0 to L  1 , where L is a maximum gray level.

There are a large number of different features, which can characterize an image. In this work we use the histogram features that describe the spatial distribution of gray values. If the discrete image is considered as a two-dimensional stochastic process, we can estimate its spatial distribution of gray values and, therefore, raw ( 2 ) and central moments ( 3 ).

 k  µk 

1 M N I k (i, j) .

MN i1 j1

1 M N  I i, j   1 k .

MN i1 j1 The calculated features are:  mean intensity:  standard deviation: s  2 ;    2 ;  skewness:

  1  3 ;  3   2  4  3 .

 4 R(m, n) 

I 1 , and also ( IR , IG , IB – mean intensity of red, green, and blue component respectively);  second raw moment (mean energy):  kurtosis (a measure of the “tailedness” of the probability distribution): ( 2 ) ( 3 ) ( 4 ) The autocorrelation matrix ( 4 ) describes dependence among the pixels of an image [4].

  I (i, j)I (i  m, j  n) (M  m )(N  n ) i j

Pd1d2 i, j    m, n 1, 2,..., M  1, 2,..., N  | I  m, n  i, I  m  d1 , n  d2   j , i, j  0, L 1 .

Textural features are extracted from the spatial dependence matrices, which are calculated for eight different distances d1, d2  : 1, 0 , 0,1 , 1, 1 ,  2, 0 , 0, 2 ,  2, 2 . To get the invariant under rotation features, they are extracted from the average matrices. Thus, eight more textural features can be defined as follows:

To extract the subset of informative features two methods were examined. The former belongs to the discriminant analysis theory. According to this method, we choose the set of features that provides the largest value of the criterion J (Q) [6]: ( 5 )  angular second moment:

2  M X MY

R – a number of neighboring pixel pairs;

M X , MY – the row and column means; DX , DY – the row and column variance.

3.2. Feature selection methods

1) Ω0  Ω1  Ω ; 2) Ω0  Ω1   . where P i, j  – an element of averaged over the four dimensions 1, 0 , 0,1 , 1, 1 and  2, 0 , 0, 2 ,  2, 2 .

Let Ω be a set of objects for recognition. In this work a feature vector xk  RK , where K is a number of features, is considered as the element of this set. The set is divided into two classes Δ  Ω j 2j1 with the following properties:

Let ( xk ) : Ω  Δ be the ideal operator that puts an object in correspondence with its class. As long as the ideal operator is unknown, another operator ( xk ) : Ω  Δ can be created. ( xk ) tries to predict a class of input object, according to the information got from a training set of data U  Ω , in which the outcome of object is observable.

As the features can be measured in varied units, firstly, they should be standardized to get zero mean and unit variance. For this purpose the expected value: and variance:

M (i)  1 U

 xk (i) , i  1, K , M  RK

U k 1 R(i,i)  1 U  xk (i)  M (i)2 , i  1, K , R  RKK

U k1 should be estimated for each feature.

Therefore, the feature vectors can be standardized by applying the formula ( 5 ).

xk (i)  xk (i)  M (i) , k  1, U , i  1, K .

R(i, i) Image Processing, Geoinformation Technology and Information Security / E. Goncharova, A. Gaidel tr R

, where Q – current set of features;

R – mixture covariance matrix; Rj – within-class covariance matrix;

P Ω j  – prior probability of class Ω j , there P Ω j   12 .

Thus, the stronger the scattering between two classes exceeds the average within-class scattering, the better selected set of features is.

To form the set of the most informative descriptors a greedy strategy of adding a feature was applied. Let the initial feature set be empty – Q0   . In step i we consider all the sets, like Qi, j  Qi1  j , and calculate the criterion Ji, j  J Qi, j  .

Then choose the set that maximizes the criterion:

Qi  Qi1   arg max Ji, j   Qi1   arg max J Qi1  j .

 j1;KZ\Qi1   j1;KZ\Qi1  These steps are iterated until a required number of features are obtained.

The second approach is based on the regression analysis. The regression analysis estimates the relationships among the dependent variable and one, or more, independent variables.

We propose that the number of class, which xk can belongs to, is an independent variable y  xk  . This implies that the feature vector xk influences y  xk  , and the regression model ( 6 ) can be built as follows:

y  X   , where y  ( y1

y2 X – feature matrix;

yn )T – output vector;    0 1

 Q T – regression weights;   1  2  n T – error vector.

The unknown coefficients belonging to the vector  are determined from the training set data via the ordinary least squares method: ( 6 )  y  X T  y  X   min .



The value of each feature is directly related to its weight in the regression equation ( 6 ). According to this proposal, the greedy strategy of removing a feature can be applied to forming the set of the informative descriptors.

Let the initial feature set Q0  Q contain all the analyzed features. In each step i the linear regression model yi  X i i is built in the corresponding feature space. Then a feature with the minimal coefficient is removed from the set according to the following rule:

  Qi1  Qi \  arg min i  j   .

 j1;KZQi    x, y  

K   x(i)  y(i)2 .

i1 As in the previous case these steps are iterated until a required number of features are obtained.

To estimate the classification power of the obtained feature subsets the nearest-neighbor classification is carried out. The Euclidean distance in feature space is defined as follows:

The classifier assigns the class of the vector x to the class of its closest point in the training set. In terms of the computational complexity, this method is rather simple in comparison with others. Since this classifier is memory-based, if the number of objects in the training set becomes large, this computational requirement may become excessive. The nearestneighbor misclassification rate is no more than twice larger than the Bayes error rate [7].

Image Processing, Geoinformation Technology and Information Security / E. Goncharova, A. Gaidel The nearest-neighbor error rate is assessed as follows:    xk  U |   xk     xk 

, k  1, U , where U – test set.

4. Results and Discussion

To assess the performance of the proposed approaches two image sets from the remote-sensing UC Merced Land Use dataset were used. This dataset includes aerial optical images, belonging to different classes (agricultural field, forest, beach, etc.), 100 for each class. Each image measures 256×256 pixels (RGB color space). There are two classes of images (agricultural fields and forest) being examined in this work. Figure 1 illustrates sample images belonging to the two classes.

To carry out the experiments we used 5-fold cross-validation. The results obtained with the discriminant and regression analysis methods are shown in tables 1 and 2 respectively.

Image Processing, Geoinformation Technology and Information Security / E. Goncharova, A. Gaidel

Having analyzed the results, we can conclude that the discriminant analysis method performed best on this classification task. The lowest classification error rate of 0.05 was achieved in three-dimensional feature space, consisting of IR , I , s . The studied textural features have no significant effect on the quality of this classification. The inclusion of more textural characteristics, considering the correlation of features on various distances, may provide a better performance of this feature group.

95%

5. Conclusion

Thus, for the task of the remote sensing images classification the subset of informative features was extracted. On the images from the UC Merced Land Use dataset, the histogram features produced the best outcome. It should be mentioned that the images were represented in RGB color space; hence the mean intensity of these three components appeared to have considerable impact on the discriminatory power.

The feature vector, selected with the discriminant analysis method, produced the best classification performance (using the nearest-neighbor classification method) on the images from the UC Merced Land Use dataset. The minimal classification error rate made up 0.05, therefore the proportion of the correctly classified images was 95%. This rate was achieved in the reduced three-dimensional feature space, consisting of the descriptors IR , I , s .

Thus, applying the feature selection methods leads to improving the image classification performance. In this study, the combination of three of the 18 initial descriptors appeared to be informative, while the other features increased the misclassification rate.

The method based on the discriminant analysis criterion provided good results and can be applied to fulfill the task of feature selection. Overall, in the future work we are interested in considering more features, which can characterize an image, and multiclass classification that can enable us to get more universal results.

Acknowledgements References

The work was partially supported by the Russian Foundation of Basic Research (grant 16-41-630761 р_а), the Russian Federation Ministry of Education and Science as a part of Samara University's competitiveness enhancement program in 20132020 and the RAS based research program “Bioinformatics, modern information technologies and mathematical methods in medicine”.

[1]

Guofeng

Sheng , Wen Yang, Tao Xu,

Hong

Sun . Guofeng Sheng. High-resolutionsatellite scene classification using a sparse coding based multiple featurecombination . International Journal of Remote Sensing 2012 ; 33 ( 8 ): 2395 - 2412 .

[2] Glumov

, Myasnikov

. Method of the informative features selection on the digital images . Computer Optics 2007 ; 31 ( 3 ): 73 - 76 . (in Russian)

[3] Gaidel

, Zelter

, Kapishnikov

, Khramov

. Computed tomography texture analysis capabilities in diagnosing a chronic obstructive pulmonary disease . Computer Optics 2014 ; 38 ( 4 ): 843 - 850 .

[4] Gaidel

, Pervushkin

. Research of the textural features for the bony tissue diseases diagnostics using the roentgenograms . Computer Optics 2013 ; 37 ( 1 ): 113 - 119 . (in Russian)

[5] Haralick

, Shanmugam

, Dinstein

Textural features for image classification . IEEE Transactions on Systems, Man, and Cybernetics 1973 ; 3 : 610 - 621 .

[6] Goncharova

, Gaidel

, Khramov

. Statistical study of the factors affecting the cardiovascular disease . Information Technology and Nanotechnology 2016 ; 1020 - 1025 . (in Russian)

[7] Fukunaga

Introduction to statistical pattern recognition . San Diego: Academic Press, 1990 ; 592 p.