Weighted Discriminant Embedding: Discriminant Subspace
          Learning for Imbalanced Medical Data Classification
                                          Tobey H. Ko1 , Zhonglei Gu2 , Yang Liu2,3
    1
        Department of Industrial and Manufacturing Systems Engineering, University of Hong Kong, HKSAR, China
                     2
                       Department of Computer Science, Hong Kong Baptist University, HKSAR, China
            3
              Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
                            tobeyko@hku.hk,csygliu@comp.hkbu.edu.hk,cszlgu@comp.hkbu.edu.hk

ABSTRACT                                                                 W ∈ R𝐷×𝑑 (𝑑 ≤ 𝐷), which is capable of projecting the o-
A model designed for automatic prediction of diseases based              riginal high-dimensional data to a low-dimensional subspace
on multimedia data collected in hospitals is introduced in               𝒵 = R𝑑 , where the weighted discriminant information could
this working notes paper. In order to perform the automatic              be preserved.
diseases prediction efficiently, while using as few data as pos-            In this year’s Medico task, the sample numbers in different
sible for training, we develop a two-stage learning strategy,            classes are highly imbalanced. To enhance the algorithm’s
which first performs the weighted discriminant embedding                 power in making correct detection on rarer classes, we expect
(WDE) to project the original data to a low-dimensional                  that data samples belonging to the same class, especially for
feature subspace and then utilizes the cost-sensitive nearest            the rarer class, should be close to each other as much as pos-
neighbor (CS-NN) method in the learned subspace for dis-                 sible in the learned subspace, while nearby data samples from
ease prediction. The proposed approach is evaluated on the               different classes, again, especially for rarer classes, should be
MediaEval 2018 Medico Multimedia Task.                                   separated from each other as much as possible in the learned
                                                                         subspace.
                                                                            To minimize the weighted intra-class scatter, we present
1       INTRODUCTION                                                     the following objective:
Aiming at improving the efficiency of detecting medical abnor-                                  𝑛
                                                                                            (︁ ∑︁                                   )︁
malities in the machine intelligence assisted medical diagnosis,           W = arg min 𝑡𝑟          𝐴𝑖𝑗 W𝑇 (x𝑖 − x𝑗 )(x𝑖 − x𝑗 )𝑇 W , (1)
and using as little information as possible, the MediaEval                         W         𝑖,𝑗=1
2018 Medico Multimedia Task [3] seeks to design an integrat-             where 𝐴𝑖𝑗 = (𝐼𝑖 + 𝐼𝑗 )/2 if 𝑙𝑖 = 𝑙𝑗 ; and 0 otherwise. Here 𝐼𝑖
ed approach to assist the medical experts’ decision-making               indicates the importance of class 𝑙𝑖 and is defined using the
process using a combination of video and image information,              entropy-based formulation [2]:
as well as other sensory information. In this paper, a two-stage
learning strategy is introduced to facilitate efficient detec-                                       (1 − 𝑝𝑖 )2
                                                                                            𝐼𝑖 = −              log(𝑝𝑖 ),             (2)
tion of diseases using multimedia and sensory information.                                              𝑝𝑖
The first stage consists of a dimensionality reduction process           where 𝑝𝑖 denotes the proportion of class 𝑙𝑖 in the dataset. In
which projects the original data to a low-dimensional fea-               Eq. (2), small proportion indicates high importance. Eq. (1)
ture representation using weighted discriminant embedding                could be rewritten as:
(WDE), which improves the efficiency of the learning process
                                                                                          W = arg min 𝑡𝑟(W𝑇 L𝐴 W),                    (3)
while also preserving the key discriminant information of                                            W
the original data. Then, the cost-sensitive nearest neighbor
                                                                         where L𝐴 is a Laplacian matrix [1] defined as L𝐴 = D𝐴 −
(CS-NN) method is employed to make the prediction in the
                                                                         ∑︀𝑛with D𝐴 being a diagonal matrix defined as (𝐷𝐴 )𝑖𝑖 =
                                                                         A,
learned subspace.
                                                                           𝑗=1 (𝐴)𝑖𝑗 (𝑖 = 1, · · · , 𝑛).
                                                                           Similarly, we define the following objective function to
2       WEIGHTED DISCRIMINANT
                                                                         maximize the weighted inter-class scatter:
        EMBEDDING                                                                          (︁ ∑︁ 𝑛                                  )︁
Let 𝒳 be the training set: 𝒳 = {(x1 , 𝑙1 ), · · · , (x𝑛 , 𝑙𝑛 )}, where    W = arg max 𝑡𝑟              𝐵𝑖𝑗 W𝑇 (x𝑖 − x𝑗 )(x𝑖 − x𝑗 )𝑇 W , (4)
x𝑖 ∈ R𝐷 (𝑖 = 1, ..., 𝑛) denotes the feature representation of                      W         𝑖,𝑗=1
the 𝑖-th sample, 𝑙𝑖 ∈ {1, · · · , 𝐶} denotes the label of x𝑖 , 𝑛         where 𝐵𝑖𝑗 = 𝑁𝑖𝑗 (𝐼𝑖 + 𝐼𝑗 )/2 if 𝑙𝑖 ̸= 𝑙𝑗 ; and 0 otherwise. Here
denotes the number of data samples in the set, 𝐶 denotes                 𝑁𝑖𝑗 = 𝑒𝑥𝑝(−‖x𝑖 − x𝑗 ‖2 /2𝜎 2 ) is utilized to measure the close-
the number of classes, and 𝐷 denotes the original dimen-                 ness between two data samples. Eq. (4) could be rewritten
sion of data. Given the training set, weighted discriminant              as:
embedding (WDE) aims to learn a transformation matrix
                                                                                       W = arg max 𝑡𝑟(W𝑇 L𝐵 W),                       (5)
                                                                                                     W
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France               where L𝐵 = D𝐵 − ∑︀  B, with D𝐵 being a diagonal matrix
                                                                         defined as (𝐷𝐵 )𝑖𝑖 = 𝑛
                                                                                              𝑗=1 (𝐵)𝑖𝑗 (𝑖 = 1, · · · , 𝑛).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                  T. H. Ko, Z. Gu, Y. Liu


  We integrate Eqs. (3) and (5) to form a unified objective          Table 1: Results of our approach on the first subtask
function of WDE:                                                     of MediaEval 2018 Medico Multimedia Task.
                             (︂ 𝑇         )︂
                               W L𝐵 W                                          Recall    Precision   Accuracy     F1 Score      Rk
             W = arg max 𝑡𝑟                  .          (6)
                      W        W𝑇 L𝐴 W
                                                                      Run 1    0.5001     0.4917       0.9471      0.4830     0.5357
Then the optimal W that maximizes the objective func-                 Run 2    0.4415     0.4294       0.9384      0.4251     0.4612
tion in Eq. (6) is composed of the normalized eigenvectors            Run 3    0.3947     0.3670       0.9320      0.3728     0.4035
corresponding to the 𝑑 largest eigenvalues of the following           Run 4    0.3553     0.3333       0.9256      0.3324     0.3511
eigen-decomposition problem:                                          Run 5    0.3019     0.2814       0.9186      0.2812     0.2918
                        L𝐵 w = 𝜆L𝐴 w.                         (7)
For a high-dimensional data sample x𝑖 , it can be mapped to          Table 2: Results of our approach on the second sub-
the subspace by y𝑖 = W𝑇 x𝑖 .                                         task of MediaEval 2018 Medico Multimedia Task.

3    RESULTS AND ANALYSIS                                                      Recall    Precision   Accuracy     F1 Score      Rk
To evaluate our approach, we test its performance on the              Run 1    0.5005     0.4917       0.9471      0.4830     0.5357
MediaEval 2018 Medico Multimedia Task. The task contains              Run 2    0.4181     0.3857       0.9337      0.4251     0.4193
both development set (with 5, 293 samples) and test set (with         Run 3    0.4259     0.4085       0.9350      0.4040     0.4348
8, 740 samples). For each sample, we use six types of features:       Run 4    0.3430     0.3107       0.9231      0.3135     0.3293
the 168-D JCD feature; the 18-D Tamura feature; the 33-D              Run 5    0.3257     0.3053       0.9227      0.3057     0.3246
ColorLayout feature; the 80-D EdgeHistogram feature; the
256-D AutoColorCorrelogram feature; and the 630-D PHOG
feature. The totally dimension is 1, 185.                            The reason might be that the proposed WDE is a linear map-
   We participate in two subtasks: 1) Classification of diseases     ping method, which is not sufficient to capture the complex
and findings; and 2) Fast and efficient classification. For both     discriminant information embedded in the high-dimensional
tasks, we submit 5 runs.                                             feature space. This motivates us to consider extending our
      ∙ For Run 1 (on both subtasks), we use all the data            method to the nonlinear case to improve the performance.
        from the development set for training;                       Furthermore, by comparing the performance on Run 2 (Run
      ∙ For Run 2 (on both subtasks), we randomly select             4) and that on Run 3 (Run 5), we observe that even we use all
        50% data for each class from the development set             the data from the minority classes (i.e., the “out-of-patient”
        for training;                                                and “instruments” classes), the performance is not improved.
      ∙ For Run 3 (on both subtasks), we randomly select             The reason might be that the number of data in these two
        50% data for each class from the development set,            classes are too small to represent the “real” distribution of the
        together with the remaining data in the “out-of-             classes. On possible solution is to employ the oversampling
        patient” and “instruments” classes, for training;            technology to reasonably and faithfully generate samples for
      ∙ For Run 4 (on both subtasks), we randomly select             minority classes.
        25% data for each class from the development set for
        training;                                                    4    CONCLUSION
      ∙ For Run 5 (on both subtasks), we randomly select             In this paper, we propose a subspace learning method called
        25% data for each class from the development set,            weighted discriminant embedding (WDE), aiming at discov-
        together with the remaining data in the “out-of-             ering the discriminant subspace for imbalanced dataset. After
        patient” and “instruments” classes, for training.            dimensionality reduction, the cost-sensitive nearest neighbor
   In the training stage, we use the training data to learn          is utilized for classification. We plan to extend our work
the transformation matrix W via WDE. We set 𝜎 = 1 and                from two aspects. First, we will generalize our approach to
the subspace dimension 𝑑 = 50. In the test stage, we use the         nonlinear case to enhance its data representation ability. Sec-
obtained W to map both training and test data to the 50-D            ond, we will incorporate some oversampling methods into
subspace, and then use the cost-sensitive nearest neighbor           our approach to make it stronger for imbalanced learning
(CS-NN) method for the final classification in the learned           problem.
subspace, where the cost of misclassifying the data of class
𝑐 (𝑐 = 1, · · · , 𝐶) to other classes is defined as 𝑐𝑜𝑠𝑡𝑐 = 𝑛/𝑛𝑐 ,   ACKNOWLEDGMENTS
with 𝑛 and 𝑛𝑐 being the total number of the training data            This work was supported in part by the National Natural Sci-
and the number of data in class 𝑐, respectively.                     ence Foundation of China (NSFC) under Grant 61503317, in
   Tables 1 and 2 report the results of our approach on sub-         part by the General Research Fund (GRF) from the Research
task 1 and subtask 2, respectively. Although the accuracy            Grant Council (RGC) of Hong Kong SAR under Project
looks good, the overall performance is far from satisfactory as      HKBU12202417, and in part by the SZSTI Grant with the
the results on other four important criteria are relatively low.     Projct Code JCYJ20170307161544087.
Weighted Discriminant Embedding                                        MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] M. Belkin and P. Niyogi. 2003. Laplacian Eigenmaps for
     Dimensionality Reduction and Data Representation. Neural
     Comput. 15, 6 (2003), 1373–1396.
 [2] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr. 2017. Focal
     Loss for Dense Object Detection. In 2017 IEEE International
     Conference on Computer Vision (ICCV). 2999–3007.
 [3] K. Pogorelov, M. Riegler, P. Halvorsen, T. de Lange, K. R.
     Randel, D.-T. Dang-Nguyen, M. Lux, and O. Ostroukhova.
     Medico Multimedia Task at MediaEval 2018. In Proceedings of
     the MediaEval 2018 Workshop. CEUR-WS, Sophia Antipolis,
     France, 29–31 October, 2018.