=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_39
|storemode=property
|title=Majority Voting of Heterogeneous Classifiers for Finding Abnormalities in the Gastro-Intestinal Tract
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_39.pdf
|volume=Vol-2283
|authors=Zeshan Khan,Muhammad Atif Tahir
|dblpUrl=https://dblp.org/rec/conf/mediaeval/KhanT18
}}
==Majority Voting of Heterogeneous Classifiers for Finding Abnormalities in the Gastro-Intestinal Tract==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_39.pdf</pdf>
<pre>
           Majority voting of Heterogeneous Classifiers for finding
                 abnormalities in the Gastro-Intestinal Tract
                                                             Zeshan Khan, Muhammad Atif Tahir
        School of Computer Science, National University of Computer and Emerging Sciences, Karachi Campus, Pakistan
                                             {zeshan.khan,atif.tahir}@nu.edu.pk

ABSTRACT
An endoscopy is a procedure in which a doctor uses specialized
instruments to view and operate on the internal organs and vessels
of the body. This paper aims to detect the diseases and abnormalities
in the Gastro-Intestinal Tract using multimedia data. It differs from
other projects in the medical domain because it does not use medical
imaging like X-rays, CT scan etc. The dataset, which comprises of
5293 images, is provided by MediaEval Benchmarking Initiative
for Multimedia Evaluation. The data is collected during traditional
colonoscopy procedures. Techniques from the fields of multimedia
content analysis (to extract information from the visual data) and
machine learning (for classification) have been used. On testing
data, 98% accuracy, 0.76 F1 and an MCC of 0.75 is achieved using
majority voting of logistic regression, random forest, and extra
trees classifiers.

1    INTRODUCTION
Medical image diagnosis is one of the most challenging tasks perti-
nent to the industry of computer vision. Most of the work in the
recent times has been done on CT-Scans, X-Rays, and MRI etc. The
Medico Task of 2018 [5] 1 challenged their participants to predict                               Figure 1: Proposed Model.
the abnormalities in the Gastro-Intestinal tract through endoscopic
examination [4]. This implies the presence of multimedia images
instead of traditional medical images for the challenge [4]. Deep             layers (16 convolutional layers and 3 fully-connected layers) for
analysis on GI tract images can help to predict abnormalities and             large scale image classification. With the help of pre-train process
diseases in its initial stages. 5293 images were used for training            using large dataset from the ImageNet challenge and retraining of
purpose and the 8740 were reserved for testing data. Different pre-           the last 2 layers with these medical images, the VGG 19 model is
processing techniques were applied and machine learning models                used to extract plentiful visual concepts.
were deployed for accurate systems.                                              Classifiers are trained on the logistic regression [1], random for-
                                                                              est [2] and extremely random trees classifier [3] for the features
2    APPROACH                                                                 that have been extracted. There were two categories of the features
Feature Engineering is one of the most challenging and key part               including pre-computed texture features and VGG features, the
of any Machine Learning problem. Figure 1 shows the proposed                  features extracted by using VGG19 pre-trained model. Ensemble
model. Discriminating features are the requirement for the function           implies the fact that the final model makes use of weighted majority
approximation. The task organizers provided 6 pre-computed visual             voting among all the independent models trained on all features.
features for every image. These include JCD, Tamura, Color Layout,            The weights of the ensemble are the percentage of accuracy mea-
Edge Histogram, Auto Color Correlogram and PHOG. Alongside                    sure of the independent classifier. It should be noted that various
these pre-computed visual features, deep learning features are also           advanced machine learning techniques have been investigated but
used to extract meaningful information for classification. There are          the best results were obtained using logistic regression, random
some visual features those can be extracted by using deep Networks.           forest and extremely random trees classifiers and thus reported in
As the training dataset is of 5293 images is very low for the training        this paper.
of a deep learning model, a pre-trained model VGG19 is used [6].                 The interesting characteristics of this competition included the
VGG 19 is a very deep convolutional networks of up to 19 weight               limited data to train the models and the class imbalance. The tech-
                                                                              nique of resampling is used to generate more data for each class.
1 http://www.multimediaeval.org/mediaeval2018/medico/index.html
                                                                              The resampling generated some more features of each class and
Copyright held by the owner/author(s).
                                                                              resulted in the same number of instance for each of the available
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                    16 classes. The resampling also increased the training dataset and
                                                                              the increased dataset is used to train and validate different models.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                      Z. Khan, A. Tahir


3    RESULTS AND ANALYSIS                                                   class. Overall, performance is satisfactory but still there is a need
The linear regression, extremely randomized trees and random                to investigate state of the art texture and local features to further
forest models have been implemented using Python’s scikit-learn             improve the performance.
package. We trained logistic regression, random forest, and ex-
tremely random trees on both deep and global features. The results          Table 2: Confusion matrix of all classes. There are total 16
are first evaluated on training data using 10 Fold cross validation.        classes and summary of all classes is shown.
By applying the proposed model, we obtained the accuracy of 97%,
F1 score of 0.90 and MCC of 0.81 on the 10 fold cross validation                            Predicted Actual     ALL    non-ALL
of the training data. Based on this initial investigation, the follow-                            ALL            7271     1469
ing runs are submitted to evaluate the performance of classifiers                               non-ALL          1469    129631
independently. The runs are submitted with the focus of 3 runs for
speed results generation and 3 runs for accuracy.
       • Run1 Ensemble of 7 features [JCD, Tamura, Color Lay-               Table 3: Confusion matrix for class dyed-lifted-polyps ver-
          out, Edge Histogram, Auto Color Correlogram, PHOG and             sus non dyed-lifted-polyps. df = dyed-lifted.
          VGG features] trained on 60 images, using voting of the
          logistic regression, random forest and extremely random                     Predicted Actual         df-polyps     non df-polyps
          trees classification algorithms.                                            dyed-lifted-polyps          339             236
       • Run2 Same as Run1 but trained on 300 images.                               non-dyed-lifted-polyps        217            7948
       • Run3 Same as Run1 but trained on all 5293 images.
       • Run4 Ensemble of 6 features [JCD, Tamura, Color Layout,
          Edge Histogram, Auto Color Correlogram, and PHOG]                 Table 4: Confusion matrix for class dyed-resection-polyps
          trained on 60 images, using voting of the logistic regression,    versus non dyed-resection-polyps. df = dyed-lifted.
          random forest and extremely random trees classification
          algorithms.
                                                                                Predicted class Actual class    dr-margins     non-dr-margins
       • Run5 Same as Run4 but trained on 300 images.
                                                                                  dyed-resection-margins           387               232
       • Run6 Same as Run4 but trained on all 5293 images.
                                                                                non-dyed-resection-margins         177              7944

Table 1: Accuracy, F1 , and MCC on different runs of testing
data.
                                                                            Table 5: Confusion matrix for class polyps versus non non-
                                                                            polyps.
                          Accuracy       F1     MCC
                 Run1      0.956       0.625    0.614
                                                                                         Predicted Actual      polyps   non-polyps
                 Run2      0.957       0.587    0.603
                                                                                              polyps            241         281
                 Run3      0.954       0.549    0.572
                                                                                            non-polyps          133        8085
                 Run4      0.961       0.611    0.597
                 Run5      0.976       0.745    0.741
                 Run6      0.979       0.752    0.756
                                                                            4     CHALLENGES AND FUTURE WORK
    Table 1 shows the summary of some evaluation criterions on best         It has been observed that results produced for many classes are
run. Accuracy of 97.9% is observed with F-score of 0.75 and MCC             quite accurate. However, there are some classes that are confusing
of 0.76. It is interesting to see that the best run is obtained by using    the system. Future work aims to target these classes hierarchically
just global features without using any deep learning features. We           and improve the performance using local features.
will investigate in future why deep features perform poorly. Initial
investigation has indicated that a lot of samples that should belong        5     CONCLUSION
to class “ulcerative-colitis" are misclassified as class “esophagitis" by   A model to classify gastro-intestinal abnormalities using endoscopic
using deep features. The best run is obtained using Run6 in which           images is presented. Training (5293 samples) and Testing (8740 sam-
all 5293 images are used and this approach is basically ensemble of 6       ples) data was provided by MediaEval Benchmarking Initiative for
features (JCD, Tamara, Edge Histograms, Color Layout, Auto Color            Multimedia Evaluation. As mentioned earlier in the introduction,
Correlogram and PHOG). Logistic regression, random forest and               the study used multimedia content analysis, machine learning and
extremely random trees is being used as a classifier with weighted          ensemble learning techniques for classification. The best of the
majority voting. Table 2 is the confusion matrix of various classes. It     results were found on majority voting of three models including
is observed that total of around 1469 samples are misclassified. Two        logistic regression, random forest and extremely random trees clas-
categories are mainly responsible for the misclassification which           sifier on 6 different features (including JCD, Tamura, Color Layout,
are “dyed-lifted-polyps" and “dyed-resection-margins". Around 500           Edge Histogram, Auto Color Correlogram and PHOG) which re-
samples are misclassified in these 2 categories (Tables 3 and 4).           sulted in an accuracy of 97% with F1-score of 0.75 and MCC of 0.76
Table 5 shows the confusion matrix for polyps versus non-polyps             on testing data.
Medico Multimedia                                                         MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Christopher M. Bishop. 2006. Pattern Recognition and Machine Learn-
    ing. Springer.
[2] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (01 Oct
    2001), 5–32.
[3] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely
    randomized trees. Machine Learning 63, 1 (2006), 3–42.
[4] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
    Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
    cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin
    Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A Multi-
    Class Image Dataset for Computer Aided Gastrointestinal Disease
    Detection. In Proceedings of the 8th ACM on Multimedia Systems Con-
    ference (MMSys’17). ACM, 164–169.
[5] Konstantin Pogorelov, Michael Riegler, Pal Halvorsen, Thomas de
    Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux,
    and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval
    2018. In MediaEval18, 29-31 October 2018, Sophia Antipolis, France.
[6] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Net-
    works for Large-Scale Image Recognition. CoRR (2014).

</pre>