Multimedia for Medicine: The Medico Task at MediaEval 2017
                               Michael Riegler1 , Konstantin Pogorelov1,2 , Pål Halvorsen1,2 ,
                     Kristin Ranheim Randel2,3 , Sigrun Losada Eskeland4 , Duc-Tien Dang-Nguyen5 ,
                     Mathias Lux6 , Carsten Griwodz1,2 , Concetto Spampinato7 , Thomas de Lange3
        1 Simula Research Laboratory, Norway                2 University of Oslo, Norway             3 Cancer Registry of Norway, Norway
        4 Vestre Viken Hospital Trust, Norway               5 Dublin City University, Ireland             6 University of Klagenfurt, Austria
                                                            7 University of Catania, Italy

                                                      michael@simula.no,konstantin@simula.no
ABSTRACT                                                                     endoscopic examinations (insertion of a camera in the gastrointesti-
The Multimedia for Medicine Medico Task, running for the first               nal tract), diseases can be detected visually, even before they become
time as part of MediaEval 2017, focuses on detecting abnormalities,          symptomatic. This is particularly the case for colorectal cancer (in
diseases and anatomical landmarks in images captured by medical              the large bowel) or its cancer precursors (colorectal polyps), which
devices in the gastrointestinal tract. The task characteristics are          can be detected through colonoscopy or capsule endoscopy. The
described, including the use case and its challenges, the dataset            challenge is, however, that both medical experts and machines cur-
with ground truth, the required participant runs and the evaluation          rently fail to detect all polyps [6]. Moreover, in previous research in
metrics.                                                                     this area, computer vision and medical imaging have created visual
                                                                             augmentations of the interior of a body. To automatically detect
                                                                             and locate abnormalities, visual representations are not sufficient.
1    INTRODUCTION                                                            There is a need for image and video processing, analysis, informa-
The Medico task tackles the challenge of predicting diseases based           tion search and retrieval, in combination with other sensor data and
on multimedia data collected in hospitals with the additional re-            assistance from medical experts, and it all needs integration [5].
quirements to use as little training data as possible, perform the              Here, participants are asked to look beyond computer vision
analysis efficient regarding processing time and to generate auto-           and medical imaging to show the potential of multimedia research
matic text reports (summaries) of the findings. The task differs from        going far beyond well known scenarios like analysis of content on
well know medical imaging tasks like the ImageClef medical tasks             YouTube and Flickr. For this detection task, we provide Kvasir, a
(http://www.imageclef.org/) [1, 7] in the points that it (i) has only        large public dataset [4] containing videos and images from the GI
multimedia data (videos and images) and no medical imaging data              tract showing different diseases and anatomical landmarks. The
(CT scans, etc.), (ii) asks for using as little training data as possible    ground truth is provided by medical experts (specialists in GI en-
and (iii) evaluates the approaches also regarding processing time.           doscopy) annotating the dataset, and the data is split into training
Furthermore, the automatic generated reports are a novel part of             and test data. Based on this, the participants are asked to solve
the task, but since it is very hard to evaluate them this subtask is         four subtasks, i.e., the two first are mandatory, and the two last are
experimental this year.                                                      optional: (i) classify diseases with as few images in the training
   It is a typical assumption that visual analysis as it is already          dataset as possible; (ii) solve the classification problem in a fast
provided by the computer vision and medical image processing                 and efficient way; (iii) run the second task on the same hardware
communities today is sufficient to solve healthcare multimedia               (supported platforms are Linux, macOS and Windows); and (iv) au-
challenges [6]. Although we concede that these methods are indeed            tomatically create a text-report for a medical doctor for three video
essential contributors to promising approaches, we have come to              cases. Tackling the task can be addressed by leveraging techniques
the understanding that analysing images and videos alone does                from multiple multimedia-related disciplines, including (but not
not solve the challenges in medical fields such as endoscopy or              limited to) machine learning (classification), multimedia content
ultrasound, because of the task complexity and the needs of both             analysis and multimodal fusion.
medical experts and patients. Neither does it make serious use of
the multitude of additional information sources including sensors             2    DATASET DETAILS
and temporal information [3, 8, 9].                                          The Kvasir dataset1 [4] consists of 8,000 GI tract images that are an-
   The Medico task is designed to help to improve the health care            notated and verified by medical doctors (experienced endoscopists)
system through application of multimedia research knowledge                  for the ground truth. It includes 8 classes showing anatomical land-
and methods to reach the next level of computer and multimedia-              marks, pathological findings or endoscopic procedures in the GI
assisted diagnosis, detection and interpretation of abnormalities.           tract, i.e., 1000 images for each class, split into 500 for training and
This is useful in multiple scenarios. For example, in some areas of          500 for testing. The anatomical landmarks are Z-line, pylorus and
the human body, such as the gastrointestinal (GI) tract, the detec-          cecum, while the pathological findings include esophagitis, polyps
tion of abnormalities and diseases in early stages can significantly         and ulcerative colitis. In addition, we provide two set of images
improve the chance of successful treatment and survival. Through             related to removal of polyps, the dyed and lifted polyp and the dyed
Copyright held by the owner/author(s).                                       resection margins. The dataset consists of images with different
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                             1 http://datasets.simula.no/kvasir/
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                               M. Riegler et al.


resolutions from 720x576 up to 1920x1072 pixels and is organized             run for each of the required subtasks defined below. Additionally,
by sorting them into separate folders named according to the con-            they optionally can submit three more for any of the described
tent. Some of the included images have a green sub-picture in the            subtasks, i.e., participants can submit up to five runs in total.
image illustrating the position and configuration of the endoscope           Required subtasks. The detection subtask is a task for multi-
inside the bowel, delivered from an electromagnetic imaging system           class classification of diseases in the GI tract. Participants have
(ScopeGuide, Olympus Europe). This sub-picture may support the               to use visual information in the provided dataset where the goal
interpretation of the image. As mentioned before, the whole dataset          is to maximize the algorithm’s performance in terms of detection
is split into two equally sized development and test datasets. Both          accuracy, where amount of training data is also taken into account.
the development and the test datasets consist of 4,000 images, 500           Detection is evaluated based on the metrics above (all should be
images for each class stored in two archives: images archive and fea-        reported), but a ranking is made using MCC and the amount of
tures archive. In the development dataset, the images are stored in          used training data. The official metric is a multi-class generalization
the separate folders named according to the name of the classes that         of the MCC. This generalization is called the R K statistic (for K
images belong to. In the test dataset, all the images stored in one          different classes) and defined in terms of a K × K confusion matrix.
folder. The image files are encoded using JPEG compression. The              The R K statistic is in essence a correlation coefficient between the
encoding settings can vary across the dataset, and they reflect the a        observed and predicted binary classifications for (for K different
priori unknown endoscopic equipment settings. Furthermore, the               classes); it returns a value between −1 and +1. A coefficient of +1
features archive contains the extracted visual feature descriptors           represents a perfect prediction, 0 corresponds to no better than
for all the images in the images archive. The extracted visual fea-          random prediction and a value < 0 indicates disagreement between
tures are the global image features, i.e., JCD, Tamura, ColorLayout,         prediction and observation (the lower negative value corresponds
EdgeHistogram, AutoColorCorrelogram and PHOG. Each feature                   to the stronger disagreement). The minimum negative value of the
vector consists of a number of floating point values. The size of            R K statistic is between −1 and 0 depending on the true distribution.
the vector depends on the feature. The sizes of the feature vectors          The maximum value is always +1.
are: 168 (JCD), 18 (Tamura), 33 (ColorLayout), 80 (EdgeHistogram),               The efficient detection subtask addresses the speed of the clas-
256 (AutoColorCorrelogram) and 630 (PHOG) [2]. The extracted vi-             sification. The classification of diseases has to be achieved as fast
sual features are stored in the separate folders and text files named        as possible in terms of data processing using any computation
according to the name and the path of the corresponding image                speed-up techniques. The goal is to find a balance between the
files. Each file consists of six lines, one line per feature, and a line     algorithm’s performance in terms of detection accuracy and the
consists of a feature name separated from the feature vector by              performance in terms of data processing speed, while keeping in
colon. Each feature vector consists of a corresponding number of             mind that the problem area requires real-time processing while
floating point values separated by commas. The extension of the              lacking data. For the evaluation, the processing time weighted by
extracted visual feature files is ".features".                               the detection accuracy.
    For the automatic report generation, we use three videos depict-         Optional subtasks. The efficient detection on same hardware sub-
ing diseases or findings that can be found in the Kvasir dataset. The        task is the same as the efficient detection subtask above, but all
goal is to generate reports describing the three videos for medical          submitted solutions are tested on the same hardware. The organiz-
experts having an automatic report generation in mind.                       ers run the code provided by the participants on the same hardware,
                                                                             and the evaluation is again based on the processing time weighted
                                                                             by detection accuracy is used.
3    EVALUATION METRICS AND TASKS
                                                                                 The experimental report generation subtask asks the participants
For the evaluation of detection accuracy, we use several standard            to automatically create a text-report for a medical doctor describing
metrics (more detailed descriptions on the task web-page). True              the detection results for three video cases. A definition of what
positive represents the number of correctly identified samples. True         a text report is, what it should contain (list of requirements) and
negative shows the number of correctly identified negative samples.          a description of what the medical experts do with the report is
False positive is the number of wrongly identified samples. False neg-       provided. The assessment then follows the list of requirements, and
ative denotes the number of wrongly identified negative samples.             the report is assessed manually from two of our medical partners in
Recall (frequently called sensitivity) is the ratio of samples that are      terms of usefulness in the medical context and if it satisfies existing
correctly identified as positive among all existing positive samples.        demands for documentation of endoscopic procedures.
Precision shows the ratio of samples that are correctly identified
as positive among the returned samples. Specificity represents the
ratio of negatives that are correctly identified as negatives. Accu-
                                                                             4   DISCUSSION AND OUTLOOK
racy is the percentage of correctly identified true and false samples.
Matthews correlation coefficient (MCC) takes into account true and           The task itself can be seen as very challenging, hard to solve and
false positives and negatives, and is a balanced measure even if the         hard to evaluate. Due to its novel use case, we hope to motivate a
classes are of very different sizes. F1 score is a measure of a test’s ac-   lot of researchers to have a look into the field of medical multime-
curacy by calculating the harmonic mean of the precision and recall.         dia. Performing research that can have societal impact will be an
We also evaluate the amount of training data that has been used to           important part of multimedia research in the future. We hope that
achieve good results and the speed (processing performance) of the           the Medico task can help to raise awareness of the topic but also
classification. For the evaluation, the participants must submit one         provide an interesting and meaningful use case to researchers.
Multimedia for Medicine: The Medico Task at MediaEval 2017                       MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Bogdan Ionescu, Henning Müller, Mauricio Villegas, Helbert Arenas,
    Giulia Boato, Duc-Tien Dang-Nguyen, Yashin Dicente Cid, Carsten
    Eickhoff, Alba Garcia Seco de Herrera, Cathal Gurrin, Bayzidul Islam,
    Vassili Kovalev, Vitali Liauchuk, Josiane Mothe, Luca Piras, Michael
    Riegler, and Immanuel Schwall. 2017. Overview of ImageCLEF 2017:
    Information extraction from images. In Experimental IR Meets Multi-
    linguality, Multimodality, and Interaction 8th International Conference
    of the CLEF Association, CLEF 2017 (LNCS 10439). Springer.
[2] Mathias Lux and Savvas A Chatzichristofis. 2008. Lire: lucene image
    retrieval: an extensible java cbir library. In Proceedings of the 16th ACM
    international conference on Multimedia. ACM, 1085–1088.
[3] Konstantin Pogorelov, Sigrun Losada Eskeland, Thomas de Lange,
    Carsten Griwodz, Kristin Ranheim Randel, Håkon Kvale Stens-
    land, Duc-Tien Dang-Nguyen, Concetto Spampinato, Dag Johansen,
    Michael Riegler, and others. 2017. A holistic multimedia system for
    gastrointestinal tract disease detection. In Proceedings of the 8th ACM
    on Multimedia Systems Conference (MMSYS). ACM, 112–123.
[4] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
    Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
    cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin
    Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Kvasir: A Multi-
    Class Image Dataset for Computer Aided Gastrointestinal Disease
    Detection. In Proceedings of the 8th ACM on Multimedia Systems Con-
    ference (MMSYS). ACM, 164–169.
[5] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Peter Thelin
    Schmidt, Carsten Griwodz, Dag Johansen, Sigrun Losada Eskeland, and
    Thomas de Lange. 2016. GPU-accelerated real-time gastrointestinal
    diseases detection. In Proceeding of the IEEE International Symposium
    onComputer-Based Medical Systems (CBMS). IEEE, 185–190.
[6] Michael Riegler, Mathias Lux, Carsten Gridwodz, Concetto Spamp-
    inato, Thomas de Lange, Sigrun L Eskeland, Konstantin Pogorelov,
    Wallapak Tavanapong, Peter T Schmidt, Cathal Gurrin, Dag Johansen,
    Håvard Johansen, and Pål Halvorsen. 2016. Multimedia and Medicine:
    Teammates for better disease detection and survival. In Proceedings of
    the 2016 ACM Multimedia Conference (ACM MM). ACM, 968–977.
[7] Mauricio Villegas, Henning Müller, Alba García Seco de Herrera, Roger
    Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei
    Yan, Arnau Ramisa, and others. 2016. General overview of imageCLEF
    at the CLEF 2016 labs. In Procedings of the International Conference of
    the Cross-Language Evaluation Forum for European Languages (LNCS
    9822). Springer, 267–285.
[8] Yi Wang, Wallapak Tavanapong, Johnny Wong, JungHwan Oh, and
    Piet C De Groen. 2011. Computer-aided detection of retroflexion in
    colonoscopy. In Proceeding of the 24th International Symposium on
    Computer-Based Medical Systems (CBMS). IEEE, 1–6.
[9] Yi Wang, Wallapak Tavanapong, Johnny Wong, Jung Hwan Oh, and
    Piet C De Groen. 2015. Polyp-alert: Near real-time feedback during
    colonoscopy. Computer methods and programs in biomedicine 120, 3
    (2015), 164–179.