=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_3 |storemode=property |title=Multimedia for Medicine: The Medico Task at MediaEval 2017 |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_3.pdf |volume=Vol-1984 |authors=Michael Riegler,Konstantin Pogorelov,Pål Halvorsen,Kristin Ranheim Randel,Sigrun Losada Eskeland,Duc-Tien Dang-Nguyen,Mathias Lux,Carsten Griwodz,Concetto Spampinato,Thomas de Lange |dblpUrl=https://dblp.org/rec/conf/mediaeval/RieglerPHREDLGS17 }} ==Multimedia for Medicine: The Medico Task at MediaEval 2017== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_3.pdf

Multimedia for Medicine: The Medico Task at MediaEval 2017
Michael Riegler1 , Konstantin Pogorelov1,2 , Pål Halvorsen1,2 ,
Kristin Ranheim Randel2,3 , Sigrun Losada Eskeland4 , Duc-Tien Dang-Nguyen5 ,
Mathias Lux6 , Carsten Griwodz1,2 , Concetto Spampinato7 , Thomas de Lange3
1 Simula Research Laboratory, Norway 2 University of Oslo, Norway 3 Cancer Registry of Norway, Norway
4 Vestre Viken Hospital Trust, Norway 5 Dublin City University, Ireland 6 University of Klagenfurt, Austria
7 University of Catania, Italy

michael@simula.no,konstantin@simula.no
ABSTRACT endoscopic examinations (insertion of a camera in the gastrointesti-
The Multimedia for Medicine Medico Task, running for the first nal tract), diseases can be detected visually, even before they become
time as part of MediaEval 2017, focuses on detecting abnormalities, symptomatic. This is particularly the case for colorectal cancer (in
diseases and anatomical landmarks in images captured by medical the large bowel) or its cancer precursors (colorectal polyps), which
devices in the gastrointestinal tract. The task characteristics are can be detected through colonoscopy or capsule endoscopy. The
described, including the use case and its challenges, the dataset challenge is, however, that both medical experts and machines cur-
with ground truth, the required participant runs and the evaluation rently fail to detect all polyps [6]. Moreover, in previous research in
metrics. this area, computer vision and medical imaging have created visual
augmentations of the interior of a body. To automatically detect
and locate abnormalities, visual representations are not sufficient.
1 INTRODUCTION There is a need for image and video processing, analysis, informa-
The Medico task tackles the challenge of predicting diseases based tion search and retrieval, in combination with other sensor data and
on multimedia data collected in hospitals with the additional re- assistance from medical experts, and it all needs integration [5].
quirements to use as little training data as possible, perform the Here, participants are asked to look beyond computer vision
analysis efficient regarding processing time and to generate auto- and medical imaging to show the potential of multimedia research
matic text reports (summaries) of the findings. The task differs from going far beyond well known scenarios like analysis of content on
well know medical imaging tasks like the ImageClef medical tasks YouTube and Flickr. For this detection task, we provide Kvasir, a
(http://www.imageclef.org/) [1, 7] in the points that it (i) has only large public dataset [4] containing videos and images from the GI
multimedia data (videos and images) and no medical imaging data tract showing different diseases and anatomical landmarks. The
(CT scans, etc.), (ii) asks for using as little training data as possible ground truth is provided by medical experts (specialists in GI en-
and (iii) evaluates the approaches also regarding processing time. doscopy) annotating the dataset, and the data is split into training
Furthermore, the automatic generated reports are a novel part of and test data. Based on this, the participants are asked to solve
the task, but since it is very hard to evaluate them this subtask is four subtasks, i.e., the two first are mandatory, and the two last are
experimental this year. optional: (i) classify diseases with as few images in the training
It is a typical assumption that visual analysis as it is already dataset as possible; (ii) solve the classification problem in a fast
provided by the computer vision and medical image processing and efficient way; (iii) run the second task on the same hardware
communities today is sufficient to solve healthcare multimedia (supported platforms are Linux, macOS and Windows); and (iv) au-
challenges [6]. Although we concede that these methods are indeed tomatically create a text-report for a medical doctor for three video
essential contributors to promising approaches, we have come to cases. Tackling the task can be addressed by leveraging techniques
the understanding that analysing images and videos alone does from multiple multimedia-related disciplines, including (but not
not solve the challenges in medical fields such as endoscopy or limited to) machine learning (classification), multimedia content
ultrasound, because of the task complexity and the needs of both analysis and multimodal fusion.
medical experts and patients. Neither does it make serious use of
the multitude of additional information sources including sensors 2 DATASET DETAILS
and temporal information [3, 8, 9]. The Kvasir dataset1 [4] consists of 8,000 GI tract images that are an-
The Medico task is designed to help to improve the health care notated and verified by medical doctors (experienced endoscopists)
system through application of multimedia research knowledge for the ground truth. It includes 8 classes showing anatomical land-
and methods to reach the next level of computer and multimedia- marks, pathological findings or endoscopic procedures in the GI
assisted diagnosis, detection and interpretation of abnormalities. tract, i.e., 1000 images for each class, split into 500 for training and
This is useful in multiple scenarios. For example, in some areas of 500 for testing. The anatomical landmarks are Z-line, pylorus and
the human body, such as the gastrointestinal (GI) tract, the detec- cecum, while the pathological findings include esophagitis, polyps
tion of abnormalities and diseases in early stages can significantly and ulcerative colitis. In addition, we provide two set of images
improve the chance of successful treatment and survival. Through related to removal of polyps, the dyed and lifted polyp and the dyed
Copyright held by the owner/author(s). resection margins. The dataset consists of images with different
MediaEval’17, 13-15 September 2017, Dublin, Ireland
1 http://datasets.simula.no/kvasir/
MediaEval’17, 13-15 September 2017, Dublin, Ireland M. Riegler et al.

resolutions from 720x576 up to 1920x1072 pixels and is organized run for each of the required subtasks defined below. Additionally,
by sorting them into separate folders named according to the con- they optionally can submit three more for any of the described
tent. Some of the included images have a green sub-picture in the subtasks, i.e., participants can submit up to five runs in total.
image illustrating the position and configuration of the endoscope Required subtasks. The detection subtask is a task for multi-
inside the bowel, delivered from an electromagnetic imaging system class classification of diseases in the GI tract. Participants have
(ScopeGuide, Olympus Europe). This sub-picture may support the to use visual information in the provided dataset where the goal
interpretation of the image. As mentioned before, the whole dataset is to maximize the algorithm’s performance in terms of detection
is split into two equally sized development and test datasets. Both accuracy, where amount of training data is also taken into account.
the development and the test datasets consist of 4,000 images, 500 Detection is evaluated based on the metrics above (all should be
images for each class stored in two archives: images archive and fea- reported), but a ranking is made using MCC and the amount of
tures archive. In the development dataset, the images are stored in used training data. The official metric is a multi-class generalization
the separate folders named according to the name of the classes that of the MCC. This generalization is called the R K statistic (for K
images belong to. In the test dataset, all the images stored in one different classes) and defined in terms of a K × K confusion matrix.
folder. The image files are encoded using JPEG compression. The The R K statistic is in essence a correlation coefficient between the
encoding settings can vary across the dataset, and they reflect the a observed and predicted binary classifications for (for K different
priori unknown endoscopic equipment settings. Furthermore, the classes); it returns a value between −1 and +1. A coefficient of +1
features archive contains the extracted visual feature descriptors represents a perfect prediction, 0 corresponds to no better than
for all the images in the images archive. The extracted visual fea- random prediction and a value < 0 indicates disagreement between
tures are the global image features, i.e., JCD, Tamura, ColorLayout, prediction and observation (the lower negative value corresponds
EdgeHistogram, AutoColorCorrelogram and PHOG. Each feature to the stronger disagreement). The minimum negative value of the
vector consists of a number of floating point values. The size of R K statistic is between −1 and 0 depending on the true distribution.
the vector depends on the feature. The sizes of the feature vectors The maximum value is always +1.
are: 168 (JCD), 18 (Tamura), 33 (ColorLayout), 80 (EdgeHistogram), The efficient detection subtask addresses the speed of the clas-
256 (AutoColorCorrelogram) and 630 (PHOG) [2]. The extracted vi- sification. The classification of diseases has to be achieved as fast
sual features are stored in the separate folders and text files named as possible in terms of data processing using any computation
according to the name and the path of the corresponding image speed-up techniques. The goal is to find a balance between the
files. Each file consists of six lines, one line per feature, and a line algorithm’s performance in terms of detection accuracy and the
consists of a feature name separated from the feature vector by performance in terms of data processing speed, while keeping in
colon. Each feature vector consists of a corresponding number of mind that the problem area requires real-time processing while
floating point values separated by commas. The extension of the lacking data. For the evaluation, the processing time weighted by
extracted visual feature files is ".features". the detection accuracy.
For the automatic report generation, we use three videos depict- Optional subtasks. The efficient detection on same hardware sub-
ing diseases or findings that can be found in the Kvasir dataset. The task is the same as the efficient detection subtask above, but all
goal is to generate reports describing the three videos for medical submitted solutions are tested on the same hardware. The organiz-
experts having an automatic report generation in mind. ers run the code provided by the participants on the same hardware,
and the evaluation is again based on the processing time weighted
by detection accuracy is used.
3 EVALUATION METRICS AND TASKS
The experimental report generation subtask asks the participants
For the evaluation of detection accuracy, we use several standard to automatically create a text-report for a medical doctor describing
metrics (more detailed descriptions on the task web-page). True the detection results for three video cases. A definition of what
positive represents the number of correctly identified samples. True a text report is, what it should contain (list of requirements) and
negative shows the number of correctly identified negative samples. a description of what the medical experts do with the report is
False positive is the number of wrongly identified samples. False neg- provided. The assessment then follows the list of requirements, and
ative denotes the number of wrongly identified negative samples. the report is assessed manually from two of our medical partners in
Recall (frequently called sensitivity) is the ratio of samples that are terms of usefulness in the medical context and if it satisfies existing
correctly identified as positive among all existing positive samples. demands for documentation of endoscopic procedures.
Precision shows the ratio of samples that are correctly identified
as positive among the returned samples. Specificity represents the
ratio of negatives that are correctly identified as negatives. Accu-
4 DISCUSSION AND OUTLOOK
racy is the percentage of correctly identified true and false samples.
Matthews correlation coefficient (MCC) takes into account true and The task itself can be seen as very challenging, hard to solve and
false positives and negatives, and is a balanced measure even if the hard to evaluate. Due to its novel use case, we hope to motivate a
classes are of very different sizes. F1 score is a measure of a test’s ac- lot of researchers to have a look into the field of medical multime-
curacy by calculating the harmonic mean of the precision and recall. dia. Performing research that can have societal impact will be an
We also evaluate the amount of training data that has been used to important part of multimedia research in the future. We hope that
achieve good results and the speed (processing performance) of the the Medico task can help to raise awareness of the topic but also
classification. For the evaluation, the participants must submit one provide an interesting and meaningful use case to researchers.
Multimedia for Medicine: The Medico Task at MediaEval 2017 MediaEval’17, 13-15 September 2017, Dublin, Ireland

REFERENCES
[1] Bogdan Ionescu, Henning Müller, Mauricio Villegas, Helbert Arenas,
Giulia Boato, Duc-Tien Dang-Nguyen, Yashin Dicente Cid, Carsten
Eickhoff, Alba Garcia Seco de Herrera, Cathal Gurrin, Bayzidul Islam,
Vassili Kovalev, Vitali Liauchuk, Josiane Mothe, Luca Piras, Michael
Riegler, and Immanuel Schwall. 2017. Overview of ImageCLEF 2017:
Information extraction from images. In Experimental IR Meets Multi-
linguality, Multimodality, and Interaction 8th International Conference
of the CLEF Association, CLEF 2017 (LNCS 10439). Springer.
[2] Mathias Lux and Savvas A Chatzichristofis. 2008. Lire: lucene image
retrieval: an extensible java cbir library. In Proceedings of the 16th ACM
international conference on Multimedia. ACM, 1085–1088.
[3] Konstantin Pogorelov, Sigrun Losada Eskeland, Thomas de Lange,
Carsten Griwodz, Kristin Ranheim Randel, Håkon Kvale Stens-
land, Duc-Tien Dang-Nguyen, Concetto Spampinato, Dag Johansen,
Michael Riegler, and others. 2017. A holistic multimedia system for
gastrointestinal tract disease detection. In Proceedings of the 8th ACM
on Multimedia Systems Conference (MMSYS). ACM, 112–123.
[4] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz,
Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con-
cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin
Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Kvasir: A Multi-
Class Image Dataset for Computer Aided Gastrointestinal Disease
Detection. In Proceedings of the 8th ACM on Multimedia Systems Con-
ference (MMSYS). ACM, 164–169.
[5] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Peter Thelin
Schmidt, Carsten Griwodz, Dag Johansen, Sigrun Losada Eskeland, and
Thomas de Lange. 2016. GPU-accelerated real-time gastrointestinal
diseases detection. In Proceeding of the IEEE International Symposium
onComputer-Based Medical Systems (CBMS). IEEE, 185–190.
[6] Michael Riegler, Mathias Lux, Carsten Gridwodz, Concetto Spamp-
inato, Thomas de Lange, Sigrun L Eskeland, Konstantin Pogorelov,
Wallapak Tavanapong, Peter T Schmidt, Cathal Gurrin, Dag Johansen,
Håvard Johansen, and Pål Halvorsen. 2016. Multimedia and Medicine:
Teammates for better disease detection and survival. In Proceedings of
the 2016 ACM Multimedia Conference (ACM MM). ACM, 968–977.
[7] Mauricio Villegas, Henning Müller, Alba García Seco de Herrera, Roger
Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei
Yan, Arnau Ramisa, and others. 2016. General overview of imageCLEF
at the CLEF 2016 labs. In Procedings of the International Conference of
the Cross-Language Evaluation Forum for European Languages (LNCS
9822). Springer, 267–285.
[8] Yi Wang, Wallapak Tavanapong, Johnny Wong, JungHwan Oh, and
Piet C De Groen. 2011. Computer-aided detection of retroflexion in
colonoscopy. In Proceeding of the 24th International Symposium on
Computer-Based Medical Systems (CBMS). IEEE, 1–6.
[9] Yi Wang, Wallapak Tavanapong, Johnny Wong, Jung Hwan Oh, and
Piet C De Groen. 2015. Polyp-alert: Near real-time feedback during
colonoscopy. Computer methods and programs in biomedicine 120, 3
(2015), 164–179.