Multimedia for Medicine: The Medico Task at MediaEval 2017 Michael Riegler1 , Konstantin Pogorelov1,2 , Pål Halvorsen1,2 , Kristin Ranheim Randel2,3 , Sigrun Losada Eskeland4 , Duc-Tien Dang-Nguyen5 , Mathias Lux6 , Carsten Griwodz1,2 , Concetto Spampinato7 , Thomas de Lange3 1 Simula Research Laboratory, Norway 2 University of Oslo, Norway 3 Cancer Registry of Norway, Norway 4 Vestre Viken Hospital Trust, Norway 5 Dublin City University, Ireland 6 University of Klagenfurt, Austria 7 University of Catania, Italy michael@simula.no,konstantin@simula.no ABSTRACT endoscopic examinations (insertion of a camera in the gastrointesti- The Multimedia for Medicine Medico Task, running for the first nal tract), diseases can be detected visually, even before they become time as part of MediaEval 2017, focuses on detecting abnormalities, symptomatic. This is particularly the case for colorectal cancer (in diseases and anatomical landmarks in images captured by medical the large bowel) or its cancer precursors (colorectal polyps), which devices in the gastrointestinal tract. The task characteristics are can be detected through colonoscopy or capsule endoscopy. The described, including the use case and its challenges, the dataset challenge is, however, that both medical experts and machines cur- with ground truth, the required participant runs and the evaluation rently fail to detect all polyps [6]. Moreover, in previous research in metrics. this area, computer vision and medical imaging have created visual augmentations of the interior of a body. To automatically detect and locate abnormalities, visual representations are not sufficient. 1 INTRODUCTION There is a need for image and video processing, analysis, informa- The Medico task tackles the challenge of predicting diseases based tion search and retrieval, in combination with other sensor data and on multimedia data collected in hospitals with the additional re- assistance from medical experts, and it all needs integration [5]. quirements to use as little training data as possible, perform the Here, participants are asked to look beyond computer vision analysis efficient regarding processing time and to generate auto- and medical imaging to show the potential of multimedia research matic text reports (summaries) of the findings. The task differs from going far beyond well known scenarios like analysis of content on well know medical imaging tasks like the ImageClef medical tasks YouTube and Flickr. For this detection task, we provide Kvasir, a (http://www.imageclef.org/) [1, 7] in the points that it (i) has only large public dataset [4] containing videos and images from the GI multimedia data (videos and images) and no medical imaging data tract showing different diseases and anatomical landmarks. The (CT scans, etc.), (ii) asks for using as little training data as possible ground truth is provided by medical experts (specialists in GI en- and (iii) evaluates the approaches also regarding processing time. doscopy) annotating the dataset, and the data is split into training Furthermore, the automatic generated reports are a novel part of and test data. Based on this, the participants are asked to solve the task, but since it is very hard to evaluate them this subtask is four subtasks, i.e., the two first are mandatory, and the two last are experimental this year. optional: (i) classify diseases with as few images in the training It is a typical assumption that visual analysis as it is already dataset as possible; (ii) solve the classification problem in a fast provided by the computer vision and medical image processing and efficient way; (iii) run the second task on the same hardware communities today is sufficient to solve healthcare multimedia (supported platforms are Linux, macOS and Windows); and (iv) au- challenges [6]. Although we concede that these methods are indeed tomatically create a text-report for a medical doctor for three video essential contributors to promising approaches, we have come to cases. Tackling the task can be addressed by leveraging techniques the understanding that analysing images and videos alone does from multiple multimedia-related disciplines, including (but not not solve the challenges in medical fields such as endoscopy or limited to) machine learning (classification), multimedia content ultrasound, because of the task complexity and the needs of both analysis and multimodal fusion. medical experts and patients. Neither does it make serious use of the multitude of additional information sources including sensors 2 DATASET DETAILS and temporal information [3, 8, 9]. The Kvasir dataset1 [4] consists of 8,000 GI tract images that are an- The Medico task is designed to help to improve the health care notated and verified by medical doctors (experienced endoscopists) system through application of multimedia research knowledge for the ground truth. It includes 8 classes showing anatomical land- and methods to reach the next level of computer and multimedia- marks, pathological findings or endoscopic procedures in the GI assisted diagnosis, detection and interpretation of abnormalities. tract, i.e., 1000 images for each class, split into 500 for training and This is useful in multiple scenarios. For example, in some areas of 500 for testing. The anatomical landmarks are Z-line, pylorus and the human body, such as the gastrointestinal (GI) tract, the detec- cecum, while the pathological findings include esophagitis, polyps tion of abnormalities and diseases in early stages can significantly and ulcerative colitis. In addition, we provide two set of images improve the chance of successful treatment and survival. Through related to removal of polyps, the dyed and lifted polyp and the dyed Copyright held by the owner/author(s). resection margins. The dataset consists of images with different MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 http://datasets.simula.no/kvasir/ MediaEval’17, 13-15 September 2017, Dublin, Ireland M. Riegler et al. resolutions from 720x576 up to 1920x1072 pixels and is organized run for each of the required subtasks defined below. Additionally, by sorting them into separate folders named according to the con- they optionally can submit three more for any of the described tent. Some of the included images have a green sub-picture in the subtasks, i.e., participants can submit up to five runs in total. image illustrating the position and configuration of the endoscope Required subtasks. The detection subtask is a task for multi- inside the bowel, delivered from an electromagnetic imaging system class classification of diseases in the GI tract. Participants have (ScopeGuide, Olympus Europe). This sub-picture may support the to use visual information in the provided dataset where the goal interpretation of the image. As mentioned before, the whole dataset is to maximize the algorithm’s performance in terms of detection is split into two equally sized development and test datasets. Both accuracy, where amount of training data is also taken into account. the development and the test datasets consist of 4,000 images, 500 Detection is evaluated based on the metrics above (all should be images for each class stored in two archives: images archive and fea- reported), but a ranking is made using MCC and the amount of tures archive. In the development dataset, the images are stored in used training data. The official metric is a multi-class generalization the separate folders named according to the name of the classes that of the MCC. This generalization is called the R K statistic (for K images belong to. In the test dataset, all the images stored in one different classes) and defined in terms of a K × K confusion matrix. folder. The image files are encoded using JPEG compression. The The R K statistic is in essence a correlation coefficient between the encoding settings can vary across the dataset, and they reflect the a observed and predicted binary classifications for (for K different priori unknown endoscopic equipment settings. Furthermore, the classes); it returns a value between −1 and +1. A coefficient of +1 features archive contains the extracted visual feature descriptors represents a perfect prediction, 0 corresponds to no better than for all the images in the images archive. The extracted visual fea- random prediction and a value < 0 indicates disagreement between tures are the global image features, i.e., JCD, Tamura, ColorLayout, prediction and observation (the lower negative value corresponds EdgeHistogram, AutoColorCorrelogram and PHOG. Each feature to the stronger disagreement). The minimum negative value of the vector consists of a number of floating point values. The size of R K statistic is between −1 and 0 depending on the true distribution. the vector depends on the feature. The sizes of the feature vectors The maximum value is always +1. are: 168 (JCD), 18 (Tamura), 33 (ColorLayout), 80 (EdgeHistogram), The efficient detection subtask addresses the speed of the clas- 256 (AutoColorCorrelogram) and 630 (PHOG) [2]. The extracted vi- sification. The classification of diseases has to be achieved as fast sual features are stored in the separate folders and text files named as possible in terms of data processing using any computation according to the name and the path of the corresponding image speed-up techniques. The goal is to find a balance between the files. Each file consists of six lines, one line per feature, and a line algorithm’s performance in terms of detection accuracy and the consists of a feature name separated from the feature vector by performance in terms of data processing speed, while keeping in colon. Each feature vector consists of a corresponding number of mind that the problem area requires real-time processing while floating point values separated by commas. The extension of the lacking data. For the evaluation, the processing time weighted by extracted visual feature files is ".features". the detection accuracy. For the automatic report generation, we use three videos depict- Optional subtasks. The efficient detection on same hardware sub- ing diseases or findings that can be found in the Kvasir dataset. The task is the same as the efficient detection subtask above, but all goal is to generate reports describing the three videos for medical submitted solutions are tested on the same hardware. The organiz- experts having an automatic report generation in mind. ers run the code provided by the participants on the same hardware, and the evaluation is again based on the processing time weighted by detection accuracy is used. 3 EVALUATION METRICS AND TASKS The experimental report generation subtask asks the participants For the evaluation of detection accuracy, we use several standard to automatically create a text-report for a medical doctor describing metrics (more detailed descriptions on the task web-page). True the detection results for three video cases. A definition of what positive represents the number of correctly identified samples. True a text report is, what it should contain (list of requirements) and negative shows the number of correctly identified negative samples. a description of what the medical experts do with the report is False positive is the number of wrongly identified samples. False neg- provided. The assessment then follows the list of requirements, and ative denotes the number of wrongly identified negative samples. the report is assessed manually from two of our medical partners in Recall (frequently called sensitivity) is the ratio of samples that are terms of usefulness in the medical context and if it satisfies existing correctly identified as positive among all existing positive samples. demands for documentation of endoscopic procedures. Precision shows the ratio of samples that are correctly identified as positive among the returned samples. Specificity represents the ratio of negatives that are correctly identified as negatives. Accu- 4 DISCUSSION AND OUTLOOK racy is the percentage of correctly identified true and false samples. Matthews correlation coefficient (MCC) takes into account true and The task itself can be seen as very challenging, hard to solve and false positives and negatives, and is a balanced measure even if the hard to evaluate. Due to its novel use case, we hope to motivate a classes are of very different sizes. F1 score is a measure of a test’s ac- lot of researchers to have a look into the field of medical multime- curacy by calculating the harmonic mean of the precision and recall. dia. Performing research that can have societal impact will be an We also evaluate the amount of training data that has been used to important part of multimedia research in the future. We hope that achieve good results and the speed (processing performance) of the the Medico task can help to raise awareness of the topic but also classification. For the evaluation, the participants must submit one provide an interesting and meaningful use case to researchers. Multimedia for Medicine: The Medico Task at MediaEval 2017 MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Bogdan Ionescu, Henning Müller, Mauricio Villegas, Helbert Arenas, Giulia Boato, Duc-Tien Dang-Nguyen, Yashin Dicente Cid, Carsten Eickhoff, Alba Garcia Seco de Herrera, Cathal Gurrin, Bayzidul Islam, Vassili Kovalev, Vitali Liauchuk, Josiane Mothe, Luca Piras, Michael Riegler, and Immanuel Schwall. 2017. Overview of ImageCLEF 2017: Information extraction from images. In Experimental IR Meets Multi- linguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017 (LNCS 10439). Springer. [2] Mathias Lux and Savvas A Chatzichristofis. 2008. Lire: lucene image retrieval: an extensible java cbir library. In Proceedings of the 16th ACM international conference on Multimedia. ACM, 1085–1088. [3] Konstantin Pogorelov, Sigrun Losada Eskeland, Thomas de Lange, Carsten Griwodz, Kristin Ranheim Randel, Håkon Kvale Stens- land, Duc-Tien Dang-Nguyen, Concetto Spampinato, Dag Johansen, Michael Riegler, and others. 2017. A holistic multimedia system for gastrointestinal tract disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference (MMSYS). ACM, 112–123. [4] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Con- cetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. Kvasir: A Multi- Class Image Dataset for Computer Aided Gastrointestinal Disease Detection. In Proceedings of the 8th ACM on Multimedia Systems Con- ference (MMSYS). ACM, 164–169. [5] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Peter Thelin Schmidt, Carsten Griwodz, Dag Johansen, Sigrun Losada Eskeland, and Thomas de Lange. 2016. GPU-accelerated real-time gastrointestinal diseases detection. In Proceeding of the IEEE International Symposium onComputer-Based Medical Systems (CBMS). IEEE, 185–190. [6] Michael Riegler, Mathias Lux, Carsten Gridwodz, Concetto Spamp- inato, Thomas de Lange, Sigrun L Eskeland, Konstantin Pogorelov, Wallapak Tavanapong, Peter T Schmidt, Cathal Gurrin, Dag Johansen, Håvard Johansen, and Pål Halvorsen. 2016. Multimedia and Medicine: Teammates for better disease detection and survival. In Proceedings of the 2016 ACM Multimedia Conference (ACM MM). ACM, 968–977. [7] Mauricio Villegas, Henning Müller, Alba García Seco de Herrera, Roger Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, and others. 2016. General overview of imageCLEF at the CLEF 2016 labs. In Procedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (LNCS 9822). Springer, 267–285. [8] Yi Wang, Wallapak Tavanapong, Johnny Wong, JungHwan Oh, and Piet C De Groen. 2011. Computer-aided detection of retroflexion in colonoscopy. In Proceeding of the 24th International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 1–6. [9] Yi Wang, Wallapak Tavanapong, Johnny Wong, Jung Hwan Oh, and Piet C De Groen. 2015. Polyp-alert: Near real-time feedback during colonoscopy. Computer methods and programs in biomedicine 120, 3 (2015), 164–179.