VISCERAL — VISual Concept Extraction challenge in RAdioLogy: ISBI 2014 Challenge Organization Oscar Alfonso Jiménez del Toro1 , Orcun Goksel2 , Bjoern Menze2 , Henning Müller1 Georg Langs3 , Marc–André Weber4 , Ivan Eggel1 , Katharina Gruenberg4 , Markus Holzer3 , András Jakab3 , Georgios Kontokotsios5 , Markus Krenn3 , Tomàs Salas Fernandez6 , Roger Schaer1 , Abdel Aziz Taha5 , Marianne Winterstein4 , Allan Hanbury5 University of Applied Sciences Western Switzerland, Switzerland1 Swiss Federal Institute of Technology (ETH) Zürich, Switzerland2 Medical University of Vienna, Austria3 University of Heidelberg, Germany4 Vienna University of Technology, Austria5 Catalan Agency for Health Information, Assessment and Quality, Spain6 Abstract The VISual Concept Extraction challenge in RAdioLogy (VISCERAL) project has been developed as a cloud–based infrastructure for the evaluation of medical image data in large data sets. As part of this project, the ISBI 2014 (Inter- national Symposium for Biomedical Imaging) challenge was organized using the VISCERAL data set and shared cloud– framework. Two tasks were selected to exploit and com- pare multiple state–of–the–art solutions designed for big data medical image analysis. Segmentation and landmark localiza- tion results from the submitted algorithms were compared to manually annotated ground truth in the VISCERAL data set. This paper presents an overview of the challenge setup and data set used as well as the evaluation metrics from the vari- ous results submitted to the challenge. The participants pre- sented their algorithms during an organized session at ISBI 2014. There were lively discussions in which the importance of comparing approaches on tasks sharing a common data set was highlighted. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: O. Goksel (ed.): Proceedings of the VISCERAL Organ Segmentation and Landmark Detection Benchmark at the 2014 IEEE International Symposium on Biomedical Imaging (ISBI), Beijing, China, May 1st , 2014 published at http://ceur-ws.org 6 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization 1 Introduction Computational approaches that can be scaled to large amounts of medical data are needed to tackle the ever–growing data resources obtained daily from the hospitals [Doi05]. Handling this enormous amount of medical data during clinical routine by health professionals has complexity and scaling limitations. It is also very time–consuming, and hence requires unsupervised and automatic methods to perform the necessary data analysis and processing for data interpretation. There are already many algorithms and techniques for big data analysis, however, most research groups do not have access to large-scale annotated medical data to develop such approaches for medical images. Distributing these big data sets (on the order of terabytes) requires efficient and scalable storing and computing capabilities. Evaluation campaigns and benchmarks can objectively compare multiple state–of–the art algorithms to determine the optimal solution for a certain clinical task [HMLM14, GSdHKCDF+ 13]. The Visual Concept Extraction Challenge in Radiology (VISCERAL) project was developed as a cloud–based infrastructure for the evaluation of medical image analysis techniques on large data sets [LMMH13]. The shared cloud environment in which the VISCERAL project takes place allows access and processing of these data without having to duplicate the data or move it to participants’ side. Since the data are stored centrally, and not distributed outside the cloud environment, the legal and ethical requirements of such data sets can also be satisfied, so also confidential data sets can be benchmarked in this way as only a small training data set can be accessed by participants [EILI+ 10]. The cloud infrastructure is provided and funded by the VISCERAL project. The participants are provided with computationally powerful virtual machines that can be accessed remotely in the shared cloud infrastructure while working on the training data and tuning their algorithms. Participant access is withdrawn during the evaluation phase and only the organizers access the machines. The algorithms are brought to the data to perform automated processing and data mining. The evaluation of the performance of these methods can therefore be done with real clinical imaging data and the outcomes can be reused to improve the methods. The whole body 3D medical imaging data including manual labels that is provided by VIS- CERAL includes a small subset with ground truth annotated by experienced radiologists. Through evaluation campaigns, challenges, benchmarks and competitions, tasks of general interest can be selected to compare the algorithms on a large scale. This manually annotated gold corpus can be used to identify high quality methods that can also be combined to create a much larger “reason- ably annotated” data set, satisfactory but perhaps not as reliable as manual annotation. Using fusion techniques this silver corpus will be created with the agreement between the segmentations of the algorithms on a large–scale data set. This maximizes the gain of manual annotation and also identifies strong differences between participating systems on the annotated organs. 2 ISBI Challenge Framework The registration procedure for the ISBI challenge was that of the VISCERAL series benchmark that includes several campaigns. The participants filled information details and uploaded a signed participation agreement form, which corresponds to ethics requests for usage of the data. Since the VISCERAL data set is stored on the Azure Cloud, each participant then received access to an Azure virtual cloud–computing instance. There were 5 operating systems available to choose from including Windows 2012, Windows 2008, Ubuntu Server 14.04 LTS, openSUSE 13.1 and CentOS 6.5. All cloud– computing instances have an 8–core CPU with 16 GB RAM to provide the same computing capabilities to different solutions proposed. The participant gets administrator rights on their virtual machine (VM) and can access remotely to deploy their algorithms and add any supporting library/applications to their VM. The VISCERAL training data set can then be accessed 7 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Figure 1: The ISBI training set. and downloaded securely within the VMs through secured URL links. 2.1 Data Set The medical images contained in the VISCERAL data set have been acquired during daily clinical routine work. Data sets of children (<18 years) were not included based on the recommendations of the ethical committee. In the provided data sets multiple organs are visible and depicted in a resolution sufficient to reliably detect an organ and delineate its borders. This is to enforce that a large number of organs and structures can be segmented in one data set. The data set consists of computed tomography (CT) scans and magnetic resonance (MR) imaging with and without contrast enhancement to evaluate the participants algorithms on several modalities, contrasts and MR sequence directions, making sure that algorithms are not optimized for one specific machine or protocol. The available training set from VISCERAL Anatomy2 benchmark was used by the participants of the ISBI VISCERAL challenge. The contents of this dataset are elaborated below. 2.1.1 CT Scans There are 15 unenhanced whole–body CT volumes acquired from patients with bone marrow neo- plasms, such as multiple myeloma, to detect osteolysis. The field–of–view spans from and includ- ing the head to the knee (see Fig. 2, A). The in–plane resolution ranges between 0.977/0.977 to 1.405/1.405 mm, and the in–between plane resolution is 3 mm or higher. 15 contrast–enhanced CT scans of the trunk that have been acquired in patients with malignant lymphoma are also included. They have a large field–of–view from the corpus mandibulae to the lower part of the pelvis (see Fig. 2, B). They have an in–plane resolution of between 0.604/0.604 and 0.793/0.793 mm, and an in–between plane resolution of at least 3 mm or higher. 2.1.2 MR Scans 15 whole–body MR scans in two sequences (30 in total) are also part of the training set. They were acquired in patients with multiple myeloma to detect focal and/or diffuse bone marrow infiltration. Both a coronal T1–weighted and fat-suppressed T2–weighted or STIR (short tau inversion recovery) sequence of the whole body are available for each of the 15 patients. The field–of–view starts and includes the head and ends at the feet (see Fig. 2, C). The in–plane resolution is 1.250/1.250 mm, and the in–between plane resolution is 5 mm. 8 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Figure 2: Sample data set volumes. A) Whole–body unenhanced CT; B) contrast–enhanced CT of the trunk; C) whole–body unenhanced MR; D) contrast–enhanced MR of the abdomen To improve the segmentation of smaller organs (such as the adrenal glands), 15 T1 contrast– enhanced fat saturated MR scans of the abdomen are also included. They were acquired in onco- logical patients with likely metastases within the abdomen. The field–of–view starts at the top of the diaphragm and extends to the lower part of the pelvis (see Fig. 2, D). They have an in plane resolution of between 0.840/0.804 to 1.302/1.302 mm, and an in–between plane resolution of 3 mm. 2.1.3 Annotated Structures and Landmarks There are in total 60 manually annotated volumes in this ISBI challenge training set. The available data contains segmentation and landmarks of several different anatomical structures in different imaging modalities, e.g. CT and MRI. The two categories of annotations and results are: • Region segmentations: These regions correspond to anatomical structures (e.g. right lung), or sub–parts in volume data. The 20 anatomical structures that make up the training set are: trachea, left/right lungs, sternum, vertebra L1, left/right kidneys, left/right adrenal glands, left/right psoas major muscles, left/right rectus abdominis, thyroid gland, liver, spleen, gall- bladder, pancreas, urinary bladder and aorta. Not all structures are visible or within the field–of–view in the images, therefore leading to varying numbers of annotations per structure (see Fig. 1 for a detailed break–down). • Landmarks: Anatomical landmarks are the locations of selected anatomical structures that should be identifiable in the different image sequences available in the data set. There can be up to 53 anatomical landmarks (see Fig. 1) located in the data set volumes: left/right clavicles, left/right crista iliaca, symphysis, left/right trochanter major, left/right trochanter minor, aortic arch, trachea bifurcation, aorta bifurcation, vertebrae C2-C7, Th1-Th12, L1-L5, xyphoideus, aortic valve, left/right sternoclavicular, VCI bifurcation, left/right tuberculums, 9 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization left/right renal pelvises, left/right bronchus, left/right eyes, left/right ventricles, left/right ischiadicum and coronaria. In total the 60 training set volumes containing 890 manually segmented anatomical structures and 2420 manually located anatomical landmarks make up the training set. Some of the anatomical structures in the volumes were not segmented if the annotators considered there was insufficient tissue contrast to perform the segmentation or to locate the landmark. Other structures are miss- ing or not included in the training set because of anatomical variations (e.g. missing kidney) or radiologic pathological signs (e.g. aortic aneurysm). Landmarks are easy and quick to annotate whereas precise organ segmentation is time–consuming even when using automatic tools. 2.1.4 Test Set The test set contains 20 manually annotated volumes. Each modality (whole–body CT, thorax and abdomen contrast–enhanced CT, whole–body MR and abdomen contrast enhanced MR) is represented by 5 volumes. The anatomical structures and landmarks contained in the selected volumes were used to evaluate the participants’ algorithms. 2.2 ISBI VISCERAL Challenge Submission The participants can select the structures and modalities in which they choose to participate. The outputs are therefore evaluated per structure and per modality. The evaluation of the ISBI challenge has been organized differently than the general VISCERAL evaluation framework to allow for the evaluation results to complete in the given relatively short time–frame. For this challenge, the test set volumes were made available in the cloud some weeks ahead of the challenge. The participants themselves computed the annotations (segmentations and/or landmark locations) in their VMs and stored them on their VM storage. The files could then be submitted within their VM through an uploading script provided to the participants. The script stored their output files in a corresponding cloud container created for the challenge individual for each participant. A list containing the available ground truth segmentations of the test set filtered duplicates or output files with incorrect file names. It also ensured all files were coherent with the participant ID list from the organizers. 2.3 Evaluation Software To evaluate the output segmentations and landmark locations against the ground truth, the VIS- CERAL evaluation tool was used. This software was also included in the VM assigned to each participant. This evaluation tool has different evaluation metrics implemented such as (1) distance– based metrics, (2) spatial overlap metrics and (3) probabilistic and information theoretic metrics. The most suitable subset of the metrics was used in the analysis of the results and all metrics were made available to the participants. For the output segmentations of the ISBI challenge the following evaluation metrics were selected: • DICE coefficient [ZWB+ 04] • Adjusted Rand Index [VPYM11] • Interclass Correlation [GJC01] • Average distance [KCAB09] 10 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Figure 3: Anatomical structure segmentation task: DICE coefficient results. Contrast–enhanced CT scans of the Thorax and Abdomen. Only one label is considered per image. The voxel value can be either zero (background) or one for the voxels containing the segmentation. A threshold is set at 0.5 to create binary images in case the output label has a fuzzy membership or a probability map. For the landmark localization evaluation the same VISCERAL tool measures the landmark– specific average error (Euclidean distance) error between all the results and the manually located landmarks. The percentage of detected landmarks per volume (i.e. landmarks detected / landmarks in the volume) is also computed. 2.4 Participation The ISBI training and test set volumes were made available through the Azure cloud framework for all the registered participants of the VISCERAL Anatomy2 benchmark. In total 18 groups got access to the challenge training set and the 60 training volumes of the data set. The research groups that submitted working virtual machines had a chance to present their methods and results at the ”VISCERAL Organ Segmentation and Landmark Detection Challenge” at the 2014 IEEE International Symposium on Biomedical Imaging (ISBI). A single–blind review process was applied to the initial abstract submissions. The accepted abstracts were then invited to submit a short paper presenting their methods and results in the challenge. There were 5 high–quality submissions accepted and included in these proceedings. Spanier et al. [SJ14] submitted segmentations for five organs in CT contrast–enhanced volumes. Their multi–step algorithm combines thresholding and region growing techniques to segment each organ individually. It starts with the location of a region of interest and identification of the largest axial cross–section slices of the selected structure. It then improves the initial segmentation with morphological operators and a final step performs 3D region growing. Huang et al. [HLJ14] proposed a coarse–to–fine liver segmentation using prior models for the shape, profile appearance and contextual information of the liver. An AdaBoost voxel–based clas- sifier creates a liver probability map that is refined in the last step with free–form deformation with 11 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Table 1: Anatomical structure segmentation task: DICE coefficient results table. Contrast– enhanced and unenhanced CT scans submissions. a gradient appearance model. Wang et al. [WS14] segmented 10 anatomical structures in CT contrast–enhanced and unen- hanced scans. Their multi–organ segmentation pipeline performs in a top–down approach by a model–based level–set segmentation of the ventral cavity. After dividing the cavity in thoracic and abdominopelvic cavity, the major structures are segmented and their location information is passed to the lower–level structures. Jiménez del Toro et al. [JdTM14] segmented structures in CT and contrast–enhanced CT scans with a hierarchical multi–atlas approach. Based on the spatial anatomical correlations between the organs, the bigger and high–contrasted organs are first segmented. These then define the initial volume transformations for the smaller structures with less defined boundaries. Goksel et al. [GGS14] submitted segmentations for both CT and MR anatomical structure seg- mentation. They also submitted results for the landmark localization task. For the segmentations they use a multi–atlas based technique that implements Markov Random Fields to guide the reg- istrations. A multi–atlas template–based approach fuses the different estimations to detect the landmarks. 3 Results There were approximately 500 structure segmentations and 211 landmark locations submitted to the VISCERAL ISBI challenge. Four participants submitted results for the segmentation tasks in multiple organs using whole–body CT or contrast–enhanced scans with results presented in Table 1 and Fig. 3. There was one participant that contributed segmentations on both the whole–body MR scans and the contrast–enhanced MR abdomen volumes with results presented in Table 3. Only one participant submitted landmark localization results, with Table 4 showing their evaluation results. 4 Conclusions and Future Work The VISCERAL project has the evaluation of algorithms on large data sets as its main objective. The proposed VISCERAL infrastructure allows evaluations with private or restricted data, such as electronic health records, without the participants access to the test data by using a fully cloud– based approach. This infrastructure also avoid moving data, which is potentially hard for very large data sets. The algorithms are brought to the data and not the data to the algorithms. 12 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Table 2: Anatomical structure segmentation task: Average Distance results table. Contrast– enhanced and unenhanced CT scans submissions. Table 3: Evaluation metrics for the MR scan submissions. Both gold corpus and silver corpus will be available as a resource to the community. The ISBI test set volumes and annotations are now available and are part of the VISCERAL Anatomy2 benchmark training set. So far, both past VISCERAL anatomy benchmarks have addressed organ segmentation and landmark localization tasks. There are two more benchmarks under development in the VISCERAL project, a retrieval benchmark and a detection benchmark. The retrieval benchmark will be the retrieval of similar cases based on both visual information and radiology reports. The detection benchmark will focus in the detection of lesions in MR and CT images. In the future, the automation of the evaluation process is intended to reduce the need for interven- tion from the organizers to a minimum and to provide faster evaluation feedback to the participants. The participants will then be able to submit their algorithms through the cloud virtual machines and obtain the calculated metrics directly from the system. Such a system could then store the results from all the algorithms submitted and perform an objective comparison with state-of-the art algorithms. Through the involvement of the research community, the VISCERAL framework could produce novel tools for the clinical work flow that has substantial impact on diagnosis quality 13 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization Table 4: Landmarks results. and treatment success. Having all tools and algorithms in the same cloud environment can also help us to combine tools and approaches with very little additional effort, which expectedly yields better results. 5 Acknowledgments The research leading to these results has received funding from the European Union Seventh Frame- work Programme (FP7/2007-2013) under grant agreement n◦ 318068 VISCERAL. We would also like to thank Microsoft research for their financial and information support in using the Azure cloud for the benchmark. References [Doi05] K Doi. Current status and future potential of computer–aided diagnosis in medical imaging. British Journal of Radiology, 78:3–19, 2005. [EILI+ 10] Bernice Elger, Jimison Iavindrasana, Luigi Lo Iacono, Henning Müller, Nicolas Roduit, Paul Summers, and Jessica Wright. Strategies for health data exchange for secondary, cross–institutional clinical research. Computer Methods and Pro- grams in Biomedicine, 99(3):230–251, September 2010. [GGS14] Orcun Goksel, Tobias Gass, and Gabor Szekely. Segmentation and landmark localization based on multiple atlases. In Orcun Goksel, editor, Proceedings of the VISCERAL Challenge at ISBI, CEUR Workshop Proceedings, pages 37–43, Beijing, China, May 2014. [GJC01] Guido Gerig, Matthieu Jomier, and Miranda Chakos. A new validation tool for assessing and improving 3D object segmentation. In Wiro J. Niessen and Max A. Viergever, editors, Medical Image Computing and Computer-Assisted Interven- tion - MICCAI 2001, volume 2208 of Lecture Notes in Computer Science, pages 516–523. Springer Berlin Heidelberg, 2001. [GSdHKCDF+ 13] Alba Garcı́a Seco de Herrera, Jayashree Kalpathy-Cramer, Dina Demner Fush- man, Sameer Antani, and Henning Müller. Overview of the ImageCLEF 2013 medical tasks. In Working Notes of CLEF 2013 (Cross Language Evaluation Forum), September 2013. 14 Jiménez del Toro et al: ISBI VISCERAL Challenge Organization [HLJ14] Cheng Huang, Xuhui Li, and Fucang Jia. Automatic liver segmentation using multiple prior knowledge models and free–form deformation. In Orcun Gok- sel, editor, Proceedings of the VISCERAL Challenge at ISBI, CEUR Workshop Proceedings, pages 22–24, Beijing, China, May 2014. [HMLM14] Allan Hanbury, Henning Müller, Georg Langs, and Bjoern H. Menze. Cloud– based evaluation framework for big data. In Alex Galis and Anastasius Gavras, editors, Future Internet Assembly (FIA) book 2013, Springer LNCS, pages 104– 114. Springer Berlin Heidelberg, 2014. [JdTM14] Oscar Alfonso Jiménez del Toro and Henning Müller. Hierarchical multi– structure segmentation guided by anatomical correlations. In Orcun Goksel, editor, Proceedings of the VISCERAL Challenge at ISBI, CEUR Workshop Pro- ceedings, pages 32–36, Beijing, China, May 2014. [KCAB09] Hassan Khotanlou, Olivier Colliot, Jamal Atif, and Isabelle Bloch. 3d brain tumor segmentation in MRI using fuzzy classification, symmetry analysis and spatially constrained deformable models. Fuzzy Sets and Systems, 160(10):1457– 1473, 2009. Special Issue: Fuzzy Sets in Interdisciplinary Perception and Intel- ligence. [LMMH13] Georg Langs, Henning Müller, Bjoern H. Menze, and Allan Hanbury. Visceral: Towards large data in medical imaging – challenges and directions. Lecture Notes in Computer Science, 7723:92–98, 2013. [SJ14] Assaf B. Spanier and Leo Joskowicz. Rule–based ventral cavity multi–organ automatic segmentation in ct scans. In Orcun Goksel, editor, Proceedings of the VISCERAL Challenge at ISBI, CEUR Workshop Proceedings, pages 16–21, Beijing, China, May 2014. [VPYM11] Nagesh Vadaparthi, Suresh Varma Penumatsa, Srinivas Yarramalle, and P. S. R. Murthy. Segmentation of Brain MR Images based on Finite Skew Gaussian Mix- ture Model with Fuzzy C–Means Clustering and EM Algorithm. International Journal of Computer Applications, 28:18–26, 2011. [WS14] Chunliang Wang and Örjan Smedby. Automatic multi–organ segmentation using fast model based level set method and hierarchical shape priors. In Orcun Gok- sel, editor, Proceedings of the VISCERAL Challenge at ISBI, CEUR Workshop Proceedings, pages 25–31, Beijing, China, May 2014. [ZWB+ 04] Kelly H. Zou, Simon K. Warfield, Aditya Bharatha, Clare M.C. Tempany, Michael R. Kaus, Steven J. Haker, William M. Wells III, Ferenc A. Jolesz, and Ron Kikinis. Statistical validation of image segmentation quality based on a spatial overlap index1 : scientific reports. Academic Radiology, 11(2):178 – 189, 2004. 15