Overview of the VISCERAL Challenge at ISBI 2015

     Orcun Goksel1 , Antonio Foncubierta-Rodrı́guez1 , Oscar Alfonso Jiménez del Toro2 ,
     Henning Müller2 , Georg Langs3 , Marc-André Weber4 , Bjoern Menze5 , Ivan Eggel2 ,
      Katharina Gruenberg4 , Marianne Winterstein4 , Markus Holzer3 , Markus Krenn3 ,
      Georgios Kontokotsios6 , Sokratis Metallidis2 , Roger Schaer2 , Abdel Aziz Taha6 ,
                András Jakab3 , Tomàs Salas Fernandez7 , Allan Hanbury6

             ETH Zürich, Switzerland1 ; HES-SO Valais, Sierre, Switzerland2 ;
    MUW, Vienna, Austria3 ; University of Heidelberg, Germany4 ; TUM, Munich, Germany5 ;
                  TUWien, Vienna, Austria6 ; AQuAS, Barcelona, Spain7


                                                  Abstract

                     This is an overview paper describing the data and evaluation
                     scheme of the VISCERAL Segmentation Challenge at ISBI
                     2015. The challenge was organized on a cloud-based virtual-
                     machine environment, where each participant could develop
                     and submit their algorithms. The dataset contains up to
                     20 anatomical structures annotated in a training and a test
                     set consisting of CT and MR images with and without con-
                     trast enhancement. The test-set is not accessible to partici-
                     pants, and the organizers run the virtual-machines with sub-
                     mitted segmentation methods on the test data. The results
                     of the evaluation are then presented to the participant, who
                     can opt to make it public on the challenge leaderboard dis-
                     playing 20 segmentation quality metrics per-organ and per-
                     modality. Dice coefficient and mean-surface distance are pre-
                     sented herein as representative quality metrics. As a contin-
                     uous evaluation platform, our segmentation challenge leader-
                     board will be open beyond the duration of the VISCERAL
                     project.


1   Introduction
In this challenge, a set of annotated medical imaging data was provided to the participants, along
with a powerful complimentary cloud-computing instance (8-core CPU with 16GB RAM) where
participant algorithms can be developed and evaluated. The available data contains segmentations
of several different anatomical structures in different image modalities, e.g.ĊT and MRI. Annotated

Copyright c by the paper’s authors. Copying permitted only for private and academic purposes.
In: O. Goksel (ed.): Proceedings of the VISCERAL Anatomy Grand Challenge
at the 2015 IEEE International Symposium on Biomedical Imaging (ISBI), New York, NY, Apr 16th , 2015
published at http://ceur-ws.org
structures in the training and testing data corpus included the segmentations of left/right kidney,
spleen, liver, left/right lung, urinary bladder, rectus abdominis muscle, 1st lumbar vertebra, pan-
creas, left/right psoas major muscle, gallbladder, sternum, aorta, trachea, left/right adrenal gland.
   As training, 20 volumes each were provided for four different image modalities and field-of-views,
with and without contrast enhancement, which add up to 80 volumes in total. In each volume, up to
20 structures were segmented. The missing annotations are due to poor visibility of the structures
in certain image modalities or due to such structures being outside the field-of-view. Accordingly, in
all 80 volumes, a total of 1295 structures are segmented. A breakdown of annotations per anatomy
can be seen in Figure 1.


Figure 1: Number of annotations in the Anatomy3 training set classified by modality and organ.
   Participants did not need to segment all the structures involved in such data, but rather they
could attempt any single anatomical structure or a combination thereof. For instance, an algo-
rithm that could segment only some organs in some of the modalities was evaluated only in those
categories for which it outputted any results. Accordingly, our evaluation results were presented
in a per-anatomy, per-modality fashion depending on the attempted segmentation task/s by each
participating algorithm. This is, indeed, in line with the VISCERAL vision of creating a single,
large, and multi-purpose medical image dataset, on which different research groups can test their
specific applications and solutions.
   Participants first registered for a benchmark account at the VISCERAL registration website.
Among the options during the registration, they could request their choice of operating system
(Linux, Windows, etc) for the virtual machine (VM), in order to get access to the VM and the
data. Having signed the data usage agreement and uploaded it to the participant dashboard, they
could then access the VM for algorithm development and also use the training data accessible
therein. Participants could additionally download the training dataset via FTP for offline training.
   Participants accordingly developed and installed their algorithms in the VM, while adapting and
testing them on the training data. They then prepared their executable on the VM according to the
input/output specifications announced by us earlier in the Anatomy3 Guidelines for Participation,
and submitted their VMs (through ”Submit VM” button in the online participant dashboard) for
evaluation on the test data. We subsequently ran their VM (and hence their algorithm) on the
test data, and computed the relevant metrics. This evaluation process could be performed several
times during the training phase, nevertheless, we limited submissions to once per week, in order to
prevent the participants “training on the test data”. The participants received feedback from their
evaluations in a private leaderboard and had the option to make their results publicly available on
the online public leaderboard, which included the results considered in our benchmark results.

2     Evaluation
For the Anatomy3 benchmark, a different evaluation approach was implemented compared to the
previous Anatomy benchmarks [LMMH13, JdTGM+ 14]. For this benchmark, participants had the
opportunity to submit their algorithms several times, giving them the opportunity to improve their
algorithms prior to the final evaluation analysis during ISBI 2015. They could also choose to make
any of their results from the test-set public at any time. To allow a continuous workflow with this
evaluation approach, the steps during the evaluation phase were automated to a large extent.
   The continuous evaluation approach included the following steps:

    1. The participant registers for the challenge; fills, signs, and uploads the participant agreement.
    2. The organizers provide the participant with a virtual machine (VM) from the VISCERAL
       cloud infrastructure.
    3. The participant implements a segmentation algorithm in the VM according to the benchmark
       specifications.
    4. The VM is submitted by the participant using the participant dashboard.
    5. The organizers isolate the VM to prevent the participant from accessing it during the evaluation
       phase.
    6. The participant executable is run in a batch-script to test if its output files correspond with
       those expected by the evaluation routines.
    7. If the previous step is successful, the evaluation proceeds for all the volumes in the test set.
    8. Each generated output segmentation file is uploaded by the batch script to the cloud storage
       reserved for that participant.
    9. Once all the images in the test-set are processed by the participant executable, the output
       segmentations are cleared from the VM, which is in turn returned to the participant.
10. The output segmentations uploaded in the cloud storage are then evaluated against the ground-
    truth (manual annotations) and the results are presented in the participant dashboard.
11. The participant can then analyze and interpret the results of their submission, and choose to
    make them public or not on the public leaderboard.
12. The participant is allowed to submit again for testing only after a minimum of one week from
    their latest submission.
Figure 2: A snapshot of VISCERAL Anatomy3 public leaderboard at the time of ISBI 2015 chal-
lenge.
3       Benchmark results
Detailed results from 20 metrics can be seen in the online leaderboard1 , a snapshot of which
at the time of ISBI 2015 Anatomy3 challenge is shown in Gig. 2. Participant evaluation results
are summarized in tables 1 and 2, respectively for Dice coefficient and mean surface distance,
as commonly-used segmentation evaluation metrics. The former is an overlap metric, describing
how well an algorithm estimates target anatomical region. The latter is a surface distance metric,
summarizing the overall surface estimation error by a given algorithm. The participant row in
the tables contains the citation for the publication contribution within this Anatomy3 proceedings
Part II.
   In the Dice results table, the highest ranking methods per-modality per-organ are marked in
bold. Any other method within 0.01 (1%) Dice of this are also considered a winner (or a tie) due
to the insignificance of the difference. Dice values below a threshold are considered unsuccessful
segmentations, and thus are not declared as a winner – even though the reader should note that de-
pending on particular clinical application such results can potentially still be useful. This threshold
was selected as 0.6 Dice, coinciding with a gap in the reported participant results.
   The results corresponding to the same bold values in the Dice table are also marked in the
mean surface distance table, in order to facilitate comparison of the segmentation surface errors for
the best methods in terms of the Dice metric. For successfully segmented organs (defined by the
empirical 0.6 Dice cutoff), both metrics agree on the results for all structures and modalities, except
for the first lumbar vertebra in CT. The reader should note that the mean surface distances are
presented in voxels, therefore the values between modalities (e.g. MR-ce and CT) are not directly
comparable in the latter table.
   According to these tables, there are different algorithms performing well for different anatomy.
In contrast-enhanced MR modality, we had only a single participant, Heinrich et al., potentially
    1
        The leaderboard is accessible at http://visceral.eu:8080/register/Leaderboard.xhtml
 MODALITY                  MR ce     CT contrast-enhanced (ce)                     CT
 PARTICIPANT              Heinrich   Jiménez   He      Cid       Kahl     Jiménez    He       Cid
                           et al.      et al.  et al.  et al.     et al.     et al.   et al.   et al.
 Left kidney               0.862       0.91    0.91      -        0.934      0.784      -        -
 Right kidney              0.855       0.889  0.922      -        0.915       0.79      -        -
 Spleen                    0.724        0.73  0.896      -         0.87      0.703    0.874      -
 Liver                     0.837       0.887  0.933      -        0.921      0.866    0.923      -
 Left lung                    -        0.959  0.966    0.974      0.972     0.972     0.952    0.972
 Right lung                   -       0.963   0.966    0.973      0.975     0.975     0.957    0.974
 Bladder                    0.494     0.679      -       -        0.763      0.698      -        -
 Pancreas                     -        0.423     -       -        0.383      0.408      -        -
 Gallbladder                  -        0.484     -       -         0.19      0.276      -        -
 Thyroid                      -         0.41     -       -        0.424      0.549      -        -
 Aorta                        -       0.721      -       -        0.847      0.761      -        -
 Trachea                      -       0.855      -       -        0.931       0.92      -        -
 Sternum                      -       0.762      -       -         0.83      0.753      -        -
 1st lumbar vertebra          -        0.523     -       -        0.775      0.718      -        -
 Left adrenal gland           -        0.331     -       -        0.282      0.373      -        -
 Right adrenal gland          -        0.342     -       -         0.22      0.355      -        -
 Left psoas major          0.801      0.794      -       -        0.861      0.806      -        -
 Right psoas major         0.772      0.799      -       -        0.847      0.787      -        -
 Left rectus abdominis        -        0.474     -       -        0.746      0.551      -        -
 Right rectus abdominis       -        0.453     -       -        0.679      0.519      -        -


  Table 1: Segmentation results in terms of DICE coefficient classified by modality and organ.


 MODALITY                  MR ce     CT contrast-enhanced (ce)                     CT
 PARTICIPANT              Heinrich   Jiménez   He      Cid       Kahl     Jiménez     He      Cid
                           et al.      et al.  et al.  et al.     et al.     et al.   et al.   et al.
 Left kidney               0.251      0.172   0.171      -        0.147      1.209       -        -
 Right kidney                0.3       0.243  0.131      -        0.229      1.307       -        -
 Spleen                    1.138       2.005  0.385      -        0.534      1.974     0.36       -
 Liver                     0.935       0.514  0.203      -        0.299       0.78    0.239       -
 Left lung                    -        0.071  0.069    0.05       0.045     0.043     0.101     0.05
 Right lung                   -       0.065   0.078    0.052      0.043     0.038     0.094    0.046
 Bladder                    2.632     1.879      -       -        1.057      1.457       -        -
 Pancreas                     -        3.804     -       -        4.478      5.521       -        -
 Gallbladder                  -        3.603     -       -        9.617      5.938       -        -
 Thyroid                      -        3.337     -       -        2.163      1.466       -        -
 Aorta                        -       0.899      -       -        0.542      0.938       -        -
 Trachea                      -       0.223      -       -        0.083      0.103       -        -
 Sternum                      -       1.094      -       -        0.798      1.193       -        -
 1st lumbar vertebra          -        4.504     -       -        2.424      1.953       -        -
 Left adrenal gland           -        3.115     -       -        3.298      2.672       -        -
 Right adrenal gland          -         2.66     -       -        7.046      3.445       -        -
 Left psoas major          0.493      0.742      -       -        0.443      0.595       -        -
 Right psoas major         0.569      0.757      -       -         0.55      0.775       -        -
 Left rectus abdominis        -        6.068     -       -        1.614      0.355       -        -
 Right rectus abdominis       -          6.6     -       -        1.922      4.032       -        -


Table 2: Segmentation results in terms of mean surface distance in pixels (which may have different
physical meanings based on the resolution of a particular modality).
due to the difficulty of automatic segmentations in this modality. This group thus became the
unchallenged winner of MRce for the structures they participated in. Note that the surface er-
ror results are reported in voxels, where MRce has a significantly lower resolution than the other
modalities. In CTce, He et al. performed the best for the 6 structures they participated in, with
some ties with Jimenez et al. The latter group segmented all the given structures in CTce, some
of them with satisfactory accuracy, while for the others with potentially unusable results. We
had the most participants for the CT modality, in which the lungs –a relatively easier segmenta-
tion problem– were segmented successfully by most participants; potentially close to the accuracy
of inter-subject annotations. For most other structures for which successful segmentations were
achieved in CT, Kahl et al.were the winner of the challenge. Nevertheless, for structures where lower
fidelity segmentations (below the 0.6 Dice cutoff) were attained, Jimenez et al.are seen to provide
better segmentations estimations; likely due to their segmentation approach being atlas-based. It
is also observed that, despite the relatively good contrast of CT, several structures (prominently
the pancreas, gallbladder, thyroid, and adrenal glands) are still quite challenging to segment from
CT — potentially due to the lower sensitivity of CT to those structures also complicated by the
difficult-to-generalize shapes of these anatomies.

4   Conclusions
The VISCERAL Anatomy3 Challenge had a total of 23 virtual machines allocated for participants
at a time, although not all participants ultimately submitted results for the challenge. Most par-
ticipants relied on atlas-based segmentation methods, although there were also techniques that
use anatomy-based reasoning and locational relations. By using an online leaderboard evaluation
method, more participants are expected to submit results for our Anatomy3 challenge in the future.

5   Acknowledgments
The research leading to these results has received funding from the European Union Seventh Frame-
work Programme (FP7/2007-2014) under grant agreement n◦ 318068 VISCERAL.

References
[JdTGM+ 14] Oscar Alfonso Jiménez del Toro, Orcun Goksel, Bjoern Menze, Henning Müller,
            Georg Langs, Marc-André Weber, Ivan Eggel, Katharina Gruenberg, Markus Holzer,
            Georgios Kotsios-Kontokotsios, Markus Krenn, Roger Schaer, Abdel Aziz Taha, Mar-
            ianne Winterstein, and Allan Hanbury. VISCERAL – VISual Concept Extraction
            challenge in RAdioLogy: ISBI 2014 challenge organization. In Orcun Goksel, editor,
            Proceedings of the VISCERAL Challenge at ISBI, number 1194 in CEUR Workshop
            Proceedings, pages 6–15, Beijing, China, May 2014.

[LMMH13]      Georg Langs, Henning Müller, Bjoern H. Menze, and Allan Hanbury. Visceral: To-
              wards large data in medical imaging – challenges and directions. Lecture Notes in
              Computer Science, 7723:92–98, 2013.