=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_10
|storemode=property
|title=Automating Bronchoconstriction Analysis based on U-Net
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_10.pdf
|volume=Vol-2841
|authors=Christian Steinmeyer,Susann Dehmel,David Theidel,Armin Braun,Lena Wiese
|dblpUrl=https://dblp.org/rec/conf/edbt/SteinmeyerDTBW21
}}
==Automating Bronchoconstriction Analysis based on U-Net==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/DARLI-AP_10.pdf</pdf>
<pre>
    Automating Bronchoconstriction Analysis based on U-Net
                                              [Industrial and Application paper]

                                                     Christian Steinmeyer
                                                       Susann Dehmel
                                                        David Theidel
                                                        Armin Braun
                                                         Lena Wiese∗
                                         christian.steinmeyer@item.fraunhofer.de
                                            susann.dehmel@item.fraunhofer.de
                                             david.theidel@item.fraunhofer.de
                                             armin.braun@item.fraunhofer.de
                                               lena.wiese@item.fraunhofer.de
                              Fraunhofer Institute for Toxicology and Experimental Medicine
                                                    Hannover, Germany
ABSTRACT                                                             underlying dataset curation, image labelling, and process-
Advances in deep learning enable the automation of a                 ing. We present an optimized data processing pipeline from
multitude of image analysis tasks. Yet, many solutions               raw images to model input. We transparently develop a NN
still rely on less automated, less advanced processes. To            model for the task of semantic segmentation of airways. We
transition from an existing solution to a deep learning              also explore image quality as means to improve predictions.
based one, an appropriate dataset needs to be created,               Finally we show preliminary results.
preprocessed, as well as a model needs to be developed,
and trained on these data. We successfully employ this               2     BACKGROUND
process for bronchoconstriction analysis in Precision Cut            There is a continuous need of innovative or advanced drugs
Lung Slices for pre-clinical drug research. Our automated            to treat the symptoms of obstructive lung diseases. PCLS
approach uses a variant of U-net for the core task of airway         is one of the developed models to evaluate bronchodila-
segmentation and reaches (mean) Intersection over Union              tory compounds in the non clinical phase of drug devel-
of 0.9. It performs comparably to the semi-manual previous           opment. Bronchoconstriction analysis with this model is
approach, but is approximately 80 times faster.                      applied in house since 2012. PCLS allows to target small
                                                                     airways of the lung in different animals, (mouse, rat, or
KEYWORDS                                                             non-human primates) or human lung explant tissue. They
Image segmentation, Preprocessing pipeline, Medical image            can be analyzed to assess potency of bronchodilatory drugs
analysis, Bronchoconstriction, Lung tissue slices                    through concentration-response curves. Mean inhibitory
                                                                     concentrations (IC50 values) can be compared to reference
                                                                     drugs within one human donor or animal to compare effi-
1   INTRODUCTION
                                                                     cacy of the drugs [14, 19, 22, 24]. Bronchoconstriction can
Inflammatory and allergic lung diseases, such as lung in-            also be compared across different species (e. g., for AHR
jury, pneumonia, asthma, chronic obstructive pulmonary               in asthma) [9]. In 13 studies, we tested bronchodilators
disease, or pulmonary hypertension are still difficult to            or bronchoconstricting compounds and mode of action of
treat. A common symptom of these diseases is the obstruc-            drugs as contract research with pharmaceutical industry or
tion of the airways or Airway Hyperresponsiveness (AHR).             within public studies (sponsored by BMBF, BMWI, DFG).
Bronchodilators aim to avoid or reduce these symptoms.
   With advances in the field of deep learning, computers            2.1    Image Acquisition
reach and sometimes surpass human level performance in
specific tasks like image analysis [7]. In semantic segmenta-
tion, Convolutional Neural Networks (CNNs) achieve state
of the art results [13]. In this work, we describe our process
of changing from a conventional, semi-manual workflow to
a fully automated one using Neural Networks (NNs) for
the evaluation of bronchodilators in microscopy image data
from Precision Cut Lung Slices (PCLSs). We focus on the
∗
 Also with Institute of Computer Science, Goethe University
Frankfurt.
                                                                           Figure 1: Experiment setup for image acquisition.
© 2021 Copyright for this paper by its author(s). Published in the
Workshop Proceedings of the EDBT/ICDT 2021 Joint Conference
(March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permit-        We use images that were acquired in previous studies
ted under Creative Commons License Attribution 4.0 International
(CC BY 4.0)                                                          testing efficacy of bronchodilatory drugs. In those stud-
                                                                     ies, PCLSs were prepared from tumor-free parts of lung
explants or from freshly isolated lungs after necropsy of          4     AUTOMATING SEGMENTATION
animals, as described in [18] (see Figure 1, step 1). Lung         To automate airway segmentation with NNs, we need la-
lobes were cut into approximately 300 µm thick slices and          belled data for supervised training. To the best of our
transferred into petri dishes. Airways within PCLSs were           knowledge, no public dataset for airway segmentation ex-
imaged and digitized using an inverted microscope and a            ists. Thus, we create our own (Section 4.1) and propose a
digital video camera (see Figure 1, step 2). Camera control        data pipeline for preprocessing (Section 4.2). In Section 4.3,
and image analysis were achieved by AxioVision software            we describe the development of an ML model for semantic
(Zeiss, Germany). Each experiment setup differs in com-            segmentation and analyse it in Sections 4.4 and 4.5.
pound(s), doses, or specimen and is repeated with at least
two samples. Throughout an experiment, images are taken            4.1    Image Labelling
in regular intervals. The resulting data are image series of
                                                                   Because we are working towards the goal of optimal auto-
varying length depending on the experiment (e. g., reaction
                                                                   matic segmentation, the quality of data labels is essential.
times). In approximately 1750 such series from around 420
                                                                   The upper limit of model performance depends on the label
different experiment setups, over 55000 images are available.
                                                                   quality. Deviation from the ground truth means a loss in
                                                                   model potential. Because the existing tools yield unsatis-
2.2    Semi-Manual Airway Segmentation                             factory labels, the image labelling is performed manually:
In our previous studies, we used a semi-manual approach to         Each label is created by a trained specialist. It is then
analyze the microscopy images. The Image J plugin “PCLS            reviewed and possibly adjusted by another one.
Area Tool ” was used for the airway segmentation. It is based
on the IsoData approach [20], that uses different brightness
levels between object of interest and background1 . It has
an automatic option (that we call “IsoData”, here), as well
as one with manual intervention (that we call “IsoData+”).
The automatic option does not yield satisfactory results
(see Section 6), mainly because it does not include locality.
If surrounding tissue has similar luminance as the airway, a
frequent result is a correctly segmented airway, surrounded                                                (b) 2D representation
by this incorrectly segmented other tissue. It can be vastly       (a) 3D representation. The dark plane   from above, similar to
improved by manually setting the threshold and defining a          represents the camera’s focus plane.    the camera perspective.
region of interest around the airway. But as this took time
in past studies, we automate this segmentation task.               Figure 2: Schematic depiction of a PCLS section (grey), con-
                                                                   taining an airway (blue).
3     RELATED WORK
                                                                      The main target of bronchoconstriction analysis is the
In recent years, NNs have caused big leaps forward in se-
                                                                   relative change of airway volume. Our annotations resemble
mantic segmentation tasks [16, 25]. In 2015, Ronneberger
                                                                   an approximation of the ground truth: We consider airways
et al. published the U-net architecture for image segmen-
                                                                   simplified as prisms (see Figure 2a). The volume 𝑉 of a
tation in a biomedical setting [21]. It was designed to cope
                                                                   prism is defined as 𝑉 = 𝑏 ∗ ℎ, where 𝑏 is the base area and ℎ
with few training data and high resolutions, typical for med-
                                                                   is the height of the prism. The airway cross-section surface
ical image analysis. It consists of an encoder and a decoder.
                                                                   is proportional to the airway volume; thus, we can use it
The encoder serves as a feature detector. Through a series
                                                                   as an indicator for the volume. If an airway lies perfectly
of convolutions and max pool operations, it reduces the
                                                                   orthogonal to the camera, all cross-section surfaces overlap.
resolution (𝑥 and 𝑦 dimensions) of the data while increasing
                                                                   If not, cross-sections of different depths are shifted on the
its depth (number of feature maps). The decoder serves as
                                                                   image plane (see Figure 2b). Some cross-sections might be
an upsampler. Through a series of convolutions and decon-
                                                                   out of focus and choosing an arbitrary cross-section might
volutions, it increases the resolution again while decreasing
                                                                   lead to a less stable model. To overcome these issues, we
the depth. It also contains skip connections from each layer
                                                                   label the cross-section in the focus plane, i. e., the plane that
before max pooling (encoder) to the corresponding layer
                                                                   lies parallel to the camera and is the sharpest. This includes
after deconvolution (decoder). Finally, in the last layer, the
                                                                   more detail and allows more unambiguous labels. We create
output is mapped to the desired number of classes.
                                                                   labels as RGB images representing class membership on
   In the context of lungs, U-net is frequently used. It
                                                                   pixel level, where the target class is one of background,
can be extended to fit specific tasks, e. g., through adding
                                                                   border, or airway. Each target class is assigned a unique
residual and recurrent connectivity [2], combining it with
                                                                   RGB color with maximal distance to other classes.
other architectures [4, 12], or extending it with a third
                                                                      The labelled dataset should resemble real life data. In
dimension [12, 17]. We therefore select it as the starting
                                                                   order to select a diverse dataset from as few images as
point for our model. For a more generic review of deep
                                                                   possible, we choose one image per experiment setup as
learning techniques for semantic segmentation, see [11].
                                                                   described in Section 2.1, yielding 420 samples.

                                                                   4.2    Data Pipeline
1 For details on IsoData, see https://imagej.net/Auto Threshold.   The unprocessed sample data is preprocessed as depicted in
html#IsoData (last accessed on 2021–03–18).                        the pipeline in Figure 3 before being fed to a model. First,
                            Figure 3: Data preprocessing pipeline from a sample to model input tensor.


one sample is defined as a raw image, its corresponding                 on the other (see Figure 4). We start by phrasing a question
label created as described in Section 4.1, as well as its               like “How does image size influence model performance? ”.
meta data (e. g., quality, see Section 5). The samples are              If that question can be answered with the current code
collected in what we call the original dataset. From this               basis, we continue with the research cycle. We define the
original set, we can create a subset (e. g., by quality). This          necessary test run parameters and configuration, before
step is optional and depends on the goals of the test run.              executing it in our research environment. This also serves
   Since working with image data is resource intensive, we              the purpose of documentation and increases reproducibility
aim to optimize computation time and cost. To that end,                 as the same configuration can be run again. We use jupyter
we store the dataset variants in the tfrecords format2 . We             notebooks [15] to enable interaction with the results. As
read those files into tensorflow.data.Dataset 3 data struc-             soon as new questions arise, we repeat the cycle.
tures for further processing. Both are part of the tensorflow
framework [1] and highly optimized [8]. They support lazy
evaluation, i. e., letting us read data on demand, reducing
memory needs and allowing us to work with bigger images.
We also use them to minimize idle times of Graphics Pro-
cessing Units (GPUs). We transform the data to tensors.
We do it early, because that way, we can profit most from
tensorflow ’s computation graph optimizations [8]. In most
test runs, the data is then shuffled and split into train,
test, and validation sets. Our model expects input tensors
of a fixed shape. Tensor shape is defined by the image
resolution (width, height) and number of (color) channels
(depth). Some of the available images are colored. As the
color channels might store useful information, we decode
all input images to three channel RGB format. While in
other domains, image tiling is applied to reduce model size
[21], in our case the area covered by airways varies and                Figure 4: Research and development cycle. VCS stands for
often takes up the majority of an image. Thus, we do not                Version Control System.
apply tiling, but bilinearly transform them to have fixed
dimensions (exact dimensions vary by experiment).                          If a question cannot be answered with the current code
   We want to use class based metrics (see Section 4.4) and             basis, we continue with the software development cycle.
hence one hot encode the label images. To that end, we                  Driven by reusability and modularization, we formulate
transform them from RGB space to HSV space. We apply                    generalized requirements from the question (e. g., resize
the same transformation to the target class representations             images). We then implement and test the missing features
described in Section 4.1. We empirically chose thresholds               using test driven development [5]. In an effort to fail cheap
for hue (10), saturation (100) and value (100) to determine             and fast, we create unit and integration tests for our custom
whether or not a pixel belongs to a target class. Each pixel is         code with pytest 4 . In combination with syntactic checks,
assigned at most to one target class that way. For each label,          they ensure intended behavior. We integrate them in a Con-
we use this to create binary masks for the classes. Finally,            tinuous Integration (CI) pipeline [10]. As a result, breaking
we concatenate the binary masks as channels, forming the                changes are noticed and errors identified quickly (see Fig-
one hot encoded labels (see last step in Figure 3).                     ure 4). All code is managed in a Version Control System
                                                                        (VCS). Once the features work locally, they are merged
4.3    Model Development Process                                        in the remote repository of the VCS. This triggers our CI
Within the scope of the model development process, we                   pipeline, ensuring proper behavior in the test environment.
work towards fast exploration and experimentation. Main-                If any of the tests fail, we repeat the implementation and
taining high code quality and avoiding bugs are further                 following steps (cf. dotted arrows in Figure 4). On success,
meta goals. Hence, we employ the following mechanisms.                  we deploy the new code and continue the research cycle.
Our model development process can be described as two
intertwined cycles of research on one hand and development              4.4    Metrics and Defaults
                                                                        In order to compare performance of different models or set-
2 For details on tfrecords, see https://www.tensorflow.org/tutorials/
                                                                        tings, we use numeric metrics. In our semantic segmentation
load data/tfrecord#setup (last accessed on 2021–03–18).
3 For details on tf.data.Dataset, see https://www.tensorflow.org/api    4 For details on pytest, see https://docs.pytest.org/en/stable/ (last
docs/python/tf/data/Dataset (last accessed on 2021–03–18).              accessed on 2021–03–18).
task, we use per-pixel labelling. Due to strong class imbal-      is reduced to 4). The results are depicted in Table 1. The
ance (usually there is more background than foreground),          best results are achieved when using the smallest images,
simply using accuracy does not suffice. By choosing metrics       increasing image size leading to a decreasing performance.
that consider both, the number of correct predictions, as
well as the ratio of classes, they better represent the actual          Image Size    CCE     mDSC     mIoU     Duration
prediction quality. Due to its wide spread use [11] and its
                                                                                32    0.124   0.951     0.908     0:14:55
intuitive definition, we use (mean) Intersection over Union
                                                                                64    0.118   0.948     0.904     0:15:40
(mIoU), or Jaccard Distance, defined as:
                                                                               128    0.128   0.942     0.893     0:26:16
                             𝑛
                        1 Õ             𝑇 𝑃𝑖                                   256    0.168   0.925     0.866     0:54:34
             𝑚𝐼𝑜𝑈 =                                        (1)                 512    0.255   0.877     0.786     3:18:38
                      𝑛 + 1 𝑖=0 𝑇 𝑃 𝑖 + 𝐹 𝑃 𝑖 + 𝐹 𝑁 𝑖
                                                                              1024    0.313   0.852     0.749    13:56:10
where 𝑛 is the number of target classes, 𝑇 𝑃𝑖 is the number
                                                                  Table 1: Experiment results for different image sizes, trained
of true positive, 𝐹 𝑃𝑖 the number of false positive, and 𝐹 𝑁𝑖
                                                                  for 50 epochs each (Duration in hours:minutes:seconds)
the number of false negative pixels for the target class 𝑖.
   We also use (mean) Dice Similarity Coefficient (mDSC),
also called quotient of similarity, F1-score, or harmonic
average between precision and recall (cf. [2]), defined as:          We conclude that the task of airway segmentation is
                            𝑛
                                                                  easier, if the image resolution is smaller, i. e., there are
                        1   Õ           2𝑇 𝑃 𝑖                    fewer details available. Also, models with smaller image
            𝑚𝐷𝑆𝐶 =                                         (2)
                      𝑛 + 1 𝑖=0 2𝑇 𝑃 𝑖 + 𝐹 𝑃 𝑖 + 𝐹 𝑁 𝑖            sizes converge faster, which we show in another experiment:
                                                                  We increase the maximum number of epochs, but stop
  We use Categorical Cross Entropy (CCE) as both, loss
                                                                  training early, if the model converges. To evaluate conver-
and metric (lower is better), defined as:
                                                                  gence, we consider validation loss and stop learning when
                             𝑚
                             Õ                                    there is no improvement for twelve epochs. This further
                   𝐶𝐶𝐸 = −          𝑦 𝑗 ∗ log 𝑝 𝑗          (3)
                                                                  reduces overfitting. The results are depicted in Table 2.
                              𝑗=0
                                                                  When comparing the average duration in both experiments,
where 𝑚 is the total number of values (pixels through all         we see: While models trained on images smaller than 128
channels), 𝑝 𝑗 is the model prediction for index 𝑗, and 𝑦 𝑗 is    pixels stop training earlier than 50 epochs, those trained
the corresponding target value.                                   on bigger images need more epochs before stopping. This
   In the upcoming experiments, we consider different vari-       also improves the results for bigger image size.
ables, like resolution, and their effect on prediction perfor-
mance. We use the following default setting, unless specified           Image Size    CCE     mDSC     mIoU     Duration
otherwise: (1) U-net with two encoder and decoder blocks
(2) image dimensions of 128 by 128 (3) 64 filters in first                      32    0.113   0.947     0.902     0:10:00
convolutional block (4) spatial dropout of 0.5 in encoder                       64    0.110   0.948     0.903     0:13:54
(5) ReLU as activation function in inner layers (6) SoftMax                    128    0.119   0.945     0.898     0:26:18
as activation function in output layer (7) CCE as loss func-                   256    0.148   0.936     0.883     1:39:14
tion (8) batch normalization between convolutional layers                      512     0.19    0.91      0.84     7:35:53
(9) training for 50 epochs with a batch size of 8. Using                      1024    0.220   0.893     0.811    35:19:46
batch normalization and dropout reduces overfitting.              Table 2: As Table 1 but up to 500 epochs with early stopping.
   Further, we use a variant of ten-fold cross validation:
First, we split the data into train, test, and validation
sets. For training, we then further split the train set into         Further, we identify the receptive field as limiting factor
ten separate folds, each missing a different tenth of the         for larger images. The receptive field of a filter is defined
original set. We use the validation set to observe training       by the kernel size, stride, dilation, and padding of previ-
performance. After training, we use the test set to obtain        ous layers [6]. In the case of U-net, the receptive field at
results (i. e., the ten different folds are evaluated on the      the deepest level grows exponentially with the number of
same test data). We choose this variant to be able to better      encoder blocks: It has a side length of 14 pixels for one
compare evaluation of different subsets (see Section 5). The      block, 32 for two blocks, 68 for three, 140 for four, 284 for
reported metrics refer to either the average (CCE, mDSC,          five etc. In the next experiment, we adapt the number of
mIoU) over all folds, or sum (duration). All test runs are        U-net blocks for larger input sizes (see Table 3).
performed on an Intel(R) Xeon(R) Gold 6252 CPU and
two NVIDIA Tesla T4 GPUs.                                         5   QUALITY META LABEL
                                                                  Apart from the model, we also reconsider our dataset in
4.5   Experiments and Results                                     order to improve results. Our dataset is heterogeneous. It
The approach described in Sections 4.2 and 4.3 enables            contains images from various specimens and was collected
us to efficiently perform experiments. We can evaluate            by five researchers. Due to different experiment setups,
hypotheses in fast cycles and robustly implement missing          they also vary in visual properties like brightness (standard
functionality. We start with small images. Then, we identify      deviation of per image average normalized brightness of
and address issues that arise as we scale up. In a first exper-   0.11), hue (0.22), saturation (0.12), and value (0.11) in HSV
iment, the input images and labels are bilinearly resized to      space. This affects the overall quality of each image. We em-
differently sized squares (for images of size 1024, batch size    pirically identify the following crucial factors: (1) contrast
    Image                                                          Train Set and Order     CCE     mDSC      mIoU      Duration
              RF       CCE     mDSC      mIoU    Duration
     Size
                                                                   bad                     1.592    0.647    0.493     0:07:55
      256    68𝑏 = 3   0.149    0.934    0.881     1:27:00         good                    0.14     0.932    0.875     0:33:38
      512    68𝑏 = 3   0.144    0.934    0.879     9:30:14         great                   0.42     0.856    0.757     0:09:42
      512   140𝑏 = 4   0.133    0.942    0.893     8:47:55         badus                   1.404    0.651    0.498     0:07:32
     1024    68𝑏 = 3   0.192    0.907    0.835    39:21:09         goodus                  1.156    0.748    0.616     0:18:28
     1024   140𝑏 = 4   0.179    0.920    0.856    22:45:53         greatus                 0.553    0.836    0.726     0:08:18
     1024   284𝑏 = 5   0.194    0.915    0.847    26:12:40         bad→good→great          0.146    0.945    0.898     0:38:00
Table 3: Like Table 2, but increased receptive fields (RF).        bad→great→good          0.14     0.934    0.879     0:38:53
The subscript b denotes number of U-net blocks.                    good→bad→great          0.187    0.937    0.885     0:41:19
                                                                   good→great→bad          0.216    0.933    0.878     0:37:20
                                                                   great→bad→good          0.136    0.934    0.878     0:36:41
                                                                   great→good→bad          0.228    0.934    0.881     0:37:32
between the airway and the surrounding tissue (2) airway           good→great              0.168    0.941    0.891     0:32:03
alignment in relation to the camera (3) sharpness of the           great→good              0.129    0.936    0.881     0:33:59
airway edge (4) amount of airway occlusion (by tissue, par-        all                     0.116    0.941    0.889     0:30:51
ticles or image distortions like flares). Because we found our    Table 4: Experiment results for different data subsets by
perception influenced by these factors, we hypothesize this       quality meta label and different order (Subscript us denotes
might be the case for our model. Based on these criteria, we      a balanced (undersampling) version; seta →setb denotes that
manually categorize the samples into 22 “bad ”, 345 “good ”,      setb was used for training after seta )
and 53 “great” ones and perform further experiments.
   First, we inspect data subsets by quality. A test set of 30
samples is randomly selected from the original dataset. It
contains two “bad ”, 24 “good ”, and four “great” samples and     is trained on all other data, before being evaluated on the
exhibits similar visual properties: it differs only slightly in   test data. An exemplary result is depicted in Figure 5.
average brightness (−0.01), contrast (+0.001), hue (+0.03),
saturation (+0.02), and value (−0.01). The remaining data
is split by quality label. Each subset is used as training set
and evaluated on the test set. The results are depicted in
Table 4. We can explain the main difference in performance
by different subset sizes: The best outcomes result from
training on 321 “good ” samples, while using 49 “great” sam-
ples yielded inferior results, followed by 20 “bad ” samples.
   Next, we use random under sampling to balance the
three subsets. When we repeat the experiment with only 20
samples per subset, the performance difference decreases
(see us in Table 4). The prediction performance strongly
correlates with our assigned quality label. We hypothesize
that the order in which the samples are presented to the
                                                                  Figure 5: Sample input (left), target output (middle), and
model might influence its performance. We test this hypoth-
                                                                  model output (right).
esis in an experiment, in which we use up to three distinct
subsets to iteratively train a model. We stop training early
and maintain the trained state of the model between sets.            We let an expert use the old approach on the raw images
Because the order matters, we average the results of ten          of the test set. We then compare the semi-manual and the
independent iterations instead of using cross validation as       automated predictions with the labels created in Section 4.1
before. The results are depicted in Table 4. Overall, differ-     (Table 5). While our new approach yields slightly less ac-
ences between settings are small, but there seems to be a         curate performance, it requires only a fraction of the time
slight advantage in starting with the worst and finishing         (mainly because it does not require human interaction).
with the best samples (bad→good→great).
                                                                            Approach     mDSC      mIoU     Duration
6    APPROACH COMPARISON                                                    IsoData      0.538     0.392    0:12:01
When we consider the lessons learned in Sections 4 and 5,                   IsoData+     0.966     0.935    0:58:32
we create a model that differs from our defaults in the                     U-net        0.935     0.881    0:00:44
following way to achieve the best trade-off between image
                                                                      Table 5: Evaluation results for different approaches.
resolution and prediction quality. (1) U-net with four en-
coder and decoder blocks (2) image dimensions of 512 by
512 pixels (3) training for 500 epochs with early stopping
and a batch size of 8 (4) training by quality meta label
(bad→good→great).                                                 7   DISCUSSION AND CONCLUSION
   To compare our new approach to the semi-manual one,            In this work, we demonstrated how we transition from a
we use the test set defined in Section 5. The above U-net         semi-manual analysis of bronchodilators to an automated
workflow using deep learning. To that end, we created a                [4] T. Araújo, G. Aresta, A. Galdran, P. Costa, A. M. Mendonça,
heterogeneous dataset. We presented how to use Tensor-                     and A. Campilho. 2018. Uolo - Automatic object detection
                                                                           and segmentation in biomedical images. In Lecture Notes in
flow to create an optimized data pipeline; this pipeline                   Computer Science, Vol. 11045 LNCS. Springer Verlag, 165–173.
can be used in other projects using Tensorflow with minor              [5] K. Beck. 2003. Test Driven Development: By Example.
                                                                           Addison-Wesley Professional, 240.
adjustments. Further, we propose an iterative model de-                [6] B. Behboodi, M. Fortin, C. J. Belasso, R. Brooks, and H. Rivaz.
velopment process applicable to any data science project                   2020. Receptive Field Size as a Key Design Parameter for
that requires custom code. We showed the capabilities of                   Ultrasound Image Segmentation with U-Net. Proceedings of
                                                                           the Annual International Conference of the IEEE Engineering
our approach in a set of experiments for the semantic seg-                 in Medicine and Biology Society, EMBS (2020), 2117–2120.
mentation of airways. In these experiments, we trained a               [7] A. Buetti-Dinh, V. Galli, S. Bellenberg, O. Ilie, M. Herold, et al.
variant of U-net to segment airways in PCLS microscopy                     2019. Deep neural networks outperform human expert’s capac-
                                                                           ity in characterizing bioleaching bacterial biofilm composition.
images with mIoU of 0.881 and mDSC of 0.935. We also                       Biotechnology Reports 22 (6 2019), e00321.
demonstrated that image quality and training order can                 [8] S. W. D. Chien, S. Markidis, V. Olshevsky, Y. Bulatov, E. Laure,
                                                                           and J. Vetter. 2019. TensorFlow Doing HPC. In 2019 IEEE
improve predictions. In comparison to our previous semi-                   International Parallel and Distributed Processing Symposium
manual approach, the proposed automated method yields                      Workshops (IPDPSW). IEEE, 509–518.
comparable results, but does so in a significantly faster              [9] O. Danov, S. Jimenez Delgado, H. Drake, S. Schindler, O.
                                                                           Pfennig, C. Förster, A. Braun, and K. Sewald. 2014. Species
way. Thus, our automated approach constitutes a valid                      comparison of interleukin-13 induced airway hyperreactivity in
alternative. These segmentation results assign individual                  precision-cut lung slices. Pneumologie 68, 06 (6 2014), A1.
pixels to the airway target class. We can count those pixels          [10] P. M. Duvall, S. Matyas, and A. Glover. 2007. Continuous inte-
                                                                           gration: improving software quality and reducing risk. Pearson
to derive the airway area. From this, we can determine                     Education, 336.
a relative change of the area (and hence the volume as                [11] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-
                                                                           Martinez, and J. Garcia-Rodriguez. 2017. A Review on Deep
described in Section 4.1) in a sequence of images and thus                 Learning Techniques Applied to Semantic Segmentation. arXiv
determine the bronchoconstriction.                                         preprint arXiv:1704.06857 (2017).
   Parts of our implementation are affected by randomness.            [12] A. Garcia-Uceda Juarez, R. Selvan, Z. Saghir, and M. de Bruijne.
                                                                           2019. A Joint 3D UNet-Graph Neural Network-Based Method
As such, small differences might be due to chance. We                      for Airway Segmentation from Chest CTs. In Lecture Notes in
expect additional improvements in semantic segmentation                    Computer Science, Vol. 11861 LNCS. Springer, 583–591.
performance, once we add traditional or NN based data                 [13] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville,
                                                                           et al. 2017. Brain tumor segmentation with Deep Neural Net-
augmentation as in [23]. So far, we deliberately did not                   works. Medical Image Analysis 35 (1 2017), 18–31.
inspect this option to focus on the effect of other variables.        [14] H. D. Held, C. Martin, and S. Uhlig. 1999. Characterization of
                                                                           airway and vascular responses in murine lungs. British Journal
The amplitude of differences reported in this work might                   of Pharmacology 126, 5 (1999), 1191–1199.
decrease, once data augmentation is applied and so might              [15] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Busson-
the amount of needed training data. Additionally, it should                nier, et al. 2016. Jupyter Notebooks—a publishing format for
                                                                           reproducible computational workflows. Positioning and Power
increase the robustness of our model. As the underlying                    in Academic Publishing: Players, Agents and Agendas - Pro-
data are image series, recurrent connections as in [2] could               ceedings of the 20th International Conference on Electronic
have the same effects. Further, we aim to extend our con-                  Publishing, ELPUB 2016 (2016), 87–90.
                                                                      [16] J. Long, E. Shelhamer, and T. Darrell. 2015. Fully Convolu-
tinuous integration pipeline by continuous deployment to                   tional Networks for Semantic Segmentation. In Proceedings of
further optimize our use of resources. There also exist other              the IEEE conference on computer vision and pattern recogni-
                                                                           tion. 3431–3440.
architectures that might be suitable for the task and re-             [17] S. A. Nadeem, E. A. Hoffman, and P. K. Saha. 2019. A fully
quire an in depth evaluation. For example, [3] proposes a                  automated CT-based airway segmentation algorithm using deep
fully CNN using dilated convolutions. Dilations allow them                 learning and topological leakage detection and branch augmen-
                                                                           tation approaches. In Medical Imaging 2019: Image Processing,
to increase receptive fields without increasing the kernel                 Vol. 10949. SPIE, 11.
size, (i. e., covering a wider image area while maintaining           [18] V. Neuhaus, O. Danov, S. Konzok, H. Obernolte, S. Dehmel,
the same amount of parameters). Finally, we plan to make                   et al. 2018. Assessment of the cytotoxic and immunomodulatory
                                                                           effects of substances in human precision-cut lung slices. Journal
our labelled dataset publicly available in the future.                     of Visualized Experiments 2018, 135 (5 2018), 57042.
                                                                      [19] A. R. Ressmeyer, A. K. Larsson, E. Vollmer, S. E. Dahlèn,
                                                                           S. Uhlig, and C. Martin. 2006. Characterisation of guinea
ACKNOWLEDGMENTS                                                            pig precision-cut lung slices: Comparison with human tissues.
This work was supported by Fraunhofer Internal Programs                    European Respiratory Journal 28, 3 (9 2006), 603–611.
                                                                      [20] T. W. Ridler and S. Calvard. 1978. Picture Thresholding Using
under Grant No. Attract 042-601000 and Central Innova-                     an Iterative Selection Method. IEEE Transactions on Systems,
tion Programme for small and medium-sized enterprises                      Man and Cybernetics SMC-8, 8 (1978), 630–632.
(ZIM) on behalf of Federal Ministry for Economic Affairs              [21] O. Ronneberger, P. Fischer, and T. Brox. 2015. U-net: Convolu-
                                                                           tional networks for biomedical image segmentation. In Lecture
and Energy (BMWi) under Grant No. ZF4196502 CS7.                           Notes in Computer Science, Vol. 9351. Springer, 234–241.
                                                                      [22] J. Vietmeier, F. Niedorf, W. Bäumer, C. Martin, E. Deegen,
                                                                           B. Ohnesorge, and M. Kietzmann. 2007. Reactivity of equine
REFERENCES                                                                 airways - A study on precision-cut lung slices. Veterinary
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, et al. 2016.         Research Communications 31, 5 (7 2007), 611–619.
     Tensorflow: A system for large-scale machine learning. In 12th   [23] J. Wang, L. Perez, et al. 2017. The effectiveness of data augmen-
     USENIX symposium on operating systems design and imple-               tation in image classification using deep learning. Convolutional
     mentation (OSDI 16). 265–283.                                         Neural Networks Vis. Recognit 11 (2017).
 [2] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K.         [24] A. Wohlsen, C. Martin, E. Vollmer, D. Branscheid, H. Mag-
     Asari. 2018. Recurrent Residual Convolutional Neural Network          nussen, et al. 2003. The early allergic response in small airways
     based on U-Net (R2U-Net) for Medical Image Segmentation.              of human precision-cut lung slices. European Respiratory Jour-
     arXiv preprint arXiv:1802.06955 (2018).                               nal 21, 6 (6 2003), 1024–1032.
 [3] M. Anthimopoulos, S. Christodoulidis, L. Ebner, T. Geiser, A.    [25] Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. A
     Christe, and S. Mougiakakou. 2019. Semantic Segmentation of           survey on deep learning-based fine-grained object classifica-
     Pathological Lung Tissue With Dilated Fully Convolutional Net-        tion and semantic segmentation. International Journal of
     works. IEEE Journal of Biomedical and Health Informatics              Automation and Computing 14, 2 (2017), 119–135.
     23, 2 (3 2019), 714–722.

</pre>