Automating Bronchoconstriction Analysis based on U-Net [Industrial and Application paper] Christian Steinmeyer Susann Dehmel David Theidel Armin Braun Lena Wiese∗ christian.steinmeyer@item.fraunhofer.de susann.dehmel@item.fraunhofer.de david.theidel@item.fraunhofer.de armin.braun@item.fraunhofer.de lena.wiese@item.fraunhofer.de Fraunhofer Institute for Toxicology and Experimental Medicine Hannover, Germany ABSTRACT underlying dataset curation, image labelling, and process- Advances in deep learning enable the automation of a ing. We present an optimized data processing pipeline from multitude of image analysis tasks. Yet, many solutions raw images to model input. We transparently develop a NN still rely on less automated, less advanced processes. To model for the task of semantic segmentation of airways. We transition from an existing solution to a deep learning also explore image quality as means to improve predictions. based one, an appropriate dataset needs to be created, Finally we show preliminary results. preprocessed, as well as a model needs to be developed, and trained on these data. We successfully employ this 2 BACKGROUND process for bronchoconstriction analysis in Precision Cut There is a continuous need of innovative or advanced drugs Lung Slices for pre-clinical drug research. Our automated to treat the symptoms of obstructive lung diseases. PCLS approach uses a variant of U-net for the core task of airway is one of the developed models to evaluate bronchodila- segmentation and reaches (mean) Intersection over Union tory compounds in the non clinical phase of drug devel- of 0.9. It performs comparably to the semi-manual previous opment. Bronchoconstriction analysis with this model is approach, but is approximately 80 times faster. applied in house since 2012. PCLS allows to target small airways of the lung in different animals, (mouse, rat, or KEYWORDS non-human primates) or human lung explant tissue. They Image segmentation, Preprocessing pipeline, Medical image can be analyzed to assess potency of bronchodilatory drugs analysis, Bronchoconstriction, Lung tissue slices through concentration-response curves. Mean inhibitory concentrations (IC50 values) can be compared to reference drugs within one human donor or animal to compare effi- 1 INTRODUCTION cacy of the drugs [14, 19, 22, 24]. Bronchoconstriction can Inflammatory and allergic lung diseases, such as lung in- also be compared across different species (e. g., for AHR jury, pneumonia, asthma, chronic obstructive pulmonary in asthma) [9]. In 13 studies, we tested bronchodilators disease, or pulmonary hypertension are still difficult to or bronchoconstricting compounds and mode of action of treat. A common symptom of these diseases is the obstruc- drugs as contract research with pharmaceutical industry or tion of the airways or Airway Hyperresponsiveness (AHR). within public studies (sponsored by BMBF, BMWI, DFG). Bronchodilators aim to avoid or reduce these symptoms. With advances in the field of deep learning, computers 2.1 Image Acquisition reach and sometimes surpass human level performance in specific tasks like image analysis [7]. In semantic segmenta- tion, Convolutional Neural Networks (CNNs) achieve state of the art results [13]. In this work, we describe our process of changing from a conventional, semi-manual workflow to a fully automated one using Neural Networks (NNs) for the evaluation of bronchodilators in microscopy image data from Precision Cut Lung Slices (PCLSs). We focus on the ∗ Also with Institute of Computer Science, Goethe University Frankfurt. Figure 1: Experiment setup for image acquisition. © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permit- We use images that were acquired in previous studies ted under Creative Commons License Attribution 4.0 International (CC BY 4.0) testing efficacy of bronchodilatory drugs. In those stud- ies, PCLSs were prepared from tumor-free parts of lung explants or from freshly isolated lungs after necropsy of 4 AUTOMATING SEGMENTATION animals, as described in [18] (see Figure 1, step 1). Lung To automate airway segmentation with NNs, we need la- lobes were cut into approximately 300 µm thick slices and belled data for supervised training. To the best of our transferred into petri dishes. Airways within PCLSs were knowledge, no public dataset for airway segmentation ex- imaged and digitized using an inverted microscope and a ists. Thus, we create our own (Section 4.1) and propose a digital video camera (see Figure 1, step 2). Camera control data pipeline for preprocessing (Section 4.2). In Section 4.3, and image analysis were achieved by AxioVision software we describe the development of an ML model for semantic (Zeiss, Germany). Each experiment setup differs in com- segmentation and analyse it in Sections 4.4 and 4.5. pound(s), doses, or specimen and is repeated with at least two samples. Throughout an experiment, images are taken 4.1 Image Labelling in regular intervals. The resulting data are image series of Because we are working towards the goal of optimal auto- varying length depending on the experiment (e. g., reaction matic segmentation, the quality of data labels is essential. times). In approximately 1750 such series from around 420 The upper limit of model performance depends on the label different experiment setups, over 55000 images are available. quality. Deviation from the ground truth means a loss in model potential. Because the existing tools yield unsatis- 2.2 Semi-Manual Airway Segmentation factory labels, the image labelling is performed manually: In our previous studies, we used a semi-manual approach to Each label is created by a trained specialist. It is then analyze the microscopy images. The Image J plugin “PCLS reviewed and possibly adjusted by another one. Area Tool ” was used for the airway segmentation. It is based on the IsoData approach [20], that uses different brightness levels between object of interest and background1 . It has an automatic option (that we call “IsoData”, here), as well as one with manual intervention (that we call “IsoData+”). The automatic option does not yield satisfactory results (see Section 6), mainly because it does not include locality. If surrounding tissue has similar luminance as the airway, a frequent result is a correctly segmented airway, surrounded (b) 2D representation by this incorrectly segmented other tissue. It can be vastly (a) 3D representation. The dark plane from above, similar to improved by manually setting the threshold and defining a represents the camera’s focus plane. the camera perspective. region of interest around the airway. But as this took time in past studies, we automate this segmentation task. Figure 2: Schematic depiction of a PCLS section (grey), con- taining an airway (blue). 3 RELATED WORK The main target of bronchoconstriction analysis is the In recent years, NNs have caused big leaps forward in se- relative change of airway volume. Our annotations resemble mantic segmentation tasks [16, 25]. In 2015, Ronneberger an approximation of the ground truth: We consider airways et al. published the U-net architecture for image segmen- simplified as prisms (see Figure 2a). The volume 𝑉 of a tation in a biomedical setting [21]. It was designed to cope prism is defined as 𝑉 = 𝑏 ∗ ℎ, where 𝑏 is the base area and ℎ with few training data and high resolutions, typical for med- is the height of the prism. The airway cross-section surface ical image analysis. It consists of an encoder and a decoder. is proportional to the airway volume; thus, we can use it The encoder serves as a feature detector. Through a series as an indicator for the volume. If an airway lies perfectly of convolutions and max pool operations, it reduces the orthogonal to the camera, all cross-section surfaces overlap. resolution (𝑥 and 𝑦 dimensions) of the data while increasing If not, cross-sections of different depths are shifted on the its depth (number of feature maps). The decoder serves as image plane (see Figure 2b). Some cross-sections might be an upsampler. Through a series of convolutions and decon- out of focus and choosing an arbitrary cross-section might volutions, it increases the resolution again while decreasing lead to a less stable model. To overcome these issues, we the depth. It also contains skip connections from each layer label the cross-section in the focus plane, i. e., the plane that before max pooling (encoder) to the corresponding layer lies parallel to the camera and is the sharpest. This includes after deconvolution (decoder). Finally, in the last layer, the more detail and allows more unambiguous labels. We create output is mapped to the desired number of classes. labels as RGB images representing class membership on In the context of lungs, U-net is frequently used. It pixel level, where the target class is one of background, can be extended to fit specific tasks, e. g., through adding border, or airway. Each target class is assigned a unique residual and recurrent connectivity [2], combining it with RGB color with maximal distance to other classes. other architectures [4, 12], or extending it with a third The labelled dataset should resemble real life data. In dimension [12, 17]. We therefore select it as the starting order to select a diverse dataset from as few images as point for our model. For a more generic review of deep possible, we choose one image per experiment setup as learning techniques for semantic segmentation, see [11]. described in Section 2.1, yielding 420 samples. 4.2 Data Pipeline 1 For details on IsoData, see https://imagej.net/Auto Threshold. The unprocessed sample data is preprocessed as depicted in html#IsoData (last accessed on 2021–03–18). the pipeline in Figure 3 before being fed to a model. First, Figure 3: Data preprocessing pipeline from a sample to model input tensor. one sample is defined as a raw image, its corresponding on the other (see Figure 4). We start by phrasing a question label created as described in Section 4.1, as well as its like “How does image size influence model performance? ”. meta data (e. g., quality, see Section 5). The samples are If that question can be answered with the current code collected in what we call the original dataset. From this basis, we continue with the research cycle. We define the original set, we can create a subset (e. g., by quality). This necessary test run parameters and configuration, before step is optional and depends on the goals of the test run. executing it in our research environment. This also serves Since working with image data is resource intensive, we the purpose of documentation and increases reproducibility aim to optimize computation time and cost. To that end, as the same configuration can be run again. We use jupyter we store the dataset variants in the tfrecords format2 . We notebooks [15] to enable interaction with the results. As read those files into tensorflow.data.Dataset 3 data struc- soon as new questions arise, we repeat the cycle. tures for further processing. Both are part of the tensorflow framework [1] and highly optimized [8]. They support lazy evaluation, i. e., letting us read data on demand, reducing memory needs and allowing us to work with bigger images. We also use them to minimize idle times of Graphics Pro- cessing Units (GPUs). We transform the data to tensors. We do it early, because that way, we can profit most from tensorflow ’s computation graph optimizations [8]. In most test runs, the data is then shuffled and split into train, test, and validation sets. Our model expects input tensors of a fixed shape. Tensor shape is defined by the image resolution (width, height) and number of (color) channels (depth). Some of the available images are colored. As the color channels might store useful information, we decode all input images to three channel RGB format. While in other domains, image tiling is applied to reduce model size [21], in our case the area covered by airways varies and Figure 4: Research and development cycle. VCS stands for often takes up the majority of an image. Thus, we do not Version Control System. apply tiling, but bilinearly transform them to have fixed dimensions (exact dimensions vary by experiment). If a question cannot be answered with the current code We want to use class based metrics (see Section 4.4) and basis, we continue with the software development cycle. hence one hot encode the label images. To that end, we Driven by reusability and modularization, we formulate transform them from RGB space to HSV space. We apply generalized requirements from the question (e. g., resize the same transformation to the target class representations images). We then implement and test the missing features described in Section 4.1. We empirically chose thresholds using test driven development [5]. In an effort to fail cheap for hue (10), saturation (100) and value (100) to determine and fast, we create unit and integration tests for our custom whether or not a pixel belongs to a target class. Each pixel is code with pytest 4 . In combination with syntactic checks, assigned at most to one target class that way. For each label, they ensure intended behavior. We integrate them in a Con- we use this to create binary masks for the classes. Finally, tinuous Integration (CI) pipeline [10]. As a result, breaking we concatenate the binary masks as channels, forming the changes are noticed and errors identified quickly (see Fig- one hot encoded labels (see last step in Figure 3). ure 4). All code is managed in a Version Control System (VCS). Once the features work locally, they are merged 4.3 Model Development Process in the remote repository of the VCS. This triggers our CI Within the scope of the model development process, we pipeline, ensuring proper behavior in the test environment. work towards fast exploration and experimentation. Main- If any of the tests fail, we repeat the implementation and taining high code quality and avoiding bugs are further following steps (cf. dotted arrows in Figure 4). On success, meta goals. Hence, we employ the following mechanisms. we deploy the new code and continue the research cycle. Our model development process can be described as two intertwined cycles of research on one hand and development 4.4 Metrics and Defaults In order to compare performance of different models or set- 2 For details on tfrecords, see https://www.tensorflow.org/tutorials/ tings, we use numeric metrics. In our semantic segmentation load data/tfrecord#setup (last accessed on 2021–03–18). 3 For details on tf.data.Dataset, see https://www.tensorflow.org/api 4 For details on pytest, see https://docs.pytest.org/en/stable/ (last docs/python/tf/data/Dataset (last accessed on 2021–03–18). accessed on 2021–03–18). task, we use per-pixel labelling. Due to strong class imbal- is reduced to 4). The results are depicted in Table 1. The ance (usually there is more background than foreground), best results are achieved when using the smallest images, simply using accuracy does not suffice. By choosing metrics increasing image size leading to a decreasing performance. that consider both, the number of correct predictions, as well as the ratio of classes, they better represent the actual Image Size CCE mDSC mIoU Duration prediction quality. Due to its wide spread use [11] and its 32 0.124 0.951 0.908 0:14:55 intuitive definition, we use (mean) Intersection over Union 64 0.118 0.948 0.904 0:15:40 (mIoU), or Jaccard Distance, defined as: 128 0.128 0.942 0.893 0:26:16 𝑛 1 Õ 𝑇 𝑃𝑖 256 0.168 0.925 0.866 0:54:34 𝑚𝐼𝑜𝑈 = (1) 512 0.255 0.877 0.786 3:18:38 𝑛 + 1 𝑖=0 𝑇 𝑃 𝑖 + 𝐹 𝑃 𝑖 + 𝐹 𝑁 𝑖 1024 0.313 0.852 0.749 13:56:10 where 𝑛 is the number of target classes, 𝑇 𝑃𝑖 is the number Table 1: Experiment results for different image sizes, trained of true positive, 𝐹 𝑃𝑖 the number of false positive, and 𝐹 𝑁𝑖 for 50 epochs each (Duration in hours:minutes:seconds) the number of false negative pixels for the target class 𝑖. We also use (mean) Dice Similarity Coefficient (mDSC), also called quotient of similarity, F1-score, or harmonic average between precision and recall (cf. [2]), defined as: We conclude that the task of airway segmentation is 𝑛 easier, if the image resolution is smaller, i. e., there are 1 Õ 2𝑇 𝑃 𝑖 fewer details available. Also, models with smaller image 𝑚𝐷𝑆𝐶 = (2) 𝑛 + 1 𝑖=0 2𝑇 𝑃 𝑖 + 𝐹 𝑃 𝑖 + 𝐹 𝑁 𝑖 sizes converge faster, which we show in another experiment: We increase the maximum number of epochs, but stop We use Categorical Cross Entropy (CCE) as both, loss training early, if the model converges. To evaluate conver- and metric (lower is better), defined as: gence, we consider validation loss and stop learning when 𝑚 Õ there is no improvement for twelve epochs. This further 𝐶𝐶𝐸 = − 𝑦 𝑗 ∗ log 𝑝 𝑗 (3) reduces overfitting. The results are depicted in Table 2. 𝑗=0 When comparing the average duration in both experiments, where 𝑚 is the total number of values (pixels through all we see: While models trained on images smaller than 128 channels), 𝑝 𝑗 is the model prediction for index 𝑗, and 𝑦 𝑗 is pixels stop training earlier than 50 epochs, those trained the corresponding target value. on bigger images need more epochs before stopping. This In the upcoming experiments, we consider different vari- also improves the results for bigger image size. ables, like resolution, and their effect on prediction perfor- mance. We use the following default setting, unless specified Image Size CCE mDSC mIoU Duration otherwise: (1) U-net with two encoder and decoder blocks (2) image dimensions of 128 by 128 (3) 64 filters in first 32 0.113 0.947 0.902 0:10:00 convolutional block (4) spatial dropout of 0.5 in encoder 64 0.110 0.948 0.903 0:13:54 (5) ReLU as activation function in inner layers (6) SoftMax 128 0.119 0.945 0.898 0:26:18 as activation function in output layer (7) CCE as loss func- 256 0.148 0.936 0.883 1:39:14 tion (8) batch normalization between convolutional layers 512 0.19 0.91 0.84 7:35:53 (9) training for 50 epochs with a batch size of 8. Using 1024 0.220 0.893 0.811 35:19:46 batch normalization and dropout reduces overfitting. Table 2: As Table 1 but up to 500 epochs with early stopping. Further, we use a variant of ten-fold cross validation: First, we split the data into train, test, and validation sets. For training, we then further split the train set into Further, we identify the receptive field as limiting factor ten separate folds, each missing a different tenth of the for larger images. The receptive field of a filter is defined original set. We use the validation set to observe training by the kernel size, stride, dilation, and padding of previ- performance. After training, we use the test set to obtain ous layers [6]. In the case of U-net, the receptive field at results (i. e., the ten different folds are evaluated on the the deepest level grows exponentially with the number of same test data). We choose this variant to be able to better encoder blocks: It has a side length of 14 pixels for one compare evaluation of different subsets (see Section 5). The block, 32 for two blocks, 68 for three, 140 for four, 284 for reported metrics refer to either the average (CCE, mDSC, five etc. In the next experiment, we adapt the number of mIoU) over all folds, or sum (duration). All test runs are U-net blocks for larger input sizes (see Table 3). performed on an Intel(R) Xeon(R) Gold 6252 CPU and two NVIDIA Tesla T4 GPUs. 5 QUALITY META LABEL Apart from the model, we also reconsider our dataset in 4.5 Experiments and Results order to improve results. Our dataset is heterogeneous. It The approach described in Sections 4.2 and 4.3 enables contains images from various specimens and was collected us to efficiently perform experiments. We can evaluate by five researchers. Due to different experiment setups, hypotheses in fast cycles and robustly implement missing they also vary in visual properties like brightness (standard functionality. We start with small images. Then, we identify deviation of per image average normalized brightness of and address issues that arise as we scale up. In a first exper- 0.11), hue (0.22), saturation (0.12), and value (0.11) in HSV iment, the input images and labels are bilinearly resized to space. This affects the overall quality of each image. We em- differently sized squares (for images of size 1024, batch size pirically identify the following crucial factors: (1) contrast Image Train Set and Order CCE mDSC mIoU Duration RF CCE mDSC mIoU Duration Size bad 1.592 0.647 0.493 0:07:55 256 68𝑏 = 3 0.149 0.934 0.881 1:27:00 good 0.14 0.932 0.875 0:33:38 512 68𝑏 = 3 0.144 0.934 0.879 9:30:14 great 0.42 0.856 0.757 0:09:42 512 140𝑏 = 4 0.133 0.942 0.893 8:47:55 badus 1.404 0.651 0.498 0:07:32 1024 68𝑏 = 3 0.192 0.907 0.835 39:21:09 goodus 1.156 0.748 0.616 0:18:28 1024 140𝑏 = 4 0.179 0.920 0.856 22:45:53 greatus 0.553 0.836 0.726 0:08:18 1024 284𝑏 = 5 0.194 0.915 0.847 26:12:40 bad→good→great 0.146 0.945 0.898 0:38:00 Table 3: Like Table 2, but increased receptive fields (RF). bad→great→good 0.14 0.934 0.879 0:38:53 The subscript b denotes number of U-net blocks. good→bad→great 0.187 0.937 0.885 0:41:19 good→great→bad 0.216 0.933 0.878 0:37:20 great→bad→good 0.136 0.934 0.878 0:36:41 great→good→bad 0.228 0.934 0.881 0:37:32 between the airway and the surrounding tissue (2) airway good→great 0.168 0.941 0.891 0:32:03 alignment in relation to the camera (3) sharpness of the great→good 0.129 0.936 0.881 0:33:59 airway edge (4) amount of airway occlusion (by tissue, par- all 0.116 0.941 0.889 0:30:51 ticles or image distortions like flares). Because we found our Table 4: Experiment results for different data subsets by perception influenced by these factors, we hypothesize this quality meta label and different order (Subscript us denotes might be the case for our model. Based on these criteria, we a balanced (undersampling) version; seta →setb denotes that manually categorize the samples into 22 “bad ”, 345 “good ”, setb was used for training after seta ) and 53 “great” ones and perform further experiments. First, we inspect data subsets by quality. A test set of 30 samples is randomly selected from the original dataset. It contains two “bad ”, 24 “good ”, and four “great” samples and is trained on all other data, before being evaluated on the exhibits similar visual properties: it differs only slightly in test data. An exemplary result is depicted in Figure 5. average brightness (−0.01), contrast (+0.001), hue (+0.03), saturation (+0.02), and value (−0.01). The remaining data is split by quality label. Each subset is used as training set and evaluated on the test set. The results are depicted in Table 4. We can explain the main difference in performance by different subset sizes: The best outcomes result from training on 321 “good ” samples, while using 49 “great” sam- ples yielded inferior results, followed by 20 “bad ” samples. Next, we use random under sampling to balance the three subsets. When we repeat the experiment with only 20 samples per subset, the performance difference decreases (see us in Table 4). The prediction performance strongly correlates with our assigned quality label. We hypothesize that the order in which the samples are presented to the Figure 5: Sample input (left), target output (middle), and model might influence its performance. We test this hypoth- model output (right). esis in an experiment, in which we use up to three distinct subsets to iteratively train a model. We stop training early and maintain the trained state of the model between sets. We let an expert use the old approach on the raw images Because the order matters, we average the results of ten of the test set. We then compare the semi-manual and the independent iterations instead of using cross validation as automated predictions with the labels created in Section 4.1 before. The results are depicted in Table 4. Overall, differ- (Table 5). While our new approach yields slightly less ac- ences between settings are small, but there seems to be a curate performance, it requires only a fraction of the time slight advantage in starting with the worst and finishing (mainly because it does not require human interaction). with the best samples (bad→good→great). Approach mDSC mIoU Duration 6 APPROACH COMPARISON IsoData 0.538 0.392 0:12:01 When we consider the lessons learned in Sections 4 and 5, IsoData+ 0.966 0.935 0:58:32 we create a model that differs from our defaults in the U-net 0.935 0.881 0:00:44 following way to achieve the best trade-off between image Table 5: Evaluation results for different approaches. resolution and prediction quality. (1) U-net with four en- coder and decoder blocks (2) image dimensions of 512 by 512 pixels (3) training for 500 epochs with early stopping and a batch size of 8 (4) training by quality meta label (bad→good→great). 7 DISCUSSION AND CONCLUSION To compare our new approach to the semi-manual one, In this work, we demonstrated how we transition from a we use the test set defined in Section 5. The above U-net semi-manual analysis of bronchodilators to an automated workflow using deep learning. To that end, we created a [4] T. Araújo, G. Aresta, A. Galdran, P. Costa, A. M. Mendonça, heterogeneous dataset. We presented how to use Tensor- and A. Campilho. 2018. Uolo - Automatic object detection and segmentation in biomedical images. In Lecture Notes in flow to create an optimized data pipeline; this pipeline Computer Science, Vol. 11045 LNCS. Springer Verlag, 165–173. can be used in other projects using Tensorflow with minor [5] K. Beck. 2003. Test Driven Development: By Example. Addison-Wesley Professional, 240. adjustments. Further, we propose an iterative model de- [6] B. Behboodi, M. Fortin, C. J. Belasso, R. Brooks, and H. Rivaz. velopment process applicable to any data science project 2020. Receptive Field Size as a Key Design Parameter for that requires custom code. We showed the capabilities of Ultrasound Image Segmentation with U-Net. Proceedings of the Annual International Conference of the IEEE Engineering our approach in a set of experiments for the semantic seg- in Medicine and Biology Society, EMBS (2020), 2117–2120. mentation of airways. In these experiments, we trained a [7] A. Buetti-Dinh, V. Galli, S. Bellenberg, O. Ilie, M. Herold, et al. variant of U-net to segment airways in PCLS microscopy 2019. Deep neural networks outperform human expert’s capac- ity in characterizing bioleaching bacterial biofilm composition. images with mIoU of 0.881 and mDSC of 0.935. We also Biotechnology Reports 22 (6 2019), e00321. demonstrated that image quality and training order can [8] S. W. D. Chien, S. Markidis, V. Olshevsky, Y. Bulatov, E. Laure, and J. Vetter. 2019. TensorFlow Doing HPC. In 2019 IEEE improve predictions. In comparison to our previous semi- International Parallel and Distributed Processing Symposium manual approach, the proposed automated method yields Workshops (IPDPSW). IEEE, 509–518. comparable results, but does so in a significantly faster [9] O. Danov, S. Jimenez Delgado, H. Drake, S. Schindler, O. Pfennig, C. Förster, A. Braun, and K. Sewald. 2014. Species way. Thus, our automated approach constitutes a valid comparison of interleukin-13 induced airway hyperreactivity in alternative. These segmentation results assign individual precision-cut lung slices. Pneumologie 68, 06 (6 2014), A1. pixels to the airway target class. We can count those pixels [10] P. M. Duvall, S. Matyas, and A. Glover. 2007. Continuous inte- gration: improving software quality and reducing risk. Pearson to derive the airway area. From this, we can determine Education, 336. a relative change of the area (and hence the volume as [11] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena- Martinez, and J. Garcia-Rodriguez. 2017. A Review on Deep described in Section 4.1) in a sequence of images and thus Learning Techniques Applied to Semantic Segmentation. arXiv determine the bronchoconstriction. preprint arXiv:1704.06857 (2017). Parts of our implementation are affected by randomness. [12] A. Garcia-Uceda Juarez, R. Selvan, Z. Saghir, and M. de Bruijne. 2019. A Joint 3D UNet-Graph Neural Network-Based Method As such, small differences might be due to chance. We for Airway Segmentation from Chest CTs. In Lecture Notes in expect additional improvements in semantic segmentation Computer Science, Vol. 11861 LNCS. Springer, 583–591. performance, once we add traditional or NN based data [13] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, et al. 2017. Brain tumor segmentation with Deep Neural Net- augmentation as in [23]. So far, we deliberately did not works. Medical Image Analysis 35 (1 2017), 18–31. inspect this option to focus on the effect of other variables. [14] H. D. Held, C. Martin, and S. Uhlig. 1999. Characterization of airway and vascular responses in murine lungs. British Journal The amplitude of differences reported in this work might of Pharmacology 126, 5 (1999), 1191–1199. decrease, once data augmentation is applied and so might [15] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Busson- the amount of needed training data. Additionally, it should nier, et al. 2016. Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power increase the robustness of our model. As the underlying in Academic Publishing: Players, Agents and Agendas - Pro- data are image series, recurrent connections as in [2] could ceedings of the 20th International Conference on Electronic have the same effects. Further, we aim to extend our con- Publishing, ELPUB 2016 (2016), 87–90. [16] J. Long, E. Shelhamer, and T. Darrell. 2015. Fully Convolu- tinuous integration pipeline by continuous deployment to tional Networks for Semantic Segmentation. In Proceedings of further optimize our use of resources. There also exist other the IEEE conference on computer vision and pattern recogni- tion. 3431–3440. architectures that might be suitable for the task and re- [17] S. A. Nadeem, E. A. Hoffman, and P. K. Saha. 2019. A fully quire an in depth evaluation. For example, [3] proposes a automated CT-based airway segmentation algorithm using deep fully CNN using dilated convolutions. Dilations allow them learning and topological leakage detection and branch augmen- tation approaches. In Medical Imaging 2019: Image Processing, to increase receptive fields without increasing the kernel Vol. 10949. SPIE, 11. size, (i. e., covering a wider image area while maintaining [18] V. Neuhaus, O. Danov, S. Konzok, H. Obernolte, S. Dehmel, the same amount of parameters). Finally, we plan to make et al. 2018. Assessment of the cytotoxic and immunomodulatory effects of substances in human precision-cut lung slices. Journal our labelled dataset publicly available in the future. of Visualized Experiments 2018, 135 (5 2018), 57042. [19] A. R. Ressmeyer, A. K. Larsson, E. Vollmer, S. E. Dahlèn, S. Uhlig, and C. Martin. 2006. Characterisation of guinea ACKNOWLEDGMENTS pig precision-cut lung slices: Comparison with human tissues. This work was supported by Fraunhofer Internal Programs European Respiratory Journal 28, 3 (9 2006), 603–611. [20] T. W. Ridler and S. Calvard. 1978. Picture Thresholding Using under Grant No. Attract 042-601000 and Central Innova- an Iterative Selection Method. IEEE Transactions on Systems, tion Programme for small and medium-sized enterprises Man and Cybernetics SMC-8, 8 (1978), 630–632. (ZIM) on behalf of Federal Ministry for Economic Affairs [21] O. Ronneberger, P. Fischer, and T. Brox. 2015. U-net: Convolu- tional networks for biomedical image segmentation. In Lecture and Energy (BMWi) under Grant No. ZF4196502 CS7. Notes in Computer Science, Vol. 9351. Springer, 234–241. [22] J. Vietmeier, F. Niedorf, W. Bäumer, C. Martin, E. Deegen, B. Ohnesorge, and M. Kietzmann. 2007. Reactivity of equine REFERENCES airways - A study on precision-cut lung slices. Veterinary [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, et al. 2016. Research Communications 31, 5 (7 2007), 611–619. Tensorflow: A system for large-scale machine learning. In 12th [23] J. Wang, L. Perez, et al. 2017. The effectiveness of data augmen- USENIX symposium on operating systems design and imple- tation in image classification using deep learning. Convolutional mentation (OSDI 16). 265–283. Neural Networks Vis. Recognit 11 (2017). [2] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. [24] A. Wohlsen, C. Martin, E. Vollmer, D. Branscheid, H. Mag- Asari. 2018. Recurrent Residual Convolutional Neural Network nussen, et al. 2003. The early allergic response in small airways based on U-Net (R2U-Net) for Medical Image Segmentation. of human precision-cut lung slices. European Respiratory Jour- arXiv preprint arXiv:1802.06955 (2018). nal 21, 6 (6 2003), 1024–1032. [3] M. Anthimopoulos, S. Christodoulidis, L. Ebner, T. Geiser, A. [25] Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. A Christe, and S. Mougiakakou. 2019. Semantic Segmentation of survey on deep learning-based fine-grained object classifica- Pathological Lung Tissue With Dilated Fully Convolutional Net- tion and semantic segmentation. International Journal of works. IEEE Journal of Biomedical and Health Informatics Automation and Computing 14, 2 (2017), 119–135. 23, 2 (3 2019), 714–722.