=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_63
|storemode=property
|title=Using
2D and 3D Convolutional Neural Networks to Predict Semen Quality
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_63.pdf
|volume=Vol-2670
|authors=Jon-Magnus Rosenblad,Steven Hicks,Håkon Kvale Stensland,Trine B. Haugen,Pål Halvorsen,Michael Riegler
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RosenbladHSHHR19
}}
==Using
2D and 3D Convolutional Neural Networks to Predict Semen Quality==
Using 2D and 3D Convolutional Neural Networks to Predict Semen Quality Jon-Magnus Rosenblad1 , Steven Hicks2, 3 , Håkon Kvale Stensland1 , Trine B. Haugen3 , Pål Halvorsen2, 3 , Michael Riegler2, 4 1 Simula, Norway, 2 SimulaMet, Norway, 3 Oslo Metropolitan University, Norway, 4 Kristiania University College, Norway ABSTRACT 2D CNN. For the morphology approach, we use a higher resolution In this paper, we present the approach of team Jmag to solve this on the video when predicting morphology to preserve the minor year’s Medico Multimedia Task as part of the MediaEval 2019 Bench- details present in the sperms appearance. In the following two mark. This year, the task focuses on automatically determining sections, we present our approach of using CNNs to solve the quality characteristics of human sperm through the analysis of mi- requires sub-tasks of this year’s medico task. croscopic videos of human semen and associated patient data. Our Motility approach is based on deep convolutional neural networks (CNNs) We present two methods for predicting the motility. First, we use of varying sizes and dimensions. Here, we aim to analyze both the a simple 3D CNN to see how well a model using just a few layers spatial and temporal information present in the videos. The results performs on this task. Second, we present a deeper and more com- show that the method holds promise for predicting the motility of plex 3D CNN to see how this improves over the simpler model. The sperm, but predicting morphology appears to be more difficult. simple model uses a very shallow network architecture consisting of only two convolutional layers. Each convolutional layer extracts 1 INTRODUCTION 32 filters using a kernel size of 4 × 4 × 4 and 5 × 5 × 5 respectively, In an effort to explore how medical multimedia can be used to create which the output is then passed to a fully-connected layer before performant and efficient prediction algorithms, the 2019 Multime- making the prediction. The complex model consists of three consec- dia for Medicine Task [6] focuses on the analysis of microscopic utive convolutional blocks, where each block is made up of three videos of human semen to predict certain quality characteristics convolutional layers and a pooling layer to reduce the spatial and of spermatozoon. The challenge presents three different tasks, of temporal dimensions. Following the conventions of Li et al. [8], we which we decided to focus on the tasks which are required in order add a 1 × 1 × 1 convolutional layer at the end of the block to act as to participate this years challenge, namely, the prediction of motility a pixel-wise fully-connected layer over the filters. The architecture task and the prediction of morphology task. Motility and morphol- for the complex and simple model can be found in Figure 1c. ogy are two metrics which are commonly used to determine the Due of the limited amount of data, we perform data augmentation quality of a semen sample. Motility is the analysis of how each during training. First, we extract 20 random samples from a single spermatozoon moves and is primarily split into three different cat- data point for which we perform several augmentation techniques egories; progressive, non-progressive and immotile. Morphology including random crops, noise injection, and vertical/horizontal refers to the shape and size of the sperm and may be split into three flips. To decrease training time, we first downsample the resolution groups; sperm with head defects, tail defects, and midpiece defects. of each sample to 128 × 171 pixels for both the training and the More information about the dataset can be found in the original validation dataset, then we randomly select samples of consecutive publication [5]. 15 frame intervals and randomly crop the image to 128 × 128 pixels. The frame samples are then randomly flipped both horizontally and 2 APPROACH vertically with a probability of 0.5 each. Finally, we add some noise Motility and morphology are properties of sperm which appear injecting each pixels with some random values selected from a differently in the videos human semen. Motility may be difficult uniform distribution in the interval [−0.01, 0.01]. For validation, we to assess looking only at the spatial dimension, as it is heavily split each video into blocks consisting of 15 consecutive frames and dependent on the temporal information present in a video. By discard the frames that remain. Each frame block is then cropped contrast, morphology is highly dependent on the visual features of into a 128×128 at the upper left edge of the frame. We then calculate the sperm and not necessarily their movement, although there may the average prediction score over all blocks of a video for each video, be some correlation between the the movement and the morphology, and take the average of these averages to get our final prediction i.e., a sperm with a tail defect may move slower. Consequently, score. We do this to avoid weighing longer videos more than shorter predicting these two aspects of semen require different approaches. ones in our final score and rather weigh each video the same. To preserve the temporal and the spatial information in a video when predicting motility, we use 3D convolutional neural networks Morphology (CNNs). When predicting morphology, we discard the temporal To predict morphology, we use a relatively deep 2D CNN while information and make a prediction based on a single frame using a avoiding making it deep in order to avoid vanishing gradients [2, 4]. The network consists of 5 convolutional layers, each with kernel Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution size 4 × 4 and strides 4 × 4 and 1 × 1 alternating starting with 4 × 4. 4.0 International (CC BY 4.0). They pad with zeros to keep it’s initial size before striding. They MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France J. Rosenblad et al. Prog Non-Prog Immotile Mean Method Fold MAE RMSE MAE RMSE MAE RMSE MAE RMSE Input layer 1 11.05 13.38 7.70 10.46 14.26 19.55 11.00 14.95 2 9.66 12.87 6.87 7.91 26.75 31.64 14.43 20.24 Simple 3 40.93 44.53 8.44 10.67 11.81 16.10 20.39 28.02 Mean 20.55 23.59 7.67 9.68 17.61 17.61 15.27 21.07 32 4 × 4 × 4 Conv 1 9.34 11.54 8.51 10.29 10.38 14.13 9.41 12.09 2 10.08 12.72 5.71 6.88 7.76 10.71 7.85 10.39 Complex 3 11.05 13.35 8.01 10.18 8.62 11.33 9.23 11.69 32 5 × 5 × 5 Conv Mean 10.16 12.54 7.41 9.12 8.92 12.06 8.83 11.39 1 18.01 21.05 8.03 9.91 15.59 22.47 13.88 18.68 2 18.88 22.06 7.62 8.61 14.27 17.44 13.59 16.98 Input layer ZeroR 3 15.45 17.74 9.46 11.61 11.37 14.04 12.09 14.68 Fully-Connected 256 Mean 17.45 20.37 8.37 10.12 13.74 18.31 13.19 16.86 Table 1: The results for the prediction of motility task. 32 4 × 4 × 4 Conv Prediction Head Midpiece Tail Mean Method Fold MAE RMSE MAE RMSE MAE RMSE MAE RMSE (a) An illustration of the Conv Block 1 1 2.40 2.74 8.36 9.44 8.68 11.20 6.48 8.60 simple model architecture. 32 5 × 5 × 5 Single 2 2.73 3.17 8.19 9.87 6.93 9.95 5.92 8.29 /2 × 4 × 4 Frame 3 2.88 3.40 8.45 10.59 6.55 8.63 5.96 8.13 Input layer Mean 2.67 3.10 8.33 9.97 7.39 9.93 6.13 8.38 1 1.90 2.36 8.73 9.86 7.33 9.33 5.99 7.95 m n × n × n /i × j × k 2 2.72 3.17 8.01 9.76 7.25 9.99 5.99 8.27 ZeroR Conv Block 2 3 2.22 2.98 8.47 10.70 6.76 8.66 5.82 8.13 Mean 2.28 2.86 8.40 10.12 7.11 9.34 5.93 8.12 128 5 × 5 × 5 m n × n × n Conv /2 × 4 × 4 Table 2: The results for the prediction of morphology task. m n × n × n Conv have 32, 128, 128, 512, and 512 filters each respectively. After the Conv Block 3 final convolutional layer, we pass the output through three fully- 256 4 × 4 × 4 connected layers; one with 1024 nodes, one with 512 nodes and m n × n × n Conv /2 × 2 × 2 the output layer with 3 nodes. Each layer in the network uses the activation function ReLU, except for the output layer which uses a linear activation. i × j × k MaxPool Fully-Connected 256 For both training and validation, data was prepared similarly to that of the motility experiments, the only difference being that we used a single frame to make predictions at a resolution of 240 × 320 Output layer Prediction and did not perform any cropping. We still, however, performed noise injection with noise retrieved from the same distribution, and (b) An illustration of the (c) An illustration of the random flips with using the same probabilities. convolutional block used in compelx model architec- the complex model. ture. Training Figure 1: The CNN architectures used for the prediction of All models were trained for a maximum of 200 epochs, only inter- motility task. rupting the training if the evaluation loss did not improve over the last 10 epochs. The models were trained using the deep learning library Keras [3] with a TensorFlow [1] back-end. The experiments the sperms with the related motility values. For the morphology were run on a machine consisting of a single Nvidia RTX 2080Ti experiments (Table 2), we see that our model fails to beat predicting graphics card, 128 GB of RAM, and an Intel Xeon Gold 5120 CPU the mean value of the labels (ZeroR). It fails to learn the individual clocked at 2.20 GHz. Each motility model was trained with a batch shape of each sperm and collectively predict total of each category. size of 64 using the Adam optimizer [7] configured as described in For future work, we will increase the size of the network to make it the original paper. The morphology model was trained using the more adaptable, which may bring other challenges such as making same configuration, only with a smaller batch size of 16. the network harder to train due to the increased risk of vanishing gradients [2, 4]. 3 RESULTS AND DISCUSSION Looking at the motility experiments (Table 1), we see that the com- 4 CONCLUSION plex model achieves much better results than the simple model. It In this paper, we presented the work done as part of the Medico is clear that our complex model is able to extract more crucial in- Multimedia Task where we participated in two of the three available formation from the data to make better predictions. Comparing the subtasks. We used deep CNNs for both tasks, where we achieved an complex model to the ZeroR baseline, we see a mean absolute error average MAE of 0.0883 for the motility task and an average MAE (MAE) improvement of 0.0436 which shows that the deep learning of 0.0613 for the morphology task. at the very least is able to learn to associate some movement of Medico 2019 MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn- ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/ Software available from tensorflow.org. [2] Yoshua Bengio, Patrice Simard, Paolo Frasconi, and others. 1994. Learn- ing long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157–166. [3] François Chollet and others. 2015. Keras. https://keras.io. (2015). [4] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249–256. [5] Trine B. Haugen, Steven A. Hicks, Jorunn M. Andersen, Oliwia Witczak, Hugo L. Hammer, Rune Borgli, Pål Halvorsen, and Michael A. Riegler. 2019. VISEM: A Multimodal Video Dataset of Human Sperma- tozoa. In Proceedings of the 10th ACM on Multimedia Systems Conference (MMSys’19). https://doi.org/10.1145/3304109.3325814 [6] Steven Hicks, Pål Halvorsen, Trine B Haugen, Jorunn M Andersen, Oliwia Witczak, Konstantin Pogorelov, Hugo L Hammer, Duc-Tien Dang-Nguyen, Mathias Lux, and Michael Riegler. 2019. Medico Mul- timedia Task at MediaEval 2019. In CEUR Workshop Proceedings - Multimedia Benchmark Workshop (MediaEval). [7] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [8] Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).