=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_56
|storemode=property
|title=Using
Deep Learning to Predict Motility and Morphology of Human Sperm
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_56.pdf
|volume=Vol-2670
|authors=Steven Hicks,Trine B. Haugen,Pål Halvorsen,Michael Riegler
|dblpUrl=https://dblp.org/rec/conf/mediaeval/HicksHHR19
}}
==Using
Deep Learning to Predict Motility and Morphology of Human Sperm==
Using Deep Learning to Predict Motility and Morphology of Human Sperm Steven Hicks1, 2 , Trine B. Haugen2 , Pål Halvorsen1, 2 Michael Riegler1, 3 1 SimulaMet, Norway 2 Oslo Metropolitan University, Norway 3 Kristiania University College, Norway ABSTRACT configurations, but only submitted the best results as the official In the Medico Task 2019, the main focus is to predict sperm quality runs. In the following few sections, we will give a brief explanation based on videos and other related data. In this paper, we present of our experimental setup (common training configuration between the approach of team LesCats which is based on deep convolution each model and data preparation), and a more detailed description neural networks, where we experiment with different data prepro- of each approach. cessing methods to predict the morphology and motility of human sperm. The achieved results show that deep learning is a promis- 2.1 Experimental Setup ing method for human sperm analysis. Out best method achieves For each experiment, we use the Inception V3 [7] architecture a mean absolute error of 8.962 for the motility task and a mean for our deep learning model, which were trained for as long as absolute error of 5.303 for the morphology task. it improved on the validation loss. This means that the models trained indefinitely until the mean absolute error did not improve over the last 100 epochs. Each model was trained with batch size of 1 INTRODUCTION 16 using Nadam [3] to optimize the weights with a learning rate In an effort to explore how medical multimedia can be used to create of 0.001. The models were implemented using the Keras [2] deep high performing and efficient prediction algorithms, the Multimedia learning library with a TensorFlow [1] back-end. Each experiment for Medicine (Medico) Task presents different use-cases [6] which was performed on what would be considered "consumer-grade" challenge computer science researchers to explore a field which hardware, specifically, a desktop computer running Arch Linux has much potential for real-world impact. This year’s task differs with an Intel core i7 processor, 16 gigabytes of RAM, and an Nvidia from previous years as it focuses on the analysis of microscopic GTX 1080Ti graphics card. As the videos in the provided dataset videos of human semen to assess the quality of sperm. The videos vary in length (ranging from 2 to 7 minutes), we extracted a number are taken from the open-source VISEM dataset [4]. The challenge of clips (one clip is contains a sequence of frames) from each video presents three different tasks, of which we decided to focus on the before training. The clips were extracted from evenly spaced out tasks which are required in order to participate this years challenge, intervals throughout the entire video, meaning we get a set of clips i.e., the prediction of motility task and the prediction of morphology which accurately represent any given semen recording. For both task. The tasks themselves are further described in the overview the prediction of motility and the prediction of morphology task, paper [5]. we use ZeroR as a baseline to measure our results. 2 APPROACH 2.2 Frame Stride Experiments Our approach is based on deep learning using deep convolutional For the methods which used different stride lengths to perform neural networks (CNNs) to predict sperm motility and sperm mor- prediction on sperm quality, we performed a total of 10 different phology. All experiments aim to utilize the information in the videos experiments. Stride in this context refers to the distance between to their fullest, yet still keeping the computational complexity low. two extracted frames within a clip. For example, using a stride The experiments can primarily be split into four distinct groups. length of 5 would select every fifth frame within a given frame Firstly (i), we combine multiple frames channel-wise using differ- sequence. The purpose of this experiment is to exaggerate the ent stride values (distance between selected frames) and feed this change between two frames by increasing the distance of where directly into the deep neural network. Secondly (ii), we vary the the two frames were sampled. Each experiment used a clip length number of frames used in each sample to see how this may effect of three frames which are greyscaled, resized to 224 × 224 pixels the algorithms prediction performance. Thirdly (iii), we threshold and combined channel-wise. The result is that each clip has a shape the colors of each frame in an attempt to separate the spermatozoa of 224 × 3, making it possible to use pre-trained networks. We take bright color from the darker background, and use this information advantage of this attribute and train two models for each stride for prediction. Lastly (iv), we add the patient data to the video analy- value tested, i.e., one transferring the weights of an ImageNet- sis to see how this may help in the prediction. Because morphology based model and one trained from scratch. As previously stated, focuses more on the visual appearance of sperm than the movement, we performed a total of ten different experiments, of which five we opted to perform the threshold experiments only on the motility different stride values were used; 1, 5, 10, 30, and 50. experiments. Internally, we experimented with a wide variety of 2.3 Clip Length Experiments Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution For the methods which used different clip lengths to predict sperm 4.0 International (CC BY 4.0). quality, we performed a total of 5 experiments. Each experiment MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Hicks et al. Method Fold 1 Fold 2 Fold 3 Average increases the number of frames in a clip by 10, starting at 10 and MAE RMSE MAE RMSE MAE RMSE MAE RMSE ending at 50. Each video is captured at 50 frames-per-second, which Stride Experiments means that the clips which contain 50 frames represent a whole Stride 1 10.436 14.769 11.079 15.155 11.581 14.404 11.032 14.776 second of a given video. In contrast to the previous method, each Stride 5 8.563 11.856 9.843 13.754 12.172 15.510 10.192 13.707 clip included in these experiments have a stride of 1, meaning each Stride 10 9.358 12.711 9.477 15.524 11.892 15.065 10.242 14.433 frame in a sequence is used for prediction. Similar to the previous Stride 25 9.490 13.530 7.149 9.579 10.871 14.532 9.170 12.547 method, each frame resized to 224×224 and greyscaled before being Stride 50 10.005 13.961 9.804 14.468 10.691 13.593 10.167 14.007 TF Stride 1 9.874 13.408 8.450 11.638 10.257 13.972 9.527 13.006 combined channel-wise. The shape of each clip is then 224 × 224 ×C, TF Stride 5 10.937 14.699 7.903 10.544 10.322 13.217 9.721 12.820 where C is the length of the clip. TF Stride 10 8.714 11.955 8.256 11.153 9.917 13.029 8.962 12.046 TF Stride 25 8.505 11.211 8.818 11.889 10.480 13.919 9.268 12.340 2.4 Threshold Experiments TF Stride 50 9.021 11.505 9.604 11.943 11.338 14.818 9.988 12.755 For the threshold approach, we greyscale each extracted frame Clip Length Experiments and threshold the color at 220, meaning all color values below Clip Length 10 12.400 17.822 11.045 14.110 12.635 16.066 12.027 15.999 220 are set to 0. The spermatozoa in the provided videos have Clip Length 20 11.605 16.674 12.867 16.361 11.712 14.778 12.061 15.938 Clip Length 30 10.757 14.871 12.116 21.117 16.435 22.337 13.102 19.442 a strong bright coloring which differentiates it from the darker Clip Length 40 11.225 14.897 9.725 12.866 11.736 15.135 10.895 14.299 background. By thresholding the color values, we aim to separate Clip Length 50 10.763 14.640 9.843 14.154 11.051 13.728 10.552 14.174 the spermatozoa from the background in order to better emphasize Threshold Experiments the movement across frames. However, by doing this, we lose some Stride 1 9.846 14.397 9.575 13.183 11.371 14.784 10.264 14.121 of the visual information present in each sperm, that is why we Stride 5 10.424 14.452 9.991 13.368 9.912 12.942 10.109 13.587 chose not to apply this method to predict morphology. We organize Stride 10 9.544 13.549 11.381 15.570 10.113 13.176 10.346 14.098 Stride 25 9.378 13.536 10.055 13.480 11.062 14.481 10.165 13.832 these experiments in a similar manner as those done for the stride Stride 50 9.621 13.270 9.331 12.240 11.917 15.083 10.290 13.531 experiments, meaning we stack three frames channel-wise using Baseline five different stride values; 1, 5, 10, 25, and 50. ZeroR 13.880 18.680 13.590 16.980 12.090 14.680 13.190 16.860 3 RESULTS AND DISCUSSION Table 1: The results for the prediction of motility task. Each Each method was evaluated using three-fold cross-validation (as entry shows the mean absolute error and root mean squared required by the task), and we report the mean absolute error (MAE) error for each fold of the three-fold cross-validation in addi- and mean absolute error (RMSE) for each experiment. The results tion to the average error across all folds. for the motility experiments are shown in Table 1, and the results Method Fold 1 Fold 2 Fold 3 Average for the morphology experiments are shown in Table 2. MAE RMSE MAE RMSE MAE RMSE MAE RMSE As we can see the prediction of motility results (Table 1), using Stride Experiments larger strides between the selected frames in combination with Stride 1 6.517 9.097 5.407 8.305 5.499 7.385 5.808 8.262 transfer learning works best. The experiments which used a lot of Stride 5 6.056 8.425 5.706 8.800 5.328 8.114 5.697 8.446 frames per clip seem to have an issue handling the amount if infor- Stride 10 6.124 8.633 5.388 7.747 5.414 7.869 5.642 8.083 mation per sample. Thresholding the color-space seems to preform Stride 25 5.983 8.099 5.380 8.294 5.476 7.736 5.613 8.043 Stride 50 5.736 7.994 5.716 8.698 5.473 7.776 5.641 8.156 marginally better than the extended clip length experiments, but TF Stride 1 5.724 8.000 5.323 8.229 5.023 7.011 5.357 7.747 are still not as as the experiments using longer strides. Despite the TF Stride 5 5.661 7.789 5.088 8.092 4.769 6.472 5.172 7.451 poor results of the threholding approach, all methods beat the Ze- TF Stride 10 6.515 8.205 5.620 8.125 4.880 6.824 5.672 7.718 roR baseline method. Although the results may not be good enough TF Stride 25 5.879 8.405 5.104 8.220 4.927 7.123 5.303 7.916 TF Stride 50 6.224 8.200 5.981 8.231 4.749 6.610 5.652 7.680 to be deployed into a clinical setting, it shows that deep neural Clip Length Experiments networks are a promising tool within the field of automatic semen Clip Length 10 6.216 8.793 5.636 7.899 5.295 7.634 5.716 8.109 analysis. Clip Length 20 6.336 8.355 5.604 7.753 5.112 7.241 5.684 7.783 Looking at the table for the prediction of morphology results Clip Length 30 6.097 8.485 6.342 9.315 5.666 8.177 6.035 8.659 (Table 2), we see that pretty much all experiments lie around the Clip Length 40 6.059 8.645 5.744 8.665 5.122 7.501 5.642 8.270 ZeroR baseline. Most, however, beat the baseline by a small margin. Clip Length 50 6.211 8.794 5.584 8.677 5.282 7.946 5.692 8.472 It hard to make any strong conclusions about which methods work Baseline best, but it seems like using transfer learning for the stride experi- ZeroR 5.990 7.950 5.990 8.270 5.820 8.130 5.930 8.100 ments achieves better results than those trained from scratch. As for Table 2: The results for the prediction of morphology task. using different clip lengths, all methods seem to achieve a similar Each entry shows the mean absolute error and root mean results. Overall, the results show that a more specific approach to squared error for each fold of the three-fold cross-validation predicting sperm morphology is needed, for example, analyzing in addition to the average error across all folds. individual spermatozoon using higher image resolutions. networks are able to predict both motiltiy and morphology with a relatively low error margin. For future work, we aim apply 3D CNNs 4 CONCLUSION and more advanced architectures which may show an improvement In this paper, we presented the work done as part of the Medico over the presented results, in addition to exploring more advanced Multimedia Task where we participated in two of the three available data preprocessing methods such as optical flow. tasks. Overall, the results are promising and shows that neural Medico 2019 MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn- ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/ Software available from tensorflow.org. [2] François Chollet and others. 2015. Keras. https://keras.io. (2015). [3] Timothy Dozat. 2015. Incorporating Nesterov Momentum into adam. [4] Trine B. Haugen, Steven A. Hicks, Jorunn M. Andersen, Oliwia Witczak, Hugo L. Hammer, Rune Borgli, Pål Halvorsen, and Michael A. Riegler. 2019. VISEM: A Multimodal Video Dataset of Human Sper- matozoa. In Proceedings of the ACM on Multimedia Systems Conference (MMSYS). https://doi.org/10.1145/3304109.3325814 [5] Steven Hicks, Pål Halvorsen, Trine B Haugen, Jorunn M Andersen, Oliwia Witczak, Konstantin Pogorelov, Hugo L Hammer, Duc-Tien Dang-Nguyen, Mathias Lux, and Michael Riegler. 2019. Medico Mul- timedia Task at MediaEval 2019. In CEUR Workshop Proceedings - Multimedia Benchmark Workshop (MediaEval). [6] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux, and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval 2018. [7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.