=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_56
|storemode=property
|title=Using
            Deep Learning to Predict Motility and Morphology of Human Sperm
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_56.pdf
|volume=Vol-2670
|authors=Steven Hicks,Trine B. Haugen,Pål Halvorsen,Michael Riegler
|dblpUrl=https://dblp.org/rec/conf/mediaeval/HicksHHR19
}}
==Using
            Deep Learning to Predict Motility and Morphology of Human Sperm==
<pdf width="1500px">https://ceur-ws.org/Vol-2670/MediaEval_19_paper_56.pdf</pdf>
<pre>
                                    Using Deep Learning to Predict
                              Motility and Morphology of Human Sperm
                            Steven Hicks1, 2 , Trine B. Haugen2 , Pål Halvorsen1, 2 Michael Riegler1, 3
                                           1 SimulaMet, Norway           2 Oslo Metropolitan University, Norway
                                                             3 Kristiania University College, Norway


ABSTRACT                                                                           configurations, but only submitted the best results as the official
In the Medico Task 2019, the main focus is to predict sperm quality                runs. In the following few sections, we will give a brief explanation
based on videos and other related data. In this paper, we present                  of our experimental setup (common training configuration between
the approach of team LesCats which is based on deep convolution                    each model and data preparation), and a more detailed description
neural networks, where we experiment with different data prepro-                   of each approach.
cessing methods to predict the morphology and motility of human
sperm. The achieved results show that deep learning is a promis-                   2.1    Experimental Setup
ing method for human sperm analysis. Out best method achieves                      For each experiment, we use the Inception V3 [7] architecture
a mean absolute error of 8.962 for the motility task and a mean                    for our deep learning model, which were trained for as long as
absolute error of 5.303 for the morphology task.                                   it improved on the validation loss. This means that the models
                                                                                   trained indefinitely until the mean absolute error did not improve
                                                                                   over the last 100 epochs. Each model was trained with batch size of
1    INTRODUCTION                                                                  16 using Nadam [3] to optimize the weights with a learning rate
In an effort to explore how medical multimedia can be used to create               of 0.001. The models were implemented using the Keras [2] deep
high performing and efficient prediction algorithms, the Multimedia                learning library with a TensorFlow [1] back-end. Each experiment
for Medicine (Medico) Task presents different use-cases [6] which                  was performed on what would be considered "consumer-grade"
challenge computer science researchers to explore a field which                    hardware, specifically, a desktop computer running Arch Linux
has much potential for real-world impact. This year’s task differs                 with an Intel core i7 processor, 16 gigabytes of RAM, and an Nvidia
from previous years as it focuses on the analysis of microscopic                   GTX 1080Ti graphics card. As the videos in the provided dataset
videos of human semen to assess the quality of sperm. The videos                   vary in length (ranging from 2 to 7 minutes), we extracted a number
are taken from the open-source VISEM dataset [4]. The challenge                    of clips (one clip is contains a sequence of frames) from each video
presents three different tasks, of which we decided to focus on the                before training. The clips were extracted from evenly spaced out
tasks which are required in order to participate this years challenge,             intervals throughout the entire video, meaning we get a set of clips
i.e., the prediction of motility task and the prediction of morphology             which accurately represent any given semen recording. For both
task. The tasks themselves are further described in the overview                   the prediction of motility and the prediction of morphology task,
paper [5].                                                                         we use ZeroR as a baseline to measure our results.

2    APPROACH                                                                      2.2    Frame Stride Experiments
Our approach is based on deep learning using deep convolutional                    For the methods which used different stride lengths to perform
neural networks (CNNs) to predict sperm motility and sperm mor-                    prediction on sperm quality, we performed a total of 10 different
phology. All experiments aim to utilize the information in the videos              experiments. Stride in this context refers to the distance between
to their fullest, yet still keeping the computational complexity low.              two extracted frames within a clip. For example, using a stride
The experiments can primarily be split into four distinct groups.                  length of 5 would select every fifth frame within a given frame
Firstly (i), we combine multiple frames channel-wise using differ-                 sequence. The purpose of this experiment is to exaggerate the
ent stride values (distance between selected frames) and feed this                 change between two frames by increasing the distance of where
directly into the deep neural network. Secondly (ii), we vary the                  the two frames were sampled. Each experiment used a clip length
number of frames used in each sample to see how this may effect                    of three frames which are greyscaled, resized to 224 × 224 pixels
the algorithms prediction performance. Thirdly (iii), we threshold                 and combined channel-wise. The result is that each clip has a shape
the colors of each frame in an attempt to separate the spermatozoa                 of 224 × 3, making it possible to use pre-trained networks. We take
bright color from the darker background, and use this information                  advantage of this attribute and train two models for each stride
for prediction. Lastly (iv), we add the patient data to the video analy-           value tested, i.e., one transferring the weights of an ImageNet-
sis to see how this may help in the prediction. Because morphology                 based model and one trained from scratch. As previously stated,
focuses more on the visual appearance of sperm than the movement,                  we performed a total of ten different experiments, of which five
we opted to perform the threshold experiments only on the motility                 different stride values were used; 1, 5, 10, 30, and 50.
experiments. Internally, we experimented with a wide variety of
                                                                                   2.3    Clip Length Experiments
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                               For the methods which used different clip lengths to predict sperm
4.0 International (CC BY 4.0).                                                     quality, we performed a total of 5 experiments. Each experiment
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                                  Hicks et al.

                                                                                Method        Fold 1          Fold 2          Fold 3          Average
increases the number of frames in a clip by 10, starting at 10 and
                                                                                           MAE     RMSE    MAE     RMSE    MAE     RMSE     MAE     RMSE
ending at 50. Each video is captured at 50 frames-per-second, which
                                                                                                      Stride Experiments
means that the clips which contain 50 frames represent a whole
                                                                                Stride 1 10.436 14.769 11.079 15.155 11.581 14.404         11.032 14.776
second of a given video. In contrast to the previous method, each               Stride 5 8.563 11.856 9.843 13.754 12.172 15.510           10.192 13.707
clip included in these experiments have a stride of 1, meaning each            Stride 10 9.358 12.711 9.477 15.524 11.892 15.065           10.242 14.433
frame in a sequence is used for prediction. Similar to the previous            Stride 25 9.490 13.530 7.149 9.579 10.871 14.532             9.170 12.547
method, each frame resized to 224×224 and greyscaled before being              Stride 50 10.005 13.961 9.804 14.468 10.691 13.593          10.167 14.007
                                                                             TF Stride 1 9.874 13.408 8.450 11.638 10.257 13.972            9.527 13.006
combined channel-wise. The shape of each clip is then 224 × 224 ×C,
                                                                             TF Stride 5 10.937 14.699 7.903 10.544 10.322 13.217           9.721 12.820
where C is the length of the clip.                                          TF Stride 10 8.714 11.955 8.256 11.153 9.917 13.029             8.962 12.046
                                                                            TF Stride 25 8.505 11.211 8.818 11.889 10.480 13.919            9.268 12.340
2.4    Threshold Experiments                                                TF Stride 50 9.021 11.505 9.604 11.943 11.338 14.818            9.988 12.755
For the threshold approach, we greyscale each extracted frame                                       Clip Length Experiments
and threshold the color at 220, meaning all color values below            Clip Length 10 12.400 17.822 11.045 14.110 12.635 16.066         12.027 15.999
220 are set to 0. The spermatozoa in the provided videos have             Clip Length 20 11.605 16.674 12.867 16.361 11.712 14.778         12.061 15.938
                                                                          Clip Length 30 10.757 14.871 12.116 21.117 16.435 22.337         13.102 19.442
a strong bright coloring which differentiates it from the darker          Clip Length 40 11.225 14.897 9.725 12.866 11.736 15.135          10.895 14.299
background. By thresholding the color values, we aim to separate          Clip Length 50 10.763 14.640 9.843 14.154 11.051 13.728          10.552 14.174
the spermatozoa from the background in order to better emphasize                                     Threshold Experiments
the movement across frames. However, by doing this, we lose some                Stride 1 9.846 14.397 9.575 13.183 11.371 14.784           10.264 14.121
of the visual information present in each sperm, that is why we                 Stride 5 10.424 14.452 9.991 13.368 9.912 12.942           10.109 13.587
chose not to apply this method to predict morphology. We organize              Stride 10 9.544 13.549 11.381 15.570 10.113 13.176          10.346 14.098
                                                                               Stride 25 9.378 13.536 10.055 13.480 11.062 14.481          10.165 13.832
these experiments in a similar manner as those done for the stride             Stride 50 9.621 13.270 9.331 12.240 11.917 15.083           10.290 13.531
experiments, meaning we stack three frames channel-wise using                                                Baseline
five different stride values; 1, 5, 10, 25, and 50.                               ZeroR 13.880 18.680 13.590 16.980 12.090 14.680 13.190 16.860

3     RESULTS AND DISCUSSION                                             Table 1: The results for the prediction of motility task. Each
Each method was evaluated using three-fold cross-validation (as          entry shows the mean absolute error and root mean squared
required by the task), and we report the mean absolute error (MAE)       error for each fold of the three-fold cross-validation in addi-
and mean absolute error (RMSE) for each experiment. The results          tion to the average error across all folds.
for the motility experiments are shown in Table 1, and the results              Method        Fold 1          Fold 2          Fold 3          Average
for the morphology experiments are shown in Table 2.                                       MAE     RMSE    MAE     RMSE    MAE     RMSE     MAE     RMSE
    As we can see the prediction of motility results (Table 1), using                                   Stride Experiments
larger strides between the selected frames in combination with                  Stride 1   6.517   9.097 5.407 8.305 5.499         7.385    5.808   8.262
transfer learning works best. The experiments which used a lot of               Stride 5   6.056   8.425 5.706 8.800 5.328         8.114    5.697   8.446
frames per clip seem to have an issue handling the amount if infor-            Stride 10   6.124   8.633 5.388 7.747 5.414         7.869    5.642   8.083
mation per sample. Thresholding the color-space seems to preform               Stride 25   5.983   8.099 5.380 8.294 5.476         7.736    5.613   8.043
                                                                               Stride 50   5.736   7.994 5.716 8.698 5.473         7.776    5.641   8.156
marginally better than the extended clip length experiments, but             TF Stride 1   5.724   8.000 5.323 8.229 5.023         7.011    5.357   7.747
are still not as as the experiments using longer strides. Despite the        TF Stride 5   5.661   7.789 5.088 8.092 4.769         6.472    5.172   7.451
poor results of the threholding approach, all methods beat the Ze-          TF Stride 10   6.515   8.205 5.620 8.125 4.880         6.824    5.672   7.718
roR baseline method. Although the results may not be good enough            TF Stride 25   5.879   8.405 5.104 8.220 4.927         7.123    5.303   7.916
                                                                            TF Stride 50   6.224   8.200 5.981 8.231 4.749         6.610    5.652   7.680
to be deployed into a clinical setting, it shows that deep neural
                                                                                                      Clip Length Experiments
networks are a promising tool within the field of automatic semen
                                                                          Clip Length 10   6.216   8.793 5.636 7.899 5.295         7.634    5.716   8.109
analysis.                                                                 Clip Length 20   6.336   8.355 5.604 7.753 5.112         7.241    5.684   7.783
    Looking at the table for the prediction of morphology results         Clip Length 30   6.097   8.485 6.342 9.315 5.666         8.177    6.035   8.659
(Table 2), we see that pretty much all experiments lie around the         Clip Length 40   6.059   8.645 5.744 8.665 5.122         7.501    5.642   8.270
ZeroR baseline. Most, however, beat the baseline by a small margin.       Clip Length 50   6.211   8.794 5.584 8.677 5.282         7.946    5.692   8.472
It hard to make any strong conclusions about which methods work                                              Baseline
best, but it seems like using transfer learning for the stride experi-            ZeroR    5.990   7.950   5.990   8.270   5.820   8.130    5.930   8.100

ments achieves better results than those trained from scratch. As for    Table 2: The results for the prediction of morphology task.
using different clip lengths, all methods seem to achieve a similar      Each entry shows the mean absolute error and root mean
results. Overall, the results show that a more specific approach to      squared error for each fold of the three-fold cross-validation
predicting sperm morphology is needed, for example, analyzing            in addition to the average error across all folds.
individual spermatozoon using higher image resolutions.                  networks are able to predict both motiltiy and morphology with a
                                                                         relatively low error margin. For future work, we aim apply 3D CNNs
4     CONCLUSION                                                         and more advanced architectures which may show an improvement
In this paper, we presented the work done as part of the Medico          over the presented results, in addition to exploring more advanced
Multimedia Task where we participated in two of the three available      data preprocessing methods such as optical flow.
tasks. Overall, the results are promising and shows that neural
Medico 2019                                                                  MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
    Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,
    Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
    Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
    Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat
    Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
    Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
    Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
    Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
    and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn-
    ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/
    Software available from tensorflow.org.
[2] François Chollet and others. 2015. Keras. https://keras.io. (2015).
[3] Timothy Dozat. 2015. Incorporating Nesterov Momentum into adam.
[4] Trine B. Haugen, Steven A. Hicks, Jorunn M. Andersen, Oliwia
    Witczak, Hugo L. Hammer, Rune Borgli, Pål Halvorsen, and Michael A.
    Riegler. 2019. VISEM: A Multimodal Video Dataset of Human Sper-
    matozoa. In Proceedings of the ACM on Multimedia Systems Conference
    (MMSYS). https://doi.org/10.1145/3304109.3325814
[5] Steven Hicks, Pål Halvorsen, Trine B Haugen, Jorunn M Andersen,
    Oliwia Witczak, Konstantin Pogorelov, Hugo L Hammer, Duc-Tien
    Dang-Nguyen, Mathias Lux, and Michael Riegler. 2019. Medico Mul-
    timedia Task at MediaEval 2019. In CEUR Workshop Proceedings -
    Multimedia Benchmark Workshop (MediaEval).
[6] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Thomas de
    Lange, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias Lux,
    and Olga Ostroukhova. 2018. Medico Multimedia Task at MediaEval
    2018.
[7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
    Zbigniew Wojna. 2016. Rethinking the inception architecture for
    computer vision. In Proceedings of the IEEE conference on computer
    vision and pattern recognition. 2818–2826.

</pre>