Predicting Sperm Motility and Morphology using
                         Deep Learning and Handcrafted Features
     Steven Hicks1,3 , Pål Halvorsen1,3 , Trine B. Haugen3 , Jorunn M. Andersen3 , Oliwia Witczak3 , Hugo
                                   L. Hammer3 , Duc-Tien Dang-Nguyen4 ,
                                          Mathias Lux2 , Michael Riegler1
                                        1 SimulaMet, Norway; 2 Alpen-Adria-Universität Klagenfurt, Austria;
                                     3 Oslo Metropolitan University, Norway; 4 University of Bergen, Norway;


ABSTRACT                                                                    morphology (as the tasks require). Furthermore, we wanted our
This paper presents the approach proposed by the organizer team             approach to be computationally efficient so to not require expensive
(SimulaMet) for MediaEval 2019 Multimedia for Medicine: The                 hardware or a complex setup procedure. Therefore, we decided to
Medico Task. The approach uses a data preparation method which              base our contribution on handcrafted features which were extracted
is based on global features extracted from multiple frames within           from multiple frames within each video in the provided dataset.
each video and then combines this with information about the                These handcrafted features are combined with each category of
patient in order to create a compressed representation of each video.       associated patient data in order to train a deep learning model
The goal is to create a less hardware expensive data representation         to make a prediction on each of the sperm quality features. In
that still retains the temporal information of the video and related        the following few section, we will describe the approach in more
patient data. Overall, the results need some improvement before             detail. This includes a description on how the data was prepared,
being a viable option for clinical use.                                     information about the model architecture, and how each model was
                                                                            trained.
1    INTRODUCTION
In this paper, we detail the approach of the Medico Task organiza-          2.1    Data Preparation
tion team (SimulaMet) as part of MediaEval 2019. The Medico Task
explores the challenge of using multimedia data to make the daily           To prepare the data for training and evaluation, we start by extract-
work of medical doctors more efficient [10]. This year’s task focuses       ing the first two frames of each second for 60 seconds for each video.
on automatically predicting sperm quality, in terms of motility and         The result is a sub-sample of each video containing 120 frames from
morphology, based on a microscopic video recordings of human                the first minute of each of the 85 videos in the dataset. From this
semen. The task provides a dataset consisting of 85 videos and              step, we extract four different types of global features from each of
associated patient data, which will be used to make predictions on          the frames, namely, Edge Histogram, Tamura, Luminance Layout
semen quality. The videos are taken from the VISEM dataset [4],             and Simple Color Histogram [9]. All image features were extracted
which is an open-source dataset with barely any restrictions regard-        using the LIRE library Lire [8], which is a popular open-source
ing usage. More details on the 2019 Medico Task and the provided            library for image retrieval. The global image features are concate-
dataset can be found in the overview paper [6] and the paper on             nated row-wise to represent a single frame. The extracted image
VISEM [4].                                                                  feature vectors are then concatenated column-wise into a 225 × 120
   The proposed approach is based on handcrafted features ex-               matrix, where the columns represent each frame, and the rows
tracted form multiple frames within each video and associated               represent the extracted features.
patient/sensor data which is fed into a deep learning model for the            The motivation behind this approach is that each column con-
prediction. This approach is used to solve the prediction of motility       tains the visual features of a single frame, while the temporal infor-
task and the prediction of morphology task, which are the tasks re-         mation is represented through the change of feature values across
quired in order to participate in the competition. The prediction of        the different columns. To add the information about the patient, we
motiltiy task involves predicting three quality metrics (progressive,       simply concatenate the values to each frame column so that each
non-progressive, and immotile) tied to the movement of the sperm            patient and sensor value will be the same for each frame in a given
in a given semen sample. The prediction of morphology task focuses          training sample. We create these video representations using each
on predicting more visual features of the sperm, namely, predicting         of the five provided categories of patient information values in the
defects which may be present in the head, tail, or midpiece.                dataset, namely, sex hormones, patient-related data (age, body mass
                                                                            index (BMI), and days of sexual abstinence), fatty acids in serum,
2    APPROACH                                                               fatty acids in spermatozoa, and sperm analysis data. Each of the five
                                                                            video categories were used to train two deep neural networks, one
As the organization team, our main goal for this year’s Medico Task
                                                                            for predicting morphology and one for predicting motility. Note
was to present a baseline approach which utilize all the available
                                                                            that for the sperm analysis data, we removed the data regarding
data in the dataset to make a prediction on sperm motility and
                                                                            motility and morphology when used to predict these values in each
Copyright 2019 for this paper by its authors. Use                           respective task. This is important as to not feed the ground truth
permitted under Creative Commons License Attribution                        values to the network during training.
4.0 International (CC BY 4.0).
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                                      Hicks et al.


2.2    Model Architecture and Training                                                                       Input
As explained in the previous section, the data format used to train
our deep neural networks is a matrix of extracted image features
from 120 frames within each video and associated patient and sensor                                       Convolution
data. The exact input shape varies depending on the category of
patient information used in the representation (number of variables
vary depending on the category), but rest of the network is kept the                                      Convolution
same. We use a novel convolutional neural network architecture to
perform our experiments.                                                                             Average Pooling
    The architecture itself is modeled to analyze the video representa-
tions using an ensemble of similar networks (network architecture
                                                                                                                                          ...
shown in Figure 1). The network consists of an inital convolution
block, after which the output gets passed through multiple residual                                  Residual Module
modules. The output of each network is then concatenated before
being passed through two fully-connected layers and then making
the prediction. The network is repeated multiple times in order to
create an ensemble of networks which each analyze the same input.                                    Average Pooling
The number of networks used is a parameter which may be tuned,
from which we decided to use three based on some internal testing.
    As previously mentioned, the same model architecture was used
for both the motility tasks and the morphology task. The model was                                    Concatenation
trained with a learning rate of 0.00001 using Nadam [3] to optimize
the weights. We used a mean absolute error (MAE) to calculate loss                                   Fully-Connected
on a batch size of 16 different samples. Overall, each experiment
was trained for a maximum of 5000 epochs, but stopped training
if the loss of the model did not improve over the last 300 epochs.                                   Fully-Connected
Note that the model used for evaluation is the one which achieved
the best MAE on the validation set. The hardware used to train                                            Prediction
all models was a desktop computer running Linux with an Nvidia
GTX 1080TI, Intel core i7 processor running at 3.6 gigahertz, and
16 gigabytes of RAM. Models were implemented using the deep               Figure 1: The CNN architecture used to perform the exper-
learning library Keras [2] with a TensorFlow [1] back-end. Due to         iments for both the prediction of morphology task and the
the small number of training samples, despite training for many           prediction of motility tasks. The box represents the network
epochs, no experiment took longer than 1 hour to train.                   which is repeated.
                                                                                   Method        Fold 1          Fold 2          Fold 3           Average
3     RESULTS AND DISCUSSION                                                                 MAE      RMSE    MAE     RMSE    MAE     RMSE    MAE       RMSE
Looking at Table 1 and Table 2, we see the results for the Prediction                 None 13.46     17.89   12.39   15.18   13.34   17.34   13.06     16.80
of Morphology Task the Prediction of Motility Task. Overall, the               Patient Data 12.72    17.52   12.96   16.94   13.15   16.73   12.94     17.06
                                                                             Sex Hormones 12.32      16.91   12.14   15.89   12.53   15.67   12.33     16.15
results show that the deep neural networks for both tasks are able               FA Serum 12.76      17.32   10.53   12.78   11.50   15.10   11.60     15.07
to learn something from the data, but the performance is overall           FA Spermatozoa 12.04      17.78   10.53   13.12   12.36   15.42   11.64     15.44
quite poor when compared to the ZeroR baseline. For future work,            Sperm Analysis 12.18     17.35   11.31   15.23   11.16   14.48   11.55     15.69
we aim to use features extracted from a deep neural network.                ZeroR Baseline 13.88     18.68   13.59   16.98   12.09   14.68   13.19     16.86
                                                                          Table 1: The results of the experiments used to predict sperm
4     CONCLUSION                                                          morphology in terms of tail progressive, non-progressive,
In this paper, we described the approach submitted by the 2019            and immotile.
Medico organization team (SimulaMet). The presented method
                                                                                   Method        Fold 1          Fold 2          Fold 3           Average
used handcrafted features and associated patient data to train a
                                                                                              MAE    RMSE     MAE    RMSE    MAE     RMSE       MAE    RMSE
deep learning model to predict sperm quality in terms of motility
                                                                                      None    6.40   8.63     5.84   8.29    5.61    8.27       5.95   8.40
and morphology. Based on the results, we see that the future of                Patient Data   6.02   8.29     5.77   8.27    5.89    8.50       5.89   8.35
automatic sperm quality prediction is promising, but requires more           Sex Hormones     7.10   8.99     5.88   8.71    6.24    8.62       6.41   8.77
work before being used in any real-world scenario. The way of                    FA Serum     6.04   8.25     5.76   8.19    5.11    7.51       5.63   7.98
representing the video into a single image also allows for future          FA Spermatozoa     6.18   8.56     5.70   8.10    5.64    8.04       5.84   8.23
                                                                            Sperm Analysis    6.26   8.42     5.81   8.38    5.96    8.37       6.01   8.39
experiments where we want to use grad cam methods, e.g. [5, 7],             ZeroR Baseline    5.99   7.95     5.99   8.27    5.82    8.13       5.93   8.12
to explain the important parts of the video and data.
                                                                          Table 2: The results of the experiments used to predict sperm
                                                                          motility in terms of tail defects, midpiece defects, and head
                                                                          defects.
Medico 2019                                                                        MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES                                                                        https://doi.org/10.1145/3204949.3208129
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng         [6] Steven Hicks, Pål Halvorsen, Trine B Haugen, Jorunn M Andersen,
    Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,                 Oliwia Witczak, Konstantin Pogorelov, Hugo L Hammer, Duc-Tien
    Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,                 Dang-Nguyen, Mathias Lux, and Michael Riegler. 2019. Medico Mul-
    Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz        timedia Task at MediaEval 2019. In CEUR Workshop Proceedings -
    Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat               Multimedia Benchmark Workshop (MediaEval).
    Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,             [7] S. Hicks, M. Riegler, K. Pogorelov, K. V. Anonsen, T. de Lange, D.
    Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul           Johansen, M. Jeppsson, K. Ranheim Randel, S. Losada Eskeland, and
    Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol            P. Halvorsen. 2018. Dissecting Deep Neural Networks for Better
    Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,               Medical Image Classification and Classification Understanding. In
    and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learn-             2018 IEEE 31st International Symposium on Computer-Based Medical
    ing on Heterogeneous Systems. (2015). https://www.tensorflow.org/             Systems (CBMS). 363–368. https://doi.org/10.1109/CBMS.2018.00070
    Software available from tensorflow.org.                                   [8] Mathias Lux and Savvas A. Chatzichristofis. 2008. Lire: Lucene Image
[2] François Chollet and others. 2015. Keras. https://keras.io. (2015).           Retrieval: An Extensible Java CBIR Library. In Proceedings of the 16th
[3] Timothy Dozat. 2015. Incorporating Nesterov Momentum into adam.               ACM International Conference on Multimedia (MM ’08). ACM, New
[4] Trine B. Haugen, Steven A. Hicks, Jorunn M. Andersen, Oliwia                  York, NY, USA, 1085–1088. https://doi.org/10.1145/1459359.1459577
    Witczak, Hugo L. Hammer, Rune Borgli, Pål Halvorsen, and Michael A.       [9] Mathias Lux, Michael Riegler, Pål Halvorsen, Konstantin Pogorelov,
    Riegler. 2019. VISEM: A Multimodal Video Dataset of Human Sperma-             and Nektarios Anagnostopoulos. 2016. LIRE: open source visual infor-
    tozoa. In Proceedings of the 10th ACM on Multimedia Systems Conference        mation retrieval. In Proceedings of the 7th International Conference on
    (MMSys’19). https://doi.org/10.1145/3304109.3325814                           Multimedia Systems. ACM, 30.
[5] Steven Hicks, Sigrun Eskeland, Mathias Lux, Thomas de Lange,             [10] Michael Riegler, Mathias Lux, Carsten Griwodz, Concetto Spampinato,
    Kristin Ranheim Randel, Mattis Jeppsson, Konstantin Pogorelov, Pål            Thomas de Lange, Sigrun L Eskeland, Konstantin Pogorelov, Walla-
    Halvorsen, and Michael Riegler. 2018. Mimir: An Automatic Re-                 pak Tavanapong, Peter T Schmidt, Cathal Gurrin, and others. 2016.
    porting and Reasoning System for Deep Learning Based Analysis                 Multimedia and medicine: Teammates for better disease detection and
    in the Medical Domain. In Proceedings of the 9th ACM Multimedia               survival. In Proceedings of the 24th ACM international conference on
    Systems Conference (MMSys ’18). ACM, New York, NY, USA, 369–374.              Multimedia. ACM, 968–977.