=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_62 |storemode=property |title=Extracting Temporal Features into a Spatial Domain Using Autoencoders for Sperm Video Analysis |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_62.pdf |volume=Vol-2670 |authors=Vajira Thambawita,Pål Halvorsen,Hugo Hammer,Michael Riegler,Trine B. Haugen |dblpUrl=https://dblp.org/rec/conf/mediaeval/ThambawitaHHRH19a }} ==Extracting Temporal Features into a Spatial Domain Using Autoencoders for Sperm Video Analysis== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_62.pdf
        Extracting Temporal Features into a Spatial Domain Using
                 Autoencoders for Sperm Video Analysis
       Vajira Thambawita1,2 , Pål Halvorsen1,2 , Hugo Hammer1,2 , Michael Riegler1,3 , Trine B. Haugen2
             1 SimulaMet, Norway                2 Oslo Metropolitan University, Norway    3 Kristiania University College, Norway

                                                              Contact:vajira@simula.no
ABSTRACT                                                                      2     APPROACH
In this paper, we present a two-step deep learning method that                Our method can primarily be split into two distinct steps. First,
is used to predict sperm motility and morphology based on video               we use an autoencoder to extract temporal features from multi-
recordings of human spermatozoa. First, we use an autoencoder                 ple frames of a video into a feature-image. Second, we pass the
to extract temporal features from a given semen video and plot                extracted feature-image into a standard pre-trained CNN to pre-
these into image-space, which we call feature-images. Second, these           dict the motility and morphology of the spermatozoa in a given
feature-images are used to perform transfer learning to predict the           video. In this paper, we present the preliminary results for four
motility and morphology values of human sperm. The presented                  experiments based on four different input types. The first input
method shows it’s capability to extract temporal information into             type (I1) uses a single raw frame. Input type two (I2) is a stack of
spatial domain feature-images which can be used with traditional              identical frames copied across the channel-dimension. The third
convolutional neural networks. Furthermore, the accuracy of the               (I3) and fourth (I4) input type stack 9 and 18 consecutive frames
predicted motility of a given semen sample shows that a deep                  from a video respectively.
learning-based model can capture the temporal information of                     The first two experiments (using I1 and I2) were performed as
microscopic recordings of human semen.                                        baseline experiments. The two other experiments (using I3 and
                                                                              I4) were performed to see how the temporal information affects
                                                                              the prediction performance of the approach. For all input types,
1    INTRODUCTION                                                             we split the extracted datasets into three folds based on the folds
The 2019 Medico task [7] focuses on automatically predicting se-              provided by the organizers. Then, three-fold cross-validation was
men quality based on video recordings of human spermatozoa.                   conducted to evaluate our four experiments. An overview of all
This is change from previous years which have mainly focused                  experiments is shown in Figure 1.
on image classification of images taken from the gastrointestinal
tract [10, 11]. For this year’s task, we look at predicting the mor-          2.1    Step 1 - Unsupervised temporal feature
phology and motility of a given semen sample. Motility is defined                    extraction
by three variables, namely, the percentage of progressive, non-               In step 1, we trained an autoencoder that takes an input frame or
progressive, and immotile sperm. Morphology is determined by the              frames (I1, I2, I3 or I4) from the sperm videos as depicted in Figure 1.
percentage of sperm with tail defects, midpiece defects, and head             Then, the encoder of the autoencoder extracted feature-images and
defects. The organizers have provided a dataset consisting of 85              passed them through the decoder architecture to reconstruct the
videos of different semen samples and a preliminary analysis of               input frame or frames back (R1, R2, R3, and R4). These extracted
each, which is used as the ground truth. For this competition, the            feature-images are different from traditional feature extractions of
organizers have provided a predefined three-fold split of the VISEM           autoencoders because the traditional autoencoders extract feature
dataset [5], which contains 85 videos from different participants             vectors instead of feature-images. In this autoencoder, the mean
and a preliminary analysis of each semen sample. In the dataset               square error (MSE) loss function is used to calculate the difference
paper, the authors presented baseline mean absolute error (MAE)               between input data and reconstructed data. Then, this error value
values for motility and morphology. Furthermore, the importance of            is backpropagated to train the autoencoder. After training 2,000
computer-aided sperm analysis can be identified from the previous             epochs, we use the encoder architecture of the autoencoder model
works which have been done over the last few decades [3, 9, 12].              to step 2.
   To solve this year’s task, we propose a deep learning-based
method consisting of two steps - (i) unsupervised feature extraction          2.2    Step 2 - CNN regression model
using an autoencoder [1] and (ii) video regression using a standard           We have selected the pre-trained ResNet-34 [6] as our basic CNN to
convolutional neural networks (CNN) and transfer learning. The au-            predict the values of motility and morphology of the sperm videos.
toencoder we use is different from the state-of-the-art autoencoders          However, any pre-trained CNN could be chosen for this step and
used to extract video features [2, 13] as they use autoencoders to            in future work we will test and compare different ones in more
extract feature vectors which are used with long-short memory                 detail. Firstly, we take an input frame or frames (I1, I2, I3 or I4)
models or multi-layer perceptron (MLP)s. In contrast, we use au-              and pass through the pre-trained encoder model (only the encoder
toencoders to extract feature-images for use in CNNs.                         section of the autoencoder model) which was trained also from the
                                                                              same data inputs in an unsupervised way. Then, the outputs of the
Copyright 2019 for this paper by its authors. Use                             encoder model were passed through the CNN model which has a
permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).                                                modified last layer to output three prediction values for motility or
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                    morphology.
Github: https://github.com/vlbthambawita/
MedicoTask_2019_paper_2
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
Github: https://github.com/vlbthambawita/MedicoTask_2019_paper_2                                                        Thambawita et al.




Figure 1: A big picture overview of our two step deep learning model: Step 1 - an autoencoder architecture used to extract
image features, Step 2 - the pre-trained Resnet-34 CNN for predicting the regression values of motility and morphology, I1, I2,
I3 and I4 - input frames extracted from the video dataset, R1, R2, R3 and R4 - reconstructed data corresponding to the input
data I1, I2 I3 and I4, sample 4 feature frames shows extracted 4 feature images from the autoencoder after training 2000 epochs
(actual resolution of a feature image is 256X256 which is equal to the original frame size of the input data)

3   RESULTS AND ANALYSIS
                                                                         Table 1: Mean absolute error values collected from the pro-
According to the average MAE values shown in Table 1, the average
                                                                         posed method from different inputs: I1, I2, I3 and I4
motility values of input I3 and I4 shows the best results among other
motility values of input I1 and I2. These performance improvements                                Motility          Morphology
imply that our model is able to learn temporal features into a spatial         Input   Fold     MAE Average        MAE Average
feature image representation. Furthermore, input I4 which uses 18                      Fold 1   13.330             5.698
stacked frames shows the best motility average values compared                 I1      Fold 2   12.880   13.017    5.748      5.715
to input I3. This performance gain shows that to predict the sperm                     Fold 3   12.840             5.698
motility in sperm videos, it is better to analyze more frames at
the same time. This might be due to the fact that the behaviour of                     Fold 1   12.890             5.573
sperm is something that needs to be observed over time and not in              I2      Fold 2   13.010   13.017    5.593      5.606
single frames. Moreover, the predictions for our base case inputs I1                   Fold 3   13.150             5.653
and I2 show the same average values. This shows that our model                         Fold 1   10.850             5.567
learns temporal information from different sperm video frames.                 I3      Fold 2   11.310   10.970    5.748      5.632
Otherwise, it would be shown different average values for our two                      Fold 3   10.750             5.580
base case inputs I1 and I2.
                                                                                       Fold 1    9.462             5.900
   When we consider the predicted morphology average in Table 1,
                                                                               I4      Fold 2    9.426   9.427     5.738      5.777
it shows values that are almost equal to each other. This is ex-
                                                                                       Fold 3    9.393             5.692
pected because the morphology of a sperm is something that can
be observed using a single frame. In contrast to predicting accurate
                                                                         feature-images capture temporal present in a sequence of frames,
morphology, the predicted morphology values support the prove
                                                                         which can be used to predict the motility of the sperm videos.
that our model has the capability to learn temporal data from mul-
                                                                            This method can be improved by using different error functions
tiple frames because motility predictions show an improvement
                                                                         to force the model to learn more temporal data. For example, re-
when we increase the number of frames analyzed simultaneously.
                                                                         searchers can experiment with variational autoencoders [8] and
                                                                         generative adversarial learning methods [4] to improve this tech-
                                                                         nique. Additionally, it may be beneficial to embed long short-term
4   CONCLUSION AND FUTURE WORKS
                                                                         memory units to investigate how our feature-images compare to
In this paper, we proposed a novel method to extract temporal            actual extracted temporal features.
features from videos to create feature-images, which can be used
to train traditional CNN models. Furthermore, we show that the
                                                                                        MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
2019 Medico Medical Multimedia                                               Github: https://github.com/vlbthambawita/MedicoTask_2019_paper_2


REFERENCES                                                                        [7] Steven Hicks, Pål Halvorsen, Trine B Haugen, Jorunn M Andersen,
[1] Pierre Baldi. 2012. Autoencoders, unsupervised learning, and deep                 Oliwia Witczak, Konstantin Pogorelov, Hugo L Hammer, Duc-Tien
    architectures. In Proceedings of ICML workshop on unsupervised and                Dang-Nguyen, Mathias Lux, and Michael Riegler. 2019. Medico Multi-
    transfer learning. 37–49.                                                         media Task at MediaEval 2019. In Proceedings of the CEUR Workshop
[2] Yong Shean Chong and Yong Haur Tay. 2017. Abnormal event detec-                   on Multimedia Benchmark Workshop (MediaEval).
    tion in videos using spatiotemporal autoencoder. In Proceedings of the        [8] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational
    International Symposium on Neural Networks. Springer, 189–196.                    bayes. arXiv preprint arXiv:1312.6114 (2013).
[3] Karan Dewan, Tathagato Rai Dastidar, and Maroof Ahmad. 2018. Esti-            [9] Sharon T Mortimer, Gerhard van der Horst, and David Mortimer.
    mation of Sperm Concentration and Total Motility From Microscopic                 2015. The future of computer-aided sperm analysis. Asian journal of
    Videos of Human Semen Samples. In Proceedings of the IEEE Conference              andrology 17, 4 (2015), 545.
    on Computer Vision and Pattern Recognition (CVPR).                           [10] Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Steven Alexan-
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David                   der Hicks, Kristin Ranheim Randel, Duc-Tien Dang-Nguyen, Mathias
    Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.            Lux, Olga Ostroukhova, and Thomas de Lange. 2018. Medico Multi-
    Generative adversarial nets. In Proceedings of the Advances in neural             media Task at MediaEval 2018.. In Proceedings of the CEUR Workshop
    information processing systems (NIPS). 2672–2680.                                 on Multimedia Benchmark Workshop (MediaEval).
[5] Trine B. Haugen, Steven A. Hicks, Jorunn M. Andersen, Oliwia                 [11] Michael Riegler, Konstantin Pogorelov, Pål Halvorsen, Carsten Gri-
    Witczak, Hugo L. Hammer, Rune Borgli, Pål Halvorsen, and Michael A.               wodz, Thomas Lange, Kristin Randel, Sigrun Eskeland, Dang Nguyen,
    Riegler. 2019. VISEM: A Multimodal Video Dataset of Human Sper-                   Duc Tien, Mathias Lux, and others. 2017. Multimedia for medicine:
    matozoa. In Proceedings of the 10th ACM on Multimedia Systems                     the medico task at MediaEval 2017. (2017).
    Conference (MMSys) (MMSys’19). ACM, New York, NY, USA. https:                [12] L. F. Urbano, P. Masson, M. VerMilyea, and M. Kam. 2017. Automatic
    //doi.org/10.1145/3304109.3325814                                                 Tracking and Motility Analysis of Human Sperm in Time-Lapse Im-
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep                 ages. IEEE Transactions on Medical Imaging (T-MI) 36, 3 (March 2017),
    residual learning for image recognition. In Proceedings of the IEEE               792–801. https://doi.org/10.1109/TMI.2016.2630720
    conference on computer vision and pattern recognition (CVPR). 770–           [13] Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and
    778.                                                                              Baining Guo. 2015. Unsupervised extraction of video highlights via
                                                                                      robust recurrent auto-encoders. In Proceedings of the IEEE international
                                                                                      conference on computer vision (ICCV). 4633–4641.