<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Humanity: Leveraging Pre-trained Human Video Classification Models for Data-Eficient Multi-species Wildlife Animal Action Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wenxin Zhao</string-name>
          <email>wenxin.zhao.gr@dartmouth.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Transfer Learning, Video Classification, Wildlife Conservation, Action Recognition</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>4th International Workshop on Camera Traps</institution>
          ,
          <addr-line>AI, and Ecology</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dartmouth College</institution>
          ,
          <addr-line>15 Thayer Dr, Hanover, NH 03755</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a transfer learning approach for data-eficient video-based multi-species wildlife animal action recognition, using pre-trained models on human action datasets. It bridges the gap between the wellstudied human-focused video classification and under-investigated animal action recognition, largely limited by insuficient structured, annotated data across animal species. By leveraging the SlowFast framework, a state-of-the-art architecture for video classification, and conducting on a small sample of the Animal Kingdom dataset, a benchmark on animal action recognition, the paper reveals a notable improvement in the mean Average Precision (mAP) score, with much fewer training data, when fine-tuned on a model pre-trained with Kinetics-400 as compared to training from scratch or utilizing image-based model pre-trained on ImageNet. This research demonstrated the promising nature of cross-domain transfer learning for video classification and has substantial inspiration for advancing the understanding of animal behavior and biodiversity conservation.</p>
      </abstract>
      <kwd-group>
        <kwd>Multi-species</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Computer Vision has become invaluable in fostering global biodiversity conservation, through
globalscale camera-trap biodiversity monitoring [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and through increasingly capable models and more
computational power available. The task of video classification, especially human action classification,
has gained significant attention among the computer vision communities [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][5]. While there have
been substantial advancements in human action recognition [6][7], the same cannot be said for animal
action recognition, primarily due to the limited availability of structured, annotated data for a wide
range of species [8]. This poses a significant challenge in developing generalized models for animal
action recognition across various species [9].
      </p>
      <p>This paper aims to tackle animal action recognition in videos, focusing on developing a model
capable of identifying actions among a wide range of animal species with limited data. This can allow
wildlife researchers to focus more on analysis than manual data collections[10], and inspire further
studies for a deeper understanding of how and why animals behave [11]. Our primary focus will be to
explore whether leveraging pre-trained models on human actions can be an efective transfer learning
technique and improve performance when applied to animal action recognition, as opposed to training
from scratch. Specifically, with Facebook’s SlowFast Framework [ 12], a state-of-the-art architecture
specializing in video classification, two pre-trained models on Kinetics-400 and ImageNet using human
action datasets will be fine-tuned on wildlife animal videos with labeled actions. By utilizing pre-trained
models, we hope to use the knowledge acquired from the more extensive and diverse human action
datasets, thereby mitigating the impact of limited data availability and advancing the state-of-the-art in
multi-species action recognition.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In the literature, numerous approaches have been developed for action recognition in videos, such as
SlowFast [12], TimeSformer [13], and videoMAE [14]. However, these state-of-the-art models are all
trained on human datasets, such as Kinetics 400/600 [15], ActivityNet [16], and UCF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], largely because
they are large-scale, structured, and accessible.
      </p>
      <p>Current endeavors at animal action recognition, on the other hand, are limited. Research such as [17]
[18] [19] [20] extracted skeletons of the animals and made predictions based on the relative motions of
the joints, a popular technique called pose estimation. However, such an approach can be limited when
applied to wildlife camera trap data, because diferent species would have drastically diferent anatomy
and movement patterns, and some actions can also be context-based [8]. There have not been notable
attempts to create a generalized, foundational model across species using video inputs.</p>
      <p>Furthermore, most animal datasets contain only a few types of animals such as cows [21], mice [18],
monkeys [22], apes [20] and fish [ 23], or a specific animal class such as mammals [ 24], and usually in
a controlled or lab environment. The Animal Kingdom dataset [8] stands out as the largest existing
benchmark on multi-species action recognition for wildlife animals. The dataset contains 50 hours of
video footage with annotations of 140 action classes across 850 species. On average, a video lasts 6
seconds, with a range between 1 to 117 seconds, and always contains at least one animal. This dataset
stands out as a suitable candidate for building a generalized animal action recognition model.</p>
      <p>This paper seeks to bridge the gap between the advancement of human video classification models
and animal behavior analysis, by leveraging an existing model trained on human actions to create a
generalized model for wildlife animals.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>In this paper, we presented comparisons between training on the animal action dataset from scratch,
ifne-tuning a model pre-trained with human actions, and fine-tuning a model pre-trained with generic
image-based object identification data. We also investigated model performance using fewer training
data sizes, currently the bottleneck for biodiversity AI research [9].</p>
      <p>We used the Animal Kingdom dataset as the training dataset. To limit the scope of the action
recognition task, we use videos with only one action label and one animal species per clip, as opposed to
multiple labels or species in one clip. Wildlife conservation researchers spend much of their time in the
ifeld worldwide with limited computing power and data storage resources. Inspired by the circumstance,
we filter the training data to have only the 9 most labeled actions in the dataset defined in Table 3. Each
class consists of 100 randomly selected training videos, 10 validation videos, and 10 test videos.</p>
      <p>Figure 1 shows the training pipeline using the SlowFast framework. Videos are extracted into
individual image frames to feed into the SlowFast architecture, where they go through two parallel
convolution neural networks (the Slow pathway and the Fast pathway) [12]. At the end, we add a
classifier layer (and discard the original classifier layer if using pre-trained models) that outputs the
predictions of the nine action labels. We first trained a model from scratch (random initialization of
weights) as our baseline result. Then we obtained weights of a model pre-trained on Kinetics-400 (K400,</p>
      <sec id="sec-3-1">
        <title>From Scratch</title>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-trained K400 Pre-trained ImageNet 10/class 0.27211</title>
        <p>=1

∑ AP
where N is the number of classes.</p>
        <sec id="sec-3-2-1">
          <title>3.1. Model Setup</title>
          <p>Overall, the models underwent supervised learning with the labeled training data. In the experiment,
the videos were conformed with the required 30 frames per second for the SlowFast framework and
extracted into individual image frames. For each input clip, SlowFast processes with a spatial crop size
of 256, a video sampling rate of 2, and 8 frames per clip. Then we performed data augmentation on the
sampled frames, specifically, random horizontal flip and adding Principal Components Analysis (PCA)
jittering with scales [256, 340]. The SlowFast architecture is set up where the inverse of the channel
reduction ratio between the Slow and Fast pathways is 8, the frame rate reduction ratio between the
Slow and Fast pathways is 4, the ratio of channel dimensions between the Slow and Fast pathways is 2,
and Kernel dimension used for fusing information from Fast pathway to Slow pathway is 7. Weights
of both pre-trained models are obtained from SlowFast’s oficial GitHub repository. Then each model
was trained with a Stochastic Gradient Descent optimizer, a dropout rate of 0.5, a cross-entropy loss
function, a batch size of 8, and a Sigmoid function on the activation layer for the output head. The
learning rate started as 0.00085 and warmed up linearly in each iteration until reaching 0.0375 on the
ifth epoch, and kept constant at 0.0375 for the remaining epochs. The total number of epochs to train
is 20.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Ground Truth K400</title>
      </sec>
      <sec id="sec-3-4">
        <title>From Scratch</title>
      </sec>
      <sec id="sec-3-5">
        <title>Swimming</title>
        <p>Swimming</p>
      </sec>
      <sec id="sec-3-6">
        <title>Jumping</title>
      </sec>
      <sec id="sec-3-7">
        <title>Eating</title>
      </sec>
      <sec id="sec-3-8">
        <title>Keeping Still</title>
        <p>Eating
4. Experimental Results
4.1. Quantitative Results
table 1 shows the result of the experiments, where the overall best-performing model is the one
pretrained on K400 with 100 videos per action class. First, the mAP score is higher for the K400 pre-trained
model than from scratch, demonstrating that transfer learning from K400 is efective. On the other
hand, the mAP of the ImageNet model shows an insignificant increase from the model from scratch,
much less than that of the K400 model. It suggests that the action recognition model benefits more
through transfer learning from a model with temporal understanding than a generic image classification
model. Furthermore, the K400 model trained with merely 10 videos per class still yields a higher mAP
than training from scratch with 100 videos per class, demonstrating its data-eficient learning nature.</p>
        <p>ifg. 2 shows the confusion matrix produced by each model trained with 100 videos per class. In
ifg. 2d, the K400 model confusion matrix exhibits a darker shade along the diagonal than the other two
matrices, indicating a higher number of true positives and true negatives. This suggests the model’s
ability to make accurate predictions across diferent classes.</p>
        <sec id="sec-3-8-1">
          <title>4.2. Qualitative Analysis</title>
          <p>While the K400 model outperforms quantitatively, its qualitative performance reveals areas where it
excels and where it falls short. To demonstrate, both the model from scratch and the K400 model were
applied to unseen videos. In table 2, the Otter video is an example where the K400 pre-trained model
predicted correctly but the one from scratch predicted wrong. The K400 dataset contains 2588 footage
labeled as swimming [15], and the pre-trained model may have learned to identify the water and waves
in the video and associate them with the action swimming. Yet the model from scratch had a harder
time identifying the otters’ movements (moving up and down in the water) in the video, which could
be disguised as jumping. fig. 2b shows the model from scratch often confuses videos with “swimming”
as the true label with “jumping” and “flying”. This confusion also occurs in the K400 model, but with
less frequency (0.1 compared to 0.2 for both classes) [fig. 2d].</p>
          <p>On the other hand, the Kangaroo video demonstrates the reverse, where knowledge of human actions
did not help. In the Kangaroo video, the animals barely moved in the video frames, and the kangaroos
eating looked nothing like humans eating. For this video, the K400 model was confused, and concluded
the result as “keeping still”. However, the model trained from scratch, which may focus more on animals
and their actions, demonstrated a correct prediction.
(a) Confusion Matrix for Model Trained from Scratch.
(b) Top Predictions from Model Trained from</p>
          <p>Scratch for videos with Swimming as Ground
Truth. Swimming action is often confused with
Flying and Jumping.
(c) Confusion Matrix for Model pre-trained with Ima- (d) Confusion Matrix for Model pre-trained with
KineticsgeNet. 400.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>This paper demonstrates the efectiveness and data eficiency of transfer learning from the K400 human
action videos to the multi-species animal action recognition task, which outperforms ImageNet and
models trained from scratch. Future work includes implementing more advanced video classification
frameworks, including TimeSformer and videoMAE, incorporating a wider range of action classes,
multi-action labels, and multiple animal species in the same frame, and evaluating the K400 model more
comprehensively with more test data to reveal actions it excels and confuses the most.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Acknowledgement</title>
      <p>This work was advised by Dr. SouYoung Jin from Dartmouth College and sponsored by the Department
of Computer Science at Dartmouth College.</p>
      <p>Category</p>
      <sec id="sec-5-1">
        <title>General</title>
      </sec>
      <sec id="sec-5-2">
        <title>Feeding</title>
      </sec>
      <sec id="sec-5-3">
        <title>Sensing</title>
        <p>Action</p>
      </sec>
      <sec id="sec-5-4">
        <title>Keeping still</title>
      </sec>
      <sec id="sec-5-5">
        <title>Eating</title>
      </sec>
      <sec id="sec-5-6">
        <title>Attending</title>
      </sec>
      <sec id="sec-5-7">
        <title>Movement</title>
      </sec>
      <sec id="sec-5-8">
        <title>Swimming</title>
      </sec>
      <sec id="sec-5-9">
        <title>Movement</title>
      </sec>
      <sec id="sec-5-10">
        <title>Jumping</title>
      </sec>
      <sec id="sec-5-11">
        <title>Movement</title>
      </sec>
      <sec id="sec-5-12">
        <title>Movement</title>
      </sec>
      <sec id="sec-5-13">
        <title>Movement</title>
      </sec>
      <sec id="sec-5-14">
        <title>Communication</title>
      </sec>
      <sec id="sec-5-15">
        <title>Walking</title>
      </sec>
      <sec id="sec-5-16">
        <title>Running</title>
      </sec>
      <sec id="sec-5-17">
        <title>Flying</title>
      </sec>
      <sec id="sec-5-18">
        <title>Chirping</title>
        <p>Description</p>
      </sec>
      <sec id="sec-5-19">
        <title>Animal makes no or minimal movement (i.e., animals staying still and alert)</title>
      </sec>
      <sec id="sec-5-20">
        <title>Include feeding, graz</title>
        <p>ing, and gnawing
Animal locates a
stimulus of potential
interest, and directs
its attention (eyes,
ears, face) towards
it, and often keeping
very still to observe
the situation</p>
      </sec>
      <sec id="sec-5-21">
        <title>Animal swims in the</title>
        <p>water (e.g. fish), or
on the surface of
water (e.g. water birds)
Animal makes large
jumping movement
from one spot to
another (e.g. from
lower to higher
grounds), or on the
same spot</p>
      </sec>
      <sec id="sec-5-22">
        <title>Animal moves from one spot to another in a slow pace</title>
        <p>[5] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan,
Youtube-8m: A large-scale video classification benchmark, CoRR abs/1609.08675 (2016). URL:
http://arxiv.org/abs/1609.08675. arXiv:1609.08675.
[6] B. Jiang, M. Wang, W. Gan, W. Wu, J. Yan, Stm: Spatiotemporal and motion encoding for action
recognition, 2019. URL: https://arxiv.org/abs/1908.02486. arXiv:1908.02486.
[7] D. Lee, J. Lee, J. Choi, Cast: Cross-attention in space and time for video action recognition, 2023.</p>
        <p>URL: https://arxiv.org/abs/2311.18825. arXiv:2311.18825.
[8] X. L. Ng, K. E. Ong, Q. Zheng, Y. Ni, S. Y. Yeo, J. Liu, Animal kingdom: A large and diverse dataset
for animal behavior understanding, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2022, pp. 19023–19034.
[9] L. Ziegler, O. Sturman, J. Bohacek, Big behavior: challenges and opportunities in a new era of deep
behavior profiling, Neuropsychopharmacology 46 (2020). doi: 10.1038/s41386-020-0751-7.
[10] E. Fazzari, D. Romano, F. Falchi, C. Stefanini, Animal behavior analysis methods using deep
learning: A survey, 2024. URL: https://arxiv.org/abs/2405.14002. arXiv:2405.14002.
[11] A. E. Brown, B. de Bivort, Ethology as a physical science, bioRxiv (2018).</p>
        <p>URL: https://www.biorxiv.org/content/early/2018/02/02/220855. doi:10.1101/220855.
arXiv:https://www.biorxiv.org/content/early/2018/02/02/220855.full.pdf.
[12] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, CoRR
abs/1812.03982 (2018). URL: http://arxiv.org/abs/1812.03982. arXiv:1812.03982.
[13] G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?,</p>
        <p>CoRR abs/2102.05095 (2021). URL: https://arxiv.org/abs/2102.05095. arXiv:2102.05095.
[14] Z. Tong, Y. Song, J. Wang, L. Wang, Videomae: Masked autoencoders are data-eficient learners
for self-supervised video pre-training, 2022. arXiv:2203.12602.
[15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green,
T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset, CoRR
abs/1705.06950 (2017). URL: http://arxiv.org/abs/1705.06950. arXiv:1705.06950.
[16] F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles, Activitynet: A large-scale video benchmark
for human activity understanding, in: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2015, pp. 961–970. doi:10.1109/CVPR.2015.7298698.
[17] L. Feng, Y. Zhao, Y. Sun, W. Zhao, J. Tang, Action recognition using a spatial-temporal network for
wild felines, Animals 11 (2021). URL: https://www.mdpi.com/2076-2615/11/2/485. doi:10.3390/
ani11020485.
[18] C. Segalin, J. Williams, T. Karigo, M. Hui, M. Zelikowsky, J. J. Sun, P. Perona, D. J. Anderson,
A. Kennedy, The mouse action recognition system (mars) software pipeline for automated analysis
of social behaviors in mice, eLife 10 (2021) e63720. URL: https://doi.org/10.7554/eLife.63720.
doi:10.7554/eLife.63720.
[19] J. Lauer, M. Zhou, S. Ye, W. Menegas, T. Nath, M. M. Rahman, V. D. Santo, D.
Soberanes, G. Feng, V. N. Murthy, G. Lauder, C. Dulac, M. W. Mathis, A. Mathis,
Multianimal pose estimation and tracking with deeplabcut, bioRxiv (2021). URL: https://www.
biorxiv.org/content/early/2021/04/30/2021.04.30.442096. doi:10.1101/2021.04.30.442096.
arXiv:https://www.biorxiv.org/content/early/2021/04/30/2021.04.30.442096.full.pdf.
[20] M. Fuchs, E. Genty, K. Zuberbühler, P. Cotofrei, Asbar: an animal skeleton-based
action recognition framework. recognizing great ape behaviors in the wild using pose
estimation with domain adaptation, bioRxiv (2023). doi:10.1101/2023.09.24.559236.
arXiv:https://www.biorxiv.org/content/early/2023/09/25/2023.09.24.559236.full.pdf.
[21] Y. Liang, F. Xue, X. Chen, Z. Wu, X. Chen, A benchmark for action recognition of large animals,
in: 2018 7th International Conference on Digital Home (ICDH), 2018, pp. 64–71. doi:10.1109/
ICDH.2018.00020.
[22] Y. Yao, P. Bala, A. Mohan, E. Bliss-Moreau, K. Coleman, S. M. Freeman, C. J. Machado, J. Raper,
J. Zimmermann, B. Y. Hayden, et al., Openmonkeychallenge: Dataset and benchmark challenges
for pose estimation of non-human primates, International Journal of Computer Vision 131 (2023)
243–258.
[23] J. Kay, P. Kulits, S. Stathatos, S. Deng, E. Young, S. Beery, G. V. Horn, P. Perona, The
caltech fish counting dataset: A benchmark for multiple-object tracking and counting, 2022.
arXiv:2207.09295.
[24] H. Yu, Y. Xu, J. Zhang, W. Zhao, Z. Guan, D. Tao, AP-10K: A benchmark for animal pose
estimation in the wild, CoRR abs/2108.12617 (2021). URL: https://arxiv.org/abs/2108.12617.
arXiv:2108.12617.
[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image
database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp.
248–255. doi:10.1109/CVPR.2009.5206848.
[26] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, A. Gupta, Hollywood in homes:
Crowdsourcing data collection for activity understanding, CoRR abs/1604.01753 (2016). URL:
http://arxiv.org/abs/1604.01753. arXiv:1604.01753.
[27] S. Bhardwaj, M. Srinivasan, M. M. Khapra, Eficient video classification using fewer frames, CoRR
abs/1902.10640 (2019). URL: http://arxiv.org/abs/1902.10640. arXiv:1902.10640.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Iannarilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Beery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fegraus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Flores</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahumada</surname>
          </string-name>
          , W. Jetz,
          <article-title>Wildlife insights: How camera trap data can foster global biodiversity conservation (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietrasik</surname>
          </string-name>
          , G. Natha,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ghouaiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Brizel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <article-title>Animal detection in man-made environments</article-title>
          , CoRR abs/
          <year>1910</year>
          .11443 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .11443. arXiv:
          <year>1910</year>
          .11443.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Soomro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shah, UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          ,
          <source>CoRR abs/1212</source>
          .0402 (
          <year>2012</year>
          ). URL: http://arxiv.org/abs/1212.0402. arXiv:
          <volume>1212</volume>
          .
          <fpage>0402</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Ebrahimi</given-names>
            <surname>Kahou</surname>
          </string-name>
          , V. Michalski, J. Materzynska,
          <string-name>
            <given-names>S.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Haenel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fruend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yianilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mueller-Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hoppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thurau</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Bax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Memisevic</surname>
          </string-name>
          ,
          <article-title>The ”something something” video database for learning and evaluating visual common sense</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>