Beyond Humanity: Leveraging Pre-trained Human Video
                         Classification Models for Data-Efficient Multi-species
                         Wildlife Animal Action Recognition
                         Wenxin Zhao1
                         1
                             Dartmouth College, 15 Thayer Dr, Hanover, NH 03755, United States


                                        Abstract
                                        This paper presents a transfer learning approach for data-efficient video-based multi-species wildlife animal
                                        action recognition, using pre-trained models on human action datasets. It bridges the gap between the well-
                                        studied human-focused video classification and under-investigated animal action recognition, largely limited
                                        by insufficient structured, annotated data across animal species. By leveraging the SlowFast framework, a
                                        state-of-the-art architecture for video classification, and conducting on a small sample of the Animal Kingdom
                                        dataset, a benchmark on animal action recognition, the paper reveals a notable improvement in the mean Average
                                        Precision (mAP) score, with much fewer training data, when fine-tuned on a model pre-trained with Kinetics-400
                                        as compared to training from scratch or utilizing image-based model pre-trained on ImageNet. This research
                                        demonstrated the promising nature of cross-domain transfer learning for video classification and has substantial
                                        inspiration for advancing the understanding of animal behavior and biodiversity conservation.

                                        Keywords
                                        Transfer Learning, Video Classification, Wildlife Conservation, Action Recognition


                         1. Introduction
                         Computer Vision has become invaluable in fostering global biodiversity conservation, through global-
                         scale camera-trap biodiversity monitoring [1][2] and through increasingly capable models and more
                         computational power available. The task of video classification, especially human action classification,
                         has gained significant attention among the computer vision communities [3][4][5]. While there have
                         been substantial advancements in human action recognition [6][7], the same cannot be said for animal
                         action recognition, primarily due to the limited availability of structured, annotated data for a wide
                         range of species [8]. This poses a significant challenge in developing generalized models for animal
                         action recognition across various species [9].
                            This paper aims to tackle animal action recognition in videos, focusing on developing a model
                         capable of identifying actions among a wide range of animal species with limited data. This can allow
                         wildlife researchers to focus more on analysis than manual data collections[10], and inspire further
                         studies for a deeper understanding of how and why animals behave [11]. Our primary focus will be to
                         explore whether leveraging pre-trained models on human actions can be an effective transfer learning
                         technique and improve performance when applied to animal action recognition, as opposed to training
                         from scratch. Specifically, with Facebook’s SlowFast Framework [12], a state-of-the-art architecture
                         specializing in video classification, two pre-trained models on Kinetics-400 and ImageNet using human
                         action datasets will be fine-tuned on wildlife animal videos with labeled actions. By utilizing pre-trained
                         models, we hope to use the knowledge acquired from the more extensive and diverse human action
                         datasets, thereby mitigating the impact of limited data availability and advancing the state-of-the-art in
                         multi-species action recognition.


                          4th International Workshop on Camera Traps, AI, and Ecology, September 5 – 6, 2024, Hagenberg, Austria
                          Envelope-Open wenxin.zhao.gr@dartmouth.edu (W. Zhao)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Training Pipeline: The Model takes video frames as input, trains and fine-tunes on SlowFast architecture
attached to the final classifier layer with custom labels, and outputs a predicted action class with a confidence
score.


2. Related Work
In the literature, numerous approaches have been developed for action recognition in videos, such as
SlowFast [12], TimeSformer [13], and videoMAE [14]. However, these state-of-the-art models are all
trained on human datasets, such as Kinetics 400/600 [15], ActivityNet [16], and UCF [3], largely because
they are large-scale, structured, and accessible.
   Current endeavors at animal action recognition, on the other hand, are limited. Research such as [17]
[18] [19] [20] extracted skeletons of the animals and made predictions based on the relative motions of
the joints, a popular technique called pose estimation. However, such an approach can be limited when
applied to wildlife camera trap data, because different species would have drastically different anatomy
and movement patterns, and some actions can also be context-based [8]. There have not been notable
attempts to create a generalized, foundational model across species using video inputs.
   Furthermore, most animal datasets contain only a few types of animals such as cows [21], mice [18],
monkeys [22], apes [20] and fish [23], or a specific animal class such as mammals [24], and usually in
a controlled or lab environment. The Animal Kingdom dataset [8] stands out as the largest existing
benchmark on multi-species action recognition for wildlife animals. The dataset contains 50 hours of
video footage with annotations of 140 action classes across 850 species. On average, a video lasts 6
seconds, with a range between 1 to 117 seconds, and always contains at least one animal. This dataset
stands out as a suitable candidate for building a generalized animal action recognition model.
   This paper seeks to bridge the gap between the advancement of human video classification models
and animal behavior analysis, by leveraging an existing model trained on human actions to create a
generalized model for wildlife animals.


3. Proposed Approach
In this paper, we presented comparisons between training on the animal action dataset from scratch,
fine-tuning a model pre-trained with human actions, and fine-tuning a model pre-trained with generic
image-based object identification data. We also investigated model performance using fewer training
data sizes, currently the bottleneck for biodiversity AI research [9].
   We used the Animal Kingdom dataset as the training dataset. To limit the scope of the action
recognition task, we use videos with only one action label and one animal species per clip, as opposed to
multiple labels or species in one clip. Wildlife conservation researchers spend much of their time in the
field worldwide with limited computing power and data storage resources. Inspired by the circumstance,
we filter the training data to have only the 9 most labeled actions in the dataset defined in Table 3. Each
class consists of 100 randomly selected training videos, 10 validation videos, and 10 test videos.
   Figure 1 shows the training pipeline using the SlowFast framework. Videos are extracted into
individual image frames to feed into the SlowFast architecture, where they go through two parallel
convolution neural networks (the Slow pathway and the Fast pathway) [12]. At the end, we add a
classifier layer (and discard the original classifier layer if using pre-trained models) that outputs the
predictions of the nine action labels. We first trained a model from scratch (random initialization of
weights) as our baseline result. Then we obtained weights of a model pre-trained on Kinetics-400 (K400,
                                       Model              10/class   100/class
                                    From Scratch          0.27211    0.32320
                                  Pre-trained K400        0.45641    0.53707
                                Pre-trained ImageNet      0.18941    0.33044

Table 1
Pre-trained K400 model shows the best mAP score in both cases training on 10 and 100 videos per class.


the human action video dataset) and used the same training dataset and configurations to fine-tune the
weights and compare their performances. Furthermore, to show how temporal human actions can be
more useful as a pre-training dataset than generic visual feature knowledge, we fine-tuned another
model pre-trained with ImageNet (a large image dataset for generic object detection) [25] and compared
their performances. Lastly, to investigate the performance with limited training data size, the models
were trained with only 10 training videos and 5 test videos per class, and then compared with ones
utilizing all 900 training videos. Following [8] and [26], mean Average Precision (mAP) is used as the
evaluation metric for each model. It is computed as the unweighted mean of all the per-class average
precision (AP), bounded between 0 and 1 [27]. For each test video, the model predicts one or more
action labels, each associated with a confidence score. The evaluation then takes the predictions and
the confidence scores to compute the Average Precision across all predictions and videos. Formally, AP
is calculated as follows:
                                                    𝑁
                                            𝐴𝑃 = ∑ 𝑝(𝑖)Δ𝑟(𝑖)                                             (1)
                                                   𝑖=1
where N is the number of predictions, p(i) is the precision, and r(i) is the recall [5]. mAP is then
calculated by taking the mean of these AP values as follows:
                                                          𝑁
                                                        1
                                            mAP =         ∑ AP                                           (2)
                                                        𝑁 𝑖=1 𝑖

where N is the number of classes.

3.1. Model Setup
Overall, the models underwent supervised learning with the labeled training data. In the experiment,
the videos were conformed with the required 30 frames per second for the SlowFast framework and
extracted into individual image frames. For each input clip, SlowFast processes with a spatial crop size
of 256, a video sampling rate of 2, and 8 frames per clip. Then we performed data augmentation on the
sampled frames, specifically, random horizontal flip and adding Principal Components Analysis (PCA)
jittering with scales [256, 340]. The SlowFast architecture is set up where the inverse of the channel
reduction ratio between the Slow and Fast pathways is 8, the frame rate reduction ratio between the
Slow and Fast pathways is 4, the ratio of channel dimensions between the Slow and Fast pathways is 2,
and Kernel dimension used for fusing information from Fast pathway to Slow pathway is 7. Weights
of both pre-trained models are obtained from SlowFast’s official GitHub repository. Then each model
was trained with a Stochastic Gradient Descent optimizer, a dropout rate of 0.5, a cross-entropy loss
function, a batch size of 8, and a Sigmoid function on the activation layer for the output head. The
learning rate started as 0.00085 and warmed up linearly in each iteration until reaching 0.0375 on the
fifth epoch, and kept constant at 0.0375 for the remaining epochs. The total number of epochs to train
is 20.
     Video
 Ground Truth                   Swimming                                        Eating
     K400                       Swimming                                      Keeping Still
 From Scratch                    Jumping                                        Eating

   Table 2
   Top-1 Prediction on two examples videos by K400 model and model from scratch. On the video of otters
   swimming, K400 correctly identifies while the one from scratch confuses the up/down wavy motion
   with jumping. In the Kangaroo video, the kangaroos displayed no motion and K400 misinterpreted it as
   keeping still.


4. Experimental Results
4.1. Quantitative Results
table 1 shows the result of the experiments, where the overall best-performing model is the one pre-
trained on K400 with 100 videos per action class. First, the mAP score is higher for the K400 pre-trained
model than from scratch, demonstrating that transfer learning from K400 is effective. On the other
hand, the mAP of the ImageNet model shows an insignificant increase from the model from scratch,
much less than that of the K400 model. It suggests that the action recognition model benefits more
through transfer learning from a model with temporal understanding than a generic image classification
model. Furthermore, the K400 model trained with merely 10 videos per class still yields a higher mAP
than training from scratch with 100 videos per class, demonstrating its data-efficient learning nature.
   fig. 2 shows the confusion matrix produced by each model trained with 100 videos per class. In
fig. 2d, the K400 model confusion matrix exhibits a darker shade along the diagonal than the other two
matrices, indicating a higher number of true positives and true negatives. This suggests the model’s
ability to make accurate predictions across different classes.

4.2. Qualitative Analysis
While the K400 model outperforms quantitatively, its qualitative performance reveals areas where it
excels and where it falls short. To demonstrate, both the model from scratch and the K400 model were
applied to unseen videos. In table 2, the Otter video is an example where the K400 pre-trained model
predicted correctly but the one from scratch predicted wrong. The K400 dataset contains 2588 footage
labeled as swimming [15], and the pre-trained model may have learned to identify the water and waves
in the video and associate them with the action swimming. Yet the model from scratch had a harder
time identifying the otters’ movements (moving up and down in the water) in the video, which could
be disguised as jumping. fig. 2b shows the model from scratch often confuses videos with “swimming”
as the true label with “jumping” and “flying”. This confusion also occurs in the K400 model, but with
less frequency (0.1 compared to 0.2 for both classes) [fig. 2d].
   On the other hand, the Kangaroo video demonstrates the reverse, where knowledge of human actions
did not help. In the Kangaroo video, the animals barely moved in the video frames, and the kangaroos
eating looked nothing like humans eating. For this video, the K400 model was confused, and concluded
the result as “keeping still”. However, the model trained from scratch, which may focus more on animals
and their actions, demonstrated a correct prediction.
                                                            (b) Top Predictions from Model Trained from
                                                                Scratch for videos with Swimming as Ground
                                                                Truth. Swimming action is often confused with
(a) Confusion Matrix for Model Trained from Scratch.            Flying and Jumping.


(c) Confusion Matrix for Model pre-trained with Ima- (d) Confusion Matrix for Model pre-trained with Kinetics-
    geNet.                                               400.

        Figure 2: (a,c,d) The K400 model displays darker shades along the diagonal than other models, showing
        more True Positives and True Negatives. (b) Some action predictions may be confused with similar
        motions.


5. Conclusion
This paper demonstrates the effectiveness and data efficiency of transfer learning from the K400 human
action videos to the multi-species animal action recognition task, which outperforms ImageNet and
models trained from scratch. Future work includes implementing more advanced video classification
frameworks, including TimeSformer and videoMAE, incorporating a wider range of action classes,
multi-action labels, and multiple animal species in the same frame, and evaluating the K400 model more
comprehensively with more test data to reveal actions it excels and confuses the most.


6. Acknowledgement
This work was advised by Dr. SouYoung Jin from Dartmouth College and sponsored by the Department
of Computer Science at Dartmouth College.
                           Category            Action          Description
                           General             Keeping still   Animal makes no or
                                                               minimal movement
                                                               (i.e., animals staying
                                                               still and alert)
                           Feeding             Eating          Include feeding, graz-
                                                               ing, and gnawing
                           Sensing             Attending       Animal locates a
                                                               stimulus of potential
                                                               interest, and directs
                                                               its attention (eyes,
                                                               ears, face) towards
                                                               it, and often keeping
                                                               very still to observe
                                                               the situation
                           Movement            Swimming        Animal swims in the
                                                               water (e.g. fish), or
                                                               on the surface of wa-
                                                               ter (e.g. water birds)
                           Movement            Jumping         Animal makes large
                                                               jumping movement
                                                               from one spot to
                                                               another (e.g. from
                                                               lower to higher
                                                               grounds), or on the
                                                               same spot
                           Movement            Walking         Animal moves from
                                                               one spot to another
                                                               in a slow pace
                           Movement            Running
                           Movement            Flying
                           Communication       Chirping
Table 3
Descriptions of the 9 most labeled actions used for training from the Animal Kingdom Dataset [8]


References
 [1] F. Iannarilli, R. Oliver, T. Birch, S. Beery, E. Fegraus, N. Flores, R. Kays, J. Ahumada, W. Jetz, Wildlife
     insights: How camera trap data can foster global biodiversity conservation (2022).
 [2] A. Singh, M. Pietrasik, G. Natha, N. Ghouaiel, K. Brizel, N. Ray, Animal detection in man-made envi-
     ronments, CoRR abs/1910.11443 (2019). URL: http://arxiv.org/abs/1910.11443. arXiv:1910.11443 .
 [3] K. Soomro, A. R. Zamir, M. Shah, UCF101: A dataset of 101 human actions classes from videos in
     the wild, CoRR abs/1212.0402 (2012). URL: http://arxiv.org/abs/1212.0402. arXiv:1212.0402 .
 [4] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend,
     P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic, The ”something
     something” video database for learning and evaluating visual common sense, in: Proceedings of
     the IEEE International Conference on Computer Vision (ICCV), 2017.
 [5] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan,
     Youtube-8m: A large-scale video classification benchmark, CoRR abs/1609.08675 (2016). URL:
     http://arxiv.org/abs/1609.08675. arXiv:1609.08675 .
 [6] B. Jiang, M. Wang, W. Gan, W. Wu, J. Yan, Stm: Spatiotemporal and motion encoding for action
     recognition, 2019. URL: https://arxiv.org/abs/1908.02486. arXiv:1908.02486 .
 [7] D. Lee, J. Lee, J. Choi, Cast: Cross-attention in space and time for video action recognition, 2023.
     URL: https://arxiv.org/abs/2311.18825. arXiv:2311.18825 .
 [8] X. L. Ng, K. E. Ong, Q. Zheng, Y. Ni, S. Y. Yeo, J. Liu, Animal kingdom: A large and diverse dataset
     for animal behavior understanding, in: Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition (CVPR), 2022, pp. 19023–19034.
 [9] L. Ziegler, O. Sturman, J. Bohacek, Big behavior: challenges and opportunities in a new era of deep
     behavior profiling, Neuropsychopharmacology 46 (2020). doi:10.1038/s41386- 020- 0751- 7 .
[10] E. Fazzari, D. Romano, F. Falchi, C. Stefanini, Animal behavior analysis methods using deep
     learning: A survey, 2024. URL: https://arxiv.org/abs/2405.14002. arXiv:2405.14002 .
[11] A. E. Brown, B. de Bivort,               Ethology as a physical science,            bioRxiv (2018).
     URL:        https://www.biorxiv.org/content/early/2018/02/02/220855.           doi:10.1101/220855 .
     arXiv:https://www.biorxiv.org/content/early/2018/02/02/220855.full.pdf .
[12] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, CoRR
     abs/1812.03982 (2018). URL: http://arxiv.org/abs/1812.03982. arXiv:1812.03982 .
[13] G. Bertasius, H. Wang, L. Torresani, Is space-time attention all you need for video understanding?,
     CoRR abs/2102.05095 (2021). URL: https://arxiv.org/abs/2102.05095. arXiv:2102.05095 .
[14] Z. Tong, Y. Song, J. Wang, L. Wang, Videomae: Masked autoencoders are data-efficient learners
     for self-supervised video pre-training, 2022. arXiv:2203.12602 .
[15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green,
     T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset, CoRR
     abs/1705.06950 (2017). URL: http://arxiv.org/abs/1705.06950. arXiv:1705.06950 .
[16] F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles, Activitynet: A large-scale video benchmark
     for human activity understanding, in: 2015 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2015, pp. 961–970. doi:10.1109/CVPR.2015.7298698 .
[17] L. Feng, Y. Zhao, Y. Sun, W. Zhao, J. Tang, Action recognition using a spatial-temporal network for
     wild felines, Animals 11 (2021). URL: https://www.mdpi.com/2076-2615/11/2/485. doi:10.3390/
     ani11020485 .
[18] C. Segalin, J. Williams, T. Karigo, M. Hui, M. Zelikowsky, J. J. Sun, P. Perona, D. J. Anderson,
     A. Kennedy, The mouse action recognition system (mars) software pipeline for automated analysis
     of social behaviors in mice, eLife 10 (2021) e63720. URL: https://doi.org/10.7554/eLife.63720.
     doi:10.7554/eLife.63720 .
[19] J. Lauer, M. Zhou, S. Ye, W. Menegas, T. Nath, M. M. Rahman, V. D. Santo, D. Sober-
     anes, G. Feng, V. N. Murthy, G. Lauder, C. Dulac, M. W. Mathis, A. Mathis,                    Multi-
     animal pose estimation and tracking with deeplabcut, bioRxiv (2021). URL: https://www.
     biorxiv.org/content/early/2021/04/30/2021.04.30.442096.           doi:10.1101/2021.04.30.442096 .
     arXiv:https://www.biorxiv.org/content/early/2021/04/30/2021.04.30.442096.full.pdf .
[20] M. Fuchs, E. Genty, K. Zuberbühler, P. Cotofrei, Asbar: an animal skeleton-based ac-
     tion recognition framework. recognizing great ape behaviors in the wild using pose
     estimation with domain adaptation,     bioRxiv (2023). doi:10.1101/2023.09.24.559236 .
     arXiv:https://www.biorxiv.org/content/early/2023/09/25/2023.09.24.559236.full.pdf .
[21] Y. Liang, F. Xue, X. Chen, Z. Wu, X. Chen, A benchmark for action recognition of large animals,
     in: 2018 7th International Conference on Digital Home (ICDH), 2018, pp. 64–71. doi:10.1109/
     ICDH.2018.00020 .
[22] Y. Yao, P. Bala, A. Mohan, E. Bliss-Moreau, K. Coleman, S. M. Freeman, C. J. Machado, J. Raper,
     J. Zimmermann, B. Y. Hayden, et al., Openmonkeychallenge: Dataset and benchmark challenges
     for pose estimation of non-human primates, International Journal of Computer Vision 131 (2023)
     243–258.
[23] J. Kay, P. Kulits, S. Stathatos, S. Deng, E. Young, S. Beery, G. V. Horn, P. Perona, The cal-
     tech fish counting dataset: A benchmark for multiple-object tracking and counting, 2022.
     arXiv:2207.09295 .
[24] H. Yu, Y. Xu, J. Zhang, W. Zhao, Z. Guan, D. Tao, AP-10K: A benchmark for animal pose
     estimation in the wild, CoRR abs/2108.12617 (2021). URL: https://arxiv.org/abs/2108.12617.
     arXiv:2108.12617 .
[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image
     database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp.
     248–255. doi:10.1109/CVPR.2009.5206848 .
[26] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, A. Gupta, Hollywood in homes:
     Crowdsourcing data collection for activity understanding, CoRR abs/1604.01753 (2016). URL:
     http://arxiv.org/abs/1604.01753. arXiv:1604.01753 .
[27] S. Bhardwaj, M. Srinivasan, M. M. Khapra, Efficient video classification using fewer frames, CoRR
     abs/1902.10640 (2019). URL: http://arxiv.org/abs/1902.10640. arXiv:1902.10640 .