Enhanced Vision-Based Human Fall Detection with Mask-RCNN and Autoencoder-LSTM Hybrid Framework Sritama Chakraborty1,*,† , Mili Ghosh2,† 1 Department of Computer Science & Technology, University of North Bengal 2 Department of Computer Science & Technology, University of North Bengal Abstract Elderly individuals who reside independently face a heightened risk of experiencing serious harm due to accidental falls, a leading contributor to mortality rates in this demographic. Fall detection is a critical part of health care for older adults. This paper introduces a methodology for identifying falls among elderly individuals utilizing machine learning algorithms. We suggest utilizing Mask R-CNN, Autoencoder- LSTM, and a Hybrid Mask R-CNN-Autoencoder-LSTM framework to detect falls in a video surveillance environment. Primarily, these algorithms are trained to identify deformations in human body shapes and postures, enabling them to detect whether a fall has occurred. At their core, these algorithms are instructed to distinguish shifts in human body structures and movements, empowering them to flag potential fall occurrences. Our findings suggest that the Mask R-CNN and Autoencoder- LSTM models perform well independently, with the Mask R-CNN excelling in spatial feature extraction and the Autoencoder-LSTM effectively modeling the temporal changes in body movements. However, the hybrid Mask R-CNN-Autoencoder-LSTM approach does not fully capitalize on the strengths of both models, leading to a slightly lower performance than the individual models. Despite this, both Mask R-CNN and Autoencoder-LSTM present a reliable solution for fall detection, with the hybrid framework offering potential improvements for future research. Keywords Human Fall Detection, Mask R-CNN, Autoencoder LSTM, Hybrid Model 1. Introduction In recent times, the prevalence of aging has markedly risen across the population, mirroring demographic changes and extended life expectancies. In the elderly healthcare sector, falls are a critical health issue. Accidental deaths among older adults are most often caused by falls [1]. According to the census, in 2021, around 6.8 percent of people were over 65 years old. According to Elderly in India 2021 from the National Statistical Office (NSO), India’s older adult population is expected to reach 194 MM by 2031. This reflects a substantial increase of 41 percent when compared to the figures from the preceding decade, suggesting a notable trend of growth over time. Given the growing population of elderly individuals living independently, it becomes imperative for both governmental and private entities to devise efficient intelligent surveillance systems capable of identifying potential fall risks. Due to the increasing number of people seeking security and healthcare services, the need for fall detection has been accelerated.Using depth sensors, skeleton joint data, and Hidden Markov Models, [2] research introduces a depth-based life logging system for senior activity detection that shows promise for tracking daily activities and medical treatment. Many research projects have been conducted to create efficient fall detection algorithms. Even though fall incidents cannot entirely be prevented, the ability to precisely identify a fall incidence and issue an emergency notice can save lives. A monitoring system should distinguish fall events from normal activity. Various methods have been proposed to identify falls at right using wearable devices such as accelerators, gyroscopes, and magnetometers [3]. However, these methods could be more effective since it is impractical to wear such devices for a long time [4]. The 2024 Sixth Doctoral Symposium on Intelligence Enabled Research (DoSIER 2024), November 28–29, 2024, Jalpaiguri, India * Corresponding author. † These authors contributed equally. $ sritama.sc@gmail.com (S. Chakraborty); ghosh.mili90@gmail.com (M. Ghosh) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Furthermore, different machine learning techniques can detect various types of falls. The adaptability of machine learning techniques in tackling various contexts and categories of falls has contributed significantly to their widespread adoption and popularity. Some of these include Convolutional Neural Network (CNN) , random forest, multi-layer perceptron [5], a statistical model i.e. Hidden Markov Model (HMM), a non-parametric supervised learning parameter i.e. K-Nearest Neighbours (KNN), a supervised machine learning algorithm i.e. Support Vector Machine(SVM), decision trees which employs deep learning method. Using deep convolutional neural networks to organize images and videos has shown outstanding results in computer vision and machine learning [6],[7]. CNNs that learn features from training data are used to produce a workable automatic feature extraction approach for images. We utilize a set of neural networks known as long-short-term memory networks, allowing for the storage of information over extended durations. Besides its capability to comprehend and forecast patterns in time series, text, speech, and images, LSTM demonstrates remarkable effectiveness in under- standing and predicting patterns within sequential data. Using visual attention-guided Bidirectional LSTM fall detection in [8], [9], authors introduced a masked R-CNN to address the challenges posed by complex background environments. This model integrates spatial and temporal information within intricate scenes to effectively tackle the fall detection problem. Our final proposal is to use a hybrid Mask R-CNN-Autoencoder-LSTM approach. The features of a video are extracted using a convolutional neural network, while the video is categorized by an LSTM neural network [10]. In [11], the researchers devised a fusion of a 3D CNN and an LSTM-based visual attention mechanism to capture spatiotemporal features from video sequences related to fall detection, effectively encoding motion information. Our study aims to implement a video action recognition system that essentially detects human falls. In consequence, we use Mask R-CNN and Autoencoder LSTM to obtain human shape distortion features for fall detection by applying them to each frame picture in a video from a camera. The residuum of this paper is structured as follows. Section 2 describes the UR-Fall Detection data set and proposes fall detection methods using Mask R-CNN, Autoencoder-LSTM, and a Hybrid Mask R-CNN-Autoencoder- LSTM, respectively. Section 3 shows a comparative study of these three experimental methods. Finally, section 4 concludes by presenting the conclusions and future work. 2. Background Study Deep neural networks have garnered increasing popularity with advancements in technology. Deep learning models are trained on vast amounts of data, allowing them to learn complex patterns and relationships within the data. This enables the system to make accurate predictions, classifications, or decisions in various domains such as image recognition, natural language processing, and autonomous systems. Within the spectrum of deep learning algorithms, convolutional neural networks (CNNs) have established themselves as the foremost choice in computer vision technology. In a CNN architecture, different layers serve specific functions, such as convolutional layers for feature extraction, pooling layers for dimensionality reduction, and fully connected layers for classification or regression tasks. Important body parts are segmented and identified using the Mask R-CNN, which also detects notable posture irregularities that may be signs of a fall. Conversely, a type of RNN known as LSTM(Long Short Term Memory) network functions akin to a feature pooling network but operates at the frame level, enabling it to integrate information over time, much like a CNN. Through the utilization of parameters shared across time, both architectures can maintain constant parameters while capturing the global temporal evolution of a video [12]. The Autoencoder-LSTM analyzes key body components and records movement patterns, identifying significant temporal abnormalities and posture abnormalities that could be signs of a fall. While the Mask R-CNN-Autoencoder LSTM architecture is highly effective theoretically in fall detection due to its capacity to capture both temporal and spatial features. 3. Proposed Methodology This paper presents an extensive analysis of Mask R-CNN, Autoencoder-LSTM, and a combination of both for fall detection. The input data consists of approximately 60 videos of falls and non-falls, which are transformed into individual frames for processing. To enhance the generalization capability of the deep learning models, the frames are scaled using a tool named Keras ImageDataGenerator. The fall detection process involves the utilization of Mask R-CNN, Autoencoder-LSTM, and a hybrid Mask R-CNN-Autoencoder-LSTM model. In this approach, each model carefully analyzes input frames, identifying key body segments and temporal movement patterns to distinguish between falls and non-falls. Through this comparative analysis, the effectiveness of each model in fall detection is evaluated. 3.1. Dataset Description This research study employs two distinct datasets for fall detection purposes, the UR fall detection data set [4] and the multiple cameras fall data set [13]. The URFall detection data set consists of 70 sequences of both 30 sets of fall sequences and 40 sets of activities of daily living. The multiple cameras fall data set contains 24 falling examples recorded with 8 IP video cameras. In this research study, we utilize videos from multiple datasets, including five original videos created to demonstrate falls and non-falls. Additionally, we randomly selected around 60 videos from various datasets that contain classifications of falls and non-falls. By merging these three datasets, we established a comprehensive and diverse data pool for thorough analysis. This combined dataset not only increases the sample size but also enhances the diversity of the data, allowing for a more comprehensive examination of fall detection. Moreover, it helps validate our findings across different sources, leading to a more robust analysis. Ultimately, this approach enhances the reliability of the study’s conclusions, providing a stronger foundation for the effectiveness of the proposed fall detection methodology. 3.2. Feature Selection Our data set has videos with a frame breadth and height of 640 × 240 pixels. Image analysis is a trendy topic in the subject of computer vision. It involves the extraction of important information from videos. To reduce computational cost and remove noise from images, we rescale to a 64x64 pixel resolution. We chose features from the fall class 15233 and the nonfall class 10540. However, the maximum number of images per class has been set at 10,000. This limitation reflects a conscious decision to control the dataset’s size and maintain computational feasibility, all while preserving the equitable representation of both classes. By imposing a limit of 10000 images per class, the study upholds an equitable approach to feature selection while also effectively managing potential resource limitations. 3.3. Mask R-CNN Our experiments use Mask R-CNN, a deep neural network that employs convolutional layers to extract essential features from images and segment body parts relevant to fall detection. The architecture of Mask R-CNN is designed to not only detect objects but also provide pixel-level segmentation. It utilizes a region proposal network (RPN) to generate potential object regions, followed by the application of a mask prediction network to generate segmentation masks for each detected object. The Mask R-CNN model consists of several key components. Initially, it uses a backbone network (such as ResNet or VGG) for feature extraction. This backbone network comprises convolutional layers that adapt their weights and biases through learning to identify important features from input images. Convolutional layers are followed by a Region of Interest (RoI) pooling layer that extracts fixedsize features from each region proposal. Subsequently, the network uses separate branches: one for object classification, one for bounding box regression, and one for predicting segmentation masks. In our experimentation, we utilize a Mask R-CNN model with a backbone network based on ResNet-50. The model is trained to identify human body parts and detect significant posture changes indicative of falls. The segmentation mask generated by Mask R-CNN helps to precisely locate and delineate body parts, improving the model’s ability to recognize fall events. To enhance the model’s feature extraction, we incorporate a series of convolutional layers to detect key spatial patterns, followed by the RPN and RoI align layers for precise object localization. Finally, the model outputs a set of classes (such as "fall" and "non-fall") along with segmentation masks that outline the body positions and movements. This design enables Mask R-CNN to not only detect falls but also segment the affected areas of the body, enhancing fall detection accuracy for elderly individuals and supporting real-time safety monitoring in video surveillance scenarios. An overview of our network’s configuration can be found in table 1. Layer Type Input Shape Kernel Size No. of Filters Strides Output Size Backbone (ResNet/FPN) (H, W, 3) Varies Varies Varies (H/32, W/32, N) RPN CONV2D (H/32, W/32, N) (3, 3) 512 (1, 1) (H/32, W/32, 18) RoI Align (H/32, W/32, 512) - - - (14, 14, 256) Mask CONV2D (14, 14, 256) (3, 3) 256 (1, 1) (14, 14, 1) Dense for Classification (8192) - - - (2) Dense for Bounding Box (8192) - - - (4) Mask Output (64, 64, 1) - - - (64, 64, 1) Table 1 Mask R-Convolutional Neural Network Model Architecture 3.4. Autoencoder LSTM Autoencoder-LSTM networks are a combination of Autoencoders and Long Short-Term Memory (LSTM) networks, designed to learn both spatial features and temporal dependencies in sequential data. The autoencoder component captures the key features of the input data by learning an efficient representation in a compressed form, while the LSTM part models the temporal dynamics and dependencies between frames in a sequence. In the Autoencoder-LSTM model, the architecture includes two main components: the encoder- decoder structure of the autoencoder and the LSTM layers that capture temporal relationships between the frames. The autoencoder focuses on encoding the input data into a lower-dimensional latent space and then decoding it back to the original input. This process helps to extract meaningful features while reducing dimensionality, which is particularly beneficial for fall detection in video sequences. The encoder part of the autoencoder compresses the input into a latent representation, typically using convolutional layers or fully connected layers, and the decoder reconstructs the input data from this latent representation. The LSTM layers are then applied to the encoded features to capture the sequential nature of the video frames. Two states exist in the cell. These are New Cell State or new long-term memory 𝑐(𝑇 ) and New Hidden State or new short-term memory ℎ(𝑇 ) . Here is a brief examination of each gate. 1. Forget Gate: The forget gate decides what information should be eliminated from the cell state. It uses the sigmoid activation function which takes the previous hidden state ℎ(𝑇 −1) and current state input xT . The function outputs a value between 0 and 1, with 0 being forgetting and 1 being retaining. 𝑓 1(𝑇 ) = 𝜎(𝑊(𝑓 ℎ) · ℎ𝑇 −1 + 𝑊𝑓 𝑥 · 𝑥𝑇 + 𝑏𝑓 ) (1) 2. Primary Input Gate: The input gate updates the new long-term memory state. The sigmoid activation function is contained in the input gate, while the tanh activation function is contained in the input node. 𝑖1(𝑇 ) = 𝜎(𝑊(𝑖ℎ) · ℎ𝑇 −1 + 𝑊𝑖𝑥 · 𝑥𝑇 + 𝑏𝑖 ) (2) 𝑔1(𝑇 ) = 𝜎(𝑊(𝑔ℎ) · ℎ𝑇 −1 + 𝑊𝑔𝑥 · 𝑥𝑇 + 𝑏𝑔 ) (3) 3. Output Gate: The LSTM output gate is responsible for selecting which information to use further. 𝑜1(𝑇 ) = 𝜎(𝑊(𝑜ℎ) · ℎ𝑇 −1 + 𝑊𝑜𝑥 · 𝑥𝑇 + 𝑏𝑜 ) (4) New cell State or long-term memory 𝑐(𝑇 ) are upgraded by the middle level forget gate and primary input gate. The LSTM cell state and hidden state equations are in the following forms. 𝑐(𝑇 ) = 𝑓 (𝑇 ) ⊗ 𝑐(𝑇 − 1) + 𝑖𝑇 ⊗ 𝑔𝑇 (5) 𝑦(𝑇 ) = ℎ(𝑇 ) = 𝑜(𝑇 ) ⊗ tanh(𝑐𝑇 ) (6) To enhance the generalization capability of the Autoencoder-LSTM model and prevent overfitting, dropout layers are incorporated into the architecture. These layers randomly drop certain connections during training, which helps the model to generalize better to unseen data. Finally, the model includes a dense layer that outputs the final classification, which can be used to distinguish between "fall" and "non-fall" instances. Table 2 shows the Autoencoder LSTM model. Layer (Type) Output Size Parameters lstm (Encoder) (None, 192, 128) 98,816 dropout (Encoder) (None, 192, 128) 0 lstm1 (Encoder) (None, 128) 131,584 repeatvector (Encoder) (None, 192, 128) 0 lstm2 (Decoder) (None, 192, 128) 131,584 dropout1 (Decoder) (None, 192, 128) 0 lstm3 (Decoder) (None, 128) 131,584 dense (Decoder) (None, 64) 8,256 dropout2 (Decoder) (None, 64) 0 dense1 (Decoder) (None, 2) 130 Table 2 Autoencoder LSTM Model Architecture 3.5. Hybrid In this research study, a hybrid Mask R-CNN - Autoencoder-LSTM algorithm was developed to identify falls in elderly individuals. The images are first analyzed by Mask R-CNN, which performs semantic segmentation and extracts key body parts from the input images. These features are then processed by the Autoencoder-LSTM network to capture temporal dependencies in the sequence of frames. The Mask R-CNN - Autoencoder-LSTM network architecture comprises several key components: the Mask R-CNN model for spatial feature extraction and segmentation, followed by the Autoencoder- LSTM model for analyzing the temporal sequence of frames. Mask R-CNN begins with a backbone network (such as ResNet) to extract features, followed by region proposal and segmentation mask prediction, which isolates relevant body parts and postures. The Autoencoder-LSTM component then processes these segmented features, leveraging the encoder-decoder structure of the autoencoder to compress and reconstruct spatial features, while the LSTM captures the temporal relationships in the sequence. The Mask R-CNN performs feature extraction and body part segmentation, while the Autoencoder-LSTM captures sequential patterns in the data, making it suitable for fall detection across video sequences. To enable the Autoencoder- LSTM to effectively process the extracted features, we place Time-Distributed Layers before the LSTM, allowing it to handle convoluted image features over time. While the hybrid approach aims to combine the strengths of Mask R-CNN for spatial feature extraction and Autoencoder-LSTM for temporal sequence modeling, our findings revealed that it did not perform as well as anticipated. The overall accuracy was lower than expected, suggesting that individual Mask R-CNN and Autoencoder-LSTM models might be more effective for this task. This underscores the challenges of combining these two architectures and the need for further optimization to fully leverage their complementary strengths. Layer (type) Output Shape time-distributed-layer1 (None, 1, 14, 14, 64) time-distributed-layer2 (None, 1, 7, 7, 64) time-distributed-layer3 (None, 1, 4, 4, 128) time-distributed-layer4 (None, 1, 1, 1, 128) time-distributed-layer5 (None, 1, 1, 1, 256) time-distributed-layer6 (None, 1, 1, 1, 256) time-distributed-layer7 (None, 1, 1, 1, 512) time-distributed-layer8 (None, 1, 1, 1, 512) time-distributed-layer9 (None, 1, 512) time-distributed-layer10 (None, 1, 256) time-distributed-layer11 (None, 1, 1024) LSTM(Encoder) (None, 1024) Dense(Decoder) (None, 2) 4. Results & Analysis In this section, we evaluate the effectiveness of the three methods on our dataset, which consists of 70 sequences comprising 30 sets of fall sequences and 40 sets of activities of daily living. We employed training and testing data sets with the information provided by several cameras fall non-fall data sets throughout these tests. The outcome of our work is measured using three metrics: accuracy metric, sensitivity metric, and specificity metric. 𝑇𝑃 + 𝑇𝑁 Accuracy = (7) 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑇𝑃 Sensitivity = (8) 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 Specificity = (9) 𝑇𝑁 + 𝐹𝑃 True Positive (TP) and True Negative (TN) refer to the appropriate number of falls and non-falls, respectively. In both the fall and non-fall datasets, instances of false positive (FP) and false negative (FN) identifications are documented. When it comes to detecting a fall, sensitivity or SE refers to the ability to detect a fall event. However, a fall can only be detected with specificity or SP. There is an imbalance between the fall and non-fall data sets used for the study. To take advantage of GPUs, every experiment was carried out in Python 3.11.0 machine learning methods utilising sklearn3 and keras4 and tensorflow2 for CNNs. In the experimental configuration, the datasets designated for training and testing purposes underwent division into distinct proportions. In particular, the dataset was divided, dedicating 90% of the data for training the model and allocating the remaining 10% for model testing purposes. Furthermore, a separate division was implemented, assigning 60% of the dataset for model training and preserving the remaining 40% as an independent testing set. This approach facilitated thorough evaluation of the model’s ability to generalize to unseen data. Figure:2 and Figure:3 show loss and accuracy between training and validation data using our CNN based architecture. All three experiments demonstrated that the maximum number of epochs for training was set at 128 and 256, while 5 and 10 epochs were utilized in specific instances for optimal performance. The results, summarized in Table 3, illustrate the overall accuracy achieved by each model: both the Mask R-CNN and Autoencoder LSTM architectures Figure 1: Training Loss vs Validation Loss Figure 2: Training Accuracy vs Validation Accuracy reached nearly 98% accuracy, indicating their strong performance in detecting falls. In contrast, the combined model achieved an accuracy of 93%. This performance disparity highlights the strengths of the individual models; the Mask R-CNN effectively captures spatial features, while the Autoencoder LSTM excels in understanding temporal dynamics. Despite the hybrid model’s potential for integrating both spatial and temporal data, it appears to fall short of the accuracy levels reached by the standalone models. This suggests that, for this particular application, leveraging the strengths of Mask R-CNN and Autoencoder LSTM separately may yield better results in fall detection for elderly individuals shown in table 3. No Train Features Test Features Epochs Batch Mask RCNN LSTM Hybrid 1 9000 1000 5 128 0.971 0.982 0.938 2 9000 1000 5 256 0.986 0.987 0.943 3 9000 1000 10 128 0.985 0.976 0.957 4 6000 4000 5 256 0.995 0.997 0.948 5 6000 4000 10 256 0.986 0.983 0.936 6 6000 4000 5 128 0.989 0.993 0.954 7 6000 4000 10 128 0.994 0.989 0.941 Table 3 Comparison of Mask R-CNN, Autoencoder LSTM, Hybrid Model Architecture 5. Conclusion & Future Direction This paper presents a novel approach for automatic fall detection in video monitoring scenarios, employing Mask R-CNN, Autoencoder-LSTM, and a hybrid fusion of both algorithms. Each frame in the video is directly analyzed using these algorithms to segment and identify body parts and detect temporal anomalies associated with human movement. We conduct a comparative analysis of the accuracy of the three approaches. Our experimental results demonstrate that Mask R-CNN and Autoencoder-LSTM individually show promising potential for real-time fall detection in video streaming, making them valuable for video monitoring systems, especially in contexts like elderly care. In the future, live footage from CCTV cameras could be examined to validate our experimental findings. It’s worth noting that the images used in our study predominantly originate from three sources with similar backgrounds, which could influence our approach. Consequently, future work will involve training the two algorithms using images captured from various perspectives to assess their performance under different conditions. Declaration on Generative AI The author(s) have not employed any Generative AI tools. References [1] S. Deandrea, E. Lucenteforte, F. Bravi, R. Foschi, C. L. Vecchia, E. Negri, Risk factors for falls in community-dwelling older people: A systematic review and meta-analysis, Epidemiology 21 (2010) 658–668. [2] A. Jalal, S. Kamal, D. Kim, A depth video sensor- based life-logging human activity recognition system for elderly care in smart indoor environments, Sensors 14 (2014) 11735–11759. [3] T. Tamura, T. Yoshimura, M. Sekine, M. Uchida, O. Tanaka, A wearable airbag to prevent fall injuries, IEEE Trans. Inf. Technol. Biomed 13 (2009) 910–914. [4] B. Kwolek, M. Kepskie, Human fall detection on embedded platform using depth maps and wireless accelerometer, Elsevier 117 (2014) 489–501. [5] S. Münzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, R. Dürucheb, Cnn based sensor fusion techniques for multimodal human activity recognition, ACM International Symposium on Wearable Computers - ISWC (2017). [6] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: Deep networks for video classification, Computer Vision and Pattern Recognition (2015). [7] B. Jokanovic, M. Amin, F. Ahmad, Radar fall motion detection using deep learning., Radar Conferenc (2016) 1–6. [8] Y. Chen, W. Li, L. Wang, J. Hu, M. Ye, Vision-based fall event detection in complex background using attention guided bi-directional lstm, IEEE Access 8 (2020) 161337 – 161348. [9] Y. Chen, W. Li, L. Wanga, J. Hu, M. Ye, Vision-based fall event detection in complex background using attention guided bi-directional lstm, IEEE Access 8 (2020) 161337 – 161348. [10] K. He, X. Zhang, S. Ren, J. Sun., Deep residual learning for image recognition, Computer Vision and Pattern Recognition (2016) 770–778. [11] N. Lu, Y. Wu, L. Feng, J. Song, Deep learning for fall detection: 3d-cnn combined with lstm on video kinematic data, IEEE Journal of Biomedical and Health Informatics (2018) 2168–2194. [12] C. Ismael, O. Mar´ıa, E. Buemi, J. J. Berlles, Cnn–lstm architecture for action recognition in videos, SAIV, Simposio Argentino de Imágenes y Visión (2020). [13] E. Auvinet, C. Rougier, J. Meunier, A. StArnaud, Multiple cameras fall dataset, Technical report 1350 DIRO Université de Montréal (July 2010).