<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhanced Vision-Based Human Fall Detection with Mask-RCNN and Autoencoder-LSTM Hybrid Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sritama Chakraborty</string-name>
          <email>sritama.sc@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mili Ghosh</string-name>
          <email>ghosh.mili90@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Technology, University of North Bengal</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>28</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Elderly individuals who reside independently face a heightened risk of experiencing serious harm due to accidental falls, a leading contributor to mortality rates in this demographic. Fall detection is a critical part of health care for older adults. This paper introduces a methodology for identifying falls among elderly individuals utilizing machine learning algorithms. We suggest utilizing Mask R-CNN, Autoencoder- LSTM, and a Hybrid Mask R-CNN-Autoencoder-LSTM framework to detect falls in a video surveillance environment. Primarily, these algorithms are trained to identify deformations in human body shapes and postures, enabling them to detect whether a fall has occurred. At their core, these algorithms are instructed to distinguish shifts in human body structures and movements, empowering them to flag potential fall occurrences. Our findings suggest that the Mask R-CNN and Autoencoder- LSTM models perform well independently, with the Mask R-CNN excelling in spatial feature extraction and the Autoencoder-LSTM efectively modeling the temporal changes in body movements. However, the hybrid Mask R-CNN-Autoencoder-LSTM approach does not fully capitalize on the strengths of both models, leading to a slightly lower performance than the individual models. Despite this, both Mask R-CNN and Autoencoder-LSTM present a reliable solution for fall detection, with the hybrid framework ofering potential improvements for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Human Fall Detection</kwd>
        <kwd>Mask R-CNN</kwd>
        <kwd>Autoencoder LSTM</kwd>
        <kwd>Hybrid Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent times, the prevalence of aging has markedly risen across the population, mirroring demographic
changes and extended life expectancies. In the elderly healthcare sector, falls are a critical health issue.
Accidental deaths among older adults are most often caused by falls [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. According to the census, in
2021, around 6.8 percent of people were over 65 years old. According to Elderly in India 2021 from
the National Statistical Ofice (NSO), India’s older adult population is expected to reach 194 MM by
2031. This reflects a substantial increase of 41 percent when compared to the figures from the preceding
decade, suggesting a notable trend of growth over time. Given the growing population of elderly
individuals living independently, it becomes imperative for both governmental and private entities to
devise eficient intelligent surveillance systems capable of identifying potential fall risks.
      </p>
      <p>
        Due to the increasing number of people seeking security and healthcare services, the need for fall
detection has been accelerated.Using depth sensors, skeleton joint data, and Hidden Markov Models, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
research introduces a depth-based life logging system for senior activity detection that shows promise
for tracking daily activities and medical treatment. Many research projects have been conducted to
create eficient fall detection algorithms. Even though fall incidents cannot entirely be prevented, the
ability to precisely identify a fall incidence and issue an emergency notice can save lives. A monitoring
system should distinguish fall events from normal activity. Various methods have been proposed to
identify falls at right using wearable devices such as accelerators, gyroscopes, and magnetometers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
However, these methods could be more efective since it is impractical to wear such devices for a long
time [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Furthermore, diferent machine learning techniques can detect various types of falls. The adaptability
of machine learning techniques in tackling various contexts and categories of falls has contributed
significantly to their widespread adoption and popularity. Some of these include Convolutional Neural
Network (CNN) , random forest, multi-layer perceptron [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a statistical model i.e. Hidden Markov
Model (HMM), a non-parametric supervised learning parameter i.e. K-Nearest Neighbours (KNN), a
supervised machine learning algorithm i.e. Support Vector Machine(SVM), decision trees which employs
deep learning method. Using deep convolutional neural networks to organize images and videos has
shown outstanding results in computer vision and machine learning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. CNNs that learn features
from training data are used to produce a workable automatic feature extraction approach for images.
      </p>
      <p>
        We utilize a set of neural networks known as long-short-term memory networks, allowing for the
storage of information over extended durations. Besides its capability to comprehend and forecast
patterns in time series, text, speech, and images, LSTM demonstrates remarkable efectiveness in
understanding and predicting patterns within sequential data. Using visual attention-guided Bidirectional
LSTM fall detection in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], authors introduced a masked R-CNN to address the challenges posed by
complex background environments. This model integrates spatial and temporal information within
intricate scenes to efectively tackle the fall detection problem. Our final proposal is to use a hybrid
Mask R-CNN-Autoencoder-LSTM approach. The features of a video are extracted using a convolutional
neural network, while the video is categorized by an LSTM neural network [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the researchers
devised a fusion of a 3D CNN and an LSTM-based visual attention mechanism to capture spatiotemporal
features from video sequences related to fall detection, efectively encoding motion information.
      </p>
      <p>Our study aims to implement a video action recognition system that essentially detects human falls.
In consequence, we use Mask R-CNN and Autoencoder LSTM to obtain human shape distortion features
for fall detection by applying them to each frame picture in a video from a camera. The residuum of
this paper is structured as follows. Section 2 describes the UR-Fall Detection data set and proposes fall
detection methods using Mask R-CNN, Autoencoder-LSTM, and a Hybrid Mask
R-CNN-AutoencoderLSTM, respectively. Section 3 shows a comparative study of these three experimental methods. Finally,
section 4 concludes by presenting the conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background Study</title>
      <p>
        Deep neural networks have garnered increasing popularity with advancements in technology. Deep
learning models are trained on vast amounts of data, allowing them to learn complex patterns and
relationships within the data. This enables the system to make accurate predictions, classifications, or
decisions in various domains such as image recognition, natural language processing, and autonomous
systems. Within the spectrum of deep learning algorithms, convolutional neural networks (CNNs) have
established themselves as the foremost choice in computer vision technology. In a CNN architecture,
diferent layers serve specific functions, such as convolutional layers for feature extraction, pooling
layers for dimensionality reduction, and fully connected layers for classification or regression tasks.
Important body parts are segmented and identified using the Mask R-CNN, which also detects notable
posture irregularities that may be signs of a fall. Conversely, a type of RNN known as LSTM(Long Short
Term Memory) network functions akin to a feature pooling network but operates at the frame level,
enabling it to integrate information over time, much like a CNN. Through the utilization of parameters
shared across time, both architectures can maintain constant parameters while capturing the global
temporal evolution of a video [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The Autoencoder-LSTM analyzes key body components and records
movement patterns, identifying significant temporal abnormalities and posture abnormalities that
could be signs of a fall. While the Mask R-CNN-Autoencoder LSTM architecture is highly efective
theoretically in fall detection due to its capacity to capture both temporal and spatial features.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>This paper presents an extensive analysis of Mask R-CNN, Autoencoder-LSTM, and a combination of
both for fall detection. The input data consists of approximately 60 videos of falls and non-falls, which
are transformed into individual frames for processing. To enhance the generalization capability of
the deep learning models, the frames are scaled using a tool named Keras ImageDataGenerator. The
fall detection process involves the utilization of Mask R-CNN, Autoencoder-LSTM, and a hybrid Mask
R-CNN-Autoencoder-LSTM model.</p>
      <p>In this approach, each model carefully analyzes input frames, identifying key body segments and
temporal movement patterns to distinguish between falls and non-falls. Through this comparative
analysis, the efectiveness of each model in fall detection is evaluated.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Description</title>
        <p>
          This research study employs two distinct datasets for fall detection purposes, the UR fall detection data
set [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and the multiple cameras fall data set [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The URFall detection data set consists of 70 sequences
of both 30 sets of fall sequences and 40 sets of activities of daily living. The multiple cameras fall data
set contains 24 falling examples recorded with 8 IP video cameras. In this research study, we utilize
videos from multiple datasets, including five original videos created to demonstrate falls and non-falls.
Additionally, we randomly selected around 60 videos from various datasets that contain classifications
of falls and non-falls. By merging these three datasets, we established a comprehensive and diverse
data pool for thorough analysis.
        </p>
        <p>This combined dataset not only increases the sample size but also enhances the diversity of the
data, allowing for a more comprehensive examination of fall detection. Moreover, it helps validate our
ifndings across diferent sources, leading to a more robust analysis. Ultimately, this approach enhances
the reliability of the study’s conclusions, providing a stronger foundation for the efectiveness of the
proposed fall detection methodology.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Feature Selection</title>
        <p>Our data set has videos with a frame breadth and height of 640 × 240 pixels. Image analysis is a trendy
topic in the subject of computer vision. It involves the extraction of important information from videos.
To reduce computational cost and remove noise from images, we rescale to a 64x64 pixel resolution. We
chose features from the fall class 15233 and the nonfall class 10540. However, the maximum number
of images per class has been set at 10,000. This limitation reflects a conscious decision to control the
dataset’s size and maintain computational feasibility, all while preserving the equitable representation
of both classes. By imposing a limit of 10000 images per class, the study upholds an equitable approach
to feature selection while also efectively managing potential resource limitations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Mask R-CNN</title>
        <p>Our experiments use Mask R-CNN, a deep neural network that employs convolutional layers to extract
essential features from images and segment body parts relevant to fall detection. The architecture of
Mask R-CNN is designed to not only detect objects but also provide pixel-level segmentation. It utilizes
a region proposal network (RPN) to generate potential object regions, followed by the application of a
mask prediction network to generate segmentation masks for each detected object. The Mask R-CNN
model consists of several key components. Initially, it uses a backbone network (such as ResNet or
VGG) for feature extraction. This backbone network comprises convolutional layers that adapt their
weights and biases through learning to identify important features from input images. Convolutional
layers are followed by a Region of Interest (RoI) pooling layer that extracts fixedsize features from each
region proposal. Subsequently, the network uses separate branches: one for object classification, one for
bounding box regression, and one for predicting segmentation masks. In our experimentation, we utilize
a Mask R-CNN model with a backbone network based on ResNet-50. The model is trained to identify
human body parts and detect significant posture changes indicative of falls. The segmentation mask
generated by Mask R-CNN helps to precisely locate and delineate body parts, improving the model’s
ability to recognize fall events. To enhance the model’s feature extraction, we incorporate a series of
convolutional layers to detect key spatial patterns, followed by the RPN and RoI align layers for precise
object localization. Finally, the model outputs a set of classes (such as "fall" and "non-fall") along with
segmentation masks that outline the body positions and movements. This design enables Mask R-CNN
to not only detect falls but also segment the afected areas of the body, enhancing fall detection accuracy
for elderly individuals and supporting real-time safety monitoring in video surveillance scenarios. An
overview of our network’s configuration can be found in table 1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Autoencoder LSTM</title>
        <p>Autoencoder-LSTM networks are a combination of Autoencoders and Long Short-Term Memory (LSTM)
networks, designed to learn both spatial features and temporal dependencies in sequential data. The
autoencoder component captures the key features of the input data by learning an eficient representation
in a compressed form, while the LSTM part models the temporal dynamics and dependencies between
frames in a sequence.</p>
        <p>In the Autoencoder-LSTM model, the architecture includes two main components: the
encoderdecoder structure of the autoencoder and the LSTM layers that capture temporal relationships between
the frames. The autoencoder focuses on encoding the input data into a lower-dimensional latent space
and then decoding it back to the original input. This process helps to extract meaningful features
while reducing dimensionality, which is particularly beneficial for fall detection in video sequences.
The encoder part of the autoencoder compresses the input into a latent representation, typically using
convolutional layers or fully connected layers, and the decoder reconstructs the input data from this
latent representation. The LSTM layers are then applied to the encoded features to capture the sequential
nature of the video frames. Two states exist in the cell. These are New Cell State or new long-term
memory ( ) and New Hidden State or new short-term memory ℎ( ). Here is a brief examination of
each gate.</p>
        <p>1. Forget Gate: The forget gate decides what information should be eliminated from the cell state. It
uses the sigmoid activation function which takes the previous hidden state ℎ( − 1) and current state
input xT . The function outputs a value between 0 and 1, with 0 being forgetting and 1 being retaining.</p>
        <p>1( ) =  ((ℎ) · ℎ − 1 +  ·  +  )
2. Primary Input Gate: The input gate updates the new long-term memory state. The sigmoid
activation function is contained in the input gate, while the tanh activation function is contained in the
input node.</p>
        <p>1( ) =  ((ℎ) · ℎ − 1 +  ·  + )
1( ) =  ((ℎ) · ℎ − 1 +  ·  + )
3. Output Gate: The LSTM output gate is responsible for selecting which information to use further.
(3)
(4)
(5)
(6)
1( ) =  ((ℎ) · ℎ − 1 +  ·  + )</p>
        <p>New cell State or long-term memory ( ) are upgraded by the middle level forget gate and primary
input gate. The LSTM cell state and hidden state equations are in the following forms.
( ) =  ( ) ⊗ ( − 1) +  ⊗</p>
        <p>( ) = ℎ( ) = ( ) ⊗ tanh( )
To enhance the generalization capability of the Autoencoder-LSTM model and prevent overfitting,
dropout layers are incorporated into the architecture. These layers randomly drop certain connections
during training, which helps the model to generalize better to unseen data. Finally, the model includes
a dense layer that outputs the final classification, which can be used to distinguish between "fall" and
"non-fall" instances. Table 2 shows the Autoencoder LSTM model.</p>
        <p>Layer (Type) Output Size
lstm (Encoder) (None, 192, 128)
dropout (Encoder) (None, 192, 128)</p>
        <p>lstm1 (Encoder) (None, 128)
repeatvector (Encoder) (None, 192, 128)</p>
        <p>lstm2 (Decoder) (None, 192, 128)
dropout1 (Decoder) (None, 192, 128)
lstm3 (Decoder) (None, 128)
dense (Decoder) (None, 64)
dropout2 (Decoder) (None, 64)
dense1 (Decoder) (None, 2)</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Hybrid</title>
        <p>In this research study, a hybrid Mask R-CNN - Autoencoder-LSTM algorithm was developed to identify
falls in elderly individuals. The images are first analyzed by Mask R-CNN, which performs semantic
segmentation and extracts key body parts from the input images. These features are then processed
by the Autoencoder-LSTM network to capture temporal dependencies in the sequence of frames.
The Mask R-CNN - Autoencoder-LSTM network architecture comprises several key components: the
Mask R-CNN model for spatial feature extraction and segmentation, followed by the
AutoencoderLSTM model for analyzing the temporal sequence of frames. Mask R-CNN begins with a backbone
network (such as ResNet) to extract features, followed by region proposal and segmentation mask
prediction, which isolates relevant body parts and postures. The Autoencoder-LSTM component then
processes these segmented features, leveraging the encoder-decoder structure of the autoencoder to
compress and reconstruct spatial features, while the LSTM captures the temporal relationships in
the sequence. The Mask R-CNN performs feature extraction and body part segmentation, while the
Autoencoder-LSTM captures sequential patterns in the data, making it suitable for fall detection across
video sequences. To enable the Autoencoder- LSTM to efectively process the extracted features, we
place Time-Distributed Layers before the LSTM, allowing it to handle convoluted image features over
time. While the hybrid approach aims to combine the strengths of Mask R-CNN for spatial feature
extraction and Autoencoder-LSTM for temporal sequence modeling, our findings revealed that it did
not perform as well as anticipated. The overall accuracy was lower than expected, suggesting that
individual Mask R-CNN and Autoencoder-LSTM models might be more efective for this task. This
underscores the challenges of combining these two architectures and the need for further optimization
to fully leverage their complementary strengths.</p>
        <p>Layer (type) Output Shape
time-distributed-layer1 (None, 1, 14, 14, 64)
time-distributed-layer2 (None, 1, 7, 7, 64)
time-distributed-layer3 (None, 1, 4, 4, 128)
time-distributed-layer4 (None, 1, 1, 1, 128)
time-distributed-layer5 (None, 1, 1, 1, 256)
time-distributed-layer6 (None, 1, 1, 1, 256)
time-distributed-layer7 (None, 1, 1, 1, 512)
time-distributed-layer8 (None, 1, 1, 1, 512)
time-distributed-layer9 (None, 1, 512)
time-distributed-layer10 (None, 1, 256)
time-distributed-layer11 (None, 1, 1024)
LSTM(Encoder) (None, 1024)</p>
        <p>Dense(Decoder) (None, 2)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results &amp; Analysis</title>
      <p>In this section, we evaluate the efectiveness of the three methods on our dataset, which consists of 70
sequences comprising 30 sets of fall sequences and 40 sets of activities of daily living. We employed
training and testing data sets with the information provided by several cameras fall non-fall data sets
throughout these tests. The outcome of our work is measured using three metrics: accuracy metric,
sensitivity metric, and specificity metric.</p>
      <p>Accuracy =</p>
      <p>+  
  +   +   +  
Sensitivity =</p>
      <p>+  
(7)
(8)
Specificity =   (9)</p>
      <p>+</p>
      <p>True Positive (TP) and True Negative (TN) refer to the appropriate number of falls and non-falls,
respectively. In both the fall and non-fall datasets, instances of false positive (FP) and false negative
(FN) identifications are documented. When it comes to detecting a fall, sensitivity or SE refers to the
ability to detect a fall event. However, a fall can only be detected with specificity or SP. There is an
imbalance between the fall and non-fall data sets used for the study. To take advantage of GPUs, every
experiment was carried out in Python 3.11.0 machine learning methods utilising sklearn3 and keras4
and tensorflow2 for CNNs.</p>
      <p>In the experimental configuration, the datasets designated for training and testing purposes underwent
division into distinct proportions. In particular, the dataset was divided, dedicating 90% of the data
for training the model and allocating the remaining 10% for model testing purposes. Furthermore, a
separate division was implemented, assigning 60% of the dataset for model training and preserving
the remaining 40% as an independent testing set. This approach facilitated thorough evaluation of the
model’s ability to generalize to unseen data. Figure:2 and Figure:3 show loss and accuracy between
training and validation data using our CNN based architecture. All three experiments demonstrated
that the maximum number of epochs for training was set at 128 and 256, while 5 and 10 epochs were
utilized in specific instances for optimal performance. The results, summarized in Table 3, illustrate the
overall accuracy achieved by each model: both the Mask R-CNN and Autoencoder LSTM architectures
reached nearly 98% accuracy, indicating their strong performance in detecting falls. In contrast, the
combined model achieved an accuracy of 93%. This performance disparity highlights the strengths of
the individual models; the Mask R-CNN efectively captures spatial features, while the Autoencoder
LSTM excels in understanding temporal dynamics. Despite the hybrid model’s potential for integrating
both spatial and temporal data, it appears to fall short of the accuracy levels reached by the standalone
models. This suggests that, for this particular application, leveraging the strengths of Mask R-CNN and
Autoencoder LSTM separately may yield better results in fall detection for elderly individuals shown in
table 3.
This paper presents a novel approach for automatic fall detection in video monitoring scenarios,
employing Mask R-CNN, Autoencoder-LSTM, and a hybrid fusion of both algorithms. Each frame in the
video is directly analyzed using these algorithms to segment and identify body parts and detect temporal
anomalies associated with human movement. We conduct a comparative analysis of the accuracy of the
three approaches. Our experimental results demonstrate that Mask R-CNN and Autoencoder-LSTM
individually show promising potential for real-time fall detection in video streaming, making them
valuable for video monitoring systems, especially in contexts like elderly care. In the future, live footage
from CCTV cameras could be examined to validate our experimental findings. It’s worth noting that the
images used in our study predominantly originate from three sources with similar backgrounds, which
could influence our approach. Consequently, future work will involve training the two algorithms using
images captured from various perspectives to assess their performance under diferent conditions.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lucenteforte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Foschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Vecchia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <article-title>Risk factors for falls in community-dwelling older people: A systematic review and meta-analysis</article-title>
          ,
          <source>Epidemiology</source>
          <volume>21</volume>
          (
          <year>2010</year>
          )
          <fpage>658</fpage>
          -
          <lpage>668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>A depth video sensor- based life-logging human activity recognition system for elderly care in smart indoor environments</article-title>
          ,
          <source>Sensors</source>
          <volume>14</volume>
          (
          <year>2014</year>
          )
          <fpage>11735</fpage>
          -
          <lpage>11759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yoshimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sekine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Uchida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          ,
          <article-title>A wearable airbag to prevent fall injuries</article-title>
          ,
          <source>IEEE Trans. Inf. Technol. Biomed</source>
          <volume>13</volume>
          (
          <year>2009</year>
          )
          <fpage>910</fpage>
          -
          <lpage>914</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kwolek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kepskie</surname>
          </string-name>
          ,
          <article-title>Human fall detection on embedded platform using depth maps and wireless accelerometer</article-title>
          ,
          <source>Elsevier</source>
          <volume>117</volume>
          (
          <year>2014</year>
          )
          <fpage>489</fpage>
          -
          <lpage>501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Münzner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hanselmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stiefelhagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dürucheb</surname>
          </string-name>
          ,
          <article-title>Cnn based sensor fusion techniques for multimodal human activity recognition</article-title>
          ,
          <source>ACM International Symposium on Wearable Computers - ISWC</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. Y.-H.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausknecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monga</surname>
          </string-name>
          , G. Toderici,
          <article-title>Beyond short snippets: Deep networks for video classification</article-title>
          ,
          <source>Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jokanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <article-title>Radar fall motion detection using deep learning</article-title>
          ., Radar
          <string-name>
            <surname>Conferenc</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <article-title>Vision-based fall event detection in complex background using attention guided bi-directional lstm</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>161337</fpage>
          -
          <lpage>161348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wanga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <article-title>Vision-based fall event detection in complex background using attention guided bi-directional lstm</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>161337</fpage>
          -
          <lpage>161348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <surname>J. Sun.</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition, Computer Vision</article-title>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Deep learning for fall detection: 3d-cnn combined with lstm on video kinematic data</article-title>
          ,
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          (
          <year>2018</year>
          )
          <fpage>2168</fpage>
          -
          <lpage>2194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ismael</surname>
          </string-name>
          , O. Mar·ıa, E. Buemi,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Berlles</surname>
          </string-name>
          ,
          <article-title>Cnn-lstm architecture for action recognition in videos</article-title>
          , SAIV, Simposio Argentino de Imágenes y Visión (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Auvinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rougier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Meunier</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>StArnaud, Multiple cameras fall dataset</article-title>
          ,
          <source>Technical report 1350</source>
          DIRO Université de Montréal (
          <year>July 2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>