<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Convolutional Long Short-Term Memory Model for Recognizing Postures from Wearable Sensor</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Junqi Zhao, Esther Obonyo Pennsylvania State University, U.S</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research investigates the feasibility and viability of applying Deep Neural Networks (DNN) to improve performance with respect to posture recognition based on multi-channel motion data from Wearable Sensors (WS). The authors use the recognition of posture that can be linked to risk of Musculoskeletal Disorder (MSD)- among construction workers as the testbed. The proposed approach is based on the use of a DNN model integrating Convolutional Neural Network (CNN) and Long short-term memory (LSTM) that can achieve automated feature engineering and sequential pattern detection. The model performance was evaluated using datasets collected from four construction workers. The proposed model outperformed baseline CNN and LSTM models. Under the personalized modelling approach, it improved recognition performance by 3% from the benchmark Machine Learning models; the improvement is 2% for generalized modelling approach. The proposed model achieves high-performance posture recognition, which facilitates the MSD prevention in construction through monitoring injury-related postures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        the recognition of temporal patterns
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . It has been demonstrated that the
use of a deep hybrid model that integrates the CNN and RNN can result in improved
performance with respect to detecting workers’ activities from videos
        <xref ref-type="bibr" rid="ref3">(Ding et al. 2018)</xref>
        . In the
subsequent sections, this research investigates the feasibility of adapting and replicating this
approach to posture recognition that can be linked to the risk of MSD among construction
workers. The remainder of the paper is organized as follows. Section 2 briefly reviews the
research background. Section 3 describes the proposed DNN model configuration and model
test. The test result and discussion are reported in Section 4. The conclusion and further work
summarized in Section 5.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical Background</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Posture-based MSD Assessment and Prevention</title>
      <p>
        Epidemiological studies have established that physical factors pose the highest risks for MSD
        <xref ref-type="bibr" rid="ref14">(Nunes and Bush 2012)</xref>
        . Workers in labor-intensive sectors such as construction routinely adopt
awkward working postures. This exposes them to a high risk of developing MSD. Some of
these efforts directed at addressing this problem through the development of more proactive
safety management practices have focused on the use of emerging wearable sensor technologies
to monitor the posture adopted by the targeted workers. Such efforts are at times leverage
ergonomics rules to facilitate the monitoring of repetitive awkward postures. Common
ergonomic rules include “Rapid Entire Body Assessment”
        <xref ref-type="bibr" rid="ref7">(Hignett and McAtamney 2000)</xref>
        ,
Ovako Working Posture Analyzing System (OWAS) and its extension
        <xref ref-type="bibr" rid="ref9">(Kivi and Mattila 1991)</xref>
        ,
and the ISO 11226:2000 (Normalización 2000). The use of such rules in construction relies on
the visual assessment of superintendents – these tend to suffer from subjective biases.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Posture Capture in Construction</title>
      <p>
        Vision-based sensing and wearable sensing techniques present an opportunity for motion data
collection in a more effective and efficient way, compared to manual inspection
        <xref ref-type="bibr" rid="ref20">(Wang et al.
2015)</xref>
        . Vision-based sensing has exceptionally high accuracy and often comes with powerful
analytical tools for assessing both body posture and internal joint load
        <xref ref-type="bibr" rid="ref6">(Han and Lee 2013)</xref>
        .
However, the vulnerability to occlusion and lighting conditions, as well as data processing
complexity
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref3">(Ding et al. 2018, Hanbin Luo et al. 2018)</xref>
        limits its application to construction sites.
Wearable sensors, particularly the Inertial Measurement Units (IMU), are more applicable to
the constraints of working in a construction job site. IMU capture streaming motion data in a
non-intrusive, robust, and cost-effective manner
        <xref ref-type="bibr" rid="ref20">(Wang et al. 2015)</xref>
        . The raw IMU sensor output
can be used to detect awkward posture without the need for a total reconstruction of a 3D human
body model. Such advantages are brought by the ML-based posture recognition models. The
robustness and flexibility of IMU sensors make them appropriate for use on the construction
job site.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3 DNN for Posture and Activity Recognition</title>
      <p>
        Conventional ML-based models suffer from three main problems when used in pattern
recognition tasks
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . Firstly, heuristic feature engineering, which manually
crafts features, is a time-consuming and biased feature construction process. Secondly, it
separates sliding windows, isolates the consecutive sample data and assumes that the
timeseries motion data are static, which neglects any existing sequential patterns. Thirdly, it
separates feature engineering and selection from model tuning during the training process,
resulting in sub-optimal performance.
      </p>
      <p>
        DNN models can address some of these challenges. They have been used successfully in pattern
recognition tasks in Computer Vision and Speech Recognition
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . The
multi-layer CNN can automatically extract rich features with increasing complexity from raw
data, reducing the bias in manual feature engineering. The RNN has the advantage in capturing
sequential patterns. The integration of the CNN and RNN leverages the learning capabilities
from both extremely complex features and sequential patterns directly from input data. This
ultimately improves the performance of the models.
      </p>
      <p>
        The use of Computer Vision and DNN for posture recognition in construction research is an
emerging area.
        <xref ref-type="bibr" rid="ref22">Zhang et al. (2018)</xref>
        used multilayer CNN to extract view-invariant features from
a single video camera for awkward working posture recognition, which significantly improved
the model’s generality. Hanbin
        <xref ref-type="bibr" rid="ref10 ref11 ref13">Luo et al. (2018)</xref>
        eliminated the need for reconstructing complex
3D human body models in their posture recognition system, that was developed using a
pretrained CNN model (VGG-16).
        <xref ref-type="bibr" rid="ref3">Ding et al. (2018)</xref>
        deployed the RNN with pre-trained CNN
model (Inception V3) to further capture the sequential patterns from the video frames. The
resulting hybrid model outperformed the state-of-the-art heuristic feature engineering
approaches with respect to activity recognition. In other efforts, it was demonstrated that DNN
models can be used detect the misuse of Personal Protective Equipment from multiple workers
in the same video scene
        <xref ref-type="bibr" rid="ref17 ref3 ref4 ref5">(Fang et al. 2018a, Fang et al. 2018b)</xref>
        . The CNN-based classifier
(ResNet-50) can also be used to detect construction equipment, materials, and workers, and
recognize interactive construction activities
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref17">(Xiaochun Luo et al. 2018a)</xref>
        . Xiaochun Luo et al.
(2018b) demonstrated the feasibility of using the CNN model to detect objects from
surveillance cameras. These breakthroughs have been made possible because of two key
factors: i) the existence of a robust CNN architecture (such as VGG-16, Inception, and
ResNet50) for automated feature extraction from images and videos; ii) RNN for capturing sequential
patterns. Notwithstanding these developments, environmental constraints and the insufficient
data for training deep model continue to limit the real-life deployment and scaling of DNN
models that use data from vision-based sensing
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref3">(Ding et al. 2018, Hanbin Luo et al. 2018)</xref>
        .
This paper explores the possible strategies for addressing these challenges based on the use of
data captured using Wearable IMU sensors. The proposed approach treats motion data from
multiple sensor channels that has been segmented into windows of the same dimension as 2D
“images.” This “trick” enables one to apply DNN models to WS output. A simple DNN model
was first introduced into wearable sensing-based posture recognition in an effort to find
discriminative features
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . On-going research work has achieved promising
recognition performance
        <xref ref-type="bibr" rid="ref15 ref21">(Ordóñez and Roggen 2016, Zeng et al. 2014)</xref>
        . The DNN-based
models have shown high-performance in recognizing human daily activities and postures from
WS output. However, there is no universal, pretrained, and ready-to-use deep model
architecture for different scenarios
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . It is, therefore, necessary to
investigate how one can configure a DNN-based model for use in the recognition of
construction workers’ posture.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. DNN-based Model Development and Test</title>
      <p>
        The DNN models were implemented using Keras (TensorFlow GPU backend). The model
development, training, and testing were done on a Windows 10 PC (Intel Core i7-7700 CPU@
2.8 GHz, 16GB RAM, NIVIDA GeForce GTX 1060 GPU@16GB RAM). The authors
experimented with the use of a DNN-based posture recognition model, which combines CNN
and Long short-term memory (LSTM) into a Convolutional-LSTM Network (CLN). The
LSTM extends the conventional RNN’s ability to capture longer temporary patterns. The CLN
was evaluated using both non-recurrent CNN and non-convolutional LSTM as baseline models.
The difference between CLN and CNN is the topology of fully connected dense layers, the
CLN uses LSTM units as dense layers. The LSTM model uses raw sensor output without
features learned from CNN. In this case, the difference in model performance can be attributed
to the architecture of the model – the performance is not driven by optimization, pre-processing
or ad hoc customization
        <xref ref-type="bibr" rid="ref15">(Ordóñez and Roggen 2016)</xref>
        .
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.1 Components in DNN-based Model Architecture</title>
      <p>A CLN model example using one-layer CNN and one-layer LSTM is shown in Figure 1.
Model Input: Model input is segmented by a sliding window. The raw sensor output is
normalized for each channel. After segmentation, it can be treated as a 2D image of “S” by “D”
(e.g. 60 by 30 in Figure 1), where S is timesteps, and D is the number of sensor channels in a
window. All sensor channels are treated as one layer, resulting in an input depth of 1-layer.
Batch and Epoch: The entire dataset for the DNN is divided into multiple (non-overlapping)
groups of equal size. One can then feed each group into the model for training. Each group is
also referred to as a batch (e.g. 10 in Figure 1) for effective model training. One epoch of
training means all batches of training data pass both forward and backward through the model
for once. Multiple epochs can be used for parameter optimization when training data is limited.
Convolutional Layer: A convolutional layer computes the output that is connected to the local
region of each sample in the input. The stride (e.g. 1 by 1 in Figure 1) quantify the movements
of the convolutional kernel along with the vertical or horizontal direction. Zero-padding of the
input data avoids losing information on the border of 2D input. The reference to “n”
convolutional kernels, identifies the number of feature maps (e.g. 20 in Figure 1). This creates
an additional dimension of depth for the convolutional layers. A Flatten Layer is used to
establish a full-connected dense layer. It converts each sample’s feature maps into a
onedimensional vector representing one sample to be classified.</p>
      <p>
        LSTM Layers: Flattening the CNN output ignores temporal dependencies between different
time steps. LSTM can address this problem. Figure 1 shows only the feature maps along the
depth dimension are flattened. The vertical time step dimension is reserved for capturing the
sequential pattern. Each slice over the time step has a dimension of 600 (features) × 10 (batch
size). Samples in a batch fully connect with 64 neurons in LSTM layer. The LSTM neurons are
then fully connected with the softmax layer. This is used to predict the class of each sample in
a batch. The LSTM gives a prediction for every time step “t” in sequence. However, the
memory of LSTM units tends to become more informed with more time steps pass. This is
because the activation information in LSTM neurons at each time step is passed on to the next.
This implies that the more time steps LSTM neurons have “seen”, the more informative the
model can be
        <xref ref-type="bibr" rid="ref15">(Ordóñez and Roggen 2016)</xref>
        . Class probability distribution at the last time step
“T” is used as a recognition result, when the full sequence in the window have been observed.
Fully-Connected Layer: In the non-recurrent CNN model, all elements in the vector generated
by Flatten Layer will connect with neurons in a subsequent fully-connected layer. The last
fullyconnected layer has an equal number of neurons to the number of class labels. The softmax
function is used to deliver a class probability distribution of samples in the batch. Each sample
can be classified by the class label with the highest probability.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Model Architecture</title>
      <p>
        (
CLN Model: The proposed CLN model uses four convolutional layers to extract complex
features. Each layer has 64 kernels with a size of 5 by the number of sensor channels, 1×1 stride,
and zero-padding.
        <xref ref-type="bibr" rid="ref8">Karpathy et al. (2015)</xref>
        and
        <xref ref-type="bibr" rid="ref15">Ordóñez and Roggen (2016)</xref>
        recommend the use of a
two-layer LSTM. Each LSTM layer has 128 units. The output from the LSTM is used by the softmax
layer for prediction. This model can be expressed as  (
) −  (  ) −  (
) −  (
) − 
(
) −
) −
      </p>
      <p>
        <xref ref-type="bibr" rid="ref17">(Pigou et al. 2018)</xref>
        , where the  (  ) denotes a convolutional layer with nc features,
(  ) denotes a recurrent LSTM layer with nr
      </p>
      <p>units, Sm is the softmax classification. The hyperbolic
tangent function was used in the proposed model to activate neurons in each layer of CNN and LSTM.
Baseline Models: The baseline CNN model was developed by substituting the LSTM layers with two
128-unit dense layers in CLN. The CNN model can be expressed as  ( ) −  (
) −  (
) −
 (
) −  (
) −  (</p>
      <p>) − 
LSTM model can be represented as</p>
      <p>(
each window as model input – it does not use feature engineering.</p>
      <p>, where  (  ) denotes a dense layer with nd units. The baseline
) − 
(
) − 
. It uses normalized sensor output in</p>
    </sec>
    <sec id="sec-9">
      <title>3.3 Data Collection and Preparation</title>
      <p>Four subjects were recruited from nearby construction projects. Five IMU sensors (Mbinet Lab
Meta Motion C) were deployed at the hardhat, upper arm, chest center, right thigh, and right
calf (Figure 2) by sticking on cloth. After acquiring the workers’ consent using Institutional
Review Board (IRB)- approved protocols, each subject was asked to perform their routine tasks
for 30 minutes. Their activities were recorded for cross-referencing. The data collected are
summarized in Table 1.
The output was down-sampled to 40 Hz from all the sensors’ channels for S2, S3, and S4. The
data for S1 was collected at 25 Hz and down-sampled to 20 Hz. Each record was labelled with
video reference. This research used a 1-second window with 50% overlap for segmentation.
Each window was labelled using the majority label in the window.
BT (14.7%), KN (2.0%), LB (12.3%), BT-Static bending and minor movement with
OW (7.2%), ST (52.4%), WK (3.4%) bending, and short-term pick up. KN-Kneel on
(B1T2.5%(7)2,.W9%K),(9.1N%U)LL (4.3%), ST LoonnBee-LalerimtgeraoanrldbBobetnohdtah.r NmleUsgSsLQLO--WtSraq-nOusavittetimrnhgoe,vaSdemTW-eSnottarskn.dwinitgh,
BT (13.6%), KN (46.7%), The posture data without video reference (due to
NULL(15.0%), SQ (3.0%), ST (22.0%) block) was not labelled, resulting in labelled
postures not adding up to 100%</p>
    </sec>
    <sec id="sec-10">
      <title>3.4 Set-up for the Model Training and the Model Implementation</title>
      <p>
        Dataset Splitting for Model Training and Testing: Stratified random shuffle was used to split
the dataset. This ensured that the same class distribution was maintained in both train and test
data. 8:2 ratio is the recommended “rule of thumb” for splitting the datasets in related studies
        <xref ref-type="bibr" rid="ref15">(Ordóñez and Roggen 2016)</xref>
        . This set-up was used to fully train the DNN models. The ratio
used for the training and validation was set at 9:1 (higher than the recommended ratio). The
splitting was repeated five times using different random state. This minimized bias in dataset
splitting and improved the reliability of model evaluation.
      </p>
      <p>Model Performance Evaluation Metric: The construction workers’ postures are highly
unbalanced between classes. This reflects the natural human postures. The classification
accuracy is insufficient for measuring model performance - a naive model would achieve high
accuracy by classifying every sample as the majority class. The authors used the Macro
F1score to account for this imbalance. The F1-score for each label was acquired using the
harmonic mean of Precision and Recall. By giving equal weights to the majority and minority
labels when evaluating overall performance for all labels, the model can be trained to achieve
high-performance for recognizing all postures. The Macro F1-score is calculated in Equation
(1) using an unweighted average. “N” in the equation denotes the number of class labels. High
Macro F1- scores reflect high classification performance.</p>
      <p>1 Precisioni×Recalli
Macro F1 = N ∑i 2 Precisioni+Recalli
Equation (1)
Checkpoint in Model Training: The model performance does increase consistently after every
epoch during the training process. The models were trained until their performance ceased to
improve. This occurred after 300 epochs. The model training checkpoint was set to save the
trained model with improved performance in an “overwritten” way. It saved the DNN models
with the best performance after all epochs. A dropout operation (50%) was used before
fullyconnected layers to control model overfitting.</p>
    </sec>
    <sec id="sec-11">
      <title>4. Result and Discussion</title>
    </sec>
    <sec id="sec-12">
      <title>4.1 Model Performance Evaluation</title>
      <p>
        The evaluation was based on the use of conventional ML-based models as the benchmark on
the same dataset. The features used in the benchmark models were the same as the authors’
previous experiments
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Zhao and Obonyo 2018)</xref>
        . Classification algorithms in ML-based models
include Support Vector Machine (SVM), K-Nearest-Neighbour (KNN), Naive Bayes (NB),
Decision Tree (DT), and Random Forest (RF) as an ensemble model. The differences between
DNN and ML models are features constructed and classification models used. The ML models
were implemented using Scikit-learn
        <xref ref-type="bibr" rid="ref16">(Pedregosa et al. 2011)</xref>
        . Results are given in Table 2.
The recognition model’s performance for each subject reflects the “personalization”
capabilities of the proposed approach, which can be used to capture individual posture
idiosyncrasy
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . Table 2 shows the proposed CLN model consistently
outperformed the benchmarking models by an average of 3%. However, CNN and LSTM
models were outperformed by the benchmarking model for S4 and S1. This supports the
hypothesis that the CLN model can leverage the advantages from both the CNN and LSTM
layers. The result is consistent with previous work which integrated CNN and RNN models in
vision-based
        <xref ref-type="bibr" rid="ref3">(Ding et al. 2018)</xref>
        and WS-based
        <xref ref-type="bibr" rid="ref15">(Ordóñez and Roggen 2016)</xref>
        studies.
Confusion matrixes are constructed to assess model errors (Appendix 2). LSTM models make
most classification errors when distinguishing postures between bent, working overhead, and
standing (S1). The models also make errors in recognizing transitional postures (S2 and S3).
This may be due to the lack of convolutional layers, which limits the LSTM models in learning
complex features. The sequential patterns alone are not enough for effective postures
recognition. The authors contend that proper feature engineering can be used to improve the
models’ performance. The proposed CNN model’s low performance for S4 stems from the high
recognition errors associated with walking posture. This may be caused by data imbalance. S4
contains least postures types and an imbalanced postures distribution. The imbalanced dataset
may lead to the overfitting for majority postures and low performance for recognizing minority.
Improved recognition performance contributes to the reliability of posture-based MSD risk
assessment. The frequency of awkward postures can be detected more accurately, which
enhances the validity of MSD risk level assessment using OWAS. The recognition accuracy
greatly influences the ease with which postures can be detected continuously
        <xref ref-type="bibr" rid="ref15">(Ordóñez and
Roggen 2016)</xref>
        . Improved accuracy can help avoid problems such as false alarm and missing
real-time posture assessment. In subsequent research efforts, the authors will investigate these
potential improvements further.
      </p>
    </sec>
    <sec id="sec-13">
      <title>4.2 Analysis of Model Generalization</title>
      <p>
        The use of the deep learning model makes it possible for one to generalize deployment of the
author’s proposed posture recognition approach to new individuals
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        .
Datasets S2 and S4 were downsampled to 20 Hz and combined with S1 as S52. The generalized
models used were based on the architecture described in Section 3.2 (see the 4 convolutional
1 The detailed evaluation result of models is provided in Appendix 3 and Appendix 4.
2 S2 was not used as it did not contain sensor output from arm.
      </p>
      <p>S2
layers in Figure 4). As shown in Table 3, the CLN model consistently outperformed the baseline
DNN models. The best benchmarking model’s performance also improved by over 2%. This
supports the hypothesis that CLN model can successfully extract common subject-invariant
features from different subjects. The observations also indicate that the performance of the
model was better than what is observed when the heuristic features engineering approach is
used. The baseline performance for DNN models was higher than that for the generalized
approach (compare Table 2 and Table 3). This can be attributed to increased training data size.
The generalized models’ confusion matrices (Figure 3) depicts how the CLN model improves
the benchmarking model (SVM). CLN and SVM give a comparable performance in recognizing
common postures from multiple subjects (BT, KN, ST, WK). The CLN model exhibited a more
superior performance when the recognized postures were limited to only one subject (LB, NON,
OW, SQ). The CLN outperformed the benchmarking model by 28% when detecting transitional
postures (NULL). This may be explained by the CLN’s ability to capture sequential data
patterns and identify dynamic postures. The results also show that the CLN model can be used
to effectively detect an individual’s unique postures even where there is less data.</p>
    </sec>
    <sec id="sec-14">
      <title>4.3 Analysis of Fusing Multi-Sensor Channels</title>
      <p>
        The output from different sensor channels was fused directly based on an assumption that they
were similar to pixels in images. The validity of this assumption was assessed using the CLN
model’s performance when fusing different sensor channels on the S5 dataset. It was established
that modifying the convolutional kernel sizes can allow the direct fusion. The performance
increased by 1.9% when fusing two types of channels. These findings suggest fusing motion
data across channels is an appropriate strategy for proposed CLN model. Accelerometers
contribute more to increased recognition performance. Table 4 shows that the model using data
from the accelerometer resulted in superior performance than that based on data from the
gyroscope. These findings are consistent with observations form
        <xref ref-type="bibr" rid="ref15">Ordóñez and Roggen (2016)</xref>
        .
      </p>
      <p>Table 4 Performance Evaluation of Fusing Multi-Sensor Channels (Macro F1 Score)
Convolutional Kernel</p>
      <p>CLN Performance</p>
      <p>Accelerometers
20Hz*15channels
3 The benchmarking model is SVM with 90 features selected using Recursive Feature Elimination.</p>
    </sec>
    <sec id="sec-15">
      <title>4.4 Analysis of Hyperparameter</title>
      <p>Convolutional layers’ depth influences DNN models’ ability to learn complex features. The authors
compared the performance of baseline CNN model with varying depths using dataset S5. The goal was
to evaluate influences from convolutional layers without considering the influence of the LSTM layers.
As shown in Figure 4, increasing the depth of the convolutional layer from one to three significantly
improves the model’s performance. There is a plateau when convolutional layer depth reaches four, after
which the model’s performance starts to decrease. Without the convolutional layers, the tested CNN
model becomes a Multilayer Perceptron (MLP) model ( ( ) −  ( ) −  ). The CNN model’s
performance is no better than the MLP model when only one convolutional layer is used. These results
suggest that the convolutional layers can be used to extract features effectively as long as the optimal
depth has been identified. The improper feature representation from one shallow convolutional layer
can have an adverse impact on the model’s performance. As show in Figure 4, too deep of an architecture
can decrease the model’s performance. Besides the problem of overfitting, the deep architecture can also
result in a “varnishing gradient” problem, that is, when the model’s performance decreases because of
the architecture becoming too complex. Additionally, the model training time increases significantly
when it goes deeper. It is necessary to tune the depth for balancing complexity and performance.</p>
    </sec>
    <sec id="sec-16">
      <title>5. Conclusion and Future Work</title>
      <p>The proposed use of a CLN model can improve the performance of recognizing construction workers’
postures based on the use of data obtained from wearable IMU sensors. The consistently
highperformance of the CLN model supports that the author’s position that: i) the integrated CLN model can
take advantage both CNN and LSTM models’ power to further improve the DNN-based model
performance; ii) an automated feature extraction approach can reduce engineering bias in the heuristic
feature engineering process, thus resulting in improved recognition performance.</p>
      <p>
        The CLN model can capture both the “subject-invariant” common features and unique features of a
specific subject. This outcome cannot not be easily achieved when using the heuristic feature
engineering approach. The CLN model can be a promising approach to balance generalization and
personalization
        <xref ref-type="bibr" rid="ref18 ref2 ref22 ref23">(Plötz and Guan 2018)</xref>
        . This is an important goal in ML-based posture recognition. The
CLN model can also be applied directly to output from WS. The deployed CLN model can learn complex
features directly from the raw sensor output. CLN model performance tends to increase with more sensor
channels. However, it is recommended to tune the convolutional layer depth as a hyperparameter for
DNN-based models. The learning power may not be fully leveraged in shallow and overly deep models.
The authors have extended the application of DNN-based models beyond vision-sensing to
multichannel motion sensing data. The proposed approach was validated using the need for accurate posture
recognition in MSD risk assessment based on the use of data derived from WS. In subsequent efforts,
the authors will refine the DNN models’ hyperparameters and deploy ensembled DNN models for
performance improvement. The developed model will be reconfigured for deployment on mobile
devices in real-time. The authors will also investigate the CLN performance on “unseen” subject’s data.
4 The arm sensor came across malfunction during data collection, the six channels from arm sensor was not
considered.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Construction worker's awkward posture recognition through supervised motion tensor decomposition</article-title>
          .
          <source>Automation in Construction, 77</source>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Automated detection of sewer pipe defects in closed-circuit television images using deep learning techniques</article-title>
          .
          <source>Automation in Construction, 95</source>
          , pp.
          <fpage>155</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Love</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ouyang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>A deep hybrid learning model to detect unsafe behavior: integrating convolution neural networks and long short-term memory</article-title>
          .
          <source>Automation in Construction, 86</source>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (2018a)
          <article-title>Computer vision aided inspection on falling prevention measures for steeplejacks in an aerial environment</article-title>
          .
          <source>Automation in Construction, 93</source>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>An</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (2018b)
          <article-title>Detecting non-hardhat-use by a deep learning method from far-field surveillance videos</article-title>
          .
          <source>Automation in Construction, 85</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>A vision-based motion capture and recognition framework for behavior-based safety management</article-title>
          .
          <source>Automation in Construction, 35</source>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hignett</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McAtamney</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2000</year>
          )
          <article-title>Rapid entire body assessment (REBA)</article-title>
          .
          <source>Applied Ergonomics</source>
          ,
          <volume>31</volume>
          (
          <issue>2</issue>
          ), pp.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Karpathy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , J. and
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Visualizing and understanding recurrent networks</article-title>
          .
          <source>arXiv preprint arXiv:1506</source>
          .
          <year>02078</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Kivi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mattila</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1991</year>
          )
          <article-title>Analysis and improvement of work postures in the building industry: application of the computerised OWAS method</article-title>
          .
          <source>Applied Ergonomics</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>43</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Love</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ouyang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Convolutional neural networks: Computer vision-based workforce activity assessment in construction</article-title>
          .
          <source>Automation in Construction, 94</source>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (2018a)
          <article-title>Recognizing Diverse Construction Activities in Site Images via Relevance Networks of Construction-Related Objects Detected by Convolutional Neural Networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Journal of Computing in Civil Engineering</source>
          ,
          <volume>32</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <article-title>2018b) Towards efficient and objective work sampling: Recognizing workers' activities in site surveillance videos with two-stream convolutional networks</article-title>
          .
          <source>Automation in Construction, 94</source>
          , pp.
          <fpage>360</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>I. L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bush</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Work-related musculoskeletal disorders assessment and prevention</article-title>
          .
          <source>in Ergonomics-A Systems Approach: InTech.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Ordóñez</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Roggen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition</article-title>
          .
          <source>Sensors</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , R. and
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          (Oct), pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Pigou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Den Oord</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Herreweghe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dambre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>126</volume>
          (
          <issue>2-4</issue>
          ), pp.
          <fpage>430</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Plötz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Deep Learning for Human Activity Recognition in Mobile Computing</article-title>
          . Computer,
          <volume>51</volume>
          (
          <issue>5</issue>
          ), pp.
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Ryu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jebelli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Automated Action Recognition Using an Accelerometer-Embedded Wristband-Type Activity Tracker</article-title>
          .
          <source>Journal of construction engineering and management</source>
          ,
          <volume>145</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>04018114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ning</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Risk Assessment of Work-Related Musculoskeletal Disorders in Construction: State-of-the-Art Review</article-title>
          .
          <source>Journal of construction engineering and management</source>
          ,
          <volume>141</volume>
          (
          <issue>6</issue>
          ), pp.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>L. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mengshoel</surname>
            ,
            <given-names>O. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and Zhang, J. (
          <year>2014</year>
          )
          <article-title>Convolutional neural networks for human activity recognition using mobile sensors</article-title>
          .
          <source>in 6th International Conference on Mobile Computing</source>
          , Applications and Services: IEEE. pp.
          <fpage>197</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Ergonomic posture recognition using 3D view-invariant features from single ordinary camera</article-title>
          .
          <source>Automation in Construction, 94</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Obonyo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Towards a Data-Driven Approach to Injury Prevention in Construction. in Advanced Computing Strategies for Engineering</article-title>
          . EG-ICE
          <year>2018</year>
          ,Cham: Springer International Publishing. pp.
          <fpage>385</fpage>
          -
          <lpage>411</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>