<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning of Temporal Information for Driver Action Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joseph Lemley</string-name>
          <email>j.lemley2@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shabab Bazra an</string-name>
          <email>S.Bazra an1@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Corcoran</string-name>
          <email>peter.corcoran@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Ireland Galway, College of Engineering and</institution>
          ,
          <addr-line>Informatics, Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>123</fpage>
      <lpage>128</lpage>
      <abstract>
        <p>Correct classi cation of image data can depend on features learned in multiple sequential frames. We focus on the problem of learning action from video data with an emphasis on driver behavior monitoring. An insu cient quantity of high quality labeled data is a major problem in machine learning research. is is especially true when deep neural networks are used. Although some su ciently large, general purpose image databases exist for action recognition, most of these are limited to single frames. is kind of data requires that the action recognition task is applied regardless of the temporal information (information from previous and next frames of a video sequence). In this paper, we show that temporal information is useful for accurate classi cation of video and that the temporal information in lower layers of a convolutional neural network can successfully be transferred from one network to another to greatly improve performance on the driver behavior monitoring task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>In recent years, deep learning has become ubiquitous for image
classi cation, with some results exceeding human accuracy for
certain tasks. With few exceptions, these impressive results have
been limited to a set of tasks for which a single picture provides all
the information required to easily distinguish between classes and
essentially ignore any time element for the recognition task.</p>
      <p>As we approach the limits of frame-based methods, there is a
desire to further improve deep learning algorithms by utilizing
temporal information, which is information between multiple frames
taken sequentially to give a more complete idea of what is
happening. Using a single frame, it is trivial to train a classi er to
determine if a person is holding a glass, but di cult or impossible
to train a classi er to understand if the glass is being picked up
or put down. Even distinguishing jogging from walking can be
di cult without a time component.</p>
      <p>For this more re ned level of visual understanding, we need
machine learning models that can learn temporal information and
information about frame content at the same time. We also need
quality labeled data of su cient size to prevent over ing. Much
work has been done recently to address these issues (described in
the related work section below), but there is still a lack of quality
data for use in many practical applications. For example, there are
no publicly available databases for driver monitoring that are of
su cient size to train a deep neural network from scratch.</p>
      <p>e largest database for driver behavior monitoring that we could
nd is the “Distracted Driver Dataset”, provided as part of a Kaggle
challenge in mid-2016. Although this database is intended for single
frame classi cation, it is possible to identify the original frame
sequences from which movies can be created. ese movies can
then be used for learning a limited amount of temporal information.</p>
      <p>In this paper, we address two questions:
“Can temporal information be used to improve accurate
classication of driver actions?” and “Can low-level information about
temporal information from an unrelated problem be successfully
used to be er understand driver actions in videos?”</p>
      <p>In the next session, we provide brief background information on
Recurrent Neural networks and 3D CNN’s for the interested reader.
In section three, we present related work. In sections 4, 5, and 6,
we present the methods, experiments, and results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In this section, we provide a brief overview of the key neural
network architectures used in this work, including recurrent neural
networks, long short term neural networks, convolutional neural
networks and 3D convolutional neural networks. We also brie y
describe transfer learning. Readers with a knowledge of these topics
are encouraged to skip to the Related Work section below.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>RNN (Recurrent Neural Network)</title>
      <p>Recurrent neural networks(RNN) [19] are a highly exible network
architecture frequently used to model time series data, such as a
sequence of frames from a video. RNNs remember the input at
a previous stage and make a decision for the present input based
on the sequence of the previous data. e hidden state of an RNN
is calculated based on the state of the network at each sequence.
e hidden state can be considered the memory. RNNs have had
great success in speech and video recognition and are sometimes
combined with convolutional layers [11].
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>LSTM (Long Short Term Memory)</title>
      <p>LSTMs [5] are a type of RNN that are designed with both long and
short term memory which permits the modeling of more complex
time series. LSTMs include at least 3 gates which control the way
information ows.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>CNN(Convolutional Neural Networks)</title>
      <p>Convolutional Neural networks (CNNs) were popularized by
lecun et al. [9], who used them successfully for handwri en digit
classi cation. ese networks are inspired by the organization
of the visual cortex and allow spatial information to be more e
ciently learned. Convolutional Neural Networks can be used on
input of any number of dimensions but due to their success in
pictures, are most popularly implemented for 2D input plus color
channels. Other popular types of CNN’s include 1D CNNs which
are commonly used for time series and 3D CNNs which can be used
for volumetric data or time series data where the third dimension
represents either spatial frames or temporal frames [11].
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Transfer Learning</title>
      <p>Transfer learning is the process of transferring knowledge that has
already been learned by one neural network into another one. is
is o en accomplished by copying the learned weights and biases
from one or more layers of a fully trained network to a di erent
network. Transfer learning can be used to overcome over ing
issues and to speed up the training process for a related task.
3</p>
    </sec>
    <sec id="sec-7">
      <title>RELATED WORK</title>
      <p>In 2008, before the recent wave of deep learning, Klaser et al. [8]
used 3D HOG (Histogram of Oriented Gradients) descriptors and
showed that 3D gradients were able to assist in understanding
action from videos. More modern approaches to action classi
cation from videos are primarily based on 3D convolutional neural
networks (CNN) or recurrent neural networks (RNN) with some
convolutional component or a combination of the two. 3D CNNs
have been shown to capture short-term temporal information.</p>
      <p>For example, [18] makes a compelling case for the use of 3D
CNNs for understanding video data. eir method, which they
named C3D, compared favorably to other published results on 5 of
6 generic action datasets used. ey also showed that their network
learns information about both motion and appearance, rst learning
appearance and then motion. One problem with this design is that
only relatively short action sequences (16 frames) can be learned.</p>
      <p>Addressing this problem, Lei, et al. [10] present an interesting
approach to action recognition, they combine a 3D convolutional
neural network with a hidden Markov model (HMM) and show that
their method compares favorably with other methods. is paper
adds support to the idea that 3D convolutions are important to
understanding short-term temporal information about movement
and the need for another mechanic (RNN, HMM, etc) to understand
long-term context (more than a few frames)</p>
      <p>Keeping on the theme of combining 3D CNN with a network
that is able to learn long-term information, Molchanov et al. [12]
combined a 3D convolutional neural network with a recurrent
neural network to classify short video clips of hand gestures, reporting
excellent results.</p>
      <p>Donahue et al. introduced a new architecture called LRCN
(Longterm Recurrent Convolutional Networks) that combines RNNs with
CNNs [2]. Speci cally, they show how to train and optimize a
long-term RNN model that can account for temporal information in
video data. ey evaluate their method on the UCF - 101, a database
that contains over 12,000 videos with 101 human action classes [16]
.</p>
      <p>Another new architecture that combines LSTM with
convolutions is introduced in “A Machine Learning Approach for
Precipitation Nowcasting” [20]. Although the focus of their paper is
forecasting precipitation, their method is generally applicable to
the task of gathering long and short term time information from
video sequences.</p>
      <p>In [21], convolutional temporal pooling is used with a long short
term memory (LSTM) network to produce state of the art results.
In the same year, [2] used a convolutional LSTM to narrate videos.</p>
      <p>Because convolutions on 3D data are time intensive, there have
been a empts to combine 2D networks while retaining temporal
information. For example, in [3], Feichtenhofer et al. train separate
CNNs for motion and appearance and report very good results.</p>
      <p>Addressing the problem of insu cient data to train a neural
network, [7] introduced a large, automatically generated, database
gathered from YouTube clips called the sport 1M dataset. ey
showed that a transfer learning approach is e ective at gaining
accuracy on UCF 101 when a network is rst trained on sports 1M.</p>
      <p>Object proposal networks, speci cally fast/faster R-CNNs
(regionbased convolutional neural networks) have been used successfully
as in [14],and [22], for action recognition, and seem to work
especially well in cases where there is not enough data to train a full
network. Hoang, et al. used this method for detecting cell phone
usage and hands on steering wheel detection in [4].</p>
      <p>Just as there are a empts to classify behavior without
exploiting temporal information, there are also ways of addressing the
problem of driver behavior monitoring that do not depend on
spatial information but instead use only temporal information. For
example, [6] compared LSTM, RNN, logistic regression, and deep
neural networks, for detecting driver confusion. Rather than using
images, that paper uses multimodal sensor data to assess driver
confusion. ey found that LSTM outperforms the other models,
likely because some long-term information is important for this
task. Similarly, [13] uses human motion, as given by cell phone
sensors, as a biometric, and an RNN was utilized to process sensor
signals.
4</p>
    </sec>
    <sec id="sec-8">
      <title>METHODS</title>
      <p>Experiments were conducted on NVIDIA Titan X GPU’s running
a pascal architecture with python 2.7, using the eano [ 17] and
Keras [1].
4.1</p>
      <p>Data Preparation
4.1.1 Driver Monitoring Dataset. e Distracted Driver Dataset
was provided as part of a Kaggle Competition in 2016. e dataset
was created by lming actors on a closed driving course engaging
in various distracted and undistracted behaviors. It should be noted
that these images were obtained in a controlled environment the
car was not actually being driven. It was being pulled by a truck
instead. e objective of the competition was to correctly classify
still images into 10 categories.</p>
      <p>
        e training set of the distracted driver database contains frames
of 26 subjects displaying several of the following behaviors/actions:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) c0: safe driving
Transfer Learning of Temporal Information for Driver Action Classification
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) c1: texting - right
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) c2: talking on the phone - right
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) c3: texting - le
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) c4: talking on the phone - le
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) c5: operating the radio
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) c6: drinking
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) c7: reaching behind
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) c8: hair and makeup
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) c9: talking to passenger
      </p>
      <p>Although all the images in the supplied training set are still
images, it is possible to reconstruct the original “movies” based on
their order in the CSV le supplied with ground truth annotations.</p>
      <p>While classifying frames for driver monitoring is an interesting
problem, we wanted to see if we could learn anything from the
temporal information in the movies. Instead of using individual
frames as required for the competition, we created short movie
clips, as we describe later. It is not possible to create such movies
from the supplied testing set because sequence information was
intentionally le out of it by the competition organizers.</p>
      <p>First, we arranged the dataset into distinct subject, action
segments ordered by time. e generated movies varied between 38
and 135 frames in length (average is about 101) for a total of 100
clips. Because the 3D convolutional neural network we were
planning to experiment with requires a xed frame size, we further
processed the videos into xed sized frames using the sliding time
window method. For each of these segments, we create a series
of 5, 10, 16, and 30 frame clips. Each clip is made using a sliding
window starting with the rst (subject, action) with a step equal
to approximately 70% of the clip length to allow su cient frame
overlap. Appropriate action labels are applied to each clip.</p>
      <p>In all experiments, 20 drivers are used for the training set and
the remaining 6 for the validation set. Due to the small size of
this database, we did not feel that further dividing the data into a
smaller test set would be reasonable.</p>
      <p>
        In our experiments, we used 4 variations of these:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Grayscale video: All videos of the 26 drivers were reduced
to 60 ⇥ 80 Grayscale images.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Color video: All videos of the 26 drivers were reduced to
112 ⇥ 112 full color.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Grayscale frames: All frames of the 26 drivers were reduced
to 60 ⇥ 80 Grayscale images.
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Color Frames: All frames of the 26 drivers were reduced
112 ⇥ 0 112 Grayscale images.
      </p>
      <p>Due to copyright restrictions, we are unable to include examples
of these images in this paper, but the interested reader may view
them by visiting the relevant competition at:</p>
      <p>h ps://www.kaggle.com/c/state-farm-distracted-driver-detection
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Augmentation</title>
      <p>In experiments where augmentation was applied, the
ImageDataGenerator class within Keras was used. is class is used to
dynamically create augmented images during training given a set of
parameters. Since the standard implementation of
ImageDataGenerator only supports 2D data, we subclassed it to properly apply
the transformations to video data. is modi cation involved
ensuring that the same transformation was applied to every frame of
a clip instead of treating each frame as an individual image with a
potentially di erent transformation.
5</p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS</title>
      <p>In this section, we describe 10 experiments on the distracted driver
database. e experiments were designed to allow comparison
between networks that use temporal information (LSTM, 3D CNN,
etc) and networks that ignore it (2D CNN). e last experiments are
designed to measure the improvement that is achieved by transfer
learning.
5.1
We train an implementation of VGG-16 [15] from scratch on the
distracted driver dataset in Keras with a learning rate of 0.001. Our
loss function was categorical cross entropy and our output class
predictions were obtained from a nal so max layer.</p>
      <p>We then repeated our experiments on our grayscale dataset
with rotation, translation, and feature normalization. By feature
normalization, we mean that the inputs are divided by the standard
deviation of the dataset.
5.2</p>
    </sec>
    <sec id="sec-11">
      <title>Very Small CNNs and LSTMs</title>
      <p>In this set of experiments, we utilized several small (one or two
layer) con gurations of CNN+LSTM, 3DCNN +LSTM, and 3DCNN.</p>
      <p>For these experiments, the LSTM used is the one described by
[20], implemented in Keras as ConvLSTM2D.</p>
      <p>ree networks were evaluated:
Network 1: 3DCNN followed by So max (Num Classes)
One 3D convolution with 16 lters and a kernel size of 3,3,3
followed by a so max layer.</p>
      <p>Network 2: LSTM followed by 3DCNN ! So max (Num Classes)
One LSTM with time length 16, and a 3x3x3 convolution kernel
followed by one 3D convolutional layer with 16 lters and a kernel
size of 3,3,3 ending with a so max layer.</p>
      <p>Network 3: 3DCNN followed by LSTM ! So max (Num Classes)
One 3D convolutional layer with 16 lters and a kernel size
of 3,3,3 followed by one LSTM with time length 16, and a 3x3x3
convolution ending with a so max layer.</p>
      <p>All three experiments used a learning rate of 0.001 with nestrov
momentum as 0.95. Categorical crossentropy was our loss function
and stochastic gradient descent as the training method.</p>
      <p>ese experiments were then repeated with the grayscale video
database.
5.3</p>
    </sec>
    <sec id="sec-12">
      <title>Experiments with small 3DCNN architectures</title>
      <p>
        Since the distracted driver dataset has a strong tendency to over t
due to the limited number of subjects, we designed a network and
training strategy that would delay over ing for as long as possible
and prevent fast convergence. is involved using a high learning
rate and large batch size, resulting in a training method that would
allow the network to jump over minima. We also observed that the
over ing came from very low level information, so we restricted
the rst layer to only 8 features. e network architecture was
designed as follows:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) input layer with 15 frames, a width and height of 112.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) 3D Convolutional Layer(RELU). 8 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) 3D Max pooling operator with size: (
        <xref ref-type="bibr" rid="ref2 ref2 ref2">2,2,2</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) 3D Convolutional Layer(RELU). 16 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) 3D Max pooling operator with size: (
        <xref ref-type="bibr" rid="ref2 ref2 ref2">2,2,2</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) 3D Convolutional Layer(RELU). 32 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) 3D Max pooling operator with size: (
        <xref ref-type="bibr" rid="ref2 ref2 ref2">2,2,2</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) 3D Convolutional Layer(RELU). 64 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) 3D Convolutional Layer(RELU). 128 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) 3D Convolutional Layer(RELU). 256 lters with size: 3, 3, 3
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) Fully connected Layer (RELU) with 4096 units.
(
        <xref ref-type="bibr" rid="ref12">12</xref>
        ) Dropout with a probability of 0.5
(
        <xref ref-type="bibr" rid="ref13">13</xref>
        ) Fully connected Layer (RELU) with 4096 units.
(
        <xref ref-type="bibr" rid="ref14">14</xref>
        ) Dropout with a probability of 0.5
(
        <xref ref-type="bibr" rid="ref15">15</xref>
        ) So max with 10 units (one for each of the 10 action classes
in the Distracted Driver Dataset)
A learning rate of 0.001 with nestrov momentum as 0.95 and
categorical crossentropy as the loss function was used. Stochastic
gradient descent was used as the optimizer.
      </p>
      <p>Variations with additional dropout and Batch Normalization
layers were also evaluated, but only increased the over ing and
decreased validation accuracy.</p>
      <p>ese experiments were repeated with rotation, translation, and
featurewise normalization.
5.4</p>
    </sec>
    <sec id="sec-13">
      <title>C3D trained from scratch</title>
      <p>In this experiment, we trained a C3D [18] from scratch on the
distracted driver dataset. Since C3D was designed to use 16 frame
color video clips of size 112 X 112 we used the color video dataset
described previously. As with previous experiments, a learning rate
of 0.001 with nestrov momentum as 0.95 was used with categorical
crosentropy and gradient descent.
5.5</p>
    </sec>
    <sec id="sec-14">
      <title>Transfer Learning with C3D</title>
      <p>Since other approaches to reducing the over ing problem were of
limited success, we tried a transfer learning approach. e idea is
to use pretrained weights from an existing network, trained for a
more generic action recognition task, and then to tune them with
the Distracted Driver training set.</p>
      <p>We used a pretrained 3CD on the sports 1-M dataset, included
the architecture, and trained weights for e Sports-1M dataset in a
GitHub repository as a 3D CNN. ey reported very good results at
capturing temporal information using their 3D CNN, so it seemed
like a natural place to start.</p>
      <p>In this experiment, we investigate the use of transfer learning to
overcome the over ing problem identi ed in the previous cases.
In the previous experiments, the rst layers were identi ed as being
the primary source of over ing, so we wanted to try two transfer
learning approaches.</p>
      <p>e rst transfer learning approach we tried was to train the C3D
network with a very low learning rate of 0.0001 without freezing
any layers with these pretrained weights.</p>
      <p>We then used an alternate transfer learning approach where we
trained only on the nal so max layer, freezing the learning rate
of the rst layers.</p>
      <p>Since we previously identi ed the rst layers as being the
greatest source of over ing, we repeated this experiment freezing only
the rst ve layers, and then repeated it again with only the rst
two layers frozen.
6</p>
    </sec>
    <sec id="sec-15">
      <title>RESULTS</title>
      <p>In this section, we report the results of the experiments in the
previous section. e experiments with the best results are listed
in table 1. Since over ing is found to be the primary cause of
validation error in most experiments, we also show details about
the loss and accuracy in tables 2 and 3 before and a er we reach
100% on the training set.
6.1
e best 2D accuracy we obtain without augmentation is 30% using
a modi ed VGG 16. [15]</p>
      <p>Interestingly, randomly augmenting the 2D dataset with shi s
of up to 10% in the horizontal or vertical direction is enough to
increase the accuracy from 30% to 46.3%. Adding rotation (5 or 15
degrees) lowered accuracy and allowing shi s of greater than 10%
also decreased accuracy.
6.2</p>
    </sec>
    <sec id="sec-16">
      <title>Very Small CNNs and LSTMs</title>
      <p>In this set of experiments, we used 3D convolutional LSTM of 3
di erent con gurations that have been widely cited in the literature
as being e ective for this task.</p>
      <p>Despite widely di erent architectures, these networks all quickly
resulted in an over t network with 100% accuracy on the training
set and roughly 10% accuracy on the validation set, o en within
the rst 10 epochs.</p>
      <p>When trained on reduced size grayscale data the results were
similar but slightly be er with the best accuracy on the validation
set at around 16%.</p>
      <p>e goal of this set of experiments was to identify the network
architecture which would be most promising for measuring the
impact of temporal information for the driver monitoring task, but
due to the over ing issue, we can not reach any conclusions from
these 3 experiments.
6.3</p>
    </sec>
    <sec id="sec-17">
      <title>Experiments with small 3DCNN architectures</title>
      <p>Here we report the results of our experiment with small 3DCNN,
which was designed to reduce over ing by forcing the network to
concentrate on more general features and by limiting the number
of low level features.</p>
      <p>is network achieved an accuracy of 39.57% on the validation
set which is be er than the 2D VGG 16 inspired network, which
had been shown to be e ective at this task in the Distracted Driver
Competition without manual augmentation, but not be er than the
2D VGG when augmented data was provided.</p>
      <p>Variations with additional dropout and Batch Normalization
layers were also evaluated, but only increased the over ing and
decreased validation accuracy.</p>
      <p>None of the augmentation strategies that were found to improve
accuracy for the 2D network (translation, rotation, featurewise
normalization) increased accuracy for this small 3DCNN architecture.
Transfer Learning of Temporal Information for Driver Action Classification
e rst experiment was to train the pretrained 3CD network at a
very low learning rate using the driver monitoring dataset. is still
resulted in over ing and unsatisfactory results on the validation
set (13%).</p>
      <p>e next and most successful transfer learning approach we used
was to train only on the last few layers, freezing the learning rate
of the rst layers. When the rst 5 layers were frozen, we achieved
60% on the validation set, which is a signi cant improvement over
the previous results. Further re nement (freezing just the rst 2
layers) allowed us to achieve over 72% accuracy on the validation
set.</p>
      <p>*in all cases loss is the categorical cross-entropy calculated on
an entire batch.
7</p>
    </sec>
    <sec id="sec-18">
      <title>DISCUSSION</title>
      <p>When training a CNN, Information ows in a cone-shaped manner
(see gure 1). is is a property of the convolution and pooling
operations, which merge and mix the information of several pixels
(a window of pixels) into a single value. In the early layers of the
convolutional network, each row of the layer corresponds to a small
spatial portion of the input image. is means that in these early
layers, the network is restricted to the details and low-level features
of the input data.</p>
      <p>is is also the case for temporal information where time is the
depth dimension. In the early stages of the 3D network, the kernels
are able to observe a very short part of the input time sequence
data. i.e., the beginning layers of the network are processing a short
time sequence from a small spatial portion of the input data.</p>
      <p>is could correspond to small nger movement, or eye blinking
for example. ese movements are common among a wide variety
of human activities like swimming or holding a cell phone. is is
the main reason a transfer learning approach that involved freezing
the rst two layers gave signi cantly be er results than other
methods (See gure 2). In our presented approach, the network
trained on the sports 1m dataset (a database of YouTube videos
focused on sports) was tuned for the driver monitoring task, but
the rst two layers were frozen during the tuning procedure. e
success of these features indicates that the early stage features
learned from sports activity classi cation are generic enough to be
used for other human activities.</p>
      <p>ese features represent the most detailed activities, which are
the same for a wide variety of human behaviors. Figure 1 shows
that by going deeper in network layers, each row of a convolutional
layer is observing a wider and larger temporal and spatial region
of the input data.</p>
      <p>is suggests that the kernels in these later layers are dealing
with coarser features of the input data, which includes speci c
movements or behaviors of the subject in the input movie stream.
is explains why tuning later layers for our speci c task converged
to a reasonable approximation of driver activities. In gure 2,
network A is the C3D network trained on the sports 1M dataset and
the network B is the presented network for driver activity classi
cation. As you can see, the parameters from the rst two layers of
the network A were transferred to network B and the rest of the
network was tuned for our task.</p>
      <p>is is our explanation of the results obtained by a transfer
learning approach.</p>
      <p>input</p>
      <p>Convolutional Convolutional
layer layer</p>
      <p>Convolutional
layer</p>
      <p>toward
the output
input
input</p>
      <sec id="sec-18-1">
        <title>Transfer parameters</title>
      </sec>
      <sec id="sec-18-2">
        <title>Network A Output Output</title>
        <p>.
.
.
.
.
.
.
.
.
.</p>
      </sec>
      <sec id="sec-18-3">
        <title>Network B</title>
        <p>At a very low level, the action of Moving ngers and heads may
not be substantially di erent between di erent action recognition
problems for convolutional neural networks. In our experiments,
using any more than the rst two layers decreased accuracy.</p>
        <p>Given the shortage of well-labeled task-speci c datasets for
action recognition, this is an encouraging result.</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>ACKNOWLEDGMENTS</title>
      <p>is research is funded under the SFI Strategic Partnership Program
by Science Foundation Ireland (SFI) and FotoNation Ltd. Project
ID: 13/SPP/I2868 on Next Generation Imaging for Smartphone and
Embedded Platforms. is work is also supported by an Irish
Research Council Employment Based Programme Award. Project ID:
EBPPG/2016/280.</p>
      <p>We gratefully acknowledge the support of NVIDIA Corporation
for the donation of a Titan X GPU used for this research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] Fran¸cois Chollet.
          <year>2015</year>
          . Keras. h ps://github.com/fchollet/keras. (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Je</surname>
            <given-names>rey Donahue</given-names>
          </string-name>
          , Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Long-term recurrent convolutional networks for visual recognition and description</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pa ern recognition</source>
          .
          <volume>2625</volume>
          -
          <fpage>2634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          , Axel Pinz, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Convolutional two-stream network fusion for video action recognition</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition</source>
          .
          <year>1933</year>
          -
          <fpage>1941</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T</given-names>
            <surname>Hoang Ngan</surname>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          , Yutong Zheng, Chenchen Zhu, Khoa Luu, and
          <string-name>
            <given-names>Marios</given-names>
            <surname>Savvides</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Multiple Scale Faster-RCNN Approach to Driver's Cell-Phone Usage and Hands on Steering Wheel Detection</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition Workshops</source>
          .
          <fpage>46</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Chiori</given-names>
            <surname>Hori</surname>
          </string-name>
          , Shinji Watanabe, Takaaki Hori, Bret A Harsham, JohnR Hershey, Yusuke Koji, Yoichi Fujii, and
          <string-name>
            <given-names>Yuki</given-names>
            <surname>Furumoto</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Driver confusion status detection using recurrent neural networks</article-title>
          .
          <source>In Multimedia and Expo (ICME)</source>
          ,
          <source>2016 IEEE International Conference on. IEEE</source>
          , 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Andrej</given-names>
            <surname>Karpathy</surname>
          </string-name>
          , George Toderici, Sanketh She y, omas Leung, Rahul Sukthankar, and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2014</year>
          .
          <article-title>Large-scale video classi cation with convolutional neural networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on Computer Vision and Pa ern Recognition</source>
          .
          <volume>1725</volume>
          -
          <fpage>1732</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Klaser</surname>
          </string-name>
          ,
          <source>Marcin Marsza lek, and Cordelia Schmid</source>
          .
          <year>2008</year>
          .
          <article-title>A spatiotemporal descriptor based on 3d-gradients</article-title>
          .
          <source>In BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association</source>
          ,
          <fpage>275</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and
          <string-name>
            <given-names>Lawrence D</given-names>
            <surname>Jackel</surname>
          </string-name>
          .
          <year>1989</year>
          .
          <article-title>Backpropagation applied to handwri en zip code recognition</article-title>
          .
          <source>Neural computation 1</source>
          ,
          <issue>4</issue>
          (
          <year>1989</year>
          ),
          <fpage>541</fpage>
          -
          <lpage>551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jun</surname>
            <given-names>Lei</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Guohui</given-names>
            <surname>Li</surname>
          </string-name>
          , Jun Zhang, Qiang Guo, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Tu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model</article-title>
          .
          <source>IET Computer Vision</source>
          <volume>10</volume>
          ,
          <issue>6</issue>
          (
          <year>2016</year>
          ),
          <fpage>537</fpage>
          -
          <lpage>544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Joe</surname>
            <given-names>Lemley</given-names>
          </string-name>
          ,
          <source>Shabab Bazra an, and Peter Corcoran</source>
          .
          <year>2017</year>
          .
          <article-title>Deep Learning for Consumer Devices</article-title>
          and
          <article-title>Services: Pushing the limits for machine learning, arti - cial intelligence, and computer vision</article-title>
          .
          <source>IEEE Consumer Electronics Magazine</source>
          <volume>6</volume>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ),
          <fpage>48</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Pavlo</surname>
            <given-names>Molchanov</given-names>
          </string-name>
          , Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Kautz</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Online detection and classi cation of dynamic hand gestures with recurrent 3d convolutional neural network</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pa ern Recognition</source>
          .
          <volume>4207</volume>
          -
          <fpage>4215</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Natalia</surname>
            <given-names>Neverova</given-names>
          </string-name>
          , Christian Wolf, Gri n Lacey, Lex Fridman, Deepak Chandra, Brandon Barbello, and
          <string-name>
            <given-names>Graham</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning human identity from motion pa erns</article-title>
          .
          <source>IEEE Access</source>
          <volume>4</volume>
          (
          <year>2016</year>
          ),
          <fpage>1810</fpage>
          -
          <lpage>1820</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>91</volume>
          -
          <fpage>99</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          .
          <source>arXiv preprint arXiv:1212.0402</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>eano</surname>
            <given-names>Development</given-names>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>eano: A Python framework for fast computation of mathematical expressions</article-title>
          .
          <source>arXiv e-prints abs/1605</source>
          .02688 (May
          <year>2016</year>
          ). h p://arxiv.org/abs/1605.02688
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 4489-
          <fpage>4497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>DRGHR</given-names>
            <surname>Williams</surname>
          </string-name>
          and
          <string-name>
            <given-names>GE</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>Learning representations by backpropagating errors</article-title>
          .
          <source>Nature</source>
          <volume>323</volume>
          ,
          <issue>6088</issue>
          (
          <year>1986</year>
          ),
          <fpage>533</fpage>
          -
          <lpage>538</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>SHI</given-names>
            <surname>Xingjian</surname>
          </string-name>
          , Zhourong Chen, Hao Wang,
          <string-name>
            <surname>Dit-Yan</surname>
            <given-names>Yeung</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wai-Kin Wong</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wang-chun Woo</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Convolutional LSTM network: A machine learning approach for precipitation nowcasting</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>802</volume>
          -
          <fpage>810</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Joe</given-names>
            <surname>Yue-Hei</surname>
          </string-name>
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , Ma hew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
          <string-name>
            <given-names>George</given-names>
            <surname>Toderici</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Beyond short snippets: Deep networks for video classi cation</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pa ern recognition</source>
          .
          <volume>4694</volume>
          -
          <fpage>4702</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Liliang</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Liang Lin, Xiaodan
          <string-name>
            <surname>Liang</surname>
            , and
            <given-names>Kaiming</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Is Faster R-CNN Doing Well for Pedestrian Detection?</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>443</fpage>
          -
          <lpage>457</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>