Temporal Factorization of
                     3D Convolutional Kernels

 Gabriëlle Ras, Luca Ambrogioni, Umut Güçlü, and Marcel A. J. van Gerven

              Department of Artificial Intelligence, Radboud University,
              Donders Centre for Cognition, Nijmegen, the Netherlands
         {g.ras,l.ambrogioni,u.guclu,marcel.vangerven}@donders.ru.nl


        Abstract. 3D convolutional neural networks are difficult to train be-
        cause they are parameter-expensive and data-hungry. To solve these
        problems we propose a simple technique for learning 3D convolutional
        kernels efficiently requiring less training data. We achieve this by factor-
        izing the 3D kernel along the temporal dimension, reducing the number
        of parameters and making training from data more efficient. Addition-
        ally we introduce a novel dataset called Video-MNIST to demonstrate
        the performance of our method. Our method significantly outperforms
        the conventional 3D convolution in the low data regime (1 to 5 videos
        per class). Finally, our model achieves competitive results in the high
        data regime (> 10 videos per class) using up to 45% fewer parameters.

        Keywords: 3D convolution · Convolutional neural network · Factoriza-
        tion


1     Introduction

Modern deep learning has celebrated tremendous success in the area of auto-
matic feature extraction from data with a grid-like structure, such as images.
This success can be largely attributed to the convolutional neural network ar-
chitecture [9], specifically 2D convolutional neural networks (CNNs). These net-
works are successful due the principles of sparse connectivity, parameter sharing
and invariance to translation in the input space [2]. Loosely said, 2D CNNs effi-
ciently find class-discriminating local features independent of where they appear
in the input space. Since video is essentially a sequence of images/frames, 2D
CNNs can be and are used to extract features from the individual frames of the
sequence [6]. However, the drawback of this method is that the temporal infor-
mation between frames is discarded. Temporal information is important when
we want to perform tasks on video such as gesture, action and emotion recogni-
tion or classification. One possible way to simulate the use of time is to stack a
recurrent layer after the convolutional layers [15]. But correlated spatiotemporal
features will not be learnt because spatial and temporal features are explicitly
    Copyright 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         G. Ras et al.

learned in separate regions of the network. To solve this problem [1] proposed
to expand the 2D convolution into a 3D convolution, essentially treating time as
a third dimension. Ref. [5] used these 3D convolutions to build a 3D CNN for
action recognition without using any recurrent layers. It is important to notice
that the principles that govern 2D CNNs also govern 3D CNNs. Translation in-
variance in time is useful because the precise beginning and ending of an action
are typically ill-defined [14]. Even though 3D CNNs have been shown to work for
different kinds of tasks on video data, they remain difficult to train. There are
roughly three main issues with 3D CNNs. First, they are parameter-expensive,
requiring an abundance of GPU memory. Second, they are data-hungry, requir-
ing much more training data compared to their 2D counterparts. And third,
the increase in free parameters leads to a larger search space. As a result these
models can be unstable and take a longer time to train. Existing literature tries
to solve these problems by essentially avoiding the use of 3D convolutions com-
pletely. The most common method is the factorization of the 3D convolution
into a 2D convolution followed by a 1D convolution at the layer level [11, 13] or
at the network level [7, 8, 12].

1.1     Contribution
We propose a simple and novel method to structure the way 3D kernels are
learned during training. This method is based on the idea that nearby frames
change very little in appearance. Each 3D convolutional kernel is represented as
one 2D kernel with a set of transformation parameters. The 3D kernel is then
constructed by sequentially applying a spatial transformation [4] directly inside
the kernel, allowing spatial manipulation of the 2D kernel values. We achieve the
following benefits:
    – A reduction in the size of the search space by imposing a sequential prior on
      the kernel values;
    – a reduction in the number of parameters in the 3D convolutional kernel;
    – efficient learning from fewer videos.

2      Related Work
Previously, an entire 3D convolutional neural network was factorized into sepa-
rate spatial and temporal layers called factorized spatio-temporal convolutional
networks [12]. This was achieved by decomposing a stack of 3D convolutional
layers into a stack of spatial 2D convolutional layers followed by a temporal 1D
convolutional layer. Ref. [13] followed in this line of research by factorizing the
individual 3D convolutional filters into separate spatial and temporal compo-
nents called R(2+1)D blocks. Both methods managed to separate the temporal
component from the spatial one. One on the network level [12] and one on the
layer level [13]. To our knowledge our approach provides the first instance of a
temporal factorization at the single kernel level. In effect, we applied the concept
of the spatial transformer network [4] to the 3D convolutional kernel to obtain
a factorization along the temporal dimension.
                         Temporal Factorization of 3D Convolutional Kernels         3

3     Methods

The proposed method uses fewer parameters compared to regular 3D convolu-
tions and it imposes a strong sequential dependency on the relationship between
temporal kernel slices. In theory our method should allow efficient feature ex-
traction from video data, using fewer parameters and fewer data. The method
is explained in Section 3.1. We demonstrate the performance of our method on
a variant of the classic MNIST dataset [10] which we call Video-MNIST. The
details of this dataset are explained in Section 3.2. As models we implement 3D
and 3DTTN variants of LeNet-5: LeNet-5-3D and LeNet-5-3DTTN respectively
(see Section 3.3). Training and inference details are explained in Section 3.4.


3.1   Temporal factorization of the 3D convolutional kernel


Fig. 1. The architecture of the temporally factorized 3D kernel. Each temporal kernel
slice is sampled using the previous temporal slice and a transformation matrix. The
resulting kernel slices are concatenated to form on the 3D kernel. Instead of learning
the entire kernel we only learn K0 and the set of transformations Θ.


Consider a 3D convolutional layer consisting of N 3D kernels. We focus on the
inner workings of a single kernel K ∈ RT ×W ×H , where T , W and H refer to the
temporal resolution, width and height of the kernel respectively. Without loss of
generality, we will assume that the input has a channel with a dimension of one.
    If we slice K along the temporal dimension, we end up with T 2D kernels
K ∈ RW ×H . Let us refer to the temporal slice at t = 0 as K0 . Instead of
learning entire K directly, we only learn K0 and Θ ∈ R(T −1)×2×3 . We factorize
K such that Kt with t > 0 depends indirectly on K0 via Kt+1 = f (Kt ; Θ(t,t+1) )
where Θ(t,t+1) ∈ R3×2 with 1 ≤ t ≤ T − 1 are the learnable parameters of the
4       G. Ras et al.

transformation function f . For every pair of slices (Kt , Kt+1 ) we have
                                                    
                                         θ11 θ12 θ13
                             Θ(t,t+1) =                                            (1)
                                         θ21 θ22 θ23
Θ can be further restricted to only contain affine transformation parameters.
That is, scaling s, rotation r, translation in the horizontal direction tx and trans-
lation in the vertical direction ty . This yields:
                                                                        
                              s cos r, −s sin r, tx s cos r − ty s sin r
                 Θ(t,t+1) =                                                       (2)
                               s sin r, s cos r, tx s sin r + ty s cos r

In that case, K has only W · H + 4(T − 1) free parameters. We can additionally
add the restriction that there is only one shared transformation per kernel. That
is, Θ(t,t+1) = Θ(t+1,t+2) = ... = Θ(T −1,T ) for 0 ≤ t ≤ T − 1. This results
in just W · H + 4 parameters. Essentially, given Θ, f modifies Kt to become
Kt+1 , sequentially building the 3D kernel from K0 . This way we impose a strong
sequential relationship between the slices along the temporal dimension of our
kernel.
    The nonlinear transformation f (K; Θ) is applied in two stages. First, Θ is
transformed into a sampling grid G that matches the shape of the input feature
map, plus an explicit dimension for each spatial dimension, G ∈ RW ×H×2 . Here
Kt is the input feature map and Kt+1 is the output feature map. We should
think of this Θ 7→ G transformation as an explicit spatial mapping of Θ into
the input feature space. Each coordinate (x, y) from the input space is split in
separate Gx with x ∈ [1, . . . , W ] and Gy with y ∈ [1, . . . , H] components, and
calculated as                                        
                                                 x
                              Gx       θ θ θ
                                    = 11 12 13 y                               (3)
                              Gy       θ21 θ22 θ23
                                                     1
   Now that we have sampling grid G we can obtain a spatially transformed
output feature map Kt+1 from our input feature map Kt . To interpolate the
values of our new temporal kernel slice we use bilinear interpolation. For one
particular pixel coordinate (x, y) in the output map we compute
                    h X
                    X w
       Kt+1,x,y =             Kt,i,j max(0, 1 − |Gx − i|) max(0, 1 − |Gy − j|) .   (4)
                    i=1 j=1

   Given that our method transforms temporal kernel slices, we refer to 3D
kernels composed with our method as 3DTT kernels. Convolutional networks
that use 3DTT kernels instead of regular 3D kernels are referred to as 3DTT
convolutional networks or 3DTTNs.

3.2   Video-MNIST
In order to test our method we constructed a dataset, referred to as Video-
MNIST, in which each class has a different appearance and dynamic behavior.
                         Temporal Factorization of 3D Convolutional Kernels            5

Video-MNIST is a novel variant of the popular MNIST dataset. It contains 70000
sequences, each sequence containing 30 frames showing an affine transformation
on a single original digit moving in a 28 × 28 pixel frame. The class-specific
affine transformations are restricted to scale, rotation and x, y translations;
see Table 1. We maintain the same train-validation-test split as in the original
MNIST dataset. To make the problem more difficult and reliant on both spatial
and motion cues, classes 0, 1, 5 and 7 and 9 contain random variations of their
specific transformation respectively. For classes 0, 1, and 7 the initial direction
(left or right) and the initial velocity at which the digit travels per frame is varied.
In class 5 the direction of rotation and the size of the radius of the circular path
are varied. In addition we also allow the digits to go partially out of frame or
almost vanish (2 and 6). We also made sure that there are overlapping movements
between classes, such as rotation or translation in the same direction (3, 4 and 8).
Finally some classes can appear visually similar because of the transformation
(6 and 9). In Figure 2 one example of each class is illustrated.


 Table 1. Class-specific affine transformation applied on the original MNIST digits.

      digit transformation description                        parameter(s)
        0 moves horizontally                                         tx
        1 moves vertically                                           ty
        2 scales down and then up                                     s
        3 rotates clockwise                                           r
        4 rotates counter-clockwise                                   r
        5 moves along a circular path                             tx , ty
        6 scales up while rotating clockwise                        s, r
        7 moves horizontally while rotating counter-clockwise      tx , r
        8 rotates clockwise and then counter-clockwise                r
        9 random rotation and horizontal+vertical movements      r, tx , ty


3.3   Model architectures
We use the LeNet-5 architecture [10] as it is a good starting point for training
a model based on a variant of the MNIST dataset. The original 2D convolu-
tions are replaced with regular 3D convolutions and 3DTT convolutions for the
LeNet-5-3D and the LeNet-5-3DTTN respectively. The number of filters in each
convolutional layer can vary since during experimentation we noticed that we
can achieve better performance by either increasing or reducing the number of
filters in the convolutional layers for both LeNet-5-3D and LeNet-5-3DTTN.
LeNet-5-3D serves as the baseline model.

3.4   Training and inference
Training All models are optimized using SGD with mometum of 0.9. Depending
on the model, the starting learning rate value can vary from 1e − 8 to 5e − 9. The
6        G. Ras et al.


Fig. 2. One example from each Video-MNIST class. Instead of displaying the full 30
frames sequence we display 15 frames, skipping one frame each time, such that the
figure fits within the margins of the page.


models are trained for a total of 100 epochs where every 10th epoch the learning
rate decreases exponentially if the validation accuracy has not improved. We
noticed that 100 epochs provides a good time-window for the models to converge.
Generally a batch size of 20 is used unless we are training on only 10 videos, in
which case a batch size of 10 is used. Initialized LeNet-5-3D model weights as
well as all fully connected layers follow a Kaiming-Uniform scheme [3]1 . LeNet-
5-3DTTN initializes K0 with weights sampled from a Gaussian distribution2 .
In our main experiments we use a parameterization of Θ with the following
initialization: s = 1, r = 0, tx = 0 and ty = 0.


Replication of video selection Each model is ran 30 times (runs) with the
same initialization parameters but with different randomly initialized weights
for the convolution and fully connected layers. The training data is randomly
selected across different runs by using a seed. The seed assures that the same
videos are chosen again when we execute the same run with a different model
or when we use different initialization parameters. This way we can compare
only the difference between the model architecture and parameters without con-
founding our results with video variance. Given that we experiment with very
1
    The default setting in PyTorch.
2
    Experimentally this gave the best results, however there was very little difference
    between different types of initializations.
                        Temporal Factorization of 3D Convolutional Kernels      7

few videos, we make sure that the classes are represented equally in the randomly
selected training data.

Inference Model selection is based on the accuracy of the validation split. The
30 models in the run with highest average accuracy are ran against the test
split. Each run is essentially the same model using the same hyperparameters
but with different randomized weight initializations for the convolutional and
fully connected layers. In the end the test results of the 30 models are averaged
and the standard error of the results is calculated. The final results can be seen
in Figure 3.

3.5     Setup
To test if our method can outperform conventional 3D convolutions on very few
datapoints we train each separate model on a different number of training videos.
This way we can test how efficient our method is. The total number of videos
are varied from low to high: 10, 20, 30, 40, 50, 100, 500, 1000, 2000, 5000. Model
selection happens for each of the number of videos separately. The models trained
on 10 videos are different from the models trained on 20 videos. After model
selection based on the validation split the models from the best run perform
inference on the test split.

4     Results
In Table 3 and Figure 3 we can see that the that our method outperforms the
conventional 3D convolution significantly in the low data regime. However, when
we have ample training data the conventional 3D convolution outperforms our
method, as is to be expected. It is worth mentioning that, in general, our method
uses fewer parameters and still achieves reasonable results in all settings.

      Table 2. Detailed results LeNet-5-3D vs. LeNet-5-3DTTN on Video-MNIST.

                         LeNet-5-3D              LeNet-5-3DTTN
           train mean standard model         mean standard model
          videos accuracy error parameters accuracy error parameters
             10    0.374   0.0122  375668   0.477 0.0139    444966
             20    0.498   0.0131  496430   0.588 0.0108    348612
             30    0.625   0.0107  496430   0.668 0.0091    444966
             40    0.671   0.0095  496430   0.721 0.0079    348612
             50    0.750   0.0083  496430   0.773 0.0067    444966
            100   0.837 0.0059     496430    0.829  0.0073  348612
            500   0.960 0.0025     496430    0.925  0.0058  222940
           1000 0.976 0.0014       496430    0.968  0.0020  254058
           2000 0.988 0.0007       496430    0.972  0.0022  222940
           5000 0.994 0.0003       496430    0.988  0.0010  222940
8               G. Ras et al.


                                       Training Videos vs. Accuracy
               1.0                                                        0.988   0.994
                           LeNet-5-3DConv                     0.96 0.977          0.988
                           LeNet-5-3DConvTTN                        0.968 0.972    *
                                                                            *
               0.9                                            0.925 *
                                             0.837            *
                                        *
               0.8                    0.773 0.829
                                      *
                                  0.721
                                        0.75
    Accuracy


               0.7                 *
                                0.669
                             *        0.671
                           0.588
               0.6                 0.625
                     *
               0.5 0.477
                               0.498
               0.4
                       0.374
                     101                      102                  103
                                                Training videos

Fig. 3. A comparison of the model using regular 3D convolutions with the model using
our factorized 3D convolutions on the entire test split of Video-MNIST. An asterisk
denotes if the difference between models is significant.


5        Conclusion


We propose a novel factorization method for 3D convolutional kernels. Our
method factorizes the 3D kernel along the temporal dimension and provides
a way to learn the 3D kernel through transformations of a 2D kernel, thereby
greatly reducing the number of parameters needed. We demonstrate that our
method significantly outperforms the conventional 3D convolution in the low
data regime (10 to 50 training videos), yielding 0.58 vs. 0.65 on average for
the LeNet-5-3D and the LeNet-5-3DTTN respectively. Additionally our model
achieves competitive results in the high data regime (> 100), with 0.95 vs. 0.94
on average for the LeNet-5-3D and the LeNet-5-3DTTN respectively, using up
to 45% fewer parameters. Hence, 3DTTNs provide a useful building block when
estimating models for video processing in the low data regime. In future work,
we will explore in which real-world problem settings 3DTTNs outperform their
nonfactorized counterparts.
                          Temporal Factorization of 3D Convolutional Kernels            9

References
 1. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep
    learning for human action recognition. In: International workshop on human be-
    havior understanding. pp. 29–39. Springer (2011)
 2. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT
    Press (2016)
 3. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
    level performance on imagenet classification. In: Proceedings of the IEEE interna-
    tional conference on computer vision. pp. 1026–1034 (2015)
 4. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.
    In: Advances in neural information processing systems. pp. 2017–2025 (2015)
 5. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action
    recognition. IEEE transactions on pattern analysis and machine intelligence 35(1),
    221–231 (2012)
 6. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-
    scale video classification with convolutional neural networks. In: Proceedings of
    the IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732
    (2014)
 7. Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal cnns for
    fine-grained action segmentation. In: European Conference on Computer Vision.
    pp. 36–52. Springer (2016)
 8. Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A
    unified approach to action segmentation. In: European Conference on Computer
    Vision. pp. 47–54. Springer (2016)
 9. Lecun, Y.: Generalization and network design strategies. In: Connectionism in
    perspective. Elsevier (1989)
10. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning ap-
    plied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
11. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-
    3d residual networks. In: proceedings of the IEEE International Conference on
    Computer Vision. pp. 5533–5541 (2017)
12. Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized
    spatio-temporal convolutional networks. In: Proceedings of the IEEE international
    conference on computer vision. pp. 4597–4605 (2015)
13. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look
    at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE
    conference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)
14. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action
    recognition. IEEE transactions on pattern analysis and machine intelligence 40(6),
    1510–1517 (2017)
15. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R.,
    Toderici, G.: Beyond short snippets: Deep networks for video classification. In:
    Proceedings of the IEEE conference on computer vision and pattern recognition.
    pp. 4694–4702 (2015)