<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Computer Science and Technology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CVPR.2016.90</article-id>
      <title-group>
        <article-title>MobileNet for Fast Anti-Spoofing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kostiantyn Khabarlak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Computing, Anti-Spoofing, Computer Vision</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dnipro University of Technology</institution>
          ,
          <addr-line>D. Yavornytskoho Av., 19, Dnipro, 49005</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Neural Network Adaptation</institution>
          ,
          <addr-line>Post-Train Adaptive, Inference Speed, Mobile Computing, Edge</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>22</volume>
      <issue>1</issue>
      <fpage>27</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Many applications require high accuracy of neural networks, as well as low latency and user data privacy guaranty. Face anti-spoofing is one of such tasks. However, a single model might not give the best results for different device performance categories, while training multiple models is time consuming. In this work we present Post-Train Adaptive (PTA) block. Such a block is simple in structure and offers a drop-in replacement for MobileNetV2 Inverted Residual block. PTA block has multiple branches with different computation costs. The branch to execute can be selected on-demand and at runtime, thus offering different inference times and configuration capability for multiple device tiers. Crucially, the model is trained once and can be easily reconfigured after training, even directly on a mobile device. In addition, the proposed approach shows substantially better overall performance in comparison to the original MobileNetV2 as tested on CelebA-Spoof dataset. Different PTA block configurations are sampled at training time, which also decreases overall wall-clock time needed to train the model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Convolutional neural networks have shown an extraordinary performance in computer vision tasks.
While the initial research has been focused purely on quality regardless computation cost, the modern
research trend is to design fast yet accurate neural networks, and in significant part such a trend has
been motivated by the requirements of low-latency data processing, user data privacy, as well as
reduction of server load. In addition to the fact Mobile and IoT devices offer significantly less
computational power, typically several generations or price categories of such devices should be
considered, yet architecture of most modern neural networks can only be configured before training and
not after. This leaves us with two alternatives: 1) to train a separate network for each device category,
which requires more time and effort; 2) to design a single architecture which will target either high-end
devices and high quality, or compatibility with all device generations at the cost of accuracy. Both of
these solutions are suboptimal.</p>
      <p>Real-time time face anti-spoofing is one of the algorithms that is preferable to be performed directly
on a mobile device. The anti-spoofing task is to distinguish whether the user shows its real, live face,
or a recording of someone else’s. The problem is complicated by plethora of ways spoofing attack can
be performed, such as printed face image, poster, video, face mask, etc. Anti-spoofing can be found as
a component in face-based access control systems, where it is not acceptable if access can be granted
to an unauthorized person holding someone’s photograph.</p>
      <p>In this work we propose Post-Train Adaptive (PTA) block, which is simple in structure and offers a
drop-in replacement for MobileNetV2 Inverted Residual block. The PTA block has multiple branches</p>
      <p>2022 Copyright for this paper by its authors.
with different computation costs. The training procedure is constructed in way, so that the branch to
infer on in a fully-trained network can be selected on-demand and at runtime, thus offering a way to
change network inference speed and to target multiple device tiers. Block configuration choice can be
made based on device speed, system load, desired power consumption or target quality. Crucially, the
model is trained once and can be easily reconfigured after training, even directly on a mobile device.</p>
      <p>To summarize, our main contributions are as follows:
1. We introduce Post-Train Adaptive block for MobileNetV2 network, which is capable of
switching between different performance/quality levels after being trained and at runtime directly
on a mobile device.
2. We demonstrate superiority of the proposed approach over the original MobileNetV2 network
both in terms of quality, as well as inference speed on multiple mobile devices on face anti-spoofing
problem. The qualitative metrics are provided on CelebA-Spoof dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Overview</title>
      <p>The initial very deep convolutional network research has been focused on finding exact
configurations for convolution blocks (including kernel size and stride), pooling type and activation
functions. Each block of these networks was “plain”, i.e. contained a single branch, such as in VGG
network [1] with up to 19 layers deep. It has been noticed, that in general deep neural networks have
better performance and overall generalization capability; however, it has turned out that building even
deeper networks faces a vanishing gradient problem, and the training barely proceeds. In [2] a training
experiment has been conducted for plain networks with different depth. The first network contained 20
layers, the second was constructed by adding more layers with the total of 56 layers. It has been
expected, that if the newly added layers provide no additional benefit, they could be learned to produce
an identity mapping, hence, the deeper network should, in theory, show accuracy no less that the
shallower network. Yet the experiment has shown that the accuracy of the deeper network was much
worse. To counteract the vanishing gradient problem, the authors of ResNet network [2] suggested
using an extra identity connection between groups of blocks. Such connection has been termed as skip
or residual connection. Also, they have also introduced a Bottleneck block, that is a group of 3
convolutions with kernel size of 1 × 1, 3 × 3, 1 × 1. To limit required computation, 1 × 1 convolutions
reduce and then restore the number of channels, so that a heavier 3 × 3 convolution processes smaller
input (hence the name “bottleneck”). Skip connection is used in this block, so that the input to the first
convolution is added to the result of the whole block.</p>
      <p>The authors of a widely used MobileNetV2 [3] architecture improve on the ideas previously
proposed in ResNet architecture. They introduce an Inverted Residual Block, which on the contrary has
small channel count in inputs and outputs, but more channels inside the block. The authors note that
such a design is more memory efficient than that of the original ResNet. To keep the number of
computations low, lightweight depthwise convolutions are used inside the block. The network has a
configurable width parameter, by changing which it is possible to tune the network’s computational
complexity. A more detailed description of the inverted bottleneck block is provided in the next section.</p>
      <p>The following works have improved in the following directions: in Squeeze-and-Excitation Network
(SENet) [4] an attention mechanism has been applied to improve the quality of network prediction. In
neural networks, attention is used to selectively gate information flowing through the network, so that
only the most important components of the signal flow forward. In MnasNet [5], an approach for an
automated neural architecture search for mobile or embedded devices has been proposed. During neural
network architecture selection process, the best network was selected based on inference speed on an
actual mobile device. MobileNetV3 [6] has also improved on the previous approaches by using network
architecture search, attention mechanisms and novel activation function. Large and small configurations
have been proposed. Mobile neural network inference is important for face-related processing [7] and
many other tasks.</p>
      <p>Anti-spoofing is used to enhance camera-based access control systems from unauthorized user
access based on someone else’s photograph, such systems can also be executed directly on mobile [8].
The above-described networks can also be used for anti-spoofing. In general, anti-spoofing can be
performed based on RGB signal from a conventional camera, infrared or depth information from special
hardware. For instance, CASIA-SURF [9] dataset, has video information about all three modalities.
Using them together improves overall quality, but depth or infrared information is typically not
available and requires special hardware. Therefore, in this work we focus on algorithms that use RGB
signal only. By using RGB signal anti-spoofing can still be performed by finding color and shape
distortions. This is different from image classification, where the shape (and not distortions) is of more
importance. In [10] it was proposed to replace convolution operation with Central Difference
Convolution, that better captures color gradients to improve anti-spoofing performance. In [11] AENet
network was introduced with ResNet as a backbone. The authors utilize rich annotations of the
CelebASpoof dataset (presented in the same work) to improve network training. Face attribute information
(e.g., smile, sunglasses etc.), photo illumination conditions, as well as depth and reflection information
are used to form a single multi-task loss. Depth and reflection information is not inherently present in
the dataset; hence, the authors propose to infer it from RGB image using an auxiliary neural network.
The extra information is used during training and not required during inference. We follow [11], [12]
and also use CelebA-Spoof dataset in this work as it is to the best of our knowledge the largest
antispoofing dataset to date.</p>
      <p>While many of the above-described networks offer a capability of configuration change either
through separate small/large configuration or through width parameter. The architecture design should
be completed prior to training. No capability to change the network configuration after it has been
trained is proposed in these networks. In this work we focus on improving MobileNetV2 architecture
as it is one of the most widely used networks on mobile devices.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <p>The main building block of a MobileNetV2 network is an Inverted Residual Block. This block starts
processing input with a 1 × 1 convolution that expands the number of channels. The expansion is
controlled by an expansion factor, the authors propose setting it to 6 for all hidden layers. The
convolution is then followed by a Batch Normalization [13] and ReLU6 activation layer. Next, 3 × 3
Depthwise Convolution is applied, followed again by Batch Normalization and ReLU6. In contrast to
the ordinary convolution, depthwise convolution computes an output based on a single input channel to
reduce computation required. Finally, a 1 × 1 convolution with Batch Normalization is applied to
shrink the channel count back to the original. The final output is summed element-wise with the input
(the abovementioned skip connection). The use of such block has allowed the authors to asymptotically
reduce the number of multiply-add operations in the network while retaining good quality. The model
also has a width multiplier, by changing which it is possible to adjust overall number of multiply-add
operations. However, it is not possible to adjust width of the model after training.</p>
      <p>The aforementioned Inverted Residual Blocks are typically repeated with the same number of input
and output channels several times. In this work we propose to change the number of inverted residual
blocks required for model inference based on user demand and after the model training is complete. For
that we introduce a Post-Train Adaptive (PTA) block, whose architecture is depicted on Figure 1. The
PTA block has 2 branches: the right (heavy) branch is more computationally expensive and is fully
equivalent to a pair of Inverted Residual Blocks; the left (light) branch reduces the computation by
executing only a single Inverted Residual Block. The branch to be executed is selected based on user
configuration and can be changed dynamically at runtime. It is possible to execute either branch
exclusively or both at the same time. If both branches are executed, their outputs are averaged
elementwise, so that the feature distribution remains the same. The weights are not shared between any of the
blocks. We propose to replace three pairs of Inverted Residual Blocks with the largest number of
channels with Post-Train Adaptive (PTA) block schemePTA blocks in MobileNetV2 architecture, as is
shown in Table 1.</p>
      <p>To train such a model at each iteration we randomly sample a configuration of PTA blocks, and
perform forward, then backward pass updating the weights. To avoid excessive randomness in the
model, we limit the number of possible configurations for the model to 5: all blocks execute heavy
branch; a single of the blocks executes the light branch, while others the heavy one; all of the blocks
execute light branch. Configuration sampling is not performed uniformly, we follow the intuition that
paths with larger number of weights should be trained for longer and thus assign higher sampling
probabilities to such configurations. The exact sampling probabilities are shown in Table 2. Note, that
we do not execute both branches at the same time during training. All configurations missing from
Table 2 are also assumed to be never sampled and trained on.
where  is the number of samples in a mini-batch,  = 2 is the number of classes,   , is the model
logit output for item  and class  . Adam [14] adaptive gradient descent method with learning rate
 = 10−4 is used as an optimizer.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>To train and evaluate the model we use recent CelebA-Spoof [11] dataset. To the best of our
knowledge, this is the largest Anti-Spoofing dataset available to date. Overall, it contains 625,537
pictures (including both spoof and live photos) of 10,177 subjects. Photos are captured with different
lighting, environment conditions and different cameras. Only RGB photo information is available in
the dataset. Several spoof attack types are considered in the dataset, such as printed full frame photos,
paper cut photos, replay attack when picture is presented on a tablet or a phone, and as the authors call
it a 3D mask when a printed image is overlayed on top of a human face. In addition to binary
spoof/nonspoof label, the dataset contains rich information about spoof type, illumination condition, environment
label, as well as face attribute labels (smile, mustache, hat, eyeglasses, etc.). At this point we use only
binary spoof/non-spoof information with a possibility of extending our model in future. The dataset
defines train/test splits and several evaluation protocols. Our results are given for intra-test protocol,
which is used for general model evaluation. We also randomly split the training subset into actually
training and validation in 80/20 ratio. Training is performed for 20 epochs. The best model is then
selected based on validation set. Gradient computation is performed on mini-batches of size 32 images.
The results are reported on the test set.</p>
      <p>We also crop images based of face bounding boxes that are present for each image in CelebA-Spoof
dataset. The resulting face image is then resized to the resolution of 128 × 128. We feed color (RGB)
images to the model. At training time color jitter and ISO-noise augmentations are used. Note, that no
ImageNet pretraining has been used, the models are trained from scratch.</p>
      <p>We follow [15], [16] and use the following metrics for model quality evaluation in our paper:
Accuracy is a proportion of correctly classified images to the overall number of images; Attack
Presentation Classification Error Rate (APCER) is a proportion of attack images incorrectly classified
as normal images:

=
+ 
(1)
(2)
Bona Fide Presentation Classification Error Rate (BPCER), that is a proportion of normal (bona) images
incorrectly classified as attack images:
Average Classification Error Rate (ACER) is an average of APCER and BPCER:

=
,
(3)
(4)
where TP is True Positive, that is the sample is labelled as spoof and the prediction is also spoof, TN is
True Negative, meaning both prediction and true label are non-spoof, FP is False Positive, i.e.,
prediction is spoof, while the image is non-spoof, finally, FN is False Negative, the prediction is
nonspoof, but actual image is spoofed.</p>
      <p>As in this paper we not only target adaptivity and quality, but also inference time speed on a mobile
device, we have selected a pair of Android smartphones for testing. The devices are based on Qualcomm
Snapdragon 845 and Snapdragon 800 CPUs, the flagship mobile processors from 2018 and 2013
correspondingly. In terms of modern-day processors, the former can be though-of as mid-to-high-end
CPU, and the latter as low-end CPU. These processors are found in many devices; thus, our results can
be easily reproduced. Also, in this way we conduct testing on major CPU performance categories.</p>
      <p>In addition, we report training time for both MobileNetV2 and MobileNetV2 with PTA blocks on
GTX 1050Ti GPU, which is an important metric for practical applications.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>For the comparison 2 models have been trained: the original MobileNetV2 (hereinafter No PTA)
and MobileNetV2 with PTA blocks (hereinafter PTA), constructed as described in Section 3.
PTAbased models can be further configured after being training, thus, in all of the following tables we show
the configuration for which the testing has been performed. As the proposed MobileNetV2+PTA
configuration consists of 3 PTA blocks we use 3-letter abbreviation to denote the exact configuration
used. Letter H, L, B are used to denote execution of Heavy, Light, and Both branches correspondingly
for each of the PTA blocks.</p>
      <p>In Table 3 we show qualitative results as measured on the test set. Accuracy is the higher the better.
APCER, BPCER, ACER denote error rates, thus, the lower the better. The best result in each column
is shown in red, second best is shown in blue. As can be seen PTA-based models dominate the original
MobileNetV2 (No PTA) implementation in all of the metrics. Interestingly, PTA-HHH configuration,
which is equivalent in terms of number of parameters and multiply-additions is also better than the
original model.</p>
      <p>In Table 4 we present model complexity and inference time comparison. First, we show the number
of parameters in each of the models (in millions). For the PTA models we configure the model after
training, and then report the number of parameters that is actively used in the corresponding
configuration. Next, we measure the number multiply-add operations (in million operations) executed
during the forward pass of the model. Also, we measure actual performance on mobile devices on
widely popular Snapdragon 845 and Snapdragon 800 processors (hereinafter SD845 and SD800
correspondingly), measured in milliseconds. We have described these processors in more detail in the
previous section. Finally, we show relative inference time improvement with respect to the No PTA
baseline as measured on SD845. PTA-HHH configuration has the same number of parameters and
computation as No PTA and is equivalent in terms of performance on a real device. PTA-BBB
configuration uses both Light and Heavy branches in all 3 PTA blocks and thus is slightly more
computationally intensive. All other configurations that use a mix of Heavy and Light blocks are faster.</p>
      <p>As we sample Light and Heavy PTA configurations during training, it is expected that overall
training time for the PTA-based model should decrease. Our experiments validate this assumption. In
Table 5, we demonstrate training time for each of the models joined by the best Accuracy and ACER
achieved by each of the models. We show epoch training time in minutes, and overall training time for
20 epochs in hours. Note, the model with PTA blocks is further configured after training, thus, only a
single MobileNetV2+PTA has been trained for all the configurations. As can be seen, PTA-based model
is better in terms of quality, inference and training time.</p>
      <p>On Figure 2 we show validation accuracy during training for the original MobileNetV2 (solid blue
line) and MobileNetV2+PTA (dashed orange line). For the PTA model we validate on PTA-HHH
configuration, which is equivalent in terms of the number of parameters and multiply-adds to the
original MobileNetV2. As is clearly seen, the MobileNetV2+PTA has better validation ACER
throughout the training process.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The key goal of this work is to make it possible to reconfigure a neural network after it has been
trained. As has been shown, Post-Train Adaptive block proposed in this work is an efficient way for
post-train network configuration. Placing only 3 PTA blocks in MobileNetV2 has made it possible to
adaptively adjust inference time from 107% to 80% of the original MobileNetV2 (see Table 4). The
simplicity of the PTA block has allowed to implement MobileNetV2 with PTA for inference on mobile
devices with different CPUs: high-end Snapdragon 845 and low-end Snapdragon 800. On the latter the
inference speed improvement over the original MobileNetV2 is over 18 milliseconds, which is
significant for a performance-limited device. The best inference speed is offered by the PTA-LLL
model, where all three PTA blocks use light branch only. Interestingly, the second-best inference time
is achieved by PTA-LHH configuration with 100.63 Mops and not by PTA-HLH with 96.50 Mops.
PTA-LHH is slower than PTA-HLH in 4.72 ms on SD800.</p>
      <p>PTA-based configurations have better quality as well. MobileNetV2 (No PTA) and PTA-HHH
configurations have the same number of parameters and multiply-additions, but PTA-HHH is better in
every metric as seen from Table 3. In this case, the only difference is in training procedure. PTA-based
models sample different network configurations during training. Consequently, we suggest, that the
proposed training procedure has a positive impact on overall model quality.</p>
      <p>Validation ACER comparison depicted on Figure 2 shows that PTA-HHH is better than No PTA
model throughout the training procedure. For instance, after a single training epoch these models have
achieved ACER of 6.46% and 9.3% for PTA-HHH and No PTA correspondingly, meaning PTA-based
model starts to train significantly faster. The final validation ACER for PTA-HHH is 0.53% and is 1.0%
for No PTA. We suggest that sampling different block configurations during training makes the network
to learn more general features and offers regularization capability. This might explain better PTA-based
model results.</p>
      <p>Overall, the best accuracy and BPCER is shown by PTA-LLL at 97.85% and 1.98%
correspondingly. This is also the fastest configuration. PTA-LHH has the lowest APCER at 0.70%
(53% relative performance improvement).</p>
      <p>We also investigate a possibility of using multiple branches jointly in PTA-BBB configuration. This
is the configuration that shows the best performance in average classification error rate (ACER) at
2.13%, which is a 23.5% relative improvement over the baseline. The model is also better than
MobileNetV2 (No PTA) in all other metrics. The 2-branch PTA-BBB model has more parameters</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work Post-Train Adaptive block has been first introduced. Such a block is simple in structure
and offers a drop-in replacement for a pair of MobileNetV2 Inverted Residual blocks. Thanks to the
proposed novel block we improve over MobileNetV2 for anti-spoofing in the following ways: 1) we
solve the problem of inability to change the network architecture after it has been trained. The PTA
block has light and heavy branches with each of them capable of switching on and off on-demand and
at runtime. Not only each of the branches can be used exclusively, but also their prediction can be
averaged, forming an in-model ensemble. Therefore, a model can be reconfigured after training to better
suit the target device; 2) the lightest PTA configuration shows 20% improvement in terms of actual
inference speed on a mobile device, while also having superior quality in comparison to the original
MobileNetV2 architecture; 3) the anti-spoofing performance has been substantially improved with
PTA-based configurations beating the baseline in all typical anti-spoofing metrics. During training we
sample different PTA configurations with different number of parameters. We suggest that this results
in the model learning more general features, thus, resulting in better overall quality. All of the
aforementioned improvements have been achieved with smaller total training time in comparison to the
MobileNetV2 model.</p>
      <p>Because-of a significant variation of mobile and edge device computational power, a single neural
network targeting several different device categories is suboptimal. The proposed approach, in contrast,
allows to train the model once and then adjust its runtime speed according to device characteristics,
overall system load and desired battery consumption. This makes the results obtained practically
significant.</p>
      <p>While the MobileNetV2 with PTA blocks architecture is applicable to any problem, where the
original MobileNetV2 was, in this work we have investigated only a single (yet important) practical
application, that is mobile face anti-spoofing. In future works we will expand our exploration on other
applications and will improve PTA blocks performance and quality even further.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgments</title>
      <p>The work is supported by the state budget scientific research project of Dnipro University of
Technology “Development of new mobile information technologies for person identification and object
classification in the surrounding environment” (state registration number 0121U109787).</p>
    </sec>
    <sec id="sec-9">
      <title>9. References</title>
      <p>[1] K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image
Recognition, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL:
http://arxiv.org/abs/1409.1556.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>