<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Virtual
" kai hong@njust.edu.cn (K. Hong); duxy@njust.edu.cn (X. Du)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kai Hong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoyu Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nanjing University Of Science And Technology</institution>
          ,
          <addr-line>Nanjing, 210014</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State Key Laboratory of Communication Content Cognition</institution>
          ,
          <addr-line>Beijing, 100733</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Recent works demonstrate the significance of textures for the neural deepfake detection methods, yet the reason is still in explorations. In this paper, we claim that the artifact discrepancies caused by the face manipulation operations are the key diference between pristine videos and deepfakes. To imitate the discrepant situation from pristine videos, we propose an artifact-discrepant data generator to generate the negative samples by adjusting the artifacts in the facial regions with conventional processing tools. We then propose Deepfake Artifact Discrepancy Detector (DADD) method to discover the discrepancies. DADD adopts the multi-task architecture, associates each sub-task with a specific artifact set, and assembles all the sub-tasks for the final prediction. We term DADD as a self-supervised method since it never meets any deepfakes during the training process. The experimental results on the FaceForensics++ and Celeb-DF datasets demonstrate the efectiveness and generalizability of DADD.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;deepfake</kwd>
        <kwd>self-supervised</kwd>
        <kwd>artifact discrepancies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Videos were a natural and convincing medium to</title>
        <p>spread information due to their abundant and strongly
co-associated details, including appearances, actions,
sounds, etc. This situation has changed due to the
emergence of Deepfakes, the model-synthetic media in which
the face or voice may be replaced with someone else’s.</p>
        <p>The synthetic videos are resulting in negative impacts on
individuals and society. Moreover, with the rapid
development of generative techniques, the procedures making
deepfakes have become substantially simple, while the
products seem more realistic. This situation facilitates
many domains, i.e., the film industry, but potentially
increases the probability of social issues. Therefore, the
deepfake detection methods have garnered widespread
attention.</p>
        <p>Recent deepfake detection methods are mainly devised
from two perspectives. The first one is used by the
bioinspired methods based on the observations and intuitive
hypotheses over the datasets. Li et al. [1] focused on the
abnormal eye blinking. Yang et al. [2] noted the
inconsistency between the facial expressions and the
corresponding head postures. Qi et al. [3] magnified the heart
rhythm signal in videos and detected the disrupted heart
rhythm signal. Li et al. [4] located the blending
boundaries made by facial replacement methods to make the
detection; the second perspective is to capture the forged
features via the neural networks, including customized
deep networks [5], classic neural networks [6], et al. .</p>
        <p>The neural methods achieve an extremely high
performance [6]. But the dependence on the training datasets
severely limits the model generalizability, which is very
important in practical applications. For instance, the
well-trained models may not work across datasets [7, 8],
since the deepfakes are made by a variant of methods.</p>
        <p>To retain the model efectiveness across the datasets,
the traditional measures including data augmentation
[9] and transfer learning [7] are introduced. However,
these methods hardly reveal the inherent diference
between pristine videos and deepfakes. To address this
issue, self-supervised learning scheme is introduced to
produce negative samples as the substitutes of true
deepfakes to make the model learn specific features [ 10, 4].</p>
        <p>The negative samples rely on the manual hypothesis of
the diferences between pristine videos and deepfakes,
facilitating the construction of interpretable detection
methods. Two typical works are FWA [10] and Face
Xray [4], where the former assumes that the artifacts are
caused by the resizing and blurring operations on the
facial regions, and the latter believes that deepfakes
always have unseen blending boundaries. Their results
demonstrate that recent neural networks mostly focus
on generic visual artifacts rather than the videos
themselves. Therefore, the negative samples generated with
intuitive and empirical operations can facilitate the
detection model and enhance the generalizability further.</p>
        <p>In addition, many works point out that the videos and
images have inherent signals like fingerprints, which
are produced by the devices, the post-processing or the
2. Related Work
generative models [11, 12]. Inspired by these works, we
make a bold hypothesis that the artifact discrepancies
caused by the face manipulation operations are the key to Bio-inspired methods. Some works have found that
detect deepfakes. Intuitively, all the frames in a pristine the actors’ physiological characteristics in deepfakes are
video have the same operation flow, thus they should diferent from the real world. Li et al. [1] found that the
have consistent fingerprints ( i.e., artifacts). In contrast, actors in deepfakes have an abnormal blinking frequency
the replaced facial regions in deepfakes inevitably in- and some even don’t blink. Yang et al. [2] found that face
troduce discrepant artifacts. Focusing on the discrepant orientation and head poses are related, but the
correlaartifacts, we propose a self-supervised deepfake detection tion is destroyed in deepfakes. Due to the development of
approach which comprises an Artifact-Discrepant Data remote visual photoplethysmography (rppg) technology,
Generator (ADDG) and a Deepfake Artifact Discrepancy the heart rate of actors in videos can be detected [15].
Detector (DADD) to discover the discrepancy from the Based on this technology, Qi et al. [3] found the irregular
generated data. ADDG just uses the pristine video frames heart rhythm of actors in deepfake. Similarly, Ciftci et al.
and perturbs the facial regions with the conventional pro- [16] explored the biological signal diference between
cessing tools, e.g., blurring, scaling, rotation, replacement, fake videos and real videos. However, the
physiologetc. Although the perturbations do not change the frames ical signal artifacts reflected by diferent data sets are
in human sense, we believe that they have introduced diferent, so specific data needs specific analysis.
the discrepancy in the artifact level. Thus the perturbed
frames are taken as the negative samples (i.e., substitutes Neural methods. Since deep neural networks can
auof deepfakes) in our approach. DADD adopts the multi- tomatically extract images’ deep features, many
DNNtask learning scheme, associates each sub-task with a based detection methods have achieved satisfactory
retype of generated data, and assembles all the sub-tasks sults. Zhou et al. [17] divided the image into diferent
for the final prediction. The prediction is constrained patches, and proposed a Two-stream network to detect
by ℓ2,1 norm[13, 14] which is a classic regularization for the diference between patches. Afchar et al. [5]
profeature selection. The experimental results on the pub- posed a compact network structure MesoNet to detect
lic datasets demonstrate that the model trained on the fake videos. Nguyen et al. [18] proposed the use of
capgenerated data can achieve a competitive performance, sule networks for deepfake detection. These methods
ineven it never sees the real deepfakes. This verifies the dicate that a simple CNN network can indeed capture the
efectiveness and generalizability of our approach, and relevant features of fake videos. In addition to these
detecreveals that our hypothesis is a feasible perspective to tion methods based on single-frame images, there are also
detect the deepfakes. methods based on multi-frame sequences. Guera et al.</p>
        <p>The main contributions of our work are as follows: [19] extracted features from each frame by using CNN,
then made decisions based on the feature sequence by
using RNN. To capture the correlation of diferent frame
features better, Sabir et al. [20] used a Bi-directional RNN.</p>
        <p>These neural methods can detect specific Deepfakes
perfectly [6], but for unseen data, the detection performance
will be greatly reduced [8].
• We hypothesize that the artifact discrepancies caused
by the face manipulations are the key to detect
deepfakes, thus propose a self-supervised deepfake
detection approach to discover the discrepancy. The core
is the Artifact-Discrepant Data Generator, which uses
the pristine video frames only and perturbs the facial
region with the conventional processing tools to
generate the negative samples.
• To better address the artifact discrepancies, we propose</p>
        <p>Deepfake Artifact Discrepancy Detector, which adopts
the multi-task learning scheme, associates each
subtask with a type of generated data, and makes the final
prediction by integrating the sub-tasks. To guide the
task feature selection, we adopt ℓ2,1 norm to constraint
the learning process.
• Extensive experiments are conducted to demonstrate
the efectiveness and generalizability of our proposed
self-supervised approach, though it has never seen any
real deepfakes through the training process.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Cross-data methods. Recently, the generalizability</title>
        <p>of detection methods has been emphasized. Xuan et al.
[21] preprocessed training images to reduce obvious
artifacts, forcing models to learn more intrinsic features.</p>
        <p>Cozzolino et al. [22] introduced an auto-encoder method
that enabled real and fake images to be decoupled in
latent space. Du et al. [7] believed that the detection model
needs to focus on the forgery area, not the irrelevant ones,
so they located the modified region and proposed an
active learning method. Nirkin et al. [23] believed that the
face and content of the fake image have inconsistent
identity information, so they used face recognition method
to detect deepfakes. However, these methods still require
corresponding fake videos to complete the training,
resulting in limited generalizability. Diferent amounts of
data are bound to produce diferent results [ 24]. Another</p>
        <p>Perturbation Strategy
1). GaussBlur 2). Scaling
3). ISONoise 4). Rotation
5). SB-Rand 6). SB-Sim</p>
        <p>Frame Perturbation</p>
        <p>XP: Positive Sample
X: Pristine Frame</p>
        <p>Random
Select</p>
        <p>Deformation
LandmarksCandidate Masks</p>
        <p>Mask Generation</p>
        <p>M:Mask</p>
        <p>Reversed</p>
        <p>XN
Negative Sample
Positive Sample
novel idea is not to use any fake image during training. bations proposed have difering impacts, we introduce
FWA[10] expected to simulate face warping artifacts by the ℓ2,1 regularization for feature selection.
adjusting the face area to diferent sizes and blurring it to
produce similar texture artifacts. Face X-ray[4] generated 3.1. Artifact-Discrepant Data Generator
images with boundary information during training
dynamically. Zhao et al. [25] also used Face X-ray’s method As shown in Figure 1, ADDG takes in the pristine
imof generating training data and proposed a model for age , and generate the negative sample (i.e.,
artifactlearning diferent patches’ consistency. Therefore, dis- discrepant sample) with three modules, frame
perturbacovering generation methods’ common steps can facili- tion, mask generation, and negative sample synthesis.
tate the generalizability of the model. Frame perturbation uses common image processing tools
to change the fingerprint of the pristine frame like data
augmentation, mask generation selects the perturbation
3. Method area, and negative sample synthesis blends the pristine
and perturbed frames to produce the discrepant artifacts.</p>
        <p>We will introduce the modules, respectively.</p>
        <p>The images from diferent sources have diferent
fingerprints which is caused by the devices, the post-processing
operations and the generative models. The fusion of
two diferent images leads to artifact discrepancies. This Frame Perturbation. We utilize one of the
convenwould be the key features of deepfakes, since deepfakes tional image processing methods to change the
fingeralways have manipulated facial regions.Therefore, we print of the pristine frames, but ensure all the frames are
propose Artifact-Discrepant Data Generator (ADDG). In still pristine. In this work, we use GaussBlur, Scaling,
order to better address the artifact discrepancy, we pro- ISONoise, Rotation, SB-Rand, SB-Sim, as the
pose Deepfake Artifact Discrepancy Detector (DADD), • GaussBlur is a commonly used data augmentation
which adopts the multi-task learning scheme to learn method in deepfake detection [21].
the features from each type of discrepancy data,
respectively, and make the final prediction by incorporating the • Scaling refers to zooming out and then zooming in
sub-tasks. Finally, considering that the diferent pertur- the image, which will change the texture.</p>
        <p>Concat
N*F</p>
        <p>Sub-Tasks</p>
        <p>ST_1
ST_2</p>
        <p>ST_3
...</p>
        <p>ST_N
Final-Task</p>
        <p>FT
• ISONoise is the inherent noise signal generated when
the sensor captures photos, and we can obtain it by
accessing the Albumentations library.
• Rotation’s purpose is to slightly adjust the face to
produce artifacts in the form of a boundary.</p>
        <p>Negative Sample Synthesis. This module produces
the negative samples by synthesizing the pristine and
perturbed frames according the mask. Let  be the input
pristine frame,  be the perturbed yet pristine frame,
and  be the generated negative sample, the negative
sample is generated by,
 =  ⊙  +  ⊙ (1 −  ),</p>
        <p>(1)
where ⊙ indicates the element-wise product.</p>
        <p>Finally, we list our nine categories of negative samples.</p>
        <p>We term Inner- as the synthesis with common mask, that
leaves the perturbation in the foreground. Similarly, we
also term Outer- as the synthesis with reversed mask, that
leaves the erturbation in the background. The categories
without the previous prefixes only use the common mask,
also. Specifically, the categories are Inner-GaussBlur,
Outer-GaussBlur, Inner-Scaling, Outer-Scaling,
InnerISONoise, Outer-ISONoise, Rotation, SB-Rand, and
SBSim. Some generated examples are shown in Figure 2.</p>
        <p>Some samples showed no diference, this is due to the
small degree of modification.</p>
        <sec id="sec-1-2-1">
          <title>3.2. Deepfake Artifact Discrepancy Detector</title>
          <p>• SB-Rand and SB-Sim refer to using the frames of
somebody else’s as the perturbation. ‘-Rand’ indicates
the frame is randomly selected, while ‘-Sim’ indicates
the frame has a similar face to the pristine frame, i.e.,
the landmarks of the faces in these two frames are
close. This operation is to introduce some more diverse
texture information.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>The simplest yet efective measure to detect the artifact</title>
        <p>discrepancy is to train a model on the artifact-discrepant
data. However, the fused dataset contains too many
information, thus it is hard to force the model learn efective
Note that all the operations in this module operate on features. To address this problem, we propose Deepfake
the whole input. The generated results are pristine. Thus Artifact Discrepancy Detector (DADD), which adopts the
we denote it as  . multi-task learning scheme to learn the characteristics of
each category, and then summarize the features to make
Mask Generation. This module decides the to-be- the final prediction. The structure is shown in Figure 3.
modified region in the frame. First, we locate the face In DADD, we first extract a common feature  with a
landmarks in the pristine frame by using Dlib to guaran- CNN (e.g., Xception [26] in this work). Then, to suit the
tee the mask is associated with the given input. Then, by requirements of each sub-task, we project  to private
analyzing the usual operations of deepfakes, the modified features of size  . The predictions of the sub-tasks are
area in the deepfake usually occurs on the face or the based on these private features. Then, we devise a final
mouth, we empirically select some key points and preset task based on the concatenation of all the private features.
ifve candidate masks and their reverse. The shape of the To train DADD, we first train the sub-tasks in turn.
masks is presented in Figure 1. We randomly select a When training sub-task ST_1, the data associated with
mask from the candidates. Because the key point detec- ST_1 would be fed. We train the sub-tasks for  iterations
tion may be inaccurate, and considering the generaliza- and the train the final-tasks for  iterations, iteratively.
tion performance, we made a slight random deformation Then the common and private features could both retain
of the mask. Since the key is to the discrepancies between the significant features for the prediction. Eventually, in
regions but not the perturbed facial region, we use the re- the test process, the prediction is the output of the final
versed region to perturb the corresponding background. task.</p>
        <p>The final mask is denoted as  , a matrix with the same
shape of the input. We set two version of it. 3.3. Training</p>
        <p>The basic  is a 0-1 matrix, which has solid
boundaries and may be easily recognized by the model. To gen- For all sub-tasks and final task, we adopt the cross entropy
erate hard samples, we generate  with soft boundary, loss as the learning target. Let ℒ be the sub-task loss
where the values of  are smooth near the boundaries.
and ℒ be the final task loss, they are defined as,
where  represents the parameter matrix,  represents
the number of columns of the matrix, and  represents
the number of rows of the matrix. The function of ℓ2,1
regularization is to sparse our parameter matrix’s rows.</p>
        <p>In our task, each row’s parameters represent the weights
corresponding to the feature vectors extracted by each
sub-task. We add ℓ2,1 regularization to the Final-Task
training process, and the overall loss function is defined
as follows:</p>
        <p>Datasets. To evaluate our approach, we leverage two
ℒ = ℒ = − 1 ∑︁  log() + (1 − ) log(1 − ), dataset, FaceForensics++ [6] and Celeb-DF [8].
=1 FaceForensics++ [6] comprises a set of pristine
(2) video (P) and four categories of fake videos, including
where  indicates the ground truth,  indicates the out- DeepFakes (DF), Face2Face (FF), FaceSwap (FS) and
Neuput of the model,  indicates the number of the samples. ralTextures (NT) . Each category contains 1,000 videos.</p>
        <p>In addition, the purpose of DADD is to use the most The dataset publisher give an oficial splitting list, that
suitable features. This is a feature selection task. There- 720, 140, and 140 videos of each category are used for
fore, we introduce a feature selection regularization ℓ2,1 training, validation and test, respectively. In our
experinorm [27, 28, 29] to perform feature selection. Formally, ments, we extract 20 frames per video. Then we adopt
ℓ2,1 regularization is, the training set of pristine videos only to train our model.</p>
        <p>We choose the parameters according to the validation set
ℓ2,1( ) = ‖ ‖2,1 = ∑︁ ⎷⎸⎯⎸∑︁ |, |2, (3) Celeb-DF [8] is a challenging data set, which is mostly
and evaluate the model on the test set.
=1 =1 used for cross-dataset test. There are 38 real videos and
62 fake videos in this test set. We extract all frames from
these videos. We select the model via the validation set
of Faceforensics++ and evaluate our model Celeb-DF.</p>
        <p>Note that the test data never appeared in the training
datasets, especially the deepfakes. Moreover, Celeb-DF
is an independent and hard dataset. Thus the test results
can demonstrate the efectiveness of our generalizability
across datasets.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Input: Training images ;</title>
        <p>1 repeat
2 for  =0 to k do
3 for  =1 to N do
4 Generate (), ();</p>
      </sec>
      <sec id="sec-1-5">
        <title>Methods. To make fair comparison, we introduce</title>
        <p>ℒ = ℒ +  · ℒ2,1, (4) two recent self-supervised deepfake detection methods,
FWA [10] and Face X-ray [4], which also used real frames
where ℒ2,1 indicates the regularization on the concate- to generate training data during training dynamically.
nated private features, and  is a hyper-parameter. Dur- FWA believes that GaussBlur could construct warped
ing training, we perform regular data augmentation on faces, so they used diferent degrees of GaussBlur to
all types of data More detailed training procedure are construct negative samples. Face X-ray dynamically
listed in Algorithm 1. generates images with boundary information.
In our experiments, we use ‘-FWA’ to denote the data
Algorithm 1: Multi-Task learning Framework generated by FWA, ‘-BI’ to denote the data generated by
Face X-ray, and ‘-ADDG’ to denote the data generated
by our proposed method. We also use ‘Xcep-’ to denote
the Xception model, ‘Xray-’ to denote the X-ray model,
and ‘DADD-’ to denote our method.
5</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiment</title>
      <sec id="sec-2-1">
        <title>In this section, we conduct extensive experiments to demonstrate the efectiveness of our approach.</title>
        <sec id="sec-2-1-1">
          <title>4.2. Performances</title>
          <p>Table 1 demonstrates the results on DF and Celeb-DF.</p>
          <p>The results marked with references indicate that they
are from the original. Two-stream [17] was trained on
the SwapMe dataset [17]. Meso4 [5] was trained on an
internal DeepFake dataset collected by the authors.
HeadPose [2] was trained on the UADFV dataset [2]. For FWA,
the dataset was collected from the Internet. The
supervised methods perform badly when testing cross data.</p>
          <p>In contrast, the self-supervised methods in the second
part of Table 1 mostly do well. That reveals the
significance of the explorations on self-supervised methods.
Since FWA only considers the use of GaussBlur to sim- Figure 4: Visual result of feature selection implemented by
ulate the warped face during the deepfake generation ℓ2,1 regularization (=0.1). The left indicates that no ℓ2,1, and
process, its generalizability is limited. As can be seen the right indicates that ℓ2,1 is used.
from Xcep-FWA, only DF and FF perform slightly higher.</p>
          <p>For Xcep-BI, the results are diferent because the specific
settings of my experiment are diferent from the original our model does not merely detect specific texture features
paper. Xray-BI and our method DADD-ADDG (ℓ2,1) per- but captures the diference between internal and external
form evenly on DF, FF, FS, and NT. DADD-ADDG (ℓ2,1) textures. Since the data set is heavily compressed, a lot
have an average improvement of 0.17% on FaceForen- of information is lost. The results on the five test data
sics++. But on the more dificult Celeb-DF, our method demonstrate that, diferent perturbation would benifit
improves by 8.17%. This verifies our hypothesis on the diferent types of deepfakes.
artifact discrepancy. Since our task is to improve
generalization performance, that is, test results on completely
unrelated data sets, a slight decrease in performance on 4.4. The Impact of DADD
FS and NT is acceptable.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>4.3. The Impact of Perturbations</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>We presents the impact of diferent perturbations in Fig</title>
        <p>ure 5. We finetune the Xception on diferent categories
of perturbed frames, respectively.</p>
        <p>From figure 5, we have the following observations. For
DF, all methods have good performance except
OuterScaling. For FF, the Rotation we proposed has reached the
best response, indicating that FF artifacts show more edge
information.For FS, the two methods that the texture of
other images to perturb the original image have the best
response, indicating that it is meaningful to introduce
various textures. This also explains that the blending
boundary constructed by rotating does not perform as well
as replacing the image. For NT, Inner-GaussBlur,
InnerScaling, and Rotation have a high response. Compared
with other types of data sets, it is dificult for Celeb-DF to
get a good response with a single perturbation method.</p>
        <p>The impact of GaussBlur and Scaling are similar. When
the face’s interior is disturbed, it responds very well to
DF, FS, and NT, while it is deficient to FS. However, when
the modified area is the background, the result is the
opposite. The efect will be better on FS. It verifies that</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4.5. The Impact of ℓ2,1 Regularization</title>
      <p>Table 2 demonstrates the impact of ℓ2,1 regularization.
It improve the final result by 1.51% on cross-data
CelebDF. This indicates that the feature selection benefit the
model performances. We also test the feature selection
hyper-parameter , and log the impact of  in Figure 7.
When  = 0.1, the model achieves the best performance.
Lower  improve the performance in a small ratio, while</p>
      <sec id="sec-3-1">
        <title>We also visualize the layer with regularization in Fig</title>
        <p>ure 4. The features from ST_3, ST_7, and ST_9 contribute
most. The corresponding perturbations are Inner-Scaling,
Rotation, and Sim-Swap. This means they could be the
delegates in the final prediction. Note that this doesn’t
mean only these three sub-tasks are necessary. Their
performances are based on the shared features, which is
learned from all the sub-tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. CONCLUSION</title>
      <sec id="sec-4-1">
        <title>In this paper, we made a hypothesis that the discrepant artifacts caused by the frame manipulations are the key diferences between pristine videos and deepfakes. To address the discrepancy, we proposed a self-supervised</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>Special thanks are given to the SSDL2021’s organizing</title>
        <p>committee and I am also very grateful to the reviewers
for their valuable comments on this paper. This research
was supported by Open Funding Project of the State
Key Laboratory of Communication Content Cognition
(No.20K03). The completion of this paper can not be
separated from the Intelligent Media Analysis Group
(IMAG) to help. The author would like to also thank
Cheng Zhuang, Jiangnan Dai and Shaocong Yang in the
IMAG for their valuable discussions.
[26] F. Chollet, Xception: Deep learning with
depthwise separable convolutions, in: Proceedings of the
IEEE conference on computer vision and pattern
recognition, 2017, pp. 1251–1258.
[27] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, X. Zhou, l2,
1-norm regularized discriminative feature selection
for unsupervised learning, in: IJCAI International
Joint Conference on Artificial Intelligence, AAAI
Press/International Joint Conferences on Artificial
Intelligence, 2011, pp. 1589–1594.
[28] L. Fu, Z. Li, Q. Ye, H. Yin, Q. Liu, X. Chen, X. Fan,
W. Yang, G. Yang, Learning robust discriminant
subspace based on joint l2, p-and l2, s-norm distance
metrics, IEEE Transactions on Neural Networks
and Learning Systems (2020).
[29] Q. Ye, Z. Li, L. Fu, Z. Zhang, W. Yang, G. Yang,
Nonpeaked discriminant analysis for data
representation, IEEE transactions on neural networks and
learning systems 30 (2019) 3818–3832.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>