<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches for Deepfake Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Alessandro Coccomini</string-name>
          <email>davidealessandro.coccomini@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Caldelli</string-name>
          <email>roberto.caldelli@unifi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
          <email>andrea.esuli@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <email>fabrizio.falchi@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Gennaro</string-name>
          <email>claudio.gennaro@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Messina</string-name>
          <email>nicola.messina@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Amato</string-name>
          <email>giuseppe.amato@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Deepfake Detection, Syntethic Content Detection, Computer Vision, Deep Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ISTI-CNR</institution>
          ,
          <addr-line>via G. Moruzzi, 1, 56100, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mercatorum University</institution>
          ,
          <addr-line>00186, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Inter-University Consortium for Telecommunications (CNIT)</institution>
          ,
          <addr-line>viale Morgagni, 65, 50134, Florence</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>The creation of highly realistic media known as deepfakes has been facilitated by the rapid development of artificial intelligence technologies, including deep learning algorithms, in recent years. Concerns about the increasing ease of creation and credibility of deepfakes have then been growing more and more, prompting researchers around the world to concentrate their eforts on the field of deepfake detection. In this same context, researchers at ISTI-CNR's AIMH Lab have conducted numerous researches, investigations and proposals to make their own contribution to combating this worrying phenomenon. In this paper, we present the main work carried out in the field of deepfake detection and synthetic content detection, conducted by our researchers and in collaboration with external organizations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, there has been a rapid increase in the
development of artificial intelligence technologies,
including deep learning algorithms, that have led to the
creation of highly realistic media manipulations known
as deepfakes. Deepfakes refer to synthetic media
generated using machine learning techniques designed to
mimic the appearance and behaviour of real individuals
in videos or images manipulating what they do and what
they say.</p>
      <p>While deepfakes have some potential positive
applications, such as in the entertainment industry, they pose
severe risks to society, including political, social, and
economic threats. For instance, deepfakes can be used
to spread disinformation, manipulate public opinion and
damage personal reputations.</p>
      <p>Given the potential harm caused by deepfakes, it is
crucial to develop efective methods for detecting and
mitigating them. In recent years, there has been a surge
in research on deepfake detection techniques, and several
deepfake detection tools have been developed. However,</p>
    </sec>
    <sec id="sec-2">
      <title>2. Research</title>
    </sec>
    <sec id="sec-3">
      <title>Works in Deepfake</title>
    </sec>
    <sec id="sec-4">
      <title>Detection</title>
      <sec id="sec-4-1">
        <title>In this section we present our works in Deepfake Detection and related fields, highlighting the contributions and discoveries made.</title>
        <sec id="sec-4-1-1">
          <title>2.1. Convolutional Cross Vision</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Transformer</title>
          <p>
            When we started to take our first steps in this field, we
noticed a shortage of Vision Transformer-based deepfake
detectors and even more so an almost total absence of
hybrid architectures used for this purpose. The reason for
this is certainly their very recent advent. We therefore
wished to explore this untrodden field in the paper [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
in which we realised a hybrid architecture composed of
a convolutional network, in particular, EficientNet-B0,
and a Cross Vision Transformer. The latter’s internal
attention mechanism, in our proposal, instead of working
on the patches extracted from the images, acts on the
features obtained from the EficientNet. The advantage
learnable process that is refined in the training phase
(N. Messina); 0000-0003-0171-4315 (G. Amato)
          </p>
          <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License lies in the fact that these features are obtained from a
Model AUC F1-score # params detectors by primarily aiming to validate various deep
ViT with distillation [5] 0.978 91.9% 373M learning architectures in a cross-forgery context. The
Selim EficientNet B7 [ 6]† 0.972 90.6% 462M ifrst work done in this regard was published at ICMR
Convolutional ViT [4] 0.843 77.0% 89M 2021 under the title of “Cross-Forgery Analysis of Vision
Eficient ViT (our) 0.919 83.8% 109M Transformers and CNNs for Image Deepfake Detection”
CCoonnvv.. CCrroossss VViiTT EWf.NodeatjBo0C-NANvg(o(ouur)r) 0.09.49725 858.46.%5% 10114M2M [8] and consists of a comparison between a convolutional
Conv. Cross ViT Ef.Net B0 - Voting (our) 0.951 88.0% 101M network, the EficientNet-V2-M, and a classical Vision
Transformer, namely ViT-Base. These two models are
Table 1 based on very diferent concepts and structures, and it is
Video-Level results of our models and other previous works in these diferences that their peculiarities reside, which
uDsFeDsCanteenstsedmatbalseeot.f 6Tnheetswyomrkbso.l †indicates that the model are reflected in diferent behaviour in deepfake
detection. Our experiments were conducted on the ForgeryNet
dataset [9] consisting of images manipulated with 15
diferent techniques. The models were trained on
imand that has therefore allowed us to obtain a state-of- ages manipulated with a specific method or a group of
the-art model on diferent datasets. In particular, the methods and then tested on images manipulated with all
one we named Convolutional Cross Vision Transformer available methods. According to our results, the Vision
proved to be state-of-the-art in terms of accuracy and Transformer turns out to have a significantly higher
genAUC on both DFDC[2] dataset (shown in Table 1) and eralization ability than EficientNet, which instead tends
FaceForensics++[3], two of the main datasets used in this to store more of the specific artifacts introduced by
deepifeld, comparing with previous works like [ 4, 5]. All this, fake generation algorithms and thus detects few images
with significantly fewer parameters than other solutions generated by other methods. This result is more
proand therefore lighter, thanks to the exploitation of a hy- nounced in the presence of large masses of data, which
brid architecture. The article also investigated the impact allows the Vision Transformer to generalize even better
of certain implementation choices such as the number to the deepfake concept.
of frames considered per video and the management of This work conducted on images was then recently
exmultiple identities in the same scene, which were then tended to the video portion of ForgeryNet submitted to
the basis for some subsequent work. the Journal of Imaging with the title of ”On the
generalization of Deep Learning techniques for Video
Deep2.2. Partecipation to the ICIAP 2021 fake Detection” [10]. In fact, not necessarily what has
competition been discovered in images is reportable on videos since
the anomalies that can be found in them can be both
The previous paper presented at ICIAP 2021 was also spatial and temporal in nature unlike images. In this
used as the basis for participation in the Face Deepfake case, the manipulation techniques used on videos are 9
Detection challenge organised at that conference. During divided into two macro-categories, ID-Replaced and
IDthe competition, the ability of the participants’ solutions Remained. The architectures explored are the same as in
to identify deepfakes ‘in the wild’ was assessed and their the previous work but with the addition of a Swin
Transgeneralisation capability was investigated. The latter is former. The latter was interesting to validate since it is a
one of the main problems in deepfake detection as nor- Transformer based on a hierarchical attention mechanism
mally deepfake detectors are very good at identifying inspired by convolutional networks and therefore a sort
deepfakes generated with methods used for the creation of middle ground between the two architectures
considof the training set but practically useless when compared ered. According to our experiments, conducted similarly
with content manipulated with novel techniques. Our to the previous work, the EficientNet performs better
method, based on what was presented in the previous sec- on a lower data regime while again the Vision
Transtion, placed fourth in the ranking and we subsequently former, even on video, is more capable of generalization
produced a paper together with the competition organ- and less tied to the methods used to create the training
isers and other participants published in the Journal of set. The Swin Transformer, on the other hand, proves
Imaging and entitled ”The Face Deepfake Detection Chal- to be a good middle ground between the architectures
lenge” [7]. hardly excelling on the others but achieving satisfactory
performance on average.
2.3. On the generalization of Deepfake To summarize, in light of the many experiments
conDetectors ducted in this work, the Vision Transformer and its
variants are more suitable to be used as the basis for deepfake
Building on the experience of previous work, we focused detectors to be applied in the real world. The
continuon investigating the generalization ability of deepfake ing emergence of new deepfake generation techniques
Model
SlowFast R-50† [12]
SlowFast R-50‡ [12]
X3D-M‡ [13]
MINTIME-EF
MINTIME-EF
MINTIME-EF
MINTIME-XC
MINTIME-XC
MINTIME-XC
forces detectors to untether themselves from any attempt
to memorize the specific anomalies of a technique and
abstract the concept of anomaly so that they can recognize
deepfakes regardless of the manipulation technique.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>2.4. MINTIME: Multi-Identity Size</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Invariant TimeSformer</title>
          <p>Model
SlowFast R-50 [12]
MINTIME-EF
MINTIME-XC
on multi-identity videos only as can be seen in Table 3.
Our approach also exposed outstanding results in terms
of generalization which can be seen in cross-forgery
evaluation presented in Table 4. From the interpretation of
the attention maps it is also possible to trace which of the
multiple identities, if any, has been manipulated, making
the approach more usable in the real world.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>The challenges encountered during our previous research</title>
        <p>resulted in a paper conducted in collaboration with
CERTH Thessaloniki and it will be submitted to an in- 2.5. Syntethic Media Detection
ternational journal, entitled ”MINTIME: Multi-Identity
Size Invariant Deepfake Detector”[11]. In this work, we Deepfakes are among the applications of deep learning
identified some frequent problems in the real world and that have caused most concern to date, however, recently
tried to develop a detector capable of handling them ef- not only people’s faces have been subject to
manipulafectively. The main novelties introduced by this work tion. In fact, many techniques are emerging that make it
are: possible to generate images of any subject, for instance
from a text describing it. This poses major challenges
• Development of a new deepfake detection model as it will be increasingly dificult to distinguish between
capable of capturing both spatial and temporal synthetic and real content. The ability to diferentiate
anomalies between synthetic and real images is essential for
pre• Capability of managing multiple identities within serving the integrity of information and safeguarding
the same scene through the introduction of new individuals against the malicious use of synthetic media.
techniques of attention, positional embedding As text-to-image methods become increasingly prevalent
and input sequence generation and accessible to the general public, society is moving
• Robustness in changing the ratio between the toward a point where a significant amount of online
conarea of the face and that of the entire frame tent is synthetic, blurring the line between reality and
through the introduction of size embedding capa- fiction. In our recent work titled ”Detecting Images
genble of inducing the real original size of the face erated by Difusers”[ 18], we present an initial attempt to
with respect to the entire scene distinguish between generated and real images based on
the image itself and the associated text used to describe
and generate it. Additionally, we analyze the image and
text’s peculiarities that may result in a more or less
credible image that is dificult to identify. Our analysis focused
on detecting content generated by text-to-image systems,
specifically Stable Difusion and GLIDE. We tested
various classifiers, including MLPs and Convolutional Neural</p>
      </sec>
      <sec id="sec-4-3">
        <title>This is also the first deepfake detection work based on TimeSformer and the first time this architecture has been combined in the realisation of a transformerconvolutional hybrid model.</title>
        <p>Our approach achieved state-of-the-art results on
several datasets outperforming other methods like [14, 15,
16, 17] as shown in Table 2 and in particular when tested
Networks, and found that traditional deep learning mod- the principles and legal framework of the participation
els can easily distinguish images generated with these foundation, which is part of the broader category of
founsystems once they have seen examples in the training dations governed by the Civil Code and related laws. Its
set. However, when tested for generalization ability, they primary objective is scientific and technological research,
were rarely able to identify images generated by methods and as such, it has been designated as the
implementother than those used in the training set, highlighting a ing party for the ”SERICS - Security and Rights in
Cysignificant issue for these systems’ real-world adoption. berSpace” Partnership, which is funded under the Public
We also conducted an analysis of the correlation between Notice ”for the presentation of Proposals for the creation
the credibility of generated images and their category, of ”Partnerships extended to universities research centers,
as well as the composition of their associated captions. companies for the funding of basic research projects” - as
Our experiments found that images generated by both part of the National Recovery and Resilience Plan,
Misgenerators are more credible when they depict inanimate sion 4 ”Education and Research” - Component 2 ”From
objects, resulting in greater classifier error. In contrast, Research to Enterprise” - Investment 1. 3, funded by the
images depicting people, animals, or animate subjects, in European Union - NextGenerationEU - Notice no. 341 of
general, are easier to identify. Moreover, there appears to 15.3.2022.
be no strong correlation between the sentence’s linguistic
composition and the models’ classification ability.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusions and Future Work</title>
      <sec id="sec-5-1">
        <title>The SERICS Foundation, focused on Security and Rights in Cyberspace, has been established in compliance with</title>
        <sec id="sec-5-1-1">
          <title>3.2. SERICS</title>
          <p>3. Related Projects In this paper, we illustrated the work conducted by
ISTICNR’s AIMH laboratory in the field of deepfake detection.
3.1. AI4Media The first steps taken in this field are relatively recent, but
within a short period of time the team of researchers has
The AI4Media project is a European Union-funded initia- achieved excellent results in various contexts by
explortive that aims to advance the state-of-the-art in artificial ing and discovering peculiarities, solving open problems
intelligence (AI) and machine learning (ML) technologies and posing new ones, all in collaboration with the fervent
and their application in the media industry. The project group of researchers actively working in this field. In the
focuses on developing innovative tools and techniques future, the aim is to extend the work that has been started
for improving media production, distribution, and con- and to continue the collaboration with other institutions
sumption, with the ultimate goal of enhancing the quality, in order to achieve an increasingly precise, robust and
diversity, and accessibility of media content while main- real-world-friendly deepfake detector, but also to
investaining high ethical and social standards. The project tigate the field of synthetic content detection in greater
brings together leading research institutions, media or- depth, so as to counter misinformation and contribute to
ganizations, and technology companies across Europe, the maintenance of network security for all users.
including universities, broadcasters, publishers, and
startups. Through collaborative research and development
eforts, the project aims to address various challenges and Acknowledgments
opportunities in the media industry, such as the creation
of personalized content, the detection and prevention of This work was partially supported by project SERICS
fake news and disinformation, and the improvement of (PE00000014) under the NRRP MUR program funded by
accessibility and user experience. Partners in this project the EU - NGEU and by FAIR (PE00000013) funded by the
can join the AI4Media Fellowship programme in which European Commission under the NextGenerationEU. It
young researchers can visit the sites of other members was also partially supported by AI4Media project, funded
and collaborate on joint research projects. In particular, by the EC (H2020 - Contract n. 951911).
Davide Alessandro Coccomini, PhD student at the
University of Pisa and research associate at ISTI-CNR, spent References
a two-month exchange period at CERTH in Thessaloniki.</p>
          <p>During this period and the following months, research
was conducted on the realisation of a deepfake
detector capable of handling various real-world challenging
situations.
[2] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, for eficient video recognition, in: 2020 IEEE/CVF
M. Wang, C. C. Ferrer, The deepfake detec- Conference on Computer Vision and Pattern
Recogtion challenge (dfdc) dataset, arXiv preprint nition (CVPR), 2020, pp. 200–210. doi:10.1109/
arXiv:2006.07397 (2020). CVPR42600.2020.00028.
[3] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, [14] Y. Zheng, J. Bao, D. Chen, M. Zeng, F. Wen,
ExplorJ. Thies, M. Nießner, Faceforensics++: Learning ing temporal coherence for more general video face
to detect manipulated facial images, in: Proceed- forgery detection, in: ICCV, 2021, pp. 15024–15034.
ings of the IEEE/CVF International Conference on doi:10.1109/ICCV48922.2021.01477.</p>
          <p>Computer Vision, 2019, pp. 1–11. [15] A. Haliassos, K. Vougioukas, S. Petridis, M.
Pan[4] D. Wodajo, S. Atnafu, Deepfake video detection tic, Lips don’t lie: A generalisable and robust
apusing convolutional vision transformer, arXiv proach to face forgery detection, in: CVPR, 2021, pp.
preprint arXiv:2102.11126 (2021). 5037–5047. doi:10.1109/CVPR46437.2021.00500.
[5] Y.-J. Heo, Y.-J. Choi, Y.-W. Lee, B.-G. Kim, Deepfake [16] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed,
detection scheme based on vision transformer and I. Masi, P. Natarajan, Recurrent convolutional
distillation, arXiv preprint arXiv:2104.01353 (2021). strategies for face manipulation detection in videos,
[6] S. Seferbekov, Dfdc 1st place solution, 2020. in: CVPR Workshops, 2019.</p>
          <p>URL: "https://github.com/selimsef/dfdc_deepfake_ [17] H. H. Nguyen, F. Fang, J. Yamagishi, I. Echizen,
challenge". Multi-task learning for detecting and segmenting
[7] L. Guarnera, O. Giudice, F. Guarnera, A. Or- manipulated facial images and videos, in: 2019
tis, G. Puglisi, A. Paratore, L. M. Q. Bui, IEEE 10th International Conference on Biometrics
M. Fontani, D. A. Coccomini, R. Caldelli, F. Falchi, Theory, Applications and Systems (BTAS), 2019, pp.
C. Gennaro, N. Messina, G. Amato, G. Perelli, 1–8. doi:10.1109/BTAS46853.2019.9185974.
S. Concas, C. Cuccu, G. Orrù, G. L. Marcialis, [18] D. A. Coccomini, A. Esuli, F. Falchi, C. Gennaro,
S. Battiato, The face deepfake detection chal- G. Amato, Detecting images generated by difusers,
lenge, Journal of Imaging 8 (2022). URL: https:// 2023. URL: https://arxiv.org/abs/2303.05275. doi:10.
www.mdpi.com/2313-433X/8/10/263. doi:10.3390/ 48550/ARXIV.2303.05275.</p>
          <p>jimaging8100263.
[8] D. A. Coccomini, R. Caldelli, F. Falchi, C.
Gennaro, G. Amato, Cross-forgery analysis of Vision
Transformers and CNNs for deepfake image
detection, in: Proceedings of the 1st International
Workshop on Multimedia AI against
Disinformation, MAD ’22, Association for Computing
Machinery, New York, NY, USA, 2022, p. 52–58. URL: https:
//doi.org/10.1145/3512732.3533582. doi:10.1145/
3512732.3533582.
[9] Y. He, B. Gan, S. Chen, Y. Zhou, G. Yin, L. Song,</p>
          <p>L. Sheng, J. Shao, Z. Liu, Forgerynet: A
versatile benchmark for comprehensive forgery
analysis, in: CPVR, 2021, pp. 4358–4367. doi:10.1109/</p>
          <p>CVPR46437.2021.00434.
[10] D. A. Coccomini, R. Caldelli, F. Falchi, C.
Gennaro, On the generalization of deep learning
models in video deepfake detection (2023).</p>
          <p>URL: https://www.preprints.org/manuscript/
202303.0161/v1. doi:https://doi.org/10.20944/
preprints202303.0161.v1.
[11] D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli,</p>
          <p>F. Falchi, S. Papadopoulos, C. Gennaro, Mintime:
Multi-identity size-invariant video deepfake
detection, 2022. URL: https://arxiv.org/abs/2211.10996.</p>
          <p>doi:10.48550/ARXIV.2211.10996.
[12] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast</p>
          <p>networks for video recognition, in: ICCV, 2019.
[13] C. Feichtenhofer, X3d: Expanding architectures</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Coccomini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Messina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gennaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Falchi</surname>
          </string-name>
          ,
          <article-title>Combining eficientnet and vision transformers for video deepfake detection, in: Image Analysis</article-title>
          and
          <string-name>
            <surname>Processing (ICIAP 2022) - Part</surname>
            <given-names>III</given-names>
          </string-name>
          , Springer,
          <year>2022</year>
          , p.
          <fpage>219</fpage>
          -
          <lpage>229</lpage>
          . URL: https://doi. org/10.1007/978-3-
          <fpage>031</fpage>
          -06433-3_
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .1007/ 978- 3-
          <fpage>031</fpage>
          - 06433- 3_
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>