1. Introduction

Approaches for Deepfake Detection

Davide Alessandro Coccomini

davidealessandro.coccomini@isti.cnr.it 0

Roberto Caldelli

roberto.caldelli@unifi.it 1 2

Andrea Esuli

andrea.esuli@isti.cnr.it 0

Fabrizio Falchi

fabrizio.falchi@isti.cnr.it 0

Claudio Gennaro

claudio.gennaro@isti.cnr.it 0

Nicola Messina

nicola.messina@isti.cnr.it 0

Giuseppe Amato

giuseppe.amato@isti.cnr.it 0

Deepfake Detection, Syntethic Content Detection, Computer Vision, Deep Learning

0 ISTI-CNR , via G. Moruzzi, 1, 56100, Pisa , Italy 1 Mercatorum University , 00186, Rome , Italy 2 National Inter-University Consortium for Telecommunications (CNIT) , viale Morgagni, 65, 50134, Florence , Italy

2023

29 31

The creation of highly realistic media known as deepfakes has been facilitated by the rapid development of artificial intelligence technologies, including deep learning algorithms, in recent years. Concerns about the increasing ease of creation and credibility of deepfakes have then been growing more and more, prompting researchers around the world to concentrate their eforts on the field of deepfake detection. In this same context, researchers at ISTI-CNR's AIMH Lab have conducted numerous researches, investigations and proposals to make their own contribution to combating this worrying phenomenon. In this paper, we present the main work carried out in the field of deepfake detection and synthetic content detection, conducted by our researchers and in collaboration with external organizations.

1. Introduction

In recent years, there has been a rapid increase in the development of artificial intelligence technologies, including deep learning algorithms, that have led to the creation of highly realistic media manipulations known as deepfakes. Deepfakes refer to synthetic media generated using machine learning techniques designed to mimic the appearance and behaviour of real individuals in videos or images manipulating what they do and what they say.

While deepfakes have some potential positive applications, such as in the entertainment industry, they pose severe risks to society, including political, social, and economic threats. For instance, deepfakes can be used to spread disinformation, manipulate public opinion and damage personal reputations.

Given the potential harm caused by deepfakes, it is crucial to develop efective methods for detecting and mitigating them. In recent years, there has been a surge in research on deepfake detection techniques, and several deepfake detection tools have been developed. However,

2. Research Works in Deepfake Detection In this section we present our works in Deepfake Detection and related fields, highlighting the contributions and discoveries made. 2.1. Convolutional Cross Vision Transformer

When we started to take our first steps in this field, we noticed a shortage of Vision Transformer-based deepfake detectors and even more so an almost total absence of hybrid architectures used for this purpose. The reason for this is certainly their very recent advent. We therefore wished to explore this untrodden field in the paper [ 1 ] in which we realised a hybrid architecture composed of a convolutional network, in particular, EficientNet-B0, and a Cross Vision Transformer. The latter’s internal attention mechanism, in our proposal, instead of working on the patches extracted from the images, acts on the features obtained from the EficientNet. The advantage learnable process that is refined in the training phase (N. Messina); 0000-0003-0171-4315 (G. Amato)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License lies in the fact that these features are obtained from a Model AUC F1-score # params detectors by primarily aiming to validate various deep ViT with distillation [5] 0.978 91.9% 373M learning architectures in a cross-forgery context. The Selim EficientNet B7 [ 6]† 0.972 90.6% 462M ifrst work done in this regard was published at ICMR Convolutional ViT [4] 0.843 77.0% 89M 2021 under the title of “Cross-Forgery Analysis of Vision Eficient ViT (our) 0.919 83.8% 109M Transformers and CNNs for Image Deepfake Detection” CCoonnvv.. CCrroossss VViiTT EWf.NodeatjBo0C-NANvg(o(ouur)r) 0.09.49725 858.46.%5% 10114M2M [8] and consists of a comparison between a convolutional Conv. Cross ViT Ef.Net B0 - Voting (our) 0.951 88.0% 101M network, the EficientNet-V2-M, and a classical Vision Transformer, namely ViT-Base. These two models are Table 1 based on very diferent concepts and structures, and it is Video-Level results of our models and other previous works in these diferences that their peculiarities reside, which uDsFeDsCanteenstsedmatbalseeot.f 6Tnheetswyomrkbso.l †indicates that the model are reflected in diferent behaviour in deepfake detection. Our experiments were conducted on the ForgeryNet dataset [9] consisting of images manipulated with 15 diferent techniques. The models were trained on imand that has therefore allowed us to obtain a state-of- ages manipulated with a specific method or a group of the-art model on diferent datasets. In particular, the methods and then tested on images manipulated with all one we named Convolutional Cross Vision Transformer available methods. According to our results, the Vision proved to be state-of-the-art in terms of accuracy and Transformer turns out to have a significantly higher genAUC on both DFDC[2] dataset (shown in Table 1) and eralization ability than EficientNet, which instead tends FaceForensics++[3], two of the main datasets used in this to store more of the specific artifacts introduced by deepifeld, comparing with previous works like [ 4, 5]. All this, fake generation algorithms and thus detects few images with significantly fewer parameters than other solutions generated by other methods. This result is more proand therefore lighter, thanks to the exploitation of a hy- nounced in the presence of large masses of data, which brid architecture. The article also investigated the impact allows the Vision Transformer to generalize even better of certain implementation choices such as the number to the deepfake concept. of frames considered per video and the management of This work conducted on images was then recently exmultiple identities in the same scene, which were then tended to the video portion of ForgeryNet submitted to the basis for some subsequent work. the Journal of Imaging with the title of ”On the generalization of Deep Learning techniques for Video Deep2.2. Partecipation to the ICIAP 2021 fake Detection” [10]. In fact, not necessarily what has competition been discovered in images is reportable on videos since the anomalies that can be found in them can be both The previous paper presented at ICIAP 2021 was also spatial and temporal in nature unlike images. In this used as the basis for participation in the Face Deepfake case, the manipulation techniques used on videos are 9 Detection challenge organised at that conference. During divided into two macro-categories, ID-Replaced and IDthe competition, the ability of the participants’ solutions Remained. The architectures explored are the same as in to identify deepfakes ‘in the wild’ was assessed and their the previous work but with the addition of a Swin Transgeneralisation capability was investigated. The latter is former. The latter was interesting to validate since it is a one of the main problems in deepfake detection as nor- Transformer based on a hierarchical attention mechanism mally deepfake detectors are very good at identifying inspired by convolutional networks and therefore a sort deepfakes generated with methods used for the creation of middle ground between the two architectures considof the training set but practically useless when compared ered. According to our experiments, conducted similarly with content manipulated with novel techniques. Our to the previous work, the EficientNet performs better method, based on what was presented in the previous sec- on a lower data regime while again the Vision Transtion, placed fourth in the ranking and we subsequently former, even on video, is more capable of generalization produced a paper together with the competition organ- and less tied to the methods used to create the training isers and other participants published in the Journal of set. The Swin Transformer, on the other hand, proves Imaging and entitled ”The Face Deepfake Detection Chal- to be a good middle ground between the architectures lenge” [7]. hardly excelling on the others but achieving satisfactory performance on average. 2.3. On the generalization of Deepfake To summarize, in light of the many experiments conDetectors ducted in this work, the Vision Transformer and its variants are more suitable to be used as the basis for deepfake Building on the experience of previous work, we focused detectors to be applied in the real world. The continuon investigating the generalization ability of deepfake ing emergence of new deepfake generation techniques Model SlowFast R-50† [12] SlowFast R-50‡ [12] X3D-M‡ [13] MINTIME-EF MINTIME-EF MINTIME-EF MINTIME-XC MINTIME-XC MINTIME-XC forces detectors to untether themselves from any attempt to memorize the specific anomalies of a technique and abstract the concept of anomaly so that they can recognize deepfakes regardless of the manipulation technique.

2.4. MINTIME: Multi-Identity Size Invariant TimeSformer

Model SlowFast R-50 [12] MINTIME-EF MINTIME-XC on multi-identity videos only as can be seen in Table 3. Our approach also exposed outstanding results in terms of generalization which can be seen in cross-forgery evaluation presented in Table 4. From the interpretation of the attention maps it is also possible to trace which of the multiple identities, if any, has been manipulated, making the approach more usable in the real world.

The challenges encountered during our previous research

resulted in a paper conducted in collaboration with CERTH Thessaloniki and it will be submitted to an in- 2.5. Syntethic Media Detection ternational journal, entitled ”MINTIME: Multi-Identity Size Invariant Deepfake Detector”[11]. In this work, we Deepfakes are among the applications of deep learning identified some frequent problems in the real world and that have caused most concern to date, however, recently tried to develop a detector capable of handling them ef- not only people’s faces have been subject to manipulafectively. The main novelties introduced by this work tion. In fact, many techniques are emerging that make it are: possible to generate images of any subject, for instance from a text describing it. This poses major challenges • Development of a new deepfake detection model as it will be increasingly dificult to distinguish between capable of capturing both spatial and temporal synthetic and real content. The ability to diferentiate anomalies between synthetic and real images is essential for pre• Capability of managing multiple identities within serving the integrity of information and safeguarding the same scene through the introduction of new individuals against the malicious use of synthetic media. techniques of attention, positional embedding As text-to-image methods become increasingly prevalent and input sequence generation and accessible to the general public, society is moving • Robustness in changing the ratio between the toward a point where a significant amount of online conarea of the face and that of the entire frame tent is synthetic, blurring the line between reality and through the introduction of size embedding capa- fiction. In our recent work titled ”Detecting Images genble of inducing the real original size of the face erated by Difusers”[ 18], we present an initial attempt to with respect to the entire scene distinguish between generated and real images based on the image itself and the associated text used to describe and generate it. Additionally, we analyze the image and text’s peculiarities that may result in a more or less credible image that is dificult to identify. Our analysis focused on detecting content generated by text-to-image systems, specifically Stable Difusion and GLIDE. We tested various classifiers, including MLPs and Convolutional Neural

This is also the first deepfake detection work based on TimeSformer and the first time this architecture has been combined in the realisation of a transformerconvolutional hybrid model.

Our approach achieved state-of-the-art results on several datasets outperforming other methods like [14, 15, 16, 17] as shown in Table 2 and in particular when tested Networks, and found that traditional deep learning mod- the principles and legal framework of the participation els can easily distinguish images generated with these foundation, which is part of the broader category of founsystems once they have seen examples in the training dations governed by the Civil Code and related laws. Its set. However, when tested for generalization ability, they primary objective is scientific and technological research, were rarely able to identify images generated by methods and as such, it has been designated as the implementother than those used in the training set, highlighting a ing party for the ”SERICS - Security and Rights in Cysignificant issue for these systems’ real-world adoption. berSpace” Partnership, which is funded under the Public We also conducted an analysis of the correlation between Notice ”for the presentation of Proposals for the creation the credibility of generated images and their category, of ”Partnerships extended to universities research centers, as well as the composition of their associated captions. companies for the funding of basic research projects” - as Our experiments found that images generated by both part of the National Recovery and Resilience Plan, Misgenerators are more credible when they depict inanimate sion 4 ”Education and Research” - Component 2 ”From objects, resulting in greater classifier error. In contrast, Research to Enterprise” - Investment 1. 3, funded by the images depicting people, animals, or animate subjects, in European Union - NextGenerationEU - Notice no. 341 of general, are easier to identify. Moreover, there appears to 15.3.2022. be no strong correlation between the sentence’s linguistic composition and the models’ classification ability.

4. Conclusions and Future Work The SERICS Foundation, focused on Security and Rights in Cyberspace, has been established in compliance with 3.2. SERICS

3. Related Projects In this paper, we illustrated the work conducted by ISTICNR’s AIMH laboratory in the field of deepfake detection. 3.1. AI4Media The first steps taken in this field are relatively recent, but within a short period of time the team of researchers has The AI4Media project is a European Union-funded initia- achieved excellent results in various contexts by explortive that aims to advance the state-of-the-art in artificial ing and discovering peculiarities, solving open problems intelligence (AI) and machine learning (ML) technologies and posing new ones, all in collaboration with the fervent and their application in the media industry. The project group of researchers actively working in this field. In the focuses on developing innovative tools and techniques future, the aim is to extend the work that has been started for improving media production, distribution, and con- and to continue the collaboration with other institutions sumption, with the ultimate goal of enhancing the quality, in order to achieve an increasingly precise, robust and diversity, and accessibility of media content while main- real-world-friendly deepfake detector, but also to investaining high ethical and social standards. The project tigate the field of synthetic content detection in greater brings together leading research institutions, media or- depth, so as to counter misinformation and contribute to ganizations, and technology companies across Europe, the maintenance of network security for all users. including universities, broadcasters, publishers, and startups. Through collaborative research and development eforts, the project aims to address various challenges and Acknowledgments opportunities in the media industry, such as the creation of personalized content, the detection and prevention of This work was partially supported by project SERICS fake news and disinformation, and the improvement of (PE00000014) under the NRRP MUR program funded by accessibility and user experience. Partners in this project the EU - NGEU and by FAIR (PE00000013) funded by the can join the AI4Media Fellowship programme in which European Commission under the NextGenerationEU. It young researchers can visit the sites of other members was also partially supported by AI4Media project, funded and collaborate on joint research projects. In particular, by the EC (H2020 - Contract n. 951911). Davide Alessandro Coccomini, PhD student at the University of Pisa and research associate at ISTI-CNR, spent References a two-month exchange period at CERTH in Thessaloniki.

During this period and the following months, research was conducted on the realisation of a deepfake detector capable of handling various real-world challenging situations. [2] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, for eficient video recognition, in: 2020 IEEE/CVF M. Wang, C. C. Ferrer, The deepfake detec- Conference on Computer Vision and Pattern Recogtion challenge (dfdc) dataset, arXiv preprint nition (CVPR), 2020, pp. 200–210. doi:10.1109/ arXiv:2006.07397 (2020). CVPR42600.2020.00028. [3] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, [14] Y. Zheng, J. Bao, D. Chen, M. Zeng, F. Wen, ExplorJ. Thies, M. Nießner, Faceforensics++: Learning ing temporal coherence for more general video face to detect manipulated facial images, in: Proceed- forgery detection, in: ICCV, 2021, pp. 15024–15034. ings of the IEEE/CVF International Conference on doi:10.1109/ICCV48922.2021.01477.

Computer Vision, 2019, pp. 1–11. [15] A. Haliassos, K. Vougioukas, S. Petridis, M. Pan[4] D. Wodajo, S. Atnafu, Deepfake video detection tic, Lips don’t lie: A generalisable and robust apusing convolutional vision transformer, arXiv proach to face forgery detection, in: CVPR, 2021, pp. preprint arXiv:2102.11126 (2021). 5037–5047. doi:10.1109/CVPR46437.2021.00500. [5] Y.-J. Heo, Y.-J. Choi, Y.-W. Lee, B.-G. Kim, Deepfake [16] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, detection scheme based on vision transformer and I. Masi, P. Natarajan, Recurrent convolutional distillation, arXiv preprint arXiv:2104.01353 (2021). strategies for face manipulation detection in videos, [6] S. Seferbekov, Dfdc 1st place solution, 2020. in: CVPR Workshops, 2019.

URL: "https://github.com/selimsef/dfdc_deepfake_ [17] H. H. Nguyen, F. Fang, J. Yamagishi, I. Echizen, challenge". Multi-task learning for detecting and segmenting [7] L. Guarnera, O. Giudice, F. Guarnera, A. Or- manipulated facial images and videos, in: 2019 tis, G. Puglisi, A. Paratore, L. M. Q. Bui, IEEE 10th International Conference on Biometrics M. Fontani, D. A. Coccomini, R. Caldelli, F. Falchi, Theory, Applications and Systems (BTAS), 2019, pp. C. Gennaro, N. Messina, G. Amato, G. Perelli, 1–8. doi:10.1109/BTAS46853.2019.9185974. S. Concas, C. Cuccu, G. Orrù, G. L. Marcialis, [18] D. A. Coccomini, A. Esuli, F. Falchi, C. Gennaro, S. Battiato, The face deepfake detection chal- G. Amato, Detecting images generated by difusers, lenge, Journal of Imaging 8 (2022). URL: https:// 2023. URL: https://arxiv.org/abs/2303.05275. doi:10. www.mdpi.com/2313-433X/8/10/263. doi:10.3390/ 48550/ARXIV.2303.05275.

jimaging8100263. [8] D. A. Coccomini, R. Caldelli, F. Falchi, C. Gennaro, G. Amato, Cross-forgery analysis of Vision Transformers and CNNs for deepfake image detection, in: Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, MAD ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 52–58. URL: https: //doi.org/10.1145/3512732.3533582. doi:10.1145/ 3512732.3533582. [9] Y. He, B. Gan, S. Chen, Y. Zhou, G. Yin, L. Song,

L. Sheng, J. Shao, Z. Liu, Forgerynet: A versatile benchmark for comprehensive forgery analysis, in: CPVR, 2021, pp. 4358–4367. doi:10.1109/

CVPR46437.2021.00434. [10] D. A. Coccomini, R. Caldelli, F. Falchi, C. Gennaro, On the generalization of deep learning models in video deepfake detection (2023).

URL: https://www.preprints.org/manuscript/ 202303.0161/v1. doi:https://doi.org/10.20944/ preprints202303.0161.v1. [11] D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli,

F. Falchi, S. Papadopoulos, C. Gennaro, Mintime: Multi-identity size-invariant video deepfake detection, 2022. URL: https://arxiv.org/abs/2211.10996.

doi:10.48550/ARXIV.2211.10996. [12] C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast

networks for video recognition, in: ICCV, 2019. [13] C. Feichtenhofer, X3d: Expanding architectures

[1]

D. A.

Coccomini ,

Messina ,

Gennaro ,

Falchi , Combining eficientnet and vision transformers for video deepfake detection, in: Image Analysis and Processing (ICIAP 2022) - Part

III

, Springer, 2022 , p. 219 - 229 . URL: https://doi. org/10.1007/978-3- 031 -06433-3_ 19 . doi: 10 .1007/ 978- 3- 031 - 06433- 3_ 19 .