Overview of the ImageCLEFsecurity 2019:
                  File Forgery Detection Tasks*

Konstantinos Karampidis1, Nikos Vasillopoulos1, Carlos Cuevas2, Carlos Roberto del-
                Blanco2, Ergina Kavallieratou1 and Narciso Garcia2
1 AIlab, Department of Information & Communication Systems Engineering, University of the

                                    Aegean, Greece
        2 Grupo de Tratamiento de Imágenes,
                                         Universidad Politécnica de Madrid, Spain
                            Imageclefsecurity@aegean.gr


        Abstract. The File Forgery Detection tasks is in its first edition, in 2019. This
        year, it is composed by three subtasks: a) Forged file discovery, b) Stego image
        discovery and c) Secret message discovery. The data set contained 6,400 images
        and pdf files, divided into 3 sets. There were 61 participants and the majority of
        them participated in all the subtasks. This highlights the major concern the sci-
        entific community shows for security issues and the importance of each subtask.
        Submissions varied from a) 8, b) 31 and c) 14 submissions for each subtask, re-
        spectively. Although the datasets were small, most of the participants used deep
        learning techniques, especially in subtasks 2 & 3. The results obtained in subtask
        3 -which was the most difficult one- showed that there is room for improvement,
        as more advanced techniques are needed to achieve better results. Deep learning
        techniques adopted by many researchers is a preamble in that direction, and
        proved that they may provide a promising steganalysis tool to a digital forensics
        examiner.

        Keywords: File Forgery Detection, Digital Forensics, Forged Image, Stego Im-
        age.


1       Introduction

The File Forgery Detection tasks described in this paper are part of the ImageCLEF
benchmarking campaign [1–4], a framework where researchers can share their exper-
tise and compare their methods based on the exact same data and evaluation methodol-
ogy in an annual rhythm. ImageCLEF is part of CLEF (Cross Language Evaluation
Forum). More details about the 2019 campaign are described in Ionescu et al. [5]. In
general, ImageCLEF aims at building tasks that are related to benchmark the challeng-
ing task of image annotation for a wide range of source images and annotation objec-
tives, since 2003.

*   Copyright (c) 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
    The File Forgery Detection has started in 2019 as a new task. It is an important and
serious issue concerning digital forensics examiners. Fraud or counterfeits are common
causes for altering files. Another example is a child predator who hides porn images by
altering the image extension and in some cases by changing the image signature. Many
proposals have been made to solve this problem and the most promising ones concen-
trate on the image content. It is also common that someone who wants to hide infor-
mation in plain sight without being perceived might use steganography. Steganography
is the practice of concealing a file, message, image, or video within another file, mes-
sage, image, or video. Among them, images are the most usual cover medium for hiding
data. Thus, the File Forgery Detection is composed by three different subtasks, namely:

    • Forged File Discovery
    • Stego Image Discovery
    • Secret Message Discovery
   This paper presents an overview of the ImageCLEF2019 File Forgery Detection
subtasks: the own subtask descriptions are in Section 2, the dataset in Section 3, and an
explanation of the evaluation framework in Section 4. The participant approaches are
described in Section 5, followed by a discussion and the conclusions in Sections 6.


2      Subtasks

The specific objective of these tasks are first to examine if an image has been forged,
and then, if it could hide a text message. Last objective is to retrieve the potentially
hidden message from the forged steganography images. Subtask 1 focuses on file for-
gery. A file can be considered forged whether it has an altered extension or signature
(also known as magic bytes). If a file has an altered extension or signature, it is rather
simple to identify it. The problem relies in the case when both a file’s extension and
signature have been altered at the same time. In this case, even the most used digital
forensic software cannot identify a file as forged. Subtask 2 concerns the discovery of
stego images. Images are the most widespread cover mediums for steganographic con-
tent. Steganography concerns the hiding of information into a cover medium which is
in plain sight, while steganalysis (our main objective in this subtask) tries to detect its
existence (subtask 2) and ideally retrieve the hidden message (subtask 3) [6].
   The participant takes the role of a professional digital forensic examiner collaborat-
ing with the police, who suspects that there is an ongoing fraud in the Central Bank.
After obtaining a court order, police gain access to a suspect’s computer in the bank
with the purpose of looking for images proving the suspect guilty. However, police
suspects that the suspect managed to change file extensions and signatures of some
images, so that they look like PDF (Portable Document Format) files or other types. It
is probable that the suspect has used steganography software to hide messages within
the forged images that can reveal valuable information. The considered subtasks are
defined as follows:
     • Subtask 1: perform detection of altered (forged) images (both extension and
          signature) and predict the actual type of the forged file.
    •    Subtask 2: identify the altered images that hide steganographic content.
    •    Subtask 3: retrieve the hidden messages (text) from the forged steganographic
         images.


3       Dataset

The data set consists of 6,400 forged images and pdfs, divided into 3 groups as shown
in Table 1. Every group of images was used for a specific task.

                     Table 1. Number of files per subtask in the data set
                                Subtask 1         Subtask 2         Subtask 3
              Training Set         2400              1000             1000

              Test Set               1000              500             500

All participants had access to the training data sets along with their respective ground
truth. The test sets were distributed without the ground truth.
   Training set for forged file discovery (i.e. subtask 1) consisted of 2400 files: 1200 of
them were true pdf files, whilethe rest seem to be pdf files, but they actually were im-
ages (equally distributed among jpg, png, and gif image types). Conversion to pdf files
was made by changing their extension to pdf and their signature (the first four bytes) to
25 50 44 46. Training set for stego image discovery (i.e. task 2) consisted of 1000
images of jpg format: 500 of these images were clean, while the rest were stego (Figures
1,2).


            Fig. 1. A clean image                          Fig. 2. A stego image

   Training set for secret message discovery (i.e. task 3) contained 1000 images of jpg
format: 500 of them were clean, while the rest contained different text messages (alt-
hough, the same one for every 100 images). A Least Significant Bit (LSB) insertion
technique was used to insert text messages, concerning the presumed dialogue the sus-
pect had with his abettor.
4      Evaluation Framework

For assessing the performance, classic metrics were used:
     a) Precision, Recall, and F-measure for Task 1 and Task 2.
     b) Edit distance for Task 3.
   In pattern recognition, information retrieval, and binary classification, Precision is
the fraction of relevant instances among the retrieved instances. For the task 1, Preci-
sion could be defined as the fraction of actual detected altered images among all the
images detected as altered:

                                      nº of actual detected altered images
                       Precision =
                                      Total detections of altered images


For the task 2, Precision could be defined as the fraction of actual detected images with
hidden messages among all the detected images with hidden a message:

                             𝑛º 𝑜𝑓 𝑎𝑐𝑡𝑢𝑎𝑙 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠 𝑤𝑖𝑡ℎ ℎ𝑖𝑑𝑑𝑒𝑛 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
             Precision=
                          𝑇𝑜𝑡𝑎𝑙 𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑎𝑙𝑡𝑒𝑟𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠 𝑤𝑖𝑡ℎ ℎ𝑖𝑑𝑑𝑒𝑛 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠


Recall is the fraction of relevant instances that have been retrieved over the total amount
of relevant instances.
For the task 1, Recall could be defined as the fraction of actual detected altered images
among all the altered images:

                                  𝑛º 𝑜𝑓 𝑎𝑐𝑡𝑢𝑎𝑙 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑎𝑙𝑡𝑒𝑟𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠
                       Recall =
                                            𝑇𝑜𝑡𝑎𝑙 𝑎𝑙𝑡𝑒𝑟𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠


For the task 2, Recall could be defined as the fraction of actual detected images with
hidden messages among all the images with hidden a message:

                            𝑛º 𝑜𝑓 𝑎𝑐𝑡𝑢𝑎𝑙 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠 𝑤𝑖𝑡ℎ ℎ𝑖𝑑𝑑𝑒𝑛 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠
                 Recall =
                               𝑇𝑜𝑡𝑎𝑙 𝑎𝑙𝑡𝑒𝑟𝑒𝑑 𝑖𝑚𝑎𝑔𝑒𝑠 𝑤𝑖𝑡ℎ ℎ𝑖𝑑𝑑𝑒𝑛 𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠


F-measure is the harmonic mean of Precision and Recall, mathematically expressed as

                                        2∙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙
                                     F=
                                          𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙


For the task3, the edit distance is adopted, which is defined as follows. Given two
strings, a and b, on an alphabet Σ (e.g. the set of ASCII characters), the edit distance
d(a,b) is the minimum-weight series of edit operations (Insertion, Deletion, Substitu-
tion) that transforms a into b.
5         Challenge Submissions

This section shows the results achieved by the participants in the three subtasks. Table
1 contains the results of subtask 1, Table 2 contains the results of subtask 2, and Table
3 contains the results of subtask 3.

5.1       Results for subtask 1
Six runs were submitted by four groups to this subtask. Table 1 shows the details of the
results, while Figure 1 summarizes the F-measure, Precision and Recall per run. The
correspondences between run IDs and participant names are given in Table 1.

                            Table 1: Runs summary table for Subtask 1.

    Rank       runID         Participant                   F-measure     Precision        Recall
    1          26850         UA.PT_Bioinformatics          1.000         1.000            1.000
    2          26738         nattochaduke                  1.000         1.000            1.000
    3          26737         nattochaduke                  1.000         1.000            1.000
    4          26735         agentili                      1.000         1.000            1.000
    5          26994         abcrowdai                     0.748         0.798            0.703
     6         26954         abcrowdai                     0.538         0.756            0.417


      1
    0.9
    0.8
    0.7
    0.6
                                                                                     F-measure
    0.5
                                                                                     Precision
    0.4
                                                                                     Recall
    0.3
    0.2
    0.1
      0
           26850    26738     26737        26735   26994    26954


            Figure 1. F-measure, Precision and Recall per submitted runID for Task 1.
5.2    Results for subtask 2
Twenty six runs were submitted by six groups to this subtask. Table 2 shows the details
of the results, while Figure 2 summarizes the F-measure, Precision and Recall per run.
The correspondences between run IDs and participant names are given in Table 2.

                       Table 2: Runs summary table for Subtask 2.

Rank       runID       Participant                   F-measure      Precision    Recall
1          26934       UA.PT_Bioinformatics          1.000          1.000        1.000
2          26929       UA.PT_Bioinformatics          0.986          1.000        0.972
3          26932       UA.PT_Bioinformatics          0.980          0.980        0.980
4          26930       UA.PT_Bioinformatics          0.965          0.939        0.992
5          26867       UA.PT_Bioinformatics          0.945          0.996        0.900
6          26871       UA.PT_Bioinformatics          0.933          0.891        0.980
7          26864       UA.PT_Bioinformatics          0.933          0.874        1.000
8          26868       UA.PT_Bioinformatics          0.932          1.000        0.872
9          26816       agentili                      0.888          0.908        0.868
10         26830       nattochaduke                  0.660          0.508        0.944
11         26844       Yasser                        0.626          0.524        0.776
12         26876       Yasser                        0.625          0.537        0.748
13         26825       Yasser                        0.614          0.529        0.732
14         26842       Yasser                        0.613          0.518        0.752
15         26817       nattochaduke                  0.613          0.473        0.872
16         26771       nattochaduke                  0.613          0.479        0.852
17         26951       Yasser                        0.599          0.542        0.668
18         26950       Yasser                        0.599          0.542        0.668
19         26948       Yasser                        0.587          0.538        0.644
20         26949       Yasser                        0.585          0.525        0.660
21         26885       Yasser                        0.576          0.506        0.668
22         26952       Yasser                        0.574          0.508        0.660
23         26787       nattochaduke                  0.529          0.542        0.516
24         26910       Abcrowdai                     0.525          0.467        0.600
25         27454       cen_amrita                    0.438          0.422        0.456
26         26770       Nattochaduke                  0.243          0.673        0.148
      1

 0.8

 0.6
                                                                                    F-measure
 0.4                                                                                Precision
 0.2                                                                                Recall
      0


            Figure 2. F-measure, precision and recall per submitted runID for Task 2.


5.3       Results for subtask 3
Eleven runs were submitted by two groups to this subtask. Table 3 shows the details of
the results, while Figure 3 summarizes the edit (Levenshtein) distance per run. The
correspondences between run IDs and participant names are given in Table 3.

                          Table 3: Runs summary table for Subtask 3.

                  Rank       runID        Participant                  Edit distance
                  1          27447        UA.PT_Bioinformatics         0.59782861
                  2          26933        UA.PT_Bioinformatics         0.59558861
                  3          27162        UA.PT_Bioinformatics         0.588343826
                  4          27438        UA.PT_Bioinformatics         0.587247762
                  5          26904        UA.PT_Bioinformatics         0.586426775
                  6          26898        UA.PT_Bioinformatics         0.571236169
                  7          26896        JoÃ£o Rafael Almeida         0.563379028
                  8          26899        UA.PT_Bioinformatics         0.529075304
                  9          27446        UA.PT_Bioinformatics         0.293547989
                  10         27445        UA.PT_Bioinformatics         0.27119247
                  11         26869        JoÃ£o Rafael Almeida         0.083585804
    0.7

    0.6

    0.5

    0.4

    0.3

    0.2

    0.1

     0
          27447 26933 27162 27438 26904 26898 26896 26899 27446 27445 26869


                    Figure 3. Edit distance per submitted runID for Task 3.


6         Discussion and Conclusions

The security task was introduced in ImageCLEF 2019. The number of the registered
teams/individuals and the submitted runs showed that the security challenges receive a
significant attention and that they are interesting and challenging. Most participants
signed to all three tasks, although this was not mandatory. This fact highlights the im-
portance of each task. The majority of the approaches exploited and combined deep
learning techniques, achieving very good results. The third task has been the most chal-
lenging one, in which the participants had to retrieve hidden messages from the images.
The third task results have also shown that there is room for improvement, as more
advanced techniques need to be used for better results. The analysis of the specific task
results indicates that the training set was small for the specific problem, i.e., the extrac-
tion of the hidden messages. To leverage the power of advanced deep learning algo-
rithms towards improving the state-of-the-art in steganalysis, we plan to increase the
data set. We also plan to narrow down the application of the challenges, e.g., focus in
steganalysis, probably in another domain.


References
 1. Ionescu, B., Muller, H., Peteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk, D., Tara-
    sau, A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Dang-Nguyen,
    D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Pelka, O., Friedrich, C.M., de
   Herrera, A.G.S., Garcia, N., Kavallieratou, E., del Blanco, C.R., Rodrıguez, C.C., Vasil-
   lopoulos, N., Karampidis, K., Chamberlain, J., Clark, A., Campello, A.: ImageCLEF 2019:
   Multimedia retrieval in medicine, lifelogging, security and nature. In: Experimental IR
   Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 10th International
   Conference of the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Sci-
   ence, Springer, Lugano, Switzerland (September 9-12 2019).
2. Kalpathy-Cramer, J., Garc´ıa Seco de Herrera, A., Demner-Fushman, D., Antani, S., Berick,
   S., Muller, H.: Evaluating performance of biomedical image retrieval systems: Overview
   of the medical image retrieval task at ImageCLEF 2004–2014. Computerized Medical Im-
   aging and Graphics 39(0) 55 – 61(2015).
3. Clough, P., Muller, H., Sanderson, M.: The CLEF 2004 cross–language image retrieval
   track. In Peters, C., Clough, P., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B., eds.:
   Multilingual Information Access for Text, Speech and Images: Result of the fifth CLEF
   evaluation campaign. Volume 3491 of Lecture Notes in Computer Science (LNCS)., Bath,
   UK, Springer 597–613 (2005).
4. Caputo, B., Muller, H., Thomee, B., Villegas, M., Paredes, R., Zellhofer, D., Goeau, H.,
   Joly, A., Bonnet, P., Gomez, J.M., et al.: Imageclef 2013: the vision, the data and the open
   challenges. In: International Conference of the Cross-Language Evaluation Forum for Eu-
   ropean Languages, Springer 250–268 (2013).
5. B. Ionescu, H. Müller, R. Péteri, D.T. Dang-Nguyen, L. Piras, M. Riegler, M.T. Tran, M.
   Lux, C. Gurrin, Y.D. Cid, V. Liauchuk, V. Kovalev, A. Ben Abacha, S.A. Hasan, V. Datla,
   J. Liu, D. Demner-Fushman, O. Pelka, C.M. Friedrich, J. Chamberlain, A. Clark, A. García,
   N. García, E. Kavallieratou, C.R. del Blanco, C. Cuevas, N. Vasillopoulos, K. Karampidis,
   “ImageCLEF 2019: Multimedia Retrieval in Lifelogging, Medical, Nature, and Security Ap-
   plications”, 41st Eur. Conf. on IR Research, ECIR 2019, Cologne (Germany), pp 301-308,
   14-18 Apr. (2019).
6. K. Karampidis, E. Kavallieratou, and G. Papadourakis, ‘‘A review of image steganalysis
   techniques for digital forensics,’’ J. Inf. Secur. Appl., vol. 40, pp. 217–235, Jun. 2018.