A Weighted Rule-Based Model for File Forgery
     Detection: UA.PT Bioinformatics at
              ImageCLEF 2019

João Rafael Almeida1,2[0000−0003−0729−2264] , Pedro Freire1[0000−0001−5663−3403] ,
Olga Fajarda1[0000−0003−1957−4947] , and José Luı́s Oliveira1[0000−0002−6672−6176]
               1
                DETI / IEETA, University of Aveiro, Aveiro, Portugal
         {joao.rafael.almeida, pedrofreire, olga.oliveira, jlo}@ua.pt
    2
      Department of Information and Communications Technologies, University of A
                              Coruña, A Coruña, Spain


        Abstract. With today’s digital technology, disparate kinds of data can
        be easily manipulated. The forgery commonly hides information, by al-
        tering files’ extensions, files’ signatures, or by using steganography. Con-
        sequently, digital forensic examiners are faced with new problems in the
        detection of these forged files. The lack of automatised approaches to
        discover these infractions encourages researchers to explore new compu-
        tational solutions that can help its identification. This paper describes
        the methodologies used in the ImageCLEFsecurity 2019 challenge, which
        were mainly rule-based models. The rules and all of their underlying
        mechanisms created for each task are described. For the third task, was
        used a random forest algorithm due to the poor performance of these
        rules.

        Keywords: ImageCLEF · File Forgery Detection · Rule-based models


1     Introduction
The ImageCLEF [7] initiative launched a new security challenge, called Image-
CLEFsecurity, addressing the problem of automatically identifying forged files
and stego images [10]. This challenge is divided into three sub-tasks: 1) forged file
discovery, 2) stego image discovery and 3) secret message discovery. The forged
files discovery sub-task is the first task of the challenge and it is independent
from the remaining two tasks. The goal of this task is to automatically detect
files whose extension and signature has been altered; more specifically, to iden-
tify the files with extension PDF that are, actually, image files (with extension
JPG, PNG, and GIF). The objective of the second sub-task is to identify the
images that hide steganographic content and the goal of the third task is to
retrieve these hidden messages.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
    In this paper, we present the several approaches that we used to address this
challenge. The main solution is based on an orchestration of specialised rule-
based models. For each model, a set of rules was defined with the purpose of
identifying a specific file or message. Additionally, when there are insufficient
rules to provide a good result, other complementary strategies have been com-
bined, namely a random forest classifier.


2     Materials and Methods
For each task, a training set and a test set were provided. The training set of the
first task is composed of 2400 files, 1200 of which are PDF files. The remaining
files, despite having PDF extension, belong to one of three classes: JPEG, PNG,
GIF, each with 400 files. In the second and third tasks, the training sets include
1000 JPEG images, 500 of which are stego images and the others are clean
images. In the case of the third task, the stego images contain five different text
messages. Regarding the test sets, the first task is comprised of 1000 files and
the second and third tasks are composed of 500 images.
     In this section, we present the five methods that were used to solve each task
of the challenge.

2.1   Rule-Based Approach
A typical rule-based system is constructed through a set of if-then rules [14]
which help identify conditions and patterns in the problem domain. However, the
use of simple conditions may not be enough to obtain the best results. Sometimes,
to accomplish a more accurate outcome, those rules need to be balanced, with
weights. The subject of rule weights in fuzzy rule-based models for classifications
is not new, and its positive effect has already been proven [8].
    We propose a rule-based weighted system with a set of models (figure 1),
which are specialised in classifying a specific entry. Each model generates a
confidence score regarding the match of the received input with its conditions.
The orchestrator collects all the results and chooses the model that gives the
best score. When more than one models give similar good confidence scores for
different classes, the weights of the rules are readjusted and a new classification
cycle is performed to help separating the classes’ scores. These readjustments
will, hopefully, allow the right output to stand out.
    The rules and the weight of the rules are specific to each problem and sce-
nario. Therefore, we used this approach as our base method for all the tasks.
The rules and the methods to classify the rules are specified in section 3.

2.2   Image Distortion Pattern Matching
Steganographic techniques permit hiding, within an image, information that
should be perceptually and statistically undetectable [2,11]. However, some of
these techniques, may not respect these two principles entirely, namely tools like
                      Fig. 1: Proposed rule-based system.


Jsteg, Outguess, F5, JPHide, and Steghide. These tools use the least significant
bit (LSB) insertion technique and distort the fidelity of the cover image by
choosing the quantized DCT coefficients as their concealment locations [11].
    Our approach aims to identify flaws of the used method by searching for a
common pattern among all the stego images.
    While scanning the training set of the second task for common patterns, it
was possible to identify that several stego images had a distortion pattern of 8x8
pixels size, that could easily be identified with the naked eye, as described in
figure 2 and figure 3.


Fig. 2: Distortion patterns observed on Fig. 3: Pixel block of 8x8 with distortion
a training image of task 2 (with a zoom pattern.
of 670%).

    Taking figure 3 as reference, the identified pattern could be described by the
following relation between each pixel, where P(x, y) represents the mean value
of R, G, and B at position (x,y):

       P (x3, y3) = P (x3, y4) = P (x4, y3) = P (x4, y4)
       P (x2, y3) = P (x2, y4) = P (x5, y3) = P (x5, y4)
                 = P (x3, y2) = P (x4, y2) = P (x3, y5) = P (x4, y5)
       P (x2, y2) = P (x2, y5) = P (x5, y2) = P (x5, y5)
       P (x3, y3) > P (x3, y2)
       P (x3, y2) > P (x2, y2)

    We created a function to scan an image for this pattern and to count the
number of occurrences. The function determines that a certain image had a
message if the number of patterns found is greater than the specified threshold.
Its output was used as a parameter into our weighted rule-base model.


2.3   Image Metadata Pattern Searcher

Another approach used to assist the creation of rules for our rule-based model
was a pre-analysis of the file’s metadata. This analysis aimed to discover patterns
that could be used as rules in the model. For instance, in the JPEG images of
the training set of the second task, a set of bits were detected which identified
the images with a hidden message.
    The pattern search was mainly done with the metadata, ignoring the image
bitstream. The rational was that the altered files could be signed in the metadata
to quickly identify which files are of interest. This simple signature would go
unnoticed and it would increase the decoding procedure.


2.4   Random Forests for Rule Definition

Random forest is a supervised learning algorithm developed by Breiman [3], who
was influenced by the work of Amit and Geman [1]. This algorithm constructs
a large number of decision trees, considering a random subset of features for
each split, and makes a prediction by combining the predictions of the different
decision trees. Caruana and Niculescu-Mizil [4] performed an empirical compar-
ison of supervised learning and concluded that random forest was one of the
algorithms that gave the best average performance.
    The random forest algorithm has several positive characteristics [5] for this
challenge, namely it can be used for high-dimensional problems and it gives an
estimation of the importance of variables. Moreover, it just needs two parame-
ters: the number of trees in the forest (ntree) and the number of features in the
random subset used in each split (mtry).
    We used the random forest algorithm in order to help solve the third task
and our implementation is described in section 3.
2.5   Manual Tuning

In large data sets, manual classification is unrealistic. However, since the training
and test set are small, we decide to try a manual validation. This approach
consists mainly in a verification of the rule-based output, followed by manual
adjustments considered relevant. This method was, essentially, used in the second
task.
    When analysing the training set, the 8x8 pixel distortion pattern described
in 2.2 was identified. Using the rule-based model in the early stages, made it
possible to define rules to reach a precision of 1. However, the recall was low.
Therefore, we isolated the images not detected as forged and tried to identify
these distortions in the image manually. This procedure increased our recall
significantly.


3     Results

The described methods were combined and led to several submissions in the
different tasks. The performance of the submissions was evaluated using the
traditional metrics: precision, recall, and F1, in the first two tasks and edit
distance in the third task.
     In the first task, the precision was defined as the quotient between the number
of altered images correctly detected and the total number of files identified as
changed. In its turn, the recall was the quotient between the number of altered
images correctly detected and the total number of files modified.
     For the second task, the definition of precision and recall was similar. The
precision was the quotient between the number of images with hidden messages
correctly detected and the total number of images with hidden messages iden-
tified. The recall was the quotient between the number of images with hidden
messages correctly detected and the total number of images with hidden mes-
sages.
     Finally, the third task used the edit distance to measure the efficiency into
recovering the message. This distance is the minimum-weight count of edit op-
erations that transforms a string in another one.


3.1   Task 1: Identify Forged Images

Detecting the type of file is a process that can be done using three different file
characteristics: the file’s extension, the magic number, and the file’s content [9].
The most straightforward technique to hide a file is changing the file’s extension
and the magic number, which is a set of standard bytes that signs the file. With
this technique, the operation system is unable to open the file. Therefore, four
models are built, each one specialised in identifying a file type (PDF, PNG, JPG
or GIF). Each model produces a score reflecting the confidence that the analysed
file is of the given type. These scores are sent to the orchestrator, who classifies
the type of file based on the scores received.
    The initial approach considered standard flags in the file structure, such as
the last bytes or the number of occurrences of a set of bytes. For instance, a
JPEG file has the hexadecimal 0xFFDA at least once in its structure because
this is the flag that indicates where the image binary starts. Table 1 presents
the flags for the end of file for each file type. For this first task, we used the rule
of identifying the end of file flag and obtained an F1 measure of 1.0.


                           Table 1: Rule used in task one.
                 File format Ending of file (Hexadecimal values)
                 GIF          0x3B
                 JPG          0xFFD9
                 PNG          0x49454E44AE426082
                 PDF          0x454F46


3.2       Task 2: Identify Stego Images

In JPEG images there are two different stages of compression: lossy and lossless.
Embedding steganographic content inside images that uses lossy compression
increases the possibility of that content to be partly lost, which means that, it is
not feasible to hide a message on a lossy stage. Therefore, in the case of JPEG
files, the steganography should take place on the lossless stage. The DCT and
the quantization phase form part of the lossy stage, while the Huffman encoding
used to further compress the data is lossless. [13].
     After scanning the training set of the second task for common parts that
distinguishes stego images binary data from non-stego images, it was possible
to observe some patterns and specify the weights of the rules that identifies
stego images. The patterns and their weights are described on table 2. We could
observe that all the patterns appeared in the Huffman table sequences, i.e. after
the marker DHT. The sequences in a JPEG file are identified by a two-byte code
described in table 3.


           Table 2: Weighted rules used in Task 2 for the first submission.
      Rule Searched on               Hex matching value             Weight
                           0x3435363738393a434445464748494a53545556
      1      DHT                                                       0.35
                           5758595a636465666768696a737475767778797a
                           0x35363738393a434445464748494a535455565
      2      DHT                                                       0.35
                           758595a636465666768696a737475767778797a
      3      DHT           0xf2f3f4f5f6f7f8f9fa                              0.30
                    Table 3: Common JPEG file markers [6].
                    Marker id Short value      Description
                    SOI         0xFFD8        start of image
                    APPn        0xFFEn        application data
                    DQT         0xFFDB        quant.tables
                    DHT         0xFFC4        Huffmantables
                    SOF         0xFFCn        start of frame
                    SOS         0xFFDA        start of scan
                    DRI         0xFFDD        restart interval
                    RSTn        0xFFDn        restart
                    COM         0xFFFE        comment
                    EOI         0xFFD9        end of image


    This task was also solved using the rule based-model where the images with
a score equal to or greater than 0.70 were considered as altered. However, in
this case, strategies and different ways to extract information from the images
were combined. Initially, a metadata pattern searcher to compare the metadata
fields’ content was developed. From this analysis, in all the images with a hid-
den message, the set of bits represented in the rules displayed in table 2 were
found. However, these rules produced several false positives, achieving, using the
training set, a recall of 1.0 and a precision of 0.75.
    Due to the lack of precision, we attempt to identify the distortion pattern
in the images, the method described in section 2.2. Without the rules used in
the first approach and using a threshold of at least one pixel-block with the
distortion, this method produced a precision of 0.53 and a recall of 0.60 in the
training set. It was also the best score obtained from all runs using this approach
isolated.
    Then, to increase the precision, the decision was made to combine the rules
of the first approach with this analyser. The image analyser method was only
used when an image was classified through the first approach as having a hidden
message. This decreased the recall to 0.604 and the precision remained in the
0.75, the best precision result so far. The decrease of the recall and the bad
results in isolated scenarios led to the abandonment of this approach and to
focusing only on ways to increase the quality of the rules.
    At this stage, a submission was made, obtaining an F1 measure of 0.933 and
a precision of 0.874.
    In the next attempts some manual rectifications were made in the output
retrieved from the rule-based system, by observing the images classified as having
a message. Some submission were made following only the rule-based approach
mixed with the manual validation, and the best result was 0.98, both for F1 and
precision.
    As a last attempt, the decision was made to re-run the metadata pattern
searcher to be more precise. Now, it analysed all the metadata as a binary,
ignoring which were the fields or its content. With these changes, the method
found a new pattern, which produced a new rule, represented in table 4, made
it possible to achieve F1 measure of 1 in the training and test sets.


                    Table 4: Final weighted rules used in Task 2.
      Rule Searched on              Hex matching value             Weight
                          0x3435363738393a434445464748494a53545556
      1      DHT                                                      0.35
                          5758595a636465666768696a737475767778797a
                          0x35363738393a434445464748494a535455565
      2      DHT                                                      0.15
                          758595a636465666768696a737475767778797a
      3      DHT          0xf2f3f4f5f6f7f8f9fa                             0.15
      4      APP0         0x00600060                                       0.35


3.3       Task 3: Retrieve the Message

The goal of the third task was to retrieve the hidden text messages from the
stego images. To address this task, initially freely available steganographic tools
were used, specifically, Hide’N’Send, Image Steganography 3 , QuickStego, SSuite
Picsel 4 , Steghide 5 , and SteganPEG [15]. However, none of the steganographic
tools were able to retrieve the hidden text messages.
     In our second approach, the RGB matrix was analyzed. First, each colour
component was individually inspected, examining the least and the two least
significant bits, in order to detect if they could compose ASCII codes of letters
in the alphabet, more precisely, the ASCII character in the range from 65 to 90,
and from 97 to 122. As these procedures did not provide a pattern for the stored
messages, the decision was made to look to the pixel as a whole, inspecting the
three colour component combined. We, also, tried to use the two least significant
bits from the four pixels that are in the centre of each 8x8 pixel block.
     The second approach could not retrieve the hidden messages from the image
files and therefore an attempt to find a pattern using the DCT matrix was made,
by inspecting the least and the two least significant bits from the value in the
first cell of an 8x8 block. The change of the least or the two least significant bit
of these values would create a small change in the block brightness, which would
explain the distortion identified in the 8x8 pixel block.
3
  https://incoherency.co.uk/image-steganography/
4
  https://www.ssuiteoffice.com/software/ssuitepicselsecurity.htm
5
  http://steghide.sourceforge.net/
    None of the procedures described so far could find a pattern in the images
of the training set with the same hidden text message. Therefore, the random
forest algorithm in an attempt to find a pattern in the binary of the image files
was used. The 500 stego images of the training set have, as hidden message,
one of five messages. Consequently, the next step was to consider a multiclass
classification problem which consists in classifying each stego image into one of
the five messages. Initially, for our first model, we used as features the frequency,
in percentage, with which each ASCII character appears in the binary of the
image files. For the second model, in addition to the features used in the first
model, the percentage of 0s and 1s in the binary of the image files were used.
To train the models we used the R package caret [12] and used cross-validation
the chose the optimal value of the parameters ntree and mtry. In what concerns
the performance of the models, this was evaluated using 10-fold cross-validation.
Table 5 presents the parameters used and the accuracy of each model.


            Table 5: Random forest models’ parameters and results.
                                      Parameters
                         Models                      Accuracy
                                      ntree mtry
                       First model     247      82     31,57%
                       Second model    374      59     31,33%


    From the results, the random forest models were unable to find a pattern
in the binary of the image files. In the absence of alternatives, the two random
forest models to classify the image files of the test set were used, despite the
fact that this task was not a classification problem. The first step was to use
the rule based-model defined in the second task to identify the stego images of
the test set and, subsequently, the random forest models to classify the images
identified as containing a hidden message. For the submissions, all the images
in the test set should have a string appointed and therefore an empty string to
the images identified as having no hidden message was assigned. Using the first
model, an edit distance of 0.588 was obtained and using the second model an
edit distance of 0.587. Our best edit distance (0.598) was achieved by assigning
the string “name John Fraud ” to all images we identified as stego images and
an empty string to the other images of the test set. These results reflect the fact
that we could correctly identify the images with no hidden messages.


4   Conclusion

This paper presents the methodologies to identify forged files and stenographic
images used in the ImageCLEFsecurity challenge. These methods were developed
specifically for the tasks of this challenge, which does not invalidate them being
used for other data sets. In the first task, an F1 measure of 1 was obtained.
This excellent result was accomplished mainly because the changes done to the
files were only the traditional ones, and with simple rules, it was possible to
identify each type. The second task also had a submission with F1 measure of
1. In this case, we could identify a signature in the altered images. On the other
hand, in the third task, the best submission had an edit distance of 0.598, mainly
due to the success of identifying empty strings, i.e., images without a message.
The purposed methodology works if it is possible to define the right rules. The
problem in this task was the difficulty to find the stenographic algorithm used.
     This challenge allowed for the identification of problems in the developed
approach, and most importantly, ways to improve some of these issues. A future
work originated from this year’s participation could be the creation of a rule
generator to fed the rule-based models. The message identification task may
be improved by creating a database of strategies used by stenographic attack-
ers, mixed with machine learning approach that look into neighbourhood pixel
colour.


Acknowledgements
This work was supported by the projects NETDIAMOND (POCI-01-0145-FEDER-
016385) and SOCA (CENTRO-01-0145-FEDER-000010), co-funded by Centro
2020 program, Portugal 2020, European Union.


References
 1. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees.
    Neural computation 9(7), 1545–1588 (1997)
 2. Attaby, A.A., Ahmed, M.F.M., Alsammak, A.K.: Data hiding inside jpeg images
    with high resistance to steganalysis using a novel technique: Dct-m3. Ain Shams
    Engineering Journal (2017)
 3. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
 4. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learn-
    ing algorithms. In: Proceedings of the 23rd international conference on Machine
    learning. pp. 161–168. ACM (2006)
 5. Cutler, A., Cutler, D.R., Stevens, J.R.: Random forests. In: Ensemble machine
    learning, pp. 157–175. Springer (2012)
 6. Gloe, T.: Forensic analysis of ordered data structures on the example of jpeg
    files. In: 2012 IEEE International Workshop on Information Forensics and Security
    (WIFS). pp. 139–144. IEEE (2012)
 7. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk,
    D., Tarasau, A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman,
    D., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C.,
    Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Garcia, N., Kavallieratou, E., del
    Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N., Karampidis, K., Chamberlain,
    J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia retrieval in medicine,
    lifelogging, security and nature. In: Experimental IR Meets Multilinguality, Mul-
    timodality, and Interaction. Proceedings of the 10th International Conference of
    the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science,
    Springer, Lugano, Switzerland (September 9-12 2019)
 8. Ishibuchi, H., Nakashima, T.: Effect of rule weights in fuzzy rule-based classification
    systems. IEEE Transactions on Fuzzy Systems 9(4), 506–515 (2001)
 9. Karampidis, K., Papadourakis, G.: File type identification for digital forensics.
    In: International Conference on Advanced Information Systems Engineering. pp.
    266–274. Springer (2016)
10. Karampidis, K., Vasillopoulos, N., Cuevas Rodrguez, C., del Blanco, C.R., Kaval-
    lieratou, E., Garcia, N.: Overview of the ImageCLEFsecurity 2019 task. In:
    CLEF2019 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN
    1613-0073, http://ceur-ws.org/Vol-2380/ (2019)
11. Khalid, S.K.A., Deris, M.M., Mohamad, K.M.: A steganographic technique for
    highly compressed jpeg images. In: The Second International Conference on Infor-
    matics Engineering & Information Science (ICIEIS2013). pp. 107–118 (2013)
12. Kuhn, M., et al.: Building predictive models in r using the caret package. Journal
    of statistical software 28(5), 1–26 (2008)
13. Kumari, M., Khare, A., Khare, P.: Jpeg compression steganography & crypography
    using image-adaptation technique. journal of advances in information technology
    1(3), 141–145 (2010)
14. Liu, H., Gegov, A., Cocea, M.: Rule-based systems: a granular computing perspec-
    tive. Granular Computing 1(4), 259–274 (2016)
15. Reddy, V.L., Subramanyam, A., Reddy, P.C.: Steganpeg steganography+ jpeg. In:
    2011 International Conference on Ubiquitous Computing and Multimedia Appli-
    cations. pp. 42–48. IEEE (2011)