Forged File Detection and Steganographic content Identification (FFDASCI) using Deep Learning Techniques Dr. M. Srinivas, Akshay Nayak, and Abhishek Bhatt National Institute of Technology Warangal Telangana, India msv@nitw.ac.in,anayak@student.nitw.ac.in and abhishekbhatt900@gmail.com Abstract. This paper presents our contribution in the identification and detection of Forged files and Steganographic content using Deep Neural Networks like Convolutional Neural Network and 3D-RESNET. We have used CNN in our research as CNN’s are inspired by visual cor- tex. In other words, they are designed to extract consequential features which are relevant in classification i.e. the ones which minimizes the loss function. In this the kernel weights are learned by Gradient Descent so as to generate the perceptive features from images fed to the network which in result supplemented to fully connected layer that performs the final classification task. In our proposed approach we mainly consider the two different tasks. Firstly, Identification of Forged Images has been carried out in which detection of altered images which includes both extension and signature has been performed. In addition to this, we have predicted the original epitome of forged file by using convolutional neural network model which automatically classify them and are useful for large-scale image classification as it has increased ConvNet depth. Secondly, we have recognized the Steganographic content by applying 3D-RESNET. Here, we have given preference to Residual Networks in place of VGG16 as increasing the depth should increase the accuracy of network, as long as over-fitting is taken care of. In VGG16 increased depth is increasing the effect of vanishing gradient and degradation problem. In this work, ImageCLEF 2019 data set is used for identification of Forged Images and recognized the Steganographic content. Keywords: Transfer Learning · Optimizer · Activation Function · Loss Function · Adadelta · Categorical cross entropy · Down sampling · Mem- ory footprint. 1 Introduction Since the advancement of Internet, one of the important concerns has been the security of information. The creation of Cryptography has been made for secur- ing the secrecy of communication and many methods have been identified for encrypting and decrypting data in order to keep the message secret. In addition 2 M. Srinivas et al. of keeping the contents of the message secret, it may also be necessary to keep the existence of the message secret. Here, comes the Steganography which is being considered as the art and science of invisible communication. It has been very easy to conceal confidential information inside files. Steganography [1] is the practice of concealing files, messages, images and videos within another files, messages, images or videos. The word Steganography combines the Greek words stego meaning ”covered” and graphics meaning ”writing”. Images are one of the most usual and efficient cover media for hiding the data. There are various problems associated with file forgery which we are discussing in this paper are as follows. Firstly, the digital forensics are skipping many important and useful content during investigations. They are unknowingly treating various image files as pdf files due to the modification of those files by changing their extension to pdf format which is the major fallback in their investigation. Secondly, there is a chance of sharing of illegal information by hiding the criminal action from the plain sight and invisible those files in front of investigators. Traditional based features extraction techniques [2] are more complicated, not optimized one and lack of discriminative capacity for stego images. By using deep learning based features give high level semantic information and more discriminate [3]. This paper, reconciles the above-mentioned problems by using deep neural networks. Few years back, due to the less availability of required efficient data set, Machine Learning algorithms were very efficient with those sized data sets. The problem with it was of defining our own features that were to be learned by our model. It was Supervised Learning which made it even more complicated and complex. As the data set grows on and we have to implement on the sufficiently large amount of data set, all these models were not be able to perform well resulting in the emergence of Deep Learning field where the whole network is not fully connected type. Only the last few layers are fully connected layers as they are connected to every element of the input volume by reducing the extra number of Hyper-parameters. Deep Learning algorithms are working very well with the large data set applying supervised learning [4] where the model itself generates the weights and its feature vectors by training the fully connected layer. This is accomplished by making the kernel smaller than the input which means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. In this work, we use ImageCLEF 2019 challenging [5] data sets. In this data set contains various kind of task related to real time applications such as ImageCLEFCoral, ImageCLEFlifelog, ImageCLEFmedical and ImageCLEFsecurity. In this work, we select the ImageCLEFsecurity related task [6] for identification of Forged Images and have recognized the Steganographic content. 1.1 Previous Research Work There are few research works that have been performed and is going on for forged file detection and identification of steganographic content. Title Suppressed Due to Excessive Length 3 Forged File Detection Few research works have been performed in which the simplest and the naive method of looking only at file extensions. As there are various ways in which type of a files can be detected without even opening them. This is not very useful when one should have to detect large number of files since volume of files determine the detection speed. The use of extensions in file type detection is not efficient as extensions are easily spoofed and altered. It is easy to change the extension by several mouse clicks in various operating system. It is not necessary to open a file during classifying the files based on their extensions, in similar manner it is not necessary for to mislead these classification techniques. In open source OS like Linux, extensions are not required for the extension- based file type detection. This OS is allowing optional extensions of any string regardless of file type which results in hiding these files from an inexperienced administrator. File Type Detection using Magic Bytes is one of the most sophisticated method for file type detection. Magic Bytes are specific to binary files and rely on matching signatures which are varying in length in file headers or tails. Due to inadequate standards for the content in files, the new file type creators will include headers for uniquely identifying files of their type. For example, letters PK has been present at the beginning of every .zip file in order to identify the Zip format files as ZIP file format had been invented by PKWARE. This method is usually slower in checking the file extension as files are being opened for reading the small number of bytes for deciding the file format. If it matches the expected result then the given file is in specific format, else it is treated as suspicious. It works only on binary file types having magic bytes associated with them. This is the cons of magic bytes type detection when the person has to consider the risk associated with ignoring detection of ASCII based files. Steganographic Content Identification Active approach is performed for determining the steganographic images. Digital Images require pre-processing such as watermarking or signatures like fingerprinting are being generated during image creation. But this is not very efficient or useful in authenticating the image when the internet is not having the large number of water marked and digital signatured images. Passive approach is the most expedient forgery detection technique called as blind forgery detection. The blind name factor ascends as it uses the received image for the originality check without any modification in image at the time of creation and capture. Copy-Move Detection Technique and Slicing are another passive approach where in the CMDT a part of an image is copied and pasted in some other location within the same image and in Slicing one or more images get combined together to form a new image. But copying a part of an image or slicing an image are not useful in finding whether an image is hiding some text behind it. Some sample of stego and non stego images are show in Fig. 1 4 M. Srinivas et al. Fig. 1. Some of the stego and non-stego images samples. 1.2 Proposed Approach For detecting a forged file, a pdf file has been given as an input to the network for reading each and every byte [7] and the frequency of each byte is being counted. After that a histogram has been plotted using the frequency distribution data table [8]. In the second stage of our work, the resulting histogram is being supplemented to VGG16 [16] which have been trained on the custom data for classifying the file types. The resulting output of the model has been predicted the actual type of file supplemented to it. The files have been predicted as images and are being stored as dataset for the next stage of proposed model resulting in classifying the image as stego or non-stego after boosting the ResNet50 [13] with image classified files. Generally, ASCII values are varying in the range of [0, 255] resulting in the availability of 256 numbers of bins in range [0, 255]. The x-axis represents byte value and the y-axis represents the frequency of each byte value. The histogram could be generated in two formats. 1. Grey scale format [9] and 2. RGB format [9]. We have decided to go with RGB formatted histogram as it is convenient to amplify RGB image in VGG16 network as it takes the input images in three channels. Else, we have to convert the grey scaled images to an image having 3 channels containing the same pixels for each channel. The resulting histogram has been used in the proposed model for the file type identification can be analyzed by seeing the Fig 2, in which each file type has their own histogram representation. The histogram of each with varying file type, either may be of JPG, GIF, PNG or PDF format is different from each other. For classifying the actual file type, VGG16 [10] model plays an important role as an image classifier. We have re-trained the above mentioned VGG16 model with obtained histogram images for GIFs, JPGs, PDFs and PNGs using Transfer Learning which enables us to use pre-trained model by changing the output classifying layer. The neural net- work is generating ”weights” by training it on very large dataset. These extracted Title Suppressed Due to Excessive Length 5 Fig. 2. Different type of files corresponding Histogram images. (a) gif file (b) jpg file (c)pdf file (d) png file related histogram images weights then transferred to any other network which protect us from training the full network from scratch by transfering the conversant features. VGG16 is a deep convolutional neural network having different convolutional and pooling layers. After series of passage through convolutional and max pooling layer, the immediate output then supplemented to the next two fully connected layer hav- ing 4096 classes and the appearing output finally be augmented to the dense fully connected layer of four classes having SoftMax as an activation function. There are some layers present in between the convolutional and pooling layer which is called as ReLU playing an important role by only keeping the posi- tive values. The Dense layer which is being worked as fully connected layer has SoftMax as an activation function [11] having 4 classes at last. For file forgery detection histogram based features and the fine-tuned VGG16 model is in our approach method and have shown in Fig. 3 stage-I. After the classification of files as images using CNN based classification method, next we have crammed another network called ResNet50 [13] for to check the images are stego or non stego. We are supplementing our network with large dataset and for deep analysis we have considered ResNet50 as an important model having 50 layers for identifying the altered images hiding the steganographic content. ResNet50 has 25,583,592 trainable parameters and hav- 6 M. Srinivas et al. Stage -I 1. Gif 2. Jpg 3. Pdf 4. Png Pdf/File Histogram image database Classification results Convolutional Neural Network [12] 1. Stego images Images 2. Non stego images Classification Classification results Stage -II Convolutional Neural Network [12] Fig. 3. Block diagram of FFDASCI model ing 53,120 non-trainable parameters. We have trained the model with our custom dataset consisting of two types of image, one is stego images and the another is non-stego images. This time the network has been trained by the images itself instead of supplementing with the histograms of each type of images classifying the images into stego and non-stego. Fig. 3 is showing the complete architecture of the proposed model. 1.3 Dataset In this work, we used the dataset provided by ImageCLEFsecurity 2019 [6] con- taining of 9,000 images for three different kind of tasks. We have supplemented our Network model with a total of 2400 histogram images of first task. We have processed the images and have used RGB format in our model. We have trained the network with the customized dataset. The root folder named data contains four sub folders named GIF, JPG, PDF and PNG containing 400, 400, 1200 and 400 histogram images respectively which are representing each of the file type. We have divided the dataset in training and testing sets and 80% of the total data from dataset is used for training our model and the remaining 20% have been used for testing purpose. In second task, dataset for stego image classifi- cation consists of 1000 images dividing into 500 stego and 500 non-stego images indexed from 0001 02 to 1000 02. 1.4 Experimental Results In this work, the performance of the proposed system is evaluated by measuring classification accuracy. In our FFDASCI model we have considered “categorical crossentropy” [14] as the loss function, “adadelta” as an optimiser, “metrices” as Title Suppressed Due to Excessive Length 7 an accuracy standard and “SoftMax” as an activation function.For forgery detec- tion task we have trained the VGG16 network on our custom forgery detection dataset for 12 epochs with 32 as batch size. We have achieved 99.93% validation accuracy and have also tested with Support Vector Machine (SVM) classifier on same forgery detection task dataset. With SVM classifier we achieved 99.73% of validation accuracy and the accuracy results are shown in Table 1.4. Table I: Performance comparison of the proposed method with SVM classifier on forgery detection validation dataset. Method Accuracy (%) Proposed Method (VGG16) 99.93 SVM 99.73 By using SVM classifier we calculate the class wise classification performance. Table 1.4 shows the class wise accuracy for the file forgery detection using SVM classifier with histogram features. Table II: Class wise performance results of the SVM classifier on forgery detection validation dataset. Classes Precision Recall F1-score Support gif 1.00 1.00 1.00 69 jpg 0.99 1.00 0.99 83 pdf 1.00 1.00 1.00 138 png 1.00 0.99 0.99 79 For stego image classification task, we have trained the ResNet50 network for the same number of epochs and having batch size as similar of VGG16. We have achieved 99.9% validation accuracy with the proposed method. In this task, we have also used SVM classifier to classify the stego image and 93.5% classification results are being achieved. Table 1.4 shows the classification results of proposed method and SVM classification results On stego image data. Table III: Performance comparison of the proposed method with SVM classifier on stego image validation dataset. Method Accuracy (%) Proposed Method (ResNet50) 99.9 SVM 93.5 Table 1.4 shows the class wise accuracy for stego image discovery. We have been secured the testing accuracy of 99.90% accuracy from our proposed model. 8 M. Srinivas et al. Table IV: Class wise performance results of the SVM classifier on stego image validation dataset. Classes Precision Recall F1-score Support Non-Stego image 0.51 0.70 0.59 93 Stego image 0.62 0.42 0.50 107 1.5 Conclusion The histograms produced by frequency distribution of bytes of each file type are being separate from each other for File Type Detection. We have proposed a model which is useful in classifying the forged files and have been able to classify whether the files are stego or not. We have decreased the computational time complexity in detecting the forged files and steganographic content. There is no need of opening the file for file type detection which relieves from magic bytes strategy. As if human cortex can detect the forged file images, certainly network which would be specifically designed for this is more powerful and speeding up the detection time. Our model is efficient in identifying the steganographic con- tent which were earlier skipped by the digital investigators. Using this each and every hidden and forged files are bring able to identify and helps in investigation. References 1. Altaay, Alaa A. Jabbar et al. An Introduction to Image Steganography Techniques. 2012 International Conference on Advanced Computer Science Applications and Technologies (ACSAT) (2012): 122-126. 2. Srinivas, M., and C. Krishna Mohan. ”Classification of medical images using edge- based features and sparse representation.” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016. 3. Srinivas, Mettu, Yen-Yu Lin, and Hong-Yuan Mark Liao. ”Deep dictionary learning for fine-grained image classification.” 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017. 4. Srinivas, M., Yen-Yu Lin, and Hong-Yuan Mark Liao. ”Learning deep and sparse feature representation for fine-grained object recognition.” 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2017. 5. Bogdan Ionescu and Henning Muller and Renaud Peteri and Yashin Dicente Cid and Vitali Liauchuk and Vassili Kovalev and Dzmitri Klimuk and Aleh Tarasau and Asma Ben Abacha and Sadid A. Hasan and Vivek Datla and Joey Liu and Dina Demner-Fushman and Duc-Tien Dang-Nguyen and Luca Piras and Michael Riegler and Minh-Triet Tran and Mathias Lux and Cathal Gurrin and Obioma Pelka and Christoph M.Friedrich and Alba Garcia Seco de Herrera and Narciso Garcia and Ergina Kavallier-atou and Carlos Roberto del Blanco and Carlos Cuevas Rodriguez and Nikos Vasil-lopoulos and Konstantinos Karampidis and Jon Chamberlain and Adrian Clark and An-tonio Campello, ImageCLEF 2019: Multimedia Retrieval in Medicine, Lifelogging, Se-curity and Nature, Experimental IR Meets Multilinguality, Multimodality, and Interac-tion., Proceedings of the 10th International Conference of the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer, September 9-12, Lugano, Switzerland. Title Suppressed Due to Excessive Length 9 6. Konstantinos Karampidis, Nikos Vasillopoulos, Carlos Cuevas Rodrguez, Carlos Roberto del Blanco, Ergina Kavallieratou and Narciso Garcia. Overview of the Im- ageCLEFsecurity 2019 Task., CLEF working notes, CEUR, 2019. 7. Karampidis, Konstantinos, Ergina Kavallieratou, and Giorgos Papadourakis. ”Com- parison of Classification Algorithms for File Type Detection A Digital Forensics Perspective.”POLIBITS, vol.56, 2017,pp.1520. 8. Aedla, Raju and G. S. Dwarakish and Reddy Venkat. (2013). A Comparative Anal- ysis of Histogram Equalization based Techniques for Contrast Enhancement and Brightness Preserving. International Journal of Signal Processing, Image Process- ing and Pattern Recognition. vol. 6. 2013, pp. 353-366. 9. Tarun Kumar, Karun Verma, A Theory Based on Conversion of RGB image to Gray image. International Journal of Computer Application, 7(2), pp. 7-10. 10. Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). 11. Nwankpa, Chigozie, et al. ”Activation Functions: Comparison of trends in Practice and Research for Deep Learning.” arXiv preprint arXiv:1811.03378 (2018). 12. https://www.mathworks.com/videos/introduction to deep learning what are con- volutional neural networks 1489512765771.html 13. He, Kaiming et al. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): 770-778. 14. Dufourq, Emmanuel, and Bruce A. Bassett. ”Automated problem identification: Re-gression vs classification via evolutionary deep networks.” Proceedings of the South African Institute of Computer Scientists and Information Technologists. ACM, 2017. 15. Hara, Kensho et al. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) (2017): 3154-3160. 16. Shelhamer, Evan et al. Fully Convolutional Networks for Semantic Segmenta- tion. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 3431-3440.