Modeling of Pseudo-Random Sequences Generated by Data
Encryption and Compression Algorithms
Alexander V. Kozachok 1, Vasiliy I. Kozachok, Andrey A. Spirin1, Andrey K. Trofimenkov2
1
  Academy of the Federal Guard Service of the Russian Federation, 35 Priborostroitelnaya ul., Orel, 302015,
Russia
2
  Orel Branch of the Federal Research Center "Informatics and Management" of the Russian Academy of Sciences
(OF FIC IU RAS), 137 Moskovskoe shosse., Orel, 302025, Russia


                   Abstract
                   Reports of information and analytical agencies indicate a high proportion of internal violators
                   as sources of confidential data leaks in Russia. One of the possible channels of data leaks can
                   be their transmission in encrypted form. Modern data analysis tools are not able to reliably
                   detect the transfer of information after its processing by cryptographic algorithms. In addition,
                   an attacker can embed digital signatures specific to compressed data in encrypted data, thereby
                   disguising them as legitimate file types. The paper presents an approach to the classification of
                   encrypted and compressed data based on the developed model of pseudorandom sequences and
                   the algorithm for their classification. The accuracy of the proposed method was 0.97.

                   Keywords 1
                   Statistical data analysis, machine learning, classification of encrypted and compressed data,
                   binary sequence classification

1. Introduction
    According to the report of the expert and analytical center of the Infowatch group of companies, the
share of internal violators as sources of confidential data leaks in Russia increased in 2020 [1], in more
than 79% of cases, confidential data leaks were caused by an internal violator.
    To protect information from leaks, software tools for detecting and preventing leaks of confidential
data and various systems for deep traffic analysis are used.
    Methods of traffic analysis, depending on the feature spaces used in its classification, can be divided
into several groups: calculation of the entropy of all or part of the data [2-6]; service information of data
transmission protocols [7–10], statistical characteristics and byte distribution [11–14]. However, there
are ways to circumvent these security methods, such as using encryption, data compression, or
encapsulation in other protocols [15,16]. In papers [17, 18] the authors conclude that the entropy
approach reduces the information about the distribution to a single number, thereby reducing the
features available for analysis. To overcome this problem, deep neural networks were used to classify
data of two classes: encrypted by the aes algorithm and compressed by the rar, zip/gzip, jpeg, png, mp3,
and pdf algorithms. This mechanism, according to the authors, will automatically determine the most
significant features inherent in the analyzed sequences and improve the accuracy of their classification.
    In [19], a combined approach based on recurrent networks and extremely randomized trees is used
to detect potentially malicious actions and intrusions into information systems. Service network
information was used as attributes.
    In the paper [20] notes that the byte distributions in encrypted and compressed sequences tend to be
evenly distributed. This fact is explained by the fact that encryption algorithms disperse the original
statistics in messages and compression algorithms tend to reduce the redundancy of the original


BIT-2021: XI International Scientific and Technical Conference on Secure Information Technologies, April 6-7, 2021, Moscow, Russia
EMAIL: a.kozachok@academ.msk.rsnet.ru (A. 1); kosachok@list.ru(A. 2); spirin_aa@bk.ru (A. 3), andjeytrf@rambler.ru (A. 4)
ORCID: 0000-0002-6501-2008 (A. 1); 0000-0001-5384-2269 (A. 2); 0000-0002-7231-5728 (A. 3); 0000-0002-5576-875X (A. 4)
                © 2021 Copyright for this paper by its authors.
                Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                      98
messages. The article [21] describes the use of a feature space based on byte allocation for detecting
potentially malicious software.
    In the reviewed studies, the following machine learning algorithms were used: decision tree (DT),
support vector machine (SVM), random forest (RF), Markov chain (MC), hidden Markov chain (HMC),
boosting (BG), convolutional neural networks (CNN), recurrent neural networks (RNN), determination
of autocorrelation of distributions (AC). The considered methods use container and file headers for data
analysis, which contain "magic" bytes – digital signatures that uniquely identify the transmission
protocol or the compressed data container. At the same time, software and hardware protection against
information leaks do not have mechanisms for analyzing encrypted or compressed data, in the absence
of information about the compression algorithm [3]. An overview of the methods considered is
presented in Table 1.

Table 1
Analysis of the subject area of research
       The authors                  Features                 Algorithms                 Accuracy
    Mamun M.S.I. [2]                Entropy                      DT                      0,981
        Tang Z. [3]                 Entropy                   SVM, RF                    0,979
                                 service traffic
       Shen M. [7]                                               MC                       0,912
                                  information
                                 service traffic
       Chen Y. [8]                                               BG                       0,987
                                  information
                                 service traffic
      Obasi T.C. [9]                                         CNN, RF, DT                   0,96
                                  information
                                 service traffic
        Yao Z. [10]                                             HMC                        0,99
                                  information
   Choudhury P. [11]           Bytes distribution                AC                        0,99
     Baldini G. [13]           Bytes distribution               SVM                        0,8
      Shen M. [14]             Bytes distribution                RF                       0,882
   De Gaspari F. [17]           Auto extraction                 DNN                     0,85-0,99
                                 service traffic
  Kasongo, S. M. [19]                                     Extra trees, RNN                 0,99
                                  information

    Machine learning algorithms are also used in related areas of information security. So in the study,
the authors used the XGBoost algorithm to search for abnormal behavior of players on the stock
exchange who own insider information and use it improperly [22]. The signs were various specific signs
and financial indicators inherent in conducting transactions on the exchange. Despite some differences
in the subject area, the authors solve a similar problem of optimizing machine learning algorithms to
improve the accuracy of classifying illegitimate actions of agents.
    In a number of studies, an algorithm of extremely randomized trees is used to detect intrusions into
corporate networks. This approach also applies to the ensemble method as well as the random forest,
but differs in the way of selecting the partition threshold. Instead of searching for the most optimal
thresholds, as happens in the random forest algorithm, the thresholds are selected randomly for each
possible feature, and the best one is selected as a rule for dividing the node. The use of extremely
randomized trees slightly reduces the variance of the model due to a slightly larger increase in the bias
[23-25].
    Gradient boosting refers to ensemble machine learning algorithms that can be used for classification
or regression tasks. Ensembles are built on the basis of decision tree models. Trees are added one by
one to the ensemble and trained to correct prediction/classification errors made by previous models.
The models are trained using any arbitrary differentiable loss function and a gradient descent
optimization algorithm. This explains the name of the algorithm "gradient boosting", since the loss
gradient is minimized as the model is trained, like a neural network, and the addition of trees causes the
term boosting [23].


                                                     99
    Special attention in Work 21 is paid to the definition of hyperparameters of the classifier. The
maximum depth of the trees limits the growth of the trees used in depth. A higher value increases the
complexity of the model, but at the same time can lead to overfitting. The learning rate controls the
weight of an individual model in the ensemble prediction. Smaller speed values may require more
decision trees in the ensemble.
    In work [24], the authors search for hyperparameters of the classifier based on the lightgbm
algorithm to increase its accuracy, the main ones are: maximum depth of trees – this parameter
determines the complexity of the model, the higher it is, the more accurately classifies the data, but can
lead to overfitting and a decrease in accuracy on test data; learning rate – the parameter assigns the
weight of each tree in the ensemble, lower values indicate a low weight of each tree in the ensemble.
The special role of the feature selection and normalization process is also noted, since erroneous data
and omissions in the feature values can significantly reduce the accuracy of the classification algorithms
used. The statistical characteristics of the obtained distributions of the values of the feature space were
used as features: the minimum and maximum values, the arithmetic mean, and the variance.
    The analysis of the works considered in the study allows us to conclude that machine learning
algorithms are widely used in the field of information security and their high accuracy in solving
problems of data classification of various classes. In many works, the entropy approach, the distribution
of bytes and subsequences of different lengths, is used to form the feature space. Based on the analysis,
it was suggested that it is possible to form a feature space based on byte distributions and subsequences
of limited bit length.

2. Pseudo-random sequences model
    The analysis of sources in the subject area of research revealed the frequent use of the byte
distribution as a feature space, its statistical characteristics, and the results of counting the frequencies
of subsequences of different lengths with different steps.
    It was suggested that the feature space based on combining the frequency of occurrence of
independent bit subsequences of different length N (in bits) without taking into account the complete
overlap of each subsequence and the byte distribution can be used to solve the problem of improving
the accuracy of the classification of PRS and improving existing solutions in the field of information
protection from leaks. For example, for sequence = 100101111010, the frequency of occurrence of
subsequences with length N = 3 bits is shown in Table 2.

Table 2
Example of bit subsequences frequency count
                 Subsequences               Number                                                   Frequency
                      000                     0                                                          0
                      001                     1                                                         0,1
                      010                     2                                                         0,2
                      011                     1                                                         0,1
                      100                     1                                                         0,1
                      101                     2                                                         0,2
                      110                     1                                                         0,1
                      111                     1                                                         0,1

   In order to prevent the use of digital signatures and correctly classify encrypted and compressed
sequences that have a uniform byte distribution and are called pseudorandom (PSP), a PSP model based
on statistical features was developed. When it is formed, the header part of the file with a length of 10
KB is discarded, the rest is subject to statistical analysis. In its formal form, the PSP model is defined
by the expression 1.

                𝑉𝑉 = �𝐹𝐹�𝐵𝐵𝑖𝑖∈[0,…,255] �, 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 , 𝐵𝐵𝑠𝑠𝑠𝑠𝑠𝑠 , 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚 , 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚 , 𝐹𝐹�𝑆𝑆𝑗𝑗∈[0,…,511] ��,   (1)


                                                                    100
where 𝐵𝐵𝑖𝑖∈[0,…,255] – byte distribution in the analyzed PRS; 𝑆𝑆𝑗𝑗∈[0,…,511] – distribution of frequencies of
occurrence of subsequences of length 9 bits in the analyzed PRS; 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 – mean of the byte frequency
in PRS; 𝐵𝐵𝑠𝑠𝑠𝑠𝑠𝑠 – standard deviation of the byte distribution; 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚 , 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚 – minimum and maximum byte
frequency values.
   Mean of the byte frequency in PRS and standard deviation of the byte distribution is calculated
according to expression 2.
                                                                                                         (2)
                                    ∑255
                                     𝑖𝑖=0 𝑛𝑛(𝑏𝑏𝑖𝑖 )               ∑255
                                                                   𝑖𝑖=0 (𝑛𝑛(𝑏𝑏𝑖𝑖 ) − 𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )
                                                                                                 2
                       𝐵𝐵𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 =                 , 𝐵𝐵𝑠𝑠𝑠𝑠𝑠𝑠 = �
                                        256                                   256

where 𝑛𝑛(𝑏𝑏𝑖𝑖 ) – the number of occurrences of byte i in the analyzed PRS.
  The frequency of occurrence of subsequences in the PRS is calculated according to expression 3.
                                                         𝑛𝑛𝑗𝑗                                    (3)
                                𝐹𝐹�𝑆𝑆𝑗𝑗∈[0,…,511] � = 𝑓𝑓𝑗𝑗 =     ,
                                                     𝑀𝑀 − 𝑁𝑁 + 1
where 𝑛𝑛𝑗𝑗 – the number of occurrences of the subsequence j in PRS; M – PRS length in bits; N – length
of the subsequence j in bits.
    To test the adequacy of the model, experiments were conducted to classify the generated set of
encrypted and compressed data with a size of 600 Kbytes. 2000 files containing meaningful text in
Russian were generated, then they were processed by encryption algorithms (AES, DES, RC4,
Camellia, GOST34.12 "Kuznechik") and compression algorithms (RAR, ZIP, 7Z, GZ, BZ2, XZ). Thus
a data sample consisting of two classes and containing 22,000 files was obtained.

3. Practical application of the developed PRS model
    In order to select the most suitable machine learning algorithm, experiments were conducted to
evaluate the accuracy of the PRS classification. In view of the dependence of the classification accuracy
on the number of features used in the PRS model, experiments were conducted using a different number
of features ranked by discriminating ability [26].
    The results of evaluating the accuracy of the classification of PRS by machine learning algorithms
depending on the number of features are shown in the figure 1.


Figure 1: Evaluation of the accuracy of the classification of PRS depending on the number of used
features of the modeled PRS.
   Almost all machine learning algorithms demonstrate an increase in classification accuracy with an
increase in the dimension of the feature space, which meets the expected results, since the features are
ranked according to their classifying ability. The most accurate classification according to the precision
metric was shown by the random forest algorithm and the extra trees algorithm.


                                                          101
    Machine learning algorithms have different time complexity, and to assess their applicability in real
systems, experiments were conducted to determine the training time of a model based on the
corresponding machine learning algorithm. The results of the experiments are shown in Figure 2.
    The proposed PRS model based on the developed PRS classification algorithm allows classifying
encrypted and compressed data with an accuracy of 0.97 in 0.5 seconds for a sequence of 600 Kbytes
in length. The classification does not take into account digital signatures, file extensions, and other
service information, but only statistical features of byte distributions and 9-bit subsequences [27].
    The developed solution can be implemented in existing DLP systems and perform statistical analysis
of data transmitted beyond the controlled perimeter of the corporate data network, the flow diagram is
shown in Figure 2.


Figure 2: Estimating the time complexity of building machine learning models.
    The gradient boosting algorithm was excluded from the experiment because it has the longest
training time of 250 seconds on a given sample size. The classifier based on adaptive boosting has the
maximum time to build the model. A classifier based on a random forest and extremely randomized
trees have similar time complexity on this task.
    Since the classifier based on a random forest is superior to the classifier based on extremely
randomized trees with the most optimal values of the number of parameters equal to 200 features, it
was chosen to build a prototype for the classification of encrypted and compressed data in information
leak prevention systems. The first 10 features with the maximum discriminating ability are shown in
Figure 3.


                                                    102
   Figure 3: First 10 features with maximal discriminative power.
   Thus, the feature space is a vector of statistical properties with a length of 110 elements that have
the maximum discriminating ability. Figure 4 shows the result of a multidimensional data visualization
algorithm for the 5 most discriminating features. Statistical features are located around the circle, the
points are the values of the features for each class of PSP, located independently of each other. The
points inside the circle are arranged so that its values are normalized along the axes from the center to
each arc.


   Figure 4. The top 10 most discriminating features.
   The total values of the hyperparameters of the generated PRS classifier are presented in the table 3.

Table 3
Hyperparameters of the generated PRS classifier
                           Hyperparameter                       Value


                                                    103
                            Number of estimators                   100
                             Number of features                    200
                           Max depth of the trees                   38
                                Length of the
                                                                  9 bits
                                subsequences
                                     100                             1
                                     101                             2
                                     110                             1
                                     111                             1
    The developed PRS model together with the PRS classification algorithm can be implemented in the
statistical data analysis module of existing DLP systems. The greatest threat to modern enterprise
systems is cloud storage. If necessary, the internal intruder is able to use encryption and data masking
tools, for example, by introducing digital signatures of known file formats: for ZIP format - "50 4B";
RAR - "52 61 72 21 1A"; pdf - "25 50 44 46 2D 31 2E". The developed solution can be implemented
in existing DLP systems and perform statistical analysis of data transmitted outside the controlled
perimeter of the corporate data network, the block diagram of which is shown in figure 5.


Figure 5: The pipeline of implementation of the developed approach of classification of PRS in the
existing DLP systems.


4. Conclusion
    The study examines existing approaches to the formation of feature spaces and machine learning
algorithms used to build classifiers of encrypted and compressed data. Existing approaches show high
classification accuracy, but all of them use the headers of the analyzed files or data packets, which
contain digital signatures that uniquely determine the type of information transmitted. An attacker can
take advantage of this flaw and transmit the information in encrypted form by changing the header part
of the file or using encapsulation methods for traffic.
    To improve the accuracy of the classification of PRS generated by data encryption and compression
algorithms, a model of PRS was developed that uses the byte distribution and the frequency of
occurrence of 9-bit subsequences in the analyzed PRS. The choice of the mathematical apparatus based
on the random forest construction algorithm was justified. During the experiments, the adequacy of the
model and the possibility of its application for the classification of the data types specified in the work
with a precision of 0.98 were confirmed. The place of implementation of the developed PRS model in
the means of detecting and preventing information leaks was proposed.

                                                     104
5. Acknowledgements
  The reported study was funded by Russian Ministry of Science (information security), project
number 18/2020.

6. References
[1] InfoWatch analytics report, 2021. URL: https://www.infowatch.ru/analytics/reports/30708.
[2] Mamun M.S.I., Ghorbani A.A., Stakhanova N. An Entropy Based Encrypted Traffic Classifier. //
     In: Qing S., Okamoto E., Kim K., Liu D. (eds) Information and Communications Security. ICICS
     2015. Lecture Notes in Computer Science, vol. 9543. Springer, Cham. DOI: 10.1007/978-3-319-
     29814-6_23.
[3] Tang Z., Zeng X., Sheng Y. Entropy-based feature extraction algorithm for encrypted and non-
     encrypted compressed traffic classification. International Journal of ICIC, 2019, vol. 15, no. 3,
     pp. 845–860. DOI: 10.24507/ijicic.15.03.845.
[4] Belyaev S. and etc. Development of a Pseudo-Random Sequence Generation Function Based on
     the “Kuznechik” Cryptographic Algorithm. Voprosy kiberbezopasnosti [Cybersecurity issues],
     2021, No 4 (44), pp. 25-34. DOI: 10.21681/2311-3456-2021-4-25-34. DOI: 10.21681/2311-3456-
     2017-5-30-41. (In Russ.)
[5] Livshitz I., Neklydov A. Assessment of Entropy of Information Security Systems. Voprosy
     kiberbezopasnosti [Cybersecurity issues], 2017. No 5(24), pp. 30-41. DOI: 10.21681/2311-3456-
     2017-5-30-41. (In Russ.)
[6] Zhou, K., Wang, W., Wu, C., & Hu, T. (2020). Practical evaluation of encrypted traffic
     classification based on a combined method of entropy estimation and neural networks. ETRI
     Journal, 42(3), 311-323. DOI: 10.4218/etrij.2019-0190.
[7] Shen M., Wei M., Zhu L. & Wang M. Classification of encrypted traffic with second-order markov
     chains and application attribute bigrams // IEEE Transactions on Information Forensics and
     Security, 2017, vol. 12, no. 8, pp. 1830-1843. DOI: 10.1109/TIFS.2017.2692682.
[8] Chen Y., Zang T., Zhang Y., Zhouz Y., & Wang Y. Rethinking Encrypted Traffic Classification:
     A Multi-Attribute Associated Fingerprint Approach. // 2019 IEEE 27th International Conference
     on     Network      Protocols     (ICNP),     Chicago,    IL,   USA,      2019,     pp.    1–11.
     DOI: 10.1109/ICNP.2019.8888043.
[9] Obasi T.C. Encrypted Network Traffic Classification using Ensemble Learning Techniques //
     Doctoral dissertation, Carleton University, 2020. DOI: 10.22215/etd/2020-14171.
[10] Yao Z., Ge J., Wu Y., Lin X., He R., Ma Y. Encrypted traffic classification based on Gaussian
     mixture models and Hidden Markov Models. // Journal of Network and Computer Applications,
     2020, vol. 166, p. 102711. DOI: 10.1016/j.jnca.2020.102711.
[11] Wang F., Quach T.T., Wheeler J., Aimone J.B., James, C.D. Sparse coding for n-gram feature
     extraction and training for file fragment classification. // IEEE Transactions on Information
     Forensics and Security, 2018, vol. 13, no.10, pp. 2553–2562. DOI: 10.1109/TIFS.2018.2823697.
[12] Choudhury P., Kumar K. P., Nandi S., Athithan G. An empirical approach towards characterization
     of encrypted and unencrypted VoIP traffic // Multimedia Tools and Applications, 2020, vol. 79,
     no. 1–2, pp. 603–631. DOI: 10.1007/s11042-019-08088-w.
[13] Baldini G., Hernandez-Ramos J. L., Nowak S., Neisse R., Nowak M. Mitigation of Privacy Threats
     due to Encrypted Traffic Analysis through a Policy-Based Framework and MUD Profiles. //
     Symmetry, 2020, vol. 12, no.9, p. 1576. DOI: 10.3390/sym12091576.
[14] Shen M., Liu Y., Zhu L., Xu K., Du X., Guizani N. Optimizing Feature Selection for Efficient
     Encrypted Traffic Classification: A Systematic Approach // IEEE Network, 2020, vol. 34, no. 4,
     pp. 20–27. DOI: 10.1109/MNET.011.1900366.
[15] Huang X. et al. A novel mechanism for fast detection of transformed data leakage //IEEE Access,
     2018, vol. 6, pp. 35926–35936. DOI: 10.1109/ACCESS.2018.2851228.
[16] Cheng L., Liu F., Yao D. D. Enterprise data breach: causes, challenges, prevention, and future
     directions //Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2017, vol.
     7, no. 5. DOI: 10.1002/widm.1211.

                                                  105
[17] De Gaspari F., Hitaj D., Pagnotta G., De Carli L., Mancini L. V. EnCoD: Distinguishing
     Compressed and Encrypted File Fragments. In International Conference on Network and System
     Security (pp. 42-62). 2020. Springer, Cham. DOI: 10.1007/978-3-030-65745-1_3.
[18] Lee K., Lee S. Y., Yim K. Machine learning based file entropy analysis for ransomware detection
     in     backup       systems. IEEE      Access, 7,      pp.   110205-110215.        2019.     DOI:
     10.1109/ACCESS.2019.2931136.
[19] Kasongo S. M., Sun Y. A deep gated recurrent unit based model for wireless intrusion detection
     system. ICT Express, 7(1), 81-87. 2021. DOI: 10.1016/j.icte.2020.03.002.
[20] Raff E., Zak R., Cox R., Sylvester J., Yacci P., Ward R., Nicholas C. (2018). An investigation of
     byte n-gram features for malware classification. Journal of Computer Virology and Hacking
     Techniques, 14(1), pp. 1-20. DOI: 10.1007/s11416-016-0283-1.
[21] De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., & Mancini, L. V. (2021). Reliable Detection
     of Compressed and Encrypted Data. arXiv preprint arXiv:2103.17059.
[22] Deng S., Wang C., Li J., Yu H., Tian H., Zhang Y., Yang T. Identification of Insider Trading Using
     Extreme Gradient Boosting and Multi-Objective Optimization. Information, 10(12), 367. 2019.
     DOI: 10.3390/info10120367.
[23] Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Liu T. Y. Lightgbm: A highly efficient
     gradient boosting decision tree. Advances in neural information processing systems, 30, 3146-
     3154. 2017. DOI: 10.5555/3294996.3295074.
[24] Minastireanu E. A., Mesnita G. Light gbm machine learning algorithm to online click fraud
     detection. J. Inform. Assur. Cybersecur, 2019. DOI: 10.5171/2019.263928.
[25] Ge D., Gu J., Chang S., Cai J. Credit Card Fraud Detection Using Lightgbm Model. In 2020
     International Conference on E-Commerce and Internet Technology (ECIT) (pp. 232-236). IEEE.
     2020. DOI: 10.1109/ECIT50008.2020.00060.
[26] Kozachok A.V., Spirin A.A., Golembiovskaya O.M. Algoritm classificazii psevdosluchainuh
     posledovatelnostei na osnove postroenija sluchainogo lesa. Dokladj Tomskogo gosudarstvennogo
     universiteta sistem upravlenija I radioelektroniki. – 2020. – V. 23. – no. 3. – P. 55–60.
     DOI: 10.21293/1818-0442-2020-23-3-55-60. (In Russ.)
[27] Kozachok A.V., Spirin A.A. Algoritm classificazii psevdosluchainuh posledovatelnostei.
     Sistemnij analyz i informazionnie tehnologii. – 2020. – no.1, pp. 87 – 98. DOI:
     10.17308/sait.2020.1/2595.


                                                   106