Searching for similar code sequences in executable
          files using siamese neural network
                                                            Alexander Yumaganov
                                              Geoinformatics and Information Security Department
                                                     Samara National Research University
                                                               Samara, Russia
                                                             yumagan@gmail.com

    Abstract—This work is dedicated to solving the problem of                similarity rate is estimated using a neural network. The
finding similar code sequences (functions) in executable files.              methods of similar code sequences search presented in [6, 7,
The description of functions obtained using the proposed                     8] are based on comparing subgraphs of function’s CFG. The
method for solving this problem is based on the mutual spatial               authors of [9] presented an intrusion detection system for
position of processor instructions and the corresponding                     distributed data processing systems based on a comparison of
operands in the function body. The word embedding model                      process signatures obtained from their control flow graphs.
word2vec is used to form an intermediate description of the                  This group of methods also has disadvantages: high
executable file functions. The final description of the functions            sensitivity to structural changes in functions, inapplicability
is formed using the siamese long short-term memory network
                                                                             to functions with a small number of basic blocks CFG. Thus,
(siamese-LSTM). Then it description directly used to search
for similar functions. The results of experimental studies of the
                                                                             despite the large number of works in the field of similar code
developed method are presented in comparison with some                       sequences search in executable files, there is no universal
previously known methods.                                                    method or approach to solve this problem, which is devoid of
                                                                             all the shortcomings.
   Keywords—searching,         code    sequence,     siamese    neural           This paper presents a method of similar code sequences
network                                                                      search in which the initial description of functions is formed
                        I. INTRODUCTION                                      on the basis of the spatial position of processor instructions
                                                                             and their corresponding operands. The siamese neural
    Nowadays, developers of new software often use code                      network is used to obtain the final description of the
that was developed earlier to solve other problems. Reusing                  function.
the code can improve the development time of new software,
but it can also cause errors and vulnerabilities in the created                  The paper is structured as follows. The first section
products. According to studies presented in [1], 69                          presents the basic definitions and a brief description of the
vulnerable fragments of C ++ code found from                                 proposed method. In the second section, the process of
StackOverflow.com were used in 2859 projects on the                          obtaining the initial description of functions is considered.
GitHub service. In addition, using someone else's code may                   The third section presents the descriptions of intermediate
be illegal. According to the Synopsys report “Open source                    function representation. The fourth section describes the
security and risk analysis” [2] for 2019, it was found that out              siamese neural network, which is used to obtain the final
of more than 1200 audited software products, 67% contained                   representation of functions. The fifth section is devoted to
components with license conflicts.                                           the description of the similar functions search algorithm. The
                                                                             sixth section provides a method for evaluating the
    Most software developers do not share their source code,                 effectiveness of the method and the results of the
so it is difficult to perform the analysis of their software. In             experiments. In the final part of the paper, the conclusions
this case, the analysis of executable files is the only way to               and a list of references are presented.
analyze the code of that kind of software.
                                                                                  II. BASIC CONCEPTS AND PRINCIPLE OF OPERATION
    Thus, solving the problem of finding similar code
sequences in executable files allow us to solve such problems                    The following definitions are used in this paper:
such as finding known vulnerabilities and plagiarism
                                                                                  current library - the set of the investigated executable
detection without using the source code of the analyzed
                                                                                   functions;
software.
    There are a lot of methods of similar code sequences                          archived data - the set of the known functions.
search in executable files. In [3], a method finding similar                     Taking into account the above definitions, the problem
code sequences of code was presented. This method is based                   solved by the proposed method is formulated as follows: for
on comparing limited sequences of processor instructions (k-                 a given (or each) function of the current library, find the
gram). To identify library functions in the IDA disassembler,                most similar function from the archive data.
the Fast Track Identification and Recognition Technology
(FLIRT) algorithm [4] is used. This algorithm is based on a                      The method of similar code sequences search proposed in
comparison of function templates. The methods mentioned                      this work includes several stages. At the first stage, the
above are sensitive to syntactic changes of the program code                 process of training the siamese neural network is carried out.
(for example, replacing a command with an equivalent                         At the second stage a description of the archive data
command or group of commands). There are also methods                        functions is formed using a trained neural network. At the
based on the analysis of function’s control flow graph                       third stage, a description of the current library functions is
(CFG). The authors of [5] proposed a search method in                        formed in a similar way. At the final stage, the search for
which the vector representation of function is formed by                     similar functions is performed.
taking into account the structure of its CFG, and the


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

   III. CONSTRUCTION OF THE INITIAL AND INTERMEDIATE                                                                                   The input of each siamese network's LSTM network is a
                                DESCRIPTIONS OF FUNCTIONS                                                                           function in the form of previously obtained intermediate
                                                                                                                                    description. Then the outputs of the last layers of each
A. Construction of the initial descriptions of functions                                                                            network are sent to the input of a function that evaluates the
    After disassembly analysis of the executable code                                                                               similarity between these outputs. In this paper, the following
performed by disassembler, we can get an assembler of every                                                                         metric is used as such a function:
function in the executable file. Each function consists of a
sequence of processor instructions and corresponding                                                                                                                                   J 1

operands.                                                                                                                                                D ( a , b )  e x p (   a i  b i ),                        
                                                                                                                                                                                       i0

    All processor instructions are divided into K functional
groups according to the type of operations they perform (for                                                                        where a i is the value of the i-th element of the last layer of
example, a group of logical operations, a group of arithmetic
operations). All operands are also divided into N groups                                                                            the first LSTM network, b i is the value of the i-th element of
according to the types of operands presented in the IDA                                                                             the last layer of the second LSTM network. Two functions
disassembler [10] (for example, base register, FPP register,                                                                        are considered to be same if D  1 . In the contrary case, if
the value itself).                                                                                                                   D  0 two functions are considered to be totally different.

   Each command and its corresponding operands are                                                                                     Structural chart of siamese neural network is shown in
converted into lexemes (words) as follows:                                                                                          Fig. 1.

L e x e m e ( k , n 0 , n 1 )  C o n c a te n a te ( k , n 0 , n 1 ), k  [0 , K  1], n 0  [0 , N  1], n 1  [0 , N  1],   

where k is a index number of the functional group of the
considered processor instruction, n 0 and n1 are index
numbers of the operands groups to which the first and second
operands belong respectively.
   For example, let us consider the command "mov rax, 1":
k  0   (Data Transfer Instructions), n 0  1 (General
Register), n1  5 (Immediate):
                                                                                                                                    Fig. 1. Structural chart of siamese neural network
                                   L e x e m e ( 0 ,1, 5 )  ' 0 0 0 1 0 5 ' 
                                                                                                                                        Here, D is the metric (1), I 0A , I 1A , I 2A are the components
    Thus, each function can be represented as an ordered set                                                                        of the intermediate description vector of the first function,
of lexemes.                                                                                                                         corresponding to its lexemes, I 0B , I 1B , I 2B , I 3B are the
                                                                                                                                    components of the intermediate description vector of the
B. Construction of the intermediate description of the                                                                              second function, corresponding to its lexemes, y is the output
    function                                                                                                                        value of the siamese neural network, taking the value 0 if the
    The word2vec model [11] is used to obtain a vector                                                                              functions are different, otherwise 1.
representation of the function's initial description. The vector
representations obtained by this model are based on the                                                                                Contrastive loss [13] is used as a loss function, which is
contextual proximity of lexemes inside functions. About a                                                                           given by:
dozen different binary files were used to train this model. As
a result of the training process, we have a dictionary in which                                                                                               1                        1
                                                                                                                                               L( y, D )        (1  y ) D
                                                                                                                                                                               2
                                                                                                                                                                                          y { m a x ( 0 ,1  D )} . 
                                                                                                                                                                                                                  2
                                                                                                                                                                                                                                
a vector of a given dimension is assigned to each lexeme                                                                                                      2                        2
(obtained from training binary files).
    Thus, a function consisting of m commands describes as                                                                              This loss function maximizes and minimizes the value of
a two-dimensional vector I of dimension m  s . Let us call                                                                         (1) between similar ( y  1 ) and different ( y  0 ) functions,
this vector as an intermediate description.                                                                                         respectively.

   IV. CONSTRUCTION OF THE FINAL DESCRIPTION OF THE                                                                                     Two “archives” of functions were used to train siamese
                                                                                                                                    neural network : the first one contained the library functions
                                                   FUNCTION
                                                                                                                                    libtiff 4.0.3 [14] and proj 4.9.1 [15], and the second one
   It is proposed to use the siamese neural network [12] to                                                                         contained the library functions libtiff 4.0.8 and proj 5.0.1.
construct the final description of the functions. This network
consists of two identical long short-term memory (LSTM)                                                                                 In the learning process, for each function of the first
neural networks with the same weights.                                                                                              archive, two pairs were formed: current function and the
                                                                                                                                    function having the same name from the second archive;
    The LSTM network is a special type of recurrent neural                                                                          current function, and a random function from the second
network (RNN). In contrast to RNN, it is able to work with                                                                          archive, which has a name different from the first function.
long-term dependencies. This is achieved due to its ability to                                                                      The resulting pairs of vectors are fed to the input of the
transfer the state of the cell from the previous time step to the                                                                   neural network, the output value for the first pair is 1
next step, as well as to control the information flow in the                                                                        (similar), and for the second pair is 0 (different).
LSTM cell.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                                                                                   204
Data Science

    The trained model of the siamese neural network is used                       The average precision of the list:
to form the final description of the functions by passing an                                           L

intermediate description of the analyzed function to the input                               AveP     Pk ( R k  R k  1 ) , R 0  0
of one of the LSTM networks. The output vector of this
                                                                                                  k 1
                                                                                                                          
LSTM network, in this case, is the final description of the                   The average precision for all functions included in the
function. The dimension of the output layer is J  3 2 .                  current library is calculated by the formula:
                                                                                                        1 S 1
    Thus, any function of the analyzed binary file can be                                         P       AveP s
represented as a vector of dimension J.                                                                S s0                    
               V. SEARCH FOR SIMILAR FUNCTIONS                            where S is a number of functions in the current library.
                                                                              The archive data was represented by the curl 7.6.3 [16],
    The final stage of the proposed method is the search for              and the current library functions was represented by other
similar functions using the final description of the functions            versions of the curl library.
of archive data and the current library.
                                                                              The average precision of search of the proposed method
    Let a be the final description vector of the current                  was compared with some previously known methods: a
library function, b be the final description vector of the                method based on the analysis of the spatial position of
archive data library. For feature vector comparison we use                processor functional groups [19] and a method based on k-
the Euclidean metric (distance):                                          gram comparison [3].As a comparison object for the first
                                                                          method, the comparison object recommended by the authors
                                      J 1
                                                                          was used (the spatial distribution of instructions in the
                      d (a , b )      (a  b ) ,
                                              i       i
                                                          2
                                                                       function body in the integral form). For the second method,
                                      i0
                                                                          the value of the parameter k = 5 was also chosen based on
where a i is the value of the i-th component of the current               the recommendations of the authors. The results are
library function feature vector, b i is the value of the i-               presented in table 1.
component of the archive data library function feature vector.
                                                                            TABLE I.         COMPARISON OF SIMILAR FUNCTIONS SEARCH METHODS
If d  0 the functions are considered equal.
    The final description of the given function of the current                  Current
                                                                                                           Average precision of search P
library is compared with each function of archive data using                    library      Using proposed        Using k-gram       Using method
Euclidian metric. Then, the obtained results are sorted by                                       method            based method      presented in [15]
increasing the distance (2). As a result, we get the list of                    curl 7.5.4       0.7830               0.8208               0.7828
archival functions, sorted by similarity in descending order,                   curl 7.5.6       0.8631               0.8543               0.8550
for the analyzed current library function.
                                                                                curl 7.5.9       0.8886               0.8768               0.8851
                       VI. EXPERIMENTS
                                                                                curl 7.6.0       0.8960               0.8830               0.9036
    To evaluate the efficiency of the proposed method of
similar code sequences search in executable files, the                        The analysis of the obtained results shows that the
functions of one dynamic library are used as archive data,                method of similar code sequences search presented in this
and the functions of the same library, but of a different                 work is highly competitive with known methods, and in
version, are used as the current library. It was considered that          some cases surpasses them. Since the training of the siamese
in the process of switching from one version of the dynamic               neural network was carried out on a small data set, the results
library to another, the names of the functions did not change             obtained for this method can be further improved by
and there are no functions with the same name among the                   increasing the amount of data for training the neural network.
functions of the archive data.
                                                                                                    VII. CONCLUSION
    Using the search algorithm described in the fifth section,
for a given function of the current library, we obtain an                     The paper presents a method of similar code sequences
ordered list of archive data functions. Let us assign a binary            search in executable files based on the siamese neural
                                                                          network. The results of experimental studies demonstrating
sequence   (  1 ,  2 , ...,  L ) to this list, the i-th element      the efficiency of the developed method (average precision
of which is equal to one, if the name of the function at the i-           0.78 - 0.89). Further research will be aimed at tuning the
th position of the list is identical to the name of the function          parameters of the siamese neural network in order to increase
being checked, and the i-th element of which is equal to zero             the average precision of the search for this method.
otherwise. The following criteria to evaluate the quality of
information retrieval [17,18] are used:                                                             ACKNOWLEDGMENT
     Precision for the k-th position of the list:                           This work was supported by RFBR research project №
                                                                          18-01-00748 A.
                                      k

                                      l                                                                  REFERENCES
                                     l 1
                           Pk                                            [1]     M. Verdi, A. Sami, J. Akhondali, F. Khomh, G. Uddin and A.K.
                                 k                                               Motlagh, “An Empirical Study of C++ Vulnerabilities in Crowd-
                                                                                  Sourced Code Examples,” arXiv: 1910.01321, 2019.
     Recall for the k-th position of the list:
                                      k                                   [2]     Synopsys Cybersecurity Research Center: Open source security and
                                      l                                         risk analysis, 2020 [Online]. URL: https://www.synopsys.com/
                                     l 1                                         content/dam/synopsys/sig-assets/reports/2020-ossra-report.pdf
                            Rk 
                                          K       


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              205
Data Science

[3]  G. Myles and C. Collberg, “K-gram based software birthmarks,”          [12] J. Bromley, I. Guyon, Y. LeCun, E. Siickinger and R. Shah,
     Proceedings of the ACM symposium on Applied computing, 2005.                “Signature verification using a “siamese” time delay neural network,”
[4] IDA Pro. F.L.I.R.T, 2019 [Online]. URL: https://www.hex-                     Proceedings of the 6th International Conference on Neural
     rays.com/products/ida/tech/flirt/in_depth.shtml.                            Information Processing Systems, pp. 737-744, 1994.
[5] N. Shalev and N. Partush, “Binary Similarity Detection Using            [13] R. Hadsell, S. Chopra and Y. LeCun, “Dimensionality Reduction by
     Machine Learning,” Proceedings of the 13th Workshop on                      Learning an Invariant Mapping,” IEEE Computer Society Conference
     Programming Languages and Analysis for Security - PLAS, 2018.               on Computer Vision and Pattern Recognition, vol. 2, 2006.
[6] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna,              [14] TIFF      Library      and     Utilities,   2019     [Online].    URL:
     “Polymorphic Worm Detection Using Structural Information of                 http://www.libtiff.org/.
     Executables,” Lecture Notes in Computer Science, Springer Berlin,      [15] PROJ coordinate transformation software library, 2019 [Online].
     Heidelberg, pp. 207-226, 2006.                                              URL: https://proj.org/about.html.
[7] Y. David and E. Yahav, “Tracelet-based code search in executables,”     [16] Libcurl - the multiprotocol file transfer library, 2019 [Online]. URL:
     ACM SIGPLAN Notices, vol. 49, no. 6, pp. 349-360, 2014.                     https://curl.haxx.se/libcurl/.
[8] S. Eschweiler, K. Yakdan and E. Gerhards-Padilla, “DiscovRE:            [17] M. Buckland and F. Gey, “The relationship between Recall and
     Efficient Cross-Architecture Identification of Bugs in Binary Code,”        Precision,” Journal of the American Society for Information Science,
     Proceedings Network and Distributed System Security Symposium,              vol. 45, no. 1, pp. 12-19, 1994.
     2016.                                                                  [18] D.M.W. Powers, “Evaluation: From Precision, Recall and F-Measure
[9] S. Aditham and N. Ranganathan, “A Novel Control-flow based                   to ROC, Informedness, Markedness & Correlation,” Journal of
     Intrusion Detection Technique for Big Data Systems,” arXiv:                 Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011.
     1611.07649, 2016.                                                      [19] A.S. Yumaganov and V.V. Myasnikov, “A method of searching for
[10] Hex-Rays IDA: About, 2019 [Online]. URL: http://hex-rays.com/               similar code sequences in executable binary files using a featureless
     products/ida/.                                                              approach,” Computer Optics, vol. 41, no. 5, pp. 756-764, 2017. DOI:
[11] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation          10.18287/2412-6179-2017-41-5-756-764.
     of word representations in vector space", Proc. of ICLR Workshop,
     2013.
                                                                                                                 .


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                            206