Searching for similar code sequences in executable files using siamese neural network Alexander Yumaganov Geoinformatics and Information Security Department Samara National Research University Samara, Russia yumagan@gmail.com Abstract—This work is dedicated to solving the problem of similarity rate is estimated using a neural network. The finding similar code sequences (functions) in executable files. methods of similar code sequences search presented in [6, 7, The description of functions obtained using the proposed 8] are based on comparing subgraphs of function’s CFG. The method for solving this problem is based on the mutual spatial authors of [9] presented an intrusion detection system for position of processor instructions and the corresponding distributed data processing systems based on a comparison of operands in the function body. The word embedding model process signatures obtained from their control flow graphs. word2vec is used to form an intermediate description of the This group of methods also has disadvantages: high executable file functions. The final description of the functions sensitivity to structural changes in functions, inapplicability is formed using the siamese long short-term memory network to functions with a small number of basic blocks CFG. Thus, (siamese-LSTM). Then it description directly used to search for similar functions. The results of experimental studies of the despite the large number of works in the field of similar code developed method are presented in comparison with some sequences search in executable files, there is no universal previously known methods. method or approach to solve this problem, which is devoid of all the shortcomings. Keywords—searching, code sequence, siamese neural This paper presents a method of similar code sequences network search in which the initial description of functions is formed I. INTRODUCTION on the basis of the spatial position of processor instructions and their corresponding operands. The siamese neural Nowadays, developers of new software often use code network is used to obtain the final description of the that was developed earlier to solve other problems. Reusing function. the code can improve the development time of new software, but it can also cause errors and vulnerabilities in the created The paper is structured as follows. The first section products. According to studies presented in [1], 69 presents the basic definitions and a brief description of the vulnerable fragments of C ++ code found from proposed method. In the second section, the process of StackOverflow.com were used in 2859 projects on the obtaining the initial description of functions is considered. GitHub service. In addition, using someone else's code may The third section presents the descriptions of intermediate be illegal. According to the Synopsys report “Open source function representation. The fourth section describes the security and risk analysis” [2] for 2019, it was found that out siamese neural network, which is used to obtain the final of more than 1200 audited software products, 67% contained representation of functions. The fifth section is devoted to components with license conflicts. the description of the similar functions search algorithm. The sixth section provides a method for evaluating the Most software developers do not share their source code, effectiveness of the method and the results of the so it is difficult to perform the analysis of their software. In experiments. In the final part of the paper, the conclusions this case, the analysis of executable files is the only way to and a list of references are presented. analyze the code of that kind of software. II. BASIC CONCEPTS AND PRINCIPLE OF OPERATION Thus, solving the problem of finding similar code sequences in executable files allow us to solve such problems The following definitions are used in this paper: such as finding known vulnerabilities and plagiarism  current library - the set of the investigated executable detection without using the source code of the analyzed functions; software. There are a lot of methods of similar code sequences  archived data - the set of the known functions. search in executable files. In [3], a method finding similar Taking into account the above definitions, the problem code sequences of code was presented. This method is based solved by the proposed method is formulated as follows: for on comparing limited sequences of processor instructions (k- a given (or each) function of the current library, find the gram). To identify library functions in the IDA disassembler, most similar function from the archive data. the Fast Track Identification and Recognition Technology (FLIRT) algorithm [4] is used. This algorithm is based on a The method of similar code sequences search proposed in comparison of function templates. The methods mentioned this work includes several stages. At the first stage, the above are sensitive to syntactic changes of the program code process of training the siamese neural network is carried out. (for example, replacing a command with an equivalent At the second stage a description of the archive data command or group of commands). There are also methods functions is formed using a trained neural network. At the based on the analysis of function’s control flow graph third stage, a description of the current library functions is (CFG). The authors of [5] proposed a search method in formed in a similar way. At the final stage, the search for which the vector representation of function is formed by similar functions is performed. taking into account the structure of its CFG, and the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science III. CONSTRUCTION OF THE INITIAL AND INTERMEDIATE The input of each siamese network's LSTM network is a DESCRIPTIONS OF FUNCTIONS function in the form of previously obtained intermediate description. Then the outputs of the last layers of each A. Construction of the initial descriptions of functions network are sent to the input of a function that evaluates the After disassembly analysis of the executable code similarity between these outputs. In this paper, the following performed by disassembler, we can get an assembler of every metric is used as such a function: function in the executable file. Each function consists of a sequence of processor instructions and corresponding J 1 operands.  D ( a , b )  e x p (   a i  b i ),   i0 All processor instructions are divided into K functional groups according to the type of operations they perform (for where a i is the value of the i-th element of the last layer of example, a group of logical operations, a group of arithmetic operations). All operands are also divided into N groups the first LSTM network, b i is the value of the i-th element of according to the types of operands presented in the IDA the last layer of the second LSTM network. Two functions disassembler [10] (for example, base register, FPP register, are considered to be same if D  1 . In the contrary case, if the value itself). D  0 two functions are considered to be totally different. Each command and its corresponding operands are Structural chart of siamese neural network is shown in converted into lexemes (words) as follows: Fig. 1. L e x e m e ( k , n 0 , n 1 )  C o n c a te n a te ( k , n 0 , n 1 ), k  [0 , K  1], n 0  [0 , N  1], n 1  [0 , N  1],  where k is a index number of the functional group of the considered processor instruction, n 0 and n1 are index numbers of the operands groups to which the first and second operands belong respectively. For example, let us consider the command "mov rax, 1": k  0 (Data Transfer Instructions), n 0  1 (General Register), n1  5 (Immediate): Fig. 1. Structural chart of siamese neural network L e x e m e ( 0 ,1, 5 )  ' 0 0 0 1 0 5 '  Here, D is the metric (1), I 0A , I 1A , I 2A are the components Thus, each function can be represented as an ordered set of the intermediate description vector of the first function, of lexemes. corresponding to its lexemes, I 0B , I 1B , I 2B , I 3B are the components of the intermediate description vector of the B. Construction of the intermediate description of the second function, corresponding to its lexemes, y is the output function value of the siamese neural network, taking the value 0 if the The word2vec model [11] is used to obtain a vector functions are different, otherwise 1. representation of the function's initial description. The vector representations obtained by this model are based on the Contrastive loss [13] is used as a loss function, which is contextual proximity of lexemes inside functions. About a given by: dozen different binary files were used to train this model. As a result of the training process, we have a dictionary in which 1 1  L( y, D )  (1  y ) D 2  y { m a x ( 0 ,1  D )} .  2  a vector of a given dimension is assigned to each lexeme 2 2 (obtained from training binary files). Thus, a function consisting of m commands describes as This loss function maximizes and minimizes the value of a two-dimensional vector I of dimension m  s . Let us call (1) between similar ( y  1 ) and different ( y  0 ) functions, this vector as an intermediate description. respectively. IV. CONSTRUCTION OF THE FINAL DESCRIPTION OF THE Two “archives” of functions were used to train siamese neural network : the first one contained the library functions FUNCTION libtiff 4.0.3 [14] and proj 4.9.1 [15], and the second one It is proposed to use the siamese neural network [12] to contained the library functions libtiff 4.0.8 and proj 5.0.1. construct the final description of the functions. This network consists of two identical long short-term memory (LSTM) In the learning process, for each function of the first neural networks with the same weights. archive, two pairs were formed: current function and the function having the same name from the second archive; The LSTM network is a special type of recurrent neural current function, and a random function from the second network (RNN). In contrast to RNN, it is able to work with archive, which has a name different from the first function. long-term dependencies. This is achieved due to its ability to The resulting pairs of vectors are fed to the input of the transfer the state of the cell from the previous time step to the neural network, the output value for the first pair is 1 next step, as well as to control the information flow in the (similar), and for the second pair is 0 (different). LSTM cell. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 204 Data Science The trained model of the siamese neural network is used  The average precision of the list: to form the final description of the functions by passing an L intermediate description of the analyzed function to the input AveP   Pk ( R k  R k  1 ) , R 0  0 of one of the LSTM networks. The output vector of this k 1  LSTM network, in this case, is the final description of the The average precision for all functions included in the function. The dimension of the output layer is J  3 2 . current library is calculated by the formula: 1 S 1 Thus, any function of the analyzed binary file can be P   AveP s represented as a vector of dimension J.  S s0   V. SEARCH FOR SIMILAR FUNCTIONS where S is a number of functions in the current library. The archive data was represented by the curl 7.6.3 [16], The final stage of the proposed method is the search for and the current library functions was represented by other similar functions using the final description of the functions versions of the curl library. of archive data and the current library. The average precision of search of the proposed method Let a be the final description vector of the current was compared with some previously known methods: a library function, b be the final description vector of the method based on the analysis of the spatial position of archive data library. For feature vector comparison we use processor functional groups [19] and a method based on k- the Euclidean metric (distance): gram comparison [3].As a comparison object for the first method, the comparison object recommended by the authors J 1 was used (the spatial distribution of instructions in the d (a , b )   (a  b ) , i i 2  function body in the integral form). For the second method, i0 the value of the parameter k = 5 was also chosen based on where a i is the value of the i-th component of the current the recommendations of the authors. The results are library function feature vector, b i is the value of the i- presented in table 1. component of the archive data library function feature vector. TABLE I. COMPARISON OF SIMILAR FUNCTIONS SEARCH METHODS If d  0 the functions are considered equal. The final description of the given function of the current Current Average precision of search P library is compared with each function of archive data using library Using proposed Using k-gram Using method Euclidian metric. Then, the obtained results are sorted by method based method presented in [15] increasing the distance (2). As a result, we get the list of curl 7.5.4 0.7830 0.8208 0.7828 archival functions, sorted by similarity in descending order, curl 7.5.6 0.8631 0.8543 0.8550 for the analyzed current library function. curl 7.5.9 0.8886 0.8768 0.8851 VI. EXPERIMENTS curl 7.6.0 0.8960 0.8830 0.9036 To evaluate the efficiency of the proposed method of similar code sequences search in executable files, the The analysis of the obtained results shows that the functions of one dynamic library are used as archive data, method of similar code sequences search presented in this and the functions of the same library, but of a different work is highly competitive with known methods, and in version, are used as the current library. It was considered that some cases surpasses them. Since the training of the siamese in the process of switching from one version of the dynamic neural network was carried out on a small data set, the results library to another, the names of the functions did not change obtained for this method can be further improved by and there are no functions with the same name among the increasing the amount of data for training the neural network. functions of the archive data. VII. CONCLUSION Using the search algorithm described in the fifth section, for a given function of the current library, we obtain an The paper presents a method of similar code sequences ordered list of archive data functions. Let us assign a binary search in executable files based on the siamese neural network. The results of experimental studies demonstrating sequence   (  1 ,  2 , ...,  L ) to this list, the i-th element the efficiency of the developed method (average precision of which is equal to one, if the name of the function at the i- 0.78 - 0.89). Further research will be aimed at tuning the th position of the list is identical to the name of the function parameters of the siamese neural network in order to increase being checked, and the i-th element of which is equal to zero the average precision of the search for this method. otherwise. The following criteria to evaluate the quality of information retrieval [17,18] are used: ACKNOWLEDGMENT  Precision for the k-th position of the list: This work was supported by RFBR research project № 18-01-00748 A. k  l REFERENCES l 1 Pk  [1] M. Verdi, A. Sami, J. Akhondali, F. Khomh, G. Uddin and A.K. k  Motlagh, “An Empirical Study of C++ Vulnerabilities in Crowd- Sourced Code Examples,” arXiv: 1910.01321, 2019.  Recall for the k-th position of the list: k [2] Synopsys Cybersecurity Research Center: Open source security and  l risk analysis, 2020 [Online]. URL: https://www.synopsys.com/ l 1 content/dam/synopsys/sig-assets/reports/2020-ossra-report.pdf Rk  K  VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 205 Data Science [3] G. Myles and C. Collberg, “K-gram based software birthmarks,” [12] J. Bromley, I. Guyon, Y. LeCun, E. Siickinger and R. Shah, Proceedings of the ACM symposium on Applied computing, 2005. “Signature verification using a “siamese” time delay neural network,” [4] IDA Pro. F.L.I.R.T, 2019 [Online]. URL: https://www.hex- Proceedings of the 6th International Conference on Neural rays.com/products/ida/tech/flirt/in_depth.shtml. Information Processing Systems, pp. 737-744, 1994. [5] N. Shalev and N. Partush, “Binary Similarity Detection Using [13] R. Hadsell, S. Chopra and Y. LeCun, “Dimensionality Reduction by Machine Learning,” Proceedings of the 13th Workshop on Learning an Invariant Mapping,” IEEE Computer Society Conference Programming Languages and Analysis for Security - PLAS, 2018. on Computer Vision and Pattern Recognition, vol. 2, 2006. [6] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, [14] TIFF Library and Utilities, 2019 [Online]. URL: “Polymorphic Worm Detection Using Structural Information of http://www.libtiff.org/. Executables,” Lecture Notes in Computer Science, Springer Berlin, [15] PROJ coordinate transformation software library, 2019 [Online]. Heidelberg, pp. 207-226, 2006. URL: https://proj.org/about.html. [7] Y. David and E. Yahav, “Tracelet-based code search in executables,” [16] Libcurl - the multiprotocol file transfer library, 2019 [Online]. URL: ACM SIGPLAN Notices, vol. 49, no. 6, pp. 349-360, 2014. https://curl.haxx.se/libcurl/. [8] S. Eschweiler, K. Yakdan and E. Gerhards-Padilla, “DiscovRE: [17] M. Buckland and F. Gey, “The relationship between Recall and Efficient Cross-Architecture Identification of Bugs in Binary Code,” Precision,” Journal of the American Society for Information Science, Proceedings Network and Distributed System Security Symposium, vol. 45, no. 1, pp. 12-19, 1994. 2016. [18] D.M.W. Powers, “Evaluation: From Precision, Recall and F-Measure [9] S. Aditham and N. Ranganathan, “A Novel Control-flow based to ROC, Informedness, Markedness & Correlation,” Journal of Intrusion Detection Technique for Big Data Systems,” arXiv: Machine Learning Technologies, vol. 2, no. 1, pp. 37-63, 2011. 1611.07649, 2016. [19] A.S. Yumaganov and V.V. Myasnikov, “A method of searching for [10] Hex-Rays IDA: About, 2019 [Online]. URL: http://hex-rays.com/ similar code sequences in executable binary files using a featureless products/ida/. approach,” Computer Optics, vol. 41, no. 5, pp. 756-764, 2017. DOI: [11] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation 10.18287/2412-6179-2017-41-5-756-764. of word representations in vector space", Proc. of ICLR Workshop, 2013. . VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 206