A combined method of similar code sequences search in executable files A S Yumaganov1 1Samara National Research University, Moskovskoye shosse, 34, Samara, Russia, 443086 e-mail: yumagan@gmail.com Abstract. The article is devoted to the development of a method of similar code sequences search in executable files, which is based on both syntax analysis of the code and function’s control flow graphs analysis. The syntax analysis method used in this paper is based on a comparison of the spatial distribution of processor instructions in the function body. The analysis of function control flow graph is used a structural description of fixed-order subgraphs of the function control flow graph. The results of experimental studies, including the comparison of the proposed method and previously known methods of searching for similar code sequences, are presented. 1. Introduction The problem of finding similar code sequences in executable files is very relevant today. According to the studies presented in [1, 2], developers often use previously created program code during the process of the new software development. This approach to software development is called code reuse. Despite the obvious advantages of this approach, it can also cause errors and vulnerabilities in the software being developed. In addition, third-party code reuse may be illegal. This approach is also used in the development of various malicious programs [3]. Thus, solving the problem of finding similar code sequences in executable files allows us to solve several problems: finding known vulnerabilities, finding plagiarism in software, and searching for malware. Currently, there are a large number of methods of similar code sequences search in executable files. All known algorithms and methods for solving the above problem are usually based either on the syntactic analysis of the assembler program code or on the analysis of the structure of control flow graphs of the executable file functions. Let us consider the methods based on the syntax analysis of the code. In [4, 5], similar methods of similar code sequences search was presented. These methods are based on comparing sequences or permutations of processor instructions of a fixed length (k-grams or n-perms) in functions. The IDA disassembler uses the IDA FLIRT algorithm [6] to identify library functions, which is based on a comparison of function’s patterns. The author of [7] presented a method for detecting malware, based on the analysis of the frequency of processor instructions occurrence inside the examined file. Significant impact on the quality of the similar code sequences search for this group of methods is made by syntactic code changes: replacing processor instructions with equivalent ones, rearranging instructions, inserting new ones, and deleting old instructions. The methods based on the analysis of the functions control flow graph allow to overcome this V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) Data Science A S Yumaganov drawback. The authors of the method presented in [8] used the information of basic blocks (the vertices of the control flow graph) to detect and classify malware. The authors of [9] used a method based on determining the isomorphism of control flow graphs to identify malware. A similar approach was described in [10], where the detection of malware was based on determining the isomorphism of fixed length function’s subgraphs and comparing the signatures (fingerprints) of their basic blocks. However, this group of methods also has several disadvantages: low search accuracy for functions with a small number of basic blocks, high sensitivity to structural changes in functions. In this paper, a combined method for searching functions of an executable file, which are similar to the known functions from some software ”archive” is proposed. The basis of the proposed method is to use of both the syntactic analysis of the code and the analysis of functions control flow graph. The description of the function in the presented method is formed by its similarity with the functions of the basis library. The paper is structured as follows. The first section presents the basic definitions and a brief description of the proposed method. The second section discusses the process of getting a syntactic description; the third one describes the process of getting the structural description of functions. The fourth section is devoted to the description of the similar functions search algorithm. The fifth section provides the effectiveness evaluation technique of the proposed method and the results of the experiments. In the final part of the paper, the conclusions and a list of references are presented. 2. Basic concepts and principle of operation The following definitions are used in this paper: • current library - the set of the investigated executable file functions; • archival data - the set of the known functions; • basis library - an auxiliary set of functions used to compare the functions of archived data and the current library. Taking into account the above definitions, the problem solved by the proposed method is formulated as follows: for a given (or each) function of the current library, find the most similar function from the archive data. In this paper, we use two definitions of the functions similarity measure. The first one is based on the position of the functional groups of processor instructions in the body of functions. The second one is based on the analysis of the structure of the functions control flow graph. In the previously published works of the author [11, 12], two methods of similar code sequences search were presented, each of which is based on one of the above definitions of the function similarity measure, respectively. This paper presents a combined method of similar code sequences search. In this method, the description of function is formed through its similarity with the functions of the basis library, using two different definitions of the functions similarity measure described in [11, 12]. The proposed method of similar code sequences search includes several stages. At the first stage, the description of the archive data functions is formed through the library of basis functions. In addition, for each function, two descriptions are obtained (syntactic and structural) corresponding to different measures of functions similarity. At the second stage, the description of the current library functions is formed similarly. At the final stage, the search for similar functions is performed. The algorithm of search is described in detail in the fifth section. 3. The syntactic description of the function Using the IDA [13] disassembler, we can obtain a partition of the assembler code of the analysed executable file into functions. The assembler code consists of a sequence of processor instructions and associated operands. All processor instructions can be divided into K functional groups V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 395 Data Science A S Yumaganov according to the type of operations they perform. Examples of such groups are: a group of arithmetic instructions, a group of logical instructions, a group of data transfer instructions. For each of the K functional groups for a given function, we obtain a list of offsets relative to the beginning of the function, on which the instructions of this group are located. Let us determine the spatial distribution of the k instruction type as the absolute frequency of the instructions of this type entering in a relative normalized i-th interval (I = 100): k −1 NX nkj ! f˜ik = I · 100 ∈ (i − 1, i] , i = 1, I, (1) j=0 N where nk0 , ..., nkNk −1 are absolute offsets relative to the beginning of the function of instructions of group k, Nk is a total number of instructions of this group in this function, N is the length of the function, I(.) is the event indicator, which takes the values ”0” or ”1” depending on the truth of the corresponding argument. To obtain the spatial distribution of instructions in the integral form, we use the following formula: i f˜yk P y=0 fˆik = I , i = 1, I. (2) P ˜k f j j=0 Then, the spatial position of the k-th group of processor instructions in the body of the function is described by the vector:  T āk = fˆ1k , fˆ2k , . . . , fˆIk , k = 0, K − 1. (3) As a result, the description of the considered function has the following form: A = (ā0 , ā1 , . . . , āK−1 ) . (4) The matrix B, which represents the description of the basis library functions, is formed in a similar way. The measure of functions similarity, which description is given by the matrices A and B, has the following form: K−1 X K−1 X µ (A, B) = αk µcos (āk , b̄k ), αk = 1, (5) k=0 k=0 where µcos is the cosine distance. If two functions are identical, the measure of similarity takes the value ”1”, otherwise ”0”. Let a library of basis functions contains J functions, each of which has a description in the form of a matrix Bj . Then the description of the considered function through the library of basis functions will have the following form: x̄A = (µ (A, 0 ) , µ (A, 1 ) , . . . , µ (A, J−1 ))T . (6) Further, using the obtained intermediate description of the function, its final description is formed using the PCA (principal component analysis) method for reducing the dimensionality of the data. The process of forming the final description is described in detail in [11]. The resulting description of the function is stored in the corresponding database (archive or current). V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 396 Data Science A S Yumaganov 4. The structural description of the function IDA disassembler allows to obtain a control flow graph of the analysed executable file functions. The control flow graph of a function is a directed graph which vertices are the basic blocks of function. The basic block of function is a sequence of processor instructions. The first instruction of the basic block receives the control from some processor instruction, and the last one is an instruction, which passes the control to another basic block. The edges of the control flow graph determine the order of the basic blocks in the control flow of the function. The control flow graph of the analysed function is divided into subgraphs of fixed order k (k-subgraphs) as follows: each basic block is chosen as a starting node and all edges beginning from that node are traversed until k nodes are encountered. In this paper, k = 3 is used. The description of each of the k-subgraphs of the function consists of a pair of vectors: ā and b̄. The ā vector is a binary vector obtained by combining the rows of the adjacency matrix of a given k-subgraph. The b̄ vector characterizes the presence or absence of reading or writing operations in operands of various types in this k-subgraph. A detailed description of the vector b̄ is presented in [12]. Thus, the initial description of a function consists of a set of pairs of vectors ā and b̄ describing each k-subgraph of the function. The intermediate description of the function is formed on the basis of its similarity with the functions of the basis library. As a measure of similarity, we use the generalized Jaccard index [14]: P min(xi , yi ) i J(x, y) = P , (7) max(xi , yi ) i where x is a set of vector pairs which described the first function, y is a set of vector pairs which described the second function,xi is a number of pairs i in set x,yi is aSnumber of pairs i in set y, i passed through all unique pairs of vectors in the combined set x y. In the case of complete similarity between functions the value of similarity measure is ”1”, in case of complete dissimilarity it is ”0”. Let x be a set of vector pairs described the analysed function, yi is a set of vector pairs described the i-th function of the basis library, I is a number of functions in the basis library, then the intermediate description of the function under investigation has the following form: z̄ = (J(x, y0 ), J(x, y1 ), ..., J(x, yI−1 ))T . (8) The final description of the function is obtained after applying the PCA dimension reduction method and is stored in the appropriate database (archive or current). 5. Search for similar functions The final stage of the proposed method is the search for similar functions based on the previously obtained function description vectors. In the previous works of the author, a search algorithm with the following assumption was used: the size of the changed functions differs from the original by no more than 30% [11]. This assumption was used to increase the quality of the search. Thus, at the first stage of the search, the archive functions were filtered by their size, then the Euclidean distance to each function was calculated from the filtered list of archive functions, and the result was sorted by increasing the Euclidean distance. This paper presents a combined method of similar functions search, which uses a syntactic and structural description of functions. However, if the number of basic blocks of the function being studied is small (bbmin ≤ 5), only the syntactic description of the function and the search algorithm described above are used. This condition allows to increase the search accuracy since V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 397 Data Science A S Yumaganov the structural description of small functions can be very similar to the structural description of the similar size functions due to their small size. The search method presented in this paper uses the following algorithm for similar functions search (if bbmin > 5): • At the first stage, preliminary filtering of archive functions is performed. For the function under study and all the functions of the archival data, the Euclidean distance between the vectors of the syntactic (or structural) description of the functions is calculated and sorted by ascending distance. The first topth = 30 elements of the list of the most similar archive functions are used in the second stage of the search. • At the second stage of the search, the Euclidean distance from the function under investigation to the list of functions obtained above is calculated. However, in this case, we use another type of vectors as the description vectors. In other words, if at the first stage the functions were compared by the vectors corresponding to their syntactic description, then at the second stage functions will be compared by the vectors corresponding to their structural description and vice versa. Then, the obtained result is sorted by increasing the Euclidean distance. As a result, for the analysed function, a list of archive data functions is obtained, sorted by similarity in descending order. 6. Experiments To evaluate the efficiency of the proposed method of similar code sequences search in executable files, the functions of one dynamic library are used as archive data, and the functions of the same library, but of a different version, are used as the current library. It was considered that in the process of switching from one version of the dynamic library to another, the names of the functions did not change and there are no functions with the same name among the functions of the archive data. Using the search algorithm described in the fifth section, for a given function of the current library, we obtain an ordered list of archive data functions. Let us assign a binary sequence β = (β1 , β2 , ..., βL ) to this list, the i-th element of which is equal to one, if the name of the function at the i-th position of the list is identical to the name of the function being checked, and the i-th element of which is equal to zero otherwise. The following criteria to evaluate the quality of information retrieval [15, 16] are used: k P βl • Precision for the k-th position of the list: Pk = l=1k k P βl • Recall for the k-th position of the list: Rk = l=1K L • The average precision of the list: AveP = Pk (Rk − Rk−1 ), R0 = 0 P k=1 The average precision for all functions included in the current library is calculated by the formula: 1 S−1 X P = AvePs , (9) S s=0 where S is a number of functions in the current library. Several versions of the libtiff library [17] were used for experiments. The functions of the library libtiff 4.0.8 were used as functions of the archived data. These libraries were compiled with the optimization flag /Od (optimization disabled). V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 398 Data Science A S Yumaganov The results of comparing two methods of preliminary filtering of the archival data functions are presented in table 1. Table 1. Comparing preliminary filtering methods of the archival data functions. Current library Average precision of search P, Average precision of search P, filtering by syntactic description filtering by structural description libtiff 3.9.2 0.7424 0.7592 libtiff 3.9.7 0.7551 0.7707 libtiff 4.0.3 0.7835 0.8290 libtiff 4.0.5 0.7943 0.8432 The average precision of the search for the considered libraries is higher when preliminary filtering of the archival data functions by structural description is used. Therefore, in further experiments, this method of preliminary filtering of archive functions will be used. In next experiment, the average precision of search of the proposed method was compared with some previously known methods: a method based on the analysis of the spatial position of processor functional groups [11] and a method based on k-gram comparison [4].As a comparison object for the first method, the comparison object recommended by the authors was used (the spatial distribution of instructions in the function body in integral form). For the second method, the value of the parameter k = 5 was also chosen based on the recommendations of the authors. The results are presented in table 2. Table 2. Comparison of similar functions search methods. Current library Average precision Average precision Average precision of search P,using of search P,using of search P,using proposed method k-gramm based method method presented in [11] libtiff 3.9.2 0.7592 0.7528 0.7370 libtiff 3.9.7 0.7707 0.7681 0.7524 libtiff 4.0.3 0.8290 0.7970 0.8257 libtiff 4.0.5 0.8432 0.8119 0.8427 The analysis of the obtained results shows that the method of similar code sequences search presented in this paper is superior to the previously known methods in each of the used current libraries. Moreover, the ”closer” the version of the current library to the version of the archive data library, the less advantage the presented method has over the method [11]. This is explained by the fact that for libraries used in experimental studies, some functions of older versions of the current library have significant syntactic changes with respect to archive functions. And preliminary filtering of archival data by structural description can significantly improve the precision of the search. 7. Conclusion The paper presents a combined method of similar code sequences search in executable files using both syntactic and structural descriptions of functions. The results of experiments demonstrating the superiority of the developed method over some previously known methods have been presented. Further studies will be carried out in improving the accuracy of similar functions search and studying the efficiency of the proposed method using executable files compiled with different compilation settings. V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 399 Data Science A S Yumaganov 8. References [1] Abdalkareem R, Shihab E and Rilling J 2017 On code reuse from StackOverow: An exploratory study on Android apps Information and Software Technology 88 148-158 [2] Gharehyazie M, Ray B and Filkov V 2017 Some from here, some from there: cross-project code reuse in GitHub Proc. of the 14th International Conference on Mining Software Repositories 1 291-301 [3] Examining Code Reuse Reveals Undiscovered Links Among North Koreas Malware Families URL:https://securingtomorrow.mcafee.com/mcafee-labs/examining-code-reuse-reveals-undiscovered -links-among-north-koreas-malware-families/ (05.11.2018) [4] Myles G and Collberg C 2005 K-gram based software birthmarks Proc. of the ACM symposium on Applied computing 1 314-318 [5] Karim M, Walenstein A, Lakhotia A and Parida L 2005 Malware phylogeny generation using permutations of code Journal in Computer Virology 1 13-23 [6] IDA F.L.I.R.T Technology: In-Depth URL: https://www.hex-rays.com/products/ida/tech/irt/in depth.shtml (05.11.2018) [7] Bilar D 2007 Opcodes as predictor for malware International Journal of Electronic Security and Digital Forensics 1 156-168 [8] Gheorghescu M 2005 An automated virus classi_cation system Virus Bulletin Conference 1 294-300 [9] Bruschi D, Martignoni L and Monga M 2006 Detecting selfmutating malware using control-ow graph matching Proc. of the Third international conference on Detection of Intrusions and Malware & Vulnerability Assessment 1 129-143 [10] Kruegel C, Kirda E, Mutz D, Robertson W and Vigna G 2005 Polymorphic worm detection using structural information of executables Recent Advances in Intrusion Detection 1 207-226 [11] Yumaganov A and Myasnikov V 2017 A method of searching for similar code sequences in executable binary files using a featureless approach Computer Optics 41(5) 756-764 DOI: 10.18287/ 2412-6179-2017-41-5-756-764 [12] Yumaganov A and Myasnikov V 2018 Searching for similar code sequences in executable files based on the structural analysis of functions J. Phys.: Conf. Ser. 1096 012093 [13] Hex-Rays IDA: About URL: http://hex-rays.com/products/ida/ (05.11.2018) [14] Spath H 1981 The minisum location problem for the Jaccard metric Operations-Research- Spektrum 3 91-94 [15] Buckland M and Gey F 1994 The relationship between recall and precision JASIS 45 12-19 [16] Powers D 2011 Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation Journal of Machine Learning Technologies 2 37-63 [17] TIFF Library and Utilities URL: http://www.libti_.org/ (05.11.2018) V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 400