<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop Proceedings</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18287/1613-0073-2016</article-id>
      <title-group>
        <article-title>SIMILARITY SEARCH OVER PROGRAM CODE SEQUENCES USING FEATURELESS PATTERN RECOGNITION TECHNIQUES</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A.S. Yumaganov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V.V. Myasnikov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1638</volume>
      <fpage>437</fpage>
      <lpage>443</lpage>
      <abstract>
        <p>In this paper, we describe a problem of searching similar code sequences over binary executable program files. A proposed method is based on function opcode partitioning into functional groups, forming the function descriptions by comparing the executable code with the code of "base" library functions, and the dimensionality reduction of the resulting indirect description, which is used to run a search at the final step. We conducted an experimental study to show, that the proposed method is applicable for the problem solution, and to demonstrate its operability and efficiency.</p>
      </abstract>
      <kwd-group>
        <kwd>search</kwd>
        <kwd>code sequence</kwd>
        <kwd>featureless pattern recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With each passing day the amount of malicious software increases rapidly [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. So,
the malware detecting method, based solely on comprasion of the signatures for
verifiable executable module, seems to be insufficient [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Due to the fact that most
new viruses are a modification of the old ones, the effective method of analysis is the
comparison of executable code with previously met [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unfortunately, the direct
comparison of code sequences turned out to be ineffective due to various reasons
(various code compilation/optimization options, the virus code alteration and
improvement and etc.). Thus, the development of the high-level code analysis
methods is becoming an urgent problem, and the detection means, based on
recognition techniques and data analysis, become more popular.
      </p>
      <p>In this paper, the method of binary executable file functions search, similar to the
known functions from some software "archive", is proposed. This method can be used
directly for malware detection, by using the set of existing programs as software
"archive". Due to the extreme complexity of obtaining the primary comprehensive
executable code (graph) description, the proposed search method is based on the
principles of featureless pattern recognition: the function, we are interested in, is
represented by its relationships (similarity/proximity) with some base library
functions. Thus, the obtained redundant description is reduced using Principal
Component Analysis (PCA) for the next search implementation.</p>
      <p>The paper is organized as follows. In the first section we give a brief description of
the proposed method. The second section is devoted to the method of constructing the
function description via set of auxiliary functions (base library). The third section
briefly describes the dimensionality reduction method, used to derive an ultimate
function representation. In the fourth section we provide a similar function search
algorithm based on the derived description. In the fifth section we provide the method
effectiveness evaluation technique and the results of the experiments. The final part of
the paper comprises the findings and the list of references.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Basic concepts and the proposed method</title>
      <p>In this paper, the following definitions are used:



current library – the set of the analyzed executable file functions;
archival data – the set of known functions and their descriptions via base library;
base library – an auxiliary set of functions, used to compare archival data
functions with current library functions.</p>
      <p>The problem can be informally stated as follows: to find the most similar archival
data function for a given (or each) current library function. In general, the concrete
definition of similarity (measures) can vary depending on different problems and
solutions. In this papper, the proposed similarity measure is specified in the second
section.</p>
      <p>The process of solving the described problem can be divided into several stages. At
the first stage, we represent the current library function(s) via (similarity)
relationships with functions of a particular base library. At the second stage, the
resulting description is reduced via PCA method. A reduced description is entered
into the description database, which also stores archival data function descriptions. At
the third stage, the search is actually implemented (via database tools), performing
ordering of archival data function descriptions by the Euclidean distance criterion.
The presented method, consisting of the three stages, has a number of tunable
parameters, which can be used to enhance the effectiveness of the proposed solution.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Base library function representation</title>
      <p>
        After disassembly analysis of the executable code, we can get an assembler code
partition into functions. Each function consist of set of processor instructions.
For convenience, we divide all processor instructions into K=45 functional groups [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Each group contains a set of instructions that perform one type of action. A set of data
transfer instructions, a group of control transfer instructions are the examples of such
groups. Such partition lays down the requirements for the used base library: each
functional group of instructions must have its representative in at least one base
library function.
      </p>
      <p>Let us consider the processor instructions inside of the given function. For each group
of instructions of this function we build a (spatial) distribution of instructions,
belonging to this group, via function length, as follows.</p>
      <p>Let n0k ,..., nNkk 1 be the absolute offsets (positions) with respect to the start point of a
function, belonging to the k group, N k be the total number of instructions, be-longing
K 1
to the group, inside the given function, and N   N k be the total number of
k0
instructions inside the function, called the function’s length. We define the spatial
distribution of the k instruction type as absolute frequency of this type instruction
occurrences in a (relative) normalized ith interval I  100  :</p>
      <p>Nk 1  nk 
~fik   I  Nj 100 i  1, i, i  1, I
j0  
(1)
where I(.) is an event indicator, taking values "0" or "1" depending on whether the
corresponding argument is true or false.</p>
      <p>
        To eliminate the effect of the small offsets of instructions in the code on the result of
the search, we additionally use a kernel [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Then code instruction distribution
estimation has the following form:
fik  jN1 f jk 1h K (
i  j
h
      </p>
      <p>), i  0, I ,
K (r) 
1
2
exp( 1 r 2 ) .</p>
      <p>2
where K (r) is a kernel function, h – window height. For example, we can use the
Gaussian kernel as a kernel function:
As a result, we get K mutually characterizing a particular function vectors of the
following form:
ak   f1k , f 2k ,, f Ik T , k  0, K  1 .</p>
      <p>The set of vectors mutually forms the description matrix:
A  a0 , a1 ,, aK 1  .</p>
      <p>A similar matrix, denoted as С, can be constructed for each function, including the
base library functions.
The similarity between two functions, defined by matrices A and C, is measured as
follows:</p>
      <p>K 1
 A, С    k  cos (ak , сk ) ,</p>
      <p>k 0
where
 cos (a , с ) 
K 1
 k  1 ,
k 0</p>
      <p>I
 ai сi
i1
I
 ai 2
i1</p>
      <p>I
 сi 2
i1</p>
      <p>,
and the values  k  0 , satisfying a condition
are the method parameters.</p>
      <p>In the case of complete similarity between functions the value of similarity measure is
"1", in case of complete dissimilarity it is "0".</p>
      <p>Let J be the number of the base library functions, each of which has a description in a
form of a matrix Bj. Thus, we compare the investigated function of the current library
with each base library function to get the following description vector of the function:
x A   A, С0 ,  A, С1 ,,  A, С J 1 T .
4</p>
    </sec>
    <sec id="sec-4">
      <title>The description dimensionality reduction method</title>
      <p>
        Since J  1 , we apply the dimensionality reduction method, namely, Principle
Component Analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This method include calculation of eigenvectors and
eigenvalues of the covariance matrix of the corresponding vectors:
B  {( x  (x))( x  (x))T } .
      </p>
      <p>After sorting the eigenvalues of the covariance matrix in descending order, we
arrange the corresponding eigenvectors in a similar manner. If we select only the
I  J largest eigenvalues, we will get the transition matrix Y, consisting of I
eigenvectors. Thus, to obtain feature vector of size I  J we use the following
expression (an index, containing a reference to the original description, hereinafter
omitted):
z  Yx .
(2)
d z, z * </p>
      <p>
        I 1
 (zi  zi* ) 2 ,
i0
(3)
Vector z serves as the final representation of the investigated function via the set of
the base library functions and does not contain any elements of the primary
description. This allows us to associate the proposed approach with featureless pattern
recognition techniques [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ].
      </p>
      <p>The described method of the data description via the base library functions is applied
for archival data function description, as well as for current library function
description. The transition matrix Y is precalculated and remains unchanged.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Search for similar functions</title>
      <p>The final stage of the proposed method is the search for similar functions via derived
descriptions, represented as vectors (2).</p>
      <p>Let z be the vector, describing the current library function representation via the base
library, z * be the vector, describing archival data function representation via the base
library. For feature vector comparison we use the Euclidean metric (distance):
where zi is the i th component of the first function feature vector, zi* is the i th
component of the second function feature vector. If d = 0, the functions are
considered equal.</p>
      <p>Assume that, when function is modified, its size varies by no more than 25%. In this
case, we use the following functions search algorithm, similar to the one under
consideration:
1. First we determine the function length N.
2. Then we get the list of archival data functions, which size differs from the
investigated function size by no more than 25%.
3. For each function from the list, obtained on the second step, we find the Euclidean
distance to the investigated function.
4. Finally, we sort the results, obtained in the previous step, by ascending the d value.
As a result, we get the list of archival data functions, sorted by similarity in
descending order (by distance in ascending order (3)), for the investigated function.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>
        To test the effectiveness of the proposed method of similar functions search we take
functions of one library as archival data and functions of another version of the same
library as a current library. To determine a priori similar (identical) functions we
assume, that, when the library code is being modified, function names do not change,
and there are no functions with the same name among archival data functions.
Under these assumptions, it is possible to evaluate the quality of recognition by
getting the list of similar functions {Fl }l1,..., L , sorted by similarity in descending
order. To do this, we associate this list of functions with binary sequence
  ( 1 ,  2 ,...,  L ) , such that if there is a function with the same name on the l th
position, then  l  1 , else  l  0 . Let us introduce the following measures, which
are commonly used as quality criteria of information retrieval [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
1) Precision Pk for the k th position of the list :
k
  l
Pk  l1
k
k
  l
Rk  l1
      </p>
      <p>K
is the number of functions with the same name as the investigated function on the first
k positions of the ordered list as a proportion of the total number of functions in the
list.</p>
      <p>2) Recall (completeness) Rk for the k th position of the list:
is the number of functions with the same name as the investigated function on the first
positions of the ordered list as a proportion of the total number of functions with the
same name among archival data functions.</p>
      <p>3) The average precision for the list:</p>
      <p>L
AveP   Pk (Rk  Rk 1 ), R0  0 .</p>
      <p>k 1
P 
1 S 1</p>
      <p> AvePs ,</p>
      <p>S s0
Thus, the average precision for all current library functions is calculated by the
formula:
where S is the number of functions in the current library.</p>
      <p>To carry out research we put parameters (1) of the
 k  1 k  0, K  1.</p>
      <p>
        The archival data are represented by the libtiff 4.0.3 library [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and current library is
represented by one of the following versions of the libtiff library: libtiff 4.0.4, libtiff
4.0.5, libtiff 4.0.6. Each library contains about 1200 functions in total; the code size is
498 Kb.
      </p>
      <p>The results of the experimental study on the evaluation of the proposed method
quality are shown in Table 1.
method equal to "1":
On the basis of these results it can be concluded that the proposed method of search
similar code sequences over executable files is efficient, as well as quite effective.
In this paper, the method of search similar code sequences over binary executable
program files is proposed. The method is based on function description construction
by comparing the executable code with the code of a “base library” functions,
dimensionality reduction of the resulting indirect description, which is used in the
final step to perform the search. The results of the experimental study demonstrate the
efficiency and effectiveness of the proposed method.</p>
      <p>Further research will be carried out in:
─
─
parametric optimization of the proposed method;
development of proximity measures, based on structural joint analysis of function
performance graphs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. 2015
          <string-name>
            <given-names>Internet</given-names>
            <surname>Security</surname>
          </string-name>
          <article-title>Threat Report: Attackers are bigger, bolder, and faster</article-title>
          . URL: http://www.symantec.com/connect/blogs/2015-internet
          <article-title>-security-threat-report-attackersare-bigger-bolder-and-faster.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. PandaLabs identifies
          <volume>227</volume>
          ,
          <article-title>000 malware samples per day in the first quarter of 2016</article-title>
          . URL: http://www.pandasecurity.com/mediacenter/news/pandalabs-study-q1/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Umakant</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <article-title>Methods of Virus Detection and Their Limitations</article-title>
          . URL: http://ssrn.com/abstract=1916708.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4. IT threat evolution in Q1 2016</article-title>
          . URL: https://securelist.com/analysis/quarterly-malwarereports/74640/it-threat
          <article-title>-evolution-in-q1-</article-title>
          <year>2016</year>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>5. x86 Assembly Language Reference Manual</article-title>
          . URL: https://docs.oracle.com/cd/E19253- 01/
          <fpage>817</fpage>
          -5477/
          <fpage>817</fpage>
          -
          <lpage>5477</lpage>
          .pdf .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fukunaga</surname>
            <given-names>K.</given-names>
          </string-name>
          <article-title>Introduction to statistical pattern recognition</article-title>
          . New York and London: Academic Press,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Alpaydin</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>Introduction to Machine Learning</article-title>
          . MIT, 2nd edn.,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vapnik</surname>
            <given-names>V.</given-names>
          </string-name>
          <article-title>An Overview of Statistical Learning Theory</article-title>
          .
          <source>Neural Networks, IEEE Transactionson</source>
          ,
          <year>1999</year>
          ;
          <volume>10</volume>
          (
          <issue>5</issue>
          ):
          <fpage>988</fpage>
          -
          <lpage>999</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kuznetsov</surname>
            <given-names>AV</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Myasnikov</surname>
            <given-names>VV</given-names>
          </string-name>
          .
          <article-title>A comparison of algorithms for supervised classification using hyperspectral data</article-title>
          .
          <source>Computer Optics</source>
          ,
          <year>2014</year>
          ;
          <volume>38</volume>
          (
          <issue>3</issue>
          ):
          <fpage>494</fpage>
          -
          <lpage>502</lpage>
          . [in Russian]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Buckland</surname>
            <given-names>MK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gey</surname>
            <given-names>FC</given-names>
          </string-name>
          .
          <article-title>The relationship between recall and precision</article-title>
          .
          <source>JASIS</source>
          ,
          <year>1994</year>
          ;
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>12</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Powers</surname>
            <given-names>DMW</given-names>
          </string-name>
          .
          <article-title>Evaluation: From Precision, Recall and</article-title>
          <string-name>
            <surname>F-Measure to</surname>
            <given-names>ROC</given-names>
          </string-name>
          , Informedness,
          <source>Markedness &amp; Correlation. Journal of Machine Learning Technologies</source>
          ,
          <year>2011</year>
          ;
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>LibTIFF - TIFF Library</surname>
          </string-name>
          and
          <article-title>Utilities</article-title>
          . URL: http://www.libtiff.org/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>