<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3533767.3534367</article-id>
      <title-group>
        <article-title>FASER: Binary Code Similarity Search through the use of Intermediate Representations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Josh Collyer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Watson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iain Phillips</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Loughborough University</institution>
          ,
          <addr-line>Epinal Way, Loughborough, LE11 3TU</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Alan Turing Institute</institution>
          ,
          <addr-line>British Library, 96 Euston Rd., London NW1 2DB</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>Being able to identify functions of interest in cross-architecture software is useful whether you are analysing for malware, securing the software supply chain or conducting vulnerability research. Cross-Architecture Binary Code Similarity Search has been explored in numerous studies and has used a wide range of diferent data sources to achieve its goals. The data sources typically used draw on common structures derived from binaries such as function control flow graphs or binary level call graphs, the output of the disassembly process or the outputs of a dynamic analysis approach. One data source which has received less attention is binary intermediate representations. Binary Intermediate representations possess two interesting properties: they are cross architecture by their very nature and encode the semantics of a function explicitly to support downstream usage. Within this paper we propose Function as a String Encoded Representation (FASER) which combines long document transformers with the use of intermediate representations to create a model capable of cross architecture function search without the need for manual feature engineering, pre-training or a dynamic analysis step. We compare our approach against a series of baseline approaches for two tasks; A general function search task and a targeted vulnerability search task. Our approach demonstrates strong performance across both tasks, performing better than all baseline approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Binary Code Similarity Search</kwd>
        <kwd>Intermediate Representations</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Binary Code Similarity Search aims to provide a means of finding compiled functions which are similar
to a given query function. Being able to achieve this is useful when wanting to identify similarities
between malware functionality, identify function re-use or to understand whether a piece of software
contains known vulnerabilities. This is a complex undertaking. Factors ranging from the diversity of
toolchains to compiler optimization options mean that functions can be represented diferently across
binaries. The diversity of ISAs is vast when viewing the problem through an embedded computing
lens where software can be used within systems ranging from a MIPS-based embedded 5G modem to a
1750A-based subsystem in a US Apache Helicopter.</p>
      <p>
        This problem is not new, however, and Binary Code Similarity Search has been tackled using a
range of diferent methods. In particular, natural language processing (NLP) approaches have been
transitioned from other domains and applied to binary analysis tasks. Early approaches such as
SAFE [14], asm2vec [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and InnerEye [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ] explored using NLP for binary code search utilising
stateof-the-art approaches. The literature then developed and moved onto explore using the advances in
Transformer architectures in approaches such as jTrans [20], PalmTree [11] and Trex [15], all of which
use a similar pre-training methodology as BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with the addition of domain specific tasks or binary
analysis specific data sources.
      </p>
      <p>The aforementioned approaches all sufer from challenges with constructing a vocabulary which
is able to cover the range of possible inputs. This is referred to as the Out of Vocabulary (OOV)</p>
      <p>Collection of</p>
      <p>Binaries</p>
      <p>DisaLsisfetmtobIlRy and</p>
      <p>Function as String</p>
      <p>Generation</p>
      <p>Normalization &amp;
Deduplication</p>
      <p>Model Training</p>
      <p>Firmware Vulnerability Search
Zero-shot Function Search
Radare2</p>
      <p>
        bin2ml
problem [11] and stems from the use of assembly instructions as input which can include a broad range
of possible values such as memory addresses and opcodes. Even after normalization the number of
possible inputs continues to increase with the number of supported architectures due to
implementationspecific nuances. In order to overcome this challenge, some approaches have instead sought to use an
intermediate representation (IR) as the input data format. For example, XLIR [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] uses the LLVM IR to
conduct binary to source function search and Penwy [16] uses the VEX IR alongside a form of concolic
execution to create bug signatures for known bugs to conduct vulnerability search. Neither of these
approaches however tackles binary function search directly using only the IR without any additional
inputs.
      </p>
      <p>
        Within this paper, we propose Function As a String Encoded Representation (FASER)1 which combines
the long document transformer architecture, Longformer [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], with the use of radare2’s [19] Evaluable
String Intermediate Language (ESIL) to create a cross architecture model which is capable of binary
function search across multiple diferent architectures. Through using an IR as the input data type,
we side step the issue of having to normalize for each assembly language and instead normalize once
across a single common representation.
      </p>
      <p>The key contributions of this paper are:
1. A binary function representation as IR Functions as Strings which requires no additional data
processing efort other than normalization.
2. A cross-architecture model which combines the usage of IRs alongside longer context transformers
and demonstrate its usefulness for cross-architecture function search and known vulnerability
detection.
3. We demonstrate that it’s possible to get strong cross-architecture binary search performance
using a transformer architecture without the need for pre-training and instead using deep metric
learning to train directly for the binary function search objective.
4. We conduct, as far as the authors are aware, the first experiment for cross-architecture function
search using RISC-V architecture as part of the experimental methodology.</p>
      <p>This paper is structured in the following way. Section 2 describes the research methodology used for
this research before then moving onto Section 3 which presents the experimental results derived from
a series of experiments conducted. This paper then concludes with concluding remarks in Section 4
where we discuss our findings, detail the implications and propose potential future research avenues.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this section we provide an overview of the methodology used to create our experimental dataset
and details related to how we train and evaluate our proposed solution. We begin by describing our
chosen IR before describing the dataset used. We then move onto describe the process of going from raw
binaries to pre-processed, training-ready data. This section then continues to describe the model design,
training configuration and evaluation design before detailing the metrics and baseline approaches used
to compare against.
1https://github.com/br0kej/FASER</p>
      <sec id="sec-2-1">
        <title>2.1. Chosen Intermediate Representation</title>
        <p>The chosen IR is radare2’s Evaluable Strings Intermediate Language (ESIL). radare2 converts assembly
language into a semantic equalivant, ESIL which represents the architecture specific instructions
using a combination of symbols and numbers. Figure 2 provides several examples of x86-64 assembly
instructions and the corresponding ESIL representations. The primary reason for choosing ESIL over
other IR’s such as VEX, LLVM or PCode was compactness. Any given assembly instruction corresponds
to a single ESIL string. Whilst some assembly instructions create very large ESIL string representations,
through our experimentation the length of the ESIL IR is typically shorter and more succinct as opposed
to the alternatives.</p>
        <p>
          disasm: push rbp esil: rbp,8,rsp,0,=[8],8,rsp,-=
disasm: call sym.imp.printf esil: 4176,rip,8,rsp,-=,rsp,=[8],rip,=
disasm: mov dword [rbp - 8], 0 esil: 0,0x8,rbp,-,=[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dataset</title>
        <p>In order to evaluate FASER and compare against comparative baselines, we use two of the datasets
detailed within [13]. The first dataset is Dataset-1 created by [13]. Dataset-1 contains seven popular
open source projects: ClamAv, Curl, nmap, Openssl, Unrar, z3 and zlib. Each of these are compiled
for ARM32, ARM64, MIPS32, MIPS64, x86 and x86-64 using four diferent versions of Clang and GCC
alongside 5 diferent optimization levels. This results in each of the 7 projects having 24 unique compiler,
architecture and optimization combinations for each binary within the library. Within [13], the authors
formulate 6 tasks using this dataset which increase in dificultly. For the purposes of this paper, we have
chosen the most dificult, denoted as XM. The XM task imposes no constraints on which functions can
be sampled from the corpus during test time and includes all possible compiler, architecture and bitness
combinations. This task is representative of conducting binary function search against real binaries.</p>
        <p>The second dataset we use is the Dataset-Vulnerability dataset also part of [13]. This dataset consists
of two firmware images that include several OpenSSL CVE’s, specifically within the libcrypto library
included as part of the firmware. The first firmware image is of a Netgear R7000 router which is ARM32
and a TP-Link Deco-M4 mesh router which is MIPS32. The dataset also includes the same vulnerable
library compiled for ARM32, MIPS32, x86 and x86-64. The goal is to use the vulnerable functions from
our compiled libraries as a query function and then identify the corresponding vulnerable function
within the firmware image. In addition to the two tasks above, we augment the second dataset with
libcrypto compiled for RISC-V 32-bit and then re-run the firmware search. This task has been introduced
to explore whether an IR model is capable of transferring its learning to architectures it has not seen
before and can be considered a research first.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Data Generation</title>
        <p>
          In order to generate the training data, bin2ml [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] was used for both data extraction and pre-processing.
bin2ml uses radare2 to disassemble the binaries and lift functions into ESIL IR. Once this lifting process
has been complete, the data is then processed further to create ESIL function as string representations
by concatenating all ESIL instructions for a given function into a single, long string. Any strings that
were longer than our model’s input dimension were truncated. This could potentially cause a loss of
key information but is mitigated by the large input dimension chosen. Each function string is then
normalized using a series of heuristics before the entire corpus is deduplicated. This process was
repeated for all binaries within our datasets. The specifics of the normalization and deduplication
process are presented below.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Normalization</title>
        <p>Normalization is a fundamental step to ensure that the vocabulary size is manageable, and all possible
inputs can be encoded. In order to facilitate this, a series of heuristics were applied to replace parts of the
ESIL strings. The normalization approach draws on common approaches outlined within the literature
such as those used within SAFE [14], jTrans [20] and PalmTree [11]. Firstly, any hexadecimal value
which starts with 0xfffff or is one to three characters long (such as 0x023 or 0x02) are considered
immediate constants and replaced with the IMM token. Secondly, any hexadecimal value which starts
with 0x preceded by 4 or more hexadecimal values is considered a memory address and replaced with
MEM. Thirdly, due to the way radare2 represents function calls and data accesses, these values are
typically represented as integers within ESIL representations. For this reason, if the opcode the ESIL
representation was derived from is a call opcode, the integer is replaced with the FUNC token and
otherwise DATA. And lastly, general purpose registers are replaced with tokens based on their size,
32-bit registers are replaced with reg32 and 64-bit registers are replaced with reg64. As part of the
experimentation, two versions of FASER were trained, one without register normalization and one with.
This allows us to understand the impact register normalization would have on an IR based model.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Deduplication</title>
        <p>After normalization, deduplication takes place. Deduplication is critical because even after changing
factors such as optimization level and compiler, it is still possible for binaries generated from the same
source code to produce identical functions. This is typically overlooked in existing literature and is
comparable to the approach presented in [13]. For each of the normalised ESIL strings, the ESIL string
plus the function name are concatenated into a single string before being hashed. These hashes are
then compared with each other to identify where there are duplicate functions. For any matches found,
only one was kept ensuring that the dataset used for training contains only unique function strings.
There is then a subsequent step taken which looks through the entire dataset and eliminates functions
which are only present once. Essentially removing the functions whereby regardless of architecture
or optimization, after disassembly and lifting IR are invariant. Across our dataset, the deduplication
process eliminated on average 20-25% of the functions from a given library.</p>
        <p>Tokenised
and
Encoded
ESIL
String
(4096)
8 x LongFormer Blocks
(2048)</p>
        <p>D(7e6n8sexL3a8y4e)r D(3e8n4sexL1a2y8e)r</p>
        <p>Output
Embedding
(128)</p>
        <p>FASER Model
FASER Model</p>
        <p>Cosine
Similarity</p>
        <p>Similarity Score
Transformer Trunk</p>
        <p>Dense Embedding Head</p>
        <p>Embedding Generation</p>
        <p>Similarity Calculation
(a) Overview of the Model Architecture Used
(b) Siamese Training Formulation</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Model Design</title>
        <p>
          The chosen model used in FASER is a LongFormer. The LongFormer model was proposed by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to tackle
the quadratic computational scaling of self-attention used in models such as BERT. The LongFormer
instead uses a combination of local, sliding window attention, with a global attention mechanism.
This formulation instead scales linearly as the input size increases, providing a mechanism to train
transformers with larger input sequences. Furthermore, this combination of local and global attention
is viewed favorably for the binary function search task. An assembly instruction is not executed in
isolation but instead executed as part of a series of instructions. This local attention window provides
a means for a single instruction to include the context of the instructions before and after it but in a
manner which is bounded. The global attention can then look at the function holistically whilst being
informed by the local contexts provided by the sliding window attention.
        </p>
        <p>
          The model parameters used for FASER were an input dimension of 4096, followed by 8 LongFormer
blocks with an intermediate dimension of 2048, followed by two dense layers which map the 786
dimension transformer output to a 128 dimension embedding. The local attention window is set to 512
tokens. We utilize the implementation provided by transformers and all other parameters are kept
default. These can be viewed here [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Training Configuration</title>
        <p>Previous works such as Trex [15], jTrans [20] and PalmTree [11] conduct a pre-training step prior
to then fine-tuning for the function search objective. Whilst this makes sense if you want to train
a model which can be used for various diferent downstream tasks, it’s potentially suboptimal if the
only downstream use case is going to be function search. To this end, we forgo any pre-training steps
and train FASER directly for the function search objective using deep metric learning. We construct
a pair-based training methodology by using a Siamese formulation in combination with Circle Loss
[18]. Circle Loss was chosen due to its ability to place emphasis on large deviations in between-class
similarity in a manner not possible with other losses such as triplet loss. Both Cosine Embedding Loss
and Triplet Loss were experimented with and resulted in unstable training and in some cases, complete
model collapse.</p>
        <p>We also formulate a sampling strategy that ensures  number of examples for a given label (function)
are present within a batch. We then apply the online batch hard pair mining method [7] to dynamically
create both positive and negative pairs for each example from batched inputs throughout training. This
works by embedding all examples within a batch before using the associated labels for each example to
create the hardest positive and negative pairs for each example. What determines hardest is the output
of a distance function which in our case was Cosine Similarity. The strength of this approach when
compared to previous research which uses static pre-computed pairs is the weaknesses of the model
are consistently challenged. For example, if during training the model quickly learns to search across
ARM and MIPS but is performing badly when searching across X86-64, this training formulation would
automatically begin to target this weakness by generating pairs including X86-64 examples and uses
them to calculate the loss.</p>
        <p>For training, we use the whole of Dataset-1 and sample 100K functions per epoch for 18 epochs
(Approximately 3 days of training). We set  to 2 to ensure each batch has 2 of each sampled function,
batch size is set to 8 and, we use gradient accumulation to artificially set the batch size to 512. The
Adam [9] optimizer is used with a fixed learning rate of 0.0005.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. Comparison Approaches</title>
        <p>For the first task, we draw upon the top performing approaches reported within Marcelli (2022)[13]
which are the Graph Matching Networks (GMN) and Graph Neural Network (GNN) approach from
Li et al (2019)[12] and Gemini [22]. All of these are Graph Neural Network (GNN) approaches which
take advantage of the structural aspects of functions, typically through using the control flow graph
(CFG) with node level feature vectors as an input. Approaches using natural language processing
and transformer model architectures such as PalmTree [11] and jTrans [20] would have been ideal
candidates but are mono-architecture therefore were deemed unsuitable for comparison.</p>
        <p>For the vulnerability search task, we use the same three approaches outlined above but also compare
against Trex [15]. Trex provides an interesting comparison because it too uses a transformer architecture
but has one significant diference. The model is pre-trained on what the authors describe as micro-traces.
These micro-traces are generated in a dynamic manner using an emulator. Once trained, the model is
capable of being used with solely static data and forgoes the emulation aspect. The emulator used to
generate the micro-traces provided by the authors does not support the full breadth of architectures and
bitness of Dataset-1 therefore would be an unfair comparison and therefore, is not used in the first task.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.9. Evaluation Configuration</title>
        <p>For task 1, we again use Dataset-1 and implement a sampling approach which dynamically creates
search pools which for a given function, contain 1 positive example and 100 negative examples. This
formulation is the same as [13]. We also adopt the same methodology as [13] for the Vulnerability
search task whereby we have a query function of a given architecture and search across all possible
functions in the firmware’s libcryto library. This means that the search pool size for task 2 is 10 times
bigger at approximately 1000 functions.</p>
      </sec>
      <sec id="sec-2-10">
        <title>2.10. Metrics</title>
        <p>We re-use the metrics used in previous studies Recall@1 and MRR@10 for task 1 in order to present
a reliable comparison. For the vulnerability search task, we report the rank at which the vulnerable
function was present at after the search was conducted similar to other studies, alongside this, we also
calculate the mean and median ranks across all architecture searches. This is primarily to aid result
analysis.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>Our evaluation aims to answer the following research questions:
1. RQ1 - How does FASER perform when compared against other baseline approaches for the binary
function search task?
2. RQ2 - How efective is FASER at searching real firmware images for known vulnerabilities?
3. RQ3 - Does using intermediate representations as the input data enable the model to zero shot
architectures not previously seen as part of the training data?</p>
      <sec id="sec-3-1">
        <title>3.1. RQ1 - Binary Function Similarity Search</title>
        <p>The results of the experimentation to gather data for RQ1 can be seen in Table 1. Both of the FASER
models trained outperform all the baseline approaches across both of the chosen metrics. Looking first
at Recall@1, the model without register normalized training data (denoted as FASER NRM) performs
significantly better than the register normalized model, achieving a Recall@1 increase of 13% when
compared against the best performing baseline, GMN. FASER RN performs comparable to the GMN
model without needing direct comparison between all possible function combinations within a given
search pool.</p>
        <p>Moving onto the MRR@10 results, FASER NRM again performance significantly better than all
baselines with a 7.5% increase. FASER RM is again comparable to the GMN approach with an identical
MRR@10.</p>
        <p>RQ1 Summary: The results present above show that our propose approach, FASER, performs as
well as if not better than the best baseline approach with FASER NRM performing better across all
metrics.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. RQ2 - Binary Function Vulnerability Search</title>
        <p>The results of the experimentation to gather data for RQ2 can be seen in Table 2. The results show the
ranks of search results when searching the Netgear R7000 router which is ARM32.</p>
        <p>In addition to the three baseline approaches used in the previous task, our proposed approach was
compared against Trex, a comparable transformer based approach which has a more complicated and</p>
        <sec id="sec-3-2-1">
          <title>Method</title>
          <p>FASER NRM
FASER RN
GMN [12]
GNN [12]
GNN (s2v) [22]</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Description</title>
          <p>ESIL Function String
ESIL Function String
CFG + BoW opc 200
CFG + BoW opc 200
CFG + BoW opc 200
0.57
0.53
0.53
0.52
0.36
computationally expensive training process. The results presented in Table 2 show that our proposed
approached performs well across all the architectures. Interestingly, whilst the register normalized model
performed the strongest in the Binary Function Similarity Search task, in this task the register normalized
model performs significantly better. This is shown by both mean and median rank descriptive statistics
being lower. The best performing FASER model is highly comparable to the GMN method but again
without the aforementioned limitations. Comparing specifically to Trex, the register normalized model
performs consistently better across all the architectures. This suggests that our training methodology
of training for the function similarity directly and forgoing the elaborate pre-training steps usually
adopted and the use of IR’s as our data input has merit.</p>
          <p>RQ2 Summary: The results present above show that the FASER RN performs well when searching
real firmware images for known vulnerabilities.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. RQ3 - Zero Shot Architecture Binary Function Search</title>
        <p>Table 3 shows the results from the experimentation undertaken to answer RQ3. The question posed here
is can the FASER models, because we are using an IR as the input data, perform zero-shot vulnerability
search for a new architecture by transferring prior learnt knowledge. Fundamentally, the answer to this
is no. The vulnerability search performance for a new instruction set architecture (in this case RISC-V)
is significantly worse. This is clearly demonstrated by the mean and median rank descriptive statistics.</p>
        <p>Nevertheless, these results do demonstrate something interesting. Across both FASER models, the
performance is significantly better when searching MIPS functions using a RISC-V query as opposed to
searching ARM functions using a RISC-V query. This suggests that the semantic representation created
when MIPS and RISC-V instructions are lifted to ESIL may be more similar than ARM and RISC-V.
Given that recent research has suggested that ARM and X86/X86-64 instructions are closer in statistical
similarity than when compared to MIPS [8], these results may suggest that introducing RISC-V binaries
into a training dataset may level out any data imbalances.</p>
        <p>RQ3 Summary: The results present above show that the use of an IR input representation does not
FASER NRM</p>
        <p>48;546;964;14
FASER RM</p>
        <p>673;292;1004;15
393
393
496
496
48;30;22;546;170;
251;14;155
673;33;4;292;76;
136;15;147
provide a means of conducting zero-shot search across unseen architectures.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Conclusion</title>
      <p>The results presented above demonstrate that the combination of ESIL IR and the Longformer
transformer architecture perform well compared to the baseline approaches with minimal requirement for
either manual feature engineering or dynamic analysis. The FASER RN model performs particularly
well at the vulnerability search task across all architectures tested and performs comparably to GMN,
without requiring direct comparison between all possible combinations within a given search pool.
Whilst demonstrating the efectiveness of IRs and longer context transformers, these results also add
weight to our argument that the pre-training step seen within previous work may be unwarranted for
binary function search and training for the binary function similarity objective directly may be more
optimal.</p>
      <p>The results presented to answer RQ3 also suggest something interesting. While the search rank
results were significantly worse in terms of mean and median rank, suggesting that our proposed
approach is unable to reliably transfer to unseen architectures, there is a large diference between
the reported ranks for RISC-V → ARM when compared to RISC-V → MIPS ranks. Prior research [8]
observed similar phenomena whereby x86 → ARM functions were statistically more similar than x86
→ MIPS functions. This suggests that RISC-V functions may be more similar to MIPS functions in terms
of semantics when represented in ESIL than ARM. Given that most datasets used in prior research
only include X86, ARM and MIPS, this similarity could potentially be leveraged and experimented with
further. An example of this experimentation could be to see explore whether including RISC-V functions
within a cross-architecture binary function search dataset balances out the x86 → ARM similarities
with providing function examples that are similar but diferent to the MIPS architecture.</p>
      <p>Turning now to implications of this research. Firstly, this work demonstrates that IRs derived from
binaries can be used to train models for binary function search and perform well. Secondly, the use
of longer input sequences also works well. The performance results, especially related to the use of
the LongFormer architecture suggest that changing the type of transformer architecture used and
increasing the input dimension for approaches such as jTrans [20], Trex [15] or PalmTree [11] may
increase their overall performance. The results also demonstrate that if the only downstream target
task is binary function search it may be worth amending the standard training methodology which
involves a pre-training step and instead, train for the objective directly.</p>
      <p>
        In future work, there are several avenues that could be explored. There are a number of diferent
IRs that could be incorporated into similar approaches such as VEX [17], LLVM [10] and PCode [1].
There is also an emerging sub-field of binary function search focused on adding heuristic pre- and
post-filtering steps to increase performance by reducing the number of functions searched such as those
described in Asteria-Pro [
        <xref ref-type="bibr" rid="ref8">23</xref>
        ] and BinUSE [21]. And finally, this approach could be enhanced through
the integration of supporting models such as those that use decompiled source code, recovered type
information or structural aspects at a control flow graph or call graph level.
      </p>
      <p>Dongkwan Kim et al. “Revisiting binary code similarity analysis using interpretable feature
engineering and lessons learned”. In: IEEE Transactions on Software Engineering (2022).
Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR
abs/1412.6980 (2014). url: https://api.semanticscholar.org/CorpusID:6628106.
[10] Chris Lattner and Vikram S. Adve. “LLVM: a compilation framework for lifelong program analysis
&amp; transformation”. In: International Symposium on Code Generation and Optimization, 2004. CGO
2004. (2004), pp. 75–86. url: https://api.semanticscholar.org/CorpusID:978769.
[11] Xuezixiang Li, Yu Qu, and Heng Yin. “Palmtree: learning an assembly language model for
instruction embedding”. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and
Communications Security. 2021, pp. 3236–3251.
[12] Yujia Li et al. “Graph matching networks for learning the similarity of graph structured objects”.</p>
      <p>In: International conference on machine learning. PMLR. 2019, pp. 3835–3845.
[13] Andrea Marcelli et al. “How Machine Learning Is Solving the Binary Function Similarity Problem”.</p>
      <p>In: 2022, pp. 2099–2116. isbn: 978-1-939133-31-1. url: https://www.usenix.org/conference/
usenixsecurity22/presentation/marcelli.
[14] Luca Massarelli et al. “Safe: Self-attentive function embeddings for binary similarity”. In:
International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer.
2019, pp. 309–329.
[15] Kexin Pei et al. “Trex: Learning execution semantics from micro-traces for binary similarity”. In:
arXiv preprint arXiv:2012.08680 (2020).
[16] Jannik Pewny et al. “Cross-architecture bug search in binary executables”. In: 2015 IEEE
Symposium on Security and Privacy. IEEE. 2015, pp. 709–724.
[17] Yan Shoshitaishvili et al. “Firmalice - Automatic Detection of Authentication Bypass
Vulnerabilities in Binary Firmware”. In: Network and Distributed System Security Symposium. 2015. url:
https://api.semanticscholar.org/CorpusID:17298209.
[18] Yifan Sun et al. “Circle loss: A unified perspective of pair similarity optimization”. In: Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 6398–6407.
[19] Radare2 Team. Radare2 GitHub repository. https://github.com/radare/radare2. 2017.
Huaijin Wang et al. “Enhancing DNN-Based Binary Code Function Search With Low-Cost
Equivalence Checking”. In: IEEE Transactions on Software Engineering 49.1 (2022), pp. 226–250.
[22] Xiaojun Xu et al. “Neural network-based graph embedding for cross-platform binary code
similarity detection”. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications
security. 2017, pp. 363–376.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>National</given-names>
            <surname>Security</surname>
          </string-name>
          <article-title>Agency</article-title>
          . ghidra. https://github.com/NationalSecurityAgency/ghidra/tree/ master.
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Iz</given-names>
            <surname>Beltagy</surname>
          </string-name>
          , Matthew E Peters, and Arman Cohan. “
          <article-title>Longformer: The long-document transformer”</article-title>
          . In: arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>05150</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Josh</given-names>
            <surname>Collyer</surname>
          </string-name>
          .
          <year>bin2ml</year>
          . https://github.com/br0kej/bin2ml/.
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          et al. “
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”</article-title>
          . In:
          <article-title>North American Chapter of the Association for Computational Linguistics</article-title>
          .
          <year>2019</year>
          . url: https://api.semanticscholar.org/CorpusID:52967399.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Steven</surname>
            <given-names>HH Ding</given-names>
          </string-name>
          ,
          <article-title>Benjamin CM Fung, and Philippe Charland. “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization”</article-title>
          .
          <source>In: 2019 IEEE Symposium on Security and Privacy (SP)</source>
          .
          <source>IEEE</source>
          .
          <year>2019</year>
          , pp.
          <fpage>472</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Gui</surname>
          </string-name>
          et al. “
          <article-title>Cross-Language Binary-Source Code Matching with Intermediate Representations”</article-title>
          .
          <source>In: 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)</source>
          .
          <year>2022</year>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>612</lpage>
          . doi:
          <volume>10</volume>
          .1109/SANER53432.
          <year>2022</year>
          .
          <volume>00077</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Hermans</surname>
          </string-name>
          , Lucas Beyer, and Bastian Leibe.
          <article-title>“In defense of the triplet loss for person re-identification”</article-title>
          .
          <source>In: arXiv preprint arXiv:1703.07737</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Shouguo</given-names>
            <surname>Yang</surname>
          </string-name>
          et al. “
          <article-title>Asteria-Pro: Enhancing Deep-Learning Based Binary Code Similarity Detection by Incorporating Domain Knowledge”</article-title>
          .
          <source>In: ACM Transactions on Software Engineering and Methodology</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Fei</given-names>
            <surname>Zuo</surname>
          </string-name>
          et al. “
          <string-name>
            <surname>Neural Machine Translation Inspired</surname>
          </string-name>
          <article-title>Binary Code Similarity Comparison beyond Function Pairs”</article-title>
          .
          <source>In: 26th Annual Network and Distributed System Security Symposium, NDSS</source>
          <year>2019</year>
          , San Diego, California, USA, February
          <volume>24</volume>
          -
          <issue>27</issue>
          ,
          <year>2019</year>
          .
          <source>The Internet Society</source>
          ,
          <year>2019</year>
          . url: https: //www.ndss
          <article-title>-symposium.org/ndss-paper/neural-machine-translation-inspired-binary-codesimilarity-comparison-beyond-function-pairs/.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>