<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>TSE.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/TSE.2022.3187811</article-id>
      <title-group>
        <article-title>Exposing the Cracks: A Case Study on the Quality of Public Linux Malware Data Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Sanna</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Canavese</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Regano</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Maiorca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Giacinto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consorzio Interuniversitario Nazionale per l'Informatica</institution>
          ,
          <addr-line>CINI</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical and Electronic Engineering (DIEE), Università degli Studi di Cagliari</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institut de Recherche en Informatique de Toulouse</institution>
          ,
          <addr-line>IRIT</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>3187811</volume>
      <fpage>03</fpage>
      <lpage>08</lpage>
      <abstract>
        <p>Machine learning is extensively used for malware detection due to its accuracy, scalability, and adaptability. However, the efectiveness of ML models heavily depends on the quality of the datasets used for training and testing. This study evaluates popular public datasets for malware ARM ELF binaries from MalwareBazaar and VirusShare, complemented with benign binaries from Debian repositories. Using mnemonic frequency analysis, we found that these datasets lack the diversity found in Android or Windows. Using only the frequency of a single assembly mnemonic, we can distinguish the malware from the goodware with a balanced accuracy of 78%, and using three mnemonics, we achieved a balanced accuracy of 99%. We finally derive conclusions on the current state of Linux publicly available malware.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Malware Analysis</kwd>
        <kwd>Data Set Quality</kwd>
        <kwd>Binary Analysis</kwd>
        <kwd>Linux Malware</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Assembly Mnemonics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine Learning (ML) is frequently used for malware detection [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] because it can achieve high
accuracy, scale to handle large volumes of data, continuously improve its precision, and potentially
adapt to new threats. Many flavours of ML models and techniques have been applied, from simple
logistic regression to advanced neural networks.
      </p>
      <p>Independent of the ML model and methodology chosen to identify the malware (and the goodware),
however, the performances of a machine learning system are strongly correlated with the quality of the
dataset used in the training phase. Without a good data set for training and testing an ML model, the
obtained results will be biased. Data set quality refers to various properties, such as being error-free
and having enough observations with enough diversity to represent the real world.</p>
      <p>We decided to analyze some popular public data sets for malware ARM ELF (Linux) binaries ofered
by MalwareBazaar and VirusShare. These websites ofer freely downloadable labelled binaries that can
be further analysed and provide an open way to train a batch of ML malware classifiers without a costly
subscription. The main purpose of these data sets is to publish malicious binaries, so we complemented
them with some realistic benign binaries from the Debian repositories.</p>
      <p>
        A common approach to analyzing these malicious binaries is disassembling them and performing
some assembly analysis. In our study, we decided to opt for one of the simplest approaches: counting
the number of instructions, split by their mnemonic (e.g., add, bx). Similar approaches have already been
proposed in the literature, such as analyzing mnemonics and opcodes with neural networks, support
vector machines, decision trees, and a variety of other machine learning models [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>To determine the diversity of these data sets, we analyzed the collected mnemonic frequencies and
tested our findings by training a batch of classifiers. Although our results are described in detail in</p>
      <p>Section 5, we anticipate that we found a lack of diversity in these public data sets.</p>
      <p>This may be because of two reasons:
1. The datasets do not represent the real-world scenario and overrepresent a specific kind of malware.
2. Currently, the state of Linux malware is such that the lack of diversity is rooted in the real world
(i.e., malicious actors focus on certain malware strands).</p>
      <p>To increase the reproducibility of our experiments and the transparency of our approach, we have
published all the source code and the data sets online1.</p>
      <p>This paper is structured as follows. Section 2 contains some related works. Section 3 introduces
our approach and describes what binaries we picked and how we analyzed them. Sections 4 and 5
contain our findings, respectively, on the data set and its features; in the last section, we also show the
classification metrics of some very simple yet highly accurate machine-learning models trained on this
data. Finally, Section 6 closes this paper by presenting our conclusions and plans for the future.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Opcodes and Mnemonics are explored approaches in the State of the Art. Although the Linux landscape
sufers from a bit of underdevelopment concerning its mobile and desktop equivalent leading solution,
i.e., Android and Windows, respectively, quite a bit of work can be found in the State of the Art regarding
malware targeting this solution. Specifically, Ngo et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] compiled a survey of other works tackling
the malware analysis problem on Linux. We take this as an initial step to compare our approach with
opcode-based approaches in SotA. Specifically, to make a fair comparison, we compare only with the
Opcode-based approaches and forgo more complex takes like Graph-based features. This is because we
focus not on feature engineering but on data set quality. That is, if data set impropriety can be shown
with an encoding as linear as opcode count, we claim that this property will transfer no matter the
encoding, as it is rooted in data. We report the dimensions of data sets employed in previous work in
Table 1.
      </p>
      <p>
        Cozzi et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] made a seminal work about Linux malware itself, conducting a large-scale empirical
study to assess the complexity of existing Linux malware, how state-of-the-art malware analysis
techniques and tools adapt to the characteristics of Linux malware, and providing an overview on
adhoc evasion techniques employed by such malware. They studied a dataset comprising 10548 malicious
samples targeting various architectures, including X86-64, MIPS, and PowerPC. They concluded that, at
the time of writing (2018), Linux malware complexity was still not on par with the Windows counterparts.
Still, malware authors were starting to evolve simple malware by adapting evasion techniques employed
by Windows malware.
      </p>
      <p>
        More recently, Carrillo-Mondéjar et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] performed a study to characterize Linux malware
specifically tailored to target IoT devices. Their approach combines static metrics (Cyclomatic Complexity,
number of basic blocks), dynamic features (including unique syscalls and number of created processes),
and an n-grams representation of mnemonics. They employ such features to train various classifiers
(Random Forest, K-nearest, Decision Tree, Support Vector Machines, and Linear kernels) on a dataset of
more than 10000 malicious samples, able to identify the family to which the analyzed malware belongs.
Notably, unlike our study, they do not classify malware against benign software.
      </p>
      <p>
        Phu et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose CFDVex, a feature selection method for cross-architecture malware detection.
Utilizing the Vex intermediate representation and the CFD method developed in their previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
they extract opcode-based features from a dataset of Intel 80836 IoT malware. Notably, they found that
their method could precisely detect MIPS malware (95.72% accuracy; 2.81% FPR), although the system
was never exposed to it. Unlike our work, however, they consider a benign population mainly focused
on router devices, diminishing the variety of behaviors considered.
      </p>
      <p>
        Azmoodeh et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] consider an opcode-based representation that relies on a compressed form of a
Control Flow Graph in matrix form, from which they extract the eigenvalues and eigenvectors to select
      </p>
      <sec id="sec-2-1">
        <title>1 https://github.com/alessandro-sanna/Linux_Mnemonic_Analysis</title>
        <p>features for a classifier. They find that this approach obtains good accuracy, although the performance
degrades with a growing percentage of inserted junk code. Notably, they consider a limited dataset
composed of less than 1500 samples, only 128 of which compose the malware class.</p>
        <p>
          Furthermore, Dovom et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] expand on the latter work considering the potential of Fuzzy Pattern
Trees, defined as "tree-like structure in which the inner nodes are fuzzy logic arithmetic operators and
the leaf nodes are associated with fuzzy predicates on input attribute," for malware detection. They
extract the opcodes from their sample set and select the most beneficial features using a Class-wise
Information Gain criterion [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Finally, they adopt the Eigenspace approach seen in the latter work by
Azmoodeh et al. and consider the eigenvectors of the filtered Control Flow Graph. This work considers
the most extended dataset in the explored State of the Art. However, they rely mainly on Vx-Heaven,
an unmaintained source no longer available. Furthermore, their sources primarily comprise Windows
malware, which behaves diferently from Linux malware. For example, the behaviour variability in the
Windows landscape is far greater, whereas most Linux malware is composed of botnet strains.
        </p>
        <p>
          Darabian et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] trained various machine learning models (e.g.,  nearest neighbours, support
vector machines, neural networks). Still, instead of using the single mnemonic frequency as inputs (as
in our work), they computed the frequency of the most common sequences of opcodes. They computed
these sequences using MG-FSM (Mind the Gap: Frequent Sequence Mining), an eficient algorithm
to compute the most common sub-sequences in a data set. They obtained high accuracy for all their
models, ranging from 97% to 99%. Although tested on a diferent data set, it is interesting that our
approach can obtain similar results but with much simpler models (see Section 5).
        </p>
        <p>
          Haddadpajouh et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] tested the performance of recurrent neural networks on a relatively small
data set of ARM IoT malware binaries consisting of 281 malicious applications and 270 benign ones.
They disassembled the ARM binaries, extracted the list of mnemonics, and removed all the consecutive
occurrences of the same mnemonic to reduce the length of these sequences. They then performed a word
embedding to transform each mnemonic into a vector of numbers and feed a list of embeddings (e.g.,
an encoded list of mnemonics) to several recurrent neural networks. The authors tested unidirectional
and bidirectional LSTM (Long-Short Term Memory) neural networks. They experimentally found that
the optimal results were with a two-layer LSTM with only 192 neurons, achieving an accuracy of about
98%.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our Approach</title>
      <p>Our approach considers an intuitive metric forcibly present in all valid ELF samples: Assembly
Mnemonics. These are defined as strings that "tell the assembler which machine instructions to assemble." 4 We
2This work is focused on family classification, so the benign class is absent.
3The distribution of benign and malicious samples is unclear, as the authors do not clarify this distinction and their main
source, Vx-Heaven, is not available anymore
4https://onlinedocs.microchip.com/oxy/GUID-91118B1F-52AE-4822-A75E-B65FA4C48896-en-US-1/
GUID-6E3B15DB-5560-44A3-93F1-91CA7F254F0A.html
mov fp, 0
mov lr, 0
pop {r1}
mov r2, sp
str r2, [sp, -4]!
str r0, [sp, -4]!
ldr ip, sym._fini
str ip, [sp, -4]!
ldr r0, main
ldr r3, sym._init
b sym.__uClibc_main
mov
pop
str
ldr
b
can consider them a string equivalent of the byte representation of "opcodes." We avail of a Python 3
environment to interact with Ghidra 115. We then execute a custom Ghidra script on each sample in
order to extract the disassembled code from each function. From this, we extract the first word of each
line of code: this is the Mnemonic Equivalent of the assembly operation pertaining to that Assembly
line.</p>
      <p>Having extracted this accumulation count for each sample, we then construct a single sparse matrix
containing a column for each instruction present in each of the considered samples, be them malicious
or benign. Then, each row will represent the count for the n-th sample of each found instruction. A
visualisation of this matrix, constituting our final feature set, can be seen in Table 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Set Analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Malware Section</title>
        <p>Our dataset of choice has been obtained by integrating ELF files available in both MalwareBazaar 7 and
VirusShare8.</p>
        <p>All of our samples were detected from at least one of the 75 antivirus solutions of VirusTotal 9.</p>
        <p>
          We collect informative JSONs for each sample using the VirusTotal APIs. For the scope of this study,
we use Linux’s readelf to obtain the architecture information and further restrict the dataset to all the
5https://ghidra-sre.org
6We report the MD5 hash instead of the SHA-256 one because of visualisation purposes.
7https://bazaar.abuse.ch
8https://virusshare.com
9https://www.virustotal.com
32-bit ARM samples. This leads us to a total set composed of 30304 malicious samples. The dataset
composition is shown in Table 3.
4.1.1. Temporal Distribution
In order to account for temporal shift in our dataset, we collect the year of submission of each sample.
The set contains samples in a 12 year span, from 2010 to 2022. This wide range allows us to better
evaluate malicious code’s evolution over time, and it is needed to conduct a realistic analysis of the
system’s performance [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>Figure 2 illustrates the temporal distribution of the malicious set on a yearly basis.
105
104
103
102
101
100
6
26 25
61
163 154
476
1236</p>
        <p>10683
3883
4936</p>
        <p>
          7859
796
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
4.1.2. Family Distribution
We use AVClass2 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to compute the malware family of each sample and provide more detail on the
malware we consider. Table 4 shows the top 10 families in our set. We do not report information
about all the 91 in our set for brevity. Briefly, the greatest portion of our set is made of botnets. The
most notable, Mirai [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and Gafgyt [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], are two related botnets used in Distributed Denial of Service
attacks around the world. Predictably, the overwhelming majority is related to these families. This
reflects recent trends [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] that highlight how Mirai is one of the most widespread malware in the
evergrowing IoT landscape.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Goodware Section</title>
        <p>We generate our goodware dataset as follows: we select all available indexed packages from Debian 12
(Bookworm) 10, from both the armhf and the armel distributions. Then, we decompress the obtained
DEB packages via dpkg. Finally, we check the magic number of the obtained files and restrict to only
the ELF files (magic number: 7f 45 4c 46).
4.2.1. Temporal Distribution
Our set contains samples ranging from 2010 to 2023. In order to make a fair comparison with the
malicious set, we also restrict this set up until 2022 goodware, included. This results in a set of 65112
benign samples. Figure 3 illustrates the temporal distribution of the benign set on a yearly basis.
105
104
103
102
101
100
366
10https://packages.debian.org/bookworm/allpackages</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>In this section, we report our findings about the data set we built and some metrics on various
machinelearning models we trained.</p>
      <sec id="sec-5-1">
        <title>5.1. Mnemonic analysis</title>
        <p>The data set presented in this paper (see Section 4) contains various benign and malicious samples from
widely used public repositories. When we started to analyse it, we noticed that the malware samples
presented substantial diferences when compared with the benign ones. Table 5 shows some statistics
relative to the top 10 features (e.g., assembly mnemonics) with the highest separability index and 10
with the lowest.</p>
        <p>In this context, we define as the separability index of a feature  a value stating how easy it is to split
the observations into benign and malicious samples by performing a univariate analysis of  . In our
case, the separability index is a percentage, and the higher the value, the higher the ability of a certain
mnemonic to identify the malware and the goodware. To compute this value, we used -means (with
 = 2) to split the samples into two clusters using a single feature, and then we computed the accuracy
of such unsupervised classification with the ground truth.</p>
        <p>In addition to the separability indexes, Table 5 also shows the average value of the various mnemonic
counts and their standard deviations.</p>
        <p>From Table 5a we can see a peculiar trend. All these instructions tend to be used only in the malicious
binaries (i.e., the means are not zero), while the goodware eschews them (i.e., the means are close to
zero). By only looking at the presence of these instructions, we can immediately have a good indicator
of a suspicious binary.</p>
        <p>On the other hand, Table 5b shows that the mnemonics with the lowest separability index have
similar means (close to zero), making them poorly suited as malware indicators. These behaviors are
consistent with our definition of the separability index.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Classification results</title>
        <p>
          We aggregate the 65112 and the 30304 feature vectors obtained via the featurization process. Therefore,
we get a feature matrix of 95416 lines. We then pair each sample with its timing information with
day-level granularity and sort the matrix rows by date. Finally, we split it using the older 75% of
the data set to train various machine learning models and the remaining more recent 25% for testing
purposes. We purposefully avoid random-split cross-validation to avoid introducing temporal biases in
our evaluation [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
        <p>100
y90
c
a
r
cu80
c
a
.b70
60
100
cay80
r
u
c
c
.a60
b
40
100</p>
        <p>Figures 4 and 5 show the balanced accuracy we obtained by testing various machine learning models.
In particular, we decided to train three categories of models: logistic regression models, Support Vector
Machines (SVMs) with a linear kernel, and some decision trees. We deliberately picked these three
model categories since they are among the simplest ones in order to highlight the clear separation
between the goodware and malware classes. We then implemented our experiments using the popular
scikit-learn Python package. We used the default hyperparameters for every model, thus avoiding
any hyperparameter optimization. The full source code can be found in our repository.</p>
        <p>Figure 4 shows the results of these model categories when training them with the first  features
(mnemonic counts) with the highest separability indexes. From the plots, we can see that by using only
the mlaeq counts, all the models have a balanced accuracy of about 78%. After using only the first three
of them (mlaeq, cmpcc, and swi), they reach a score of about 98%.</p>
        <p>We then repeated a similar experiment by picking the features with the lowest separability indexes.
Figure 5 shows our findings. The bottom 11 mnemonics (see also Table 5b) are inconclusive, and all
the models show a balanced accuracy of 50%. This is the worst-case scenario since the classifiers are
unsure if a binary is malware or goodware. However, if we add more features, the accuracy increases.
By picking the bottom 20 features, all the models become very precise. In particular, they achieve
a balanced accuracy of about 90%, 93%, and 97%, respectively, for the logistic regression, SVM, and
decision tree.</p>
        <p>5 10 15
mnemonic count
20</p>
        <p>5 10 15
mnemonic count
20
(a) Logistic regression.</p>
        <p>(b) Linear SVM.</p>
        <p>5 10 15
mnemonic count</p>
        <p>20
(c) Decision tree.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this study, we demonstrate the low variability of existing free datasets, which severely hinders
research on detection methodologies for malware. Considering the increasing spread of malware
5 10 15
mnemonic count</p>
      <p>20
(a) Logistic regression.</p>
      <p>5 10 15
mnemonic count</p>
      <p>20
(b) Linear SVM.</p>
      <p>5 10 15
mnemonic count</p>
      <p>20
(c) Decision tree.
targeting Linux machines, coupled with the widespread adoption of IoT solutions based on this OS, we
believe that the community should undertake a serious efort toward the creation of more extensive
free datasets to bolster the research on malware analysis and detection methodologies in the Linux
landscape.</p>
      <p>
        Notably, one could consider that focusing on ARM architectures is an ill-posed efort because the
process can be trivialised. However, we point out that ARM is the most used architecture in the IoT
domain [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], and the market share of this domain is projected to grow exponentially in the future11.
Given this, we argue that a proactive efort to strengthen the current classification paradigms is
necessary.
      </p>
      <p>Considering the results presented in this paper, we plan to extend our study to assess the quality of
Linux malware datasets for other hardware platforms, e.g., Intel and MIPS. We also plan to increase
the granularity of our research to precisely evaluate the efect of the variability of datasets when
targeting the detection of specific malware families. We also plan to establish a precise methodology to
assess the quality and variability of datasets to provide a standardized benchmarking procedure to test
ML/AI-based malware detection approaches, reducing the bias introduced by the chosen datasets for
training and testing.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partially supported by project SERICS (PE00000014) and by project SETA (PNRR M4.C2.1.1
PRIN 2022 PNRR, Cod. P202233M9Z, CUP F53D23009120001, Avviso D.D 1409 14.09.2022). Both these
projects are under the MUR National Recovery and Resilience Plan funded by the European Union
NextGenerationEU. This work was also partially supported by the European research projects H2020
LeADS (GA 956562), Horizon Europe DUCA (GA 101086308), ARN TrustInClouds, and CNRS IRN
EU-CHECK.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>11https://www.fortunebusinessinsights.com/industry-reports/internet-of-things-iot-market-100307</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Maniriho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. J. M. Chowdhury</surname>
          </string-name>
          ,
          <article-title>A survey of recent advances in deep learning models for detecting malware in desktop and mobile platforms 56 (</article-title>
          <year>2024</year>
          ). URL: https://doi.org/10. 1145/3638240. doi:
          <volume>10</volume>
          .1145/3638240.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A survey</surname>
          </string-name>
          <article-title>on machine learning-based malware detection in executable files</article-title>
          ,
          <source>Journal of Systems Architecture</source>
          <volume>112</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1016/j.sysarc.
          <year>2020</year>
          .
          <volume>101861</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>HaddadPajouh</surname>
          </string-name>
          , A. Dehghantanha,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khayami</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-K. R. Choo</surname>
          </string-name>
          ,
          <article-title>A deep recurrent neural network based approach for internet of things malware threat hunting</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>85</volume>
          (
          <year>2018</year>
          )
          <fpage>88</fpage>
          -
          <lpage>96</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0167739X1732486X. doi:https://doi.org/10.1016/j.future.
          <year>2018</year>
          .
          <volume>03</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Darabian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dehghantanha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hashemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Homayoun</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-K. R. Choo</surname>
          </string-name>
          ,
          <article-title>An opcode-based technique for polymorphic internet of things malware detection</article-title>
          ,
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>32</volume>
          (
          <year>2020</year>
          )
          <article-title>e5173</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.-D.</given-names>
            <surname>Ngo</surname>
          </string-name>
          , H.-T. Nguyen, V.
          <string-name>
            <surname>-H. Le</surname>
            ,
            <given-names>D.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>A survey of iot malware and detection methods based on static features</article-title>
          ,
          <source>ICT Express 6</source>
          (
          <year>2020</year>
          )
          <fpage>280</fpage>
          -
          <lpage>286</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S2405959520300503. doi:https://doi.org/10.1016/j.icte.
          <year>2020</year>
          .
          <volume>04</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Graziano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fratantonio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Balzarotti</surname>
          </string-name>
          ,
          <article-title>Understanding linux malware</article-title>
          ,
          <source>in: 2018 IEEE Symposium on Security and Privacy (SP)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>175</lpage>
          . doi:
          <volume>10</volume>
          .1109/SP.
          <year>2018</year>
          .
          <volume>00054</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-Mondéjar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Suarez-Tangil</surname>
          </string-name>
          ,
          <article-title>Characterizing linux-based malware: Findings and recent trends</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>110</volume>
          (
          <year>2020</year>
          )
          <fpage>267</fpage>
          -
          <lpage>281</lpage>
          . doi:https://doi. org/10.1016/j.future.
          <year>2020</year>
          .
          <volume>04</volume>
          .031.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Phu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Toan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Tho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Binh</surname>
          </string-name>
          ,
          <article-title>Cfdvex: A novel feature extraction method for detecting cross-architecture iot malware</article-title>
          ,
          <source>in: Proceedings of the 10th International Symposium on Information and Communication Technology</source>
          ,
          <source>SoICT '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>248</fpage>
          -
          <lpage>254</lpage>
          . URL: https://doi.org/10.1145/3368926.3369702. doi:
          <volume>10</volume>
          .1145/ 3368926.3369702.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Phu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Toan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Dai</given-names>
            <surname>Tho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Binh</surname>
          </string-name>
          , C500
          <article-title>-cfg: A novel algorithm to extract control flow-based features for iot malware detection</article-title>
          ,
          <source>in: 2019 19th International Symposium on Communications and Information Technologies (ISCIT)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>568</fpage>
          -
          <lpage>573</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISCIT.
          <year>2019</year>
          .
          <volume>8905120</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azmoodeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dehghantanha</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-K. R. Choo</surname>
          </string-name>
          ,
          <article-title>Robust malware detection for internet of (battleifeld) things devices using deep eigenspace learning</article-title>
          ,
          <source>IEEE transactions on sustainable computing 4</source>
          (
          <year>2018</year>
          )
          <fpage>88</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Dovom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Azmoodeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dehghantanha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Newton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Parizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Karimipour</surname>
          </string-name>
          ,
          <article-title>Fuzzy pattern tree for edge malware detection and categorization in iot</article-title>
          ,
          <source>Journal of Systems Architecture</source>
          <volume>97</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1383762118305265. doi:https://doi.org/10.1016/j.sysarc.
          <year>2019</year>
          .
          <volume>01</volume>
          .017.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Artificial</given-names>
            <surname>Immune</surname>
          </string-name>
          <string-name>
            <surname>System</surname>
          </string-name>
          , John Wiley &amp; Sons, Ltd,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          . URL: https://onlinelibrary. wiley.com/doi/abs/10.1002/9781119076582.ch1. doi:https://doi.org/10.1002/9781119076582.ch1. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119076582.ch1.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pendlebury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pierazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jordaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kinder</surname>
          </string-name>
          , L. Cavallaro, TESSERACT:
          <article-title>Eliminating experimental bias in malware classification across space and time</article-title>
          ,
          <source>in: 28th USENIX Security Symposium (USENIX Security 19)</source>
          , USENIX Association, Santa Clara, CA,
          <year>2019</year>
          , pp.
          <fpage>729</fpage>
          -
          <lpage>746</lpage>
          . URL: https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sebastián</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Caballero,</surname>
          </string-name>
          <article-title>Avclass2: Massive malware tag extraction from av labels</article-title>
          , in: Annual Computer Security Applications Conference, ACSAC '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .1145/3427228.3427261.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.mirai,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.bashlite,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alexander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Fonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fazzion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hoepers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Steding-Jessen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H. P. C.</given-names>
            <surname>Chaves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cunha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guedes</surname>
          </string-name>
          , W. Meira,
          <article-title>The evolution of bashlite and mirai iot botnets</article-title>
          ,
          <source>in: 2018 IEEE Symposium on Computers and Communications (ISCC)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>00813</fpage>
          -
          <lpage>00818</lpage>
          . doi:
          <volume>10</volume>
          .1109/ISCC.
          <year>2018</year>
          .
          <volume>8538636</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>CrowdStrike</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://www.crowdstrike.com/blog/ linux-targeted
          <article-title>-malware-increased-</article-title>
          <string-name>
            <surname>by-</surname>
          </string-name>
          35
          <string-name>
            <surname>-</surname>
          </string-name>
          percent-in-
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.tsunami,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.dofloo,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.ngioweb,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>FortiGuard</given-names>
            <surname>Labs</surname>
          </string-name>
          , https://www.fortiguard.com/encyclopedia/virus/10077710,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.kaiji,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>FortiGuard</given-names>
            <surname>Labs</surname>
          </string-name>
          , https://www.fortiguard.com/encyclopedia/virus/6570964,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Fraunhofer</surname>
            <given-names>FKIE</given-names>
          </string-name>
          , https://malpedia.caad.fkie.fraunhofer.de/details/elf.hajime,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>FortiGuard</given-names>
            <surname>Labs</surname>
          </string-name>
          , https://www.fortiguard.com/encyclopedia/virus/8065787,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Quiring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pendlebury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warnecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pierazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wressnegger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavallaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rieck</surname>
          </string-name>
          ,
          <article-title>Dos and don'ts of machine learning in computer security</article-title>
          ,
          <source>in: Proc. of USENIX Security Symposium</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>A comprehensive study on arm disassembly tools</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>49</volume>
          (
          <year>2023</year>
          )
          <fpage>1683</fpage>
          -
          <lpage>1703</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>