<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Robust Machine Learning for Malware Detection over Time</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Angioni</string-name>
          <email>daniele.angioni@unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Demetrio</string-name>
          <email>luca.demetrio93@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura Pintor</string-name>
          <email>maura.pintor@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Battista Biggio</string-name>
          <email>battista.biggio@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pluribus One S.r.l.</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Cagliari</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The presence and persistence of Android malware is an on-going threat that plagues this information era, and machine learning technologies are now extensively used to deploy more efective detectors that can block the majority of these malicious programs. However, these algorithms have not been developed to pursue the natural evolution of malware, and their performances significantly degrade over time because of such concept-drift. Currently, state-of-the-art techniques only focus on detecting the presence of such drift, or they address it by relying on frequent updates of models. Hence, there is a lack of knowledge regarding the cause of the concept drift, and ad-hoc solutions that can counter the passing of time are still underinvestigated. In this work, we commence to address these issues as we propose (i) a drift-analysis framework to identify which characteristics of data are causing the drift, and (ii) SVM-CB, a time-aware classifier that leverages the drift-analysis information to slow down the performance drop. We highlight the eficacy of our contribution by comparing its degradation over time with a state-of-the-art classifier, and we show that SVM-CB better withstand the distribution changes that naturally characterizes the malware domain. We conclude by discussing the limitations of our approach and how our contribution can be taken as a ifrst step towards more time-resistant classifiers that not only tackle, but also understand the concept drift that afect data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;android malware</kwd>
        <kwd>machine learning</kwd>
        <kwd>concept drift</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In this information era, we are experiencing tremendous growth in mobile technology, both
in its eficacy and pervasiveness. One of the most common operating systems for mobile
devices is Android, 1 and, because of its popularity, it became particularly attractive to
cyberattackers eyes, who exploit Android vulnerabilities creating malicious applications, also known
as malware, targeted specifically for these systems 2. Luckily, the technological development
of this era brings enough power to machine learning algorithms, considered the standard for
many domains, including cyber-security and, specifically, malware detection, which has shown
to be very efective also against never-seen malware families [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ].
      </p>
      <p>However, real-world data experience a phenomenon known as concept drift, i.e. their temporal
evolution [7]. In particular, Android applications naturally change over time since attackers keep
adjusting malware to bypass detection, and legitimate applications embrace new frameworks
and programming patterns while abandoning deprecated technologies. Recent work highlighted
how concept drift worryingly afects the performance of state-of-the-art Android malware
detectors, revealing how much it drops over time, contradicting the results achieved by their
original analysis since they were inflated by wrong evaluation settings [ 8]. On top of this
issue, the only proposals to counter the concept drift rely on continuous update or retraining of
machine learning models [9, 10, 11, 12, 13], instead of tracking which are the characteristics of
data that mainly change over time.</p>
      <p>Hence, we start bridging the gaps left in the state-of-the-art by proposing novel techniques
that understand the concept drift and take advantage of it. The contribution of this work are
summarized as follows: (i) we propose a drift-analysis framework that investigates the reasons
causing the concept drift inside data, highlighting which features are more prone to have a
negative contribution to the performance decay; and (ii) we propose SVM-CB, a novel classifier
that leverages our drift-analysis information to bound the selected unstable features, reducing
the overall performance drop.</p>
      <p>
        We show the efectiveness of SVM-CB, by comparing its performance over time with
Drebin [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a state-of-the-art linear classifier. To obtain a fair comparison, we train both
classifiers on the same dataset, and we show how SVM-CB better withstand the passing of time,
thanks to the domain knowledge acquired through the results of our drift-analysis framework,
thus allowing SVM-CB to be updated less often compared to Drebin.
      </p>
      <p>We conclude by discussing future directions of this work considering fewer heuristic rules to
tune SVM-CB, and extensions of our methodology to non-linear classifiers.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Android Malware Detection over Time</title>
      <p>Before delving into the details of the proposed methods, we firstly describe the structure of
Android applications to lay a foundation for understanding the classifier that we consider in
this work, and we discuss the problem and proposed solutions to the concept drift problem.
Android Applications. These are programs that run on the Android Operating System. They
are distributed as an Android Application Package (APK), an archive file with the .apk extension.
An APK contains diferent files: (i) the AndroidManifest.xml, that stores all the required
information needed by the operating system to correctly handle the application at run-time;3
(ii) the classes.dex, that stores the compiled source code and all user-implemented methods
and classes; and (iii) additional .xml and resource lfies that are used to define the application
user interface, along with additional functionality or multimedia content.</p>
      <p>
        Malware Detection with Machine Learning. We select a popular binary detector named
Drebin [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as a baseline for our proposals, for which we show the architecture in Fig .whose
3https://developer.android.com/guide/topics/manifest/manifest-intro
architecture is described in Fig. 1. This classifier relies on a Support Vector Machine (SVM) [ 14]
trained on top of hand-crafted features extracted from APKs provided at training time, and
they consider: (i) features extracted from AndroidManifest.xml, like hardware
components, requested permissions, app components, and filtered intents; (ii) features extracted from
classess.dex, including restricted API calls, used permissions, suspicious API calls, and
network addresses. All this knowledge is encoded inside -dimensional feature vectors, whose
entries are 0 or 1 depending on the absence or presence of a particular characteristic. Since
Drebin relies on an SVM, it can be used to investigate its decision-making process since each
feature is already correlated with a weight that describes its orientation toward one of the two
prediction classes, namely legit and malicious.
      </p>
      <p>Performance over Time. Even though Drebin registered impressive performance in detecting
malware, it was not properly tested inside a time-aware environment. I ts training relies on the
Independent and Identical Distribution (I.I.D.) assumption, which takes for granted that both
training and testing samples are drawn from the same distribution. While this property might
hold for the image classification domain, it can not be satisfied for the rapidly-growing domain
of programs, where training samples difer from future test data as new updates, frameworks,
and techniques are introduced while others are deprecated. The classic evaluation setting injects
artifacts inside the learning process, like the presence of samples coming from mixed periods,
allowing the classifier to rely on future knowledge at training time. Such has been demonstrated
by Pendlebury et al. [8], that show how selected state-of-the-art detectors are characterized by
worrying performance drops when evaluated with a more realistic time-aware approach.
3. Analysing and Improving Robustness to Time
We now introduce the two contributions of our work: (i) the drift-analysis framework to either
understand the causes of the concept drift by inspecting the features extracted from data at
diferent time intervals and quantifying their contribution to the overall performance drop;
and (ii) the time-aware learning algorithm SVM-CB (i.e. SVM with Custom Bounds), that uses
drift-analysis information to select and bound the weights of a chosen number of features
considered unstable to reduce their contribution to the performance decay caused by time.
Drift-analysis framework. Our first contribution tackles the open problem of explaining the
concept drift, and we propose the temporal feature stability (T-stability), a novel metric measuring
the single feature contribution to the performance decay, designed for linear classifiers. This
metric captures two distinct characteristics of each single feature when dealing with time: their
relevance in the classifier prediction and their temporal evolution. These are quantified by the
product between (i) the weight  corresponding to the -th feature, learned at training time
by the classifier; and (ii) the slope  that approximate the temporal evolution of the values of
the feature.</p>
      <p>To compute our metric, we start with the hypothesis that a decrement in the detection rate of
malware is strictly related to a decaying score assigned to malware samples as time passes. Such
behavior corresponds to a shift of the malware class distribution towards the decision boundary
learned at training time, thus increasing the number of misclassified samples. To quantify
our intuitions, we analyze the variation of the malware score over time, and we compute the
conditional expectation of the score over all malware samples (identified with the label  = 1)
at time  as:</p>
      <p>⎡⎛  ⎞ ⎤ ⎡  ⎤
[  + | = 1, ] =  ⎣⎝∑︁  · , ⎠ + | = 1, ⎦ = ⎣∑︁  · [, | = 1, ]⎦ + 
=0 =0
(1)
where the score is computed as the scalar product between , the vector containing the weights
of the linear classifier with bias , and  the -dimensional feature vector representation of an
input Android application.</p>
      <p>Since we want to quantify how the features contribute to the score expectation evolution, we
consider the derivative of Eq. 1, being the summation over the products between weights and
the derivatives of the feature expectation w.r.t. time.</p>
      <p>[  + |, ]

= ∑− ︁1  · [ |, ]
=1 
≈
− 1 − 1
∑︁  ·  = ∑︁  
=0 =1
(2)
Since we are interested in capturing the overall trend of the score decay, we approximate each
derivative of the -th feature with the slope  of the regression line that best fits the single
feature expectation over time. Here, we compress the product  ·  in a single value   ,
that is how we compute the T-stability of the feature . Intuitively, the larger and negative
the T-stability metric is for a feature, the more such feature accelerates the degradation of the
classifier.</p>
      <p>Since expectations are not computable for a specific time instant , we quantize the time
variable considering time slots with length Δ, where the -th slot indicate the subset  of
malware samples registered at time  ∈ [Δ, ( + 1)Δ], being  an integer variable. Thus,
we use Alg. 1 to obtain the vector  containing the T-stability of each feature. After having
computed the number of available time slots  based on the timestamps in , and the chosen
time window Δ (line 1), we initialize a utility matrix  that will contain the mean feature
values (line 2). Then we iterate through the time slots (line 3) and select, for each one, the subset
 (line 4) needed to compute the mean feature value at time Δ storing it in the -th column
of  (line 5). After this step, we loop over the number of features (line 7) to compute the
slope  of the -th feature over time, i.e. the -th row of  (line 8), to eventually return the
Algorithm 1 Drift Analysis
Input : The input timestamped and labeled dataset  = {, , }=1; the time window Δ;
the weights  of the reference classifier ′.</p>
      <p>Output : the T-stability vector 
1  ← ⌈ ( − )/Δ⌉ ◁ Compute number of time slots
2  ← (,  ) ◁ Initialize utility matrix
3 for  ∈ [0,  − 1] do
4  ← { (, , ) ∈  :  = 1,  ∈ [Δ, ( + 1)Δ]} ◁ Obtain data in time
slot 
5 * , ← |1| ∑︀∈ ,*
6  ← ()
7 for  ∈ [0,  − 1] do
8  ←  (,* )
9  ←  ∘ 
10 return 
◁ Compute mean feature value in time slot 
◁ Compute slope of the regression line
◁ Compute the T-stability vector
◁ Return T-stability vector
Hadamard product between the classifier trained weights  and the feature slopes  (line 9),
i.e. the T-stability vector  .</p>
      <p>Robustness to Future Changes. As our second contribution, we show how to exploit the
information obtained with the drift-analysis inside the optimization process to train SVM-CB, an
SVM classifier hardened against the passing of time. To train SVM-CB, we consider a reference
temporally unstable classifier to compute the T-stability for each feature. Then, we select the
unstable features, that are the  of them that have the most negative   values. Our goal is to
train a new classifier that relies less on these unstable features, thus we bound the absolute
value of the correspondent weights to directly reduce their contribution in Eq. 2. This can be
formalized as the constrained optimization problem in Eq. 3, where the hinge loss is minimized
subject to a constrained on the subset of weights  , i.e. the  weights correspondent to the
unstable features, that are forced to be lower than a specific bound  in their absolute value.
arg min
,
..</p>
      <p>∑︁ (0, 1 −  (; , )),
=1
| | &lt; , ∀ ∈  .
(3)
(4)</p>
      <p>We show in Alg. 2 the time-aware training algorithm for SVM-CB that minimize this objective
through a gradient descent procedure. The algorithm is initialized by firstly identifying the
subset  of weights corresponding to the  unstable features (lines 1-3). Then, for each
iteration, we firstly modulate the learning rate with the function () to improve convergence
(line 6), we update the parameters of the classifier to train by applying gradient descent (lines
78), to eventually clip the weights contained in  to the bound  if their absolute value exceed
it (line 9), as described in Eq. 4. After  iterations, the algorithm returns the learned parameters
 and .
10 return (), ()</p>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>◁ Return the learned parameters
Algorithm 2 SVM-CB learning algorithm
Input :  = {, }=1, the training data; , the absolute value of the bound that must be
applied to the weights;  , the T-stability vector;  , the number of weights that must
be bounded;  , the number of iterations;  (0), the initial gradient step size ; () a
decaying function of .</p>
      <p>
        Output : , , the trained classifier’s parameters.
1  ← ( ) ◁ Initialize feature indexes ordered w.r.t. 
2  ← {  :  = 0, ...,  },  ∈  . ◁ Select first  indexes
3 Initialize  = { :  ∈  } ◁ Select corresponding  weights
4 ((0), (0)) ← (0, 0) ◁ Initialize parameters
5 for  ∈ [1,  ] do
6  () ←  (0)() ◁ Update learning rate
7 () ← (− 1) −  ()∇ℒ ◁ Update weights
8 () ← (− 1) −  ()∇ℒ ◁ Update bias
9 () ← (();  , ) ◁ Clip weights based on Eq. 4 criteria
We now apply our methodology to quantify how it explains and hardens a classifier against the
performance decay compared with the time-agnostic classifier Drebin [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Dataset. We leverage the dataset provided by Pendlebury et al. [8], composed of 116,993
legitimate and 12,735 malicious Android applications sampled from the AndroZoo dataset [15],
spanning from January 2014 to December 2016. We replicate their temporal train-test split as
shown in Fig. 2, by dividing them between December 2014 and January 2015, and we set the
time slot Δ equal to 1 month to ensure suficient statistics for each. We hence extract 465,608
from the training set to match the original formulation of Drebin [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Models. We consider Drebin as the baseline classifier, trained with the  parameter set to 1, and
we compare it with two versions of SVM-CB by considering diferent bounds on the unstable
features detected by the drift-analysis framework. We will refer to the baseline classifier as
SVM since the underlying feature extractor and the feature embedding module are the same for
all the classifiers under analysis.</p>
      <p>Drift Analysis Results. To identify the features responsible for the performance decay over
time in our baseline SVM, we firstly show in Fig. 3 the trend of the mean score assigned
respectively to malicious (Fig. 3a) and benign samples (Fig. 3b) over all the testing periods.
While the classifier assigns, on average, an almost constant negative score to the goodware
class, the mean score assigned to malware gradually approaches to zero to eventually become
negative after 10 months, thus validating the hypothesis claimed in Sect. 3.</p>
      <p>We compute the T-stability vector  through Alg. 1 for the learned weights of the SVM
w.r.t. the timestamped training set, and we show the first 104 T-stability values in increasing
order along with the corresponding features in Fig. 3c. The latter highlights that most of the
contribution to the performance decay is caused by roughly 100 features among all the feature
9
9
4
/4788 /6399
7
1
4
1
9
5
/4
5
3
6
set, while all the remaining ones do not substantially compromise the detection rate over time
since their T-stability is very close to zero.</p>
      <p>We report a subset of the selected unstable features (i.e. features presenting large negative
T-stability values) in Table 1. The first 10 rows show features that the SVM associates with the
goodware class and are becoming more likely to be found in malware ( &lt; 0,  &gt; 0), while
the last 10 rows show features that the SVM associates with the malware class but they are
disappearing from data ( &gt; 0,  &lt; 0). For simplicity, we will refer to the features in the
ifrst and second table, respectively, as the first and the second group of features.</p>
      <p>We can recognize in the first group features mostly related to commonly-used URLs. For
instance, among them, we find “www.google.com”, “www.youtube.com”, and websites under
the “facebook.com” domain, which are all legitimate URLs to browse, and the classifier links
them to the goodware class by assigning them a positive weight. The second group is mostly</p>
      <p>Feature name  
urls::https://graph.facebook.com/%1$s?...&amp;accessToken=%2$s -0.008753
intents::android_intent_action_VIEW -0.010168
urls::http://www.google.com -0.021320
activities::com_revmob_ads_fullscreen_FullscreenActivity -0.006204
activities::com_feiwo_view_IA -0.004435
urls::http://i.ytimg.com/vi/ -0.005245
api_calls::android/content/ContentResolver;→openInputStream -0.003749
urls::https://m.facebook.com/dialog/ -0.004955
urls::http://market.android.com/details?id= -0.004041
urls::http://www.youtube.com/embed/ -0.004289
api_calls::android/net/wifi/WifiManager;→getConnectionInfo -0.003469
app_permissions::name=’android_permission_MOUNT_UNMOUNT_FILESYSTEMS’ -0.004508
urls::http://e.admob.com/clk?... -0.006713
activities::com_feiwothree_coverscreen_SA -0.003564
interesting_calls::Cipher(DES) -0.008910
intents::android_intent_action_PACKAGE_ADDED -0.022435
activities::com_fivefeiwo_coverscreen_SA -0.003813
intents::android_intent_action_CREATE_SHORTCUT -0.012456
intents::android_intent_action_USER_PRESENT -0.021155
activities::com_feiwoone_coverscreen_SA -0.010022
characterized by features related to intents and activities. For instance, we find the presence
of a cipher algorithm (“interesting_calls::Cipher(DES)”), reported to be used to obfuscate and
encrypt part of the malicious application.4 However, this feature has a decreasing trend ( &lt; 0),
meaning that malware relies less on this method as time passes, probably because it would ease
the detection of the malware under manual inspection.</p>
      <p>From this analysis, we can deduce that the unstable features can be grouped into two types
of features: (i) goodware-related features that malware creators are starting to inject in their
malicious code to increase the probability of it being recognized as goodware, and (ii)
malwarerelated features that malware creators are starting to deprecate to reduce the probability of it
being recognized as malware.</p>
      <p>Improving Robustness. We now leverage the results of our drift-analysis framework
performed on the SVM by training SVM-CB using Alg. 2, running it for 2000 iterations, with
the initial learning rate  (0) set to 7 · 10− 5 and we use the cosine annealing function as ()
to modulate it over the iterations. We heuristically choose the number of features to bound
 = 102, since these are the ones the most contribute to the performance decay (Fig. 3c). We
train two versions of SVM-CB, referred as (i) SVM-CB(H) the classifier with  = 0.8 and (ii)
SVM-CB(L) the classifier with  = 0.2. These two diferent bounds allow us to better understand
how the robustness against the concept drift changes when we apply softer ( = 0.8) or harder
( = 0.2) constraints to the correspondent weights. We report the performance analysis of
4https://www.virusbulletin.com/virusbulletin/2014/07/obfuscation-android-malware-and-how-fight-back
(a)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0 0 3 6 9 12 15 18 21</p>
      <p>Month
0.95
0.90
0.85
0.80
0.75
0.70
000...566505 SSSeeVccM--SSVVMM--CCBB((LH))
0.50 0 3 6 9 12 15 18 21</p>
      <p>Month
(d)
these classifiers in Fig. 4, where we show the evolution over the testing periods of the recall
(red) and the precision (blue) for the SVM (Fig. 4a) and SVM-CB (L-H) (Fig. 4c and 4b). We will
focus mainly on the discussion of the recall curves, as our primary concern is the detection
rate of the malware samples over time, which is computed in the same way. Also, we will not
discuss the results concerning the last two months, as the number of samples is not suficiently
large for a proper evaluation (as highlighted by Fig. 2).</p>
      <p>We correctly replicated the results obtained by Pendlebury et al. [8] for the SVM, which
presents the highest recall among the tested classifiers in the first testing periods, starting from
76.4%, dropping fast towards a 28.8% recall at 16-th month to eventually rise to 45.3% at 21-th
month. Although the initial detection rate of SVM-CB(L) is lower than 70% it fluctuates less
w.r.t. to the baseline by maintaining the performance around 50-60% with a final drop to 35.8%
at the third to last month. SVM-CB(H) presents an initial recall of 69.4%, while it decays to
43.2% once it reaches the 22-th month. Coherently to the results obtained by Pendlebury et
al. [8], we observe that the baseline SVM is characterized by the fastest performance decay,
while the other classifiers start between 60% and 70% recall. The peak of temporal robustness is
reached by SVM-CB(L) where the recall curve seems to be almost flattened, while SVM-CB(H)
has indeed a slower decay w.r.t. the SVM but faster than SVM-CB(L). Lastly, Fig. 4d shows the
Area Under the Receiving Operating Curve (ROC) curve for each testing period, computed up
to 5% FPR. Here we indirectly discuss the correlation between precision and recall considering
the performance when we fix a constant percentage of goodware misclassified as malware for
each month in order to better measure and compare the data separation capabilities of the
three classifiers. The AUC curves reflect what we have discussed for the recall: the SVM starts
as usual with the highest AUC and decays rapidly below all the other AUC curves, while the
other classifiers start with a lower AUC that reveals to be higher than SVM when approaching
the 10-th month. SVM-CB(L) has been confirmed to be the more stable classifier even in this
constrained evaluation setting with low FPR.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Related Work</title>
      <p>We now ofer an overview of state-of-the-art techniques similar to our proposal. Pendlebury
et al. [8] proposes Tesseract, a test-time evaluation framework to determine the faultiness of
classifiers in the presence of the concept-drift. The authors show that evaluations are afected by
misleading biases that inject artifacts inside the trained machine learning model, thus causing a
performance decay once the model faces real-world data. Tesseract highlights how diferent
proposed models do not cope with the concept drift of Android applications and that faulty
training settings inflated their original evaluations. While Tesseract is a consistent method to
include concept drift in the evaluation, it is not designed to either fix or mitigate its presence.</p>
      <p>Jordaney et al. [10], propose Transcend, a framework that signals the premature aging of
classifiers before their performance starts to degrade consistently by analyzing the diference
between samples observed at training at test time. On top of this methodology, Barbero et
al. [11] propose Transcendent, which improves Transcend to include the rejection of
out-ofdistribution samples that cause the performance drops. However, they do not propose methods
to harden a classifier against concept drift, rather they focus on protection systems exploiting
samples encountered during deployment, such as a notification when data start difering from
the training one [10], or directly rejecting a sample coming from a drifted data distribution [11].</p>
      <p>In contrast to previous work, we consider the presence of faulty evaluations, and we extend
it with a methodology that quantifies which features of the data distributions are changing and
how. Such contribution not only explains the performance decay, but also helps understanding
the reasons behind the concept drift. Instead of rejecting samples or just signaling the worsening
of the performances of a model, we build a time-aware classifier that takes into account the
acquired knowledge of the data distribution changes, and we show how our methodology can
better withstand the passing of time.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions and Future Work</title>
      <p>In this work, we propose a preliminary methodology that understands and provide an initial
hardening against the concept drift that plagues the performance of Android malware detection.
In particular, we develop a drift-analysis framework that highlights which features contribute
more to the performance decay of a classifier over time, and we leverage these results to propose
SVM-CB, a linear classifier hardened against the passing of time.</p>
      <p>We show the eficacy of our proposals by applying our drift-analysis framework to Drebin,
a linear Android malware detector, and we compare its performances over time against its
hardened version computed through our proposed methodology. From our experimental
analysis, we can precisely detect which features worsen the detection rate of Drebin and how the
trained SVM-CB better withstand the passing of time. In particular, we highlight the eficacy of
the bounding of these unstable features, reducing the performance drop of SVM-CB w.r.t. the
baseline Drebin.</p>
      <p>
        Although the obtained results are promising, this work presents the following limitations.
First, the experimental setup does not guarantee that the provided solution against performance
decay can be applied to other types of detectors, as this work addresses the problem of analyzing
the efect of the concept drift only for linear classifiers that work only on static features [
        <xref ref-type="bibr" rid="ref1">1, 16</xref>
        ].
Also, the T-stability might not reflect the actual concept drift that afects Android applications,
as it is computed on a classifier trained on a specific dataset, which approximates the real
data distribution. Hence, we should also study the Android malware domain more to provide
suficient and reliable evidence of why the features chosen by the drift-analysis framework are
actually causing the decay. Lastly, we heuristically tuned the bounds for the selected weights of
SVM-CB, but these choices could be improved with an automatic algorithm that computes the
ones that lead to better robustness against time.
      </p>
      <p>However, we anyhow believe that our work can suggest a promising research direction
that will provide more insight on the usage of each contribution. We first intend to explore
more advanced methods based on the drift-analysis framework, including an automatic bound
selection for the weights inside the learning algorithm, by adopting a regularization term tailored
specifically for temporal performance stability. Secondly, we intend to generalize this method
to address deep learning algorithms, where the feature extractor and the feature representation
of the last linear layer evolve during training.</p>
      <p>Moreover, we will explore other research directions, such as (i) the quantification and
prevention of machine learning malware detectors from forgetting old threats when updated with new
data, and (ii) the inclusion of research fields such as Continual Learning, 5 which model data
as a continuous stream, thus enabling the development of techniques for updating classifiers
constantly and efortlessly.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partly supported by the PRIN 2017 project RexLearn, funded by the Italian
Ministry of Education, University and Research (grant no. 2017TWNMH2); and by the project
TESTABLE (grant no. 101019206), under the EU’s H2020 research and innovation programme.
[7] G. I. Webb, R. Hyde, H. Cao, H. L. Nguyen, F. Petitjean, Characterizing concept drift, Data</p>
      <p>Mining and Knowledge Discovery 30 (2016) 964–994.
[8] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, TESSERACT: Eliminating
experimental bias in malware classification across space and time, in: 28th USENIX
Security Symposium (USENIX Sec. 19), 2019, pp. 729–746.
[9] A. Singh, A. Walenstein, A. Lakhotia, Tracking concept drift in malware families, in:
Proceedings of the 5th ACM workshop on Security and artificial intelligence, 2012, pp.
81–92.
[10] R. Jordaney, K. Sharad, S. K. Dash, Z. Wang, D. Papini, I. Nouretdinov, L. Cavallaro,
Transcend: Detecting concept drift in malware classification models, in: 26th USENIX
Security Symposium (USENIX Sec. 17), 2017, pp. 625–642.
[11] F. Barbero, F. Pendlebury, F. Pierazzi, L. Cavallaro, Transcending transcend: Revisiting
malware classification in the presence of concept drift, arXiv preprint arXiv:2010.03856
(2020).
[12] D. Hu, Z. Ma, X. Zhang, P. Li, D. Ye, B. Ling, The concept drift problem in android malware
detection and its solution, Security and Communication Networks 2017 (2017).
[13] A. Narayanan, L. Yang, L. Chen, L. Jinliang, Adaptive and scalable android malware
detection through online learning, in: 2016 International Joint Conference on Neural
Networks (IJCNN), IEEE, 2016, pp. 2484–2491.
[14] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[15] K. Allix, T. F. Bissyandé, J. Klein, Y. Le Traon, Androzoo: Collecting millions of android
apps for the research community, in: 2016 IEEE/ACM 13th Working Conference on Mining
Software Repositories (MSR), IEEE, 2016, pp. 468–471.
[16] A. Demontis, M. Melis, B. Biggio, D. Maiorca, D. Arp, K. Rieck, I. Corona, G. Giacinto,
F. Roli, Yes, machine learning can be more secure! a case study on android malware
detection, IEEE Transactions on Dependable and Secure Computing 16 (2017) 711–724.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spreitzenbarth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hubner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gascon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rieck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Siemens</surname>
          </string-name>
          ,
          <article-title>Drebin: Efective and explainable detection of android malware in your pocket</article-title>
          .,
          <source>in: Ndss</source>
          , volume
          <volume>14</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Mariconti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Onwuzurike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Andriotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cristofaro</surname>
          </string-name>
          , G. Ross, G. Stringhini, Mamadroid:
          <article-title>Detecting android malware by building markov chains of behavioral models</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1612</volume>
          .
          <fpage>04433</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Manoharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Backes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McDaniel</surname>
          </string-name>
          ,
          <article-title>Adversarial examples for malware detection</article-title>
          ,
          <source>in: European symposium on research in computer security</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ahvanooey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Rajput</surname>
          </string-name>
          ,
          <article-title>A survey on smartphones security: software vulnerabilities, malware, and attacks</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>09406</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Souri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <article-title>A state-of-the-art survey of malware detection approaches using data mining techniques</article-title>
          ,
          <source>Human-centric Computing and Information Sciences</source>
          <volume>8</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Amamra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Talhi</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Robert</surname>
          </string-name>
          ,
          <article-title>Smartphone malware detection: From a survey towards taxonomy</article-title>
          ,
          <source>in: 2012 7th International Conference on Malicious and Unwanted Software</source>
          , IEEE,
          <year>2012</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>