<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Experimental Analysis of Semi-supervised Learning for Malware Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Minnei</string-name>
          <email>luca.minnei@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Piras</string-name>
          <email>giorgio.piras@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Sotgiu</string-name>
          <email>angelo.sotgiu@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura Pintor</string-name>
          <email>maura.pintor@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ambra Demontis</string-name>
          <email>ambra.demontis@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Maiorca</string-name>
          <email>davide.maiorca@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Battista Biggio</string-name>
          <email>battista.biggio@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, the wide use of the Android operating system for mobile devices has encouraged a likewise increasing number of cyber-attackers, which exploit related vulnerabilities to create Android Malware. While these represent a major threat in the security landscape, it has been shown how machine learning algorithms, trained over a collection of goodware and malware data, can efectively detect their presence. However, the domain in which such data lies changes over time due to the evolution of applications, such as software updates or deprecation of API calls, and the amount of malware and goodware examples are typically imbalanced. Hence, while machine-learning detectors are efective solutions, their performance must keep up with domain evolution and class imbalance, which can, however, result in frequent expensive retraining. In this work, we perform a preliminary experimental investigation of semi-supervised learning to retrain machine learning-based malware detectors using pseudo-labels along with a small pool of labeled samples. In detail, we account for class imbalance by considering self-training with class-specific thresholds. Our results show that we improve the classification performances by using approximately 10% of pseudo labels in each re-training round.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semi-Supervised Learning</kwd>
        <kwd>Android Malware Detection</kwd>
        <kwd>Concept Drift</kwd>
        <kwd>Active Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Android is one of the most common operating systems for mobile devices, which, however, has often
been associated with numerous vulnerabilities.1 As in many areas of cybersecurity, attackers exploit
these vulnerabilities to target users, and Android is no exception. In fact, over the years, attackers have
developed and deployed multiple forms of malware, i.e., malicious applications or software.2 To address
this challenge, Machine Learning (ML) models have shown to be highly efective tools in detecting
Android malware against legitimate applications (goodware) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and as Android mobile devices have
become increasingly popular, these detectors play a now potentially crucial role in safeguarding users’
security and privacy. In detail, the ML models used in such systems are trained over a collection of
goodware and malware samples, and are then used “in the wild" to detect possible threats.
      </p>
      <p>Despite their wide use, recent work has shown that the detection performances of such ML models
is, however, mostly limited to controlled environments (in vitro) where data distribution remains stable,
while it can drop dramatically when the detectors operate in real-world settings (in vivo) [2]. In fact,
the constantly evolving nature of the applications and the corresponding adaptation of malware threats
leads to frequent changes in data distribution, which create substantial challenges for detection. This
phenomenon, which is known as "concept drift", causes models to lose their efectiveness as the malware
landscape evolves and the data distribution changes over time [3]. To keep up with such changes, an
ordinary solution involves frequent retraining of the models on large and manually retrained labeled
datasets, which, however, would incur extremely high costs for both retraining and labeling.</p>
      <p>In contrast, state-of-the-art approaches adopt diferent strategies to adapt the model to concept drift,
each accompanied by its own set of challenges. For instance, Continual Learning allows models to
adapt to new malware patterns relying on labeled data to update the model incrementally and ensure
accurate detection of evolving threats. However, especially on imbalanced datasets, it risks catastrophic
forgetting, a phenomenon where the model loses previously acquired knowledge as it learns from
new data [2]. Semi-Supervised Learning approaches instead, such as self-training, which uses model
predictions as pseudo-labels, reduce reliance on labeled data by leveraging abundant unlabeled apps.
However, these require careful management to avoid propagating errors from incorrect pseudo-labels [4].
Finally, Active Learning approaches prioritize labeling the most informative samples to reduce labeling
costs. For instance, using uncertainty sampling to label the samples improves eficiency, although it
requires precise sample identification to maximize impact and minimize additional labeling costs [ 2].
Hence, while each of these methods proposes interesting approaches, each has its own set of drawbacks
that limit the eficacy of a standalone implementation as a solution.</p>
      <p>In this work, we thus investigate the use of semi-supervised learning (SSL) to tackle the challenges of
updating malware detection models in ever-evolving environments. In detail, we use Self-Training (ST),
a subset of SSL, where we leverage asymmetric thresholds to prioritize malware samples and increase
the efectiveness of the pseudo-labeling process. In addition, we evaluate an extension of ST to Active
Learning (AL) techniques to improve the re-training process by incorporating small batches of labeled
samples, and analyze its efectiveness. Our results show that the asymmetric thresholds self-training
method outperforms traditional SSL techniques, achieving robust F1 scores even without additional
labeled data from AL. By using approximately 10% of pseudo-labels per retraining phase, the method
significantly reduces dependence on manual labeling while maintaining high accuracy, demonstrating
its eficiency for scalable Android malware detection.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>We start our discussion by providing the background on Android applications and malware in Sect. 2.1,
and then present the related machine-learning techniques for detecting Android malware in Sect. 2.2.
We finally conclude by discussing the issues of concept drift in Sect. 2.3.</p>
      <sec id="sec-2-1">
        <title>2.1. Android OS Applications</title>
        <p>Applications for the Android Operating System are packaged and installed using Android Application
Package (APK) files. These files, with a .apk extension, serve as archives containing all the components
required for the application, including source code, resources, assets, and metadata.3 A key element
of an APK is the AndroidManifest.xml file, which provides essential information for the operating
system to install and manage the app, such as the app’s name, version, required permissions, and main
components. The APK may also include one or more classes.dex files containing the app’s Java
or Kotlin source code compiled into Dalvik bytecode; a res directory housing various .xml resource
ifles for user interfaces, constant strings, multimedia, and more; a lib directory with
architecturespecific compiled code, such as native C/C++ libraries; an assets folder for general-purpose files; and
a META-INF directory that stores metadata like certificates and signatures.</p>
        <p>Android Malware. The open and flexible nature of the Android operating system makes it an attractive
target for malicious actors. Android malware often appears as a legitimate .apk file, taking advantage
of known vulnerabilities, weak permission controls, and unauthorized app stores. Once inside, it can
manipulate core components such as AndroidManifest.xml, request unwarranted permissions, or
alter classes.dex files to insert harmful functionality. After installation, such malware may collect
personal information, encrypt files for ransom, display intrusive advertisements, or masquerade as
trusted applications. By exploiting the system’s packaging, installation, and execution model, Android
malware efectively turns the platform’s strengths into its own opportunities for malicious activity.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Machine Learning-based Android Malware Detection</title>
        <p>
          The vast number of applications and malware samples released each year necessitates a high level of
analysis and detection eficiency. Traditional signature-based methods, that use the malware signature,
are inefective against sophisticated malware that employs techniques such as obfuscation, app
repackaging, dynamic code loading, encryption, and malware dropping, and can easily change the signature
without afecting the malware functionality [ 5]. Moreover, these methods struggle with previously
unknown threats and require constant updates to remain efective. To address these limitations, machine
learning has emerged as a promising approach for Android malware detection, leveraging features
extracted through static [
          <xref ref-type="bibr" rid="ref1">1, 6</xref>
          ], dynamic [7, 8], or hybrid analysis [9]. While dynamic analysis provides
insights into application behaviour and can prevent evasion tactics, static analysis remains important
due to its eficiency and low computational overhead.
        </p>
        <p>
          Drebin. In this study, we build upon the well-known Drebin malware detection framework [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which
utilizes features extracted statically from Android application files. Drebin defines eight feature sets
derived from two key sources: the AndroidManifest.xml and classes.dex files. These features
are summarized in Table 1.
        </p>
        <sec id="sec-2-2-1">
          <title>3https://developer.android.com/guide/topics/manifest/manifest-intro</title>
          <p>The feature sets 1, 2, 3, and 4 are extracted from the AndroidManifest.xml file, while 5,
6, 7, and 8 are obtained from the classes.dex files. In our approach, these features, represented
as text, are transformed into numerical data using a Term Frequency-Inverse Document Frequency
(TF-IDF) vectorizer. The TF-IDF vectorizer  converts the extracted features from a dataset  into
a -dimensional vector, where each index corresponds to a specific feature. The value at each index
represents the TF-IDF weight of the corresponding feature, reflecting its importance within the dataset.
This method considers not only the frequency of a term within a specific document (term frequency,
TF) but also its rarity across the entire dataset (inverse document frequency, IDF). By assigning higher
importance to terms that are frequent in a given document but rare across the dataset, the TF-IDF
approach highlights the most informative features for analysis.</p>
          <p>Detector Performance Over Time. Although Drebin and other machine learning-based detectors
have demonstrated impressive performance, subsequent studies have revealed that their evaluations
often neglected the temporal evolution observed in both legitimate and malicious applications [10]. The
introduction of the new Android OS version introduces new functionalities and deprecates or removes
others. That leads to changes in the previously defined features, rendering some features obsolete and
thereby having no impact on the classification outcome.</p>
          <p>These shifts violate the so-called independent and identically distributed (i.i.d.) assumption between
training and test datasets in traditional learning-based approaches, necessitating temporally-aware
evaluations. Rather than using random splits, data should be partitioned chronologically, training
models on older samples and testing on newer ones. Previous research shows that model performance
under such temporally-aware evaluations declines significantly over time [ 10]. Therefore, it is essential
to develop new techniques that improve classifiers’ robustness to data drift or detect when they become
obsolete.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Countering Concept Drift</title>
        <p>While the detectors’ performances have been shown to deteriorate over time, multiple solutions have
been proposed to address such issues, as described in this section.</p>
        <p>Semi-Supervised Learning. Semi-supervised learning (SSL) is used to address the challenges posed
by data drift, specifically the scarcity of labeled data. SSL leverages the large amounts of unlabeled data
available in malware detection by generating pseudo-labels for unlabeled samples. These pseudo-labels,
created using the model’s predictions, are then incorporated into the training process, expanding the
dataset without additional labeling costs. A commonly used SSL approach is self-training, where the
model iteratively retrains itself on the most confidently pseudo-labeled samples. Enhancements, such
as the use of asymmetric thresholds to prioritize the inclusion of malware samples, further improve
SSL’s efectiveness by ensuring the retraining process targets the most critical and variable aspects
of the malware distribution [11]. While recent work has shown the efectiveness of SSL for malware
detectors, these approaches often involve costly updates relying on ensembles of models [4] or are
limited to neural network architectures [12]. We instead investigate model-agnostic SSL techniques,
focusing on the popular Drebin approach features.</p>
        <p>Active Learning. Active learning (AL) is a technique designed to optimize the labeling process by
selectively querying the most informative samples from the dataset. In the context of malware detection,
AL is particularly useful for addressing data drift, as it allows the model to be updated incrementally
with minimal manual labeling efort. Common strategies in AL include uncertainty sampling, which
prioritizes samples where the model has the least confidence in its predictions, and random sampling,
which serves as a baseline by selecting a random subset of samples. These approaches ensure that
labeling resources are focused on the most impactful data points, thereby enhancing the model’s ability
to adapt to changes in the data distribution over time [13].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Self Training with Asymmetric Thresholds</title>
      <p>In this section, we first describe, in Sect. 3.1, the semi-supervised learning method employed to eficiently
re-train the model, focusing on addressing the challenges posed by concept drift. Then, in Sect. 3.2, we
describe the approach used to address the class-imbalance between malware and goodware.</p>
      <sec id="sec-3-1">
        <title>3.1. Yarowsky Algorithm</title>
        <p>The base algorithm, called Self-Training (ST) [14], operates by leveraging the most confidently predicted
samples as pseudo-labels, which are then used to update the model during the re-training phase. This
approach eliminates the need for additional labeled samples. However, it often proves insuficient for
re-training the classifier, as it tends to favour the majority class (goodware), thereby exacerbating class
imbalance. This limitation arises because the highest-confidence predictions are more likely to belong
to the majority class, reducing the focus on the minority class (malware). To address these issues, we
experiment with an alternative method that uses asymmetric thresholds to prioritize the minority class
during the selection of pseudo-labels.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Asymmetric thresholds</title>
        <p>We address class imbalance through asymmetric thresholds, where we prioritize the selection of malware
samples over goodware. Unlike traditional approaches in fact, that use uniform thresholds for all classes,
this method assigns distinct thresholds for goodware and malware. Specifically, the algorithm calculates
two separate lists of predicted scores: one for samples classified as malware by the model and another
for those classified as goodware. For the malware list, a lower threshold is set, allowing a greater
number of malware samples to be selected. Conversely, for the goodware list, a higher threshold is
applied to limit the number of goodware samples included.</p>
        <p>This strategy ensures a higher proportion of malware samples are incorporated, while minimizing the
inclusion of goodware samples. Since the initial training dataset already contains a substantial amount
of goodware, and recent studies suggest goodware tends to exhibit fewer changes over time, prioritizing
goodware selection contributes less to improving the model [15]. The rationale for this approach is
that, while malware samples are less frequent, they hold significantly greater value in enhancing the
classifier’s performance. By lowering the threshold for malware and raising it for goodware, the model
focuses on malware instances, improving its ability to detect emerging threats while avoiding overfitting
to the abundant but less informative goodware samples. To counteract possible inaccuracies of the
labeling process, which is common for SSL algorithms, we additionally evaluate the integration of active
learning strategies and analyze the performances with respect to SSL approaches.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setting</title>
        <p>In this section, we first describe the experimental setup used to validate the experiment. Then, we
present the results obtained from the experiments and provide insights gained from the analysis.
In this subsection, we describe the experimental setup designed to evaluate the proposed methodology.
The setup simulates the real-world scenario, concentrating on the challenges posed by data drift and
imbalanced datasets.</p>
        <p>Dataset. The datasets are taken from the Android malware detection competition hosted in the ELSA
Cybersecurity Use Case.4 The applications are sampled from the AndroZoo [16] collection of Android
Applications, which contains (at the time of writing) over 24 million samples collected from diferent
sources. The sampling is performed based on analysis reports from VirusTotal, from which a timestamp
4https://github.com/pralab/elsa-cybersecurity
(from the first_submission_date field) and a binary label are extracted. A negative label is assigned
to samples that have no detections from the VirusTotal antimalware engines, whereas a positive label
is assigned to samples that are detected by at least 10 engines and can be uniquely assigned to a
malware family by the avclass tool5, efectively discarding any grayware application that has less than
10 detections or uncertain label. Moreover, a proportion of 9 : 1 between legitimate and malware
samples is kept, as suggested in previous works [10]. We rely on the provided training set and the test
sets of the deployed “Track 3: Temporal Robustness to Data Drift”, consisting of a total of 137, 500
samples, of which 123, 750 goodware and 13, 750 malware, sampled between January 2017 and June
2022. Specifically, the initial training set is composed of 75.000 samples collected between January 2017
and December 2019. The subsequent test sets, each spanning a three-month period (one quarter), are
sampled starting from January 2020 till June 2022, resulting in a total of 10 distinct test sets.
Models. The experiments in this study were conducted using a Calibrated Linear Support Vector
Machine (SVM) model. To enable probability calibration, we employed the CalibratedClassifierCV
class from Scikit-learn. Calibration is particularly important when dealing with imbalanced data,
as models trained on such datasets often produce biased or poorly calibrated probability estimates,
favoring the majority class. By calibrating the model, we aim to improve the reliability of the predicted
probabilities, ensuring they better reflect the actual likelihoods and enhancing decision-making in
downstream tasks. After extensive testing, we selected isotonic regression as the optimal calibration
method, setting the number of folds in the inner cross-validation process to 10. This configuration
yielded the best performance in terms of calibration accuracy and overall model efectiveness. For the
Linear SVC model, we utilized the Hinge Loss function with a regularization parameter of  = 0.1.
This specific configuration provided a favorable balance between model complexity and generalization,
as determined by our experimental results. Additionally, we used the TF-IDF Vectorizer technique for
feature tokenization, described in Sect. 2, which is implemented in the Scikit-learn library.
Testing over Time. To support the claim of this study, we trained the model using the dataset from
the first three years (referred to as TR). The remaining data was then divided into consecutive sets
covering 3-month periods (quarters). The subsequent quarters were used as TSi. This method allows
us to track changes in the data across each test set and assess the model’s performance in the context of
temporal drift.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>In this subsection, we present the results of our experiments on the proposed methodology. The
results are analyzed in the context of temporal drift, showing the eficacy of semi-supervised learning
techniques and the results of integration with active learning.</p>
        <p>Evaluation. The baseline plots in this study will serve as reference points for evaluating the efectiveness
of the implemented methods. Baseline plots are calculated for each diferent setting. The worst-case
scenario, or lower bound, represents a situation where the model is never retrained over time. This
acts as the lower bound because if a method yields results below this curve, it is deemed inefective; in
such cases, doing nothing would be a better option. On the other hand, the best-case scenario, or
upper bound, represents the optimal outcome where the model is continuously retrained with the
correct labels. This upper bound reflects a scenario where the model is regularly updated with new,
accurate labels every quarter. Ideally, an efective experiment should yield results that fall between the
lower and upper bounds, with a tendency toward the upper bound.</p>
        <p>Self-Training Phase. The results from the self-training phase are summarized in a plot that showcases
the F1 scores on the y-axis, reflecting the performance of various semi-supervised learning (SSL)
techniques, with the quarters displayed on the x-axis. Each curve in the plot corresponds to a specific
SSL method: the green curve indicates the asymmetric thresholds SSL method, while the red, purple,
and brown curves represent the Scikit-learn SSL techniques. The figures allow for a comparison of these
methods under two scenarios: one with no labeled samples from active learning (0 labeled samples) and
another that includes 200 labeled samples obtained through active learning with random sampling. We
in fact integrate with Active Learning to simulate a more realistic scenario where the cost of labeling
is high, and the model’s performances are expected to improve. Visual comparisons for the TF-IDF
vectorizer are illustrated in Figure 2 (for 0 labeled samples) and Figure 3 (for 200 labeled samples).
1.0</p>
        <p>From our analysis of the plots, we can highlight two key results. First, the asymmetric thresholds
SSL method, represented by the green curve, consistently outperforms the Scikit-learn SSL techniques,
regardless of the presence of active learning samples. Second, the performance of the asymmetric
method remains stable when transitioning from 0 to 200 labeled samples added in the re-training phase,
indicating that the addition of labeled samples through active learning does not enhance its efectiveness.
Conversely, the Scikit-learn SSL methods, illustrated by the red, purple, and brown curves, show a
slight improvement in performance as we move from 0 to 200 labeled samples. Nevertheless, even with
these improvements, these methods still have worse performance than the asymmetric thresholds SSL
method in both scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This study explores the challenges of retraining classifiers to manage evolving threats while minimizing
the use of labeled samples. It particularly focuses on semi-supervised learning (SSL) methods combined
with active learning (AL) techniques. Among the various approaches evaluated, the asymmetric
thresholds SSL method demonstrated the most robust performance over time. This method not only
achieved better performance consistently but also utilized only 10% of the samples during the retraining
phases, which included both pseudo-labels and selected labels from the AL process. It efectively
prioritized malware samples and avoided selecting redundant, high-confidence goodware samples.</p>
      <p>The findings suggest that the high performance of the asymmetric technique is closely linked to the
prioritization of malware samples. Although malware instances are less frequent, they have a greater
impact during the model update phase. This likely stems from the fact that malware evolves more
rapidly [15], enabling it to contribute more efectively to updates in the classifier compared to goodware
1.0
samples. The second insight is that employing random selection within active learning did not improve
the performance of semi-supervised learning (SSL), particularly when using fewer than 200 labeled
samples. This limitation arises from the imbalanced dataset, which often resulted in the selection of
majority-class goodware samples that contributed little to the classification process.</p>
      <p>These preliminary results appear to be a promising avenue for further exploration. Future directions
include trying this approach with diferent model retraining techniques. It may also be beneficial to
consider alternative approaches beyond SSL, such as contrastive learning techniques or neural networks,
which may ofer better performance. Additionally, experimenting with more advanced active learning
techniques, such as using uncertain samples for labeling, could help identify the most informative data
points, while requiring fewer labels. This strategy can potentially enhance model performance and
resilience, especially in imbalanced datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by: project SERICS (PE00000014), FAIR (PE00000013, CUP:
J23C24000090007) and SETA (PNRR M4.C2.1.1 PRIN 2022 PNRR, Cod. P202233M9Z, CUP
F53D23009120001, Avviso D.D. 1409 14.09.2022) under the MUR National Recovery and Resilience
Plan funded by the European Union - NextGenerationEU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>[2] Y. Chen, Z. Ding, D. Wagner, Continuous learning for android malware detection, in: 32nd USENIX
Security Symposium (USENIX Security 23), USENIX Association, Anaheim, CA, 2023, pp. 1127–
1144. URL: https://www.usenix.org/conference/usenixsecurity23/presentation/chen-yizheng.
[3] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Weiss, Learning under concept drift: A review, IEEE
Transactions on Knowledge and Data Engineering 31 (2020) 2346–2363. URL: https://ieeexplore.
ieee.org/document/8496795. doi:10.1109/TKDE.2018.2876857.
[4] Z. Kan, F. Pendlebury, F. Pierazzi, L. Cavallaro, Investigating labelless drift adaptation for malware
detection, in: Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security,
AISec ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 123–134. URL:
https://doi.org/10.1145/3474369.3486873. doi:10.1145/3474369.3486873.
[5] V. Rastogi, Y. Chen, X. Jiang, Droidchameleon: Evaluating android anti-malware against
transformation attacks, in: Proceedings of the 8th ACM SIGSAC Symposium on
Information, Computer and Communications Security, ACM, New York, NY, USA, 2013, pp. 329–334.
doi:10.1145/2484313.2484355.
[6] L. Onwuzurike, E. Mariconti, P. Andriotis, E. D. Cristofaro, G. Ross, G. Stringhini, Mamadroid:
Detecting android malware by building markov chains of behavioral models (extended version),
ACM Trans. Priv. Secur. 22 (2019). URL: https://doi.org/10.1145/3313391. doi:10.1145/3313391.
[7] A. Saracino, D. Sgandurra, G. Dini, F. Martinelli, Madam: Efective and eficient behavior-based
android malware detection and prevention, IEEE Transactions on Dependable and Secure Computing
15 (2018) 83–97. doi:10.1109/TDSC.2016.2536605.
[8] H. Cai, N. Meng, B. Ryder, D. Yao, Droidcat: Efective android malware detection and categorization
via app-level profiling, IEEE Transactions on Information Forensics and Security 14 (2019) 1455–
1470. doi:10.1109/TIFS.2018.2879302.
[9] M. Spreitzenbarth, T. Schreck, F. Echtler, D. Arp, J. Hofmann, Mobile-sandbox: combining static
and dynamic analysis with machine-learning techniques, Int. J. Inf. Secur. 14 (2015) 141–153. URL:
https://doi.org/10.1007/s10207-014-0250-0. doi:10.1007/s10207-014-0250-0.
[10] F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, L. Cavallaro, {TESSERACT}: Eliminating
experimental bias in malware classification across space and time, in: 28th USENIX security
symposium (USENIX Security 19), 2019, pp. 729–746.
[11] J. E. van Engelen, H. H. Hoos, A survey on semi-supervised learning, Machine
Learning 109 (2020) 373–440. URL: https://doi.org/10.1007/s10994-019-05855-6. doi:10.1007/
s10994-019-05855-6.
[12] M. T. Alam, R. Fieblinger, A. Mahara, N. Rastogi, Morph: Towards automated concept drift
adaptation for malware detection, 2024. URL: https://arxiv.org/abs/2401.12790. arXiv:2401.12790.
[13] F. Cacciarelli, M. Kulahci, Active learning for data streams: A survey, Machine Learning 113
(2024) 45–72. URL: https://link.springer.com/article/10.1007/s10994-023-06454-2. doi:10.1007/
s10994-023-06454-2.
[14] D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, in:
Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL ’95,
Association for Computational Linguistics, USA, 1995, p. 189–196. URL: https://doi.org/10.3115/
981658.981684. doi:10.3115/981658.981684.
[15] L. Minnei, H. Eddoubi, A. Sotgiu, M. Pintor, A. Demontis, B. Biggio, Data drift in android malware
detection, in: International Conference on Machine Learning and Cybernetics, ICMLC, IEEE, 2024.
[16] K. Allix, T. F. Bissyandé, J. Klein, Y. Le Traon, Androzoo: Collecting millions of android apps for
the research community, in: Proceedings of the 13th International Conference on Mining Software
Repositories, MSR ’16, ACM, New York, NY, USA, 2016, pp. 468–471. URL: http://doi.acm.org/10.
1145/2901739.2903508. doi:10.1145/2901739.2903508.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spreitzenbarth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hubner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gascon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rieck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Siemens</surname>
          </string-name>
          ,
          <article-title>Drebin: Efective and explainable detection of android malware in your pocket</article-title>
          .,
          <source>in: Ndss</source>
          , volume
          <volume>14</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>