<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Conference on Cybersecurity, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detectors from DNS Trafic Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Piras</string-name>
          <email>giorgio.piras@unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maura Pintor</string-name>
          <email>maura.pintor@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Demetrio</string-name>
          <email>luca.demetrio93@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Battista Biggio</string-name>
          <email>battista.biggio@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pluribus One S.r.l.</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Cagliari</institution>
          ,
          <addr-line>Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>0</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>One of the most common causes of lack of continuity of online systems stems from a widely popular Cyber Attack known as Distributed Denial of Service (DDoS), in which a network of infected devices (botnet) gets exploited to flood the computational capacity of services through the commands of an attacker. This attack is made by leveraging the Domain Name System (DNS) technology through Domain Generation Algorithms (DGAs), a stealthy connection strategy that yet leaves suspicious data patterns. To detect such threats, advances in their analysis have been made. For the majority, they found Machine Learning (ML) as a solution, which can be highly efective in analyzing and classifying massive amounts of data. Although strongly performing, ML models have a certain degree of obscurity in their decisionmaking process. To cope with this problem, a branch of ML known as Explainable ML tries to break down the black-box nature of classifiers and make them interpretable and human-readable. This work addresses the problem of Explainable ML in the context of botnet and DGA detection, which at the best of our knowledge, is the first to concretely break down the decisions of ML classifiers when devised for botnet/DGA detection, therefore providing global and local explanations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        During the last decades, our day-by-day life has been strictly connected to the usage of devices
and online services, therefore making their eficiency and continuity play a crucial role in
the technological transformation we witness. Likewise, the economic loss derived from
cyberthreats has increased exponentially in recent years [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as the technologies continually evolve and
attackers develop their skills. One of the most common ways cybercriminals try to jeopardize
the continuity of systems and thus cause economic damage is Denial of Service (DoS), which
aims to drain the computing capabilities of the target system in both fancy and basic ways.
A case of this attack is the Distributed Denial of Service DDoS, where a network of infected
devices (bots) are commanded by an attacker (botmaster) through a Command&amp;Control Server
(C&amp;C) [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. What happens to be erratic and thus detectable by a Machine Learning (ML)
model in this kind of attack is the DNS trafic, carrying Domain Names through which bots are
connected to the C&amp;C server. This stealthy connection strategy is commonly known as Domain
Fluxing, where the algorithms used by the infected bots to generate the domain are known as
Domain Generation Algorithms (DGAs).
      </p>
      <p>
        Although employing ML models to detect the presence of botnets within network trafic has
been demonstrated to be successful, almost the entirety of the relevant works have followed a
common baseline and workflow, presenting a partially novel feature set on which to train a
classifier to obtain relevant results [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ]. The proposed approaches lack interpretability
and contextualization. First, depending on the context from which DNS trafic data is extracted
and the model is deployed, potential attackers might have control over some features. Second,
the model prioritization and general usage of the features in the decision process are not known
beforehand, making the process challenging to debug and protect.
      </p>
      <p>
        To make up for these problems, we first analyze the techniques used to detect botnets/DGAs
from the DNS data (Section 2); we analyze which explainability techniques can provide insight
into how the model takes its decisions (Section 3). Upon a re-implementation of the EXPOSURE
system [
        <xref ref-type="bibr" rid="ref5 ref9">9, 5</xref>
        ] (Section 4), we provide the following contributions: (i) we build and test the
EXPOSURE system on a newly collected dataset; (ii) we observe statistics on the features used
by the system; (iii) we train diferent classifiers and compare their performances; (iv) we obtain
explanations from such classifiers; and (v) given the explanations, we develop and discuss an
analysis on the features used by the systems mentioned above. Finally, we conclude the work
by presenting related works (Section 5), limitations, and future directions (Section 6).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background: DNS System and ML Techniques</title>
      <p>
        From DNS to DGA. The Domain Name System (DNS) is a database responsible for mapping
domain names to IP addresses, thus answering a query made by clients in the form of a domain
name towards the IP addresses. This action is commonly known as resolution [10]. The DNS
organizes domain names into a hierarchy (through dot-separated levels), as the whole technology
itself creates a hierarchical database structure. The information stored and carried by DNS
records can be A records, returning IPv4 addresses, NS records, returning authoritative name
servers, and finally, PTR, which stands for Pointer to Record and returns a domain name but in
the reverse query format (i.e., the question started from an IP rather than a domain name). It
is also worth citing specific information carried by the DNS packets, such as the Time-to-live
(TTL), which indicates how long the server will cache that packet [11, 10]. Being of paramount
importance for the correct functioning of basic internet activities, the DNS is the perfect target
for malicious activities having a high impact on unaware users. That is why this technology
gets exploited by attackers (botmasters) who aim to command and control a network of infected
machines, i.e., a botnet. To go as undercover as possible, having only one domain name to which
to connect would have the botnet quickly taken down by vigilant authorities. That is why bots
generate massive DNS trafic trying to connect to a much more concealed C&amp;C server. The
generation of such a significant amount of domain names happens through Domain Generation
Algorithms that, given a random seed, create a string that will possibly establish a connection.
Botnet Detection with ML: The EXPOSURE system. Seizing the chance to detect malicious
patterns, the research community has driven its eforts towards analyzing the DNS data,
extracting the features, and eventually training a ML model capable to distinguish malicious and
benign DNS behaviors. The EXPOSURE system [
        <xref ref-type="bibr" rid="ref5 ref9">9, 5</xref>
        ] is among the most prominent works for
its completeness in the feature set and reproducibility (in terms of feature extraction). For this
reason, we use it as a base for our explainability analysis. Table 1 shows the feature set, listing
the features extracted by the EXPOSURE system (whose extensive description can be found
in the original work [
        <xref ref-type="bibr" rid="ref5 ref9">9, 5</xref>
        ]). The entire set is subdivided into Time-Based features (collecting
temporal patterns from the queries to the domains), DNS-Answer-Based Features (patterns from
the answers records), TTL-Value-Based Features (statistical patterns from the TTL values), and
ifnally, Name-Based Features (statistical patterns from the Domain Name). Given a collection of
DNS packets, we can compose a training set of benign and malicious samples to train the
classiifer, as Bilge et al. did in EXPOSURE. Our work will focus on reproducing the experiment with
our newly-collected trafic and applying explainability techniques to understand the patterns
employed by the model for detecting malicious activities.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Explaining Predictions of ML-based DNS Analysis</title>
      <p>As pointed out by Miller et al. [12], explanations increase transparency and interpretability
so that user awareness and systems designers can jointly benefit from this gain of trust. In
security-relevant scenarios, like the one we are considering, understanding the data and the
model provides the added benefit of helping to see if there are problems in the system, for
example assigning high relevance to spurious features that should not influence it to that
extent [13].</p>
      <p>Analyzing the dataset’s statistics provides further insights into the separability of the features
into the two diferent classes. Additionally, in the case under investigation, lots of features
come from similar sources and elaborations, which lets the statistical analysis come in handy to
highlight correlations and redundancies.</p>
      <p>On top of that, we will use a ML model to analyze such features and categorize the samples
into the two output classes of benign and malicious domains. Model explanations can help
understand how the model is making such decisions. An explanation is said to be local if it
is made on single samples and wants to describe how a model emphasizes the features of a
specific single sample in its classification. On the other hand, global explanations are made over
entire datasets or relevant collections of samples to describe how the model prioritizes features
over those samples [14]. This work will focus on both local and global explanations.</p>
      <p>In [15], Lundberg and Lee proposed SHAP (SHapley Additive exPlanations), where feature
importance is computed with an additive approach, representing a unified measure of feature
importance. The basic concept behind SHAP comes from Shapley values and a game theory
setting, where the features act as players and cooperate in a coalitional game (i.e., the prediction
task) to receive a profit (i.e., a gain, which is the actual prediction). The Shapley values
assign payouts to players depending on their contribution to the total payout [16]. Thus
each feature that contributes to the prediction task is computed as a sum of the expected
marginal contributions in any feature value combination. Given the computational burden for
which SHAP should find all the possible feature combinations, Lundberg and Lee proposed a
Shapley kernel that produces estimates instead of exact values.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Analysis and Explainability Techniques</title>
        <p>This section will briefly explain the details and diferences between the data analysis and
explainability tools that will play a central role in the experiments section.</p>
        <p>Feature Statistical Analysis. These plots show the marginal distributions of every pair of
features as density plots, describing how the distributions for the classes behave. Through the
scattered plots instead, we can assess where both benign and malicious samples lie in their
ad-hoc feature space, thus making us capable of understanding to which extent pairs of features
separate the data. Analyzing the scattered plots allows observing the distribution of the features
to get a rough idea of how they will behave/discriminate and to which extent.
Partial Dependence Plot. This plot shows the marginal efect that a single feature has on
the prediction made by the model, thus providing global explanations. Taking as input the
model, the feature, and a background distribution on which to make the model learn the
feature importance, the Partial Dependence Plots (PDPs) depict the feature values on the x-axis,
whilst the y-axis represents the expected prediction contribution given the feature value. In
the background, a histogram shows the underlying data distribution of the feature values.
A horizontal line represents the expected contribution to the prediction, and a vertical one
represents the expected value of the feature. By reading this plot, we can measure how the
observed feature contributes to the classification of the samples.</p>
        <p>Summary Plot. The SHAP summary plot, which as the PDP is a global explanation technique,
shows how the model prioritizes the features and how these contribute to steering the
classification towards each class. This plot comprises a list of features ordered from the one giving the
higher contribution to the least powerful as interpreted by the model, showing the magnitude
for benign (in blue) and malicious (in red) samples.</p>
        <p>Force Plot. This technique is one of the local explainability methods provided in SHAP. It
explains why a specific sample has been assigned a particular label. This can be useful for
understanding why samples are misclassified and to which extent the classifier misunderstands
them. Force plots, showing the magnitude of the feature contribution on single samples, are
rendered as blue arrows indicating magnitude values towards the benign class and red vice-versa.</p>
        <p>In the next section, we will use the presented techniques to explain predictions of our
reimplementation of the EXPOSURE system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In the experimental section of this work, we will first describe our re-implementation of the
EXPOSURE system. Then we will discuss the DNS trafic data we used to make our feature
extraction, followed by a brief model selection made to improve the system’s performance.
Eventually, we will delve into the results section to show how explanations applied in this
context can bring the analysis to the next level.</p>
      <sec id="sec-4-1">
        <title>4.1. Re-implementation of the EXPOSURE system</title>
        <p>Dataset. The DNS trafic was collected from recursive servers on which, through snifers,
we were able to save the data as .pcap files for the entire month of January 2021. Given the
massive amount of trafic, summing up to 15 GB of data per day, we filtered out packets whose
label was not known by either black or white lists and domain names that did not resolve
(NXDOMAIN as response code). We used the list of most popular sufixes from the Alexa website 1
to label benign domains, and the list from DGArchive [17] to flag the malicious samples. The
remaining packets (203, 034 domains, of which 25, 882 benign and 177, 152 malicious - note that
benign domains re-appear much more frequently than the malicious ones, which are almost
always unique) have then been passed through the feature extractor we implemented, and are
distributed through days as shown in Figure 1.</p>
        <p>Model Selection. The authors used a J48 Decision Tree to obtain overall good performances in
the original EXPOSURE work. We additionally bench-marked several models such as Decision
Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Ada-Boost (ADA), and</p>
        <p>Random Forest (RF). After estimating the hyperparameters through a Grid Search, (whose bests
overall have been reported in Section 7) we compared the best models with the best parameters
on two diferent days (i.e., two diferent sample balances). The first ROC curve in Figure 2a
was obtained using a more balanced day of data (mid of January days). The second set of
curves in Figure 2b was obtained from a day of data with very few malicious samples, showing
how the performances of the classifiers dropped down consistently. Overall, throughout the
(a) Classifiers trained on day 14, with class
distribution unbalanced but less evident than
the other days in the dataset.</p>
        <p>(b) The same classifiers trained on day 5, which
presents a highly-unbalanced distribution
of the two classes.
most balanced days, RF and ADA have shown to be the more consistent classifiers. For this
reason, they have been selected as classifiers for the rest of the experiments. We reckon that
the mid-days of capture are also more suitable for the rest of the analysis, and they have been
thereby used for all of the following experiments.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature and Explainability Analysis on EXPOSURE</title>
        <p>We now present our results on the analysis of the feature statistics and our insight obtained
through applying explainability techniques to interpret the decisions taken by the
machinelearning models used in our DGA detector. These proposed plots have been implemented
through the Python libraries Seaborn [18] and SHAP [19].</p>
        <p>Statistical Analysis. The statistical analysis shows an overview of the correlation and
distribution of the features. As shown in Figure 3a, some features like the %of_lms and num_chars%
when joined, do not separate perfectly the data collection used into the two classes. In particular,
the %of_lms reaches a plateau in malicious domains once over 0.8 (bottom-right plot, depicting
the distribution of the feature), which describes how algorithmically-generated domains tend
not to have a single meaningful word covering their entire name in most of the cases, yet
there are exceptions in any direction. This might be brought on by the diversity in malware
families, where some like “Gameover” DGA used to mix up numbers and characters. In contrast,
others such as “Gozi” used to mix up words from openly accessible documents, such as the US
constitution [20]. In the plot of Figure 3b, depicting time-based features, malicious domains
show a more volatile behavior, which is reasonable if we think about the diversity of applications
in which they can be used. In Figure 8, shown in the Appendix, we can observe interesting TTL
behaviors characterizing the domains. Contrary to the now old-fashioned belief that a low TTL
is only typical for malicious domains [21, 22], as it makes malicious records stand less in caches,
we show that also benign domains can present this behavior depending on the application in
which they are used, e.g., to handle critical resources [23] or for load balancing purposes [22].
(a) Correlation between name-based features.
(b) Correlation between time-based features.</p>
        <p>Interpreting Global Explanations. As pointed out in Section 3, diferent models can use
features in diferent ways. Global explanations can uncover these behaviors and let the analyst
be aware of the feature prioritization that a model brings. Some Decision-Tree based classifiers
such as ADA and RF, respectively in the summary plots of Figure 4a and Figure 4b, share four
out of the five top important features, which is likely to be a consequence of their “similar”
tree-based intrinsic nature. In both of them, unique_ips notably brings the higher contribution.
In Section 7, we show how other classifiers have a low magnitude provided by the unique_ips
feature, whilst prioritizing a diverse subset of TTL features. This furtherly shows how it is not
possible to solely rely on statistical analysis to foresee the utilization of the features, as class
separations that at first glance look either weak or strong can be subverted.
(a) SHAP summary plot of feature contributions
on ADA BOOST classifier.</p>
        <p>(b) SHAP summary plot of feature contributions
on RANDOM FOREST classifier.</p>
        <p>Partial Dependence Plots show the marginal efect of a single feature globally on the
predictions. Considering a trained classifier (RF in the case of this analysis) and a background
distribution, through SHAP we can assess how the considered feature contributes to classifying
the background samples over their values. An advisable security-related use of these plots can
be to employ a background distribution of malicious samples, thus analyzing to which extent
the feature values contribute to classifying the sample as malicious. The following plots (after
normalization) have been made using a background distribution of 1000 malicious samples on
the RF classifier. Figure 5a shows the PDP of the strongest feature of the model. The plot tends
to be a “gentle” step, producing the highest contribution on very low feature values and the
lowest with values going just subtly over the threshold. Very similar behavior to the one of
the unique_ips feature is shown in the number of changes in the TTL, depicted in Figure 7 in
Section 7. These plots help, for example, understanding the extent to which features contribute
to the classification of the domain as malicious and possibly setting policies and restrictions
based on simply tweakable features, such as the num_chars% in Figure 5b. In this plot, we
can understand how a high rate of numerical characters leads to a solid contribution to the
prediction of the domain as malicious. Likewise, it is surprising that a 20% rate of numerical
characters in the domain string leads to an even bigger magnitude, which can be caused by the
relevant presence of some malware families not having numbers in their “regex”. In this case,
usage of the proposed security policies for a system hosting an EXPOSURE-like system would
be to allow domain names with numerical characters comprised in between the 20%-40% range.
(a) SHAP PDP on unique_ips feature.</p>
        <p>(b) SHAP PDP on num_chars% feature.
Magnifying the Behavior with Local Explanations. Force plots and local explanations
can be considered the local version of summary plots, showing how features contribute locally
to the sample. As Figure 6a shows, the features correctly lead the RF model classification of
the domain spring.io as benign. It is not surprising to see a 0 value of unique_ips forcing
the classification towards malicious, but the rest of the features gently move the prediction
towards the benign class. For malicious samples, instead, the latter plays a major role incorrectly
classifying some malicious samples like mobile.de and qcx.nl, as additionally shown in the
Appendix. In general, the magnitude of the features sticks to what is shown by the summary
plot of Figure 4b for samples correctly classified. To understand instead which features are
leading the model to misclassify a sample, Figure 6b shows the force plot for the domain fgc.es,
blacklisted but yet misclassified by our system as benign. The unique_ips and %of_lms features
correctly move the prediction towards the malicious class, but the values of the TTL-based
features deceive the classifier. Figure 6c, on the other hand, shows how unique_ips and TTL
values deviate the prediction of the benign domain towards the malicious class.
Summary of the Results. As a result of the presented experiments, we can reflect on the
issues of feature management and hypothetical counteractions. We understand the features
distribution, correlation, and how the DGAs in our trafic tend to behave from statistical analysis.
However, the global explanations can turn the table and quantify how the model perceives the
features. Finally, we can see how features drive the sample’s prediction via local explanations.
This ensemble of analysis makes us notice how the overall feature prioritization depends on
both the model used and the considered data, which further proves how context-dependent
such systems’ behavior can be. Hence, an Explainability analysis should always be used to
better portray the big picture of both systems and employed data. In our case, the big picture
(a) Force plot of the benign domain sample spring.io, correctly classified as benign.</p>
        <p>(b) Force plot of the malicious domain sample fgc.se, misclassified as benign.</p>
        <p>(c) Force plot of the benign domain sample topeleven.com, misclassified as malicious.
has led to a more prominent analysis of the EXPOSURE system. The TTL features, the subject
of this analysis, sum up to 37.5% of the entire feature set, being 9 out of 24. It turns out from our
analysis of the explanations that they contribute massively to the misclassification of several
samples (such as Figure 6b and Figure 6c) as they cover high-magnitude roles in the summary
plots of Figure 4a and Figure 4b (which are the best classifiers overall). This makes this feature
set highly powerful for the whole system, yet in its power also lies a crucial problem. Namely,
attackers can manipulate this feature, being completely free to tune the TTL and balance their
caching time (i.e., the likelihood of being detected) with the chance of evading the classification
of the system. Having such a relevant portion of the feature set reserved for values that can
be somehow crafted directly by the attackers, it can serve as a significant stepping-stone for
attackers. Furthermore, if deployed on a system devised to manage critical resources, besides
evading, some part of the feature set can be overridden by the context. Some works like [23]
point out how security-sensitive systems, e.g., banking applications, should indeed carefully set
their DNS TTL to a low value. The scope of these assumptions is that a botnet/DGA detector
cannot solely rely on accuracy metrics to establish its eficiency, in that analysts also need to be
aware of the model, the data, and the context therein. Explanations can give a huge and crucial
hand in this regard, helping prevent major issues from happening and allowing debugging
of the model. Considering our system, through explanations, we have seen how dangerously
influential TTL-based features are in most of the models. And considering their extensive use
in the feature set, appropriate security measures should be taken (e.g., reducing their number
like for Name-Based features, which are just easily adjustable as well but sum up to only the
8%). We firmly believe that through explanations we can rapidly enhance the usage and trust in
AI, as companies can look at such security systems from a human-readable perspective, and
model biases can be analyzed and studied.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        DNS Analysis. Several promising works have striven to tackle botnet/DGA detection during the
last decade, often showing innovative DNS passive features and methodologies. Some notable
works, besides EXPOSURE, have been Notos [24], where Antonakakis et al. created the first
relevant and eficient reputation system for domains from various data sources. Pleiades [ 25],
where the authors focused on NXDOMAIN records to both cluster domains and classify DGAs
by looking at the strings association. Finally, in FANCI [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Schueppen et al. developed a
detector based on a small feature set such as the EXPOSURE one, though focusing only on
NXDOMAIN passive data. All of these works have reached comparable performances in diferent
settings. None of them, though, has focused their interest on the explainability of such a critical
application. The only works to have addressed such problems focused on multiclass classification
problems with deep learning approaches, thus classifying the malicious domains with family
pairing. In [26], Becker et al. proposed a visual analytics system for Deep Learning (DL) models,
providing graphical insights on statistical properties of the domain name string. Drichel et al.,
in [27], briefly highlighted some string-wise interpretations for DL models starting from the
misclassified samples. In contrast, in [ 28], Drichel et al. proposed feature-based classifiers based
on string features for multiclass classification, with the purpose of improving explainability.
Firstly, none of these three works focused on passive DNS data, choosing string-based features
to ease the computational burden. Secondly, none developed explanatory analysis, rather
focusing on how the model could be made more explainable or at most on how to visualize a
few string patterns from a DL model. Our work focuses on passive DNS trafic data, analyzing
features from a comprehensive viewpoint and not limiting them to the human-readable string
features. Additionally, we propose both local and global explanations, concretely enhancing the
awareness of how a model behaves in such a context.
      </p>
      <p>Explainability Techniques. In [29], Ribeiro et al. proposed LIME (Local Interpretable
ModelAgnostic Explanations), an explainability method conceived as a local model learning and
approximating around the prediction. Despite its wide use, several concerns about stability
and consistency have been addressed towards LIME [15, 30]. Considering that SHAP is a more
reliable tool, we have driven our choice towards its use in our explainability work.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this work, we proposed an explanatory analysis of ML classifiers devised for botnet/DGA
detection. Starting from the implementation of the EXPOSURE feature set on our trafic data, we
have shown how from prior statistical assumptions on the malware behavior within the network,
a model can interpret features in its way globally, thus prioritizing certain features rather than
others that were prevented, also demonstrating how diferent models can have a diferent feature
conception, to which eventually we analysts should adapt and debug accordingly. Locally, we
have seen how certain features can contribute and how explanations can make the analysts
and users aware of the single decisions and motivations behind misclassified samples. Through
these analyses, we raised concerns about how the feature and model can be biased by the
context in which the systems are both trained and deployed. And our analysis makes the
comprehension of such contexts move fast forward towards favoring the employment of such
systems, as they can be firstly interpreted and adapted and subsequently accepted. In this regard,
several advances of this work can be developed aiming at fairness and legal regularization of
the detectors through explanations and, if possible, bringing them into debugging/pipelining
processes to obtain an eficient and explainable system. Additionally, they can be instrumental
when humans want to be involved in the decision-making. All in all, this work demonstrated
how powerful explanations can be and how security, debugging, interpretability, and fairness
can be brought to the next level by the application of ML to detection, where security has to be
assessed and interpreted through the process chain.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partly supported by the PRIN 2017 project RexLearn, funded by the Italian
Ministry of Education, University and Research (grant no. 2017TWNMH2); and by the project
TESTABLE (grant no. 101019206), under the EU’s H2020 research and innovation programme.
service to detect and report malicious domains, ACM Trans. Inf. Syst. Secur. 16 (2014).</p>
      <p>URL: https://doi.org/10.1145/2584679. doi:10.1145/2584679.
[10] P. V. Mockapetris, Domain names - implementation and specification, ????. URL: https:
//tools.ietf.org/html/rfc1035.
[11] P. V. Mockapetris, Domain names - concepts and facilities, ????. URL: https://tools.ietf.org/
html/rfc1034.
[12] T. Miller, Explanation in Artificial Intelligence: Insights from the Social Sciences,
arXiv:1706.07269 [cs] (2018). URL: http://arxiv.org/abs/1706.07269, arXiv: 1706.07269.
[13] D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wressnegger, L. Cavallaro,
K. Rieck, Dos and don’ts of machine learning in computer security, in: Proceedings of the
31st USENIX Security Symposium, 2020.
[14] C. Molnar, 2.5 Properties of Explanations | Interpretable Machine Learning,
https://christophm.github.io/interpretable-ml-book/, 2019. URL: https://christophm.github.
io/interpretable-ml-book/properties.html.
[15] S. Lundberg, S.-I. Lee, A Unified Approach to Interpreting Model Predictions,
arXiv:1705.07874 [cs, stat] (2017). URL: http://arxiv.org/abs/1705.07874, arXiv: 1705.07874.
[16] S. Hart, Shapley Value, in: The New Palgrave Dictionary of Economics, Palgrave Macmillan
UK, London, 2017, pp. 1–5. URL: https://doi.org/10.1057/978-1-349-95121-5_1369-2. doi:10.
1057/978- 1- 349- 95121- 5_1369- 2.
[17] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, E. Gerhards-Padilla, A comprehensive
measurement study of domain generating malware, in: 25th USENIX Security Symposium
(USENIX Security 16), 2016, pp. 263–278.
[18] M. L. Waskom, seaborn: statistical data visualization, Journal of Open Source Software 6
(2021) 3021. URL: https://doi.org/10.21105/joss.03021. doi:10.21105/joss.03021.
[19] An introduction to explainable AI with Shapley values — SHAP latest
documentation, ???? URL: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%
20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html.
[20] F. F. Daniel Plohmann, Dgarchive, https://dgarchive.caad.fkie.fraunhofer.de, ????
[21] R. Villamarin-Salomon, J. C. Brustoloni, Identifying botnets using anomaly detection
techniques applied to dns trafic, in: 2008 5th IEEE Consumer Communications and
Networking Conference, 2008, pp. 476–481. doi:10.1109/ccnc08.2007.112.
[22] K. Alieyan, A. ALmomani, A. Manasrah, M. M. Kadhum, A survey of botnet detection
based on DNS, Neural Computing and Applications 28 (2017) 1541–1558. URL: http:
//link.springer.com/10.1007/s00521-015-2128-0. doi:10.1007/s00521- 015- 2128- 0.
[23] N. Vlajic, M. Andrade, U. T. Nguyen, The Role of DNS TTL Values in Potential DDoS Attacks:
What Do the Major Banks Know About It?, Procedia Computer Science 10 (2012) 466–473.
URL: https://www.sciencedirect.com/science/article/pii/S1877050912004176. doi:10.1016/
j.procs.2012.06.060.
[24] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, N. Feamster, Building a dynamic reputation
system for dns, in: USENIX security symposium, 2010, pp. 273–290.
[25] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, D. Dagon,
From throw-away trafic to bots: detecting the rise of dga-based malware, in: Presented
as part of the 21st USENIX Security Symposium (USENIX Security 12), 2012, pp. 491–506.
[26] F. Becker, A. Drichel, C. Müller, T. Ertl, Interpretable visualizations of deep neural networks
for domain generation algorithm detection, in: 2020 IEEE Symposium on Visualization for
Cyber Security (VizSec), 2020, pp. 25–29. doi:10.1109/VizSec51108.2020.00010.
[27] A. Drichel, U. Meyer, S. Schüppen, D. Teubert, Analyzing the real-world applicability
of dga classifiers, in: Proceedings of the 15th International Conference on Availability,
Reliability and Security, 2020, pp. 1–11. doi:10.1145/3407023.3407030.
[28] A. Drichel, N. Faerber, U. Meyer, First step towards explainable dga multiclass classification,
in: Proceedings of the 16th International Conference on Availability, Reliability and
Security, 2021, pp. 1–13. doi:10.1145/3465481.3465749.
[29] M. T. Ribeiro, S. Singh, C. Guestrin, ”Why Should I Trust You?”: Explaining the Predictions
of Any Classifier, arXiv:1602.04938 [cs, stat] (2016). URL: http://arxiv.org/abs/1602.04938,
arXiv: 1602.04938 version: 1.
[30] Z. Zhou, G. Hooker, F. Wang, S-LIME: Stabilized-LIME for Model Explanation, Proceedings
of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining (2021)
2429–2438. URL: http://arxiv.org/abs/2106.07875. doi:10.1145/3447548.3467274, arXiv:
2106.07875.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Appendix</title>
      <p>In this additional section, we show several plots that have been cited in the previous sections
and that we believe can support the comprehension of the work.</p>
      <p>Grid Search Results. Using the Scikit-Learn Python suite, we optimized the parameters through
the GridSearchCV API. The results of the optimization have been reported for completeness in
Listing 1.</p>
      <p>TTL Features plots. Having focused the discussion of the explainability analysis almost
entirely on the TTL features, there are some additional plots that can point out interesting
behaviors, such as Figure 8, which shows the statistical analysis of the first 4 TTL-based features.
In Figure 7 instead, we can see how low changes in the TTL values mean a low contribution to
the classification of the sample as malicious and vice versa.</p>
      <p>Additional Summary Plots. Figure 9 shows how unique_ips are much less considered than
the TTL-based features by the KNN classifier, which again shows how models are as diverse as
they are. The same goes for the SVC classifier in Figure 10, which once again does not employ
the unique_ips feature as much as the Decision-Tree based classifiers do.</p>
      <p>Additional Force Plots. The plots of Figure 11 show a variety of samples either correctly
classified or misclassified by the RF model, demonstrating practically how the most relevant
features can play a major role in any classification scenario, either in the wrong or correct way.</p>
      <p>Listing 1: Grid Search Results</p>
      <p>F
i
g
u
r
e</p>
      <p>Figure 8: Correlation between the first four TTL-based features.</p>
      <p>Figure 9: SHAP summary plot of feature contributions on KNN classifier.</p>
      <p>(a) Force plot of the benign domain sample mobile.de, correctly classified as benign.
(b) Force plot of the malicious domain sample qcx.nl, correctly classified as malicious.</p>
      <p>(c) Force plot of the malicious domain sample szx.pw, misclassified as benign.</p>
      <p>Figure 11: Local explanations on other three domains from the dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lostri</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (Firm),
          <source>The Hidden Costs of Cybercrime, McAfee</source>
          ,
          <year>2020</year>
          . URL: https://books.google.it/books?id=mG0jzgEACAAJ.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Puri</surname>
          </string-name>
          , Bots &amp;;
          <string-name>
            <surname>Botnet: An Overview</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>2003</year>
          )
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Salusky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Danford</surname>
          </string-name>
          , Know Your Enemy:
          <article-title>Fast-Flux Service Networks</article-title>
          ,
          <source>in: The Honeypot Project</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Stone-Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavallaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szydlowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kemmerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kruegel</surname>
          </string-name>
          , G. Vigna,
          <article-title>Your botnet is my botnet: analysis of a botnet takeover</article-title>
          ,
          <source>in: Proceedings of the 16th ACM conference on Computer and communications security - CCS '09</source>
          , ACM Press, Chicago, Illinois, USA,
          <year>2009</year>
          , p.
          <fpage>635</fpage>
          . URL: http://portal.acm.org/citation.cfm?doid=
          <volume>1653662</volume>
          . 1653738. doi:
          <volume>10</volume>
          .1145/1653662.1653738.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bilge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kirda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kruegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balduzzi</surname>
          </string-name>
          , Exposure:
          <article-title>Finding malicious domains using passive dns analysis</article-title>
          .,
          <source>in: Network and Distributed System Security Symposium (NDSS)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schüppen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrmann</surname>
          </string-name>
          , U. Meyer, FANCI :
          <article-title>Feature-based automated nxdomain classification and intelligence</article-title>
          ,
          <source>in: 27th USENIX Security Symposium (USENIX Security 18)</source>
          , USENIX Association, Baltimore,
          <string-name>
            <surname>MD</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1181</lpage>
          . URL: https://www. usenix.org/conference/usenixsecurity18/presentation/schuppen.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Themis: A Novel Detection Approach for Detecting Mixed Algorithmically Generated Domains</article-title>
          ,
          <source>in: 2019 15th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>264</lpage>
          . doi:
          <volume>10</volume>
          . 1109/MSN48538.
          <year>2019</year>
          .
          <volume>00057</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schiavoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Maggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavallaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zanero</surname>
          </string-name>
          , Phoenix:
          <article-title>Dga-based botnet tracking and intelligence, in: Detection of intrusions and malware</article-title>
          , and vulnerability assessment, Springer,
          <year>2014</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bilge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Balzarotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kirda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kruegel</surname>
          </string-name>
          ,
          <article-title>Exposure: A passive dns analysis</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>