=Paper=
{{Paper
|id=Vol-2327/ESIDA2
|storemode=property
|title=eX2: a framework for interactive anomaly detection
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-ESIDA-2.pdf
|volume=Vol-2327
|authors=Ignacio Arnaldo,Kalyan Veeramachaneni,Mei Lam
|dblpUrl=https://dblp.org/rec/conf/iui/ArnaldoVL19
}}
==eX2: a framework for interactive anomaly detection==
eX2: a framework for interactive anomaly detection
Ignacio Arnaldo Mei Lam Kalyan Veeramachaneni
iarnaldo@patternex.com mei@patternex.com kalyanv@mit.edu
PatternEx PatternEx LIDS, MIT
San Jose, CA, USA San Jose, CA, USA Cambridge, MA, USA
ABSTRACT outlier analysis, given that new attacks are expected to be rare and
We introduce eX 2 (coined after explain and explore), a framework exhibit distinctive features. At the same time, special attention is
based on explainable outlier analysis and interactive recommenda- dedicated to providing interpretable, actionable results for analyst
tions that enables cybersecurity researchers to efficiently search consumption. Finally, the framework exploits human-data interac-
for new attacks. We demonstrate the framework with both pub- tions to recommend the exploration of regions of the data deemed
licly available and real-world cybersecurity datasets, showing that problematic by the analyst.
eX 2 improves the detection capability of stand-alone outlier analy-
sis methods, therefore improving the efficiency of so-called threat 2 RELATED WORK
hunting activities. Anomaly detection methods have been extensively studied in the
machine learning community [1, 6, 10]. The strategy based on
CCS CONCEPTS Principal Component Analysis used in this work is inspired by [14],
• Security and privacy → Intrusion/anomaly detection and while the method introduced to retrieve feature contributions based
malware mitigation; • Human-centered computing → User on the analysis of feature projections into the principal components
interface management systems; • Information systems → is closely related to [7].
Recommender systems; Given the changing nature of cyber-attacks, many researchers re-
sort to anomaly detection for threat detection. The majority of these
KEYWORDS works focus on building sophisticated models [13, 15], but do not
Anomaly detection; interactive machine learning; explainable ma- exploit analyst interactions with the data to improve detection rates.
chine learning; cybersecurity; recommender systems Recent works explore a human-in-the-loop detection paradigm by
leveraging a combination of outlier analysis, used to identify new
ACM Reference Format:
Ignacio Arnaldo, Mei Lam, and Kalyan Veeramachaneni. 2019. eX2 : a frame-
threats, and supervised learning to improve detection rates over
work for interactive anomaly detection. In Joint Proceedings of the ACM IUI time [2, 8, 16]. However, these works do not consider two critical
2019 Workshops, Los Angeles, USA, March 20, 2019 , 5 pages. aspects in cybersecurity. First, they do not provide explanations
for the anomalies (note that [2] provides predefined visualizations
1 INTRODUCTION based on prior attack knowledge, but it does not account for new
The cybersecurity community is embracing machine learning to attacks exhibiting unique patterns). Second, neither of these works
transition from a reactive to a predictive strategy for threat detec- exploit interactive strategies upon the confirmation of a new attack
tion. At the same time, most research works at the intersection by an analyst, therefore missing an opportunity to improve the
of cybersecurity and machine learning focus on building complex detection recall and the label acquisition process.
models for a specific detection problem [11], but rarely translate
into real-world solutions. Arguably one of the biggest weakspots 3 FINDING ANOMALIES
of these works is the use of datasets that lack generality, realism, We leverage Principal Component Analysis (PCA) to find cases that
and representativeness [3]. violate the correlation structure of the main bulk of the data. To
To break out of this situation, the first step is to devise efficient detect these rare cases, we analyze the projection from original
strategies to obtain representative datasets. To that end, intelligent variables to the principal components’ space, followed by the in-
tools and interfaces are needed to enable security researchers to verse projection (or reconstruction) from principal components to
carry out threat hunting activities, i.e., to search for attacks in the original variables. If only the first principal components (the
real-world cybersecurity datasets. Threat hunting solutions remain components that explain most of the variance in the data) are used
vastly unexplored in the research community, and open challenges for projection and reconstruction, we ensure that the reconstruction
in combining the fields of outlier analysis, explainable machine error will be low for the majority of the examples, while remaining
learning, and recommendation systems. high for outliers. This is because the first principal components ex-
In this paper, we introduce eX 2 , a threat hunting framework plain the variance of normal cases, while last principal components
based on interactive anomaly detection. The detection relies on explain outlier variance [1].
Let X be a p-dimensional dataset. Its covariance matrix Σ can be
IUI Workshops’19, March 20, 2019, Los Angeles, USA decomposed as: Σ = P × D × P T , where P is an orthonormal matrix
Copyright © 2019 for the individual papers by the papers’ authors. Copying permitted where the columns are the eigenvectors of Σ, and D is the diagonal
for private and academic purposes. This volume is published and copyrighted by its matrix containing the corresponding eigenvalues λ 1 . . . λp , where
editors.
the eigenvectors and their corresponding eigenvalues are sorted in
IUI Workshops’19, March 20, 2019, Los Angeles, USA Ignacio Arnaldo, Mei Lam, and Kalyan Veeramachaneni
0.015 0.015 0.020 2.5 8 4
7 3.5
0.015 2 6 3
0.010 0.010 5
1.5 2.5
0.010 4 2
1 3 1.5
0.005 0.005
0.005 2 1
0.5
1 0.5
0.000 0.000 0.000 0 0 0
0 200 400 600 800 1, 000 0 500 1, 000 1, 500 2, 000 0 200 400 600 800 1, 000 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(a) Credit card (b) KDDCup (c) ATO (d) Credit card (e) KDDCup (f) ATO
Figure 1: Score distributions (a, b, c) and normalized scores (d, e, f) for three datasets obtained with the PCA method and the
corresponding fitted distributions.
decreasing order of significance (the first eigenvector accounts for 4 EXPLAINING AND EXPLORING
the most variance etc). ANOMALIES
The projection of the dataset into the principal component space
Interpretability in machine learning can be achieved by explain-
is given by Y = X P. Note that this projection can be performed with
ing the model that generates the results, or by explaining each
a reduced number of principal components. Let Y j be the projected
model outcome [9]. In this paper, we focus on the latter, given that
dataset using the top j principal components: Y j = X × P j . In the
the goal is to provide explanations for each individual anomaly.
same way, the reverse projection (from principal component space
More formally, we consider an anomaly detection strategy given by
to original space) is given by R j = (P j × (Y j )T )T , where R j is the
b(X p ) = S where b is a black-box detector, X p is a dataset with p
reconstructed dataset using the top j principal components.
features, and S is the space of scores generated by the detector. The
We define the outlier score of point X i = [x i1 . . . x ip ] as:
goal is to find an explanation e ∈ ϵ for each x ∈ X p , where ϵ repre-
sents the domain of interpretable explanations. We approach this
problem as finding a function f such that for each vector x ∈ X p ,
p
score(X i ) =
Í j
(|X i − Ri |) × ev(j) the corresponding explanation is given by e = f (x, b).
In this paper, we introduce a procedure f tailored to PCA that
j=1
j
(1) generates explanations e = {C, V }, where C contains the contribu-
λk
Í
ev(j) = k =1 tion of each feature to the score, and V is a set of visualizations
p
that highlight the difference between the analyzed example and the
λk
Í
k =1
bulk of the population.
Retrieving feature contributions: In this first step, we retrieve
Note that ev(j) represents the cumulative percentage of variance the contribution of each feature of the dataset to the final outlier
explained with the top j principal components. This means that, score via model inspection. Note that we leverage matrix opera-
the higher is j, the most variance will be accounted for within the tions to simultaneously retrieve the feature contributions for all
components from 1 to j. With this score definition, large deviations the examples; we proceed as follows:
in the top principal components are not heavily weighted, while
(1) Project one feature at a time using all principal components.
deviations in the last principal components are. Outliers present
For feature i, the projected data is given by Yi = X i × P,
large deviations in the last principal components, and thus will
where the matrix P contains all p eigenvectors.
receive high scores.
(2) Compute the feature contribution Ci of feature i as:
Normalizing outlier scores: As shown in Figure 1, the outlier p
detection method assigns a low score to most examples, and the j
Õ
Ci = Yi × ev(j) (2)
distribution presents a long right tail. At the same time, the range j=1
of the scores depends on the datasets, which limits the method’s
interpretability. To overcome this situation, we project all scores j
where Yi is the projected value of the i-th feature on the j-th
into a same space, in such a way that scores can be interpreted as principal component, and ev(j) is the cumulative percentage
probabilities. To that end, we model PCA-based outlier scores with of variance explained with the top j principal components
a Weibull distribution (overlaid in the figures in red). Note that given in Equation 1. In other words, the higher the absolute
the Weibull distribution is flexible and can model a wide variety of values projected with the last principal components, the
shapes. For a given score S, its outlier probability corresponds to higher the contribution of the feature to the outlier score.
the cumulative density function evaluated in S: F (S) = P(X ≤ S). (3) In a last step, we normalize the feature contributions to
Figure 1 shows the final scores F for each of the analyzed datasets. obtain a unit vector C for each sample:
We can see that, with this technique, the final scores approximately
follow a long-right tailed distribution in the [0, 1] domain. Note that Ci
Ci = p (3)
these scores can be interpreted as the probability that a randomly
Cj
Í
picked example will present a lower or equal score. j=1
eX2 : a framework for interactive anomaly detection IUI Workshops’19, March 20, 2019, Los Angeles, USA
num_cred_cards 100 100
58.29% 80 80
num_addresschg
num_addresschg
60 60
40 40
5.13% rest 20 20
7.09%
new_ip 0 0
17.94% 11.40%
0 5 10 15 20 0 5 10 15 20
num_address_chg addr_verify_fail num_cred_cards num_cred_cards
(a) (b) (c)
Figure 2: Explanation of an outlier (in red) of the account takeover dataset (ATO): (a) feature contributions; (b) distribution of
the population in the subspace formed by the top 2 contributing features; (c) nearest neighbors (green) in the 2D subspace.
This way, for each outlier, we obtain a contribution score in the 5 EXPERIMENTAL WORK
[0, 1] domain for each feature in the dataset. To illustrate this step, Datasets: We evaluate the framework’s capability to find, explain,
we show in Figure 2a the feature contributions to the score of and explore anomalies with four outlier detection datasets, out of
an outlier of the ATO dataset; we can see that num_cred_cards which three are publicly available (WDBC, KDDCup, and Credit
contributed the most to the example’s score (58.29%), followed Card) and one is a real-world dataset built with logs generated by
by num_address_chд and addr _veri f y_f ail (17.94% and 11.40% an online application:
respectively). - WDBC dataset: this dataset is composed of 367 rows, 30 numer-
Visualizing anomalies: Once the feature contributions are ex- ical features, and includes 10 anomalies. We consider the version
tracted, the system generates a series of visualizations to show each available at [5] introduced by Campos et al. [4]. Note that this is
outlier in relation with the rest of the population. For ease of in- not a cybersecurity dataset, but has been included to cover a wider
terpretation, these visualizations are generated in low dimensional range of scenarios.
feature subspaces as follows: - KDDCup 99 data (KDD): We consider the pre-processed version
introduced in [4] in which categorical values are one-hot encoded
(1) Retrieve the top-m features ranked by contribution score
and duplicates are eliminated. This version is composed of 48113
(2) For each pair of features (x i , x j ) in the top-m, display the
rows, 79 features, and counts 200 malicious anomalies.
joint distribution of the population in a 2D-scatter plot as
- Credit card dataset (CC): used in a Kaggle competition [12],
shown in Figure 2b. Note that in the example m = 2 and
the dataset is composed of 284807 rows, 29 numerical features, and
that the analyzed outlier is highlighted in red. In cases of
counts 492 anomalies.
large datasets, the visualizations are limited to 10K randomly
- Account takeover dataset (ATO): this real-world dataset was
picked samples.
built using web logs from an online application during three months.
With this approach, we obtain intuitive visualizations in low-dimension Each row corresponds to the summarized activity of a user during
subspaces of the original features, in such a way that outliers are a 24 hour time window (midnight to midnight). It is composed
likely to stand out with respect to the rest of the population. of 317163 rows, 25 numerical features, and counts 318 identified
Exploring via recommendations in feature subspaces: As the anomalies.1
analyst interacts with the visualizations and confirms relevant find- Detection rates and analysis of top outliers: Table 1 shows
ings, the framework recommends the investigation of entities with the detection metrics of the PCA-based method and Local Outlier
similar characteristics. These recommendations are interactive and Factor (LOF), a standard outlier analysis baseline, on each of the
correspond to searching the top-k nearest neighbors in the feature datasets. The detection performance of LOF is superior for the
subspaces used to visualize the data (as opposed to using all the smaller dataset, WDBC. However, PCA-based outlier analysis out-
features for distance computation). As shown in Figure 2c, the rec- performs LOF in the three cybersecurity datasets (KDD, CC, and
ommendations highlighted in green help narrow down the search ATO). This observation validates the choice of PCA, given that not
of further anomalies. only it outperforms LOF, but it also provides interpretability as
This strategy, recommending based on similarities computed explained in Section 4.
in feature subsets, exploits user interactions with the data. The Despite improving the results of LOF in the cybersecurity datasets,
intuition is that, upon confirmation of the relevance of an outlier we can see that the precision and recall metrics of the PCA-based
with the provided visualizations, the user identifies discriminant method remain low. For instance, when looking at the top 100 out-
feature sets that are not known a priori. Thus, points close to the liers, the precision of our method (noted as P@100 in the table) is
identified anomaly in the resulting subspaces are likely to be in 1 As most real-world datasets, ATO is not fully labeled, therefore the metrics presented
turn relevant. in the following need to be interpreted accordingly.
IUI Workshops’19, March 20, 2019, Los Angeles, USA Ignacio Arnaldo, Mei Lam, and Kalyan Veeramachaneni
Dataset Method AUROC AUPR Pr@10 R@10 P@50 R@50 P@100 R@100 P@200 R@200 P@500 R@500
LOF 0.982 0.834 0.800 0.800 0.180 0.900 0.100 1.000 0.050 1.000 0.020 1.000
WDBC
PCA 0.899 0.219 0.300 0.300 0.160 0.800 0.090 0.900 0.050 1.000 0.020 1.000
LOF 0.606 0.029 0.000 0.000 0.240 0.060 0.170 0.085 0.105 0.105 0.054 0.135
KDDCup
PCA 0.977 0.136 0.300 0.015 0.260 0.065 0.210 0.105 0.220 0.220 0.138 0.345
LOF 0.654 0.015 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Credit card
PCA 0.954 0.255 0.400 0.008 0.620 0.063 0.480 0.098 0.500 0.203 0.282 0.287
LOF 0.568 0.004 0.000 0.000 0.020 0.003 0.010 0.003 0.005 0.003 0.004 0.006
Account takeover
PCA 0.861 0.010 0.100 0.003 0.020 0.003 0.020 0.006 0.020 0.013 0.014 0.022
Table 1: Anomaly detection metrics of Local Outlier Factor (LOF) and the method based on PCA used in our framework.
20 1 1
num_cred_cards
0.8
addr_verify_fail
0.8
addr_verify_fail
15
0.6 0.6
10
0.4 0.4
5 0.2 0.2
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 5 10 15 20
num_addresschg num_addresschg num_cred_cards
Figure 3: Visualization of the top ATO outlier with respect to the bulk of the population in 2D feature subspaces of interest.
The recommendations performed by the system are shown in green.
0.210, 0.480, and 0.020 for KDDcup, CC, and ATO respectively. This outlier (shown in red) stands out with respect to the population
observation indicates that not all outliers are malicious, and justifies (blue), ie outliers fall in sparse regions of the selected subspaces. The
the effort dedicated to providing interactive exploration of the data top 3 contributing features retrieved by the framework are the num-
to increase anomaly detection rates. ber of address changes (num_addresschg), the number of credit
Explain and explore: We show in Figure 3 the visualizations and cards used (num_cred_cards), and whether the user failed the ad-
recommendations generated for the top ATO outlier. The frame- dress verification (addr_verify_fail). In the first plot (num_addresschg
work appropriately selects feature subsets such that the analyzed vs num_cred_cards), we can clearly see why the highlighted user is
suspicious: he/she used four credit cards, and changed the delivery
address more that 90 times. The plot also shows five additional
AD-WDBC IAD-WDBC users recommended by the system upon confirmation of the threat
AD-KDD IAD-KDD by an analyst. The recommended users present an elevated number
AD-CC IAD-CC
AD-ATO IAD-ATO of address changes, and used one or more credit cards.
To further evaluate the exploratory strategy based on recom-
1
mendations, Figure 4 shows the detection rate obtained with PCA
alone, versus the metrics obtained with the combination of PCA
0.9
and recommendations. To obtain the latter metrics, we simulate
0.8 investigations for the top-m (m ∈ [10, 25, 50, 100, 200, 500]) outliers
0.7 (ie we reveal the ground truth) and consider the top-10 recom-
0.6 mended entries for the confirmed threats. In all cases, interactive
Precision
0.5 anomaly detection improves the precision. In particular, we can see
0.4
a significant precision improvement for the KDD and CC datasets
for investigation budgets in the 50-200 range.
0.3
0.2
6 CONCLUSION
0.1
We have introduced the eX 2 framework for threat hunting activities.
0
0 50 100 150 200 250 The framework leverages principal component analysis to generate
Investigation budget
interpretable anomalies, and exploits analyst-data interaction to
recommend the exploration of problematic regions of the data. The
Figure 4: Precision versus investigation budget of anomaly results presented in this work with three cybersecurity datasets
detection alone based on PCA (AD) and interactive anomaly show that eX 2 outperforms detection strategies based on stand-
detection combining both PCA and recommendations (IAD). alone outlier analyis.
eX2 : a framework for interactive anomaly detection IUI Workshops’19, March 20, 2019, Los Angeles, USA
REFERENCES
[1] Charu C. Aggarwal. 2013. Outlier Analysis. Springer. https://doi.org/10.1007/
978-1-4614-6396-2
[2] Anaël Beaugnon, Pierre Chifflier, and Francis Bach. 2017. ILAB: An Interactive
Labelling Strategy for Intrusion Detection. In RAID 2017: Research in Attacks,
Intrusions and Defenses. Atlanta, United States. https://hal.archives-ouvertes.fr/
hal-01636299
[3] E. Biglar Beigi, H. Hadian Jazi, N. Stakhanova, and A. A. Ghorbani. 2014. Towards
effective feature selection in machine learning-based botnet detection approaches.
In 2014 IEEE Conference on Communications and Network Security. 247–255.
[4] Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello,
Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016.
On the evaluation of unsupervised outlier detection: measures, datasets, and
an empirical study. Data Mining and Knowledge Discovery 30, 4 (01 Jul 2016),
891–927.
[5] Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J. G. B. Campello,
Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2018.
Datasets for the evaluation of unsupervised outlier detection. www.dbs.ifi.lmu.
de/research/outlier-evaluation/DAMI/
[6] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection:
A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages. https:
//doi.org/10.1145/1541880.1541882
[7] XuanHong Dang, Barbora MicenkovÃą, Ira Assent, and RaymondT. Ng. 2013.
Local Outlier Detection with Interpretation. In Machine Learning and Knowledge
Discovery in Databases, Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen,
and Filip šelezná (Eds.). Lecture Notes in Computer Science, Vol. 8190. Springer
Berlin Heidelberg, 304–320. https://doi.org/10.1007/978-3-642-40994-3_20
[8] S. Das, W. Wong, T. Dietterich, A. Fern, and A. Emmott. 2016. Incorporating
Expert Feedback into Active Anomaly Discovery. In 2016 IEEE 16th International
Conference on Data Mining (ICDM). 853–858. https://doi.org/10.1109/ICDM.2016.
0102
[9] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca
Giannotti, and Dino Pedreschi. 2018. A Survey of Methods for Explaining Black
Box Models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018), 42 pages. https:
//doi.org/10.1145/3236009
[10] Victoria Hodge and Jim Austin. 2004. A Survey of Outlier Detection Method-
ologies. Artif. Intell. Rev. 22, 2 (Oct. 2004), 85–126. https://doi.org/10.1023/B:
AIRE.0000045502.10941.a9
[11] Heju Jiang, Jasvir Nagra, and Parvez Ahammad. 2016. SoK: Applying Machine
Learning in Security-A Survey. arXiv preprint arXiv:1611.03186 (2016).
[12] Kaggle. 2018. Credit Card Fraud Detection Dataset. www.kaggle.com/isaikumar/
creditcardfraud
[13] Benjamin J. Radford, Leonardo M. Apolonio, Antonio J. Trias, and Jim A. Simpson.
2018. Network Traffic Anomaly Detection Using Recurrent Neural Networks.
CoRR abs/1803.10769 (2018). arXiv:1803.10769 http://arxiv.org/abs/1803.10769
[14] Mei-ling Shyu, Shu ching Chen, Kanoksri Sarinnapakorn, and Liwu Chang. 2003.
A novel anomaly detection scheme based on principal component classifier.
In in Proceedings of the IEEE Foundations and New Directions of Data Mining
Workshop, in conjunction with the Third IEEE International Conference on Data
Mining (ICDMâĂŹ03. 172–179.
[15] Aaron Tuor, Samuel Kaplan, Brian Hutchinson, Nicole Nichols, and Sean Robin-
son. 2017. Deep Learning for Unsupervised Insider Threat Detection in Struc-
tured Cybersecurity Data Streams. CoRR abs/1710.00811 (2017). arXiv:1710.00811
http://arxiv.org/abs/1710.00811
[16] K. Veeramachaneni, I. Arnaldo, V. Korrapati, C. Bassias, and K. Li. 2016. AI 2 :
Training a Big Data Machine to Defend. In 2016 IEEE 2nd International Conference
on Big Data Security on Cloud (BigDataSecurity). 49–54.