=Paper=
{{Paper
|id=Vol-1569/paper5
|storemode=property
|title=ADOVA: Anomaly Detection in Online and Virtual spAces
|pdfUrl=https://ceur-ws.org/Vol-1569/paper5.pdf
|volume=Vol-1569
|authors=C. David Emele,Vitalij Spakov,Wei Pang,Jone Bone,George Coghill
|dblpUrl=https://dblp.org/rec/conf/atal/EmeleSPBC15
}}
==ADOVA: Anomaly Detection in Online and Virtual spAces==
ADOVA: Anomaly Detection in Online and Virtual spAces
C. David Emele Vitalij Spakov Wei Pang
RCUK dot.rural DE Hub Computer Science Dept. Computer Science Dept.
University of Aberdeen, UK University of Aberdeen, UK University of Aberdeen, UK
c.emele@abdn.ac.uk v.spakov@abdn.ac.uk w.pang@abdn.ac.uk
Jone Bone George Coghill
Sociology Department Computer Science Dept.
University of Aberdeen, UK University of Aberdeen, UK
j.bone@abdn.ac.uk g.coghil@abdn.ac.uk
ABSTRACT players in a short time interval. Many social networking
Online and virtual spaces comprise a myriad of ad-hoc net- platforms have been developed which enable users to engage
works and online communities. Such communities are com- socially with their environment and communities. Exam-
posed of smart devices, agents, systems and people who seek ples of such platforms include Twitter, Facebook, Instagram,
to interact in one way or another. We argue that the task and LinkedIn. Through these platforms, users connect with
of detecting anomalies in such settings is non-trivial. The other users to form ad-hoc organisations. Such organisations
complexity is further compounded since there is no clear cut often operate with little or no rules to guide their interac-
definition/specification of what normal behaviour is, and tions with one another. Because of the transient nature of
how far out an outlier should be before it is detected as these ad-hoc communities, detecting abnormal behaviours
an anomaly. This is often the case with online and virtual (often referred to as anomaly or outlier) is a difficult task.
spaces as there is little or no regulation of the interactions Anomaly is a common phenomenon in our everyday lives,
between the various players in online communities. Hence, and could present grievious repercussions such as poten-
detecting anomalous behaviour in such settings poses a huge tial security threat to lives and property. There is strong
challenge. In this paper, we investigate how evolutionary evidence that detecting anomaly is a challenge in real-life
clustering could be exploited to support decision makers, real-time interactions, particularly in social media. This is
designers and data scientists in the autonomous detection of largely as a result of the amount of data produced in on-
anomalies in online and virtual spaces. We present prelimi- line social interactions, the time and intelligence required
nary ideas in tackling this issue using a freeform online social to trawl through massive quantities of data produced ev-
media community (Twitter) and explore how emerging pat- ery second and the heterogeneity of the network. An even
terns and trends could help identify clusters of players (or further complication is the fact that communities and organ-
normal behaviour) and, conversely, anomalies. isations formed on social networks are often open, informal
and lack rigid rules for conformity - in order words, little
or no control. In such scenarios, no-one defines what an
Categories and Subject Descriptors acceptable behaviour is and so it is difficult to detect abnor-
H.4 [Information Systems Applications]: Miscellaneous mal behaviour when normal behaviour is not clearly defined.
Nevertheless, it is important to detect anomalous behaviour
General Terms and possibly respond to it early enough so as to mitigate
imminent threats.
Algorithms, Experimentation, Human Factors We acknowledge that anomalies could be accidental or de-
liberate and may be innocuous or harmful. Whatever the
Keywords case, it is important to develop tools that could help to de-
evolutionary clustering, anomaly, outliers, twitter, social me- tect anomalies when they occur, especially in online and
dia, online communities, agent technology, virtual organisa- virtual spaces, where there is neither regulation nor supervi-
tions sion. To date, many anomaly detection tools and algorithms
have been developed, however they are often developed for
offline settings and so would struggle to detect anomalies in
1. INTRODUCTION online and virtual spaces. Some of the work that has been
Recent advancements in technology and the Internet present done in detecting anomaly in online settings often fail to
new spaces for communities to interact. Such spaces (re- deal with scenarios where normal behaviour is not explicitly
garded here as online and virtual spaces) offer new opportu- known, and often do not consider the ad-hoc nature of vir-
nities to engage in pervasive communication between many tual spaces (where players can join and leave a community
as they wish).
In this paper, we propose the use of evolutionary clus-
tering mechanisms to support data analysts and decision
makers in detecting anomalies in ad-hoc and unregulated
communities. The rest of this paper is formatted as follows:
Section 2 discusses as background some of the existing ap- 3. EVOLUTIONARY CLUSTERING IN ON-
proaches that have been used in anomaly detection in the LINE AND VIRTUAL SETTINGS
literature. Section 3 describes our evolutionary clustering
approach and how it can aid anomaly detection in online Evolutionary clustering presents benefits that make it a
and virtual communities (e.g. social networks). Section 4 robust framework for analysing interactions in online and
presents preliminary results of our evaluation, while Section virtual spaces. For example, evolutionary clustering enables
5 concludes with a brief discussion of the results as well as clusters to have smoother and more natural transitions over
future directions. time since new information is added to existing data, which
could cause cluster centres to shift. Secondly, evolutionary
algorithm enables a new cluster to resemble the cluster in
2. BACKGROUND the previous timestep, if at all possible, which we refer to
This section presents a brief background on anomaly de- as consistency. On th other hand, if the new data is not
tection in general. It often involves a set of observations representative of the cluster population then it registers the
over a time period such that X = {x1 , x2 , x3 , · · · xn }. Social new data as an anomaly. Regarding noise, evolutionary al-
media data is largely noisy and so differentiating noise and gorithm is robust against noise because previous data points
anomaly could be a daunting task. Being able to automat- are considered in generating new clusters.
ically detect anomalies in such noisy, ad-hoc environment We proceed by presenting our evolutionary clustering al-
presents huge research challenges. Although anomaly detec- gorithm, which has strong roots in the algorithm proposed
tion has been extensively researched in the literature, exist- by Chakrabarti et al. [3]. Let t denote a timestep and j be
ing approaches do not adequately detect anomalies in online the index of a cluster such that ctj denotes a cluster centre j
social networks due to the uniqueness of the environment in at time t. A matching cluster is represented as ct−1f (j) , where
terms of noise, little or no regulation, and ”ad-hocness”. t − 1 depicts a previous timestep and f (j) denotes a clus-
Anomaly detection is a well-studied problem in the liter- ter matching function that matches a given cluster with the
ature (see [11]) and several techniques including statistical most similar cluster in the previous timestep. We calculate
([13]), graph-based ([8]), network learning ([9, 10]) and clus- the relative cluster size, denoted by γ, as follows:
tering ([5, 12]) have been applied. However, to the best of
our knowledge, evolutionary clustering has not been con-
sidered/applied in anomaly detection. Furthermore, it has ntj
γ= (1)
not been considered in online social networks where net- (ntj + nt−1
f (j) )
work configurations and composition involve heterogenous
connections that are constantly changing. The ad-hoc and The relative cluster size is an important statistic for cal-
“unregulated” nature of social interactions in online commu- culating a new centre for a given cluster. Another impor-
nities pose interesting challenges which evolutionary cluster- tant variable is the change parameter (cp, for short), which
ing could help to resolve (as presented in Section 3). As a specifies the trade off between two cluster centres. Change
background, we discuss related research that has looked into parameter is user defined and ranges from 0 to 1 (that is,
some of the temporal issues involved in anomaly detection 0 < cp < 1). Putting it all together, we define a centre
including online topic detection, classification, tracking and recalculation function, denoted as g(c, t, j 0 ) as follows:
load balancing. For example, some work in machine learn-
ing allow models learned in classification settings to evolve
over time but with penalties for errors and shifts introduced g(c, t, j 0 ) = (1 − γ) × cp × ct−1 t
f (j) + γ × (1 − cp) × cj (2)
at each timestep ([7]). It is, nevertheless, unclear how such
algorithms could be employed in ad-hoc, fast changing, het- The centre recalculation function (depicted in Equation
erogenous, unregulated and unsupervised learning settings. 2) calculates a new center location that has to be between a
Some online document clustering research has explored current cluster center retrieved from non-evolutionary clus-
temporal aspects, in which a time series of documents are tering, and the one matched from a previous clustering while
examined sequentially to detect novelty regarding certain taking into account the relative sizes of both clusters and
features of the network ([2]). Research in machine learn- change parameter values defined by the user.
ing, data mining, statistics, and cloud computing has ex- Having defined the variables that our evolutionary clus-
plored a number of approaches for clustering time-series tering mechanism depends on, we now turn our attention
data. Temporal correlation is perhaps the best-known ap- to describing the algorithm. Firstly, we present the evolu-
proach to time-series similarity computation [4, 6]. Statisti- tionary clustering algorithm (see Algorithm 1) which would
cal approaches have utilised probabilistic models for online require the centre recalculation algorithm (see Algorithm 2).
document clustering ([13]). Clustering has been applied to The centre recalculation function is the most important part
automatic discovery and retrieval of topically related mate- of the evolutionary clustering algorithm. It is responsible for
rial in data streams, and to detect novel events from a tem- relocating cluster centres, and reassigning the current input
porally ordered collection of news stories ([1]). However, the points to new clusters respectively. In line 8 of Algorithm 2
primary objective of topic detection and tracking is to detect we use greedy comparison to find a previous cluster instance
new events in a timeline using approaches such as clustering, to match with a given cluster; hence, a given cluster will be
and not to produce clusterings that incorporate history into matched with the closest cluster from a previous clustering
the clustering objective. One clustering mechanism that is step. Our evolutionary clustering mechanism is robust to
known to incorporate history into the clustering objective is handle multidimensional data as long as the cluster centres
evolutionary mechanism. In this light, our research explores contain numeric coordinates, and this is typical of social me-
evolutionary clustering to detect anomalous behaviours in dia data (as will be shown in the preliminary results of our
social network settings. evaluation).
Algorithm 1 Evolutionary Clustering Algorithm.
1: INPUT : InputData (that is, time series data)
2: OUTPUT : Output (that is, cluster centre locations
and cluster sizes)
3: Initialise Output ← ∅
4: Split InputData by timesteps
5: for each timestep in InputData do
6: CurrentData ← data in first timestep of InputData
7: InputData ← Remove CurrentData
8: Perform k-means clustering on CurrentData
9: if timestep > 1 then
10: Recalculate clustering centres by using historical
data as reference (see Algorithm 2)
11: else
12: Do Nothing
13: end if
14: end for
15: Generate Output using modified clustering information
16: Save CurrentData clustering centre locations and cluster
sizes for historical reference
17: return Output
Algorithm 2 Centre Recalculation Algorithm.
1: INPUT : previousTimeClusters
2: INPUT : currentTimeClusters
Figure 1: The plot at the third timestep.
3: INPUT : dataPoints
4: OUTPUT : clusterCentres (that is, a list of cluster
centre locations)
5: Initialise clusterCentres ← ∅
6: dimensionCount ← size of centre dimensions of current-
TimeClusters
7: for each center in currentTimeClusters do
8: find previous instance of centre from previous-
TimeClusters
9: apply centre recalculation function (see Equation 2)
10: save modified centre in currentTimeClusters
11: add centre to the list of cluster centres (that is,
clusterCentres)
12: end for
13: for each point in dataPoints do
14: find the shortest distance to a cluster centre in cur-
rentTimeClusters
15: assign point to the closest cluster found
16: end for
17: return Output
4. EVALUATION
In evaluating our work, we developed a social network
analysis tool which used Twitter social media platform as
an online and virtual space that enabled the creation of ad-
hoc online communities around a given topic or event (e.g.,
the 2014 scottish referendum - #indyref). We downloaded Figure 2: The plot at the ninth timestep.
10,000 timestamped tweets relating to the scottish referen-
dum and applied evolutionary clustering mechanism to the
collection of tweets. The tweets gathered covered the period
from 9th to 25th March 2015. Though the Scottish Refer- Count). One of our objectives was to see how the clusters
endum took place in September 2014, “#indyref” is still s evolved over time. Results show that in timestep 3 (illus-
fairly popular hashtag. trated in Figure 1, some patterns were already visible from
In our preliminary analysis, we attempted to cluster the the data, and by timestep 9 (depicted in Figure 2) 4 clusters
interactions in the online community based on the number had emerged. In both graphs we could spot the anomalies
of retweets of the various tweets gathered (that is, retweet- clearly as they do not fit into any of the clusters.
5. CONCLUSIONS Web, pages 201–210. ACM, 2007.
This paper presented a prototype evolutionary clustering [10] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina.
approach for detecting anomaly in online and virtual spaces. Web graph similarity for anomaly detection. Journal
We discussed how our approach could support decision mak- of Internet Service Applications, 1:19–30, 2001.
ers, designers and data scientists in the detection of anoma- [11] D. Savage, X. Zhang, X. Yu, P. Chou, and Q. Wang.
lies in online and virtual spaces. We used a freeform online Anomaly detection in online social networks. Social
social media community (Twitter) to explore how emerg- Networks, 39(0):62 – 70, 2014.
ing patterns within a dataset could help identify clusters of [12] I. H. Witten and E. Frank. Data Mining: Practical
players (or normal behaviour) and, conversely, anomalies. machine learning tools and techniques. Morgan
Preliminary results of our work show that our evolutionary Kaufmann, 2nd edition, 2005.
clustering approach is robust to handle multidimensional- [13] J. Zhang, Z. Ghahramani, and Y. Yang. A
ity in ad-hoc and unregulated settings presented by social probabilistic model for online document clustering
media communities. We are currently engaging in more de- with applications to novelty detection. In Proceedings
tailed analyses of the results of our experiments. In the of Advances in Neural Information Processing
future, we plan to investigate additional features that will Systems, 2005.
enable faster anomaly detection in online and virtual spaces.
Acknowledgments
The research described here is supported by the award made
by the Research Councils UK Digital Economy programme
to the dot.rural digital economy research hub, at University
of Aberdeen, UK (award reference: EP/G066051/1).
REFERENCES
[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and
Y. Yang. Topic detection and tracking pilot study:
Final report. In Proc. of the DARPA Broadcast News
Transcription and Understanding Workshop, 1998.
[2] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum.
Hierarchical topic models and the nested chinese
restaurant process. In S. Thrun, L. Saul, and
B. Scholkopf, editors, The Semantic Web, ISWC 2002,
volume 16 of Advances in Neural Information
Processing Systems. 2004.
[3] D. Chakrabarti, R. Kumar, and A. Tomkins.
Evolutionary clustering. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 554–560. ACM,
2006.
[4] C. Chatfield. The Analysis of Time Series. Chapman
and Hall, New York, NY, USA, 1984.
[5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification. Wiley-Interscience, New York, NY,
USA, 2000.
[6] M. Gupta, J. Gao, Y. Sun, and J. Han. Community
trend outlier detection using soft temporal pattern
mining. In P. Flach, T. Bie, and C. Nello, editors,
Machine Learning and Knowledge Discovery in
Databases, volume 16, pages 692–708. Springer, Berlin
Heidelberg, 2012.
[7] M. Herbster and M. K. Warmuth. Tracking the best
linear predictor. Journal of Machine Learning
Research, 1:281–309, 2001.
[8] C. C. Noble and D. J. Cook. Graph-based anomaly
detection. In Proceedings of the Ninth ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, pages 631–636. ACM, 2003.
[9] S. Pandit, D. Chau, S. Wang, and C. Faloutsos.
Netprobe: a fast and scalable system for fraud
detection in online auction networks. In Proceedings of
the 16th International Conference on World Wide