=Paper= {{Paper |id=Vol-1569/paper5 |storemode=property |title=ADOVA: Anomaly Detection in Online and Virtual spAces |pdfUrl=https://ceur-ws.org/Vol-1569/paper5.pdf |volume=Vol-1569 |authors=C. David Emele,Vitalij Spakov,Wei Pang,Jone Bone,George Coghill |dblpUrl=https://dblp.org/rec/conf/atal/EmeleSPBC15 }} ==ADOVA: Anomaly Detection in Online and Virtual spAces== https://ceur-ws.org/Vol-1569/paper5.pdf
     ADOVA: Anomaly Detection in Online and Virtual spAces

                 C. David Emele                          Vitalij Spakov                        Wei Pang
             RCUK dot.rural DE Hub                  Computer Science Dept.             Computer Science Dept.
            University of Aberdeen, UK             University of Aberdeen, UK         University of Aberdeen, UK
             c.emele@abdn.ac.uk     v.spakov@abdn.ac.uk     w.pang@abdn.ac.uk
                             Jone Bone             George Coghill
                                 Sociology Department               Computer Science Dept.
                               University of Aberdeen, UK          University of Aberdeen, UK
                                 j.bone@abdn.ac.uk                  g.coghil@abdn.ac.uk

ABSTRACT                                                           players in a short time interval. Many social networking
Online and virtual spaces comprise a myriad of ad-hoc net-         platforms have been developed which enable users to engage
works and online communities. Such communities are com-            socially with their environment and communities. Exam-
posed of smart devices, agents, systems and people who seek        ples of such platforms include Twitter, Facebook, Instagram,
to interact in one way or another. We argue that the task          and LinkedIn. Through these platforms, users connect with
of detecting anomalies in such settings is non-trivial. The        other users to form ad-hoc organisations. Such organisations
complexity is further compounded since there is no clear cut       often operate with little or no rules to guide their interac-
definition/specification of what normal behaviour is, and          tions with one another. Because of the transient nature of
how far out an outlier should be before it is detected as          these ad-hoc communities, detecting abnormal behaviours
an anomaly. This is often the case with online and virtual         (often referred to as anomaly or outlier) is a difficult task.
spaces as there is little or no regulation of the interactions        Anomaly is a common phenomenon in our everyday lives,
between the various players in online communities. Hence,          and could present grievious repercussions such as poten-
detecting anomalous behaviour in such settings poses a huge        tial security threat to lives and property. There is strong
challenge. In this paper, we investigate how evolutionary          evidence that detecting anomaly is a challenge in real-life
clustering could be exploited to support decision makers,          real-time interactions, particularly in social media. This is
designers and data scientists in the autonomous detection of       largely as a result of the amount of data produced in on-
anomalies in online and virtual spaces. We present prelimi-        line social interactions, the time and intelligence required
nary ideas in tackling this issue using a freeform online social   to trawl through massive quantities of data produced ev-
media community (Twitter) and explore how emerging pat-            ery second and the heterogeneity of the network. An even
terns and trends could help identify clusters of players (or       further complication is the fact that communities and organ-
normal behaviour) and, conversely, anomalies.                      isations formed on social networks are often open, informal
                                                                   and lack rigid rules for conformity - in order words, little
                                                                   or no control. In such scenarios, no-one defines what an
Categories and Subject Descriptors                                 acceptable behaviour is and so it is difficult to detect abnor-
H.4 [Information Systems Applications]: Miscellaneous              mal behaviour when normal behaviour is not clearly defined.
                                                                   Nevertheless, it is important to detect anomalous behaviour
General Terms                                                      and possibly respond to it early enough so as to mitigate
                                                                   imminent threats.
Algorithms, Experimentation, Human Factors                            We acknowledge that anomalies could be accidental or de-
                                                                   liberate and may be innocuous or harmful. Whatever the
Keywords                                                           case, it is important to develop tools that could help to de-
evolutionary clustering, anomaly, outliers, twitter, social me-    tect anomalies when they occur, especially in online and
dia, online communities, agent technology, virtual organisa-       virtual spaces, where there is neither regulation nor supervi-
tions                                                              sion. To date, many anomaly detection tools and algorithms
                                                                   have been developed, however they are often developed for
                                                                   offline settings and so would struggle to detect anomalies in
1.    INTRODUCTION                                                 online and virtual spaces. Some of the work that has been
  Recent advancements in technology and the Internet present       done in detecting anomaly in online settings often fail to
new spaces for communities to interact. Such spaces (re-           deal with scenarios where normal behaviour is not explicitly
garded here as online and virtual spaces) offer new opportu-       known, and often do not consider the ad-hoc nature of vir-
nities to engage in pervasive communication between many           tual spaces (where players can join and leave a community
                                                                   as they wish).
                                                                      In this paper, we propose the use of evolutionary clus-
                                                                   tering mechanisms to support data analysts and decision
                                                                   makers in detecting anomalies in ad-hoc and unregulated
                                                                   communities. The rest of this paper is formatted as follows:
Section 2 discusses as background some of the existing ap-            3.     EVOLUTIONARY CLUSTERING IN ON-
proaches that have been used in anomaly detection in the                     LINE AND VIRTUAL SETTINGS
literature. Section 3 describes our evolutionary clustering
approach and how it can aid anomaly detection in online                 Evolutionary clustering presents benefits that make it a
and virtual communities (e.g. social networks). Section 4             robust framework for analysing interactions in online and
presents preliminary results of our evaluation, while Section         virtual spaces. For example, evolutionary clustering enables
5 concludes with a brief discussion of the results as well as         clusters to have smoother and more natural transitions over
future directions.                                                    time since new information is added to existing data, which
                                                                      could cause cluster centres to shift. Secondly, evolutionary
                                                                      algorithm enables a new cluster to resemble the cluster in
2.    BACKGROUND                                                      the previous timestep, if at all possible, which we refer to
   This section presents a brief background on anomaly de-            as consistency. On th other hand, if the new data is not
tection in general. It often involves a set of observations           representative of the cluster population then it registers the
over a time period such that X = {x1 , x2 , x3 , · · · xn }. Social   new data as an anomaly. Regarding noise, evolutionary al-
media data is largely noisy and so differentiating noise and          gorithm is robust against noise because previous data points
anomaly could be a daunting task. Being able to automat-              are considered in generating new clusters.
ically detect anomalies in such noisy, ad-hoc environment               We proceed by presenting our evolutionary clustering al-
presents huge research challenges. Although anomaly detec-            gorithm, which has strong roots in the algorithm proposed
tion has been extensively researched in the literature, exist-        by Chakrabarti et al. [3]. Let t denote a timestep and j be
ing approaches do not adequately detect anomalies in online           the index of a cluster such that ctj denotes a cluster centre j
social networks due to the uniqueness of the environment in           at time t. A matching cluster is represented as ct−1f (j) , where
terms of noise, little or no regulation, and ”ad-hocness”.            t − 1 depicts a previous timestep and f (j) denotes a clus-
   Anomaly detection is a well-studied problem in the liter-          ter matching function that matches a given cluster with the
ature (see [11]) and several techniques including statistical         most similar cluster in the previous timestep. We calculate
([13]), graph-based ([8]), network learning ([9, 10]) and clus-       the relative cluster size, denoted by γ, as follows:
tering ([5, 12]) have been applied. However, to the best of
our knowledge, evolutionary clustering has not been con-
sidered/applied in anomaly detection. Furthermore, it has                                                  ntj
                                                                                               γ=                                      (1)
not been considered in online social networks where net-                                             (ntj + nt−1
                                                                                                               f (j) )
work configurations and composition involve heterogenous
connections that are constantly changing. The ad-hoc and                The relative cluster size is an important statistic for cal-
“unregulated” nature of social interactions in online commu-          culating a new centre for a given cluster. Another impor-
nities pose interesting challenges which evolutionary cluster-        tant variable is the change parameter (cp, for short), which
ing could help to resolve (as presented in Section 3). As a           specifies the trade off between two cluster centres. Change
background, we discuss related research that has looked into          parameter is user defined and ranges from 0 to 1 (that is,
some of the temporal issues involved in anomaly detection             0 < cp < 1). Putting it all together, we define a centre
including online topic detection, classification, tracking and        recalculation function, denoted as g(c, t, j 0 ) as follows:
load balancing. For example, some work in machine learn-
ing allow models learned in classification settings to evolve
over time but with penalties for errors and shifts introduced              g(c, t, j 0 ) = (1 − γ) × cp × ct−1                     t
                                                                                                           f (j) + γ × (1 − cp) × cj   (2)
at each timestep ([7]). It is, nevertheless, unclear how such
algorithms could be employed in ad-hoc, fast changing, het-              The centre recalculation function (depicted in Equation
erogenous, unregulated and unsupervised learning settings.            2) calculates a new center location that has to be between a
   Some online document clustering research has explored              current cluster center retrieved from non-evolutionary clus-
temporal aspects, in which a time series of documents are             tering, and the one matched from a previous clustering while
examined sequentially to detect novelty regarding certain             taking into account the relative sizes of both clusters and
features of the network ([2]). Research in machine learn-             change parameter values defined by the user.
ing, data mining, statistics, and cloud computing has ex-                Having defined the variables that our evolutionary clus-
plored a number of approaches for clustering time-series              tering mechanism depends on, we now turn our attention
data. Temporal correlation is perhaps the best-known ap-              to describing the algorithm. Firstly, we present the evolu-
proach to time-series similarity computation [4, 6]. Statisti-        tionary clustering algorithm (see Algorithm 1) which would
cal approaches have utilised probabilistic models for online          require the centre recalculation algorithm (see Algorithm 2).
document clustering ([13]). Clustering has been applied to            The centre recalculation function is the most important part
automatic discovery and retrieval of topically related mate-          of the evolutionary clustering algorithm. It is responsible for
rial in data streams, and to detect novel events from a tem-          relocating cluster centres, and reassigning the current input
porally ordered collection of news stories ([1]). However, the        points to new clusters respectively. In line 8 of Algorithm 2
primary objective of topic detection and tracking is to detect        we use greedy comparison to find a previous cluster instance
new events in a timeline using approaches such as clustering,         to match with a given cluster; hence, a given cluster will be
and not to produce clusterings that incorporate history into          matched with the closest cluster from a previous clustering
the clustering objective. One clustering mechanism that is            step. Our evolutionary clustering mechanism is robust to
known to incorporate history into the clustering objective is         handle multidimensional data as long as the cluster centres
evolutionary mechanism. In this light, our research explores          contain numeric coordinates, and this is typical of social me-
evolutionary clustering to detect anomalous behaviours in             dia data (as will be shown in the preliminary results of our
social network settings.                                              evaluation).
Algorithm 1 Evolutionary Clustering Algorithm.
 1: INPUT : InputData (that is, time series data)
 2: OUTPUT : Output (that is, cluster centre locations
    and cluster sizes)
 3: Initialise Output ← ∅
 4: Split InputData by timesteps
 5: for each timestep in InputData do
 6:    CurrentData ← data in first timestep of InputData
 7:    InputData ← Remove CurrentData
 8:    Perform k-means clustering on CurrentData
 9:    if timestep > 1 then
10:       Recalculate clustering centres by using historical
          data as reference (see Algorithm 2)
11:    else
12:       Do Nothing
13:    end if
14: end for
15: Generate Output using modified clustering information
16: Save CurrentData clustering centre locations and cluster
    sizes for historical reference
17: return Output


Algorithm 2 Centre Recalculation Algorithm.
 1: INPUT : previousTimeClusters
 2: INPUT : currentTimeClusters
                                                                      Figure 1: The plot at the third timestep.
 3: INPUT : dataPoints
 4: OUTPUT : clusterCentres (that is, a list of cluster
    centre locations)
 5: Initialise clusterCentres ← ∅
 6: dimensionCount ← size of centre dimensions of current-
    TimeClusters
 7: for each center in currentTimeClusters do
 8:   find previous instance of centre from previous-
      TimeClusters
 9:   apply centre recalculation function (see Equation 2)
10:    save modified centre in currentTimeClusters
11:    add centre to the list of cluster centres (that is,
      clusterCentres)
12: end for
13: for each point in dataPoints do
14:    find the shortest distance to a cluster centre in cur-
      rentTimeClusters
15:    assign point to the closest cluster found
16: end for
17: return Output


4.   EVALUATION
   In evaluating our work, we developed a social network
analysis tool which used Twitter social media platform as
an online and virtual space that enabled the creation of ad-
hoc online communities around a given topic or event (e.g.,
the 2014 scottish referendum - #indyref). We downloaded               Figure 2: The plot at the ninth timestep.
10,000 timestamped tweets relating to the scottish referen-
dum and applied evolutionary clustering mechanism to the
collection of tweets. The tweets gathered covered the period
from 9th to 25th March 2015. Though the Scottish Refer-         Count). One of our objectives was to see how the clusters
endum took place in September 2014, “#indyref” is still s       evolved over time. Results show that in timestep 3 (illus-
fairly popular hashtag.                                         trated in Figure 1, some patterns were already visible from
   In our preliminary analysis, we attempted to cluster the     the data, and by timestep 9 (depicted in Figure 2) 4 clusters
interactions in the online community based on the number        had emerged. In both graphs we could spot the anomalies
of retweets of the various tweets gathered (that is, retweet-   clearly as they do not fit into any of the clusters.
5.   CONCLUSIONS                                                     Web, pages 201–210. ACM, 2007.
   This paper presented a prototype evolutionary clustering     [10] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina.
approach for detecting anomaly in online and virtual spaces.         Web graph similarity for anomaly detection. Journal
We discussed how our approach could support decision mak-            of Internet Service Applications, 1:19–30, 2001.
ers, designers and data scientists in the detection of anoma-   [11] D. Savage, X. Zhang, X. Yu, P. Chou, and Q. Wang.
lies in online and virtual spaces. We used a freeform online         Anomaly detection in online social networks. Social
social media community (Twitter) to explore how emerg-               Networks, 39(0):62 – 70, 2014.
ing patterns within a dataset could help identify clusters of   [12] I. H. Witten and E. Frank. Data Mining: Practical
players (or normal behaviour) and, conversely, anomalies.            machine learning tools and techniques. Morgan
Preliminary results of our work show that our evolutionary           Kaufmann, 2nd edition, 2005.
clustering approach is robust to handle multidimensional-       [13] J. Zhang, Z. Ghahramani, and Y. Yang. A
ity in ad-hoc and unregulated settings presented by social           probabilistic model for online document clustering
media communities. We are currently engaging in more de-             with applications to novelty detection. In Proceedings
tailed analyses of the results of our experiments. In the            of Advances in Neural Information Processing
future, we plan to investigate additional features that will         Systems, 2005.
enable faster anomaly detection in online and virtual spaces.

Acknowledgments
The research described here is supported by the award made
by the Research Councils UK Digital Economy programme
to the dot.rural digital economy research hub, at University
of Aberdeen, UK (award reference: EP/G066051/1).

REFERENCES
 [1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and
     Y. Yang. Topic detection and tracking pilot study:
     Final report. In Proc. of the DARPA Broadcast News
     Transcription and Understanding Workshop, 1998.
 [2] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum.
     Hierarchical topic models and the nested chinese
     restaurant process. In S. Thrun, L. Saul, and
     B. Scholkopf, editors, The Semantic Web, ISWC 2002,
     volume 16 of Advances in Neural Information
     Processing Systems. 2004.
 [3] D. Chakrabarti, R. Kumar, and A. Tomkins.
     Evolutionary clustering. In Proceedings of the 12th
     ACM SIGKDD international conference on Knowledge
     discovery and data mining, pages 554–560. ACM,
     2006.
 [4] C. Chatfield. The Analysis of Time Series. Chapman
     and Hall, New York, NY, USA, 1984.
 [5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
     Classification. Wiley-Interscience, New York, NY,
     USA, 2000.
 [6] M. Gupta, J. Gao, Y. Sun, and J. Han. Community
     trend outlier detection using soft temporal pattern
     mining. In P. Flach, T. Bie, and C. Nello, editors,
     Machine Learning and Knowledge Discovery in
     Databases, volume 16, pages 692–708. Springer, Berlin
     Heidelberg, 2012.
 [7] M. Herbster and M. K. Warmuth. Tracking the best
     linear predictor. Journal of Machine Learning
     Research, 1:281–309, 2001.
 [8] C. C. Noble and D. J. Cook. Graph-based anomaly
     detection. In Proceedings of the Ninth ACM SIGKDD
     International Conference on Knowledge Discovery and
     Data Mining, pages 631–636. ACM, 2003.
 [9] S. Pandit, D. Chau, S. Wang, and C. Faloutsos.
     Netprobe: a fast and scalable system for fraud
     detection in online auction networks. In Proceedings of
     the 16th International Conference on World Wide