<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anomaly and Event Detection for Unsupervised Athlete Performance Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jim O' Donoghue</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Roantree</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bryan Cullen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niall Moyna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Conor O Sullivan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew McCarren</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight: Centre for Data Analytics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Health and Human Performance, Dublin City University</institution>
          ,
          <addr-line>Glasnevin, Dublin 9</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <fpage>205</fpage>
      <lpage>217</lpage>
      <abstract>
        <p>There are many projects today where data is collected automatically to provide input for various data mining algorithms. A problem with freshly generated datasets is their unsupervised nature, leading to di culty in tting predictive algorithms without substantial manual effort. One of the rst steps in dataset preparation and mining is anomaly detection, where clear anomalies and outliers as well as events or changes in the pattern of the data are identi ed as a precursor to subsequent steps in data mining. In the research presented here, we provide a multi-step anomaly detection process which utilises di erent combinations of algorithms for the most accurate identi cation of outliers and events.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Anomaly detection is an important component in data science. In many
situations, researchers are confronted with datasets which possibly contain a large
number of features and more often than not, incorporates outliers and missing
data. Implementing dimensionality reduction and incorporating cluster
analysis techniques such as K -means are commonly used in performing unsupervised
learning tasks in such data. In fact, principal components are the continuous
solutions to the discrete cluster membership indicators for K -means clustering
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Anomalies are generally de ned as unusual events which occur within a
dataset, where a subset of these events are outliers. Outliers are occurrences
that make either no physical sense, or appear so extreme they are considered
probabilistically infeasible. The identi cation of outliers and anomalies is critical
in avoiding poorly- tting models for many machine learning algorithms [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
1.1
      </p>
      <sec id="sec-1-1">
        <title>Problem Background</title>
        <p>
          The physiological demands of any sport are determined largely by the activity
patterns of the game. Similar to other team sports, Gaelic football [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] involves
repeated, short duration, high intensity bouts of anaerobic exercise interspersed
with sustained light to moderate aerobic activity. The duration of these intervals
of high intensity are largely unpredictable, due to the fact that they are imposed
by the pattern of play, and can vary greatly from player to player and from one
game to another. On average, senior players perform 96 bursts of high intensity
activity lasting 6 seconds followed by an average recovery of 37 seconds. These
players typically work at 80% of their maximum heart rate (HRmax) and cover
an average distance of 8.5 km during competitive games. Superimposed on the
physiological demands of match play are key technical activities such as winning
possession of the ball, evading opponents and breaking tackles which involve
high running velocities, agility, strength and power.
        </p>
        <p>
          As part of the process for collecting data on each athlete, global
positioning software (GPS) has become increasingly popular among sport scientists as
a method of tracking movement patterns in many eld based sports [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Modern GPS devices are portable, robust and lightweight making them particularly
suited to eld based sports. From a sports science perspective, the initial aim is
to evaluate the characteristics and tness levels of Gaelic football players and
compare the physical and tness characteristics relative to each playing position.
The subsequent goal is to predict when these players are approaching or have
reached optimal performance level. This requires the generation of a su ciently
rich dataset to build an initial model before it can be used in a real time
environment. At each of 17 competitive games, 10 out of 15 players in the team
were tted with appropriate sensor devices to record heart rate, speed, distance
covered, GPS location and accelerometer values, recording at multiple times per
second. The resulting dataset contained in excess of 200 million values. Simple
detection methods [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] can be useful for more obvious outliers, but encounter
limitations in discerning more subtle anomalies. Due to the nature of contact
sport, the devices incur a number of blows during each game, introducing many
potential anomalies. The work presented here focuses on anomaly detection in
unsupervised data.
        </p>
        <p>
          Contribution. If one uses unsupervised clustering techniques such as K
means to determine an anomaly, then K (proposed number of clusters) can be
calculated as part of the X-means cluster estimation technique [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. However,
such algorithms rely on the choice of good initial starting points [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to nd
workable solutions. Our contribution is the development of an unsupervised
outlier detection algorithm for time series data which employs both univariate and
multivariate approaches for a more accurate detection rate and further our
previously developed learning framework [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to incorporate anomaly detection as
well as classi cation. The univariate method is based on the approach taken in
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] but extends this work to manage time series data, while the multivariate
approach builds upon the work of [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and introduces a secondary decision
statistic which detects when variables are unusually static. In dynamic environments
such as GAA matches extremely static measurements are equally as anomalous
as those which are extremely varying.
        </p>
        <p>Paper Structure. The paper is structured as follows: in Section 2, we discuss
related work in the area; in Section 3 we provide a description of our approach
and detail how we identify and classify anomalies; we describe our experimental
setup together with an evaluation of the results in Section 4; and nally, we
present our conclusions in Section 5.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Research</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the authors present a novel approach for anomaly detection incorporating
both density and grid-based clustering algorithms. Their primary focus is high
dimensional data and they test their algorithm on the KDD Cup 1999 network
dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The approach taken was to optimise the pMa a algorithm, using a
Frequency-Pattern tree in an intermediate step in order to improve the detection
rate. Similar to our approach, they provide an unsupervised anomaly detection
algorithm. However, in their evaluation it was shown that the improvement in
detection rate had a negative side e ect in generating a higher number of false
positives. By their own admission, the algorithm works best for datasets with
certain characteristics i.e. data points sought will be statistically di erent from
normal data. This means that if there is an entire window of anomalous data,
this may a ect the performance of the detection method.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors present an algorithm for anomaly detection in
multivariate time series data. Their goal is to capture relationships across variables and
by doing so, identify di erent types of anomalies that occur in the time series
dataset. As we use a real-world dataset, our comments concern their
evaluation with the several time series datasets from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and not the experiments with
synthetic data. The evaluation used a sliding window of length 6 and clearly
demonstrates that for time series, a subsequence of data points outperforms
the basic data point approach, which is similar to our ndings. Apart from the
fact that our research is based fully on a real-world dataset using unsupervised
learning, we also employ both univariate and multivariate algorithms to deliver
a higher performance in anomaly and outlier detection.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] developed the Robust Support Vector Machine that
demonstrates its ability to identify images (bullet holes) when outliers exist. This
algorithm is an improvement on the standard support vector machine (SVM)
algorithm as the incorporation of the averaging technique to an SVM makes the
decision function less susceptible to outliers and thus, avoids over tting. The
process could be used to identify outliers however it requires a supervised training
dataset which, as with our work, is not always available.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the authors propose an extension to K -means algorithm called X
means to identify outliers in Gaussian datasets without specifying the initial
number of suspected clusters. The algorithm performed exceptionally well with
regard to identifying the exact number of clusters and functioned
commensurately against the K -means algorithm. However, the X -means algorithm is
vulnerable to initial estimates and may attain a sub-optimal minima.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors demonstrate mathematically that principal components
are the continuous solutions to the discrete cluster membership indicators for
K -means clustering. This idea is extended in our work by using principal
components as the basis for a decision based system to detect outliers in truly
unsupervised data. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the authors propose a novel Principal Components classi er
in order to detect anomalies in the case of network intrusion identi cation on the
KDD Cup 1999 network dataset, whose aim was to detect attacks on network
access data. While they produced a false hit rate of only 1% and their PCC
remained robust to false positives, all the other metrics degraded signi cantly
in terms of quality. Our approach uses the PCC but extends it to detect highly
static variables with a secondary chi-squared decision statistic. Furthermore, our
training data is real-world, containing anomalous examples, whereas in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] their
classi er was trained on completely clean, non-anomalous data.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Outline Approach</title>
      <p>Our anomaly detection algorithm has 4 primary components: Boundary
Detection, Univariate Outlier Detection, Principal Component Transformation and
Principal Component Classi cation. The role of the algorithm is to detect
anomalies and classify these as outliers (data points which are far outside the expected
norm) or events (samples which demonstrate a clear shift in the pattern of the
data). Each of the algorithm's components detects anomalies within the dataset
with the exception of Principal Component Transformation which transforms
the data only. We now provide a brief overview of each component.
3.1</p>
      <sec id="sec-3-1">
        <title>Boundary Detection</title>
        <p>This represents a pre-processing stage where clearly erroneous data points are
removed. This can only take place for those features where a domain expert has
clearly speci ed boundaries, outside which data values make no sense. For
example, if a player had a heart-rate below 40bpm or above 250bpm, the hardware
has clearly malfunctioned. The process eliminates obvious errors so that later
calculations are not a ected.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Univariate Outlier Detection</title>
        <p>
          This stage is based on Chauvnet's method, an approach for univariate outlier
detection found in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Euclidean distance [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] has shown to be of little value
with this type of time series data as it detects far too many outliers and thus,
excludes large amounts of data. Early experiments with Euclidean distance with
sport scientists for manual evaluation con rmed this assumption. A rst step
marks a sample as an anomaly low or anomaly high, while a second classi es
these anomalies as either: outlier, event or untrue (not an anomaly). Before
detecting anomalous values, the algorithm rst calculates summary statistics of
mean, variance and standard deviation, denoted by x; 2, and respectively, for
each feature xi in the dataset X. Unlike [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], where anomalies are detected with
standard deviation alone, our univariate approach incorporates time di erencing
and compares the current time di erence against previous time di erences. If it
is signi cantly di erent, we then compare with the future time di erence to
con rm the data point is an anomaly or outlier. This is achieved by iterating
through each feature and examining every time point with a t -distribution
coe cient for natural con dence intervals on the di erenced data (which removes
any non-stationary components of the data), a crucial factor for time series data
as it is non-stationary.
        </p>
        <p>The rst steps of the algorithm presented get the dimensions of the dataset
jXj = (T x n), where T is the number of time-points or samples and n is the
number of features where 8xt;i 2 X; t 2 (1; 2; : : : ; T 1; T ) and i 2 (1; 2; : : : ; n 1; n).
Subsequently a t-distribution coe cient of 3 was chosen to give approximately
a 99.9% con dence interval in order to detect anomalies with an of 0.001 for
each feature, where is the probability of a false alarm. In concrete terms,
this coe cient is multiplied by the standard deviation both positively and
negatively ( 3 xi) for each feature in order to calculate upper and lower bounds
for anomalies.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Principal Component Transformation</title>
        <p>The aim of this step is to compute key characteristics and to transform the data
for the nal stage in the algorithm.
1. Recalculate the summary statistics from the previous stage. This is necessary
due to removed outliers.
2. Impute the missing values.
3. Calculate the correlation matrix in order to provide input to the Principal</p>
        <p>Components Analysis.
4. Calculate eigenvectors and eigenvalues. The eigenvectors enable the creation
of an orthogonal representation of the dataset, used to derive the principal
components. Eigenvalues measure the energy contribution of each of the
principal components, as well as providing input into the principal components
classi er.
5. Standardise each feature to have unit variance. For each feature x, this
involves subtracting the mean, x from x and dividing the result by the standard
deviation, 2x.
6. Compute the transformed dataset. Principal Component Analysis (PCA)
provides an orthogonal representation of the data, describing it in terms of
the axes of most variation for each component.</p>
        <p>
          Before transforming the data into its principal components, it is di erenced
at a one second time lag and transformed into a 5 second moving average, centred
on the value being transformed, essentially ltering noise from the data. Missing
values are imputed with the R Amelia [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] package. Amelia is a multi-variate
imputation mechanism which infers missing data in a single-cross section from
times series and is the only R component in a Python application. The use
of R was necessary as Amelia was not available in Python and was evaluated
to best suit our imputation needs. Employing bootstrapping and Expectation
Maximisation, it allows for imputation from the posterior distribution of the
data.
        </p>
        <p>After imputation, it is necessary to determine how much the features change
together by calculating the correlation matrix. The dimensions of this matrix
are (p x p) where p is the number of features in the dataset.</p>
        <p>Algorithm 1 Data Transform
1: function DataTransform(X)
2: Xlagl moving avrg(X) . transform data into moving average at a lag of l
3: Ximp amelia( Xlagl) . impute missing data with Amelia on the di erences
(xitm;i p xiimp)(yti;mi p yiimp)</p>
        <p>T 1 . calculate covariance
4: cov(Ximp; Yimp) = PtT=1
5: corr(Ximp; Y imp) = cov(XXiimmppY;Yi mimpp)
6: E = (e1 : : : ep) and evals = 1 : : : p
7: for xt;i in xi where t 2 0 : : : T do
8: Zt;i = xi;t2xixi
9: end for
10: P = (E&gt;Z&gt;)&gt;</p>
        <p>return Ximp; E; evals, P
11: end function
. calculate correlation
. calculate the eigenvalues and vectors</p>
        <p>. standardise the data
. calculate principal components</p>
        <p>The eigenvectors E are then calculated on the non-standardised data using
the correlation matrix (whose calculation e ectively standardises the data) where
each eigenvector ei 2 e1; e2; : : : ; ep. Eigenvalues 1; : : : ; p are also calculated.
Once the data is standardised as Z, the result is multiplied with the eigenvectors
which gives the principal components of the original dimensions (T x n).
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Principal Component Classi cation</title>
        <p>The nal stage has three main steps.
1. Calculate the number of major and minor components to use by calculating
the cumulative percentage of the total eigenvalue energy for each.
2. Calculate the classi cation value or test statistic for each sample.
(a) Sum the major components divided by their eigenvalues, giving you the
chi-squared test-statistic.</p>
        <p>(b) Calculate the same value for minor components.
3. Classify anomalies using signi cance values calculated with the chi-squared
distribution
(a) Use the test statistic generated from step 2, compute the decision
statistics with the chi-squared cumulative distribution function and compare
this to a chi-squared distribution with num components degrees of
freedom and all false alarm rates providing the con dence interval. The
chi-squared distribution was employed as we observed that the
distribution of di erenced, standardised variables at the univariate stage
demonstrated a normal distribution.
(b) If con dence interval exceeds either of the chosen decision statistic
thresholds (e.g. &gt; 95% or &lt; 0.001%), for either the major or the minor
components test statistics (who have the same false alarm pairs), the data
instance is classi ed as anomalous.</p>
        <p>In algorithm 3, we rst calculate the number of principal components involved
and determine how much variation in both the major and minor components is
to be included in the classi er. The percentage variation thresholds are set and
subsequently the number of components to use are calculated with Algorithm
2 which takes the eigenvalues, type of components being summed (string of
'major' or 'minor') and the desired percentage eigenvalue energy (variation) as
parameters. Eigenvalues are initially summed to calculate the total variance
and then the parameter num components and current sum (running total) are
initialised to zero before being calculated.</p>
        <p>In lines 5 to 12 of algorithm 2, for each eigenvalue there is a check to see
if the current percent variance sum is less than the desired variance. If this is
the case, the number of components is incremented and added to variance sum
is the current eigenvalue divided by the eigenvalue total sum, as this gives the
percentage variance. It is worth noting that if it is the major components sum,
we begin at i = 1 to start with the major components but with the minor
components, we begin with the last eigenvalue i = p 1 where p is the total
number of eigenvalues.</p>
        <p>Once the number of major components q and the number of minor
components r are found, the chi-squared test statistics are then calculated for both
component types by summing the value for the component pi at time-point t
divided by the appropriate eigenvalue i up until the number of components is
exhausted. In the case of the major components, the process begins at 1 and
stops at component q whereas in the case of the minor components, it begins at
the last component p and sum to p r as shown in lines 7 to 12.</p>
        <p>In lines 13 and 14, the decision statistic is calculated by 1 the chi-squared
cumulative distribution function with q degrees of freedom for the major
components and r for the minor components. Given that our data follows a multivariate
normal distribution the overall false alarm rate is given by Equation 1.
total =
major +
minor
major minor
(1)</p>
        <p>
          We then chose the varying and static false alarm rates large and static
(line 15). If a particular decision statistic is greater than 1 static or less than
1 large i.e. a certain signi cance threshold, the row is classi ed as anomalous
Algorithm 2 Calculate Number of Components
14:
15: clrg = 1 large . false alarm rates lmaragje = lmaringe
16: cstat = 1 static . smtaatjic = smtaintic
17: for all major;t 2 Decisionmaj and minor;t 2 Desicisionmin do
18: if major;t &lt; clarge j minor;t &lt; clrg j major;t &gt; cstat j minor;t &gt; cstat then
19: Xt is anomalous
20: Anomaliest T rue
21: end if
22: end for
23: return Anomalies
24: end function
. For each sample
as in line 18 of Algorithm 3. This implies a very large or very small degree of
variation at this time-point. Our extension to the PCC captures where there is
very little or no variation from sample to sample. Our features should be
nonstatic and should be constantly changing, this minor variation parameter was a
very important factor and was not incorporated by [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Finally, as our aim was
to identify anomalous time-periods as opposed to particular time-points (as a
time-point itself is not anomalous) and due to our rolling average transformation,
we incorporated time-points within an 11 point centred window as anomalous.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Analysis</title>
      <p>In this section, we brie y describe our dataset, approach to evaluation and
provide a detailed analysis of experiments and results.
4.1</p>
      <sec id="sec-4-1">
        <title>Experiment Setup</title>
        <p>
          Experimental Set-up. Experiments were performed on a Dell Optiplex 790
running 64-bit Windows 7 Home Premium SP1 with an Intel Core i7-2600
quadcore 3.40 GHz CPU and 16.0GB of RAM. The code for the experiments was
developed in Python using the Enthought Canopy (1.5.4.3105) distribution of
64-bit Python 2.7.6 and developed in PyCharm 4.5 IDE, making use of NumPy
1.8.1-1[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], Pandas 0.16.0[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and SciPy 0.15.1[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for mathematical and statistical
operations and data manipulation. The imputations were performed with R and
Amelia, package version 1.0 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Dataset. For our evaluation, we used the results of one match with 81,165
instances in the dataset. The hardware devices generate at least 10 sets of
measures per second, which were averaged giving one set of measures per second and
providing just under 8,000 instances. As only heart rate and distance covered
were used in this experiment and only 9 players generated data, each instance
had 18 features per second, namely 9 sets of heart rates and distances. We
selected the data beginning at the pre-match warm-up, until 4 minutes and 30
seconds after the end of the match which left a total 6,211 instances.</p>
        <p>Evaluation Design. The design of the outlier detection algorithm allows for
evaluation of di erent processes in combination. The univariate (P2) and
multivariate steps (P3 &amp; P4) were evaluated in isolation and with bounds detection
(P1) added. We also evaluated after the univariate step (P1 &amp; P2), without the
univariate step (P1, P3, &amp; P4) and with both (P1, P2, P3, &amp; P4).</p>
        <p>Before evaluating combinations of processes, we rst tested various numbers
of major components in isolation, at various false alarm rates before performing
the same evaluation for the minor components in the PCC. The secondary alpha
measure added to P4 to detect unusually static time-points was kept at 0.01%
for all experiments, keeping the measure sensitive. Window-size was also not
varied and kept at 11, to account for the rolling average transformation.</p>
        <p>Each anomaly detection algorithm was trained on the dataset in its entirety
without partition. For testing, a random subset of the dataset was presented
to the sports scientist involved in the original data collection for classi
cation. He classi ed the subset as 69.89% anomalous with the remainder
being non-anomalous. We then examined and cross-referenced the relevant
subset of each result (from each con guration) and calculated an accuracy score
for each. The evaluation metric use for all experiments was accuracy where
accuracy = T PP ++TNN .
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Evaluation: Results and Analysis</title>
        <p>Tuning the PCC (P3 &amp; P4) Table 1 shows the results of our empirical search
for the parameters to use in our PCC (P3 &amp; P4). We rst vary the percentage
of total contribution to the classi er by the major components only (r = 0), in
a range from 30% to 70% (columns 1 to 4) and then repeat the process with the
minor components only (q = 0), in a range of 10% to 20%. We evaluate both at
various false alarm rates , in increments from 1% to 10%.
The results in Table 1 furnishes us with interesting ndings. First, the
highest accuracy achieved with the major components (MajC) alone was just over
50% with an = 10%. This was found at both the 30% and 70% contributions
respectively. Increasing the number of major components actually decreased the
accuracy for higher false alarm rates but slightly improved the values for lower
false alarm rates. Our second nding is that when only the minor components
were tested (MinC), a higher accuracy was immediately achieved. This is
surprising as the major components explain the major variation in the dataset leading
us to the conclusion that, at least for the dataset in question the components
that contribute less to the dataset as a whole actually have greater capacity for
modelling anomalies, suggesting that anomalous examples in this data are subtle
and contribute to noise in the dataset as minor components generally contain a
relatively high degree of the noise in a dataset.</p>
        <p>After the analysis of Table 1 we chose the percentage contribution for the
Major components to be 70% and those of the minor to be 20%. The ROC
curve of this can be seen in Figure 1. Our rst observation was the results
were not optimal, at 66.67% accuracy this is just a 7% increase to using the
minor components in isolation. When we again analysed the results of Table 1
we decided to test a con guration where an even greater emphasis was given
to the minor components and less to the major components, as we realised
that increasing the major components from 30% to 70% gave no great gains in
accuracy. We chose to double the contribution from the minor components to
40%. The results again improved with an accuracy increase of just under 11% to
70.97% compared to using either of the major or minor components in isolation,
and can also be seen in Figure 1.
Comparing Processes. Table 2 shows the comparison of the events found
in P2 to the anomalies found with P3/P4, when taken as components in isolation,
to those where P1 was coupled with P2 and with P3/P4. We chose to exclude
the results from P1-P2-P3-P4 as only six clear outliers were found in P2 to be
excluded from the P3/P4 processes and therefore did not have a great impact
on the model. The results of the process in its entirety is also shown in Figure
1, demonstrating a clear gain in accuracy.</p>
        <p>Measure P2 Chvnts. Univariate P3/P4 PCC P1+P2 P1+P3/P4
Accuracy 0.5376 0.7097 0.8495 0.8925</p>
        <p>As we can see from 2, the PCC multivariate anomaly detector is far more
accurate than the univariate outlier process, but once P1 is added, the results
approach in accuracy. This is primarily due to zero values now being classed as
anomalies from the added boundary detection step. A number of samples given
to the sports scientist were found to contain zeros which he immediately classed
as anomalies. Further evaluation could perhaps exclude samples containing zeros
as these classi cations do not accurately test the algorithm as this fundamental
step is easy to compute and from the sports scientist's perspective, not di cult
to identify via manual inspection. Given this caveat, once P1 was added we still
achieved a nal classi cation accuracy of 0.8925 once our algorithmic
components were combined, showing the performance of the process as a whole was
greater than any constituent process in isolation.</p>
        <p>Some general ndings include that anomalies (when the rolling window was
not used), often occurred together in a sequential series. This gives credence to
the hypothesis that in our time-series dataset, anomalies occurred in windows
rather than at particular time-points. Furthermore, a subset of the events found
from P2 and the anomalies found with P3/P4 overlapped at certain points,
indicating strong events at these points. Finally, in an examination of the false
positives, we found a number of the distance variables actually remained static,
an unusual event for a GAA match and similarly for other sports events. Further
evaluation will involve testing this hypothesis with the sports scientists, as we
posit certain events even in the relatively small evaluation subset could have
been missed upon manual inspection, due to fatigue or other factors incurred by
examining vast swathes of data by eye. This secondary analysis could provide
further motivation to this work, that is identifying anomalies in data that might
have been missed by manual inspection.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Newly generated datasets often prove di cult for data mining as they can
contain erroneous data-points and are often unsupervised (without classi cations);
datasets produced in areas such as sports science exemplify this. The rst step in
addressing these problems is anomaly detection, both to remove or adjust those
values which are clear outliers, and to detect patterns which are signs of an event
or change in the data. In this paper, we presented a novel anomaly detection
algorithm which utilises both univariate and multivariate steps enabling us to
determine which approach works best for a unsupervised time series dataset.
Our results demonstrated the e ectiveness of a combined approach when
compared to even an improved univariate approach and that a multivariate approach
outperforms it's univariate counterpart when used in isolation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially funded by FP7 project Grant Agreement Number 304979
and also by Science Foundation Ireland under grant number SFI/12/RC/2289.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Rules of gaa football,
          <year>2015</year>
          . Online; last accessed:
          <volume>09</volume>
          /07/2015;
          <article-title>www.gaa.ie/ about-the-gaa/our-games/football/rules.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Yanping</given-names>
            <surname>Chen</surname>
          </string-name>
          , Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and
          <string-name>
            <given-names>Gustavo</given-names>
            <surname>Batista</surname>
          </string-name>
          .
          <article-title>The ucr time series classi cation archive</article-title>
          ,
          <year>July 2015</year>
          . www.cs.ucr.edu/~eamonn/time_series_data/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Haibin Cheng,
          <string-name>
            <surname>Pang-Ning</surname>
            <given-names>Tan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Potter</surname>
          </string-name>
          , and
          <article-title>Steven A Klooster. Detection and characterization of anomalies in multivariate time series</article-title>
          .
          <source>In SDM</source>
          , pages
          <volume>413</volume>
          {
          <fpage>424</fpage>
          .
          <string-name>
            <surname>SIAM</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Chris</given-names>
            <surname>Ding</surname>
          </string-name>
          and
          <string-name>
            <given-names>Xiaofeng</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>K-means clustering via principal component analysis</article-title>
          .
          <source>In Proceedings of the twenty- rst international conference on Machine learning, page 29. ACM</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>AM</given-names>
            <surname>Fahim</surname>
          </string-name>
          ,
          <source>AM Salem</source>
          , FA Torkey,
          <string-name>
            <given-names>G</given-names>
            <surname>Saake</surname>
          </string-name>
          , and MA Ramadan.
          <article-title>An e cient kmeans with good initial starting points</article-title>
          .
          <source>Georgian Electronic Scienti c Journal: Computer Science and Telecommunications</source>
          ,
          <volume>2</volume>
          (
          <issue>19</issue>
          ):
          <volume>47</volume>
          {
          <fpage>57</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Blackwell James Honaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gary</given-names>
            <surname>King</surname>
          </string-name>
          .
          <article-title>Amelia II: A Program for Missing Data</article-title>
          ,
          <year>2014</year>
          .
          <source>R package version 1</source>
          .
          <article-title>0 | For new features, see the 'Changelog' le (in the package source</article-title>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Eric</given-names>
            <surname>Jones</surname>
          </string-name>
          , Travis Oliphant,
          <string-name>
            <given-names>Pearu</given-names>
            <surname>Peterson</surname>
          </string-name>
          , et al.
          <article-title>SciPy: Open source scienti c tools for Python,</article-title>
          <year>2001</year>
          {. Online; accessed 2015-
          <volume>06</volume>
          -29; Available on: http://www. scipy.org/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Kingsly</given-names>
            <surname>Leung</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Leckie</surname>
          </string-name>
          .
          <article-title>Unsupervised anomaly detection in network intrusion detection using clusters</article-title>
          .
          <source>In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume</source>
          <volume>38</volume>
          , pages
          <fpage>333</fpage>
          {
          <fpage>342</fpage>
          . Australian Computer Society, Inc.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Moshe</given-names>
            <surname>Lichman</surname>
          </string-name>
          .
          <article-title>1999 kdd cup dataset</article-title>
          ,
          <source>UCI machine learning repository</source>
          ,
          <year>2013</year>
          . Dataset; Available on: https://archive.ics.uci.edu/ml/datasets/KDD+ Cup+1999+Data.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wes McKinney</surname>
          </string-name>
          .
          <article-title>Data structures for statistical computing in python</article-title>
          . In Stefan van der Walt and Jarrod Millman, editors,
          <source>Proceedings of the 9th Python in Science Conference</source>
          , pages
          <volume>51</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jim</surname>
            <given-names>ODonoghue</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Roantree</surname>
          </string-name>
          .
          <article-title>A framework for selecting deep learning hyper-parameters</article-title>
          .
          <source>In Data Science</source>
          , pages
          <volume>120</volume>
          {
          <fpage>132</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dan</surname>
            <given-names>Pelleg</given-names>
          </string-name>
          , Andrew W Moore, et al.
          <article-title>X-means: Extending k-means with e cient estimation of the number of clusters</article-title>
          . In ICML, pages
          <volume>727</volume>
          {
          <fpage>734</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mark</surname>
            <given-names>Roantree</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donall McCann</surname>
            ,
            <given-names>and Niall</given-names>
          </string-name>
          <string-name>
            <surname>Moyna</surname>
          </string-name>
          .
          <article-title>Integrating sensor streams in phealth networks</article-title>
          .
          <source>In Parallel and Distributed Systems</source>
          ,
          <year>2008</year>
          . ICPADS'
          <volume>08</volume>
          . 14th IEEE International Conference on, pages
          <volume>320</volume>
          {
          <fpage>327</fpage>
          . IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mark</surname>
            <given-names>Roantree</given-names>
          </string-name>
          , Jie Shi, Paolo Cappellari,
          <string-name>
            <surname>Martin F. OConnor</surname>
            , Michael Whelan, and
            <given-names>Niall</given-names>
          </string-name>
          <string-name>
            <surname>Moyna</surname>
          </string-name>
          .
          <article-title>Data transformation and query management in personal health sensor networks</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          ,
          <volume>35</volume>
          (
          <issue>4</issue>
          ):
          <volume>1191</volume>
          {
          <fpage>1202</fpage>
          ,
          <year>2012</year>
          .
          <article-title>Intelligent Algorithms for Data-Centric Sensor Networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Stephen M Ross.</surname>
          </string-name>
          <article-title>Peirce's criterion for the elimination of suspect experimental data</article-title>
          .
          <source>Journal of Engineering Technology</source>
          ,
          <volume>20</volume>
          (
          <issue>2</issue>
          ):
          <volume>38</volume>
          {
          <fpage>41</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mei-Ling</surname>
            <given-names>Shyu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shu-Ching</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Kanoksri Sarinnapakorn, and
          <string-name>
            <given-names>LiWu</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>A novel anomaly detection scheme based on principal component classi er</article-title>
          .
          <source>Technical report, DTIC Document</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Qing</surname>
            <given-names>Song</given-names>
          </string-name>
          , Wenjie Hu, and
          <string-name>
            <given-names>Wenfang</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <article-title>Robust support vector machine with bullet hole image classi cation</article-title>
          .
          <source>Systems, Man, and Cybernetics</source>
          , Part C:
          <article-title>Applications</article-title>
          and Reviews, IEEE Transactions on,
          <volume>32</volume>
          (
          <issue>4</issue>
          ):
          <volume>440</volume>
          {
          <fpage>448</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stefan Van Der Walt</surname>
            ,
            <given-names>S Chris</given-names>
          </string-name>
          <string-name>
            <surname>Colbert</surname>
            , and
            <given-names>Gael</given-names>
          </string-name>
          <string-name>
            <surname>Varoquaux</surname>
          </string-name>
          .
          <article-title>The numpy array: a structure for e cient numerical computation</article-title>
          .
          <source>Computing in Science &amp; Engineering</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ):
          <volume>22</volume>
          {
          <fpage>30</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>