Finding Similar Movements
                       in Positional Data Streams

                             Jens Haase† and Ulf Brefeld‡
                         Knowledge Mining & Assessment Group
                        Technische Universität Darmstadt, Germany
                                 †
                               je.haase@gmail.com
                 ‡
                     brefeld@kma.informatik.tu-darmstadt.de


      Abstract. In this paper, we study the problem of efficiently finding similar move-
      ments in positional data streams, given a query trajectory. Our approach is based
      on a translation-, rotation-, and scale-invariant representation of movements. Near-
      neighbours given a query trajectory are then efficiently computed using dynamic
      time warping and locality sensitive hashing. Empirically, we show the efficiency
      and accuracy of our approach on positional data streams recorded from a real
      soccer game.


1   Introduction

Team sports has become a major business in many parts of the world. Clubs
spend a great deal of money on players, training facilities and other ways to fur-
ther improve their play. For instance, the German Bundesliga records all games
with special cameras, capturing bird’s eye views of the pitch, to better analyse
player movements and tactics. The recordings capture positions of the players
and the ball for every fraction of a second. While simple analyses, such as the
overall distance a player covered, heat maps of player positions, etc., can be
computed (semi-)automatically, more complex analyses involving tactics and
counter-strategies rely on human experts. However, the sheer existence of such
data paves the way for automatic analyses using intelligent mining techniques.
In this paper, we study efficient techniques for detecting similar movements in
positional data streams to provide a basis for the analyses of frequent move-
ments and tactical patterns.
    For a soccer game taking about 90 minutes, the recorded data translate into
a positional data stream. Standard recording rates of 25 frames per second lead
to a representation by about 135,000 snapshots. Every snapshot consists of the
23 positions of the players of the two teams and the ball. In sum, the game
is described by more than three million coordinates. As player movements are
sequences of such coordinates, it is clear that there are a great deal of com-
parisons necessary to account for different lengths of such sequences across
players. Thus, there is a real need for efficient techniques to further process and
analyse the data.
     In this paper, we study the problem of finding similar movements of play-
ers in such positional data streams. The problem is challenging for two rea-
sons. Firstly, it is not clear how to define an appropriate similarity measure on
player trajectories and secondly, the sheer number of coordinates render exact
approaches infeasible as we will show in the experiments. We first propose a
translation-, rotation-, and scale-invariant representation of movements using
Angle/Arc-Lengths [11]. Second, we investigate efficient near-neighbour rou-
tines based on dynamic time warping [9] and locality sensitive hashing [2]. Our
empirical results on positional data recorded from a real soccer game show the
efficiency and accuracy of our approach compared to exact baseline methods.
     The remainder is structured as follows. Section 2 briefly reviews related
work. Our main contribution is presented in Section 3 and we report on empiri-
cal results in Section 4. Section 5 concludes.


2     Related Work

Prior work on mining positional data streams mostly focuses on the performance
of individual players. Kang et al. [4] present an approach that uses positional
data to assess player positions in particular areas of the pitch, such as catchable,
safe or competing zones. Grunz et al. [3] analyse groups of players and their
behaviour using self organising maps on positional data. Every neuron of the
network represents a certain area of the pitch. Thus, whenever a player moves
into such an area, the respective neuron is activated. Perše at al. [8] use posi-
tional data of basketball games to compare movements of players with respect
to tactical patterns, e.g., a player blocks space for his teammate. The presented
approach however does not detect novel movements that deviate from the al-
ready known patterns. By contrast, we study a purely data-driven approach to
find similar movements in positional data for a given query trajectory without
making any assumptions on zones, tasks, or movements.


3     Contribution

3.1    Preliminaries

For each player, we are given a positional data stream P = hx1 , x2 , . . .i where
xt = (x1 , x2 )> denotes the coordinates of the players position on the pitch at
time t. A trajectory or movement of the player is a subset p ⊂ P of the stream,
e.g., p = hxt , xt+1 , . . . , xt+m i, where m is the length of the trajectory. A game
D is thus given by the union of all trajectories of length m of the two teams. For
simplicity, we omit the time index t and simply index elements of a trajectory
by their offset 1, . . . , m in the remainder. The goal of this paper is to accurately
and efficiently compute similarities between trajectories in D. That is, given a
query trajectory q, we aim at finding the N most similar trajectories in D.

3.2   Representation
We aim to exploiting the symmetries of the pitch and use Angle/Arc-Length
(AAL) [11] transformations to guarantee translation, rotation, and scale invari-
ant representations of trajectories. The main idea of AAL is to represent a move-
ment p = hx1 , . . . , xm i in terms of distances and angles
                    p 7→ p̄ = h(α1 , kv 1 k), . . . , (αm , kv m k)i,                (1)
where v i = xi − xi−1 . The difference v i is called the movement vector at time
i and the corresponding angle with respect to a reference vector v ref = (1, 0)>
is defined as
                                                   v>
                                                              
                                           −1        i v ref
                αi = sign(v i , v ref ) cos                        ,
                                                kv i k kv ref k
where the sign function computes the direction (clockwise ore counterclock-
wise) of the movement with respect to the reference. In the remainder, we dis-
card the norms in Equation (1) and represent trajectories by their sequences of
angles, p 7→ p̃ = hα1 , . . . , αm i.

3.3   Dynamic Time Warping
In this section, we propose a distance measure for trajectories. The proposed
representation of the previous section fulfils the required invariance in terms
of translation, rotation and scaling [11]. However, some movements may start
slow and end fast, while others start fast and then slow down at the end. Thus,
we additionally need to compensate for phase shifts of trajectories. A remedy
comes from the area of speech recognition and is called dynamic time warping
(DTW) [9]. Given two sequences s = hs1 , . . . , sm i and q = hq1 , . . . , qm i and
an element-wise distance function dist : R × R → R (e.g., Euclidean distance),
we define the DTW function g recursively as follows
       g(∅, ∅) = 0
       g(s, ∅) = dist(∅, q) = ∞                                                 
                                      g(s, hq2 , . . . , qm i)                  
       g(s, q) = dist(s1 , q1 ) + min g(hs2 , . . . , sm i, q)
                                       g(hs2 , . . . , sm i, hq2 , . . . , qm i)
                                                                                
The time complexity of DTW is O(|s||q|) which is clearly intractable for com-
puting similarities of thousands of trajectories. However, recall that we aim at
finding the N best matches for a given query. This allows for pruning some
DTW computations using lower bounds f , i.e., f (s, q) ≤ g(s, q), with an ap-
propriate function f that can be more efficiently computed than g [10]. We use
two different lower bound functions, fkim [6] and fkeogh [5], that are defined
as follows: fkim focuses on the first, last, greatest, and smallest values of two
sequences [6]

   fkim (s, q) = max {|s1 − q1 |, |sm − qm |, | max(s) − max(q)|, | min(s) − min(q)|}

 and can be computed in O(m). However, the greatest (or smallest) entry in
the transformed paths is always close or identical to π (or −π) and can thus
be ignored. Consequentially, the time complexity reduces to O(1) [10]. The
second lower bound fkeogh [5] uses minimum `i and an maximum values ui for
subsequences of the query q given by

           `i = min(qi−r , . . . , qi+r ) and     ui = max(qi−r , qi+r ),

where r is a user defined threshold. Trivially, ui ≥ qi ≥ `i holds for all i and
the lower bound fkeogh is given by

                                               (si − ui )2 : if si > ui
                       v                       
                       um
                       uX
       fkeogh (q, s) = t      ci with ci = (si − `i )2 : if si < `i
                                                     0      : otherwise
                                               
                          i=1

which can also be computed in O(m) (see [7] for details).
    Algorithm 1 extends [10] to compute the N most similar trajectories for a
given query q. Lines 2–9 compute the DTW distances of the first N entries in
the database and store the entry with the highest distance to q. Lines 10–21 loop
over all subsequent trajectories in D by first applying the lower bound functions
fkim and fkeogh to efficiently filter irrelevant movements before using the exact
DTW distance for the remaining candidates. Every trajectory, realising a smaller
DTW distance than the current maximum, replaces its peer and the variables
maxdist and maxind are updated accordingly. Note that the complexity of
Algorithm 1 is linear in the number of trajectories in D. In the worst case, the
sequences are sorted in descending order by the DTW distance, which requires
to compute all DTW distances. In practice we however observe much lower
run-times.
    An important factor is the tightness of the lower bound functions. The better
the approximation of the DTW the better the pruning. The parameter N plays
also a crucial part in the effectiveness of the algorithm. If we set N = 1 the
Algorithm 1 TOP N(N,q,D)
Input: number of near-neighbour movements N , query trajectory q, game D
Output: The N most similar trajectories to q in D
 1: output = ∅ ∧ maxdist = 0 ∧ maxind = −1
 2: for i = 1, . . . , N do
 3:    dist = g(q, D[i])
 4:    output[i] = D[i]
 5:    if dist > maxdist then
 6:        maxdist = dist
 7:        maxind = i
 8:    end if
 9: end for
10: for i = N + 1, . . . , |D| do
11:    if fkim (q, D[i]) < maxdist then
12:        if fkeogh (q, D[i]) < maxdist then
13:            dist = g(q, D[i])
14:            if dist < maxdist then
15:                output[maxind] = D[i]
16:                maxdist = max{g(q, output[j]) : 1 ≤ j ≤ N }
17:                maxind = arg max g(q, output[j])
                            1≤j≤N
18:          end if
19:       end if
20:   end if
21: end for


maximum value will drop faster towards the lowest value in the dataset. By
contrast, setting N = |D| requires to compute the DTW distances for all entries
in the database. Hence, in most cases, N  |D| is an appropriate choice to
reduce the overall computation time.


3.4   Locality Sensitive Hashing


To further improve the efficiency of our algorithm, we will use locality sensi-
tive hashing (LSH) [2] to remove a great deal of trajectories before process-
ing them with Algorithm 1. The idea of LSH is to hash similar objects to the
same bucket, so that all objects of a bucket are considered candidates for being
near-neighbours. An interesting equivalence class of LSH functions are distance
based hashes (DBH) [1] that can be applied together with arbitrary (e.g., non-
metric) distance measures.
    To define a hash family for our purposes, we first need to define a function
h : D → R that maps a trajectory s ∈ D to the set of real numbers. Choosing
two randomly drawn members s1 , s2 ∈ D we define the function h as follows:

                              dist(s, s1 )2 + dist(s1 , s2 )2 − dist(s, s2 )2
              hs1 ,s2 (s) =                                                   .
                                             2 dist(s1 , s2 )

In the remainder, we will use the identity dist(s1 , s2 ) = fkim (s1 , s2 ) for sim-
plicity. To compute a discrete hash value for s we verify whether h(s) lies in a
certain interval [t1 , t2 ],
                                       
                           [t1 ,t2 ]    1 : hs1 ,s2 (s) ∈ [t1 , t2 ]
                         hs1 ,s2 (s) =
                                        0:       otherwise

Optimally, the interval boundaries t1 and t2 are chosen so that the probability
that a randomly drawn s ∈ X lies with 50% chance within and with 50% chance
outside of the interval. The set T defines the set of admissible intervals,
                 n                                                               o
                                     [t ,t ]                    [t ,t ]
  T (s1 , s2 ) = [t1 , t2 ] : P rD (hs11,s22 (s)) = 0) = P rD (hs11,s22 (s)) = 1) .

Given h and T we can now define the DBH hash family that can be directly
integrated in standard LSH algorithms:
                      n                                               o
                         [t ,t ]
            HDBH = hs11,s22 : s1 , s2 ∈ R ∧ [t1 , t2 ] ∈ T (s1 , s2 )

Using random draws from HDBH , we construct several hash functions by AND-
and OR-concatenation [2]. Given a query trajectory q ∈ D, the retrieval process
first identities candidate objects that are hashed to the same bucket for at least
one of the hash functions and computes the exact distances of the remaining
candidates using Algorithm 1.

4     Evaluation

In this section, we evaluate our approach in terms of run-time, pruning effi-
ciency, and precision. For our experiments, we use positional data published
by the DEBS Grand Challenge in 20131 . There are eight players in each team,
where every player is equipped with two sensors, one for each shoe. We aver-
age these two values to obtain only a single measurement for every player at a
time. Discarding additional data that is not useful in our context leaves us with
a stream of

                   (sensor/player id, timestamp, player coordinates)
 1
     http://www.orgs.ttu.edu/debs2013/index.php?goto=cfchallengedetails
               3000
                                   Top 1000 with DBH            +
                           Top 1000 with DBH (4 CPU)            ×
               2500                         Top 1000            ∗                                           ∗
                                    Top 1000 (4 CPU)            2                                    ∗      +
                                             Baseline                                               +
                                                                                                ∗
                                                                                                +
               2000
                                                                                        ∗
                                                                                         +
                                                                                    ∗
          Time                                                                      +
          (sec) 1500                                                         ∗
                                                                       ∗    +                              2
                                                                        +                            2
                                                                                                     ×
                                                                                                            ×
                                                                    ∗                           2
                                                                                                ×
               1000                                        ∗
                                                                    +                    2
                                                                                         ×
                                                            +                       2
                                                                                    ×
                                                     ∗
                                                     +                       2
                                                                       2
                                                                        ×    ×
                                                 ∗
                                                 +                  2
                                                                    ×
                500
                                          ∗                 2
                                                            ×
                                         +          2
                                                     ×
                                      ∗
                                      +          2
                                                 ×
                               ∗
                               
                               +      2
                                      ×   2
                                          ×
                  0        ∗
                           2
                           +
                           ×
                              2
                               ×
                       0       2000       4000       6000       8000        10000       12000       14000       16000
                                                      Sliding Windows |w|


                                                 Fig. 1. Run-time.


triplets. Additionally, we discard all data points that have been recorded before
or after the game as well as data points that occurred outside of the pitch. We
also remove the effects of changing sides after half time by appropriate trans-
formations. Finally, we average the positional information of each player over
100ms to reduce the overall amount of data and use trajectories of size 10.
     In our first experiment, we focus on 15,000 consecutive positions of one
player, so that we are still able to compare performances to the exact baseline
using the DTW distance from Section 3.3. We compute the N -most similar
trajectories using Algorithm 1, where N = 1, 000 and study run-times of the
different approaches. Figure 1 (left) shows the results. The computation time of
the baseline grows exponentially in the size of the data D. Algorithm 1 performs
slightly super-linear and clearly outperforms the baseline. Pre-filtering trajecto-
ries using DBH results in only a small speed-up. Adding more CPUs further
significantly improves the run-time of the algorithms and indicates that paral-
lelisation of the approach allows for computing near-neighbours for large data
sets in only a couple of minutes.
     The observed improvements in run-time are the result of a highly efficient
pruning strategy. Table 4 shows the amount of trajectories that are pruned for dif-
ferent amounts of data. Note that the DBH pruning depends on the data and not
               N
on the ratio |D|  . The effectiveness of pruning using fkim and fkeogh increases
with increasing amounts of data for constant N .
     We now investigate the accuracy of the proposed approach. We compute the
1000 most similar trajectories for all 35,248 player movements and measure the
effect of DBH in terms of the precision@N. For every query q we computed
the performance for N ∈ {100, 200, . . . 1000} and averaged the results that
are shown in Figure 2. For completeness we also included the worst cases. The
                                    Table 1. Pruning efficiency

                            trajectories fkim fkeogh DBH Total
                                1000        0%     0% 11.42% 11.42%
                                5000     0.28% 34.00% 16.33% 50.61%
                               10000     9.79% 41.51% 17.80% 60.10%
                               15000    17.50% 46.25% 11.82% 75.57%


                   1
                        +    +
                  0.9              +     +      +        +     +     +             +       +
                  0.8
                  0.7
                  0.6
                                                                            avg        +
            P@n   0.5                                                      worst       ×
                  0.4
                  0.3
                  0.2
                  0.1
                             ×     ×     ×                                                 ×
                                                ×        ×     ×     ×             ×
                   0×
                    100      200   300   400   500       600   700   800       900         1000
                                                     n


                                    Fig. 2. Accuracy of DBH.


quality of the candidates covers a broad range and the worst cases are clearly
inappropriate for accurate computations of similarity. Nevertheless, on average
DBH performs well and only slightly decreases in the size of N . Figure 3 shows
an exemplary query trajectory (top, left) as well as five similar trajectories found
by DBH, where the axes denote the coordinates on the pitch of the respective
movement. The retrieved near-duplicates are very close to the query and well
suited for further processing.

5   Conclusion
In this paper, we presented an approach to efficiently compute similar move-
ments in positional data streams. Our solution is based on dynamic time warping
and distance based hashing. Empirically, we showed the efficiency and accuracy
of our approaches. Future work will deal with detecting frequent movements
across players.

References
 1. V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios. Nearest neighbor retrieval using
    distance-based hashing. In Proceedings of the 2008 IEEE 24th International Conference on
   -2050                                                                                    3060                                                                          +           480                                                                        +
                                                                             +
                                                                                            3050
   -2100                                                                                                                                                                              460
                                                                                            3040                                                           +                                                                                       +
                                                             +                                                                                                                        440
                                                                                            3030
   -2150
                                                                                                                                              +
                                               +                                            3020
                                                                                                                                                                                      420
                                                                                                                                                                                                                            +
   -2200                                                                                    3010
                                                                                                                                                                                      400
                                                                                            3000
   -2250                       +                                                            2990                              +                                                       380
                                                                                                                                                                                                            +
                                                                                            2980
   -2300
               +                                                                                                                                                                      360
                                                                                            2970                                                                                               +

   -2350                                                                                    2960       +                                                                              340
       21800   22000   22200   22400       22600   22800   23000   23200   23400   23600       18100    18200         18300       18400   18500   18600   18700   18800       18900     27900      28000    28100   28200       28300   28400      28500      28600   28700

   -1720                                                                                   -11350                                                                                     -10800
                                                                                                                                                                                                                                                                 +
                                                                                   +                                                                                                  -10850
   -1730                                                                                                                                                                  +
                                                                                           -11400
                                                                                                                                                                                      -10900                                                   +
   -1740
                                                                   +                       -11450                                                 +                                   -10950
   -1750                                                                                                                                                                              -11000                                +
                                                   +
                                                                                           -11500
   -1760                                                                                                                                                                              -11050
                                       +
                                                                                           -11550                                                                                     -11100
   -1770                                                                                                                      +
                                                                                                                                                                                      -11150                +
                                                                                           -11600                 +
   -1780
               +                                                                                       +                                                                              -11200
                                                                                                                                                                                               +
   -1790                                                                                   -11650                                                                                     -11250
       38200   38250   38300   38350       38400   38450   38500   38550   38600   38650        4000       4200       4400        4600    4800    5000    5200    5400        5600         7000            7500        8000             8500           9000           9500


Fig. 3. Exemplary results: The query trajectory is shown in the top-left figure. The other figures
show five similar trajectories found by our approach.


    Data Engineering, pages 327–336, 2008.
 2. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In
    Proceedings of the 25th International Conference on Very Large Data Bases, pages 518–529,
    1999.
 3. Andreas Grunz, Daniel Memmert, and Jrgen Perl. Tactical pattern recognition in soccer
    games by means of special self-organizing maps. Human Movement Science, 31(2):334 –
    343, 2012. Special issue on Network approaches in complex environments.
 4. C.-H. Kang, J.-R. Hwang, and K.-J. Li. Trajectory analysis for soccer players. In Pro-
    ceedings of the Sixth IEEE International Conference on Data Mining - Workshops, pages
    377–381, 2006.
 5. Eamonn Keogh. Exact indexing of dynamic time warping. In Proceedings of the 28th
    international conference on Very Large Data Bases, pages 406–417, 2002.
 6. S.-W. Kim, S. Park, and W. W. Chu. An index-based approach for similarity search sup-
    porting time warping in large sequence databases. In Proceedings of the 17th International
    Conference on Data Engineering, pages 607–614, 2001.
 7. D. Lemire. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern
    Recogn., 42(9):2169–2180, September 2009.
 8. M. Perše, M. Kristan, S. Kovačič, G. Vučkovič, and J. Perš. A trajectory-based analysis of
    coordinated team activity in a basketball game. Computer Vision and Image Understanding,
    113(5):612 – 621, 2009.
 9. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. Prentice-
    Hall, Inc., Upper Saddle River, NJ, USA, 1993.
10. T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and
    E. Keogh. Searching and mining trillions of time series subsequences under dynamic time
    warping. In Proceedings of the International Conference on Knowledge Discovery and Data
    Mining, pages 262–270, 2012.
11. Michail Vlachos, D. Gunopulos, and Gautam Das. Rotation invariant distance measures for
    trajectories. In Proceedings of tInternational Conference on Knowledge Discovery and Data
    Mining, KDD ’04, pages 707–712, 2004.