=Paper= {{Paper |id=Vol-1313/paper_14 |storemode=property |title=Missing Value Imputation in Time Series Using Top-k Case Matching |pdfUrl=https://ceur-ws.org/Vol-1313/paper_14.pdf |volume=Vol-1313 |dblpUrl=https://dblp.org/rec/conf/gvd/WellenzohnMGBK14 }} ==Missing Value Imputation in Time Series Using Top-k Case Matching== https://ceur-ws.org/Vol-1313/paper_14.pdf
 Missing Value Imputation in Time Series using Top-k Case
                        Matching

                 Kevin Wellenzohn                            Hannes Mitterer                        Johann Gamper
                   Free University of                          Free University of                    Free University of
                    Bozen-Bolzano                               Bozen-Bolzano                         Bozen-Bolzano
           kevin.wellenzohn@unibz.it hannes.mitterer@unibz.it  gamper@inf.unibz.it
                             M. H. Böhlen            Mourad Khayati
                                        University of Zurich                   University of Zurich
                                     boehlen@ifi.uzh.ch                     mkhayati@ifi.uzh.ch

ABSTRACT                                                                  pecially frost is dangerous as it can destroy the harvest within a
In this paper, we present a simple yet effective algorithm, called        few minutes unless the farmers react immediately. The Südtiroler
the Top-k Case Matching algorithm, for the imputation of miss-            Beratungsring operates more than 120 weather stations spread all
ing values in streams of time series data that are similar to each        over South Tyrol, where each of them collects every five minutes
other. The key idea of the algorithm is to look for the k situations      up to 20 measurements including temperature, humidity etc. The
in the historical data that are most similar to the current situation     weather stations frequently suffer outages due to sensor failures or
and to derive the missing value from the measured values at these k       errors in the transmission of the data. However, the continuous
time points. To efficiently identify the top-k most similar historical    monitoring of the current weather condition is crucial to immedi-
situations, we adopt Fagin’s Threshold Algorithm, yielding an al-         ately warn about imminent threats such as frost and therefore the
gorithm with sub-linear runtime complexity with high probability,         need arises to recover those missing values as soon as they are de-
and linear complexity in the worst case (excluding the initial sort-      tected.
ing of the data, which is done only once). We provide the results            In this paper, we propose an accurate and efficient method to
of a first experimental evaluation using real-world meteorological        automatically recover missing values. The need for a continuous
data. Our algorithm achieves a high accuracy and is more accurate         monitoring of the weather condition at the SBR has two important
and efficient than two more complex state of the art solutions.           implications for our solution. Firstly, the proposed algorithm has
                                                                          to be efficient enough to complete the imputation before the next
                                                                          set of measurements arrive in a few minutes time. Secondly, the
Keywords                                                                  algorithm cannot use future measurements which would facilitate
Time series, imputation of missing values, Threshold Algorithm            the imputation, since they are not yet available.
                                                                             The key idea of our Top-k Case Matching algorithm is to seek
                                                                          for the k time points in the historical data when the measured val-
1.     INTRODUCTION                                                       ues at a set of reference stations were most similar to the measured
   Time series data is ubiquitous, e.g., in the financial stock mar-      values at the current time point (i.e., the time point when a value is
ket or in meteorology. In many applications time series data is in-       missing). The missing value is then derived from the values at the k
complete, that is some values are missing for various reasons, e.g.,      past time points. While a naïve solution to identify the top-k most
sensor failures or transmission errors. However, many applications        similar historical situations would have to scan the entire data set,
assume complete data, hence need to recover missing values before         we adopt Fagin’s Threshold Algorithm, which efficiently answers
further data processing is possible.                                      top-k queries by scanning, on average, only a small portion of the
   In this paper, we focus on the imputation of missing values in         data. The runtime complexity of our solution is derived from the
long streams of meteorological time series data. As a case study,         Threshold Algorithm and is sub-linear with high probability and
we use real-world meteorological data collected by the Südtiroler         linear in the worst case, when all data need to be scanned. We pro-
Beratungsring1 (SBR), which is an organization that provides pro-         vide the results of a first experimental evaluation using real-world
fessional and independent consultancy to the local wine and apple         meteorological data from the SBR. The results are promising both
farmers, e.g., to determine the optimal harvesting time or to warn        in terms of efficiency and accuracy. Our algorithm achieves a high
about potential threats, such as apple scab, fire blight, or frost. Es-   accuracy and is more accurate than two state of the art solutions.
1                                                                            The rest of the paper is organized as follows. In Section 2, we
    http://www.beratungsring.org/
                                                                          review the existing literature about imputation methods for missing
                                                                          values. In Section 3, we introduce the basic notation and a running
                                                                          example. In Section 4, we present our Top-k Case Matching algo-
                                                                          rithm for the imputation of missing values, followed by the results
                                                                          of an experimental evaluation in Section 5. Section 6 concludes the
                                                                          paper and outlines ideas for future work.
Copyright © by the paper’s authors. Copying permitted only for
private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-     2.    RELATED WORK
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.     Khayati et al. [4] present an algorithm, called REBOM, which
recovers blocks of missing values in irregular (with non repeating                                                   t∈w        s(t)      r1 (t)       r2 (t)   r3 (t)
trends) time series data. The algorithm is based on an iterated trun-                                                 1        16.1°      15.0°        15.9°    14.1°
cated matrix decomposition technique. It builds a matrix which                                                        2        15.8°      15.2°        15.7°    13.9°
stores the time series containing the missing values and its k most                                                   3        15.9°      15.2°        15.8°    14.1°
correlated time series according to the Pearson correlation coeffi-                                                   4        16.2°      15.0°        15.9°    14.2°
cient [7]. The missing values are first initialized using a simple                                                    5        16.5°      15.3°        15.7°    14.5°
interpolation technique, e.g., linear interpolation. Then, the ma-                                                    6        16.1°      15.2°        16.0°    14.1°
trix is iteratively decomposed using the truncated Singular Value                                                     7          ?        15.0°        16.0°    14.3°
Decomposition (SVD). By multiplying the three matrices obtained
from the decomposition, the algorithm is able to accurately approx-                             Table 1: Four time series in a window w = [1, 7].
imate the missing values. Due to its quadratic runtime complexity,
REBOM is not scalable for long time series data.                                                                                     s (Schlanders)        r1 (Kortsch)
   Khayati et al. [5] further investigate the use of matrix decompo-                                                                 r2 (Göflan)           r3 (Laas)
sition techniques for the imputation of missing values. They pro-




                                                                                  Temperature in Degree Celsius
pose an algorithm with linear space complexity based on the Cen-
troid Decomposition, which is an approximation of SVD. Due to                                                        16
the memory-efficient implementation, the algorithm scales to long
time series. The imputation follows a similar strategy as the one
used in REBOM.                                                                                                       15
   The above techniques are designed to handle missing values in
static time series. Therefore, they are not applicable in our sce-
nario, as we have to continuously impute missing values as soon                                                      14
as they appear. A naïve approach to run the algorithms each time
                                                                                                                           1     2        3        4       5      6       7
a missing value occurs is not feasible due to their relatively high
runtime complexity.                                                                                                                           Timestamps
   There are numerous statistical approaches for the imputation of
missing values, including easy ones such as linear or spline interpo-                                             Figure 1: Visualization of the time series data.
lation, all the way up to more complex models such as the ARIMA
model. The ARIMA model [1] is frequently used for forecasting
future values, but can be used for backcasting missing values as             For the imputation of missing values we assign to each time se-
well, although this is a less common use case. A recent comparison        ries s a set Rs of reference time series, which are similar to s.
of statistical imputation techniques for meteorological data is pre-      The notion of similarity between two time series is tricky, though.
sented in [9]. The paper comprises several simple techniques, such        Intuitively, we want time series to be similar when they have sim-
as the (weighted) average of concurrent measurements at nearby            ilar values and behave similarly, i.e., values increase and decrease
reference stations, but also computationally more intensive algo-         roughly at the same time and by the same amount.
rithms, such as neural networks.                                             As a simple heuristic for time series similarity, we use the spa-
                                                                          tial proximity between the stations that record the respective time
3.    BACKGROUND                                                          series. The underlying assumption is that, if the weather stations
                                                                          are nearby (say within a radius of 5 kilometers), the measured val-
  Let S = {s1 , . . . , sn } be a set of time series. Each time series,   ues should be similar, too. Based on this assumption, we manually
s ∈ S, has associated a set of reference time series Rs , Rs ⊆            compiled a list of 3–5 reference time series for each time series.
S \ {s}. The value of a time series s ∈ S at time t is denoted as         This heuristic turned out to work well in most cases, though there
s(t). A sliding window of a time series s is denoted as s([t1 , t2 ])     are situations where the assumption simply does not hold. One rea-
and represents all values between t1 and t2 .                             son for the generally good results is most likely that in our data
   E XAMPLE 1. Table 1 shows four temperature time series in a            set the over 100 weather stations cover a relatively small area, and
time window w = [1, 7], which in our application corresponds to           hence the stations are very close to each other.
seven timestamps in a range of 30 minutes. s is the base time series
from the weather station in Schlanders, and Rs = {r1 , r2 , r3 } is       4.    TOP-K CASE MATCHING
the associated set of reference time series containing the stations
                                                                             Weather phenomena are often repeating, meaning that for exam-
of Kortsch, Göflan, and Laas, respectively. The temperature value
                                                                          ple during a hot summer day in 2014 the temperature measured at
s(7) is missing. Figure 1 visualizes this example graphically.
                                                                          the various weather stations are about the same as those measured
   The Top-k Case Matching algorithm we propose assumes that              during an equally hot summer day in 2011. We use this observa-
the time series data is aligned, which generally is not the case for      tion for the imputation of missing values. Let s be a time series
our data. Each weather station collects roughly every 5 minutes           where the current measurement at time θ, s(θ), is missing. Our
new measurements and transmits them to a central server. Since            assumption on which we base the imputation is as follows: if we
the stations are not perfectly synchronized, the timestamps of the        find historical situations in the reference time series Rs such that
measurements typically differ, e.g., one station collects measure-        the past values are very close to the current values at time θ, then
ments at 09:02, 09:07, . . . , while another station collects them at     also the past measurements in s should be very similar to the miss-
09:04, 09:09, . . . . Therefore, in a pre-processing step we align the    ing value s(θ). Based on this assumption, the algorithm searches
time series data using linear interpolation, which yields measure-        for similar climatic situations in historical measurements, thereby
ment values every 5 minutes (e.g., 00:00, 00:05, 00:10, . . . ). If we    leveraging the vast history of weather records collected by the SBR.
observe a gap of more than 10 minutes in the measurements, we                More formally, given a base time series s with reference time
assume that the value is missing.                                         series Rs , we are looking for the k timestamps (i.e., historical sit-
uations), D = {t1 , . . . , tk }, ti < θ, which minimize the error          popularity. Let us assume that k = 2 and the aggregation
function                                                                    function f (x1 , x2 ) = x1 + x2 . Further, assume that the bounded
                               X                                            buffer currently contains {(C, 18), (A, 16)} and the algorithm has
                 δ(t) =             |r(θ) − r(t)|.
                                                                            read the data up to the boxes shown in gray. At this point the al-
                             r∈Rs
                                                                            gorithm computes the threshold using the interestingness
That is, δ(t) ≤ δ(t ) for all t ∈ D and t0 6∈ D ∪ {θ}. The er-
                      0
                                                                            grade for object B and the popularity grade of object C, yield-
ror function δ(t) is the accumulated absolute difference between            ing τ = f (5, 9) = 5 + 9 = 14. Since the lowest ranked object in
the current temperature r(θ) and the temperature at time t, r(t),           the buffer, object A, has an aggregated grade that is greater than τ ,
over all reference time series r ∈ Rs . Once D is determined,               we can conclude that C and A are the top-2 objects. Note that the
the missing value is recovered using some aggregation function              algorithm never read object D, yet it can conclude that D cannot
g ({s(t)|∀t ∈ D}) over the measured values of the time series s             be part of the top-k list.
at the timestamps in D. In our experiments we tested the average
and the median as aggregation function (cf. Section 5).
                                                                                       interestingness                     popularity
    E XAMPLE 2. We show the imputation of the missing value s(7)
in Table 1 using as aggregation function g the average. For                            Object     grade                  Object     grade
the imputation, we seek the k = 2 most similar historical sit-                           A         10                      B         10
uations. The two timestamps D = {4, 1} minimize δ(t) with                                C          9                      C          9
δ(4) = |15.0° − 15.0°| + |16.0° − 15.9°| + |14.3° − 14.2°| = 0.2°                        B          5                     D           8
and δ(1) = 0.3°. The imputation is then simply the average                               D          4                     A           6
of the base station measurements at time t = 4 and t = 1,
i.e.,s(7) = avg(16.2°, 16.1°) = 12 (16.2° + 16.1°) = 16.15°.                               Table 2: Threshold Algorithm example.

   A naïve implementation of this algorithm would have to scan
the entire database of historical data to find the k timestamps that        4.2     Adapting the Threshold Algorithm
minimize δ(t). This is, however, not scalable for huge time series             In order to use the Threshold Algorithm for the imputation of
data, hence a more efficient technique is needed.                           missing values in time series data, we have to adapt it. Instead of
                                                                            looking for the top-k objects that maximize the aggregation func-
4.1     Fagin’s Threshold Algorithm                                         tion f , we want to find the top-k timestamps that minimize the
   What we are actually trying to do is to answer a top-k query for         error function δ(t) over the reference time series Rs . Similar to
the k timestamps which minimize δ(t). There exist efficient algo-           TA, we need sorted access to the data. Therefore, for each time
rithms for top-k queries. For example, Fagin’s algorithm [2] solves         series r ∈ Rs we define Lr to be the time series r ordered first
this problem by looking only at a small fraction of the data. Since         by value and then by timestamp in ascending order. Table 3 shows
the first presentation of Fagin’s algorithm there were two notewor-         the sorted data for the three reference time series of our running ex-
thy improvements, namely the Threshold Algorithm (TA) by Fagin              ample (ignore the gray boxes and small subscript numbers for the
et al. [3] and a probabilistic extension by Theobald et al. [8]. The        moment).
latter approach speeds up TA by relaxing the requirement to find
the exact top-k answers and providing approximations with proba-                           Lr1                Lr2                 Lr3
bilistic guarantees.
   Our Top-k Case Matching algorithm is a variation of TA with                        t    r1 (t)         t    r2 (t)         t    r3 (t)
slightly different settings. Fagin et al. assume objects with m at-                   1    15.0° 4        2    15.7°          2    13.9°
tributes, a grade for each attribute and a monotone aggregation                       4    15.0° 1        5    15.7°          1    14.1°
function f : Rm 7→ R, which aggregates the m grades of an ob-                         7    15.0°          3    15.8°          3    14.1°
ject into an overall grade. The monotonicity property is defined as                   2    15.2°          1    15.9°          6    14.1°
follows.                                                                              3    15.2°          4    15.9° 5        4    14.2° 3
                                                                                      6    15.2°          6    16.0° 2        7    14.3°
    D EFINITION 1. (Monotonicity)              Let    x1 , . . . , xm and             5    15.3°          7    16.0°          5    14.5° 6
x01 , . . . , x0m be the m grades for objects X and X 0 , re-
spectively.         The aggregation function f is monotone if                             Table 3: Time series sorted by temperature.
f (x1 , . . . , xm ) ≤ f (x01 , . . . , x0m ) given that xi ≤ x0i for
each 1 ≤ i ≤ m.                                                                 The general idea of our modified TA algorithm is the following.
   The TA finds the k objects that maximize the function f . To do          The scan of each sorted lists starts at the current element, i.e., the
so it requires two modes of accessing the data, one being sorted and        element with the timestamp t = θ. Instead of scanning the lists Lri
the other random access. The sorted access is ensured by maintain-          only in one direction as TA does, we scan each list sequentially
ing a sorted list Li for each attribute mi , ordered by the grade in        in two directions. Hence, as an initialization step, the algorithm
                                                                                                                    −
descending order. TA keeps a bounded buffer of size k and scans             places two pointers, pos+   r and posr , at the current value r(θ) of

each list Li in parallel until the buffer contains k objects and the        time series r (the gray boxes in Table 3). During the execution of
lowest ranked object in the buffer has an aggregated grade that is          the algorithm, pointer pos+  r is only incremented (i.e., moved down

greater than or equal to some threshold τ . The threshold τ is com-         the list), whereas pos−  r is only decremented (i.e., moved up the

puted using the aggregation function f over the grades last seen            list). To maintain the k highest ranking timestamps, the algorithm
under the sorted access for each list Li .                                  uses a bounded buffer of size k. A new timestamp t0 is added only
                                                                            if the buffer is either not yet full or δ(t0 ) < δ(t), where t is the last
   E XAMPLE 3. Table 2 shows four objects {A, B, C, D} and                  (i.e., lowest ranking) timestamp in the buffer. ¯In the latter
                                                                                                                                        ¯ case the
their grade for the two attributes interestingness and                      timestamp t is removed from the buffer.
                                                                                        ¯
   After this initialization, the algorithm iterates over the lists Lr in    Algorithm 1: Top−k Case Matching
round robin fashion, i.e., once the last list is reached, the algorithm        Data: Reference time series Rs , current time θ, and k
wraps around and continues again with the first list. In each iter-            Result: k timestamps that minimize δ(t)
ation, exactly one list Lr is processed, and either pointer pos+     r or
                                                                                          r
                                                                             1 L ← {L |r ∈ Rs }
pos−r is advanced, depending on which value the two pointers point           2 buffer ← boundendBuffer(k)

to has a smaller absolute difference to the current value at time θ,         3 for r ∈ Rs do

r(θ). This process grows a neighborhood around the element r(θ)              4      pos−         +
                                                                                        r , posr ← position of r(θ) in L
                                                                                                                          r
                                                                             5 end
in each list. Whenever a pointer is advanced by one position, the
                                                                             6 while L <> ∅ do
timestamp t at the new position is processed. At this point, the             7      for Lr ∈ L do
algorithm needs random access to the values r(t) in each list to             8           t ← AdvancePointer(Lr )
compute the error function δ(t). Time t is added to the bounded              9           if t = N IL then
buffer using the semantics described above.                                 10                L ← L \ {Lr }
   The algorithm terminates once the error at the lowest ranking            11           else
                                                                            12                if t 6∈ buffer then
timestamp, t, among the k timestamps in the buffer is less or equal
            ¯                                                               13                      buffer.addWithPriority(t, δ(t))
to thePthreshold,    i.e., δ(t) ≤ τ . The threshold τ is defined as         14                end
τ = r∈Rs |r(θ) − r(pos       ¯ )|, where pos is either pos+ or pos− ,
                               r              r              r         r    15                τ ← ComputeThreshold(L)
depending on which pointer was advanced last. That is, τ is the             16                if buffer.size() = k
sum over all lists Lr of the absolute differences between r(θ) and                              and buffer.largestError() ≤ τ then
                                 −                                                                  return buffer
the value under pos+   r or posr .
                                                                            17
                                                                            18                end
   E XAMPLE 4. We illustrate the Top-k Case Matching algorithm              19           end
for k = 2 and θ = 7. Table 4 shows the state of the algorithm in            20      end
each iteration i. The first column shows an iteration counter i, the        21 end
                                                                            22 return buffer
second the buffer with the k current best timestamps, and the last
column the threshold τ . The buffer entries are tuples of the form
(t, δ(t)). In iteration i = 1, the algorithm moves the pointer to
t = 4 in list Lr1 and adds (t = 4, δ(4) = 0.2°) to the buffer. Since        on the direction of the pointer. If next() reaches the end of a list,
δ(4) = 0.2° > 0.0° = τ , the algorithm continues. The pointer               it returns N IL. The utility functions timestamp() and value()
in Lr2 is moved to t = 6, and (6, 0.4°) is added to the buffer. In          return the timestamp and value of a list Lr at a given position, re-
iteration i = 4, timestamp 6 is replaced by timestamp 1. Finally,           spectively. There are four cases, which the algorithm has to distin-
in iteration i = 6, the error at timestamp t = 1 is smaller or equal        guish:
to τ , i.e., δ(1) = 0.3° ≤ τ6 = 0.3°. The algorithm terminates and
returns the two timestamps D = {4, 1}.                                         1. None of the two pointers reached the beginning or end of the
                                                                                  list. In this case, the algorithm checks which pointer to ad-
                                                                                  vance (line 5). The pointer that is closer to r(θ) after advanc-
          Iteration i         Buffer            Threshold τi                      ing is moved by one position. In case of a tie, we arbitrarily
               1        (4, 0.2°)                   0.0°                          decided to advance pos+   r .
               2        (4, 0.2°), (6, 0.4°)        0.0°
               3        (4, 0.2°), (6, 0.4°)        0.1°                       2. Only pos−
                                                                                          r reached the beginning of the list: the algorithm
               4        (4, 0.2°), (1, 0.3°)        0.1°                          increments pos+
                                                                                                r (line 11).
               5        (4, 0.2°), (1, 0.3°)        0.2°
               6        (4, 0.2°), (1, 0.3°)        0.3°                       3. Only pos+
                                                                                          r reached the end of the list: the algorithm decre-
                                                                                  ments pos−
                                                                                           r (line 13).

   Table 4: Finding the k = 2 most similar historical situations.              4. The two pointers reached the beginning respective end of the
                                                                                  list: no pointer is moved.
4.3     Implementation                                                      In the first three cases, the algorithm returns the timestamp that
   Algorithm 1 shows the pseudo code of the Top-k Case Matching             was discovered after advancing the pointer. In the last case, N IL is
algorithm. The algorithm has three input parameters: a set of time          returned.
series Rs , the current timestamp θ, and the parameter k. It returns           At the moment we use an in-memory implementation of the al-
the top-k most similar timestamps to the current timestamp θ. In            gorithm, which loads the whole data set into main memory. More
line 2 the algorithm initializes the bounded buffer of size k, and in       specifically, we keep two copies of the data in memory: the data
                                   −
line 4 the pointers pos+r and posr are initialized for each reference       sorted by timestamp for fast random access and the data sorted by
time series r ∈ Rs . In each iteration of the loop in line 7, the algo-     value and timestamp for fast sorted access.
                                     −
rithm advances either pos+  r or posr (by calling Algorithm 2) and             Note that we did not normalize the raw data using some standard
reads a new timestamp t. The timestamp t is added to the bounded            technique like the z-score normalization, as we cannot compute
buffer using the semantics described before. In line 15, the algo-          that efficiently for streams of data without increasing the complex-
rithm computes the threshold τ . If the buffer contains k timestamps        ity of our algorithm.
and we have δ(t) ≤ τ , the top-k most similar timestamps were
                 ¯
found and the algorithm   terminates.                                       4.4    Proof of Correctness
   Algorithm 2 is responsible for moving the pointers pos+       r and         The correctness of the Top-k Case Matching algorithm follows
pos−                  r
    r for each list L . The algorithm uses three utility functions.         directly from the correctness of the Threshold Algorithm. What
The first is next(), which takes a pointer as input and returns the         remains to be shown, however, is that the aggregation function δ(t)
next position by either incrementing or decrementing, depending             is monotone.
                                                                                                                                        ∗
    Algorithm 2: AdvancePointer
                                                                                    1
                                                                                      P the real value∗ s(θ) and the imputed value s (θ),
                                                                        ference between
     Data: List Lr where to advance a pointer                           i.e., ∆ = |w| θ∈w |s(θ) − s (θ)|
     Result: Next timestamp to look at or N IL                             Figure 2 shows how the accuracy of the algorithms changes with
1  pos ← N IL                                                           varying k. Interestingly and somewhat unexpectedly, ∆ decreases
   if next(pos+                         −                               as k increases. This is somehow contrary to what we expected,
2               r ) <> N IL and next(posr ) <> N IL then
        ∆ ← |r(θ) − value(L [next(pos+
           +                      r                                     since with an increasing k also the error function δ(t) grows, and
 3                                         r )])|
        ∆− ← |r(θ) − value(Lr [next(pos−                                therefore less similar historical situations are used for the imputa-
 4                                         r )])|
 5      if ∆+ ≤ ∆− then                                                 tion. However, after a careful analysis of the results it turned out
             pos, pos+               +                                  that for low values of k the algorithm is more sensitive to outliers,
 6                   r ← next(posr )
 7      else                                                            and due to the often low quality of the raw data the imputation is
 8           pos, pos−
                     r ← next(posr )
                                     −                                  flawed.
 9      end




                                                                                    Average Difference ∆ in °C
                     +                       −                                                                                           Top-k (Average)
10 else if next(posr ) <> N IL and next(posr ) = N IL then                                                       0.8
        pos, pos+              +                                                                                                         Top-k (Median)
11               r ← next(posr )
                     +                    −                                                                                               Simple Average
12 else if next(posr ) = N IL and next(posr ) <> N IL then
13      pos, pos−r  ←  next(pos −
                                r  )                                                                             0.7

14 end
15 if pos <> N IL then
                                                                                                                 0.6
16      return timestamp(Lr [pos])
17 else
18      return N IL                                                                                              0.5
19 end
                                                                                                                       0                 50                100

                                                                                                                                     Parameter k

   T HEOREM 4.1. The aggregation function δ(t) is a monotoni-                                                     Figure 2: Impact of k on accuracy.
cally increasing function.
   P ROOF. Let t1 and t2 be two timestamps such that |r(θ) −               Table 5 shows an example of flawed raw data. The first row is
r(t1 )| ≤ |r(θ) − r(t2 )| for each r ∈ Rs . Then it trivially fol-      the current situation, and we assume that the value in the gray box
lows that δ(t1 ) ≤ δ(t2 ) as the aggregation function δ is the sum of   is missing and need to be recovered. The search for the k = 3
|r(θ) − r(t1 )| over each r ∈ Rs and, by definition, each compo-        most similar situations using our algorithm yields the three rows
nent of δ(t1 ) is less than or equal to the corresponding component     at the bottom. Notice that one base station value is 39.9° around
in δ(t2 ).                                                              midnight of a day in August, which is obviously a very unlikely
                                                                        thing to happen. By increasing k, the impact of such outliers is
4.5       Theoretical Bounds                                            reduced and hence ∆ decreases. Furthermore, using the median as
   The space and runtime bounds of the algorithm follow directly        aggregation function reduces the impact of outliers and therefore
from the probabilistic guarantees of TA, which has sub-linear cost      yields better results than the average.
with high probability and linear cost in the worst case. Note
                                                                            Timestamp                                         s          r1           r2           r3
that sorting the raw data to build the lists Lr is a one-time pre-
processing step with complexity O(n log n). After that the system        2013-04-16 19:35                                  18.399°     17.100°      19.293°      18.043°
can insert new measurements efficiently into the sorted lists with       2012-08-24 01:40                                  18.276°     17.111°      19.300°      18.017°
logarithmic cost.                                                        2004-09-29 15:50                                  19.644°     17.114°      19.259°      18.072°
                                                                         2003-08-02 01:10                                  39.900°     17.100°      19.365°      18.065°
5.      EXPERIMENTAL EVALUATION                                                                                  Table 5: Example of flawed raw data.
   In this section, we present preliminary results of an experimental
evaluation of the proposed Top-k Case Matching algorithm. First,           Figure 3 shows the runtime, which for the Top-k Case Match-
we study the impact of parameter k on the Top-k Case Matching           ing algorithm linearly increases with k. Notice that, although the
and a baseline algorithm. The baseline algorithm, referred to as        imputation of missing values for 8 days takes several minutes, the
“Simple Average”, imputes the missing value s(θ) with the average       algorithm is fast enough to continuously impute missing values in
of thePvalues in the reference time series at time θ, i.e., s(θ) =      our application at the SBR. The experiment essentially corresponds
  1
|Rs |    r∈Rs r(θ). Second, we compare our solution with two state      to a scenario, where in 11452 base stations an error occurs at the
of the art competitors, REBOM [4] and CD [5].                           same time. With 120 weather stations operated by the SBR, the
                                                                        number of missing values at each time is only a tiny fraction of the
5.1     Varying k                                                       missing values that we simulated in this experiment.
   In this experiment, we study the impact of parameter k on the
accuracy and the runtime of our algorithm. We picked five base          5.2    Comparison with CD and REBOM
stations distributed all over South Tyrol, each having two to five         In this experiment, we compare the Top-k Case Matching algo-
reference stations. We simulated a failure of the base station dur-     rithm with two state-of-the-art algorithms, REBOM [4] and CD [5].
ing a time interval, w, of 8 days in the month of April 2013. This      We used four time series, each containing 50.000 measurements,
amounts to a total of 11452 missing values. We then used the Top-k      which corresponds roughly to half a year of temperature measure-
Case Matching (using both the average and median as aggregation         ments. We simulated a week of missing values (i.e., 2017 measure-
function g) and Simple Average algorithms to impute the missing         ments) in one time series and used the other three as reference time
values. As a measure of accuracy we use the average absolute dif-       series for the imputation.
                                                                           further study the impact of complex weather phenomena that we
                            800                                            observed in our data, such as the foehn. The foehn induces shifting
                                                                           effects in the time series data, as the warm wind causes the temper-
            Runtime (sec)
                            600
                                                                           ature to increase rapidly by up to 15° as soon as the foehn reaches
                                                  Top-k (Average)          another station.
                            400
                                                  Top-k (Median)              There are several possibilities to further improve the algorithm.
                                                   Simple Average          First, we would like to explore whether the algorithm can dynam-
                            200
                                                                           ically determine an optimal value for the parameter k, which is
                             0                                             currently given by the user. Second, we would like to make the
                                  0               50                100    algorithm more robust against outliers. For example, the algorithm
                                              Parameter k                  could consider only historical situations that occur roughly at the
                                                                           same time of the day. Moreover, we can bend the definition of “cur-
                             Figure 3: Impact of k on runtime.             rent situation” to not only consider the current timestamp, but rather
                                                                           a small window of consecutive timestamps. This should make the
                                                                           ranking more robust against anomalies in the raw data and weather
   The box plot in Figure 4 shows how the imputation error |s(θ) −         phenomena such as the foehn. Third, right now the similarity be-
s∗ (θ)| is distributed for each of the four algorithms. The left and       tween time series is based solely on temperature data. We would
right line of the box are the first and third quartile, respectively.      like to include the other time series data collected by the weather
The line inside the box denotes the median and the left and right          stations, such as humidity, precipitation, wind, etc. Finally, the al-
whiskers are the 2.5% and 97.5% percentile, which means that the           gorithm should be able to automatically choose the currently hand-
plot incorporates 95% of the values and omits statistical outliers.        picked reference time series based on some similarity measures,
The experiment clearly shows that the Top-k Case Matching algo-            such as the Pearson correlation coefficient.
rithm is able to impute the missing values more accurately than CD
and REBOM. Although not visualized, also the maximum observed
error for our algorithm is with 2.29° (Average) and 2.21° (Median)
                                                                           7.    ACKNOWLEDGEMENTS
considerably lower than 3.71° for CD and 3.6° for REBOM.                     The work has been done as part of the DASA project, which is
                                                                           funded by the Foundation of the Free University of Bozen-Bolzano.
                                                                           We wish to thank our partners at the Südtiroler Beratungsring and
              Top-k                                                        the Research Centre for Agriculture and Forestry Laimburg for the
            (Median)
                                                                           good collaboration and helpful domain insights they provided, in
               Top-k                                                       particular Armin Hofer, Martin Thalheimer, and Robert Wiedmer.
            (Average)

                            CD                                             8.    REFERENCES
                                                                           [1] G. E. P. Box and G. Jenkins. Time Series Analysis, Forecasting
             REBOM                                                             and Control. Holden-Day, Incorporated, 1990.
                                                                           [2] R. Fagin. Combining fuzzy information from multiple systems
                                  0     0.5       1      1.5        2
                                                                               (extended abstract). In PODS’96, pages 216–226, New York,
                                      Absolute Difference in °C                NY, USA, 1996. ACM.
                                                                           [3] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
           Figure 4: Comparison with REBOM and CD.                             algorithms for middleware. In PODS ’01, pages 102–113,
                                                                               New York, NY, USA, 2001. ACM.
   In terms of runtime, the Top-k Case Matching algorithm needed           [4] M. Khayati and M. H. Böhlen. REBOM: recovery of blocks of
16 seconds for the imputation of the 2017 missing measurements,                missing values in time series. In COMAD’12, pages 44–55,
whereas CD and REBOM needed roughly 10 minutes each. Note,                     2012.
however, that this large difference in run time is also due to the         [5] M. Khayati, M. H. Böhlen, and J. Gamper. Memory-efficient
fact that CD and REBOM need to compute the Pearson correlation                 centroid decomposition for long time series. In ICDE’14,
coefficient which is a time intensive operation.                               pages 100–111, 2014.
                                                                           [6] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo:
6.    CONCLUSION AND FUTURE WORK                                               Mining and summarization of coevolving sequences with
   In this paper, we presented a simple yet efficient and accurate al-         missing values. In KDD’09, pages 507–516, New York, NY,
gorithm, termed Top-k Case Matching, for the imputation of miss-               USA, 2009. ACM.
ing values in time series data, where the time series are similar to       [7] A. Mueen, S. Nath, and J. Liu. Fast approximate correlation
each other. The basic idea of the algorithm is to look for the k sit-          for massive time-series data. In SIGMOD’10, pages 171–182,
uations in the historical data that are most similar to the current sit-       New York, NY, USA, 2010. ACM.
uation and to derive the missing values from the data at these time        [8] M. Theobald, G. Weikum, and R. Schenkel. Top-k query
points. Our Top-k Case Matching algorithm is based on Fagin’s                  evaluation with probabilistic guarantees. In VLDB’04, pages
Threshold Algorithm. We presented the results of a first experi-               648–659. VLDB Endowment, 2004.
mental evaluation. The Top-k Case Matching algorithm achieves a            [9] C. Yozgatligil, S. Aslan, C. Iyigun, and I. Batmaz.
high accuracy and outperforms two state of the art solutions both              Comparison of missing value imputation methods in time
in terms of accuracy and runtime.                                              series: the case of turkish meteorological data. Theoretical
   As next steps we will continue with the evaluation of the algo-             and Applied Climatology, 112(1-2):143–167, 2013.
rithm, taking into account also model based techniques such as Dy-
naMMo [6] and other statistical approaches outlined in [9]. We will