Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data
                                         AALTD 2015


                Temporal Density Extrapolation
                                       Georg Krempl
                         Knowledge Management & Discovery
                       Otto-von-Guericke University Magdeburg
                    Universitätsplatz 2, 39106 Magdeburg, Germany

                      https://kmd.cs.ovgu.de/res/driftmining


       Abstract. Mining evolving datastreams raises the question how to ex-
       trapolate trends in the evolution of densities over time. While approaches
       for change diagnosis work well for interpolating spatio-temporal den-
       sities, they are not designed for extrapolation tasks. This work stud-
       ies the temporal density extrapolation problem and sketches two ap-
       proaches that addresses it. Both use a set of pseudo-points in combina-
       tion with spatio-temporal kernel density estimation. The first, weight-
       extrapolating approach, uses regression on the weights of stationary-
       located pseudo-points. The second, location-extrapolating approach, ex-
       trapolates the trajectory of uniformly-weighted pseudo-points within the
       feature space.

       Keywords: kernel density estimation, density extrapolation, density
       forecasting, spatio-temporal density, evolving datastreams, nonstation-
       ary environments, concept drift, drift mining


1    Introduction
Density estimation methods, like kernel density estimation [14, 13], allow to learn
a model from instances observed at different positions in feature space, and
to use this model to estimate the density at any position within this feature
space. While the original work in [14, 13] is limited to spatial densities of a
stationary distribution, the approach was extended for spatio-temporal density
estimation of non-stationary distributions in [1, 2]. This so-called velocity density
estimation allows to estimate and visualise trends in densities. However, these
existing approaches are not directly applicable for predicting the density at future
time points, for example to extrapolate trends in the evolution of densities and
for building classifiers that work with delayed label information [10, 5]. Such
temporal density extrapolation should predict the densities at spatio-temporal
coordinates in the future, given a sample of (historic) instances observed at
different positions in feature space and at different times in the past.
    We propose and study two approaches to address this problem. Both use
extrapolation of pseudo-points in combination with spatio-temporal kernel den-
sity estimation. The first approach extrapolates the weights of stationary located
pseudo-points, while the second extrapolates the path of moving pseudo-points of
fixed weight. Subsequently, these weight- or position-extrapolated pseudo-points
are used in a spatio-temporal kernel density estimation.


Copyright©2015 for this paper by its authors. Copying permitted for private and academic
purposes.
2     G. Krempl

   In the following Section 2, we review the related work, before sketching the
two approaches in Section 3 and concluding in Section 4.


2   Related Work

The task of estimating the probability density based on a sample of indepen-
dently and identically distributed (iid) observations has been intensively studied.
Density estimation methods, like kernel density estimation [14, 13], allow to learn
a model from instances observed at different positions in feature space, and to
use this model to estimate the density at any position within this feature space.
The difficulties of the early kernel and near-neighbour density estimation tech-
niques when extended to multivariate settings was addressed by approaches like
projection pursuit density estimation, proposed in [4]. All these density estima-
tion approaches, as well as related curve regression approaches, require an iid
sample from a stationary distribution [11].
    In the case of a nonstationary distribution, one might be interested in esti-
mating the density at different points in time and space. In [1, 2], this problem
of spatio-temporal density estimation is addressed by combining spatial kernel
density estimation with a temporal weighting of instances. A framework for so-
called change diagnosis in evolving datastreams is proposed, which estimates
the rate of change at each region by using a user-specified temporal window to
calculate forward and reverse time slice density estimates. This velocity den-
sity estimation technique is applicable for spatio-temporal density interpolation,
for monitoring and visualising the change of densities in a (past) time window.
However, it is not designed for extrapolating the density to (future) time points
outside the window of observed historical data.
    Related to change diagnosis is change mining [3], which aims to understand
the changes in data distributions themselves. Within this paradigm, the idea
of so-called drift-mining approaches [8, 9, 5] is to model the evolution of distri-
butions in order to extrapolate them to future time points, thereby addressing
problems of verification latency or label delay. The algorithm APT proposed in
[8] uses matching between labelled old and unlabelled new instances to infer the
labels of the later, thus indirectly estimating the class-conditional distributions
of the new instances. Likewise, an expectation-maximisation approach is used in
[9] to track components of a Gaussian mixture model over time. In [5], a mix-
ture model is learned on old labelled data and compared to density estimates
on current unlabelled data, thereby inferring changes such as of the class prior.
However, these approaches are again not designed for directly extrapolating den-
sities.
    Density forecasting approaches [15, 16, 7], on the other hand, focus on the
prediction of a single variable’s (mostly unimodal) density at a particular future
timepoint, based on an observed time series of this variable’s values. In the sim-
plest case, as discussed for example in [16], this is done by providing a confidence
interval for a point estimate, obtained by assuming a particular distribution of
this variable. More sophisticated approaches return (potentially multi-modal)
                                          Temporal Density Extrapolation          3

density estimates by combining several predictions, which are obtained for ex-
ample by different macroeconomic models, experts, or simulation outcomes, into
a distribution estimation by kernel methods. Nevertheless, their multi-modal
character originates from the different modes in the combined unimodal mod-
els. In addition, most works consider only a single variable. One exception is
[7], where univariate forecasts of two explanatory variables are converted using
conditional kernel density estimation into forecasts of the dependent variable.
    In contrast to density forecasting above, we are concerned with temporal
density extrapolation of a potentially multi-modal density distribution. Further-
more, instead of having a time series of single observations at any one time,
our input data consists of multiple observations at any one time. This temporal
density extrapolation is related to spatial density extrapolation [17, 6], which ad-
dresses the extrapolation of densities for feature values that have not been seen
yet in historical instances. In [17], the authors suggest a Taylor series expansion
about the point of interest to estimate the density, while in [6] a statistical test
is provided to examine whether the data distribution is distinct from a uniform
distribution at the extrapolation position. While modelling time as a feature is
possible, there is an important difference in extrapolation between time and fea-
ture space: one expects the density to diminish towards unpopulated (and thus
unseen) positions in feature space. However, there is no a priori reason to assume
densities to decrease towards yet unseen moments in time. On the contrary, it
is reasonable to assume that at each point in time (whether future, current, or
past) the density integrates to one over the feature space.


3     Temporal Density Extrapolation

To address the problem of extrapolating the observed, potentially multi-modal
density-distribution of instances to future time points, we propose an approach
based on pseudo-points. These pseudo-points are used in the spatio-temporal ker-
nel density estimation in lieu of the originally observed instances. The resulting
kernel density estimation model can be interpreted as a mixture model, where
each pseudo-point constitutes itself a component. The pseudo-points evolve over
time, either by changing their weight (their component’s mixing proportion), or
by changing their position (their component’s location). Therefore, the learning
task is to fit a trend function to the evolution of each pseudo-point. We present
each of the two variants in the next Subsections 3.1 and 3.2, before discussing
their potential difficulties and limitations in Section 3.3.


3.1   Weight-Extrapolated, Stationary Pseudo-Points

Given a set of stationary pseudo-points, the first approach models their weights
as functions of time. These functions are then fit on a window with historical
data, such that the distribution therein is modelled with maximum likelihood.
    The approach is illustrated for a one-dimensional feature space in Figure 1.
At the first time point in the past (time = 0), a density estimate is calculated
4     G. Krempl

using historical data collected at that time (solid blue line). Then, a set of
pseudo-points (here 1, 2, · · · 4) is generated, either by placing them equidistant
on a grid or by drawing them at random. Next, the weights (w1 , w2 , · · · w4 ) of all
pseudo-points are calculated such that the divergence is minimised between the
kernel density estimate over the weighted pseudo-points and the kernel-density
estimate over the original data instances at that time point. The pseudo-point’s
weights are estimated in the same way for subsequent time points (e.g. t = 1), as
soon as instances become available for them. This results for each pseudo-point
in a time series of weight values, for which a polynomial trend function (red
curves) is learned by regression. Finally, for a future time point (e.g. time = 2),
the trend functions’ values are predicted (w1 , w2 , · · · w4 in red at time = 2).
Using these weighted pseudo-points in a kernel density estimate at time = 2,
one obtains the extrapolated density (red dotted line), which is later evaluated
against the observed density (solid blue-gray line).

3.2   Position-Extrapolated, Uniformly-Weighted Pseudo-Points
The second approach to address this problem is to use uniformly-weighted, but
flexibly located pseudo-points. Thus, the pseudo-point’s weights are uniform and
constant, but their positions are functions of time, fitted such that the divergence
on the available historical data is minimised.
    In analogy to the previous figure, this approach is illustrated for a one-
dimensional feature space in Figure 2. Given a set of historical instances and
a specified number of pseudo-points, density estimates (solid blue lines) are
made for historical time points (time = 0 and time = 1). Then, a mixture
model with each pseudo-point as a single Gaussian component is formulated.
Assuming polynomial trajectories (red solid lines) for the pseudo-points, the
parameters of this model are the coefficients of the pseudo-points polynomial
trajectories, which are learned using Expectation-Maximisation. Finally, for a
future time point (time = 2), the pseudo-point’s positions are predicted using
the polynomial function, and the density (red dotted line) at this time point is
estimated using kernel density estimation over the pseudo-points placed at their
extrapolated positions.

3.3   Discussion
Both approaches above rely on a regression over time for extrapolating trends in
the development of either weights or positions. In order to make this extrapola-
tion more robust, we recommend using regularised trend functions that consider
penalties for the models’ complexities. The choice of the type of regression func-
tion depends on the type of drift, as for example polynomial functions require
gradual drift, while trigonometric functions seem to be interesting candidates
for modelling recurring context.
    The weight-extrapolation in the first approach requires a normalisation, such
that the extrapolated weights are all non-negative and sum up to one. An impor-
tant question concerns the choice of the pseudo-point’s location in this approach,
                                                 Temporal Density Extrapolation         5


Fig. 1. Temporal Density Extrapolation Using Weight-Extrapolated Pseudopoints


                                                                X
                                                      t   ure
                                                   fea


                                            w4
                                                                              X


                             w3


                                                           w4
                       w2
  density


                                                                                    X


                                                                         w4
                                            w3
              w1


                                       w2
                       w1


          0


                                                           w3
                                       w1


                                                  w2
                   1
                                                                 Observed Density
                                                                 Extrapolated Density
                              2
                                  tim                            Extrapolated
                                     e                           Pseudopoint Weight


Fig. 2. Temporal Density Extrapolation Using Position-Extrapolated Pseudopoints


                                                         eX
                                                   eatu r
                                                  f
                                                                          X
                                            p4
density


                             p3


                                                                                    X
                        p2


                                                            p4
                                             p3
              p1


     0
                                                                         p4
                                  p2


                                                                    p3
                       p1


                   1
                                                                 Observed Density
                                            p2


                                                                 Extrapolated Density
                                       p1


                             2
                                  tim                            Extrapolated
                                     e                           Pseudopt. Trajectory
6    G. Krempl

as it influences the precision of the extrapolated values: in regions with sparse
pseudo-point populations, the model is less flexible than in densely populated
ones. Therefore, this approach seems better suited for constricted (bounded)
feature spaces. A simple equidistant placement of pseudo-points distributes the
precision over the whole feature space. Alternatively, the pseudo-points might be
placed at the coordinates of a subsample of the observed instances, thus concen-
trating the precision on areas with previously high density. However, if densities
change largely over time, these areas might become less relevant.
    In contrast, the second, position-extrapolating approach determines the po-
sitions of each pseudo-point automatically. It aims to adjust the future location
of the pseudo-points such that they are densely placed in regions with a high
expected density. However, in the case polynomial regression functions are used,
a potential drawback is that their trajectories diverge in the long run. Thus,
in contrast to the first approach, the second one seems to be better suited for
infinite (unbounded) feature spaces.
    Related to the choice of the pseudo-points’ placements is the question of
optimal bandwidth selection, which for kernel density estimation has already
been reviewed in [18]. In short, we expect that with an increasing number of
pseudo-points the optimal bandwidth decreases, while the extrapolation’s preci-
sion increases. Furthermore, the number of pseudo-points is also an upper bound
on the number of modes that both approaches are able to model.


4   Conclusion


In this paper, we have addressed the problem of temporal density extrapolation,
where the objective is the prediction of a (potentially multi-modal) density dis-
tribution at future time points, given a sample of historical instances observed
at different positions in feature space and at different times in the past. Two ap-
proaches based on pseudo-points were sketched: the first uses an extrapolation of
time-varying weights of stationary located pseudo-points, while the second uses
an extrapolation of the trajectory of the time-varying location of pseudo-points
with uniform weights. Subsequently, these extrapolated pseudo-points are used
in a kernel density estimation at future time points.
    Having sketched the idea of the two temporal density extrapolation ap-
proaches, a more detailed specification and evaluation of these methods needs
to be done in future work. Furthermore, the conjectures in the discussion above,
in particular the usability of each approach for bounded and unbounded fea-
ture spaces, need to be verified. Finally, a known challenge for kernel-based
approaches is the curse of dimensionality on multi-dimensional data. A naive
approach is to combine multiple univariate temporal density extrapolations.
However, an optimisation for multi-variate problems by using either projection
pursuit [4] or copula [12] techniques seems worth investigating.
                                             Temporal Density Extrapolation            7

Acknowledgments. We thank Vera Hofer from Karl-Franzens-University Graz,
Stephan Möhring, and Andy Koch for insightful discussions.


References
 1. Aggarwal, C.C.: A framework for diagnosing changes in evolving data streams. In:
    Proceedings of the ACM SIGMOD Conference (2003)
 2. Aggarwal, C.C.: On change diagnosis in evolving data streams. IEEE Transactions
    on Knowledge and Data Engineering 17(5), 587–600 (2005)
 3. Böttcher, M., Höppner, F., Spiliopoulou, M.: On exploiting the power of time in
    data mining. ACM SIGKDD Explorations Newsletter 10(2), 3–11 (2008)
 4. Friedman, J.H., Stuetzle, W., Schroeder, A.: Projection pursuit density estimation.
    Journal of the American Statistical Association 79(387) (1984)
 5. Hofer, V., Krempl, G.: Drift mining in data: A framework for addressing drift in
    classification. Computational Statistics and Data Analysis 57(1), 377–391 (2013)
 6. Hooker, G.: Diagnosing extrapolation: Tree-based density estimation. In: Knowl-
    edge Discovery in Databases (KDD) (2004)
 7. Jeon, J., Taylor, J.W.: Using conditional kernel density estimation for wind power
    density forecasting. Journal of the American Statistical Association 107 (2012)
 8. Krempl, G.: The algorithm APT to classify in concurrence of latency and drift. In:
    Gama, J., Bradley, E., Hollmén, J. (eds.) Advances in Intelligent Data Analysis X,
    Lecture Notes in Computer Science, vol. 7014, pp. 222–233. Springer (2011)
 9. Krempl, G., Hofer, V.: Classification in presence of drift and latency. In:
    Spiliopoulou, M., Wang, H., Cook, D., Pei, J., Wang, W., Zaı̈ane, O., Wu, X.
    (eds.) Proceedings of the 11th IEEE International Conference on Data Mining
    Workshops (ICDMW 2011). IEEE (2011)
10. Krempl, G., Zliobaitė, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V.,
    Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., Stefanowski, J.: Open challenges
    for data stream mining research. SIGKDD Explorations 16(1), 1–10 (2014), special
    Issue on Big Data
11. Nadaraya, E.A.: Nonparametric estimation of probability densities and regression
    curves. Kluwer (1989), originally published in Russian by Tbilisi University Press,
    Translated by S. Klotz
12. Nelsen, R.B.: An Introduction to Copulas. Springer (1999)
13. Parzen, E.: On estimation of a probability density function and mode. Annals of
    Mathematical Statistics 33, 1065–1076 (1962)
14. Rosenblatt, M.: Remarks on some non-parametric estimates of a density function.
    Annals of Mathematical Statistics 27(3), 832–837 (1956)
15. Skouras, K., Dawid, A.P.: On efficient probability forecasting systems. Biometrika
    86(4), 765–784 (1999)
16. Tay, A.S., Wallis, K.F.: Density forecasting: A survey. Companion to Economic
    Forecasting pp. 45–68 (2002)
17. Terrell, G.R.: Tail probabilities by density extrapolation. In: Proceedings of the
    Annual Meeting of the American Statistical Association (2001)
18. Turlach, B.A.: Bandwidth selection in kernel density estimation: A review. Tech.
    Rep. 9307, Humboldt University, Statistic und Ökonometrie (1991)