=Paper=
{{Paper
|id=Vol-1712/p02
|storemode=property
|title=Reconstruction of Missing Data in Synthetic Time Series Using EMD
|pdfUrl=https://ceur-ws.org/Vol-1712/p02.pdf
|volume=Vol-1712
|authors=Tatjana Sidekerskiene,Robertas Damaševičius
}}
==Reconstruction of Missing Data in Synthetic Time Series Using EMD==
Reconstruction of Missing Data in Synthetic Time
Series Using EMD
Tatjana Sidekerskiene Robertas Damasevicius
Department of Applied Mathematics Department of Software Engineering
Kaunas University of Technology Kaunas University of Technology
Kaunas, Lithuania Kaunas, Lithuania
Email: tatjana.sidekerskiene@ktu.lt Email: robertas.damasevicius@ktu.lt
Abstract—The paper presents a novel method for reconstruc- gaps, which must be dealt with before applying other signal
tion of missing data in time series. The method is based on processing techniques such as spectral analysis. Consequently,
the decomposition of known parts of time series into mono- the quality and the completeness of these time series are
components (Intrinsic Mode Functions, IMF) using Empirical
Mode Decomposition (EMD), construction of prediction models essential. Previously researchers [4] have demonstrated that
for each IMF using known parts of times series and their com- missing values imputation can significantly improve the overall
position using weighted average. We demonstrate the efficiency prediction or data analysis when information about the data
of the proposed approach using a synthetic time series data. is incorporated into the imputation. Thus the problem of
gap-filling (or data reconstruction) is a fundamental one in
I. I NTRODUCTION computational intelligence.
Knowledge-based decision-making processes (such as busi- The applications of gap filling (aka missing data imputation)
ness decisions) are very dependent on the availability of data, are numerous and include bioinformatics (gene microarray
from which information can be extracted. These processes data) [5] and remote sensing observations [13]. The existing
often use predictive models or other computational intelligence gap-filling techniques depend upon the size of gap series
technique that take observed data as inputs. However, in and the nature of data. If gaps are small, the solutions are
some cases due to various reasons (power failure, sensor simple and well researched. For example, mean imputation
failure, maintenance, human error, etc.) data could be lost, [4] substitutes every missing value with the mean of the
corrupted or recorded incompletely, which affects the qual- observations. Because of its simplicity, mean imputation is
ity of data negatively. Most decision-making and machine commonly used in the social sciences. Regression imputation
learning tools such as the commonly used Artificial Neural replaces missing values with predictions made by a regression
Networks (ANN), Support Vector Machines (SVM), Principal (curve fitting, interpolation) model of the missing value on
Component Analysis (PCA) and many other computational variables observed for the data vector. If there is abundance
intelligence techniques cannot be used for decision making of complete data, hot-deck imputation can be used to substitute
if data are not complete [1]. Missing values in the data missing values according to data vectors with similar values.
can severely affect the interpretation and hinder downstream However, if gaps expand over several cycles of data in time
analysis such as supervised or unsupervised classification or series, specific methods should be developed.
clustering. Since the decision output should still be maintained The specific gap filling methods can be categorized accord-
despite the missing data, we have to deal with the problem ing to the type of information used as global and local. Using
of missing data. Therefore, in case of incomplete or missing global approach, the algorithms perform missing values esti-
data, the first step in data processing is to estimate the missing mation based on global correlation information obtained from
values. the entire data matrix. An example of such algorithms includes
The task, also known as missing value imputation [2] or gap the SVD imputation [6]. Peng and Zhu [7] used Independent
filling, is an important one in cases where it is crucial to use Component Analysis (ICA) and Self Organizing Maps (SOM)
all available data and not discard records with missing values. to handle missing values. Gaussian mixture models are used in
It is especially relevant in many real-world time series such as [8]. Ssali and Marwala [9] combined decision trees, Principal
obtained by remote sensing observations made without direct Component Analysis Neural Network (PCA-NN) and Auto
physical contact with the observed object [13], in which raw Associative Neural Network (AANN) to estimate the missing
data can be corrupted, obstructed and hindered by multiple, values. The power of evolutionary computing in imputation
often unforeseen, ways. Such time series may contain multiple process is emphasized in [10] by combining ANN with either
GA or Particle Swarm Optimization (PSO). In [11], [12] it
Copyright c 2016 held by the authors. was used a composite neuro-wavelet reconstruction system
composed by two neural networks separately trained to obtain
7
the reconstruction. For local approach, the algorithms exploit II. D ESCRIPTION OF METHOD
only local similarity structure in the data sets to perform A. Task
missing values imputation.
Consider a time series X ∈ RT with m ≤ T −2 data points
m
The signal decomposition methods split the signal into missing. The missing data are indicated by a vector X̂ ∈ R
additive components. The time series are then reconstructed with components xt = 1 for all t ∈ 1, 2, . . . , T where xt
using only the components of interest, usually by removing the is defined, and xt = −1 for all t ∈ 1, 2, . . . , T where xt is
high frequency components considered as noise. Because the not defined. Let the zero crossing rate of X be
decomposition is performed over a limited temporal window, T
X
only a limited amount of information is used when filling gaps.
zcr(X) = I xt xt−1 < 0 (1)
t=1
The Iterative Caterpillar Singular Spectrum Analysis
Method (ICSSA) [13] is a modification of the CSSA [14] here I{A} is the Iverson operator: I{A} is 1 if its argument
method developed to describe time series and fill missing A is true, and 0 otherwise.
data by decomposing the time series into empirical orthog- Hereinafter we introduce the simplification and consider
onal functions (EOF). This method allows filling gaps and only such X, for which zcr(X) = 2 and x0 = 1. That is the
forecasting data at the extremities of the time series. time series starts with known data, then follows all missing
data, and the time series ends with known data. Hereinafter,
Empirical mode decomposition (EMD) method [15] con- we denote the known data at the beginning of the time series
sists in decomposing the time series into a small number X before the missing data as the left series Xl , and the known
of intrinsic mode functions (IMFs) derived directly from the data at the end of the time series X after the missing data as
time series itself using an adaptive iterative process based on the right series Xr , and the missing series as Xm .
local characteristics of data in the time domain. The first IMF,
mostly affected by noise, can be discarded or thresholded to B. Outline
remove the high frequency fluctuations [16]. Note that the The proposed method consists of the following steps:
EMD method requires the time series to be continuous. EMD 1) Perform decomposition of Xl into a set of IMFs IM Fl ,
has been used before in for gap-filling in [17]. In this method, and decompose Xr into a set of IMFs IM Fr using
gap filling is based on adding artificial local extrema in the EMD.
missing part of the data based on the assumptions that behavior 2) Perform prediction of IM Fl into the future by m steps
of the missing data is similar to the neighbourhood of the gap, using the polynomial prediction model. Let predicted set
and for each IMF, every two consecutive extrema in the gap of series is IM F 0 .
are equally distant. The modified EMD produces IMFs with 3) Perform prediction of IM Fr into the past by m steps
gaps, and the gap data is reconstructed by summing up the using the polynomial prediction model. Let the predicted
IMFs. However, the assumption that the behavior of the data set of series be IM F 00 .
does not change in the gap may not hold true. Furthermore, 4) Combine the predicted series IM F 0 and IM F 00 using
the method involves the selection of extrema points manually the linearly weighted average. Let the result be IM F.
based on the appearance of each gap-filled IMF, which is time- 5) Sum IM F to derive the prediction of Xm . as Y.
consuming. Kandasamy et al. [13] filled the missing data by
linear interpolation and then applied signal decomposition by C. EMD for decomposition
EMD. However, as linear interpolation provides generally poor The steps comprising EMD method [15] are as follows:
performances in case of long periods without observations, the 1) Identify local maxima and minima of signal x(t), where
EMD-based method fails when there is a significant fraction t is time.
of gaps (more than 20 %). 2) Perform cubic spline interpolation between the maxima
The aim of this paper is to propose and evaluate a novel and minima to obtain envelopes Emax (t) and Emin (t).
method for reconstruction of missing data in time series. The 3) Calculate the mean of the envelopes as M (t) =
method is based on the decomposition of known parts time (Emax (t) + Emin (t))/2.
series into mono-components (IMFs) using EMD, construction 4) Calculate the difference between a signal and the mean
of prediction models for each IMF using known parts of times of its envelopes as C1 (t) = x(t) − M (t).
series and their composition using weighted average. 5) IF the number of local extrema of C1 (t), is equal to or
differs from the number of zero crossings by one, and
The structure of the remaining parts of the paper is as the average of C1 (t) is close to zero,
follows. The proposed method is described in Section II. THEN IM F1 = C1 (t);
The metrics for evaluation the accuracy of the solution are ELSE repeat steps 1-4 on C1 (t) instead of x(t), until
discussed in Section III. The experiment is described in new C1 (t) satisfies the conditions of an IMF in Step 5.
Section IV. The results of the experiment are presented in 6) Calculate residue R1 (t) = x(t) − C1 (t).
Section V. Finally, conclusions and discussion of future work 7) If residue R1 (t) is above a threshold value, then repeat
are presented in Section VI. steps 1-6 on R1 (t) to obtain next IMF and a new residue.
8
As a result, n orthogonal IMFs are obtained from which the
original signal may be reconstructed as follows
X
x(t) = IM Fi (t) + R(t). (2)
i
D. Prediction using polynomial model
Prediction is performed using a polynomial model that finds
coefficients for a polynomial p(x) of degree n that is a best fit
(in a least-squares sense) for the data in y using M previous
values of x. Here a linear case of the model is shown:
p(xi ) = a0 xi−1 + a1 xi−2 + . . . + aM −1 xi−M + aM . (3)
Fig. 1. A fragment of example synthetic time series.
E. Weighted averaging
The predictions are averaged using the weighted average as Pearson correlation coefficient R2 is a measure of the
follows: strength and direction of the linear relationship between two
variables that is defined as:
X = X 0 w(t) + X 00 (1 − w(t)), (4) P 2
2 (xi − x̄)(yi − ȳ)
where w(t) is a linear monotonous increasing function in the R =P P . (8)
(xi − x̄)2 (yi − ȳ)2
range [0; 1] from 0 to m − 1.
IV. EXPERIMENT
III. P ERFORMANCE EVALUATION USING STATISTICAL A. Data and motivating example
MEASURES We use the following synthetic time series (also analyzed
in [18]):
Evaluation of the data gap filling results is a crucial step to
demonstrate reliability and accuracy of the proposed method. 2π 2π π 2π
x(t) = sin t · cos t + sin t (9)
In validation, the performance indices are computed between 300 40 2 120
the reconstructed and known original values. The quality of The known data to the left and right of the gap was decom-
the reconstructed data also can be evaluated independently of posed using EMD and the IMFs were derived. The polynomial
the original data using the smoothness criterion. model based prediction method described in Section II.D was
We evaluate the accuracy of the results using the following used to derive the forward prediction from the IMF data to
metrics: left side of the gap and the backward prediction from the IMF
Root Mean Square Error (RMSE) of a model prediction with data to the right side of the gap.
respect to the estimated variable y is defined as the square root Fig. 1 shows a fragment of the original time series with
of the mean squared error: missing data (gap is in the middle of the time series, gap
r Pn length = 50). Fig. 2 shows the IMFs derived for known
2
i=1 (xi − yi ) continuous fragments of data before and after the gap. Fig. 3
RM SE = , (5) shows the adjusted IMFs with reconstructed IMF values for the
n
missing data using the combination of forward and backward
where x1 , x2 , . . . , xn are n observed values and y1 , y2 , . . . , yn
prediction using linearly weighted averaging. Finally, Fig. 4
are the corresponding values predicted.
shows the true data and reconstructed data (only the gap of
Mean Absolute Error (MAE) measures the average magni- time series is shown).
tude of the errors in a set of forecasts, without considering For comparison, the results of reconstruction without signal
their direction. It is used to measure how close forecasts or decomposition are also shown. Even by visual inspection of
predictions are to the eventual outcomes: figures we can see that the proposed method performs better.
n B. Methodology
1X
M AE = |xi − yi |. (6) To obtain a statistically valid evaluation of the efficiency of
n i=1
the method, we, first, have generated a synthetic series with
Mean Square Error (MSE). If yi is a vector of n predictions, a large length N = 10000 . Then we have extracted M =
and xi is the vector of observed values corresponding to the 20 different subseries of selected length 500, 1000, 2000 at
inputs to the function which generated the predictions, then random locations of the original large time series. The gap in
the MSE of the predictor can be estimated by: the data is always located in the middle of series and has the
n
length of K(K = 10, 20, . . . , 100). Gap filling is performed
1X using the proposed method and accuracy is calculated. Results
M SE = (xi − yi )2 . (7)
n i=1 of computational experiments are presented in Section V.
9
Fig. 2. IMFs from known data of synthetic time series. Fig. 5. Average RMSE with respect to the length of gap.
Fig. 3. Gap filling on the IMF level. Fig. 6. Average MAE with respect to the length of gap.
Fig. 4. Reconstructed missing data in the gap. Fig. 7. Average MSE with respect to the length of gap.
V. R ESULT the method allows to achieve some improvement as compared
with the direct prediction of missing data from the known time
In order to evaluate the performance of the gap-filling, series data. The results are summarized in Tables I-II.
we have repeated the gap filling experiment 20 times with
different subseries at random locations of the original time VI. C ONCLUSION
series. The numerical simulation was performed using MAT- The paper has presented a novel method for the reconstruc-
LAB 8.1 (R2013a). The experimental results are presented in tion of missing data in time series. The proposed method
Figs. 5-8 for the values of RMSE, MAE, MSE and Pearson is model-free, fully data-driven, and does not impose any
correlation values vs the gap ration (length of gap divided by technical assumptions. The results of experiments with the
the length of subseries under consideration). synthetic data show that better results have not been achieved
We can see that the EMD decomposition-based method fails for gap filling task using the EMD decomposition if gap length
to improve the accuracy of missing data reconstruction for is large (more than 10 missing data points). The reasons for
large gaps. Yet, when the size of gap is small (10 or less), that may be the inherent deficiencies of the EMD method
10
the accuracy of analysis results. The correct prediction of
locations of minima and maxima in data gap, could allow
to better estimate the envelope of data. The main reasoning
for using EMD is that IMFs would be less complex than the
original data. However, the higher frequency IMFs do not
always have a simple waveform, which would allow more
accurate prediction of their values. Thus no gain by predicting
a simpler could be achieved. Additionally, the method requires
more computation, which also negatively affects the accuracy.
Future work will focus on performing more extensive
experiments with synthetic time series and the application
Fig. 8. Average correlation with respect to the length of gap. of the proposed method to real world time series and the
examination of other signal decomposition and prediction
TABLE I
methods to overcome the shortcoming of EMD and to improve
R ESULTS WITHOUT AND WITH DECOMPOSITION FOR SUBSERIES OF the proposed method.
LENGTH 500
Gap length Decomposition RMSE MAE MSE R2
R EFERENCES
yes 0.0316 0.0262 0.0014 0.9890 [1] A. W. Liew and N. F. Law, H. Yan, Missing value imputation for gene
10
no 0.0304 0.0257 0.0014 0.9887 expression data: computational techniques to recover missing data from
yes 0.0912 0.0764 0.0136 0.9499 available information, Brief Bioinform, no. 12, pp. 498-513, 2011.
20 [2] D. B. Rubin, Multiple Imputation After 18+ Years, Journal of the
no 0.0964 0.0813 0.0148 0.9440 American Statistical Association, vol. 91, no. 434, pp. 473-489, 1996.
yes 0.2393 0.1823 0.0815 0.9301 [3] S. Kandasamy, F. Baret, A. Verger, P. Neveux, and M. Weiss, A compari-
30
no 0.2034 0.1543 0.0561 0.9418 son of methods for smoothing and gap filling time series of remote sensing
observations application to MODIS LAI products, Biogeosciences,
yes 0.4062 0.3085 0.1810 0.7010 vol. 10, pp. 4055-4071, 2013.
40
no 0.3632 0.2775 0.1514 0.7449 [4] P. D. Allison, Missing data, Thousand Oaks, CA: Sage Publications,
Inc., 2001.
[5] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tib-
TABLE II shirani, D. Bostein, R.B. Alman, Missing value estimation methods for
R ESULTS WITHOUT AND WITH DECOMPOSITION FOR SUBSERIES OF DNA microarrays, Bioinformatics, vol. 17, pp. 520-525, 2001.
LENGTH 1000 [6] T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown, and D. Botstein,
Imputing Missing Data for Gene Expression Arrays, Stanford University
Gap length Decomposition RMSE MAE MSE R2 Statistics Department Technical report, 1999.
yes 0.0449 0.0382 0.0026 0.9870 [7] H. Peng and S. Zhu, Handling of incomplete data sets using ICA and
10 SOM in data mining, Neural Computing & Applications, vol. 16, no. 2,
no 0.0445 0.0380 0.0025 0.9855
pp. 167-172, 2007.
yes 0.0870 0.0714 0.0120 0.9808 [8] D. Yoon, E. K. Lee, T. Park, Robust imputation method for missing values
20
no 0.0732 0.0609 0.0093 0.9829 in microarray data, BMC Bioinformatics, vol. 8, Suppl 2, S6, 2007.
yes 0.2760 0.2166 0.1004 0.7968 [9] G. Ssali and T. Marwala, Computational Intelligence and Decision Trees
30 for Missing Data Estimation, in International Joint Conference on Neural
no 0.2684 0.2123 0.0974 0.7913 Networks (IJCNN 2008), Hong Kong, 1-8 Jun 2008.
yes 0.2893 0.2220 0.0979 0.8109 [10] S. M. Dhalmini, F. V. Nelwamondo, T. Marwala, Condition Monitoring
40 of HV Bushings in the Presence of Missing Data Using Evolutionary
no 0.2902 0.2247 0.0985 0.7828
Computing, WSEAS Transactions on Power Systems, vol. 1, no. 2,
pp. 296-302, 2006.
TABLE III [11] G. Capizzi, C. Napoli, L. Paterno, An Innovative Hybrid Neuro-wavelet
R ESULTS WITHOUT AND WITH DECOMPOSITION FOR SUBSERIES OF Method for Reconstruction of Missing Data in Astronomical Photometric
LENGTH 2000 Surveys, Proc. of International Conference on Artificial Intelligence and
Soft Computing ICAISC (1), pp. 21-29, 2012.
Gap length Decomposition RMSE MAE MSE R2 [12] C. Napoli and E. Tramontana, Massively parallel WRNN reconstructors
for spectrum recovery in astronomical photometrical surveys, Neural
yes 0.0415 0.0354 0.0025 0.9719 Networks, vol. 83, pp. 4250, 2016.
10
no 0.0391 0.0333 0.0023 0.9800 [13] S. Kandasamy, P. Neveux, A. Verger, S. Buis, M. Weiss, and F. Baret,
yes 0.1095 0.0920 0.0242 0.8538 Improving the consistency and continuity of MODIS 8 day leaf area index
20 products, International Journal of Electronics and Telecommunications,
no 0.1157 0.0973 0.0250 0.8499 vol. 58, pp. 141-146, 2012.
yes 0.1813 0.1409 0.0482 0.8952 [14] N. Golyandina and E. Osipov, The Caterpillar-SSA method for analysis
30
no 0.1809 0.1402 0.0466 0.9023 of time series with missing values, Journal of Statistical Planning and
Inference, vol. 137, pp. 2642-2653, 2007.
yes 0.2973 0.2278 0.1143 0.7743
40 [15] N. E. Huang, Z. Shen, S. R. Long, A new view of nonlinear water
no 0.2615 0.2004 0.0922 0.8273 waves: the Hilbert spectrum, Annual Review of Fluid Mechanics, vol. 31,
pp. 417-457, 1999.
[16] R. Damasevicius, M. Vasiljevas, I. Martisius, V. Jusas, D. Birvin-
skas, M. Wozniak, BoostEMD: An Extension of EMD Method and Its
also noted by other researchers such as the boundary problem Application for Denoising of EMG Signals, Electronics and Electrical
and the mode mixing problem, which can significantly affect Engineering, vol. 21, no. 6, pp. 57-61, 2015.
11
[17] A. Moghtaderi, P. Borgnat, P. Flandrin, Gap-filling by the empirical
mode decomposition, Proc. of IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pp. 3821-3824, 2012.
[18] D. Kondrashov and M. Ghil, Spatio-temporal filling of missing points
in geophysical data sets Nonlinear Processes in Geophysics, vol. 13,
pp. 151-159, 2006.
12