=Paper=
{{Paper
|id=Vol-2277/paper39
|storemode=property
|title=
Sensor Data Preprocessing, Feature Engineering and Equipment Remaining Lifetime Forecasting for Predictive Maintenance
|pdfUrl=https://ceur-ws.org/Vol-2277/paper39.pdf
|volume=Vol-2277
|authors=Evgeniy Latyshev
|dblpUrl=https://dblp.org/rec/conf/rcdl/Latyshev18
}}
==
Sensor Data Preprocessing, Feature Engineering and Equipment Remaining Lifetime Forecasting for Predictive Maintenance
==
Sensor Data Preprocessing, Feature Engineering and Equipment Remaining Lifetime Forecasting for Predictive Maintenance © Evgeniy Latyshev Lomonosov Moscow State University, Moscow, Russia e.latishev@gmail.com Abstract. Analytics based on sensor data is gradually becoming an industry standard in equipment maintenance. However, it involves several challenges, such as sensor data preprocessing, feature engineering and forecasting model development. Due to work in progress, this paper is mainly focused on sensor data preprocessing, which plays a crucial role in predictive maintenance due to the fact, that real-world sensing equipment usually provides data with missing values and a considerable amount of noise. Obviously, poor data quality can render practically useless all the following steps of data analysis. Thus, many missing data imputation, outlier filtering, and noise reduction algorithms were introduced in the literature. Streaming sensor data can be represented in a form of univariate time series. This paper provides an overview of common univariate time series preprocessing steps and the most appropriate methods, with consideration of the field of application. Sensor data from different sources comes in different scales and should be normalized. Thus, the comparison of univariate time series normalization techniques is given. Conventional algorithm quality metrics for each of the preprocessing steps are described. Basic sensor data quality assessment approach is suggested. Moreover, the architecture of a sensor data preprocessing module is proposed. The overview of time series-specific feature engineering techniques is given. The brief enumeration of considered forecasting approaches is provided. Keywords: predictive maintenance, preprocessing, univariate time series, data cleaning, missing data imputation, noise reduction, outlier filtering, data quality assessment, feature engineering, time series forecasting 1 Introduction unacceptable amount of missing values, outliers, sudden spikes etc. Simply ignoring these issues can be critical Maintenance costs are a major part of the total operating due to several reasons. For example, some analysis tools, costs of any business involving complex equipment. including popular machine learning algorithms, can’t Conducted surveys of maintenance management handle missing values. The absence of outlier filtration effectiveness indicate that one-third of all maintenance can dramatically skew the results. Measuring equipment costs is wasted as the result of unnecessary or improperly standard error can be mistaken for an actual pattern in carried out maintenance [16]. With the spread of Internet data. As a result, time series preprocessing involves of Things concept, sensor data can be collected from a several independent steps: missing data imputation, huge amount of devices and equipment. This data can be noise reduction, and data normalization. After these used for real-time health monitoring and effective steps, data can be evaluated in quality and passed further maintenance. However, this approach to maintenance, for analysis. It is clear, that preprocessing should be done also known as predictive maintenance, involves several in near real-time to minimize the delay between data challenges. measurement and decision making. Thus, there is a need First of all, often collected data is of poor quality, for a fast and scalable independent module, that can which can lead to unreliable analysis and ineffective preprocess constantly incoming sensor data. This paper maintenance. Consequently, data from sensing proposes the design of such a module, keeping in mind equipment needs to be preprocessed before it can be used the following integration with the existing architecture of for any analysis. Poor data quality means non- a predictive maintenance system, introduced in [14]. compliance with requirements on at least one of the data Secondly, it can be difficult to distinguish the patterns quality assessment metrics. The root of problems can and relationships in the initial data. The process of vary: connection issues, sensor malfunction, transmitting extracting and generating new characteristics and hardware failure, data processing server downtime, features out of the available data, commonly referred to software crash, measuring equipment inaccuracy and as feature engineering, has two main objectives. The first many more. Common cases of poor data quality involve one is to represent the data in such a form, that will make Proceedings of the XX International Conference it easier to establish simple yet strong connections “Data Analytics and Management in Data between the input and the output variables for the Intensive Domains” (DAMDID/RCDL’2018), forecasting model, increasing the quality of the forecasts. Moscow, Russia, October 9-12, 2018 226 The second objective is to pick the most useful features below in figure 1. The whole module consists of 4 out of all the available ones, reducing the amount of transformation steps and the data quality assessment computations of the forecasting model. step. Finally, a proper forecasting model is to be chosen and implemented. There are various approaches to time series forecasting, from straightforward methods like naive method to way more sophisticated ones like long- short term recurrent neural network. The main complication here is the trade-off between the forecast quality and the ease of model implementation and deployment. The remaining part of the paper is organized as follows. The preprocessing module architecture is described in Section 2. Section 3 reviews missing data imputation methods. Section 4 is devoted to time series noise reduction. Section 5 briefly overviews data normalization techniques. In section 6, some thoughts on data quality assessment are combined. Section 7 is devoted to time series feature engineering. The brief overview of time series forecasting approaches is given in section 8. Finally, the future directions of presented work are given in section 9. 2 Preprocessing Module Architecture The preprocessing module is a part of the system for predictive maintenance, deployed to a Hadoop [29] cluster in a cloud manner. The module is wrapped in Docker [17] container and runs on a standalone node of the cluster. One of the key requirements for the module is the seamless integration into the architecture. The data is retrieved from Apache Kafka message queue [2], transformed by the preprocessing module and passed in parallel to OpenTSDB [25] and Apache Hive [11] for storage. To satisfy the requirements onto speed and scalability the transformations are conducted onto Apache Spark Streaming engine [27]. There are many stream data processing frameworks including but not limited to Apache Storm [5], Apache Flink [1], Apache Samza [4] and Kafka Streams [3]. Although Spark Streaming has latency issues and sliding Figure 1 Components and data flow within the window processing may be tricky due to Spark inherent preprocessing module batch-based streaming model, is has several advantages which make Spark Streaming a safer choice. 3 Missing Data Imputation First of all, Spark Streaming is a mature framework with thorough documentation and huge community. As a Sometimes due to a sensor malfunction, unstable internet result of long-term popularity, there are plenty of open- connection or other technical difficulties the data for source tools for Spark Streaming, including solutions for some points in time is missing. Simply ignoring those relatively painless integration with Kafka and mentioned gaps may be not the best strategy, because it can lead to earlier database management systems [28, 11, 21]. a loss of efficiency and unreliable results of the analysis. Another advantage is the existence of pySpark [23] – an Another approach is to try to impute the missing values API for Python, one of the biggest programming based on the available information. languages at this moment. All other enumerated frameworks require Scala, Clojure or Java knowledge, 3.1 Methods which makes them less accessible. The detailed overview of basic imputation methods and One of the biggest downsides of Spark Streaming is their implementations can be found in imputeTS R performance degradation on sudden bursts of input data. package documentation [19]. However, in case of sensor data processing the input data Some simple methods that are applicable not only to flow intensity remains nearly the same at all time time series: median imputation, mode imputation, mean intervals, which mitigates the downside. imputation, random imputation. These methods are fast The data flow and module components are introduced and very straight-forward, but lack accuracy. 227 Simple time series specific methods include LOCF Frequency domain approaches are based on signal (last observation carried forward), NOCB (next decomposition into frequency components. The most observation carried backward), interpolation (linear, common approaches involve discrete/fast/short-time polynomial, Stineman) and moving average (simple, Fourier transform either wavelet transform. weighted, exponential). All of them are rather fast and Most of the time domain approaches are based on can work in specific cases, but fall off when there is smoothing the signal of each given data point based on seasonality in the data or large missing sub-sequences. the values of its neighbors. More sophisticated approaches like Structural Model The comparison of the basic noise reduction methods & Kalman Smoothing, ARIMA State Space can be found in the work of Köhler et al. [15]. The Representation & Kalman Smoothing [10] can be used conducted experiment involves the comparison of for seasonal data with complex patterns. moving average filter, exponential smoothing filter, However, sensor data has one unfortunate linear Fourier smoothing, nonlinear wavelet shrinkage characteristic – the gaps of missing data can be too long and simple nonlinear noise reduction in different for conventional methods to work properly. In this case, conditions. the method proposed in [22] can be the appropriate The downside of the approaches listed above is that choice. The idea of the Dynamic Time Warping Based they modify almost all the data values, most of which are Imputation is to find the most similar sub-sequence to the initially correct. Song et al. [26] proposed the first sub-sequence before the missing values, then complete constraint-based approach for cleaning stream data. The the gap by the next sub-sequence of the most similar one. idea is to sanity check the changes of values in time based The result is a very plausible gap imputation with a on subject area constraints. This method allows to detect drawback of a huge computational cost. and repair large spike errors in data. The biggest advantage of this method is the support of online 3.2 Metrics cleaning over streaming data. Missing data imputation involves 2 types of quality However, this method can be used only for large metrics based on the pattern of imputation. outlier detection. In some cases, even small errors can be For single value imputations, the metrics coincide important and repairing only spike errors is insufficient. with the ones commonly used in time series forecasting Zhang et al. [30] proposed a novel statistical-based – RMSE (Root Mean Square Error)and MAPE (Mean cleaning by introducing the repairment likelihoods with Absolute Percentage Error). respect to speed changes. Several effective and computationally efficient heuristics are also introduced ∑ (𝑦𝑦^𝚤𝚤 −𝑦𝑦𝑖𝑖 )2 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = � 𝑖𝑖 , in this work. 𝑛𝑛 100% 𝑦𝑦 −𝑦𝑦^𝚤𝚤 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = × ∑𝑖𝑖 � 𝑖𝑖 �, 4.2 Metrics 𝑛𝑛 𝑦𝑦𝑖𝑖 where yi is real value, ŷi is the forecasted value and n Most of the papers involve RMSE, defined earlier, as a is the number of forecasts. denoising quality metric. However, there are several less However, different metrics are used for long gap popular ones, including the Symmetrical Visual Error imputation. The most popular of them are similarity and Measure, proposed in [18]. Dynamic Time Warping distance. 1 1 5 Data Normalization 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = × �� � 𝑛𝑛 |𝑦𝑦𝑖𝑖 − 𝑦𝑦^𝚤𝚤 | Making sure that your data is of uniform scale is key 𝑖𝑖 �1 + � 𝑚𝑚𝑚𝑚𝑚𝑚(𝑦𝑦^𝚤𝚤 ) − 𝑚𝑚𝑚𝑚𝑚𝑚(𝑦𝑦^𝚤𝚤 ) for many methods, including k-NN, linear models, DTW calculation algorithm can be found at [19]. It is artificial neural networks and many more. Even worth mentioning, that modern implementations often univariate time series data should be normalized because have adjustments to speed up the calculations (for it might be further used in combination with data of example, DDTW [13]). different scale from other sources. The most well-known and widely used are min-max 4 Noise Reduction normalization and z-score normalization. Min-max implies that you know the minimum and the maximum Similar to missing data points, sensor data is usually values in your dataset beforehand, which is often not the contaminated with noise, which can be mistaken for case. Z-score is more robust but performs poorly om non- actual data pattern, which yet again leads to a loss of stationary time series. efficiency and unreliable results of the analysis. The task 𝑦𝑦−𝑚𝑚𝑚𝑚𝑚𝑚(𝑌𝑌) ŷ𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = , of noise reduction is to subtract the maximum amount of 𝑚𝑚𝑚𝑚𝑚𝑚(𝑌𝑌)−𝑚𝑚𝑚𝑚𝑚𝑚(𝑌𝑌) 𝑦𝑦−𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑌𝑌) noise from the initial data, leaving the maximum amount ŷ𝑧𝑧−𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = , 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) of useful signal. where ŷ is a value after normalization, y is a value 4.1 Methods prior normalization and Y is the set of values being normalized. According to Chen et al. [6] noise reduction methods Some less popular methods are decimal scaling can be divided into 2 categories: frequency domain normalization, which holds all the drawbacks of min- approaches and time domain approaches. 228 max normalization, sigmoid normalization, which is It is worth mentioning, that there are automatic time actively used in neural networks and tanh estimators, series feature engineering tools such as tsfresh [7], which which roughly can be described as a hyperbolic tangent achieve decent results with almost no effort required. of the z-score normalization. 𝑦𝑦 ŷ𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 = 𝑑𝑑, 7.1 Timestamp Features 10 where d is the order of values in the set. The idea of this approach is to extract the features from 1 the timestamp of each observation. The most commonly ŷ𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑦𝑦 , 1+𝑒𝑒 0.01∗�𝑦𝑦−𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚(𝑌𝑌)� used features are: ŷ𝑡𝑡𝑡𝑡𝑡𝑡ℎ = 0.5 ∗ �𝑡𝑡𝑡𝑡𝑡𝑡ℎ � � + 1�. • minutes elapsed for the day; 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) According to the experiment, conducted in [20], there • hour of the day; is no optimal time series normalization method and one • day of the month; should choose the appropriate method based on the data • weekend or not; patterns. Regarding sensor data, the mean and standard • season of the year; deviation remain approximately the same throughout • public holiday or not. time, which makes z-score normalization a reasonable Talking about sensor data, some examples of useful choice. timestamp features are: • time since the last maintenance; 6 Data Quality Assessment • age of equipment; • time since the last failure; Data Quality Assessment (DQA) is the scientific and • operating time of equipment. statistical evaluation of data to determine if data obtained Using just these features alone for predictions will from environmental data operations are of the right type, likely result in a poor model. However, in combination quality, and quantity to support their intended use [8]. with other features, they can boost the quality of There is a comprehensive work on time series data forecasts. quality assessment done in [9], which shows that there are dozens of different metrics that can be used to 7.2 Statistical Features measure the quality of data. Obviously, using all of them is excessive and computationally inefficient, so only a This approach involves sliding through a time series with few are to be chosen. However, there is no common view the window of a given width and calculating statistics for on which metrics are better. The simple yet effective each iteration. The most common statistical features are strategy might be to look onto the most popular ones: the mean of the previous few values, the median, the • event data loss (gaps in the data); mode, the minimal value, the maximum value, the • values out of range (values out of sane interval for standard deviation and many more. In addition to the domain); calculated statistics, we can also use the lagged values of • value spikes (improbable sudden changes); a time series as features. • wrong timestamps; The biggest challenge of this approach is that the window can be of any width and there is no general • rounded measurement value (not desirable level of algorithm to choose it. Usually, the researchers just try detail); out several values of the width and choose the one that • signal noise (slightly inaccurate measurements). performs best. However, if there is a seasonal pattern in The assessment is to be done for both data prior and data, it is worth making the width of the window not less after preprocessing to acquire an evaluation of than the period of the seasons. preprocessing module effectiveness. It is also worth keeping in mind, that initially clean data is different to 7.3 Spectral Features the data, that was made “clean” during preprocessing due to approximations and inevitable errors of the methods Different variations of Fourier transform and wavelet involved on each step. transform are used to extract spectral features from a nonstationary signal. The basic idea behind those methods is to decompose a given time series into a sum 7 Feature Engineering of several basic functions, providing a different Feature engineering is, probably, the most peculiar step representation of the initial signal. The biggest drawback of data processing, as it depends on the initial data type, of those methods is that they are relatively its origin, quantity, quality, the desired output of the computationally expensive. forecasting model and even the nature of the model itself. As it was already mentioned, sensor data can be 7.4 Dimensionality Reduction represented in a form of univariate time series. The Feature extraction provides many features, some of conventional approaches to time series feature which can be useless or strongly correlated with each engineering can be divided into 3 categories: timestamp other. Excessive features not only add unnecessary features, statistical features, and spectral features. The computations but also can decrease the quality of the feature extraction step is usually followed by a model. Thus, several dimensionality reduction methods dimensionality reduction step. were introduced to minimize the number of features, at 229 the same time keeping the maximum amount of References information. The most common ones are principal component analysis, independent component analysis [1] Apache Flink. https://flink.apache.org/ and partial least squares regression. [2] Apache Kafka. https://kafka.apache.org/ [3] Apache Kafka Streams Documentation. 8 Remaining Lifetime Forecasting https://kafka.apache.org/documentation/streams/ [4] Apache Samza. http://samza.apache.org/ There is a variety of methods, that can be used for time series forecasting. Each of them has advantages and [5] Apache Storm. https://storm.apache.org/ drawbacks and can be viable in certain circumstances. [6] Chen, Mithal, Vangala, Brugere, Boriah, Kumar: A First of all, there are some basic methods such as study of time series noise reduction techniques in the average method, naive method, seasonal naive method, context of land cover change detection. NASA and drift method, which are very simple yet can be Conference on Intelligent Data Understanding effective when the data pattern is easy. (2011) Secondly, linear regression models can be used for [7] Christ M., Kempa-Liehrb A., Feindt M.: Distributed forecasting. In the simplest case, the regression model and Parallel Time Series Feature Extraction for allows for a linear relationship between the forecast Industrial Big Data Applications. ACML Workshop variable and some predictor variables. The biggest on Learning on Big Data (2016) downside is inherent linearity, while real-world data is [8] Gitzel R.: Data Quality in Time Series Data An mostly non-linear. Experience Report. CBI Industrial Track (2016) The most common approach is to use stochastic [9] Guidance for Data Quality Assessment: Practical models – ARMA, ARIMA, SARIMAX, etc. One of the Methods for Data Analysis. EPA (2000) biggest drawbacks of those models is that they require fine-tuning of several hyperparameters, which is [10]Harvey A.: Forecasting, structural time series computationally expensive and not intuitive. models and the Kalman filter. Cambridge university One of the recently popular approaches is to use press (1990) decision trees. Random forests and gradient boosting [11]Hive on Spark: Getting Started. methods, which are so widely used in machine learning https://cwiki.apache.org/confluence/display/Hive/H competitions, can also be used for time series ive+on+Spark%3A+Getting+Started forecasting. [12]Huai Y., Chauhan A., Gates A., Hagleitner G., Artificial neural networks approach for time series Hanson E.N., O’Malley O., Pandey J., Yuan Y., Lee forecasting gained immense popularity in last few years. R., Zhang X.: Major technical advancements in Their modifications – recurrent neural networks (RNNs) Apache Hive. ACM SIGMOD international and long short-term memory networks (LSTMs) are conference on management of data (2014) especially effective for this task due to their “memory” [13]Keogh, E., Pazzani, M.: Derivative Dynamic Time component. Although neural networks tend to be the most Warping. First SIAM International Conference on accurate forecasting method when tuned properly and given Data Mining (2001) enough data, they might be computationally too costly for [14]Kovalev D., Shanin I., Stupnikov S., Zakharov V.: the considered conditions. Data Mining Methods and Techniques for Fault Detection and Predictive Maintenance in Housing 9 Conclusion and Utility Infrastructure. Engineering Technologies and Computer Science (2018) In this study, the overview of sensor data preprocessing steps, methods, and common metrics is held. Some [15]Köhler, Torsten, Lorenz: A comparison of denoising thoughts on sensor data quality assessment are given. methods for one dimensional time series. Zentrum The architecture of a fast, scalable preprocessing module für Technomathematik (2005) is proposed. A brief overview of time series feature [16]Marron J. S., Tsybakov A. B.: Visual error criteria engineering techniques and forecasting methods is given. for qualitative smoothing. Journal of the American The future goals of the ongoing work are to implement Statistical Association (1995) the designed preprocessing module on Spark Streaming [17]Merkel D.: Docker: Lightweight Linux Containers engine, integrate it into the existing predictive for Consistent Development and Deployment. Linux maintenance pipeline, implement the feature engineering J., vol. 2014 (2014) step and to develop a remaining lifetime forecasting [18]Mobley K.: An Introduction to Predictive model. Maintenance — 2nd edition (2002) Acknowledgments. This work is supervised by Dmitriy [19]Moritz S., Sardá A., Bartz-Beielstein T., Zaefferer Kovalev, Institute of Informatics Problems, Federal M., Stork J: Comparison of different Methods for Research Center “Computer Science and Control” of the Univariate Time Series Imputation in R. CoRR Russian Academy of Sciences. abs/1510.03924 (2015) The research is financially supported by Ministry of Education and Science of the Russian Federation [20]Nayak S., Misra B., Behera H.: Impact of Data (project’s unique identifier RFMEFI60717X0176). Normalization on Stock Index Forecasting. International Journal of Computer Information 230 Systems and Industrial Management Applications [26]Song S., Zhang A., Wang J., Yu P.: SCREEN: (2014) Stream Data Cleaning under Speed Constraints. [21]OpenTSDB 2.3 documentation | HTTP API. ACM SIGMOD international conference on http://opentsdb.net/docs/build/html/api_http/put.ht management of data (2015) ml [27]Spark Streaming Programming Guide. [22]Phan T., Poisson Caillault E., Lefebvre A., Bigand https://spark.apache.org/docs/latest/streaming- A.: Dynamic time warping-based imputation for programming-guide.html univariate time series data. Pattern Recognition [28]Spark Streaming + Kafka Integration Guide. Letters (2017) https://spark.apache.org/docs/2.2.0/streaming- [23]pySpark Package Documentation. kafka-integration.html http://spark.apache.org/docs/2.1.0/api/python/pyspa [29]White T.: Hadoop: The Definitive Guide. O'Reilly rk.html Media; Forth Edition (2012) [24]Sakoe, Chiba: Dynamic Programming Algorithm [30]Zhang A., Song S., Wang J.: Sequential Data Optimization for Spoken Word Recognition. IEEE Cleaning: A Statistical Approach. ACM SIGMOD Transactions on Acoustics, Speech and Signal international conference on management of data Processing (1978) (2016) [25]Sigoure B.: OpenTSDB: The distributed, scalable time series database. OSCON, vol. 11 (2010) 231