Integrating Heterogeneous Contextual Data for Enhanced Time Series Analysis Saifullah Burero Supervised by Anton Dignös and Johann Gamper Free University of Bozen-Bolzano, Italy Abstract In the rapidly evolving industrial landscape, sensors are integral to automation applications. Capturing and analyzing the vast amount of time series data is crucial for optimizing processes. However, analyzing this sensor data in isolation presents challenges, particularly in time series analysis, due to the influence of various external contextual factors that are not always apparent. Integrating these contextual factors with time series data is essential for time series analysis. However, these contextual factors are often heterogeneous in the time dimension due to the diverse nature of the data, makes integration challenging. Therefor, as a part of this PhD research that is currently at the beginning of the second year, we aim to introduce a systematic approach for integrating contextual factors with heterogeneous time dimensions. This integration enables the transformation of data with heterogeneous time dimensions into a format that can be effectively processed by machine learning and deep learning models for time series analysis. We use Water Distribution Systems (WDSs) as a representative use case and aim to demonstrate how this integration enhances the accuracy and reliability of time series analysis. Keywords time series analysis, heterogeneous time dimensions, time series forecasting, anomaly detection 1. Introduction essential to consider such factors to accurately interpret patterns in time series data. In this work, we use Water Distribution Systems (WDSs) as a representative application use case, where sensors are employed for monitoring the consumption of water. These observations serve as crucial inputs for, e.g., detecting water losses or estimating the water demand. WDSs are equipped with networks of sensors and control units, such as flow and pressure sensors, to ensure efficient resource manage- ment [1]. These patterns can be analyzed and used to fore- cast the water consumption, helping to fulfill water demand, and to detect possible losses. However analyzing these pat- terns solely on sensor measurements can be complex, as various contextual factors influence consumption patterns Figure 1: Importance of Context and/or measurements, such as maintenance, sensor calibra- tion, weather, yearly seasons, tourist trends, temperature etc. These factors pose challenges for machine learning In the domains of WDSs, many methods have been pro- and deep learning models for time series analysis, as they posed that rely on historical consumption patterns for analy- may not be apparent in the data. Figure 1 highlights the sis. The work by Zanfei et al. [2] emphasizes the importance importance of contextual information when analyzing time of integrating external factors such as temperature, humid- series data, where water consumption data from water dis- ity, radiation, and rainfall, since these factors strongly cor- tribution system based in Trentino, Italy is reported. The relate with consumption patterns. While such studies focus data is segmented and color coded to highlight different on contextual data that are time series, such as meteorologi- consumption patterns. Green segments illustrate “regu- cal factors, which can be integrated with sensory informa- lar” consumption, i.e., low consumption during night with tion by aligning them to the same sampling frequency, our peaks in the morning, afternoon, and evening. Red seg- approach addresses a broader range of contextual factors. ments indicate the impact of weekends, pattern that are These factors are heterogeneous in the time dimension and also observed during bank holidays, reflecting altered con- cannot be simply treated as time series data. For instance, sumption behavior during these periods as compared to consider a period of drought that spans several months, regular days. Yellow segments highlight deviations likely where the water consumption pattern at the beginning of caused by scheduled maintenance, which occurs monthly this period might still be regular, but water consumption from the 7th to the 10th. Additionally, in the lower part tends to increase the longer the drought persists. In WDSs of the figure, highlighted in orange, an extended period and in many industrial domains such contextual factors are where the consumption behavior diverges from the normal characterized by heterogeneity in the time dimension. We behavior is highlighted, potentially due to external factors have identified four distinct types: static contextual data that like school vacations, tourist trends, and/or seasonal tem- is not changing with time, such as sensor location, sensor perature changes. Although these external factors are not sensitivity and other specifications etc., interval contextual directly visible, their influences can be inferred, making it data that occurs over a period of time, such as drought, pe- riods of hight tourism, vacations etc., event contextual data Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 that happens at possibly irregular time points, such as bank Joint Conference (March 25-28, 2025), Barcelona, Spain holidays, sensor replacements, calibration, cleaning etc., and Envelope-Open saifullah.burero@student.unibz.it (S. Burero) secondary time series contextual data, that are recorded at a © 2025 Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). regular frequency, such as temperature, humidity, pressure CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings etc. These external contextual factors play a crucial role RQ2 How to detect patterns in time series data that deviate in shaping how resources are utilized and can significantly in specific contexts using data driven approaches? impact the accuracy of forecasting and anomaly detection in systems like WDSs. An effective integration of this di- RQ3 How to build a data driven model that effectively verse mixture of data enhances the decision making process handles heterogeneity and adapts to time series dy- by providing a comprehensive understanding of processes. namics? The main objective of this PhD will be on the systematic integration of contextual information with heterogeneous 4. Methodology time dimensions (static, interval, event, and secondary time series) for the effective analysis of time series, i.e., missing Figure 2 provides an overview of the proposed approach in values imputation, anomaly detection and forecasting. combination with an anomaly detection task. Each stage is responsible for performing different task. Initially, our time series data and contextual data with heterogeneous time 2. Related Work dimensions from various sources is provided as input to the According to a recent survey on anomaly detection in the representation stage. The data undergoes a first transfor- IoT and IIoT domain by Rodríguez et al. [3] that reviewed mation step into a homogeneous format understandable to 99 articles only 8% consider context aware information. Re- machine learning and deep learning models. After data rep- cent studies [4, 5, 6, 7] in contextual anomaly detection resentation, the data undergoes a feature extraction stage. application explore combining contextual and behavioral In this phase, time series features are extracted from the features for improved contextual anomaly detection. These homogeneous format, along with other relevant features approaches categorize features into contextual and behav- depending on the time dimension of the context. The fea- ioral depending upon domain knowledge, with clustering ture extraction process yields many features, necessitating a used to establish context and separate models built for each subsequent feature selection stage. Once important features group. are obtained, a model based on machine learning or deep Daniel et al. [8] developed a three-stage model incorpo- learning is constructed to perform anomaly detection. rating temporal features and a sliding window technique for feature representation. Rozhin Yasaei et al. [9] pro- posed an RNN-based model for clustering sensor behav- iors, using a consensus algorithm for anomaly localization. Kosek et al. [10] focused on detecting malicious voltage con- trol actions in the power grid with a deep neural network approach. Most of the existing works consider internal contexts while analyzing sensory data, such as day, time, yearly sea- sons, months, and years etc., that are typically uniform and consistent and easily handled, as it is derived from the time dimension of the time series data. On the other hand, inte- Figure 2: Analyzing time series with contextual data grating external context is more challenging and requires a transformation because the information might be repre- sented using a different time dimension and often lacks uniformity. 4.1. Data with Heterogeneous Time Dimensions 3. Objective and Research Questions Machine learning and deep learning techniques for time series analysis, such as Random Forest, Support Vector Ma- The main objective of this thesis is to enhance the perfor- chine (SVM), Long Short Term Memory (LSTM) etc., require mance of machine learning and deep learning models for input data with a uniform or homogeneous format, i.e, nu- time series analysis by integrating contextual information merical values (vectors) recorded over regular time steps. with heterogeneous time dimensions together with sensory However, contextual information with heterogeneous time readings. By acknowledging the crucial role of context dimensions lacks this uniformity. The effective integration in time series analysis, this research endeavors to develop of data with heterogeneous time dimensions involves dis- novel methodologies that leverage contextual information to tinct strategies for each type, and we illustrate it on the significantly enhance the accuracy and robustness of these example of WDSs. models. Through rigorous experimentation and analysis, this study aims to establish a deeper understanding of how 4.1.1. Time Series Data contextual information can be effectively integrated into existing frameworks, thereby advancing the state-of-the-art WDSs utilize networks of sensors such as flow sensor to in time series analysis within diverse domains. The main monitoring the consumption of water, that generate time research questions for the PhD are as follows. series data recorded either regularly or irregularly, with varying sampling frequencies. Along with consumption RQ1 How to combine relevant contextual information with patterns (primary time series) to be analyzed, secondary heterogeneous time dimensions and time series data time series may also be considered, such as temperature to build data driven models for time series analysis, or rainfall. For example, water consumption is highly cor- such as anomaly detection and forecasting? related with temperature, which is also time series data. Tables 1 and 2 exemplify time series data collected from a water flow sensor and a temperature sensor. This data can Table 4 be aligned with specific timestamps to ensure consistency, Process events typically achieved through interpolation or resampling to Sensor ID DateTime Maintenance establish regular time steps. 1 01-08-2023 02:00:00 calibration 2 01-09-2023 01:30:00 replacement Table 1 Time series data from sensors Sensor ID DateTime Water Flow information is valid. In the domain of WDSs, interval data 1 01-01-2023 01:30 9.01 represents periods, such as tourism seasons that may or 1 01-01-2023 02:30 8.85 may not occurring annually over some period of time. It is 2 01-01-2023 01:30 6 crucial to consider tourism related information as it signifi- 2 01-06-2023 12:30 7.5 cantly influences water consumption. During the peak of a 1 01-01-2023 03:30 7.90 tourism season, water consumption increases substantially compared to off-season periods. and therefore, incorpo- rating tourism seasonality into the analysis is essential for Table 2 accurately understanding the water consumption patterns. Time series data from locations Location DateTime Temp Table 5 Bolzano 01-01-2023 01:00 4 Interval data Bolzano 01-01-2023 02:00 5 Location Period Tourist Trend Bolzano 01-01-2023 02:30 5 Bolzano 01-06-2023—30-10-2023 high Bronzolo 01-01-2023 02:30 6 Bronzolo 01-07-2023—30-10-2023 high Bronzolo 01-01-2023 03:30 7 4.1.2. Static Data 4.2. Homogeneous Data Static data, as illustrated in Table 3, remains constant over During the data representation stage, data with heteroge- time and may be associated with every time point, maintain- neous time dimensions is transformed into a uniform or ing its validity throughout the period of interest. In WDSs, homogeneous format suitable for machine learning models static data, such as sensor location and characteristics play that require numerical vectors over regular time steps. This a crucial role for cross sensor analysis. For instance the stage following the sampling rate of the main time series or location of sensors can provide valuable insights into the user parametrization extracts different values from the data population distribution of different regions, which can di- with heterogeneous time dimensions using simple value rectly influence factors such as water consumption. Areas extraction, different aggregation functions, or one-hot en- with higher population density are likely to exhibit differ- coding (similar to resampling). For instance at each time ent consumption patterns compared to sparsely populated step, for constant data it may just extract a value or a one-hot regions. By considering sensor locations, models can better encoding, for event data it may extract a one-hot encoding capture and interpret variations in water consumption data, and/or count of events, for interval data it may extract a leading to more accurate and informed analyses in decision count and/or an indicator if a period just started or ended, making. and for secondary time series it may apply a resampling based on the new frequency. This initial transformation ensures that data has the same homogeneous representa- Table 3 Static data tion, such as for instance in Table 6 (do not consider the last two columns for now). Each row in the table indicates to Sensor ID Location min flow max flow a specific time step, while each column represents values 1 Bozen 5000 10000 extracted from the heterogeneous data. Figure 3 illustrates 2 Bronzolo 1000 5000 this transformation process to homogeneous data graphi- cally, making it compatible with machine learning models. This step can be achieved through a combination of densifi- 4.1.3. Event Data cation and aggregation, and some of these values extracted Event data as illustrated in Table 4, captures specific time at each time step will be used later to extract features that points of an event occurrence. Such data may be integrated are related to previous (cross) time steps. by marking relevant time points with event indicators, of- fering a comprehensive view of discrete events over time. Table 6 In WDSs, event data captures irregular occurrences such as Uniform data representation sensor calibration, cleaning, replacement or bank holidays. Sensor ID Location flow Temp DateTime Tourism Maintenance Last Cal. Dur. high tourism 1 Bolzano 9.01 4 01-01-2023 01:00 low calibration 30 0 Some of this events can directly influence sensor readings, 1 1 Bolzano Bolzano 8.05 7.90 5 5 01-01-2023 02:00 01-01-2023 03:00 low high no cleaning 0 0 0 2 potentially leading to inaccuracies in interpreting water con- 1 2 Bolzano Bronzolo 7 6.01 5 6 01-01-2023 04:00 01-01-2023 01:00 high low calibration no 3 0 2 0 sumption data. Therefore, it is crucial to include such type 2 Bronzolo 9.01 7 01-01-2023 02:00 high replacement 0 1 of information for accurate analysis of water consumption. 4.1.4. Interval Data 4.3. Feature Extraction Interval data as illustrated in Table 5, are characterized by After data representation, the subsequent step involves fea- start and end time points for which a particular contextual ture extraction, where features from the generated values 5. Conclusion Sensor readings often display complex properties influenced by numerous contextual factors. Integrating contextual fac- tors in time series analysis is crucial for accurately interpret- ing data pattern and identifying factors behind variations. By incorporating contextual factors, we aim to improve per- Figure 3: Homogeneous data representation formance of machine learning and deep learning models for time series analysis. In this PhD research, currently at the beginning of the second year, we have identified four cate- and primary time series are extracted. Similarly to data gories of contextual information with heterogeneous time representation this stage is sensitive to the type of homoge- dimensions. Based on these, we will develop a systematic neous time dimension from which the values were extracted. data management approach for the effective and efficient For instance, for event data, the number of events in the integration of contextual features for time series analysis. past or elapsed time since a specific event last occurred may be generated, e.g., see “Last Cal.” in Table 6 that records the References number of hours since the last calibration event. For inter- val data the time since a period last ended or the duration [1] M. Mutchek, E. Williams, Moving towards sustainable so far may be generated, e.g., see “Dur. high tourism” in and resilient smart water grids, Challenges 5 (2014) Table 6 that records the duration of high tourism in days. 123–137. doi:10.3390/challe5010123 . These additional features provide information across time [2] A. Zanfei, B. M. Brentan, A. Menapace, M. Righetti, to the model and help setting a given time step into context, A short-term water demand forecasting model using ultimately improving model performance. For instance, sen- multivariate long short-term memory with meteoro- sor readings with very high values for the last calibration logical data, Journal of Hydroinformatics 24 (2022) may be subject to higher variability or outliers, or water 1053–1065. doi:10.2166/hydro.2022.055 . consumptions deep in the tourism season may be generally [3] M. Rodríguez, D. P. Tobón, D. Múnera, Anomaly clas- very high. While the previous stage was achieved through a sification in industrial internet of things: A review, combination of densification and aggregation, this step can Intell. Syst. Appl. 18 (2023) 200232. doi:10.1016/J. generated through window functions aggregating over pre- ISWA.2023.200232 . viously generated values. The key challenge is to identify [4] E. Calikus, S. Nowaczyk, M. Bouguelia, O. Dikmen, the required sequence of dependencies. Wisdom of the contexts: active ensemble learn- ing for contextual anomaly detection, Data Min. 4.4. Feature Selection Knowl. Discov. 36 (2022) 2410–2458. doi:10.1007/ S10618- 022- 00868- 7 . Feature extraction may yield hundreds of features and at [5] M. A. Hayes, M. A. M. Capretz, Contextual anomaly this stage it becomes imperative to select features with high detection framework for big sensor data, J. Big Data 2 importance. Feature selection is indeed crucial for several (2015) 2. doi:10.1186/S40537- 014- 0011- Y . reasons, including reducing computational complexity and [6] Z. Li, M. van Leeuwen, Explainable contextual mitigating the risk of overfitting models. Features are cho- anomaly detection using quantile regression forests, sen based on the model’s performance, and depending on Data Min. Knowl. Discov. 37 (2023) 2517–2563. doi:10. the model’s output, features are iteratively selected, contin- 1007/S10618- 023- 00967- Z . uing this process recursively during feature selection. This [7] Y. Shulman, Unsupervised contextual anomaly detec- iterative approach enhance the algorithm’s explainability, tion using joint deep variational generative models, identifying which features are indeed valuable for decision CoRR abs/1904.00548 (2019). arXiv:1904.00548 . making. [8] D. B. Araya, K. Grolinger, H. F. ElYamany, M. A. M. Capretz, G. T. Bitsuamlak, Collective contextual 4.5. Anomaly Detection Model anomaly detection framework for smart buildings, in: IJCNN, IEEE, 2016, pp. 511–518. doi:10.1109/IJCNN. After feature extraction and selection, the final step involves 2016.7727242 . the development of machine learning and deep learning [9] R. Yasaei, F. Hernandez, M. A. A. Faruque, Iot-cad: models for multivariate time series analysis using the ex- Context-aware adaptive anomaly detection in iot sys- ample of anomaly detection. To validate the impact of the tems through sensor association, in: IEEE/ACM IC- integrated contextual information in the previous steps, the CAD, IEEE, 2020, pp. 9:1–9:9. doi:10.1145/3400302. performance evaluation is twofold. Firstly, model training 3415672 . without incorporating contextual data to establish a baseline [10] A. M. Kosek, Contextual anomaly detection for cyber- performance solely based on available features or signals. physical security in smart grids based on an artificial Secondly, contextual features are integrated to observe their neural network model, in: CPSR-SG, 2016, pp. 1–6. impact on the model’s predictive capabilities. Comparing doi:10.1109/CPSRSG.2016.7684103 . the model’s performance in both settings facilitates under- standing of how contextual information enhances predictive accuracy and robustness.