=Paper=
{{Paper
|id=Vol-3163/paper4
|storemode=property
|title=Detecting Temporal Dependencies in Data
|pdfUrl=https://ceur-ws.org/Vol-3163/BICOD21_paper_5.pdf
|volume=Vol-3163
|authors=Joaquin Cuomo,Hajar Homayouni,Indrakshi Ray,Sudipto Ghosh
|dblpUrl=https://dblp.org/rec/conf/bncod/CuomoHRG21
}}
==Detecting Temporal Dependencies in Data==
Detecting Temporal Dependencies in Data
Joaquin Cuomo1 , Hajar Homayouni2 , Indrakshi Ray1 and Sudipto Ghosh1
1
Department of Computer Science, Colorado State University
2
Department of Computer Science, San Diego State University
Abstract
Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting
the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these
characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically
determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to
classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be
treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as
a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal
attributes in 15 datasets from various domains.
Keywords
Dataset Management Systems, Statistics, Temporal attribute detection, Autocorrelation.
1. Introduction tributes to be analyzed by the experts. Moreover, even
domain experts may not be aware of temporal dependen-
Datasets can be temporal or non-temporal. A dataset cies among a subset of attributes in a big dataset. An ex-
is temporal if one or more attributes is a time sequence ample is a health data warehouse to which temporal and
[1]. An example of a temporal dataset is a stock market non-temporal data is automatically loaded from multiple
dataset, in which each value of an attribute corresponds source hospitals through an automated Extract, Trans-
to the daily stock price. Time series normally present form, Load (ETL) process [10]. Every patient can have
a time-dependency, meaning that a value is dependent a set of temporally dependent records, such as records
on its past values. Time-series analysis has applications related to their lab tests. Explicit temporal information,
ranging from stock market prediction to digital signal such as a timestamp that identifies when data is captured
processing and has been studied in statistics [1], econo- as well as attribute names that indicate temporal char-
metrics [2], and in communications [3]. acteristics may change through the ETL transformation.
Data analysis techniques depend on the type of data. For example, the name of the Patient_Height attribute
Techniques for non-temporal data, such as Support Vec- may change into a random name through the transfor-
tor Machine (SVM) [4] and Isolation Forest (IF) [5] only mation process. This data modification can make the
discover associations among attributes of individual data temporal nature of the target attribute unknown to the
records and cannot be used for analyzing time-series data researchers who are using the data for making critical
because associations may exist among multiple records in decisions on disease, treatments, and medications.
a time series [6]. Other approaches, such as Autoregres- To the best of our knowledge, there is no prior attempt
sive Moving Average (ARIMA) [7] and Long Short-Term on the detection of temporal dependencies in datasets.
Memory (LSTM) [8], are more suitable for either pre- Such dependencies are presumed to be known before-
diction or optimization for temporal data analysis [9] hand, which works only for well-understood datasets.
techniques. However, where domain experts lack adequate knowl-
It is critical to understand the existence of temporal edge about the data characteristics, there is a need to
dependencies in a dataset in advance in order to choose automatically detect whether or not a dataset is temporal
the best analysis approach. Existing analysis approaches in order to choose the right technique and have a fully
rely on domain experts to identify the type of data and to automated process. Our work fills this gap.
choose appropriate techniques to model the data. How- We developed a method to determine whether or not
ever, in big datasets, there can be a large number of at- a dataset contains temporal attributes. Moreover, our
approach automatically identifies grouping attributes.
BICOD21: British International Conference on Databases, December A grouping attribute is such by which we can group
9–10, 2021, London, UK
the dataset records and obtain intergroup temporal at-
Envelope-Open jcuomo@colostate.edu (J. Cuomo); hhomayouni@sdsu.edu
(H. Homayouni); iray@colostate.edu (I. Ray); ghosh@colostate.edu tributes but not intragroup. A dataset may have one or
(S. Ghosh) more grouping attributes. The proposed algorithm is
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). based on a portmanteau test [1] for autocorrelation to
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
determine the presence of temporal data. To find group- of the series with a delayed copy of itself. It gives critical
ing attributes that yield temporal sequences we use a information on whether a value in the series can be used
brute force approach by testing each unique value as to infer information about another value. A common way
a possible grouping attribute. Finally, we propose met- to analyze temporal data is to create a model that fits the
rics that help determine whether the result of the port- data, and the most widespread technique is regression
manteau test should be accepted or rejected based on analysis, which uses autocorrelation [12]. Therefore, au-
an integrated perspective of the dataset. We evaluated tocorrelation is going to be the most important metric
the proposed method on fifteen datasets, where each to determine if a dataset has or does not have temporal
attribute was given a priori classification by domain ex- dependence.
perts. We demonstrated that our approach was able to
discover all the temporal attributes. 2.2. Testing Autocorrelation
The paper is organized as follows. Section 2 presents
the theoretical background that forms the basis of our Most of the literature on autocorrelation of time-series is
work. Section 3 discusses our proposed approach to de- about evaluating the fitness of an autoregressive model,
tect temporal dependencies in datasets. Sections 4 and 5 which is done by analyzing the autocorrelation of the
describe our experiments and results using 15 different model’s residuals. However, because we do not have
datasets. Section 7 concludes the paper. prior knowledge about the data we are unable to ap-
ply these methods which require certain assumptions
[13]. The most popular methods are Ljung-Box [14], Box-
2. Background Pierce [15] and others like Breusch–Godfrey [16], Daniel-
Peña [17] and Monte-Carlo [18] which overcomes some
In this section we provide some background on time se- of the limitations of the first two [14, 15] but are more fo-
ries analysis and autocorrelation theory, which is needed cused on time-series model’s residuals. Both Ljung-Box
to understand the proposed method. and Box-Pierce methods are portmanteau tests which
allows testing the autocorrelation of a time series at mul-
2.1. Time Series tiple lags at the same time. The null hypothesis of the
test is that the data is independently distributed while
A time series is a sequence of observations equally spaced
the alternative hypothesis is that the data exhibits serial
and ordered by time [11]. Normally, these observations
correlation up to any lag. The distribution of the tests
are not independent from each other because their rela-
approximates asymptotically to a 𝜒 2 and the rejection
tive order is important. This non-independence means
of the null hypothesis will indicate to us that there is
that there is a temporal dependence implying that future
autocorrelation in our data.
values are influenced by past values. The classical ap-
The method that we used is Ljung-Box, which is a
proach for analyzing temporal series is to consider them
modification of Box-Pierce and it approximates better to
as a combination of four components. This combination
a 𝜒 2 [14]. The formula is:
can be additive or multiplicative:
𝑚 𝑟𝑘2
𝑋𝑡 = 𝑇 𝑟𝑒𝑛𝑑 + 𝑆𝑒𝑎𝑠𝑜𝑛𝑎𝑙 + 𝐶𝑦𝑐𝑙𝑖𝑐𝑎𝑙 + 𝐼 𝑟𝑟𝑒𝑔𝑢𝑙𝑎𝑟 (1) 𝑄(𝑚) = 𝑛(𝑛 + 2) ⋅ ∑ (3)
𝑘=1
𝑛−𝑘
𝑋𝑡 = 𝑇 𝑟𝑒𝑛𝑑 ⋅ 𝑆𝑒𝑎𝑠𝑜𝑛𝑎𝑙 ⋅ 𝐶𝑦𝑐𝑙𝑖𝑐𝑎𝑙 ⋅ 𝐼 𝑟𝑟𝑒𝑔𝑢𝑙𝑎𝑟 (2)
where 𝑛 is the number of samples, 𝑚 is the maximum lag
where 𝑋𝑡 is a temporal series. to test for autocorrelation, and 𝑟 is the autocorrelation.
A secular trend (Trend in Eqs. 1 and 2) describes the The degree of freedom of the 𝜒 2 , when there is no
consistent tendency of the data over a long period. A other information about the data, should be equal to
seasonal variation (Seasonal in Eqs. 1 and 2) describes the the number of lags up to where the autocorrelation
periodic fluctuation within cycles. The cyclical compo- is being tested. The choice of lag is difficult when
nent (referred to as Cyclical in Eqs. 1 and 2) describes to no information about the data is known. The higher
longer periodic fluctuations. The irregular component the lag the lower the performance of the test. Also,
(Irregular in Eqs. 1 and 2) describes small changes that the lag should be a fraction of the sequence length.
are unpredictable. For example, the Stata implementation [19] uses the
A time series is said to be stationary if its statistical rule of m=min(n/2,40), while Box et al. [20] suggest
properties do not change over time, that is, if it has con- m=20, and Tsay [21] suggests m=ln(n) warning that
stant mean and variance, and covariance is independent when seasonal behavior is expected, this behavior
of time. needs to be taken into consideration and lag values
Finally, autocorrelation is a measure of the similarity at multiples of the seasonality are more important.
of the observations at certain lag, that is, the correlation Escanciano and Lobato [22] present a portmanteau test
We proposed an algorithm that aims to detect the data
with temporal dependency. In order to do this, we split
the algorithm in two stages, A and B, as shown in Figure
2.
Figure 2: High-level overview of the proposed method.
Figure 1: Example of groups created by filtering by the at-
tribute county’s values. Left shows the entire dataset. Right In stage A, we do nested iterations over all the numeric
shows the groups. This dataset has no obvious temporal de- attributes and all their unique values. We group the
pendent data until we do the grouping by ‘county’. Only then, dataset by those values and classify all other attributes
‘deaths’ and ‘cases’ have temporal dependency. as time-dependent or not. As an example, using dataset
from Figure 1, while we are at the iteration of the attribute
‘county’, we group by ‘county’, and for each group, we
that automatically chooses the lag. classify the other attributes (‘date’, ‘cases’, and ‘deaths’)
as temporal or not. The following pseudo code describes
the process, which computes a set of metrics we analyze
in stage B using a decision tree to determine the temporal
3. Our Approach attributes.
Based on possible temporal characteristics, we catego- for each attribute A do
rized datasets into three types. for each unique value x of A do
smallDB = SELECT * WHERE A = x;
• No temporal dependence: Given a dataset with classification(smallDB);
no temporal information, no autocorrelation is
end
expected.
end
• A continuous evenly-sampled time-ordered
dataset: Given a dataset that corresponds to a
single time window, we can detect the temporal The classification part of stage A is diagrammed in
dependence by computing the autocorrelation of Figure 3. It consists of analyzing a single attribute and
each attribute over the entire dataset. determining if it has autocorrelation. We do a Ljung-Box
• Temporal dependence within a grouping at- test to detect statistically significant autocorrelation. In
tribute: There is no observable temporal depen- parallel, we apply a threshold (0.5 in our examples) to
dence when the dataset is considered as a whole, determine if the autocorrelation is also quantitatively
but the temporal dependence becomes appar- significant for the specific posterior use of the dataset. If
ent when grouped by some attribute. In such both tests pass, we consider the sequence to have tempo-
a case, we can detect the temporal dependence by ral dependency.
computing the autocorrelation of each attribute The metrics outputted on stage A consists of a table
within each group. Finding the proper grouping showing statistics of all the classification when grouping
attribute is the main challenge in this case. the dataset by each attribute. The rows are the attributes
of the dataset and the columns are the metrics described
Figure 1 exemplifies the third case, where a dataset in Table 1. To address the first and second types of dataset
may have hidden temporal dependencies that are uncov- described at the beginning of this section, we add a row
ered once the proper attribute is used to form groups. On consisting of no-grouping-by-any-attribute, where we
the left, the entire dataset does not exhibit any autocorre- show the classification of the attributes if no grouping is
lation for any of the attributes. On the other hand, on the done. As an example of how the metrics are computed,
right, after grouping by attribute ‘county’ the attributes let us consider the dataset from Figure 1. First, we group
‘deaths’ and ‘cases’ correspond to temporal series. by ‘date’ and classify each attribute as temporal or not.
Our Algorithm In this case, in none of the groups the attributes were
classified as temporal. Next, we group by ‘county’ and
used, as they are considered not representative. Similarly,
the attributes that produced only one group (or none) are
discarded as they do not produce multiple groups with
temporal dependence. Based on the definition of the met-
rics, only groups with some autocorrelation are being
counted. For example, in Figure 6, when grouping by
the ‘date’ attribute, none of the resulting groups presents
autocorrelation. As a result, the group count metric for
Figure 3: Diagram of proposed classification algorithm. ‘date’ is equal to 0. Next, the average count of attributes
with detected autocorrelation is used to discard attributes,
where the larger is preferred (as far as the standard de-
classify ‘cases’ and ‘deaths’ as temporal (‘date’ is not viation is small). This condition is the primary metric
considered as it is not a numerical attribute). As some to analyze the attributes, as it values more the groups
particular counties might have only ‘cases’ being clas- that in average have more attributes with autocorrelation.
sified as temporal, the average of attributes detected as Additionally, the average of the autocorrelation of each
temporal when grouping by ‘county’ is less than 2. Figure group is evaluated and those with the highest values are
4 shows the average is 1.93 in the resulting table. considered.
Table 1
Metrics
Name Description
Percentage of records from groups with at least
% data one attribute classified as temporal over the entire
dataset.
Count of groups with at least one attribute classified
groups
as temporal.
Average of the count of attributes classified as tem-
avg_temp_att
poral over the groups.
std Standard deviation of avg_temp_att.
Average of maximums autocorrelations over all
avg_corr groups. Maximum values are calculated within a
group, over all attributes classified as temporal.
Maximum autocorrelation over all attributes and
max_corr
groups.
Figure 5: Decision tree to analyze results.
The final result consists of the attributes with temporal
dependence along with the percentage of times it was
detected as temporal over all groups.If the percentage is
lower than 50% we don’t consider that attribute as tem-
poral for further analysis. Figure 6 shows an example of
this result when grouping by the ‘county’ attribute. In
this example, both ‘cases’ and ‘death’ attributes have au-
tocorrelation and were detected as temporal in more than
50% of times over all groups. As a result, both attributes
are considered as temporal if we group the dataset by
Figure 4: Example of the analysis of the metrics using exam- ‘county’.
ple from Figure 1.
Stage B consists of analyzing the metrics from the
4. Experiments
resulting table to determine if grouping by attributes gen- We conduct different experiments to show how the met-
erates temporal sequences. We designed a decision tree, rics we defined can help determine if there is temporal
shown in Figure 5 to guide the analysis of the table. The dependence in the dataset. We run the algorithm against
tree first discards attributes with small percentage of data
Figure 7 shows the scores for each dataset. Each row
corresponds to a dataset, where the first five (election, in-
comes, countries, biomechanical, and crime) do not have
temporal dependence. The next ten have temporal de-
pendence, but only the last five have grouping attributes
Figure 6: An example of attributes with temporal depen-
that produce temporal sequences (covid2, wage, market,
dence.
avocado, and suicides), while the five in the middle do
not (codiv1, energy1, yahoo, traffic, india). Each of these
cases is indicated by column ‘case’ and the numbers 0,1,2
15 datasets, five for each of the three categories men- correspond to the same order of categories explained in
tioned in Section 3. For each of these categories, we Section 3. The columns ‘FP’, ‘TP’, ‘FN’, and ‘TN’ count the
explain one dataset in detail as an example. In the follow- number of attributes that have been classified as temporal
ing, we describe the research questions that we answer or not and are false positive, true positive, false negative,
through our experiments. and true negative respectively. The columns ‘ACC’ and
Q1: Can the proposed approach correctly identify at- ‘F1’ are the accuracy and the F1-score of the classification
tributes with temporal dependence in the datasets? respectively. Following, ‘# temp att detected’ is the ratio
We answer this question using the domain knowledge. the attributes that were correctly detected over the total
Domain experts label an attribute as positive or negative number of temporal attributes. The next three columns
depending on whether or not the attribute has temporal refers to the detection of the grouping attributes, where
dependency. We construct a contingency matrix and ‘grouping’ indicates if the dataset has one or more group-
calculate the accuracy (eq. 5) and the F1-score (eq. 4) ing attributes, ‘grouping detected’ if the algorithm found
for each dataset. We include the accuracy because the any, and ‘contingency’ specifies the type of error or suc-
F1-score is not applicable for the cases where there are cess. Finally, the bottom part of the table summarizes
no temporal attributes. the ‘contingency’ column and shows the F1-score for the
𝑇𝑃 detection of grouping attributes.
𝐹1 = (4) For case 0, in two of the five datasets, there were at-
𝑇 𝑃 + 1/2(𝐹 𝑃 + 𝐹 𝑁 )
tributes that were detected as temporal. As there is no
𝑇𝑃 + 𝑇𝑁 temporal attribute, the F1-score is not applicable for this
𝐴𝐶𝐶 = (5)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 case, so only the accuracy should be taken into account.
Q2: Can the approach correctly identify grouping at- In the entire table, these are the only two cases with ac-
tributes to form multiple temporally dependent sequences? curacy and F1-score (when applicable) lower than 1. In
Using same evaluation metrics, we analyzed if the ap- none of the datasets, an attribute was falsely considered
proach can correctly identify attributes by which we as a grouping attribute. Moreover, in all datasets with
can group the dataset records into multiple temporally- grouping attributes, those attributes were successfully
dependent sequences. For each dataset, we first identify found, yielding a F1-score of 1.
if it has such an attribute. Typically, there are multiple
possible grouping attributes. For example, in a dataset 5.1. No Temporal Dependencies Datasets
containing information about suicides all over the world,
a grouping attribute can be each country, but at the same To exemplify this case, we have the ‘elections’ dataset,
time, there could be trends related to other attributes, which consists of reported votes by county in the gov-
such as gender. Therefore, we do not do a specific anal- ernor race in the US elections 2020 (Figure 8). It has
ysis on each of the possible grouping attributes, but we 1025 entries, 2 non-numeric attributes, and 3 numerical
limit the analysis to the existence or not of any. The attributes, none of which has temporal dependence.
F1-score is calculated for all the datasets. Figure 9 shows that no autocorrelation was found, as
All the datasets used in this study are publicly avail- expected.
able and were picked to exemplify various categories One of the limitations of using autocorrelation, as we
(Appendix B). will discuss in Section 6.1, is that other types of relation-
ships can also produce correlation. To illustrate this, we
used the ‘biomechanical’ dataset (Figure 10), for which
5. Results there are two false positives based on the result table
of Figure 7. The dataset consist of six biomechanical at-
In this section, we first present the summary of the results tributes derived from the shape and orientation of the
for each dataset. Then, we explore with more details pelvis and lumbar spine of 310 patients. Despite the lack
some examples for each specific case. of temporal dependence in the data, the results, as shown
in Figure 11, indicates the presence of autocorrelation
Figure 7: Results summary for all datasets. Case 0, 1, and 2 corresponds to no-temporal information, no-grouping temporal
information, and grouping temporal information.
as expected are both the number of cases and deaths, as
shown in Figure 14.
5.3. Temporal Dependencies within
Grouping Attributes
Figure 8: US elections dataset To illustrate this case, we used the ‘covid2’ dataset from
Figure 15, which consists of daily deaths and positive
cases of COVID-19 by county in the United States. There
are two differences between ‘covid1’ and ‘covid2’ datasets.
First, there are two non-numeric attributes corresponding
to counties and states, which can be used to establish a
geographical relation between the records. Second, there
is a numerical attribute ‘FIPS’, which is a code to identify
counties and states. Therefore, we expect this attribute
not to have autocorrelation, but to be a potential grouping
Figure 9: US elections dataset grouping attribute results attribute.
The results in Figure 16 show that if we do not split the
dataset in groups, none of the attributes can be consid-
ered to have temporal dependence. Instead, if we group
in 2 out of the 5 numerical attributes, namely, ‘lumbar
by ‘county’ or ‘fips’ there are 2 attributes in average with
lordosis angle’ and ‘degree spondylolisthesis’.
autocorrelation. Despite that grouping by the attributes
‘date’, ‘state’, ‘cases’, ‘fips’ have non-zero values in the re-
5.2. No-grouping Temporal Datasets sulting table, the percentage of used data is low. Thus, we
The ‘covid1’ dataset, shown in Figure 12 has daily infor- ignore grouping by these attributes. Figure 17 shows, that
mation about positive cases and deaths caused by COVID- when grouping by ‘county’ the attributes ‘cases’, ‘deaths’,
19 in the United States. and ‘fips’ could be considered as temporal sequences.
The results in Figure 13 shows that no-grouping is the Nevertheless, we discard ‘fips’ as it has an occurrence of
best option, which is correct as none of the attributes approx. 22%, which is lower than our defined threshold
allows to form groups. The detected temporal attributes, of 50%.
Figure 10: Biomechanical dataset
Figure 11: Biomechanical dataset temporal attribute results
Figure 15: Covid-19 in the US by county and state dataset
Figure 12: Covid-19 in the US dataset
Figure 16: Covid-19 in the US by county and state grouping
attribute results
Figure 13: Covid-19 in the US dataset grouping attribute
results
Figure 17: Covid-19 in the US by county and state temporal
attribute results
Figure 14: Covid-19 in the US dataset temporal attribute 6. Discussion and Future Research
results
We proposed an approach to classify datasets based on
whether or not they contain temporally dependent data.
The core of our algorithm is based on the autocorrelation
as the method to determine if there is a temporal depen-
dence in a section of the data. Our algorithm relies on a
set of proposed metrics to integrally classify the dataset. • Cross-sectional data: when there are dependence
Among these metrics, the percentage of data used in other than temporal between attribute values.
the analysis, the number of groups, and the average of Even though autocorrelation is a necessary condi-
autocorrelated sequences found were the three metrics tion to exploit temporal data information, it is not
that provided the most relevant information for making a sufficient condition to determine if the data is
a decision. The other metrics were not used in any of temporal. For example, our method will fail when
the examples but we believe that they could come handy a dataset has correlations that are not temporal
in larger datasets. For example, the standard deviation but spatial [26].
should not be too large as it would mean that there is a • Non-stationarity: when time-series statistical
particular grouping attribute value with more temporal properties vary over time. In such cases, the auto-
sequences than the rest, which is probably as a result correlation cannot be calculated using the mean
of an outlier, and should be handled carefully to avoid a and the variance but needs to be estimated. Simi-
false positive. Both the average autocorrelation and the lar methods could be used as when dealing with
maximum autocorrelation are used as tiebreakers when missing values [27].
the other metrics have same values.
Our approach could identify temporal sequences, The decision tree to analyze the metrics is currently
when the sequence corresponds to the entire dataset, not automated as we require a higher volume of use cases
and also when grouping by attributes was needed. Typi- to generalize the rules. Similarly, for tuning the hyperpa-
cal datasets fall in both cases, meaning that an attribute rameters, such as the autocorrelation threshold we used
can present autocorrelation as a whole sequence and as to determine if an autocorrelation was significant, we
multiple grouped subsequences. The latter case is impor- need more extensive analysis and cases.
tant because it allows to improve the data analysis. For
example, if we have an outlier detection algorithm for 6.2. Future work
temporal data, we may apply that to a single sequence
as well as to different subsequences constructed from Statistical exploration and optimization We will
the same data, to increase the chance of detecting more investigate whether different types of correlations, such
outliers. Another use case is when the algorithm has as Pearson, Kendall, Spearman, and estimation from the
high time complexity. In such a case, it may be better power spectral density can be used within the Ljung-
to only explore the outliers in the smaller subsequences Box or Box-Pierce test [28]. We will conduct a deep
than in the entire sequence. analysis on which autocorrelation function to use when
no prior information on the data is known. Currently,
the algorithm goes over all numeric attributes searching
6.1. Limitations
for autocorrelation. This is time consuming and should
We identify the following scenarios where our approach be, if possible, improved.
might failed to detect temporal dependency on the at-
tributes. Working with categorical attributes Datasets may
• Small sample size: when the number of samples consist of categorical attributes, such as boolean labels,
is small, no statistical test will have enough sig- names, IDs, and dates. These attributes may be temporal
nificance. as well. For example, a positive value for a patient with a
• Unevenly-sampled data: when there is no con- non-curable disease is unlikely to become negative in the
stant time-spacing between samples. If the un- future. Thus, finding a way to process such attributes is
even sampling is due to missing data points and important. We will use one-hot encoding to pre-process
the sample size is large enough, the approach the categorical attributes.
should converge to the same values as if all data
points were present. However, if there is no
pattern in the sampling rate, different methods
7. Conclusions
should be used to calculate the autocorrelation In this paper, we have presented a technique that uses
indirectly, such as estimating the autocorrelation autocorrelation to determine the presence of temporal
using the statistical approaches [23]. data within its attributes without any prior knowledge
• Missing values: when there are null values in the about the database. The algorithm was tested for different
data. there are many methods [24] to overcome databases, including those with and without temporal
missing values in time-series data and specifically dependence data, and specifically focused on databases
for the Ljung-Box test[25]. However, under the containing hidden temporal groups. For these cases, we
assumption that we do not have prior information proposed metrics to find the grouping attributes that
on the data, none of these methods can be used.
unveil such hidden groups. The results show that we [12] A. Dotis-Georgiou, Autocorrelation in Time-series
were able to successfully classified attributes as temporal Data, 2019. URL: https://www.influxdata.com/blog/
or not, and also to find grouping attributes that form autocorrelation-in-time-series-data/, influx Data
temporal groups. Finally, we discussed the limitations of article, accessed 25th July 2021.
the approach and potential improvement paths. [13] G. Maddala, Introduction to Econometrics, Wiley,
2001.
[14] G. Ljung, G. Box, On a Measure of Lack of Fit in
Acknowledgments Time Series Models, Biometrika 65 (1978).
[15] G. Box, D. Pierce, Distribution of Residual Auto-
This work was supported in part by funding from NSF
correlations in Autoregressive-Integrated Moving
under Award Numbers CNS 1822118, IIS 2027750, OAC
Average Time Series Models, Journal of the Ameri-
1931363, Statnett, ARL, AMI, Cyber Risk Research, and
can Statistical Association 72 (1970) 397–402.
NIST.
[16] D. Scott, Applied Econometrics with R by Chris-
tian Kleiber, Achim Zeileis, International Statistical
References Review 77 (2009) 164–164.
[17] D. Peña, J. Rodríguez, A Powerful Portmanteau
[1] P. J. Brockwell, R. A. Davis, Introduction to Time Test of Lack of Fit for Time Series, Journal of the
Series and Forecasting, Springer, 2008. American Statistical Association 97 (2002) 601–610.
[2] H. Luetkepohl, M. Krätzig, In Applied Time Series [18] J.-M. Dufour, L. Khalaf, Monte Carlo Test Methods
Econometrics, Applied Time Series Econometrics in Econometrics, 2007.
(2004). [19] S. Documentation, wntestq Portmanteau (Q) Test
[3] M. Allen, The SAGE Encyclopedia of Communica- Description, 2019. URL: http://www.stata.com/
tion Research Methods, SAGE Publications, 2017. manuals13/tswntestq.pdf, accessed on 24th July
[4] Y. Chen, W. Wu, Application of One-class Sup- 2021.
port Vector Machine to Quickly Identify Multivari- [20] E. Ziegel, G. Box, G. Jenkins, G. Reinsel, Time series
ate Anomalies from Geochemical Exploration Data, analysis, forecasting, and control, Technometrics
Geochemistry: Exploration, Environment, Analysis 37 (1995) 238.
17 (2017) 231–238. [21] R. Tsay, Analysis of Financial Time Series. Financial
[5] Z. Cheng, C. Zou, J. Dong, Outlier Detection Using Econometrics, 2002.
Isolation Forest and Local Outlier Factor, in: Con- [22] J. C. Escanciano, I. N. Lobato, An automatic Port-
ference on Research in Adaptive and Convergent manteau Test for Serial Correlation, Journal of
Systems, Association for Computing Machinery, Econometrics 151 (2009) 140–149.
2019, p. 161–168. [23] K. Rehfeld, N. Marwan, J. Heitzig, J. Kurths, Com-
[6] H. Lu, Y. Liu, Z. Fei, C. Guan, An Outlier Detec- parison of Correlation Analysis Techniques for Ir-
tion Algorithm based on Cross-Correlation Analy- regularly Sampled Time Series, Nonlinear Processes
sis for Time Series Dataset, IEEE Access 6 (2018) in Geophysics 18 (2011) 389–404.
53593–53610. [24] I. Pratama, A. E. Permanasari, I. Ardiyanto, R. In-
[7] P. M. Maçaira, A. M. T. Thomé, F. L. C. Oliveira, drayani, A Review of Missing Values Handling
A. L. C. Ferrer, Time Series Analysis with Explana- Methods on Time-series Data, in: 2016 Interna-
tory Variables: A Systematic Literature Review, En- tional Conference on Information Technology Sys-
vironmental Modelling & Software 107 (2018) 199 – tems and Innovation (ICITSI), 2016, pp. 1–6.
209. [25] D. Stoffer, C. Toloi, A Note on the Ljung–
[8] Y. Yu, X. Si, C. Hu, J. Zhang, A Review of Recurrent Box–Pierce Portmanteau Statistic with Missing
Neural Networks: LSTM Cells and Network Archi- Data, Statistics & Probability Letters 13 (1992)
tectures, Neural Computation 31 (2019) 1235–1270. 391–396.
[9] W. Lin, M. Orgun, G. Williams, An overview of [26] A. Zuur, Spatial Correlation, 2019. URL:
temporal data mining (2019). http://userwww.sfsu.edu/efc/classes/biol710/
[10] H. Homayouni, S. Ghosh, I. Ray, An Approach spatial/spat-auto.htm, san Fransisco State Univer-
for Testing the Extract-Transform-Load Process in sity article, accessed on 24th June 2021.
Data Warehouse Systems, in: Proceedings of the [27] G. P. Nason, R. Von Sachs, G. Kroisandt, Wavelet
22nd International Database Engineering and Ap- processes and adaptive estimation of the evolution-
plications Symposium, IDEAS 2018, Association for ary wavelet spectrum, Journal of the Royal Statisti-
Computing Machinery, 2018, p. 236–245. cal Society Series B 62 (2000) 271–292.
[11] W. Wei, Time Series Analysis: Univariate and Mul- [28] C. Chatfield, The Analysis of Time Series: An In-
tivariate Methods, volume 33, 1989. troduction, Fourth Edition, Chapman & Hall/CRC
Texts in Statistical Science, CRC Press, 1989. • india
https://www.kaggle.com/muralimunna18/
india-population
A. Code Population of india by year.
• exchange
The code used for this paper is available in GitHub: https://www.kaggle.com/rohithbollareddy/
https://github.com/JCuomo/TemporalDependenceDB foreign-exchange-in-india-yearlysource-rbi
Exchange currencies by year.
• covid2
B. Datasets https://raw.githubusercontent.com/nytimes/
covid-19-data/master/us-counties.csv
• elections
Covid cases and death by County in the USA.
https://www.kaggle.com/unanimad/
• wage
us-election-2020
https://kaggle.com/lislejoem/
”governors county” file.
us-minimum-wage-by-state-from-1968-to-2017
General information about reporting votes to
USA minimum wage by State from 1968 to 2020.
governor race by county.
• incomes • market
https://www.kaggle.com/jonavery/ https://raw.githubusercontent.com/selva86/
incomes-by-career-and-gender datasets/master/MarketArrivals.csv
American citizens incomes from 2015 broken Indian markets quantity and price per year.
into male and female statistics. • avocado
• countries https://www.kaggle.com/neuromusic/
https://www.kaggle.com/fernandol/ avocado-prices
countries-of-the-world Avocado weekly 2018 retail scan data for National
Information on population, region, area size, retail volume (units) and price.
infant mortality and more. • suicides
• biomechanical https://www.kaggle.com/russellyates88/
https://www.kaggle.com/uciml/ suicide-rates-overview-1985-to-2016
biomechanical-features-of-orthopedic-patients Worldwide suicide statistics per year.
Patient data of six biomechanical attributes
derived from the shape and orientation of the
pelvis and lumbar spine.
• crime
https://www.kaggle.com/mascotinme/
population-against-crime
FBI crime statistics for 2012 on population less
than 250,000.
• covid1
https://raw.githubusercontent.com/nytimes/
covid-19-data/master/us.csv
Covid cases and death statistics for USA.
• energy
This dataset is proprietary and cannot be dis-
tributed.
Daily energy delivery by Fort Collins power fa-
cility.
• yahoo
https://webscope.sandbox.yahoo.com/
”A3Benchmark all” file
Real and synthetic time-series. The synthetic
dataset consists of time-series with varying trend,
noise and seasonality. The real dataset consists
of time-series representing the metrics of various
Yahoo services.
Table 2
Description of Datasets
DB Link Description
elections https://www.kaggle.com/unanimad/us-election- General information about reporting votes to gover-
2020 ”governors county” file. nor race by county.
incomes https://www.kaggle.com/jonavery/incomes-by- American citizens incomes from 2015 broken into
career-and-gender male and female statistics.
countries https://www.kaggle.com/fernandol/countries-of- Information on population, region, area size, infant
the-world mortality and more.
biomechanical https://www.kaggle.com/uciml/biomechanical- Patient data of six biomechanical attributes derived
features-of-orthopedic-patients from the shape and orientation of the pelvis and
lumbar spine.
crime https://www.kaggle.com/mascotinme/population- FBI crime statistics for 2012 on population less than
against-crime 250,000.
covid1 https://raw.githubusercontent.com/nytimes/covid- Covid cases and death statistics for USA.
19-data/master/us.csv
energy This dataset is proprietary and cannot be dis- Daily energy delivery by Fort Collins power facility.
tributed.
yahoo https://webscope.sandbox.yahoo.com/ ”A3Bench- Real and synthetic time-series. The synthetic
mark all” file dataset consists of time-series with varying trend,
noise and seasonality. The real dataset consists of
time-series representing the metrics of various Ya-
hoo services.
india https://www.kaggle.com/muralimunna18/india- Population of india by year.
population
exchange https://www.kaggle.com/rohithbollareddy/foreign- Exchange currencies by year.
exchange-in-india-yearlysource-rbi
covid2 https://raw.githubusercontent.com/nytimes/covid- Covid cases and death by County in the USA.
19-data/master/us-counties.csv
wage https://kaggle.com/lislejoem/us-minimum-wage- USA minimum wage by State from 1968 to 2020.
by-state-from-1968-to-2017
market https://raw.githubusercontent.com/selva86/ Indian markets quantity and price per year.
datasets/master/MarketArrivals.csv
avocado https://www.kaggle.com/neuromusic/avocado- Avocado weekly 2018 retail scan data for National
prices retail volume (units) and price.
suicides https://www.kaggle.com/russellyates88/suicide- Worldwide suicide statistics per year.
rates-overview-1985-to-2016