=Paper= {{Paper |id=Vol-3163/paper4 |storemode=property |title=Detecting Temporal Dependencies in Data |pdfUrl=https://ceur-ws.org/Vol-3163/BICOD21_paper_5.pdf |volume=Vol-3163 |authors=Joaquin Cuomo,Hajar Homayouni,Indrakshi Ray,Sudipto Ghosh |dblpUrl=https://dblp.org/rec/conf/bncod/CuomoHRG21 }} ==Detecting Temporal Dependencies in Data== https://ceur-ws.org/Vol-3163/BICOD21_paper_5.pdf
Detecting Temporal Dependencies in Data
Joaquin Cuomo1 , Hajar Homayouni2 , Indrakshi Ray1 and Sudipto Ghosh1
1
    Department of Computer Science, Colorado State University
2
    Department of Computer Science, San Diego State University


                                             Abstract
                                             Organizations collect data from various sources, and these datasets may have characteristics that are unknown. Selecting
                                             the appropriate statistical and machine learning algorithm for data analytical purposes benefits from understanding these
                                             characteristics, such as if it contains temporal attributes or not. This paper presents a theoretical basis for automatically
                                             determining the presence of temporal data in a dataset given no prior knowledge about its attributes. We use a method to
                                             classify an attribute as temporal, non-temporal, or hidden temporal. A hidden (grouping) temporal attribute can only be
                                             treated as temporal if its values are categorized in groups. Our method uses a Ljung-Box test for autocorrelation as well as
                                             a set of metrics we proposed based on the classification statistics. Our approach detects all temporal and hidden temporal
                                             attributes in 15 datasets from various domains.

                                             Keywords
                                             Dataset Management Systems, Statistics, Temporal attribute detection, Autocorrelation.



1. Introduction                                                                                                       tributes to be analyzed by the experts. Moreover, even
                                                                                                                      domain experts may not be aware of temporal dependen-
Datasets can be temporal or non-temporal. A dataset                                                                   cies among a subset of attributes in a big dataset. An ex-
is temporal if one or more attributes is a time sequence                                                              ample is a health data warehouse to which temporal and
[1]. An example of a temporal dataset is a stock market                                                               non-temporal data is automatically loaded from multiple
dataset, in which each value of an attribute corresponds                                                              source hospitals through an automated Extract, Trans-
to the daily stock price. Time series normally present                                                                form, Load (ETL) process [10]. Every patient can have
a time-dependency, meaning that a value is dependent                                                                  a set of temporally dependent records, such as records
on its past values. Time-series analysis has applications                                                             related to their lab tests. Explicit temporal information,
ranging from stock market prediction to digital signal                                                                such as a timestamp that identifies when data is captured
processing and has been studied in statistics [1], econo-                                                             as well as attribute names that indicate temporal char-
metrics [2], and in communications [3].                                                                               acteristics may change through the ETL transformation.
   Data analysis techniques depend on the type of data.                                                               For example, the name of the Patient_Height attribute
Techniques for non-temporal data, such as Support Vec-                                                                may change into a random name through the transfor-
tor Machine (SVM) [4] and Isolation Forest (IF) [5] only                                                              mation process. This data modification can make the
discover associations among attributes of individual data                                                             temporal nature of the target attribute unknown to the
records and cannot be used for analyzing time-series data                                                             researchers who are using the data for making critical
because associations may exist among multiple records in                                                              decisions on disease, treatments, and medications.
a time series [6]. Other approaches, such as Autoregres-                                                                 To the best of our knowledge, there is no prior attempt
sive Moving Average (ARIMA) [7] and Long Short-Term                                                                   on the detection of temporal dependencies in datasets.
Memory (LSTM) [8], are more suitable for either pre-                                                                  Such dependencies are presumed to be known before-
diction or optimization for temporal data analysis [9]                                                                hand, which works only for well-understood datasets.
techniques.                                                                                                           However, where domain experts lack adequate knowl-
   It is critical to understand the existence of temporal                                                             edge about the data characteristics, there is a need to
dependencies in a dataset in advance in order to choose                                                               automatically detect whether or not a dataset is temporal
the best analysis approach. Existing analysis approaches                                                              in order to choose the right technique and have a fully
rely on domain experts to identify the type of data and to                                                            automated process. Our work fills this gap.
choose appropriate techniques to model the data. How-                                                                    We developed a method to determine whether or not
ever, in big datasets, there can be a large number of at-                                                             a dataset contains temporal attributes. Moreover, our
                                                                                                                      approach automatically identifies grouping attributes.
BICOD21: British International Conference on Databases, December                                                      A grouping attribute is such by which we can group
9–10, 2021, London, UK
                                                                                                                      the dataset records and obtain intergroup temporal at-
Envelope-Open jcuomo@colostate.edu (J. Cuomo); hhomayouni@sdsu.edu
(H. Homayouni); iray@colostate.edu (I. Ray); ghosh@colostate.edu                                                      tributes but not intragroup. A dataset may have one or
(S. Ghosh)                                                                                                            more grouping attributes. The proposed algorithm is
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     based on a portmanteau test [1] for autocorrelation to
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
determine the presence of temporal data. To find group-        of the series with a delayed copy of itself. It gives critical
ing attributes that yield temporal sequences we use a          information on whether a value in the series can be used
brute force approach by testing each unique value as           to infer information about another value. A common way
a possible grouping attribute. Finally, we propose met-        to analyze temporal data is to create a model that fits the
rics that help determine whether the result of the port-       data, and the most widespread technique is regression
manteau test should be accepted or rejected based on           analysis, which uses autocorrelation [12]. Therefore, au-
an integrated perspective of the dataset. We evaluated         tocorrelation is going to be the most important metric
the proposed method on fifteen datasets, where each            to determine if a dataset has or does not have temporal
attribute was given a priori classification by domain ex-      dependence.
perts. We demonstrated that our approach was able to
discover all the temporal attributes.                          2.2. Testing Autocorrelation
   The paper is organized as follows. Section 2 presents
the theoretical background that forms the basis of our     Most of the literature on autocorrelation of time-series is
work. Section 3 discusses our proposed approach to de-     about evaluating the fitness of an autoregressive model,
tect temporal dependencies in datasets. Sections 4 and 5   which is done by analyzing the autocorrelation of the
describe our experiments and results using 15 different    model’s residuals. However, because we do not have
datasets. Section 7 concludes the paper.                   prior knowledge about the data we are unable to ap-
                                                           ply these methods which require certain assumptions
                                                           [13]. The most popular methods are Ljung-Box [14], Box-
2. Background                                              Pierce [15] and others like Breusch–Godfrey [16], Daniel-
                                                           Peña [17] and Monte-Carlo [18] which overcomes some
In this section we provide some background on time se- of the limitations of the first two [14, 15] but are more fo-
ries analysis and autocorrelation theory, which is needed cused on time-series model’s residuals. Both Ljung-Box
to understand the proposed method.                         and Box-Pierce methods are portmanteau tests which
                                                           allows testing the autocorrelation of a time series at mul-
2.1. Time Series                                           tiple lags at the same time. The null hypothesis of the
                                                           test is that the data is independently distributed while
A time series is a sequence of observations equally spaced
                                                           the alternative hypothesis is that the data exhibits serial
and ordered by time [11]. Normally, these observations
                                                           correlation up to any lag. The distribution of the tests
are not independent from each other because their rela-
                                                           approximates asymptotically to a 𝜒 2 and the rejection
tive order is important. This non-independence means
                                                           of the null hypothesis will indicate to us that there is
that there is a temporal dependence implying that future
                                                           autocorrelation in our data.
values are influenced by past values. The classical ap-
                                                              The method that we used is Ljung-Box, which is a
proach for analyzing temporal series is to consider them
                                                           modification of Box-Pierce and it approximates better to
as a combination of four components. This combination
                                                           a 𝜒 2 [14]. The formula is:
can be additive or multiplicative:
                                                                                                   𝑚    𝑟𝑘2
      𝑋𝑡 = 𝑇 𝑟𝑒𝑛𝑑 + 𝑆𝑒𝑎𝑠𝑜𝑛𝑎𝑙 + 𝐶𝑦𝑐𝑙𝑖𝑐𝑎𝑙 + 𝐼 𝑟𝑟𝑒𝑔𝑢𝑙𝑎𝑟     (1)                   𝑄(𝑚) = 𝑛(𝑛 + 2) ⋅ ∑                       (3)
                                                                                                  𝑘=1
                                                                                                        𝑛−𝑘
        𝑋𝑡 = 𝑇 𝑟𝑒𝑛𝑑 ⋅ 𝑆𝑒𝑎𝑠𝑜𝑛𝑎𝑙 ⋅ 𝐶𝑦𝑐𝑙𝑖𝑐𝑎𝑙 ⋅ 𝐼 𝑟𝑟𝑒𝑔𝑢𝑙𝑎𝑟   (2)
                                                               where 𝑛 is the number of samples, 𝑚 is the maximum lag
where 𝑋𝑡 is a temporal series.                                 to test for autocorrelation, and 𝑟 is the autocorrelation.
   A secular trend (Trend in Eqs. 1 and 2) describes the          The degree of freedom of the 𝜒 2 , when there is no
consistent tendency of the data over a long period. A          other information about the data, should be equal to
seasonal variation (Seasonal in Eqs. 1 and 2) describes the    the number of lags up to where the autocorrelation
periodic fluctuation within cycles. The cyclical compo-        is being tested. The choice of lag is difficult when
nent (referred to as Cyclical in Eqs. 1 and 2) describes to    no information about the data is known. The higher
longer periodic fluctuations. The irregular component          the lag the lower the performance of the test. Also,
(Irregular in Eqs. 1 and 2) describes small changes that       the lag should be a fraction of the sequence length.
are unpredictable.                                             For example, the Stata implementation [19] uses the
   A time series is said to be stationary if its statistical   rule of m=min(n/2,40), while Box et al. [20] suggest
properties do not change over time, that is, if it has con-    m=20, and Tsay [21] suggests m=ln(n) warning that
stant mean and variance, and covariance is independent         when seasonal behavior is expected, this behavior
of time.                                                       needs to be taken into consideration and lag values
   Finally, autocorrelation is a measure of the similarity     at multiples of the seasonality are more important.
of the observations at certain lag, that is, the correlation   Escanciano and Lobato [22] present a portmanteau test
                                                                   We proposed an algorithm that aims to detect the data
                                                                with temporal dependency. In order to do this, we split
                                                                the algorithm in two stages, A and B, as shown in Figure
                                                                2.




                                                                Figure 2: High-level overview of the proposed method.

Figure 1: Example of groups created by filtering by the at-
tribute county’s values. Left shows the entire dataset. Right      In stage A, we do nested iterations over all the numeric
shows the groups. This dataset has no obvious temporal de-      attributes and all their unique values. We group the
pendent data until we do the grouping by ‘county’. Only then,   dataset by those values and classify all other attributes
‘deaths’ and ‘cases’ have temporal dependency.                  as time-dependent or not. As an example, using dataset
                                                                from Figure 1, while we are at the iteration of the attribute
                                                                ‘county’, we group by ‘county’, and for each group, we
that automatically chooses the lag.                             classify the other attributes (‘date’, ‘cases’, and ‘deaths’)
                                                                as temporal or not. The following pseudo code describes
                                                                the process, which computes a set of metrics we analyze
                                                                in stage B using a decision tree to determine the temporal
3. Our Approach                                                 attributes.
Based on possible temporal characteristics, we catego-            for each attribute A do
rized datasets into three types.                                      for each unique value x of A do
                                                                          smallDB = SELECT * WHERE A = x;
     • No temporal dependence: Given a dataset with                       classification(smallDB);
       no temporal information, no autocorrelation is
                                                                      end
       expected.
                                                                  end
     • A continuous evenly-sampled time-ordered
       dataset: Given a dataset that corresponds to a
       single time window, we can detect the temporal           The classification part of stage A is diagrammed in
       dependence by computing the autocorrelation of        Figure 3. It consists of analyzing a single attribute and
       each attribute over the entire dataset.               determining if it has autocorrelation. We do a Ljung-Box
     • Temporal dependence within a grouping at-             test to detect statistically significant autocorrelation. In
       tribute: There is no observable temporal depen-       parallel, we apply a threshold (0.5 in our examples) to
       dence when the dataset is considered as a whole,      determine if the autocorrelation is also quantitatively
       but the temporal dependence becomes appar-            significant for the specific posterior use of the dataset. If
       ent when grouped by some attribute. In such           both tests pass, we consider the sequence to have tempo-
       a case, we can detect the temporal dependence by      ral dependency.
       computing the autocorrelation of each attribute          The metrics outputted on stage A consists of a table
       within each group. Finding the proper grouping        showing statistics of all the classification when grouping
       attribute is the main challenge in this case.         the dataset by each attribute. The rows are the attributes
                                                             of the dataset and the columns are the metrics described
   Figure 1 exemplifies the third case, where a dataset in Table 1. To address the first and second types of dataset
may have hidden temporal dependencies that are uncov- described at the beginning of this section, we add a row
ered once the proper attribute is used to form groups. On consisting of no-grouping-by-any-attribute, where we
the left, the entire dataset does not exhibit any autocorre- show the classification of the attributes if no grouping is
lation for any of the attributes. On the other hand, on the done. As an example of how the metrics are computed,
right, after grouping by attribute ‘county’ the attributes let us consider the dataset from Figure 1. First, we group
‘deaths’ and ‘cases’ correspond to temporal series.          by ‘date’ and classify each attribute as temporal or not.
Our Algorithm                                                In this case, in none of the groups the attributes were
                                                             classified as temporal. Next, we group by ‘county’ and
                                                          used, as they are considered not representative. Similarly,
                                                          the attributes that produced only one group (or none) are
                                                          discarded as they do not produce multiple groups with
                                                          temporal dependence. Based on the definition of the met-
                                                          rics, only groups with some autocorrelation are being
                                                          counted. For example, in Figure 6, when grouping by
                                                          the ‘date’ attribute, none of the resulting groups presents
                                                          autocorrelation. As a result, the group count metric for
Figure 3: Diagram of proposed classification algorithm.   ‘date’ is equal to 0. Next, the average count of attributes
                                                          with detected autocorrelation is used to discard attributes,
                                                          where the larger is preferred (as far as the standard de-
classify ‘cases’ and ‘deaths’ as temporal (‘date’ is not viation is small). This condition is the primary metric
considered as it is not a numerical attribute). As some to analyze the attributes, as it values more the groups
particular counties might have only ‘cases’ being clas- that in average have more attributes with autocorrelation.
sified as temporal, the average of attributes detected as Additionally, the average of the autocorrelation of each
temporal when grouping by ‘county’ is less than 2. Figure group is evaluated and those with the highest values are
4 shows the average is 1.93 in the resulting table.       considered.


Table 1
Metrics
 Name           Description
                Percentage of records from groups with at least
 % data         one attribute classified as temporal over the entire
                dataset.
                Count of groups with at least one attribute classified
 groups
                as temporal.
                Average of the count of attributes classified as tem-
 avg_temp_att
                poral over the groups.
 std            Standard deviation of avg_temp_att.
                Average of maximums autocorrelations over all
 avg_corr       groups. Maximum values are calculated within a
                group, over all attributes classified as temporal.
                Maximum autocorrelation over all attributes and
 max_corr
                groups.




                                                                         Figure 5: Decision tree to analyze results.


                                                                The final result consists of the attributes with temporal
                                                             dependence along with the percentage of times it was
                                                             detected as temporal over all groups.If the percentage is
                                                             lower than 50% we don’t consider that attribute as tem-
                                                             poral for further analysis. Figure 6 shows an example of
                                                             this result when grouping by the ‘county’ attribute. In
                                                             this example, both ‘cases’ and ‘death’ attributes have au-
                                                             tocorrelation and were detected as temporal in more than
                                                             50% of times over all groups. As a result, both attributes
                                                             are considered as temporal if we group the dataset by
Figure 4: Example of the analysis of the metrics using exam- ‘county’.
ple from Figure 1.


   Stage B consists of analyzing the metrics from the
                                                                         4. Experiments
resulting table to determine if grouping by attributes gen- We conduct different experiments to show how the met-
erates temporal sequences. We designed a decision tree, rics we defined can help determine if there is temporal
shown in Figure 5 to guide the analysis of the table. The dependence in the dataset. We run the algorithm against
tree first discards attributes with small percentage of data
                                                                  Figure 7 shows the scores for each dataset. Each row
                                                               corresponds to a dataset, where the first five (election, in-
                                                               comes, countries, biomechanical, and crime) do not have
                                                               temporal dependence. The next ten have temporal de-
                                                               pendence, but only the last five have grouping attributes
Figure 6: An example of attributes with temporal depen-
                                                               that produce temporal sequences (covid2, wage, market,
dence.
                                                               avocado, and suicides), while the five in the middle do
                                                               not (codiv1, energy1, yahoo, traffic, india). Each of these
                                                               cases is indicated by column ‘case’ and the numbers 0,1,2
15 datasets, five for each of the three categories men-        correspond to the same order of categories explained in
tioned in Section 3. For each of these categories, we          Section 3. The columns ‘FP’, ‘TP’, ‘FN’, and ‘TN’ count the
explain one dataset in detail as an example. In the follow-    number of attributes that have been classified as temporal
ing, we describe the research questions that we answer         or not and are false positive, true positive, false negative,
through our experiments.                                       and true negative respectively. The columns ‘ACC’ and
   Q1: Can the proposed approach correctly identify at-        ‘F1’ are the accuracy and the F1-score of the classification
tributes with temporal dependence in the datasets?             respectively. Following, ‘# temp att detected’ is the ratio
   We answer this question using the domain knowledge.         the attributes that were correctly detected over the total
Domain experts label an attribute as positive or negative      number of temporal attributes. The next three columns
depending on whether or not the attribute has temporal         refers to the detection of the grouping attributes, where
dependency. We construct a contingency matrix and              ‘grouping’ indicates if the dataset has one or more group-
calculate the accuracy (eq. 5) and the F1-score (eq. 4)        ing attributes, ‘grouping detected’ if the algorithm found
for each dataset. We include the accuracy because the          any, and ‘contingency’ specifies the type of error or suc-
F1-score is not applicable for the cases where there are       cess. Finally, the bottom part of the table summarizes
no temporal attributes.                                        the ‘contingency’ column and shows the F1-score for the
                               𝑇𝑃                              detection of grouping attributes.
               𝐹1 =                                      (4)      For case 0, in two of the five datasets, there were at-
                      𝑇 𝑃 + 1/2(𝐹 𝑃 + 𝐹 𝑁 )
                                                               tributes that were detected as temporal. As there is no
                              𝑇𝑃 + 𝑇𝑁                          temporal attribute, the F1-score is not applicable for this
              𝐴𝐶𝐶 =                                    (5)
                       𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁                       case, so only the accuracy should be taken into account.
    Q2: Can the approach correctly identify grouping at-       In the entire table, these are the only two cases with ac-
tributes to form multiple temporally dependent sequences?      curacy and F1-score (when applicable) lower than 1. In
    Using same evaluation metrics, we analyzed if the ap-      none of the datasets, an attribute was falsely considered
proach can correctly identify attributes by which we           as a grouping attribute. Moreover, in all datasets with
can group the dataset records into multiple temporally-        grouping attributes, those attributes were successfully
dependent sequences. For each dataset, we first identify       found, yielding a F1-score of 1.
if it has such an attribute. Typically, there are multiple
possible grouping attributes. For example, in a dataset        5.1. No Temporal Dependencies Datasets
containing information about suicides all over the world,
a grouping attribute can be each country, but at the same      To exemplify this case, we have the ‘elections’ dataset,
time, there could be trends related to other attributes,       which consists of reported votes by county in the gov-
such as gender. Therefore, we do not do a specific anal-       ernor race in the US elections 2020 (Figure 8). It has
ysis on each of the possible grouping attributes, but we       1025 entries, 2 non-numeric attributes, and 3 numerical
limit the analysis to the existence or not of any. The         attributes, none of which has temporal dependence.
F1-score is calculated for all the datasets.                      Figure 9 shows that no autocorrelation was found, as
    All the datasets used in this study are publicly avail-    expected.
able and were picked to exemplify various categories              One of the limitations of using autocorrelation, as we
(Appendix B).                                                  will discuss in Section 6.1, is that other types of relation-
                                                               ships can also produce correlation. To illustrate this, we
                                                               used the ‘biomechanical’ dataset (Figure 10), for which
5. Results                                                     there are two false positives based on the result table
                                                               of Figure 7. The dataset consist of six biomechanical at-
In this section, we first present the summary of the results   tributes derived from the shape and orientation of the
for each dataset. Then, we explore with more details           pelvis and lumbar spine of 310 patients. Despite the lack
some examples for each specific case.                          of temporal dependence in the data, the results, as shown
                                                               in Figure 11, indicates the presence of autocorrelation
Figure 7: Results summary for all datasets. Case 0, 1, and 2 corresponds to no-temporal information, no-grouping temporal
information, and grouping temporal information.



                                                              as expected are both the number of cases and deaths, as
                                                              shown in Figure 14.

                                                              5.3. Temporal Dependencies within
                                                                   Grouping Attributes
Figure 8: US elections dataset                            To illustrate this case, we used the ‘covid2’ dataset from
                                                          Figure 15, which consists of daily deaths and positive
                                                          cases of COVID-19 by county in the United States. There
                                                          are two differences between ‘covid1’ and ‘covid2’ datasets.
                                                          First, there are two non-numeric attributes corresponding
                                                          to counties and states, which can be used to establish a
                                                          geographical relation between the records. Second, there
                                                          is a numerical attribute ‘FIPS’, which is a code to identify
                                                          counties and states. Therefore, we expect this attribute
                                                          not to have autocorrelation, but to be a potential grouping
Figure 9: US elections dataset grouping attribute results attribute.
                                                             The results in Figure 16 show that if we do not split the
                                                          dataset in groups, none of the attributes can be consid-
                                                          ered to have temporal dependence. Instead, if we group
in 2 out of the 5 numerical attributes, namely, ‘lumbar
                                                          by ‘county’ or ‘fips’ there are 2 attributes in average with
lordosis angle’ and ‘degree spondylolisthesis’.
                                                          autocorrelation. Despite that grouping by the attributes
                                                          ‘date’, ‘state’, ‘cases’, ‘fips’ have non-zero values in the re-
5.2. No-grouping Temporal Datasets                        sulting table, the percentage of used data is low. Thus, we
The ‘covid1’ dataset, shown in Figure 12 has daily infor- ignore grouping by these attributes. Figure 17 shows, that
mation about positive cases and deaths caused by COVID- when grouping by ‘county’ the attributes ‘cases’, ‘deaths’,
19 in the United States.                                  and ‘fips’ could be considered as temporal sequences.
   The results in Figure 13 shows that no-grouping is the Nevertheless, we discard ‘fips’ as it has an occurrence of
best option, which is correct as none of the attributes approx. 22%, which is lower than our defined threshold
allows to form groups. The detected temporal attributes, of 50%.
Figure 10: Biomechanical dataset




Figure 11: Biomechanical dataset temporal attribute results




                                                              Figure 15: Covid-19 in the US by county and state dataset




Figure 12: Covid-19 in the US dataset
                                                              Figure 16: Covid-19 in the US by county and state grouping
                                                              attribute results




Figure 13: Covid-19 in the US dataset grouping attribute
results
                                                              Figure 17: Covid-19 in the US by county and state temporal
                                                              attribute results



Figure 14: Covid-19 in the US dataset temporal attribute      6. Discussion and Future Research
results
                                                              We proposed an approach to classify datasets based on
                                                              whether or not they contain temporally dependent data.
                                                              The core of our algorithm is based on the autocorrelation
                                                              as the method to determine if there is a temporal depen-
                                                              dence in a section of the data. Our algorithm relies on a
set of proposed metrics to integrally classify the dataset.        • Cross-sectional data: when there are dependence
Among these metrics, the percentage of data used in                  other than temporal between attribute values.
the analysis, the number of groups, and the average of               Even though autocorrelation is a necessary condi-
autocorrelated sequences found were the three metrics                tion to exploit temporal data information, it is not
that provided the most relevant information for making               a sufficient condition to determine if the data is
a decision. The other metrics were not used in any of                temporal. For example, our method will fail when
the examples but we believe that they could come handy               a dataset has correlations that are not temporal
in larger datasets. For example, the standard deviation              but spatial [26].
should not be too large as it would mean that there is a           • Non-stationarity: when time-series statistical
particular grouping attribute value with more temporal               properties vary over time. In such cases, the auto-
sequences than the rest, which is probably as a result               correlation cannot be calculated using the mean
of an outlier, and should be handled carefully to avoid a            and the variance but needs to be estimated. Simi-
false positive. Both the average autocorrelation and the             lar methods could be used as when dealing with
maximum autocorrelation are used as tiebreakers when                 missing values [27].
the other metrics have same values.
   Our approach could identify temporal sequences,               The decision tree to analyze the metrics is currently
when the sequence corresponds to the entire dataset,          not automated as we require a higher volume of use cases
and also when grouping by attributes was needed. Typi-        to generalize the rules. Similarly, for tuning the hyperpa-
cal datasets fall in both cases, meaning that an attribute    rameters, such as the autocorrelation threshold we used
can present autocorrelation as a whole sequence and as        to determine if an autocorrelation was significant, we
multiple grouped subsequences. The latter case is impor-      need more extensive analysis and cases.
tant because it allows to improve the data analysis. For
example, if we have an outlier detection algorithm for        6.2. Future work
temporal data, we may apply that to a single sequence
as well as to different subsequences constructed from         Statistical exploration and optimization We will
the same data, to increase the chance of detecting more       investigate whether different types of correlations, such
outliers. Another use case is when the algorithm has          as Pearson, Kendall, Spearman, and estimation from the
high time complexity. In such a case, it may be better        power spectral density can be used within the Ljung-
to only explore the outliers in the smaller subsequences      Box or Box-Pierce test [28]. We will conduct a deep
than in the entire sequence.                                  analysis on which autocorrelation function to use when
                                                              no prior information on the data is known. Currently,
                                                              the algorithm goes over all numeric attributes searching
6.1. Limitations
                                                              for autocorrelation. This is time consuming and should
We identify the following scenarios where our approach        be, if possible, improved.
might failed to detect temporal dependency on the at-
tributes.                                                   Working with categorical attributes Datasets may
     • Small sample size: when the number of samples consist of categorical attributes, such as boolean labels,
        is small, no statistical test will have enough sig- names, IDs, and dates. These attributes may be temporal
        nificance.                                          as well. For example, a positive value for a patient with a
     • Unevenly-sampled data: when there is no con- non-curable disease is unlikely to become negative in the
        stant time-spacing between samples. If the un- future. Thus, finding a way to process such attributes is
        even sampling is due to missing data points and important. We will use one-hot encoding to pre-process
        the sample size is large enough, the approach the categorical attributes.
        should converge to the same values as if all data
        points were present. However, if there is no
        pattern in the sampling rate, different methods
                                                            7. Conclusions
        should be used to calculate the autocorrelation In this paper, we have presented a technique that uses
        indirectly, such as estimating the autocorrelation autocorrelation to determine the presence of temporal
        using the statistical approaches [23].              data within its attributes without any prior knowledge
     • Missing values: when there are null values in the about the database. The algorithm was tested for different
        data. there are many methods [24] to overcome databases, including those with and without temporal
        missing values in time-series data and specifically dependence data, and specifically focused on databases
        for the Ljung-Box test[25]. However, under the containing hidden temporal groups. For these cases, we
        assumption that we do not have prior information proposed metrics to find the grouping attributes that
        on the data, none of these methods can be used.
unveil such hidden groups. The results show that we           [12] A. Dotis-Georgiou, Autocorrelation in Time-series
were able to successfully classified attributes as temporal        Data, 2019. URL: https://www.influxdata.com/blog/
or not, and also to find grouping attributes that form             autocorrelation-in-time-series-data/, influx Data
temporal groups. Finally, we discussed the limitations of          article, accessed 25th July 2021.
the approach and potential improvement paths.                 [13] G. Maddala, Introduction to Econometrics, Wiley,
                                                                   2001.
                                                              [14] G. Ljung, G. Box, On a Measure of Lack of Fit in
Acknowledgments                                                    Time Series Models, Biometrika 65 (1978).
                                                              [15] G. Box, D. Pierce, Distribution of Residual Auto-
This work was supported in part by funding from NSF
                                                                   correlations in Autoregressive-Integrated Moving
under Award Numbers CNS 1822118, IIS 2027750, OAC
                                                                   Average Time Series Models, Journal of the Ameri-
1931363, Statnett, ARL, AMI, Cyber Risk Research, and
                                                                   can Statistical Association 72 (1970) 397–402.
NIST.
                                                              [16] D. Scott, Applied Econometrics with R by Chris-
                                                                   tian Kleiber, Achim Zeileis, International Statistical
References                                                         Review 77 (2009) 164–164.
                                                              [17] D. Peña, J. Rodríguez, A Powerful Portmanteau
 [1] P. J. Brockwell, R. A. Davis, Introduction to Time            Test of Lack of Fit for Time Series, Journal of the
     Series and Forecasting, Springer, 2008.                       American Statistical Association 97 (2002) 601–610.
 [2] H. Luetkepohl, M. Krätzig, In Applied Time Series        [18] J.-M. Dufour, L. Khalaf, Monte Carlo Test Methods
     Econometrics, Applied Time Series Econometrics                in Econometrics, 2007.
     (2004).                                                  [19] S. Documentation, wntestq Portmanteau (Q) Test
 [3] M. Allen, The SAGE Encyclopedia of Communica-                 Description, 2019. URL: http://www.stata.com/
     tion Research Methods, SAGE Publications, 2017.               manuals13/tswntestq.pdf, accessed on 24th July
 [4] Y. Chen, W. Wu, Application of One-class Sup-                 2021.
     port Vector Machine to Quickly Identify Multivari-       [20] E. Ziegel, G. Box, G. Jenkins, G. Reinsel, Time series
     ate Anomalies from Geochemical Exploration Data,              analysis, forecasting, and control, Technometrics
     Geochemistry: Exploration, Environment, Analysis              37 (1995) 238.
     17 (2017) 231–238.                                       [21] R. Tsay, Analysis of Financial Time Series. Financial
 [5] Z. Cheng, C. Zou, J. Dong, Outlier Detection Using            Econometrics, 2002.
     Isolation Forest and Local Outlier Factor, in: Con-      [22] J. C. Escanciano, I. N. Lobato, An automatic Port-
     ference on Research in Adaptive and Convergent                manteau Test for Serial Correlation, Journal of
     Systems, Association for Computing Machinery,                 Econometrics 151 (2009) 140–149.
     2019, p. 161–168.                                        [23] K. Rehfeld, N. Marwan, J. Heitzig, J. Kurths, Com-
 [6] H. Lu, Y. Liu, Z. Fei, C. Guan, An Outlier Detec-             parison of Correlation Analysis Techniques for Ir-
     tion Algorithm based on Cross-Correlation Analy-              regularly Sampled Time Series, Nonlinear Processes
     sis for Time Series Dataset, IEEE Access 6 (2018)             in Geophysics 18 (2011) 389–404.
     53593–53610.                                             [24] I. Pratama, A. E. Permanasari, I. Ardiyanto, R. In-
 [7] P. M. Maçaira, A. M. T. Thomé, F. L. C. Oliveira,             drayani, A Review of Missing Values Handling
     A. L. C. Ferrer, Time Series Analysis with Explana-           Methods on Time-series Data, in: 2016 Interna-
     tory Variables: A Systematic Literature Review, En-           tional Conference on Information Technology Sys-
     vironmental Modelling & Software 107 (2018) 199 –             tems and Innovation (ICITSI), 2016, pp. 1–6.
     209.                                                     [25] D. Stoffer, C. Toloi,      A Note on the Ljung–
 [8] Y. Yu, X. Si, C. Hu, J. Zhang, A Review of Recurrent          Box–Pierce Portmanteau Statistic with Missing
     Neural Networks: LSTM Cells and Network Archi-                Data, Statistics & Probability Letters 13 (1992)
     tectures, Neural Computation 31 (2019) 1235–1270.             391–396.
 [9] W. Lin, M. Orgun, G. Williams, An overview of            [26] A. Zuur, Spatial Correlation, 2019. URL:
     temporal data mining (2019).                                  http://userwww.sfsu.edu/efc/classes/biol710/
[10] H. Homayouni, S. Ghosh, I. Ray, An Approach                   spatial/spat-auto.htm, san Fransisco State Univer-
     for Testing the Extract-Transform-Load Process in             sity article, accessed on 24th June 2021.
     Data Warehouse Systems, in: Proceedings of the           [27] G. P. Nason, R. Von Sachs, G. Kroisandt, Wavelet
     22nd International Database Engineering and Ap-               processes and adaptive estimation of the evolution-
     plications Symposium, IDEAS 2018, Association for             ary wavelet spectrum, Journal of the Royal Statisti-
     Computing Machinery, 2018, p. 236–245.                        cal Society Series B 62 (2000) 271–292.
[11] W. Wei, Time Series Analysis: Univariate and Mul-        [28] C. Chatfield, The Analysis of Time Series: An In-
     tivariate Methods, volume 33, 1989.                           troduction, Fourth Edition, Chapman & Hall/CRC
     Texts in Statistical Science, CRC Press, 1989.          • india
                                                               https://www.kaggle.com/muralimunna18/
                                                               india-population
A. Code                                                        Population of india by year.
                                                             • exchange
The code used for this paper is available in GitHub:           https://www.kaggle.com/rohithbollareddy/
https://github.com/JCuomo/TemporalDependenceDB                 foreign-exchange-in-india-yearlysource-rbi
                                                               Exchange currencies by year.
                                                             • covid2
B. Datasets                                                    https://raw.githubusercontent.com/nytimes/
                                                               covid-19-data/master/us-counties.csv
     • elections
                                                               Covid cases and death by County in the USA.
       https://www.kaggle.com/unanimad/
                                                             • wage
       us-election-2020
                                                               https://kaggle.com/lislejoem/
       ”governors county” file.
                                                               us-minimum-wage-by-state-from-1968-to-2017
       General information about reporting votes to
                                                               USA minimum wage by State from 1968 to 2020.
       governor race by county.
     • incomes                                               • market
       https://www.kaggle.com/jonavery/                        https://raw.githubusercontent.com/selva86/
       incomes-by-career-and-gender                            datasets/master/MarketArrivals.csv
       American citizens incomes from 2015 broken              Indian markets quantity and price per year.
       into male and female statistics.                      • avocado
     • countries                                               https://www.kaggle.com/neuromusic/
       https://www.kaggle.com/fernandol/                       avocado-prices
       countries-of-the-world                                  Avocado weekly 2018 retail scan data for National
       Information on population, region, area size,           retail volume (units) and price.
       infant mortality and more.                            • suicides
     • biomechanical                                           https://www.kaggle.com/russellyates88/
       https://www.kaggle.com/uciml/                           suicide-rates-overview-1985-to-2016
       biomechanical-features-of-orthopedic-patients           Worldwide suicide statistics per year.
       Patient data of six biomechanical attributes
       derived from the shape and orientation of the
       pelvis and lumbar spine.
     • crime
       https://www.kaggle.com/mascotinme/
       population-against-crime
       FBI crime statistics for 2012 on population less
       than 250,000.
     • covid1
       https://raw.githubusercontent.com/nytimes/
       covid-19-data/master/us.csv
       Covid cases and death statistics for USA.
     • energy
       This dataset is proprietary and cannot be dis-
       tributed.
       Daily energy delivery by Fort Collins power fa-
       cility.
     • yahoo
       https://webscope.sandbox.yahoo.com/
       ”A3Benchmark all” file
       Real and synthetic time-series. The synthetic
       dataset consists of time-series with varying trend,
       noise and seasonality. The real dataset consists
       of time-series representing the metrics of various
       Yahoo services.
Table 2
Description of Datasets
DB                 Link                                               Description
elections          https://www.kaggle.com/unanimad/us-election-       General information about reporting votes to gover-
                   2020 ”governors county” file.                      nor race by county.
incomes            https://www.kaggle.com/jonavery/incomes-by-        American citizens incomes from 2015 broken into
                   career-and-gender                                  male and female statistics.
countries          https://www.kaggle.com/fernandol/countries-of-     Information on population, region, area size, infant
                   the-world                                          mortality and more.
biomechanical      https://www.kaggle.com/uciml/biomechanical-        Patient data of six biomechanical attributes derived
                   features-of-orthopedic-patients                    from the shape and orientation of the pelvis and
                                                                      lumbar spine.
crime              https://www.kaggle.com/mascotinme/population-      FBI crime statistics for 2012 on population less than
                   against-crime                                      250,000.
covid1             https://raw.githubusercontent.com/nytimes/covid-   Covid cases and death statistics for USA.
                   19-data/master/us.csv
energy             This dataset is proprietary and cannot be dis-     Daily energy delivery by Fort Collins power facility.
                   tributed.
yahoo              https://webscope.sandbox.yahoo.com/ ”A3Bench-      Real and synthetic time-series. The synthetic
                   mark all” file                                     dataset consists of time-series with varying trend,
                                                                      noise and seasonality. The real dataset consists of
                                                                      time-series representing the metrics of various Ya-
                                                                      hoo services.
india              https://www.kaggle.com/muralimunna18/india-        Population of india by year.
                   population
exchange           https://www.kaggle.com/rohithbollareddy/foreign-   Exchange currencies by year.
                   exchange-in-india-yearlysource-rbi
covid2             https://raw.githubusercontent.com/nytimes/covid-   Covid cases and death by County in the USA.
                   19-data/master/us-counties.csv
wage               https://kaggle.com/lislejoem/us-minimum-wage-      USA minimum wage by State from 1968 to 2020.
                   by-state-from-1968-to-2017
market             https://raw.githubusercontent.com/selva86/         Indian markets quantity and price per year.
                   datasets/master/MarketArrivals.csv
avocado            https://www.kaggle.com/neuromusic/avocado-         Avocado weekly 2018 retail scan data for National
                   prices                                             retail volume (units) and price.
suicides           https://www.kaggle.com/russellyates88/suicide-     Worldwide suicide statistics per year.
                   rates-overview-1985-to-2016