1. INTRODUCTION

Template-based Time series generation with Loom

Lars Kegel

Martin Hahmann

Wolfgang Lehner

0 0 Technische Universität Dresden 01062 Dresden , Germany

Time series analysis and forecasting are important techniques for decision-making in many domains. They are typically evaluated on given sets of time series that have a constant size and speci ed characteristics. Synthetic datasets are relevant because they are exible in both size and characteristics. In this demo, we present our prototype Loom, that generates datasets with respect to the user's con guration of categorical information and time series characteristics. The prototype allows for comparison of di erent analysis techniques.

Time series analysis Data generation

1. INTRODUCTION

Time series describe the dynamic behavior of a monitored object, parameter, or process over time and are one of the most popular and useful data types. They can be found in a multitude of application domains, e.g. as item sales in commerce, various sensor readings in manufacturing processes, or as demand and production in the energy domain. Obviously, this makes them a valuable source for diverse data analysis techniques, such as forecasting [ 8 ]. Especially in the domain of renewable energy production, where the uctuating character of renewable energy sources makes accurate forecasts vital in order to match electricity production and demand. Further applications on time series data include querying, classi cation, e cient storage and much more. The ubiquity of this data type and the ongoing trend for data collection, storage and analysis have led to a substantial amount of research that is dedicated to the handling and processing of large sets of time series data. While all these research endeavors can di er greatly with respect to their individual goals and application scenarios they have one thing in common: they require large amounts of time series data in order to evaluate, verify, and optimize their ndings. Although there are many stakeholders that have a substantial interest in using and exploiting time series, acquiring sophisticated data is not easy. Basically there are two sources: First, public open repositories or single datasets [ 10 ], that are tailored to speci c applications and only o er a small selection. Second, "real" data owned by companies/organizations that is sometimes made available to partners in the context of closed research projects but rarely to the general public. Moreover, obtaining real data can be tedious due to the time and cost that is necessary to collect them [ 11 ]. Based on our own experience we can state that some data is always available, that allows to conduct basic evaluations. This situation normally becomes problematic when scalability, versatility, and robustness have to be examined. These require a more versatile selection of data, containing datasets with varying size, time series length, trends, seasonality, or just a di erent blend of time series characteristics. In general, this is not available which often leads to researchers using workarounds to create more data, e.g. duplication to increase the number of time series or their length.

To cope with this problem, we demonstrate Loom which is a user-friendly and exible approach to generate sets of time series for the evaluation of arbitrary analysis techniques or the benchmarking of time series management systems. Loom stands for the process of weaving di erent time series generators, datasets of arbitrary size. In addition, users can generate categorical information to structure the time series hierarchically. Thus, they form a data cube that can be explored by usual OLAP queries, such as roll up or drill down. Generated datasets can be directly exported to relational databases or at le formats in order to easily utilize them in di erent applications. The usage of Loom is template-driven at its core. This means, given datasets can be analyzed in order to extract a template containing their de ning characteristics. These templates are then used to create di erent variants of datasets that are still similar to the template. This approach eases the application of our tool as users do not have to specify a completely synthetic time series model. In addition, this mechanism o ers a certain degree of anonymization for otherwise closed data.

In the remainder of this article, we present a general system overview in Section 2, before we describe our demonstraTemplate creation tion in detail in Section 3. Previous work related to dataset generation and time series models is presented in Section 4 before the concluding remarks and pointers to future work are given in Section 5.

2. SYSTEM OVERVIEW

The main work ow of Loom is depicted in Figure 1 and shows all steps necessary to create a set of time series. In this section, we give an outline of the idea behind each step before we describe its implementation in Section 3. Template creation. This optional step at the beginning of the work ow allows the user to upload and analyze given time series data in order to create templates that can be used during the latter steps of the data generation process. Currently, Loom employs three types of template creation: (1) If present, existing hierarchies of categorical information are extracted and stored. In order to anonymize the data, the original attributes are replaced with synthetic ids. (2) Given time series are extracted and taken as samples for time series generation. (3) The whole set of time series is analyzed to create a template that represents its characteristics and can be used to create multiple datasets that are similar to the original. While the rst two types are straightforward, the third one is more complex. In order to create the described template, Loom uses an approach based on the hierarchical divisive analysis clustering (DIANA) [ 12 ]. With this method, the dataset is partitioned into groups of similar time series. From each of these clusters, the time series with the lowest average distance to the remaining members is selected as a prototype. By tting an analytical time series model, e.g. ARIMA, to this time series, a generator that represents the characteristics of its underlying cluster is created. This generator can be used to create multiple variations of the original time series. To complete the template, the size of each group in relation to the whole dataset is stored. With the collected information it is possible to create multiple datasets that are di erent but still share the characteristic time series of the original and their distribution.

Data cube modelling. Usually, a set of time series does not only contain sequences of measured values, but also categorical information, e.g. geography, purpose, or color. Sets of these attributes are organized in hierarchies, called dimensions, while sets of these dimensions form the skeleton of a data cube. The goal of this step is to allow users the conguration and creation of such cubes. In short, data cubes can be formally described as follows [ 17 ]:

A data cube skeleton consists of a set of dimensions. A dimension is a lattice of levels L = fl1; l2; : : : ; lmg. The constraint of this lattice states that the values of a level, called category attributes functionally determine the values of its parent level, e.g., l0 ! l00. More formally, for each level l, l0, l00 of the same dimension: l ! l l ! l0 ^ l0 ! l ) l = l0 l ! l0 ^ l0 ! l00 ) l ! l00 (re exivity) (antisymmetry) (transitivity) For the sake of simplicity, we implemented totally ordered dimensions in Loom; this will be discussed in Section 5. A total order has the following additional condition: l ! l0 _ l0 ! l (totality) As an example, Figure 2 shows the totally ordered dimension Geography of Australia with the two levels State and Region. The category attributes of Region functionally determine the category attributes of State, e.g. Melbourne and Ballarat determine Victoria.

Three parameters are necessary to con gure a data cube skeleton: the number of dimensions, the number of levels per dimension, and the outdegree of a category attribute. While the rst two parameters are straightforward, the last one needs more explanation. The outdegree of a category attribute de nes the number of subcategories within a category and thus describes the branching between the levels of a dimension. To illustrate this, we regard the example from Figure 2 where we observe an outdegree of 2. This means a category attribute on the State level, e.g. New South Wales, is related to two category attributes on the lower Region level, e.g. Sydney and Blue Mountains. Category attributes that have no further subcategories are called base category attributes and form the leaves of a dimension's hierarchy. Thus, a data cube skeleton is the cross-product of all base category attributes of all dimensions.

In the energy domain, categorical information is also bene cial because forecast models that are built for categories may lead to more accurate and robust forecast results. For Region Sydney

Blue Mountains

Melbourne

Ballarat instance, the Irish Smart Metering Project [ 1 ] gathers time series of smart meters in over 5,000 Irish households and businesses. In a survey, the owners give additional information, such as Social class, House type and Age of the house. We identify 8 dimensions, each of whose consisting of one dimension level. Thus, forecast models can be created for individual time series or aggregated time series along one or more dimensions.

Time series generation. The data cube skeleton that was con gured and created in the previous step must now be lled with facts, which in the case of our system are time series. In the following, we again give a quick formalization of time series and explain what is necessary to con gure their creation. A time series is a sequence of successive observations xt (1 t T ) recorded at speci ed time instances t. For this demonstration, we assume that observations are complete and equidistant, i.e. there exists an observation for every time instance and all time instances have the same distance. For con guration a time frame must be de ned that consists of start and end time instance as well as the distance between time instances. In addition one or more measure columns must be de ned, depending whether univariate or multivariate time series should be generated. To ll these measures with actual values, our system uses time series models and samples.

As an example, the user can create a synthetic model with a base bt, season st and error component et:

xt = bt s(t mod L) + et The season length is given by L. The component st is a seasonal mask of length L. Thus, the weights repeat every L time instances. The error component et is normal distributed N (0; 2) and overlays the \perfect" model xt? = bt s(t mod L). The standard deviation depends on the user's accuracy expectation, expressed as mean average percentage error (MAPE):

T M AP E = 1 X

T t=1 j

? xt x? t

T xt j = 1 X

j T t=1 bt s(t mod L)

j et The error distribution is calculated such that the user given M AP E holds on average of the whole time series.

Datasets may also be created from template. During template creation, a set of ARIMA models is created that is based on user given data. Synthetic time series are generated by those ARIMA models, incorporating normally distributed errors or sampled errors from the template [ 4 ]. Time series mapping. The mapping links time series to the data cube, i.e. populates the skeleton with facts. As manual insertion of thousands of time series into a large data cube would not be feasible, Loom features an automatic mapping that randomly distributes the generated time series over the data cube skeleton. The user may set the model frequency by weighting each time series model. If templates are used, their distribution is used as default.

Dataset export. After the con guration and mapping steps are done, the actual data is created and can be exported in a suitable format for further use. Our application o ers two di erent export destinations: either le or database. File export o ers general formats like CSV and SQL script, allowing the use of our generated data in almost every application. In addition, we o er export as RData which is the data format data.table for the popular statistical workbench R [ 3 ]. The database export transforms the created data into fact tables that can be imported into any RDBMS. These tables have one time and at least one measure column, categorical information may be stored in fact tables as well as in dimensional tables. As we deal with a high amount of structured data, it is necessary to bring the dataset into an appropriate schema depending on the chosen export option. 3.

DEMONSTRATION

This section demonstrates the usage of Loom. Starting the application, the user sets a workspace directory that is used for con guration les and generated datasets. Below, we describe the con guration of a template, a dataset and further generation steps. These steps correspond to the work ow shown in Figure 1. 3.1

Template configuration

The optional template con guration annotates a user given dataset with information that is further needed for the template creation. The user inputs a CSV le and annotates each column with the corresponding semantic (time, measure or category). Moreover, the time column needs information about the time type (integer, date, time) and the respective format. As shown in Figure 3, the user con gures dimensions by Drag and Drop of the category names. Optionally, the user may indicate the primary attribute that is the lowest level category in every dimension. Once this con guration is nished, the template may be used for time series and/or data cube generation.

After login, the user is greeted by an overview window that displays all datasets that he/she has already generated and those that are queued for generation, Figure 4. Clicking \Create new", opens a wizard dialog that guides the user through the con guration. The rst input by the user is a dataset name which is needed to reference and handle the con guration and its results. 3.2.1

Data cube configuration

The next dialog, Figure 5, allows the con guration of the data cube. Loom o ers two ways of con guration: template and synthetic.

Template con guration: The user can select a template as basis for the data cube skeleton. Templates are derived from real world datasets or from existing data cubes and are ready-made con gurations featuring default values for number of dimensions, number of levels and outdegrees. Thus, this type of con guration is more user friendly than the synthetic one. Users can still customize the con guration by making selective adjustments to the template. It is possible to create di erent variants, e.g. smaller, larger, highly branched, etc., of an existing data cube.

Synthetic con guration: Alternatively, this type of conguration allows users the full manual customization of the data cube skeleton. Users specify the number of dimensions, the number of levels per dimension and the outdegree per level. These parameters can be provided xed for each element or the whole cube. In addition, Loom o ers a random parameter distribution, allowing a probabilistic setting with randomly structured dimensions.

Both template and synthetic con gurations, employ random distribution of outdegrees to a certain extent. This means the speci c number of facts that can be accommodated by the data cube skeleton is not known during conguration. As the user must know the actual structure of the data cube in order to con gure an appropriate number of time series, our system o ers a preview of the data cube skeleton directly after con guration. This preview is depicted in Figure 6 and shows all category attributes ordered by their size. The user may restart the modeling or accept the generated result.

In many benchmarks such as TPC-H [ 2 ], it is common to set a scale factor SF of a database size, such as 10, 30, 100. Loom adopts this functionality by using a scale dimension that consists of exactly one level with SF 1 category attributes and can be used to adapt the size of the data cube as desired. Thus, the resulting data cube is in ated by this factor.

While the parameters and con guration types described in this section provide users with a versatile and comfortable way of modeling the categorical information of a data cube, its con guration is not mandatory. If a user wishes for a plain set of unlabeled data, only a primary attribute is generated to identify the created time series.

Time series con guration is split into three separate dialogs: time attribute, measure attributes, and time series models. The time attribute needs parameters such as the time type and the timeframe, i.e. start time, granularity, end time.

In the measure dialog, the user sets the number of measure columns of the dataset. For each measure attribute, the user sets a data type. Usually, a measure is of double-precision oating-point format but in order to decrease dataset size in a database, the user can also set single-precision oatingpoint or integer format.

Most importantly, time series models have to be added to the con guration. For this, Loom o ers four options: Sampled from Template: All time series data is generated \as is" by taking values from given time series from existing datasets uploaded by users. According to the timeframe con guration data is extracted from the original time series and added to the generated one. Mismatches with the timeframe or data cube size are resolved using duplication, cutting, granularity conversion etc.

Recombined from Template: Time series are created via decomposition and recombination. Classical decomposition strategies like decompose [ 13 ] and stl [ 6 ] are used to extract the de ning components trend, seasonality, and noise from an existing set of time series. By recombining these components, new time series are created and can be used to add volume and variety to a dataset.

Modeled from Template: This option uses the third type of templates we described in Section 2. Users can load the time series generators and their distributions of an existing dataset. Customization is possible by changing the number of time series an individual generator creates or by removing/adding certain generators.

Synthetic time series: The user con gures a time series model from scratch, without relying on any given measures. Time series properties are de ned freely and are synthetically generated, e.g. with a linear rising trend, a regular gauss-shaped seasonal component, and a normally distributed error series. 3.2.3

Export configuration

In the last dialog, the user sets the export con guration. The export destination has di erent options. CSV, SQL and RData create at les within the workspace. Alternatively, time series are exported to a database via JDBC driver. Thus, the user sets up a database connection with a database location and login credentials. Finally, the user sets the schema. Loom supports di erent schemas: (1) Basic unnormalized export in a Universal schema, which creates high redundancy as it stores the categorical information with each value of a time series. (2) The partly resp. fully normalized Star and Snow ake schema, which allows more compact exports and are common in database design. (3) The Parent-Child schema, which stores each functional dependency as a pair of parent attribute and child attribute in the respective dimension table.

Particular attention has to be paid to the export when dimensions are unbalanced, i.e. their base categories are not on the same level. Figure 7 shows an example for such a dimension, where base category attributes are either on the region level (Darwin, Alice Springs) or on the state level (Australia Capital Territory). While both the universal schema and the parent-child schema support unbalanced dimensions, the star and snow ake schema only support unbalanced dimensions when referential integrity can be guaranteed. To achieve this, the user may add a primary attribute column and a minimum outdegree. If the user did not set a data cube skeleton, then a primary attribute is automatically generated.

Top State

Northern Territory

Australian Capital Territory Australian Capital Territory Region Darwin

Alice Springs

Closing the con guration window brings the user back to the initial overview where a new entry has been added, see Figure 4. Now, the generation process can be invoked by clicking "Start Selected\. The state switches to "In Progress\ and indicates the current amount of data that has been generated. 4.

RELATED WORK

Workload generation has been studied in many papers each of which focus either on the generation process or time series modelling. To our knowledge, Loom is the rst application that integrated both techniques and allows for exible data cube and time series characteristics. In the following, we present selected sources that relate to our prototype.

The IDAS dataset generator [ 11 ] o ers data generation based on statistical distributions. Attributes form dependency graphs that are not necessarily lattices. The goal is the creation of a synthetic dataset for testing data mining algorithms. The work ow is similar to Loom since the user creates a dataset by specifying the number of tables, setting the attributes and initiating the data generation. Moreover, the authors experienced similar shortcomings with real data, such as privacy issues, a lack of training data or unsatisfying categorical information. Still, measures do not depend on time, thus time series generation is not supported.

Scha ner and Januschowski [ 14 ] focus on benchmarking of databases under varying request rates. Request rates can be seen as time series of aggregate tenant traces. Since there is not enough real data available, they provide two methodologies for generating synthetic tenant traces. (1) The modelling approach ts a function as a model for a given aggregate tenant trace. The function's shape has been determined empirically. By adding an error term, they create diversity among the synthetic tenant traces. (2) Another way is the decomposition of time series by bootstrapping. Thus, a given trace is split into windows that are randomly shifted and result in synthetic traces. Both approaches are similar to Loom's template creation in that synthetic time series are either modelled or recombined from template.

The F-TPC-H benchmark [ 7 ] is a modi ed TPC-H benchmark for time series generation. This work reuses the given TPC-H schema in that customers submit orders of products for a certain quantity. While this quantity does not depend on time in TPC-H, the modi ed F-TPC-H adds dependency for representing trend and seasonal e ects via ARIMA models. Thus, this work represents a subset of Loom because it consists of a given schema and allows for synthetic time series in the sales domain. Loom also supports schema exibility and allows for composing di erent time series generators.

A speci c use case for managing energy data is given by [ 16 ]. This work proposes a uni ed Data Warehouse schema for storing workloads, given as information about actors like producers and consumers, o ers, and time series about past measures. Their time series schema involves measures from di erent types such as energy, power and price. Categorical information is necessary in order to store special annotations such as aggregation level of time series or additional information for each time series type. A time series is represented by several tables: (1) a time series table stores the primary attribute that identi es the time series and that links to each category, (2) another dimensional table stores time frames with an identi er and the resp. time frame information, (3) the fact table itself consists of the primary attribute, a measure column and a foreign key to the time frame. Thus, this schema is not a traditional star of snow ake schema and cannot directly be covered by Loom. Moreover, Loom keeps time and measure together as fact. This may increase redundancy but we opt for this solution for several reasons: (1) there is no need for an additional description of time frames, (2) a time frame is encoded either as an integer or a short string, thus the space for storage is still a ordable, (3) there is no join operation needed in order to retrieve a time series. After all, time series from the energy domain may be generated by models integrated in Loom.

CONCLUSIONS AND FUTURE WORK

In this paper, we introduced Loom as a tool for generating large sets of synthetical time series data. Our prototype utilizes di erent time series generators to create multiple time series that share certain characteristics. In addition, our prototype allows the creation of dimensional categorical information for the description of time series. Besides the full manual de nition of a dataset, Loom features a template driven approach that analyses given datasets and allows the creation of synthetic variants of this template data.

Currently, our approach only generates complete time series with equidistant time stamps. This is done, as forecast methods like Exponential smoothing [ 9 ] and ARIMA [ 5 ] rely on these properties, except few models like [ 15 ]. Part of our future work will be the integration of functions for generating incomplete time series with con gurable gap patterns.

Regarding data cube modelling, we assume that a dimension is a totally ordered set of levels, which is the case in most real-world datasets. However, there are exceptions, such as the modelling of a time dimension with levels: day, week and month. There, a day functionally determines a week and a month, but a week does not determine the month. Such lattices are not supported by our prototype.

Further future work will be focused on time series mapping to the data cube. Right now, we use a very simple approach for this and randomly distribute our time series over the data cube skeleton. This approach will be replaced with a more sophisticated method that allows the con guration of the distribution. For example, time series from a certain generator only occur in a speci ed subset of the data cube skeleton.

REFERENCES

[1]

CER

Smart Metering Project . http://www.ucd.ie/issda/data, 2010 .

[2]

TPC

Benchmark H . http://www.tpc.org/TPC_ Documents_Current_Versions/pdf/tpch2.17.1.pdf, 2014 .

[3]

data . table package . https://cran.r-project.org/ web/packages/data.table/index.html, 2015 .

[4] R forecast package . https://cran.r-project.org/ web/packages/forecast/index.html, 2015 .

[5]

G. E. P.

Box and

G. M.

Jenkins . Time series analysis forecasting and control . Holden-Day , San Francisco, 1970 .

[6]

R. B.

Cleveland ,

W. S.

Cleveland ,

J. E.

McRae , and I. Terpenning. STL: A Seasonal-Trend Decomposition Procedure Based on Loess . Journal of O cial Statistics , 6 :3{ 73 , 1990 .

[7]

Fischer . Forecasting in database systems , 2014 .

[8]

Fischer ,

Rosenthal , and

Lehner . F2DB: The Flash-Forward Database System . In ICDE , pages 1245 { 1248 , 2012 .

[9]

C. C.

Holt . Forecasting trends and seasonal by exponentially weighted averages . O ce of Naval Research Memorandum , 52 , 1957 . Reprinted in: International Journal of Forecasting , 20 ( 1 ): 5 - 10 , 2004 .

[10]

R. J.

Hyndman . Time series data library . http://data.is/TSDLdemo. Accessed on 9-24-15.

[11]

D. R.

Jeske et al. Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems . In Proc. of KDD , pages 756 { 762 , 2005 .

[12]

Kaufman and

P. J.

Rousseeuw . Finding Groups in Data. Wiley, 1990 .

[13]

Kendall and

Stuart . The Advanced Theory of Statistics , volume 3 . Gri

, 1983 .

[14]

Scha ner and

Januschowski . Realistic tenant traces for enterprise DBaaS . In Workshops Proc. of ICDE , pages 29 { 35 , 2013 .

[15]

R. H.

Shumway and D. S. Sto er . Time Series Analysis and Its Applications . Springer, 2011 .

[16]

Siksnys ,

Thomsen , and

T. B.

Pedersen. MIRABEL DW : managing complex energy data in a smart grid . In Proc. of DaWaK , pages 443 { 457 , 2012 .

[17]

Vassiliadis . Modeling multidimensional databases, cubes and cube operations . In Proc. of SSDBM , pages 53 { 62 , 1998 .