=Paper=
{{Paper
|id=Vol-1558/paper18
|storemode=property
|title=Template-based Time Series Generation with Loom
|pdfUrl=https://ceur-ws.org/Vol-1558/paper18.pdf
|volume=Vol-1558
|authors=Lars Kegel,Martin Hahmann,Wolfgang Lehner
|dblpUrl=https://dblp.org/rec/conf/edbt/KegelHL16
}}
==Template-based Time Series Generation with Loom==
Template-based Time series generation with Loom Lars Kegel, Martin Hahmann, Wolfgang Lehner Technische Universität Dresden 01062 Dresden, Germany {firstname.lastname}@tu-dresden.de ABSTRACT and processing of large sets of time series data. While all Time series analysis and forecasting are important tech- these research endeavors can differ greatly with respect to niques for decision-making in many domains. They are typ- their individual goals and application scenarios they have ically evaluated on given sets of time series that have a con- one thing in common: they require large amounts of time stant size and specified characteristics. Synthetic datasets series data in order to evaluate, verify, and optimize their are relevant because they are flexible in both size and charac- findings. Although there are many stakeholders that have teristics. In this demo, we present our prototype Loom, that a substantial interest in using and exploiting time series, generates datasets with respect to the user’s configuration of acquiring sophisticated data is not easy. Basically there categorical information and time series characteristics. The are two sources: First, public open repositories or single prototype allows for comparison of different analysis tech- datasets [10], that are tailored to specific applications and niques. only offer a small selection. Second, ”real” data owned by companies/organizations that is sometimes made available to partners in the context of closed research projects but Categories and Subject Descriptors rarely to the general public. Moreover, obtaining real data I.6.7 [Simulation and modeling]: Simulation Support can be tedious due to the time and cost that is necessary to Systems; H.2.8 [Database management]: Database Ap- collect them [11]. Based on our own experience we can state plications—Statistical databases that some data is always available, that allows to conduct basic evaluations. This situation normally becomes prob- Keywords lematic when scalability, versatility, and robustness have to be examined. These require a more versatile selection Time series analysis, Data generation of data, containing datasets with varying size, time series length, trends, seasonality, or just a different blend of time 1. INTRODUCTION series characteristics. In general, this is not available which Time series describe the dynamic behavior of a monitored often leads to researchers using workarounds to create more object, parameter, or process over time and are one of the data, e.g. duplication to increase the number of time series most popular and useful data types. They can be found or their length. in a multitude of application domains, e.g. as item sales To cope with this problem, we demonstrate Loom which in commerce, various sensor readings in manufacturing pro- is a user-friendly and flexible approach to generate sets of cesses, or as demand and production in the energy domain. time series for the evaluation of arbitrary analysis techniques Obviously, this makes them a valuable source for diverse or the benchmarking of time series management systems. data analysis techniques, such as forecasting [8]. Especially Loom stands for the process of weaving different time se- in the domain of renewable energy production, where the ries generators, datasets of arbitrary size. In addition, users fluctuating character of renewable energy sources makes ac- can generate categorical information to structure the time curate forecasts vital in order to match electricity produc- series hierarchically. Thus, they form a data cube that can tion and demand. Further applications on time series data be explored by usual OLAP queries, such as roll up or drill include querying, classification, efficient storage and much down. Generated datasets can be directly exported to re- more. The ubiquity of this data type and the ongoing trend lational databases or flat file formats in order to easily uti- for data collection, storage and analysis have led to a sub- lize them in different applications. The usage of Loom is stantial amount of research that is dedicated to the handling template-driven at its core. This means, given datasets can be analyzed in order to extract a template containing their defining characteristics. These templates are then used to create different variants of datasets that are still similar to the template. This approach eases the application of our tool as users do not have to specify a completely synthetic time series model. In addition, this mechanism offers a cer- c 2016, Copyright is with the authors. Published in the Workshop Pro- tain degree of anonymization for otherwise closed data. ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this In the remainder of this article, we present a general sys- paper is permitted under the terms of the Creative Commons license CC- tem overview in Section 2, before we describe our demonstra- by-nc-nd 4.0 Template Data cube Time series Time series Dataset creation modelling generation mapping export • Dataset • Dimensions • Time Frame • Model • Destination • Scale Factor • Time Series Frequency • Schema Models Figure 1: Workflow overview tion in detail in Section 3. Previous work related to dataset dimension is a lattice of levels L = {l1 , l2 , . . . , lm }. The con- generation and time series models is presented in Section 4 straint of this lattice states that the values of a level, called before the concluding remarks and pointers to future work category attributes functionally determine the values of its are given in Section 5. parent level, e.g., l0 → l00 . More formally, for each level l, l0 , l00 of the same dimension: 2. SYSTEM OVERVIEW l→l (reflexivity) The main workflow of Loom is depicted in Figure 1 and 0 0 0 shows all steps necessary to create a set of time series. In l →l ∧l →l ⇒l =l (antisymmetry) this section, we give an outline of the idea behind each step l → l0 ∧ l0 → l00 ⇒ l → l00 (transitivity) before we describe its implementation in Section 3. For the sake of simplicity, we implemented totally ordered dimensions in Loom; this will be discussed in Section 5. A Template creation. This optional step at the beginning of total order has the following additional condition: the workflow allows the user to upload and analyze given time series data in order to create templates that can be l → l0 ∨ l0 → l (totality) used during the latter steps of the data generation process. As an example, Figure 2 shows the totally ordered dimension Currently, Loom employs three types of template creation: Geography of Australia with the two levels State and Region. (1) If present, existing hierarchies of categorical informa- The category attributes of Region functionally determine the tion are extracted and stored. In order to anonymize the category attributes of State, e.g. Melbourne and Ballarat data, the original attributes are replaced with synthetic ids. determine Victoria. (2) Given time series are extracted and taken as samples Three parameters are necessary to configure a data cube for time series generation. (3) The whole set of time series skeleton: the number of dimensions, the number of levels is analyzed to create a template that represents its charac- per dimension, and the outdegree of a category attribute. teristics and can be used to create multiple datasets that While the first two parameters are straightforward, the last are similar to the original. While the first two types are one needs more explanation. The outdegree of a category straightforward, the third one is more complex. In order attribute defines the number of subcategories within a cate- to create the described template, Loom uses an approach gory and thus describes the branching between the levels of based on the hierarchical divisive analysis clustering (DI- a dimension. To illustrate this, we regard the example from ANA) [12]. With this method, the dataset is partitioned Figure 2 where we observe an outdegree of 2. This means a into groups of similar time series. From each of these clus- category attribute on the State level, e.g. New South Wales, ters, the time series with the lowest average distance to the is related to two category attributes on the lower Region remaining members is selected as a prototype. By fitting level, e.g. Sydney and Blue Mountains. Category attributes an analytical time series model, e.g. ARIMA, to this time that have no further subcategories are called base category series, a generator that represents the characteristics of its attributes and form the leaves of a dimension’s hierarchy. underlying cluster is created. This generator can be used Thus, a data cube skeleton is the cross-product of all base to create multiple variations of the original time series. To category attributes of all dimensions. complete the template, the size of each group in relation to In the energy domain, categorical information is also ben- the whole dataset is stored. With the collected information eficial because forecast models that are built for categories it is possible to create multiple datasets that are different may lead to more accurate and robust forecast results. For but still share the characteristic time series of the original and their distribution. Top Data cube modelling. Usually, a set of time series does not only contain sequences of measured values, but also cat- egorical information, e.g. geography, purpose, or color. Sets State New South Wales Victoria of these attributes are organized in hierarchies, called dimen- sions, while sets of these dimensions form the skeleton of a data cube. The goal of this step is to allow users the con- Region Sydney Blue Mountains Melbourne Ballarat figuration and creation of such cubes. In short, data cubes can be formally described as follows [17]: A data cube skeleton consists of a set of dimensions. A Figure 2: Dimension with levels State and Region instance, the Irish Smart Metering Project [1] gathers time lowing the use of our generated data in almost every appli- series of smart meters in over 5,000 Irish households and cation. In addition, we offer export as RData which is the businesses. In a survey, the owners give additional informa- data format data.table for the popular statistical workbench tion, such as Social class, House type and Age of the house. R [3]. The database export transforms the created data into We identify 8 dimensions, each of whose consisting of one fact tables that can be imported into any RDBMS. These dimension level. Thus, forecast models can be created for tables have one time and at least one measure column, cat- individual time series or aggregated time series along one or egorical information may be stored in fact tables as well as more dimensions. in dimensional tables. As we deal with a high amount of structured data, it is necessary to bring the dataset into an Time series generation. The data cube skeleton that was appropriate schema depending on the chosen export option. configured and created in the previous step must now be filled with facts, which in the case of our system are time se- ries. In the following, we again give a quick formalization of 3. DEMONSTRATION time series and explain what is necessary to configure their This section demonstrates the usage of Loom. Starting creation. A time series is a sequence of successive observa- the application, the user sets a workspace directory that tions xt (1 ≤ t ≤ T ) recorded at specified time instances is used for configuration files and generated datasets. Be- t. For this demonstration, we assume that observations are low, we describe the configuration of a template, a dataset complete and equidistant, i.e. there exists an observation and further generation steps. These steps correspond to the for every time instance and all time instances have the same workflow shown in Figure 1. distance. For configuration a time frame must be defined that consists of start and end time instance as well as the 3.1 Template configuration distance between time instances. In addition one or more The optional template configuration annotates a user given measure columns must be defined, depending whether uni- dataset with information that is further needed for the tem- variate or multivariate time series should be generated. To plate creation. The user inputs a CSV file and annotates fill these measures with actual values, our system uses time each column with the corresponding semantic (time, mea- series models and samples. sure or category). Moreover, the time column needs infor- As an example, the user can create a synthetic model with mation about the time type (integer, date, time) and the a base bt , season st and error component et : respective format. As shown in Figure 3, the user configures xt = bt · s(t mod L) + et dimensions by Drag and Drop of the category names. Op- tionally, the user may indicate the primary attribute that The season length is given by L. The component st is is the lowest level category in every dimension. Once this a seasonal mask of length L. Thus, the weights repeat configuration is finished, the template may be used for time every L time instances. The error component et is nor- series and/or data cube generation. mal distributed N (0, σ 2 ) and overlays the “perfect” model x?t = bt · s(t mod L) . The standard deviation σ depends on the user’s accuracy expectation, expressed as mean average percentage error (MAPE): T T 1 X x?t − xt 1 X et M AP E = | | = | | T t=1 x?t T t=1 bt · s(t mod L) The error distribution is calculated such that the user given M AP E holds on average of the whole time series. Datasets may also be created from template. During tem- plate creation, a set of ARIMA models is created that is based on user given data. Synthetic time series are gener- ated by those ARIMA models, incorporating normally dis- tributed errors or sampled errors from the template [4]. Time series mapping. The mapping links time series to the data cube, i.e. populates the skeleton with facts. As manual insertion of thousands of time series into a large data cube would not be feasible, Loom features an automatic mapping that randomly distributes the generated time series over the data cube skeleton. The user may set the model frequency by weighting each time series model. If templates are used, their distribution is used as default. Dataset export. After the configuration and mapping steps are done, the actual data is created and can be exported in a suitable format for further use. Our application offers two Figure 3: Template configuration different export destinations: either file or database. File export offers general formats like CSV and SQL script, al- 3.2 Dataset configuration After login, the user is greeted by an overview window that displays all datasets that he/she has already generated and those that are queued for generation, Figure 4. Clicking “Create new”, opens a wizard dialog that guides the user through the configuration. The first input by the user is a dataset name which is needed to reference and handle the configuration and its results. 3.2.1 Data cube configuration The next dialog, Figure 5, allows the configuration of the data cube. Loom offers two ways of configuration: template and synthetic. • Template configuration: The user can select a tem- plate as basis for the data cube skeleton. Templates are derived from real world datasets or from existing data cubes and are ready-made configurations featur- ing default values for number of dimensions, number of levels and outdegrees. Thus, this type of config- Figure 5: Data cube configuration uration is more user friendly than the synthetic one. Users can still customize the configuration by making selective adjustments to the template. It is possible attributes and can be used to adapt the size of the data to create different variants, e.g. smaller, larger, highly cube as desired. Thus, the resulting data cube is inflated by branched, etc., of an existing data cube. this factor. While the parameters and configuration types described • Synthetic configuration: Alternatively, this type of con- in this section provide users with a versatile and comfort- figuration allows users the full manual customization able way of modeling the categorical information of a data of the data cube skeleton. Users specify the number of cube, its configuration is not mandatory. If a user wishes dimensions, the number of levels per dimension and the for a plain set of unlabeled data, only a primary attribute is outdegree per level. These parameters can be provided generated to identify the created time series. fixed for each element or the whole cube. In addition, Loom offers a random parameter distribution, allow- ing a probabilistic setting with randomly structured dimensions. Both template and synthetic configurations, employ ran- dom distribution of outdegrees to a certain extent. This means the specific number of facts that can be accommo- dated by the data cube skeleton is not known during con- figuration. As the user must know the actual structure of the data cube in order to configure an appropriate number of time series, our system offers a preview of the data cube skeleton directly after configuration. This preview is de- picted in Figure 6 and shows all category attributes ordered by their size. The user may restart the modeling or accept the generated result. In many benchmarks such as TPC-H [2], it is common to set a scale factor SF of a database size, such as 10, 30, 100. Loom adopts this functionality by using a scale dimension that consists of exactly one level with SF ≥ 1 category Figure 6: Preview of generated dimension 3.2.2 Time series configuration Time series configuration is split into three separate di- alogs: time attribute, measure attributes, and time series models. The time attribute needs parameters such as the time type and the timeframe, i.e. start time, granularity, end time. Figure 4: Dataset overview In the measure dialog, the user sets the number of measure columns of the dataset. For each measure attribute, the user dimensions, the star and snowflake schema only support un- sets a data type. Usually, a measure is of double-precision balanced dimensions when referential integrity can be guar- floating-point format but in order to decrease dataset size anteed. To achieve this, the user may add a primary at- in a database, the user can also set single-precision floating- tribute column and a minimum outdegree. If the user did point or integer format. not set a data cube skeleton, then a primary attribute is Most importantly, time series models have to be added to automatically generated. the configuration. For this, Loom offers four options: Top • Sampled from Template: All time series data is gen- erated “as is” by taking values from given time series from existing datasets uploaded by users. According Northern Australian Capital State Territory Territory to the timeframe configuration data is extracted from the original time series and added to the generated one. Mismatches with the timeframe or data cube size are Australian Capital Region Darwin Alice Springs resolved using duplication, cutting, granularity conver- Territory sion etc. Figure 7: Unbalanced dimension with two levels • Recombined from Template: Time series are created via decomposition and recombination. Classical de- composition strategies like decompose [13] and stl [6] 3.3 Table generation are used to extract the defining components trend, sea- Closing the configuration window brings the user back to sonality, and noise from an existing set of time series. the initial overview where a new entry has been added, see By recombining these components, new time series are Figure 4. Now, the generation process can be invoked by created and can be used to add volume and variety to clicking Start Selected“. The state switches to In Progress“ a dataset. ” ” and indicates the current amount of data that has been gen- erated. • Modeled from Template: This option uses the third type of templates we described in Section 2. Users can load the time series generators and their distribu- 4. RELATED WORK tions of an existing dataset. Customization is possible Workload generation has been studied in many papers by changing the number of time series an individual each of which focus either on the generation process or time generator creates or by removing/adding certain gen- series modelling. To our knowledge, Loom is the first appli- erators. cation that integrated both techniques and allows for flexible data cube and time series characteristics. In the following, • Synthetic time series: The user configures a time series we present selected sources that relate to our prototype. model from scratch, without relying on any given mea- The IDAS dataset generator [11] offers data generation sures. Time series properties are defined freely and are based on statistical distributions. Attributes form depen- synthetically generated, e.g. with a linear rising trend, dency graphs that are not necessarily lattices. The goal is a regular gauss-shaped seasonal component, and a nor- the creation of a synthetic dataset for testing data mining mally distributed error series. algorithms. The workflow is similar to Loom since the user creates a dataset by specifying the number of tables, setting 3.2.3 Export configuration the attributes and initiating the data generation. Moreover, In the last dialog, the user sets the export configuration. the authors experienced similar shortcomings with real data, The export destination has different options. CSV, SQL such as privacy issues, a lack of training data or unsatisfy- and RData create flat files within the workspace. Alter- ing categorical information. Still, measures do not depend natively, time series are exported to a database via JDBC on time, thus time series generation is not supported. driver. Thus, the user sets up a database connection with Schaffner and Januschowski [14] focus on benchmarking a database location and login credentials. Finally, the user of databases under varying request rates. Request rates can sets the schema. Loom supports different schemas: (1) Ba- be seen as time series of aggregate tenant traces. Since there sic unnormalized export in a Universal schema, which cre- is not enough real data available, they provide two method- ates high redundancy as it stores the categorical information ologies for generating synthetic tenant traces. (1) The mod- with each value of a time series. (2) The partly resp. fully elling approach fits a function as a model for a given ag- normalized Star and Snowflake schema, which allows more gregate tenant trace. The function’s shape has been deter- compact exports and are common in database design. (3) mined empirically. By adding an error term, they create The Parent-Child schema, which stores each functional de- diversity among the synthetic tenant traces. (2) Another pendency as a pair of parent attribute and child attribute in way is the decomposition of time series by bootstrapping. the respective dimension table. Thus, a given trace is split into windows that are randomly Particular attention has to be paid to the export when shifted and result in synthetic traces. Both approaches are dimensions are unbalanced, i.e. their base categories are similar to Loom’s template creation in that synthetic time not on the same level. Figure 7 shows an example for such series are either modelled or recombined from template. a dimension, where base category attributes are either on The F-TPC-H benchmark [7] is a modified TPC-H bench- the region level (Darwin, Alice Springs) or on the state mark for time series generation. This work reuses the given level (Australia Capital Territory). While both the univer- TPC-H schema in that customers submit orders of products sal schema and the parent-child schema support unbalanced for a certain quantity. While this quantity does not depend on time in TPC-H, the modified F-TPC-H adds dependency 6. REFERENCES for representing trend and seasonal effects via ARIMA mod- [1] CER Smart Metering Project. els. Thus, this work represents a subset of Loom because it http://www.ucd.ie/issda/data, 2010. consists of a given schema and allows for synthetic time se- [2] TPC Benchmark H. http://www.tpc.org/TPC_ ries in the sales domain. Loom also supports schema flexibil- Documents_Current_Versions/pdf/tpch2.17.1.pdf, ity and allows for composing different time series generators. 2014. A specific use case for managing energy data is given by [3] R data.table package. https://cran.r-project.org/ [16]. This work proposes a unified Data Warehouse schema web/packages/data.table/index.html, 2015. for storing workloads, given as information about actors like [4] R forecast package. https://cran.r-project.org/ producers and consumers, offers, and time series about past web/packages/forecast/index.html, 2015. measures. Their time series schema involves measures from different types such as energy, power and price. Categorical [5] G. E. P. Box and G. M. Jenkins. Time series analysis information is necessary in order to store special annotations forecasting and control. Holden-Day, San Francisco, such as aggregation level of time series or additional infor- 1970. mation for each time series type. A time series is represented [6] R. B. Cleveland, W. S. Cleveland, J. E. McRae, and by several tables: (1) a time series table stores the primary I. Terpenning. STL: A Seasonal-Trend Decomposition attribute that identifies the time series and that links to each Procedure Based on Loess. Journal of Official category, (2) another dimensional table stores time frames Statistics, 6:3–73, 1990. with an identifier and the resp. time frame information, (3) [7] U. Fischer. Forecasting in database systems, 2014. the fact table itself consists of the primary attribute, a mea- [8] U. Fischer, F. Rosenthal, and W. Lehner. F2DB: The sure column and a foreign key to the time frame. Thus, Flash-Forward Database System. In ICDE, pages this schema is not a traditional star of snowflake schema 1245–1248, 2012. and cannot directly be covered by Loom. Moreover, Loom [9] C. C. Holt. Forecasting trends and seasonal by keeps time and measure together as fact. This may increase exponentially weighted averages. Office of Naval redundancy but we opt for this solution for several reasons: Research Memorandum, 52, 1957. Reprinted in: (1) there is no need for an additional description of time International Journal of Forecasting, 20(1):5-10, 2004. frames, (2) a time frame is encoded either as an integer or a [10] R. J. Hyndman. Time series data library. short string, thus the space for storage is still affordable, (3) http://data.is/TSDLdemo. Accessed on 9-24-15. there is no join operation needed in order to retrieve a time [11] D. R. Jeske et al. Generation of synthetic data sets for series. After all, time series from the energy domain may be evaluating the accuracy of knowledge discovery generated by models integrated in Loom. systems. In Proc. of KDD, pages 756–762, 2005. [12] L. Kaufman and P. J. Rousseeuw. Finding Groups in 5. CONCLUSIONS AND FUTURE WORK Data. Wiley, 1990. In this paper, we introduced Loom as a tool for generating [13] M. Kendall and A. Stuart. The Advanced Theory of large sets of synthetical time series data. Our prototype Statistics, volume 3. Griffin, 1983. utilizes different time series generators to create multiple [14] J. Schaffner and T. Januschowski. Realistic tenant time series that share certain characteristics. In addition, traces for enterprise DBaaS. In Workshops Proc. of our prototype allows the creation of dimensional categorical ICDE, pages 29–35, 2013. information for the description of time series. Besides the [15] R. H. Shumway and D. S. Stoffer. Time Series full manual definition of a dataset, Loom features a template Analysis and Its Applications. Springer, 2011. driven approach that analyses given datasets and allows the [16] L. Siksnys, C. Thomsen, and T. B. Pedersen. creation of synthetic variants of this template data. MIRABEL DW: managing complex energy data in a Currently, our approach only generates complete time se- smart grid. In Proc. of DaWaK, pages 443–457, 2012. ries with equidistant time stamps. This is done, as forecast [17] P. Vassiliadis. Modeling multidimensional databases, methods like Exponential smoothing [9] and ARIMA [5] rely cubes and cube operations. In Proc. of SSDBM, pages on these properties, except few models like [15]. Part of our 53–62, 1998. future work will be the integration of functions for generat- ing incomplete time series with configurable gap patterns. Regarding data cube modelling, we assume that a dimen- sion is a totally ordered set of levels, which is the case in most real-world datasets. However, there are exceptions, such as the modelling of a time dimension with levels: day, week and month. There, a day functionally determines a week and a month, but a week does not determine the month. Such lattices are not supported by our prototype. Further future work will be focused on time series mapping to the data cube. Right now, we use a very simple approach for this and randomly distribute our time series over the data cube skeleton. This approach will be replaced with a more sophisticated method that allows the configuration of the distribution. For example, time series from a certain generator only occur in a specified subset of the data cube skeleton.