First Principle Models Based Dataset Generation for Multi-Target Regression and Multi-Label Classification Evaluation Ricardo Sousa1 and João Gama1,2 1 LIAAD/INESC TEC, Universidade do Porto, Portugal rtsousa@inesctec.pt 2 Faculdade de Economia, Universidade do Porto, Portugal jgama@fep.up.pt Abstract. Machine Learning and Data Mining research strongly de- pend on the quality and quantity of the real world datasets for the eval- uation stages of the developing methods. In the context of the emerging Online Multi-Target Regression and Multi-Label Classification method- ologies, datasets present new characteristics that require specific testing and represent new challenges. The first difficulty found in evaluation is the reduced amount of examples caused by data damage, privacy preser- vation or high cost of acquirement. Secondly, few data events of interest such as data changes are difficult to find in the datasets of specific do- mains, since these events naturally scarce. For those reasons, this work suggests a method of producing synthetic datasets with desired properties(number of examples, data changes events, ... ) for the evaluation of Multi-Target Regression and Multi-Label Clas- sification methods. These datasets are produced using First Principle Models which give more realistic and representative properties such as real world meaning ( physical, financial, . . . ) for the outputs and inputs variables. This type of dataset generation can be used to produce infinite streams and to evaluate incremental methods such as online anomaly and change detection. This paper illustrates the use of synthetic data gener- ation through two showcases of data changes evaluation. Keywords: Synthetic data, data streams, Multi-Label Classification, Multi- Target Regression, First Principle Models 1 Introduction In the areas of Machine Learning and Data Mining, datasets quality and quan- tity are crucial for evaluation stage of methods development [1]. Controlled evaluation environments with specified challenge problems are required to un- derstand the behaviour of the methods [2]. Methods of Multi-Target Regres- sion(MTR) and Multi-Label Classification (MLC) on online data streams are fair examples that imply these evaluation requirements. The importance of these methodologies has been growing due to reasonable modelling and predicting capabilities [3, 4]. Formally, let an unbounded data stream be represented by D = {..., (x1 , y1 ), (x2 , y2 ), ..., (xi , yi ), ...}, where xi = [xi,1 · · · xi,j · · · xi,M ] is a M -dimensional vector of real values containing the data descriptive variables xi,j (input variables) of the ith example (considering one example with the in- dex of reference). For Multi-Target Regression, yi = [yi,1 · · · yi,j · · · yi,N ] denotes a real values vector of responses yi,j (output variables) of the ith example. For Multi-Label Classification, yi corresponds to a subset of nominal labels λk such that yi ⊆ {λ1 , ..., λk , ..., λL }, where L is the number of possible labels. Typically, output set of labels yi are transformed into a vector of outputs variables [yi,1 · · · yi,k · · · yi,L ], where yi,k ∈ {0, 1} are binary. If label λk is assigned to the ith example then yi,k = 1, otherwise yi,k = 0. The outputs variables are redefined as yi = [yi,1 · · · yi,k · · · yi,L ]. Finally, the objective of both MTR and MLC methods is to learn a function f (xi ) → yi that maps the input values of xi into the output values of yi . In the evaluation of the methods, the number of examples are not sufficient in many cases by the reasons of sensitive data, data damage or high cost of ac- quisition [2]. Data changes scenarios are another prominent challenge for MTR and MLC methods [5]. Changes in the probability distributions, variables trends or variables rapid shifts of the inputs variables are events that have strong in- fluence on the method’s performance [5]. Similarly to the scarcity of examples, real world datasets that gather all desired data changes properties also lack [2]. Moreover, the data change events are not often annotated, since annotation is time consuming. As an attempt to solve this problem, researchers produce synthetic datasets to create evaluation challenges with desired properties. This alternative allows to produce a significant amount of examples or even create a reasonable ap- proximation of an infinite data stream. Moreover, few resources for storage and transmission are required. Despite high complexity, the produced datasets mod- els do not reflect the real world conterpart. In fact, the latent models are based on abstract mathematical concepts. However, datasets can be constructed through the employment of a First Principle Models (FPM) which are described by established laws without mak- ing assumptions (empirical or fitted parameters) [6]. FPM are used to create synthetic data for a wide range of areas such as Chemical Engineering (In- dustrial Chemical process) [7] and Mechanical Engineering(Mechanical Systems Diagnosis) [8]. In the area of Control Systems, Proportional-Integrate-Derivative (PID) systems modelling uses FPM extensively in several contexts of applica- tion [9]. For instance, FPM based software simulators(parametrized with inputs and outputs variables) that mimic those systems are created to reduce cost in the industrials trials [7]. The abundance of free software simulators and models of PID systems justified the focus on PID Systems. Thus, this work suggests a method to produce synthetic datasets that are reproducible for MTR and MLC evaluation. This method applies the FPM to produce more realistic and representative models. Section 2 briefly reviews some existent methods of dataset generation. Sec- tion 3 describes the FPM method and the selected FPM model that is used to produce the synthetic datasets. Section 4 shows the production of the MTR and MLC synthetic datasets and their application through MTR and MLC meth- ods evaluation showcases, under data changes scenarios. Finally, the results are presented and discussed in Section 5 and the main conclusions are reported in Section 6. 2 Related Work In the literature, most of dataset generators produce Single-Label Classification (the output variable yi is a label) datasets, since MLC is an emerging methodol- ogy [1]. Monedero et al [2], Frasch et al [10] and Narasimhamurthy et al [5] are representative examples of Single-Label Classification(SLC) dataset generators. One possible strategy is to produce several Binary Classification (where yi,k ∈ {0, 1}) outputs (one for each label) and combine them into MLC datasets. However, these datasets does not represent correlation between outputs. Read et al propose a dataset generator that uses single-label generators and combine them according to configured label imbalance and probabilities of si- multaneous label occurrence [11]. This dataset generator attempt to create more realistic datasets in terms of label imbalance and concept drifts through an em- pirical method. Tomas et al propose a highly configurable dataset generators for MLC based on the creation of hyper-spheres [1]. However, the generator is not designed to produce data changes. In addiction, despite high flexibility, this dataset generator produces an abstract mathematical challenge. Similarly, the majority of the existent datasets generator produce Single- Target Regression(also known as Multivariate Regression) datasets with simple but highly non-linear models. Friedman produced STR datasets from very simple and non-linear models to test the methods of Multivariate Adaptive Regresssion Splines [12] The same strategy used in MLC can be applied to produce MTR from a STR. For the best of our knowledge, no MTR dataset generators were found in literature. In fact, MLC methodologies received more attention by the researchers. 3 First Principle Model-Tennesse Eastman Process In order to illustrate the application of FPM’s to generate MTR and MLC datasets, the Tennessee Eastman Process (TEP) was chosen. TEP consists of an industrial process of continuous chemical production. The process is unstable, non-linear and controlled by PID system. Basically, PID systems are controlled mechanisms based on loop feedback that are used in process stabilization [9]. Figure 1 shows the model of a generic PID system. These systems involves es- sentially a Plant and a PID controller. Fig. 1. Model of a generic PID system. The Plant consists of a set equations that represent the behaviour of the controlled process. The process is driven by the manipulable variables ui and observed by the measurement variables yi . Some processes present the distur- bances variables di (binary variables)to simulate process impairments. The error ei between the desired set points ri and process measurements is computed with the purpose of being minimized. The PID controller consists of weighted sum of proportional(present values), integral(past values) and derivative (possible fu- ture values) terms which are calibrated in order to produce stable responses [9]. This component receives the error and computes new manipulable variables val- ues that stabilizes the Plant process. Figure 2 represents the diagram of TEP Plant. Most process details were originally omitted for simplification and pro- tection of intellectual property. Fig. 2. Model of TEP Plant. The uj = ui,j are the individual manipulable variables, yj = yi,j are individual measurements variables. Two products G and H (Product) are produced from four gaseous reactants (A, C, D, E). An inert product B is present but does not intervenes in the chemical reaction. A by-product F (Purge) results from the whole reaction. The chemical reactions are irreversible and exothermic. The process model comprises five inter-acting major units: a reactor, a condenser, a vapour-liquid separator, a stripper for the product stream and centrifugal compressor for the recycle stream. The model has 41 measurements and 12 manipulated variables and 12 set points. There is also 20 variables that simulate disturbances. The physical quantities that the variables represent are explained in detail in Bathelt et al [7]. Tables 6 to 5 gives the names of the variables and respective physical meaning (see in Appendix). 4 Methods This section gives the description of procedures for the generation of MTR and MLC datasets. Two evaluation showcases of MTR regressor and MLC classifier which predict over the generated datasets are also described. The generation of the datasets was focused on data changes robustness tests. Data changes in streams can be analysed in two aspects: nature and rate. The nature reflects the variables statistics that changed such as mean and variance [13]. The rate of change is an important aspect that influences the performance of most MTR and MLC methods. Abrupt changes (concepts drifts) are identified when the change occur from one example to the next (inexistence of transition phase). Gradual changes (concepts shifts) present a transition phase where the changing statistic is continuously varying [13]. This work is focused on the abrupt changes (concept drifts) of the mean statistic which is one of the most studied topics in data streams [14]. In this work, a simulator that implements the model described in Section 3 was used. The files of this simulator can be found at http://depts. washing- ton.edu/control/LARRY/TE and was originally developed by Ricker [15]. This simulator was partially developed in C and in Matlab(Simulink). TEP simulator is composed essentially by a plant function developed in C language and pos- teriorly built into a Matlab mex file. The PID controller (a set of small PID controllers) function was developed in Simulink. To produce data changes events, the set points variables are manipulated. Some set points such as Production Set Point allows to produce gradual changes (concept shifts) and abrupt changes (concept drifts) events, since the variables are related to measurements variables convergence to stable values. The rate can be calibrated using a rate limiter in the simulink environment. Figure 3 shows the parameters of data change event. The curve reflects the variation of a statistical parameter from a value a to a value b. The parameter t is the example index where the change starts and W is the number of examples in the transition phase. The change in the set point value create similar changes in the Plant parameters. The changes can be whether a descending and ascending of a parameter. Fig. 3. Model of a data change reflecting a statistical parameter variation. The generated datasets were produce with 100000 examples each, for both MTR and MLC evaluation. In this experiments, the changes were created with the periodicity of 5000, 10000 and 20000 examples. The W values were 0 in order to simulate a concept drift. Two special cases were also produced. One dataset does not present any drift(base line) and the other presents a constant decreasing change. Regarding the MTR dataset production, the dataset joins measurement variables yi and manipulable variables ui with purpose of predicting ui from yi . Each data example is defined as ei = (yi , ui ). As performance measures, the error was used for the error evolution and RMSE was used for global evaluation and comparison between algorithms. The error curve was smooth with a me- dian sliding window due to the spiky form. The window length is 1000 examples without overlapping. For MLC dataset generation, the disturbances variables di and measurement variables yi are joined. The purpose is to predict the dis- turbance variables di from yi . The di disturbance variables are already in the form used in MLC problem transformation [11]. Each data example is defined as ei = (yi , di ). A Poisson process was used to choose the instant of a distur- bance occurrence with a user defined duration. The duration of the disturbances was 5000 examples. In this evaluation, the inputs variables were smoothed with a low-pass filter (sliding windows of 1000 examples) in other to have stability. The F-Measure, Precision, Recall, Accuracy and Exact Match were used for the classification scenario. The Multi-Target regressor MT-AMRules and the Multi-Label classifier ML- AMRules were used to exemplify the application of synthetic datasets in MTR and MLC methods evaluation, respectively. Both MTR regressor and MLC clas- sifier were tested using a prequential mode. 5 Results In this section, two simple showcases with MTR and MLC are demonstrated. The types of involved data changes events are also visualized. A comparison between a scenario without data changes and several scenarios where several types of data changes occur is performed. Figure 4 depicts 4 data changes in the mean statistic with different rates using the Reactor Coolant variable. The porpose is to show the flexibility and the diversity of the adapted software to create several types of data changes. Fig. 4. Examples of data changes with different rate of variation observed in Reactor Coolant variable. These gradual and abrupt changes are created using rate delimiters in Simulink implementation. The thinner curve represents an abrupt change (concept drift) which is the event that is intended to be evaluated in MTR and MLC showcases. In the following paragraphs, the MTR and MLC evaluation showcases are presented. Figure 5 shows the evalution of the error for several MTR datasets. Fig. 5. Error evolution that show the effect of data changes. The plots a) to c) represent the smoothed error curve for 5000 (MTR 5k dataset), 10000 (MTR 10k dataset) and 20000 (MTR 20k dataset) examples of concept drift periodicity, respectively. The plot d) represent in solid line the scenario where the dataset presents no data change (MTR NoChange dataset) and the scenario where the data change is constant (MTR Const dataset) in dotted line. The error curves show the effect of the concept drifts. Figures 5 shows descending (dashed) and ascending (solid) concept drift events. The descending variation cause error increasing with more impact than ascending variation. The impact of ascending variation can be observed in the plot c). Table 1 show the results of MTR regression using several datasets with different challenges. Table 1. RMSE of the MTR evaluation. Dataset MTR 5k MTR 10k MTR 20k MTR NoChange MTR Const RMSE 0.158 0.151 0.146 0.101 0.119 Table 1 shows that the more often data changes, the higher is the RMSE. This fact is espected, since the drifts lead to the relearning of the model. In- terestingly, Figure 5 d) shows that the constant and gradual change lead to a gradual increasing of the error compared to the drift scenario where no drifts occurs. Table 2 show the results of MLC regression using several algorithms. Perfor- mance measures of the MLC evaluation for datasets of 5000 (MLC 5k dataset), 10000 (MLC 10k dataset) and 20000 (MLC 20k dataset) example periodicity. It also presents the performance measures for scenarios of no data change (MLC NoChange dataset) occurred and constant change (MLC Const dataset). Table 2. Performance measures of the MLC evaluation Dataset MLC 5k MLC 10k MLC 20k MLC NoChange MLC Const Accuracy 0.80 0.79 0.80 0.83 0.79 Exact Match 0.62 0.61 0.62 0.66 0.60 Precision 0.65 0.65 0.64 0.62 0.65 Recall 0.68 0.66 0.68 0.63 0.67 F-Measure 0.65 0.64 0.64 0.61 0.65 Table 2 shows that the drifts and the gradual change produce little effect on the performance measures. 6 Conclusion This work presented a framework for data set generation for MTR and MLC evaluation. A realistic and representative datasets for MTR regression and MLC Classification were obtained for method evaluation. However, the main limita- tions are the limited number of inputs and outputs. As future work, the main goal is to implement a MTR and MLC data streamer that produces data trough a configurable FPM in a standalone and portable application. 7 Acknowledgements This work was partly supported by the European Commission through MAES- TRA (ICT-2013-612944) and the Project TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of Concept with Industrial Impact/NORTE -01-0145- FEDER-000020 is financed by the North Portugal Regional Operational Pro- gramme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). References 1. Jimena Torres Tomás, Newton Spolaôr, Everton Alvares Cherman, and Maria Car- olina Monard. A framework to generate synthetic multi-label datasets. Electronic Notes in Theoretical Computer Science, 302:155 – 176, 2014. Proceedings of the {XXXIX} Latin American Computing Conference (CLEI 2013). 2. Javier Sánchez-Monedero, Pedro Antonio Gutiérrez, Marı́a Pérez-Ortiz, and César Hervás-Martı́nez. An n-spheres based synthetic data generator for supervised clas- sification. In Advances in Computational Intelligence - 12th International Work- Conference on Artificial Neural Networks, IWANN 2013, Puerto de la Cruz, Tener- ife, Spain, June 12-14, 2013, Proceedings, Part I, pages 613–621, 2013. 3. Changsheng Li, Weishan Dong, Qingshan Liu, and Xin Zhang. MORES: online incremental multiple-output regression for data streams. CoRR, abs/1412.5732, 2014. 4. Jesse Read, Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Streaming multi-label classification. In Proceedings of the Second Workshop on Applications of Pattern Analysis, WAPA 2011, Castro Urdiales, Spain, October 19-21, 2011, pages 19–25, 2011. 5. Anand Narasimhamurthy and Ludmila I. Kuncheva. A framework for generating data to simulate changing environments. In Proceedings of the 25th Conference on Proceedings of the 25th IASTED International Multi-Conference: Artificial In- telligence and Applications, AIAP’07, pages 384–389, Anaheim, CA, USA, 2007. ACTA Press. 6. Manuel Rodrı́guez and David Pérez. First principles model based control. In Luis Puigjaner and Antonio Espuña, editors, European Symposium on Computer- Aided Process Engineering-15, 38th European Symposium of the Working Party on Computer Aided Process Engineering, volume 20 of Computer Aided Chemical Engineering, pages 1285 – 1290. Elsevier, 2005. 7. Andreas Bathelt, N. Lawrence Ricker, and Mohieddine Jelali. Revision of the tennessee eastman process model. IFAC-PapersOnLine, 48(8):309 – 314, 2015. 8. Jonathan Cagan and Alice Agogino. Innovative design of mechanical structures from first principles. IA-EDAM, 1(3):169 – 189, 1987. 9. K. J. AAström and T. Hägglund. PID Controllers: Theory, Design, and Tuning. Instrument Society of America, Research Triangle Park, NC, 2 edition, 1995. 10. Janick V. Frasch, Aleksander Lodwich, Faisal Shafait, and Thomas M. Breuel. A bayes-true data generator for evaluation of supervised and unsupervised learning methods. Pattern Recogn. Lett., 32(11):1523–1531, August 2011. 11. Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Generating synthetic multi- label data streams. In ECML/PKKD 2009 Workshop on Learning from Multi-label Data (MLD’09), 2009. 12. Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67, 1991. 13. João Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 1st edition, 2010. 14. Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 139–148, New York, NY, USA, 2009. ACM. 15. N. Lawrence Ricker. Decentralized control of the tennessee eastman challenge process. Journal of Process Control, 6(4):205 – 221, 1996. Appendix: Tables of TEP variables Table 3. Manipulable Variables Number Variable Name 1 D feed flow (stream 2) 2 E feed flow (stream 3) 3 A feed flow (stream 1) 4 A and C feed flow (stream 4) 5 Compressor recycle valve 6 Purpc valve (stream 9) 7 Separator pat liquid flow (stream 10) 8 Stripper liquid product flow (stream 11) 9 Stripper steam valve 10 Reactor cooling water flow 11 cooling water flow 12 Agitator speed Table 4. Setpoints Variable Number Variable Name 1 Production Set point 2 Strip Level Set point 3 Separator Set point 4 Reactor Level Set point 5 Reactor Pression Set point 6 Mole G Set point 7 A Set point 8 C Set point 9 Recycled Valve Position 10 Steam Valve Position 11 Stripper steam valve 12 Agitator Setting Table 5. Disturbances variables Variable Name Variable Number Type 1 A/C feed ratio, B composition constant (stream 4) Step 2 B composition A/C ratio constant (stream 41) Step 3 D feed temperature(stream i) Step 4 Reactor cooling water inlet temperature Step 5 Condenser c4mting water inlet temperature Step 6 A feed loss (stream I) Random 7 C header pressure losereduced availability (stream 4) Random 8 A, B, C feed composition (stream 4) Random 9 D feed temperature (stream 2) Random 10 C feed temperature (stream 4) Random 11 Reactor cooling water inlet temperature Drift 12 Condenser cooling water inlet temperature Stiction 13 Reaction kinetics Stiction 14 Slow drift Random 15 Reactor cooling water valve Sticking Random 16 Condenser cooling water valve Sticking IDV(16) 17 Unknown Random 18 Unknown Random 19 Unknown Random 20 Unknown Random Table 6. Variables of measurements. Number Variable Name Unit 1 A feed (stream 1) kscmh 2 D feed (stream 2) kg/h 3 E feed (stream 3) kg/h 4 A and C feed (stream 4) kscmh 5 Recycle flow (stream 8) kscmh 6 feed rate (stream 6) kscmh 7 Reactor pressure kPa gauge 8 Reactor level % ◦ 9 Reactor temperature C 10 Purge rate (stream 9) kscmh ◦ 11 Separator temperature C 12 Product separator level % 13 Separator pressure kPa gauge 14 Separator underflow (stream 10) m3 /h 15 Stripper level % 16 Stripper pressure kPa gauge 17 Stripper underflow (stream 11) m3 /h ◦ 18 Stripper temperature C 19 Stripper steam dew kg/h 20 Compressor work kW ◦ 21 Reactor cooling water outlet temperature C ◦ 22 Separator cooling water outlet temperature C 23 of A in Reactor feed (stream 6) mol % 24 Concentration of B in Reactor feed (stream 6) mol % 25 Concentration of C in Reactor feed (stream 6) mol % 26 Concentration of D in Reactor feed (stream 6) mol % 27 Concentration of E in Reactor feed (stream 6) mol % 28 Concentration of F in Reactor feed (stream 6) mol % 29 Concentration of A in Purge (stream 9) mol % 30 Concentration of B in Purge (stream 9) mol % 31 Concentration of C in Purge (stream 9) mol % 32 Concentration of D in Purge (stream 9) mol % 33 Concentration of E in Purge (stream 9) mol % 34 Concentration of F in Purge (stream 9) mol % 35 Concentration of G in Purge (stream 9) mol % 36 Concentration of H in Purge (stream 9) mol % 37 Concentration of D in stripper underflow (stream 11) mol % 38 Concentration of E in stripper underflow (stream 11) mol % 39 Concentration of F in stripper underflow (stream 11) mol % 40 Concentration of G in stripper underflow (stream 11) mol % 41 Concentration of H in stripper underflow (stream 11) mol %