=Paper= {{Paper |id=None |storemode=property |title=Task Graphs of Stream Mining Algorithms |pdfUrl=https://ceur-ws.org/Vol-1018/paper10.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/Akioka13 }} ==Task Graphs of Stream Mining Algorithms== https://ceur-ws.org/Vol-1018/paper10.pdf
                    Task Graphs of Stream Mining Algorithms
                                                             Sayaka Akioka
                                                             Meiji University
                                                       4-21-1 Nakano, Nakano-ku
                                                        Tokyo, 164-8525, Japan
                                                           +81-3-5343-8305
                                                          akioka@meiji.ac.jp


ABSTRACT                                                                    application, which is often synthetic workload generated
Acceleration of huge data analysis, especially an analysis of huge,         randomly. As the quality of task graphs heavily impacts on
and fast streaming data is one of the major issues in recent                validation of scheduling algorithms, the methodology to generate
computer science. Proper modeling, and understanding of                     task graphs have been studied as well with a strong focus on data
streaming data analysis are indispensable for speed-up, scale out,          intensive applications in HPC.
and faster response time of streaming data analysis. Especially for         This paper proposes task graphs generated from the actual
the research on scheduling, or load balancing algorithms, a model           implementations of stream mining algorithms in order to
of the target application truly impacts on the performance of the           contribute to a development of effective, and practical scheduling
scheduling, or load balancing algorithms, however, there is no              algorithms for stream mining algorithms. The contributions of this
study on the realistic models, or the actual behaviors of streaming         paper are 1) the first proposal of task graphs for stream mining
data analysis yet. This paper proposes a task graph for stream              algorithms, 2) the practical and realistic workloads extracted from
mining algorithms with some examples of actual applications. A              the existing implementations, 3) task graphs as representations of
task graph represents a workload of the target application with             the behaviors of stream mining algorithms to open up unexplored
data dependencies, and control flows. This is the first proposal of         problems for conventional scheduling algorithms, and 4) task
task graphs for stream mining algorithms, and the task graphs play          graphs as a benchmarking tool to accelerate the development of
an important role as a benchmarking tool for the development of             scheduling algorithms for stream mining algorithms.
scheduling, or load balancing algorithms targeting on stream
mining algorithms.                                                          The rest of this paper is organized as follows. Section 2 gives a
                                                                            generic model of stream mining algorithms in order to clarify data
                                                                            dependencies of the process. Section 3 describes the procedure of
1. INTRODUCTION                                                             task graph generation, and proposes a format of task graphs for
Applications to process a massive amount of data, so-called “big            stream mining algorithms. Section 4 overviews actual stream
data”, is one of the recent hot research topics. Big data                   mining algorithms analyzed in this paper, and represents
applications are sometimes considered to be quite similar with              corresponding task graphs. Section 5 briefly introduces the related
data intensive applications in high performance computing (HPC),            work, and Section 6 concludes this paper.
however, the behaviors of applications in these two domains are
quite different [9].
                                                                            2. STREAM MINING ALGORITHMS
Big data applications utilize often stream mining algorithms,               A stream mining algorithm is an algorithm specialized for a data
while data intensive applications process huge data in a batch.             analysis over data streams on the fly. There are many variations of
That is, big data application often tries to analyze data stream,           stream mining algorithms, however, general stream mining
which is a sequence of data arriving in chronological order, on the         algorithms share a fundamental structure, and a data access
fly. As the data stream flows very fast, stream mining algorithms           pattern as shown in Figure 1 [1].
are developed with the purpose of the perfect analysis over such
fast data flows. Once the delay of the analysis arises, and the             A stream mining algorithm consists of two parts; a stream
analysis fails to keep up with the data arrival, the whole process          processing part, and a query processing part. First, the stream
will be forced to drop some of the arriving data. As many of the            processing module picks the target data unit, which is a chunk of
streaming analysis processes place emphasis on the real-time                data arrived in a limited time frame, and executes a quick analysis
analysis in chronological order, a drop of the arrival data is highly       over the data unit. The quick analysis here can be a
critical.                                                                   preconditioning process such as a morphological analysis, or a
                                                                            word counting. Second, the stream processing module updates the
As big data applications scale up with such a severe requirement            data cached in one or more sketches with the latest results through
for extremely low latency, big data applications become to run on           the quick analysis. That is, the sketches keep the intermediate
the parallel and distributed computing environment such as the              analysis, and the stream processing module updates the analysis
computing cloud. In order to exploit parallelism, and speed up the          incrementally as more data units are processed. Third, the analysis
applications, scheduling is indispensable. Scheduling algorithms            module reads the intermediate analysis from the sketches, and
in parallel and distributed computing environment have been                 extracts the essence of the data in order to complete the quick
studied intensively for a long time especially in HPC, and these            analysis in the stream processing part. Finally, the query
researches often validate, and compare the scheduling algorithms
with task graphs. A task graph represents a workload of a target

                                                                        1
                                                                           accesses. On the other hand, in a stream mining algorithm, a
                                                                           process refers to its data unit only once, which is a read-once-
                                                                           write-once style. Therefore, a scheduling algorithm for the data
                                                                           intensive applications is not simply applicable or the purpose of
                                                                           the speedup of a stream mining algorithm.
                                                                           Figure 2 illustrates data dependencies between two processes
                                                                           analyzing data units in line, and data dependencies inside ne
                                                                           process. The left top flow represents the stream processing part of
                                                                           the preceding process, and the right bottom flow represents the
                                                                           stream processing part of the successive process. Each flow
                                                                           consists of the six stages; read from sketches, read from input,
                                                                           stream processing, update sketches, read from sketches, and
                                                                           analysis. An arrow represents a control flow, and a dashed arrow
                                                                           represents a data dependency.
                                                                           In Figure 2, there are three data dependencies in total as follows,
                                                                           and all of these three dependencies are essential to keep the
        Figure 1. A model of stream mining algorithms.                     analysis results consistent, and correct.
                                  .                                        1.   The processing module in the preceding process should
processing part receives this essence for the further analysis, and             finish updating the sketches before the processing module in
the whole process for the target data unit is closed.                           the successive process starts reading the sketches (Dep.1 in
Based on the modeling above, we can conclude that the major                     Figure 2).
responsibility of the stream processing part is to process each data       2.   The processing module should finish updating the sketches
unit for the further analysis, and that the stream processing part              before the analysis module in the same process starts
has the huge impact over the latency of the whole process. The                  reading the sketches (Dep.2 in Figure 2).
stream processing part needs to finish the preconditioning of the
current data unit before the next data unit arrives, otherwise, the        3.   The analysis module should finish reading the sketches
next data unit will be lost as there is no storage for buffering the            before the processing module in the successive process
incoming data in a stream mining algorithm. On the other hand,                  starts updating the sketches (Dep.3 in Figure 2).
the query processing part takes care of the detailed analysis such
as a frequent pattern analysis, or a hot topic extraction based on         3. TASK GRAPH DEFINITIONS
the intermediate data passed by the stream processing part. The
                                                                           As discussed in Section 2, a model of a stream mining algorithm
output by the query processing part is usually pushed into a
                                                                           has data dependencies both across the processes, and inside one
database system, and there is no such an urgent demand for an
                                                                           process. Therefore, a task graph or a stream mining algorithm
instantaneous response. Therefore, only the stream processing part
                                                                           should consist of a data dependency graph, and a control flow
needs to run on a real-time basis, and the successful analysis over
                                                                           graph. We already modeled both the data dependencies, and the
all the incoming data simply relies on the speed of the stream
                                                                           control flows for stream mining algorithms in Section 2, however,
processing part.
                                                                           a task graph is a finer grained model for a specific algorithm and
The model of a stream mining algorithm shown in Figure 1 also              implementation.
indicates that the data access pattern of the stream mining                A data dependency graph is drawn via an analysis of the actual
algorithms is totally different from the data access pattern of so-        implementation of the target algorithm. Figure 3(a) is an example
called data intensive applications, which is intensively                   of a data dependency graph of the training stage of Naïve Bayes
investigated in HPC. The data access pattern in the data intensive         classifier[2] implemented by MOA project [8]. Figure 4
applications is a write-once-read-many [9]. That is, the application       represents a pseudo code for the data dependency graph in Figure
refers to the necessary data many times during the computation;            3(a). A data dependency graph is a directed acyclic graph (DAG).
therefore, the key for the speedup of the application is to place he       In a data dependency graph, each node represents a basic block, or
necessary data close to the computational nodes for the faster data




                          Figure 2. Data dependencies of the stream processing parts in two processes in line.
                                                                       .
                                                                       2
                                                            285
                    1                                 1                          
                                                                                 
                                         360          360              360        
    2         2           ...   2         2       2         ...    2               
                                                                                   
                                                                                   
                                                          285                     
                    1                                 1
                                                                                  
                                        360           360              360         
    2         2           ...   2         2       2         ...    2               
                                                                                   
                                                                                  
                    (a)                               (b)
                                                                                  
     Figure 3. The data dependency graph (a), and the                              
   control flow graph (b) for Naïve Bayes implementation                           
                           of MoA.                                                

                                                                                  
        for all training data do                                                   
          (1) Fetch one training data v                                             
                                                                                   
          for all attributes for v do                                             
                  (2-1) Update the weight sum of this attribute.
                                                                                  
                  (2-2) Update the mean value of this attribute.
                                                                                   
          end for                                                                   
        end for                                                                    
                                                                                  
   Figure 4. The training stage of Naïve Bayes algorithm.
                                                                                  
an equivalent chunk of codes in the actual implementation, and                     
each array indicates a data dependency. If an arrow comes up                        
from node A to node B, the arrow indicates that there is a data                     
dependency between node A, and node B, and that the process                               
represented by node B relies on the data generated by the process                   
represented by node A for consistency of the analysis.                              
                                                                                   
A data dependency graph in Figure 3(a) actually consists of two                   
DAGs; a DAG with nodes in white, and a DAG with nodes in
gray. Each DAG represents each process in Figure 2. That is, the                  
DAG with white nodes in Figure 3(a) indicates the preceding                        
process in Figure 2, and the DAG with gray nodes in Figure 3(b)                     
indicates the successive process in Figure 2. The arrows between                    
the two DAGs represent data dependencies between the two                            
processes. In the case of stream mining applications, which is the                 
most different point from conventional applications, the                          
application continues running as long as a new data unit arrives. A
DAG with nodes in a same color represents one process for one                     
data unit, therefore, DAGs should lie in a line as many as the                    
number of data the corresponding application processes. In this                   
case, two DAGs are sufficient for the representation of the                       
minimum unit of the repeated pattern in the application, and the                  
data dependency graph does not contain any more redundant                         
DAGs for simple but sufficient representation.                                    
In a data dependency graph, each node has a number, and the                       
number indicates that the particular node represents which basic                 
block in the pseudo code, such as shown in Figure 4. In this
example, node 1 represents the line starting with “(1)” in the                              Figure 5. XML scheme for a task graph.
pseudo code in Figure 4, and node 2 represents the lines starting

                                                                             3
                                       for all input data items do
                                                                      (1) fetch one input data v
         
                                                             for all distinct items appeared do
                                          (2) create or update a border point for v
          
                                                                                    (3) update summary
          
                                                                         (4) update frequency
                                                                                    (5) delete obsolete border points
         
                                            end for
                                            (6) update pruning threshold
          
                                                                              end for
          
                                                                           Figure 7. A pseudo-code of top-k (min summary).

         
be represented also in XML according to this schema, and a designer of scheduling simulators can easily employ the task graph as a benchmark by reading this XML. 4. ACTUAL TASK GRAPHS
This section introduces task graphs extracted from the actual
popular methodologies. One is top-k implemented as a Java 1.7 Figure 6. XML representation for the task graph in Figure 3. application. The other is Hoeffding tree algorithm[6], which is one of decision tree algorithms, and implemented as MOA module[8]. with “(2-1)”, and “(2-2)” in the pseudo code in Figure 4. Here, as We implemented top-k based on min summary algorithm determined from the pseudo code in Figure 4, the basic block proposed by Lam et al.[7], and the base proposal by Calders et al. indicated by node 2 is data parallel. Therefore, in the data [3]. Figure 7 is the pseudo-code of the corresponding algorithm. dependency graph, several node 2s are located in the same level of Figure 8 (a) represents the extracted data dependency graph, and the DAG. Logically, there is no limit of the number of node 2s in Figure 8(b) represents the extracted control flow graph. this case, therefore, an user can put node 2s as many as desired. As we already saw through the generic model of the stream A control flow graph is also drawn via an analysis of the actual mining algorithms in Section 2, each node processing one data implementation of the target algorithm again, and the basic unit basically depends on the results of the previous node. That is, definitions are almost the same to the case of a data dependency each node is a consumer of the previous node. The exception is graph. Figure 3(b) is a control flow graph for Naïve Bayes node 1 (data fetching), and node 6 (update of the pruning classifier, and the corresponding pseudo code is shown in Figure 4. threshold). Especially, node 6 updates the pruning threshold based Each node represents basic block again, however, each arrow in a on the length of the summary, and node 6 has to wait for the control flow graph represents the order of the process of basic elimination of the obsolete border points, which is node 5. On the blocks. That is, in Figure 3(b), node 2 always has to be processed other hand, the process represented from node 2 to node 5 is just after node 1 is completed. On the other hand, nodes without independent across the distinct items appeared during the arrays in between do not have any ordering restriction. Therefore, observation, and this part is capable of parallel execution. these nodes can be executed in a shuffled order, or on the same stage. As the same to the data dependency graph, a control flow When we focus on the dependency between the preceding process, graph consists of the minimum but sufficient DAGs for the and the successive process, node 2 in the successive process simplicity. depends on node 5 in the preceding process. Node 5 in the preceding process deletes the obsolete border points, while node 2 A control flow graph has a computational cost for each node. A adds a new border point, or increment the counter of the existing computational cost shown in a control flow graph is the average border point according to the input. There is no dependency when of the actual computational costs measured in the actual node 2 adds a new border point, however, node 2 needs to decide computations, however, this version of the control flow graphs do which border point should be updated when node 2 increments the not contain communication costs. As the control flow graphs here count of the existing border point. This is the reason why node 2 are fine-grained, it is not beneficial to scatter one control flow in the successive process behaves as a consumer of node 5 in the graph over the distributed computing environment. That is, preceding process. pipelining control flow graphs according to the speed of the input data is a more realistic, and practical solution. Communication One more thing we would note here is that the computational cost costs for pipelining in the distributed computing environment of node 6 is relatively huge compared to the computational costs contains further discussions, and we reserve this topic for the of the other nodes. The major reason of the heavy load of node 6 future work. is that node 6 needs to calculate the maximum relative frequency of the least appeared item during the observation. Because of this Figure 5 is the XML schema for a task graph, and Figure 6 is an process, node 6 is a consumer of node 5, needs to sweep all the example representation of XML for Figure 3. Task graphs should data in the summary, and consumes more time for completion. 4 1 1 424 Let HT be a tree with a single leaf (the root) for all training data do 1355 1355 1355 2 2 ... 2 2 2 ... 2 (1) Fetch one training data v, and sort v into leaf l using HT 2322 2322 2322 for all attributes for v do 3 3 ... 3 3 3 ... 3 (2) Update sufficient statistics in l end for 1143 1143 1143 (3) Increment nl, the number of examples seen at l 4 4 ... 4 4 4 ... 4 if n1 mod nmin = 0 and data seen at l not all of same class then 694 694 694 (4) Compute Gl(Xi) for each attribute 5 5 ... 5 5 5 ... 5 (5) Let Xa be attribute with highest Gl (6) Let Xb be attribute with second-highest Gl 6 6 6376 (7) Compute Hoeffding bound if Xa != Xb and (Gl(Xa) − Gl(Xb) > ε or ε < τ) then 1 1 424 (8) Replace l with an internal node that splits on Xa for all branches of the split do 2 2 ... 2 1355 2 2 1355 ... 2 1355 (9) Add a new leaf with initialized sufficient statistics 2322 2322 2322 end for 3 3 ... 3 3 3 ... 3 end if end if end for 1143 4 1143 4 4 ... 4 4 1143... 4 Figure 9. A pseudo-code of a training tree of Hoeffding tree algorithm. ... 694 694 ... 694 5 5 5 5 5 5 1 6 6 6376 2 2 ... 2 1 (a) (b) 2 2 ... 2 3 Figure 8. The data dependency graph (a), and the control 3 flow graph for top-k (min summary). 4 4 ... 4 This tendency of the computational cost implies that the execution 4 4 ... 4 in a pipeline is really effective for min summary algorithm. In fact, 5 the computational cost of node 6 is almost equivarent to the total cost of the single path of nodes 1-5, and node 6 is independent 5 from these nodes. Therefore, node 1-5, and node 6 are capable of 6 running in a pipeline, and consumes almost the same computational time. That is, there is a chance to hide almost half 6 of the execution time of the process time of one data unit, and 7 improve the throughput by pipelining. 7 We also extracted a task graph of Hoeffding tree algorithm from 8 MOA implementation. Figure 9 is the pseudo-code of the algorithm, and Figure 10 represents the extracted data dependency 8 graph. The control graph is omitted for the page limitation. We 9 9 ... 9 skip the detailed discussion for the page limitation again, however, we can observe similar tendency of the application as we saw in 9 9 ... 9 the generic model in Section 2, Naïve Bayes in Section 3, and min summary algorithm in this section. One major difference from the previous cases is that node 1 depends on node 9, therefore, the Figure 10. A data dependency graph for Hoeffding tree effect of the pipelining is not huge compared to the other cases. algorithm. Here, we need to discuss computational costs in the control flow graph. This version of the task graph represents a computational develop the better methodology for the computational model, cost as the average of actual executions. This is in a sort of the however, we reserve this issue as a future work. simplified model as the computational cost of stream mining algorithms easily varies depending on the input data. We need to 5 5. RELATED WORK compared to the data intensive applications in HPC, and this fact There are several studies on task graph generation, mainly points out we need to consider scheduling methodologies focusing focusing on random task generation. A few projects reported task on stream mining algorithms. For the better set of task graphs, we graphs generated based on the actual well-known applications, are working on more stream mining algorithms. however, those applications are from numerical applications such as Fast Fourier Transformation, or applications familiar to HPC 7. REFERENCES community for a long time. [1] Akioka, S., Muraoka, Y., Yamana, H., Data access pattern analysis on stream mining algorithms for cloud computation. Task Graphs for Free (TGFF) provides pseudo-random task In Proceedings of the 2011 International Conference on graphs [5,11]. TGFF allows users to control several parameters, Parallel and Distributed Processing (PDPTA2011) (Las however, generates only directed acyclic graphs (DAGs) with one Vegas, USA, July 18-21, 2011), 2011, 36-42. or multiple start nodes, and one or multiple sink nodes. Each task graph is assigned a period, and a deadline based on the length of [2] Bifet, A., Holmes, G., Pfahringer, B., Karen, P., Kremer, H., the maximum path in the graph, and the user specified parameter. Jansen, T., Seidl, T., MOA: Massive online analysis, a framework for stream classification and clustering. Journal GGen is another random task graph generator proposed by of Machine Learning Research (JMLR), Workshop and Cordeiro et al [4]. GGen generates random task graphs according Conference Proceedings Vol. 11: Workshop on Application to the well-known random task generation algorithms. In addition of Pattern Analysis, 2010. to the graph generator, GGen provides a graph analyzer, which characterizes randomly generated task graphs based on the longest [3] Calders, T., Dexters, N., Goethals, B., Mining Frequent path, the distribution of the out-degree, and the number of edges. Itemsets in a Stream. In Proceedings of 2007 IEEE International Conference on Data Mining (ICDM2007) Task graph generator provides both random task graphs, such as (Omaha, USA, October 28-31, 2007), 2007. Fast Fourier Transformation, Gaussian Elimination, and LU Decomposition [12]. The random task graph generator supports [4] Corderiro, D., Mounie, G., Perarnau S., Trystram, D., variety of network topologies, including star, and ring. Task graph Vincent, J. M., Wagner, F., Random graph generation for generator also provides scheduling algorithms as well. scheduling simulations. In Proceedings of the 3rd International ICST Conference on Simulation Tools and Tobita et al. proposed Standard Task Graph Set (STG), evaluated Techniques (SIMUTools’10) (Torremolinos, Spain, March several scheduling algorithms, and published the optimal 15-19, 2010), 2010. schedules for STG [10,13]. STG is a set of random task graphs, which are ready to download. Tobita et al. also provides task [5] Dick, R. P., Rhodes D. L., Wolf, W., TGFF: Task graphs for graphs from numerical applications such as a robot control free. In Proceedings of International Workshop on programs, a sparse matrix solver, and SPEC fpppp. Hardware/Software Codesign (Seattle, USA, March 15-18, 1998), 1998, 97-101. Besides the studies on task graph generation, Cordeiro et al. pointed out that randomly generated task graphs can create biased [6] Domingos, P., Hulten, G., Mining High-Speed Data Streams. In Proceedings of The 6th ACM SIGKDD Conference on results, and that the biased results can mislead the analysis of scheduling algorithms[4]. According to the experiments by Knowledge Discovery and Data Mining (KDD’00) (Boston, USA, August 20-23, 2000), 2000, 71-80. Cordeiro et al., a same scheduling algorithm man obtain a speedup of 3.5 times only by changing the graph generation algorithm for [7] Lam, H. G., Calders, T., Mining top-k frequent items in a the performance evaluation. data stream with flexible sliding window. In Proceedings of Random task graphs contributes positively for evaluation of The 16th ACM SIGKDD Conference on Knowledge scheduling algorithms, however, do not perfectly cover all the Discovery and Data Mining (KDD’10) (Washington DC, USA, July 25-28, 2010), 2010. domains of parallel and distributed applications as Cordeiro et al. figured out in their work. Especially for stream mining [8] McCallum A., Nigram, K., A comparison of event models applications, which focus on in this paper, the characteristic of the for Naïve Bayes text classification. In Proceedings of AAAI- application behaviors are quite different from the characteristic of 98 Workshop on ‘Learning for Text Categorization’ the applications familiar to the conventional HPC community as (Madison, USA, July 26-27, 1998), 1998. we discussed in Section 2. Task graphs generated from the actual [9] Raicu, I., Foster, I. T., Zhao, Y., Little, P., Moretti, C. M., stream mining applications have profound significance in the Chaudhary, A., Thain, D. The quest for scalable support of better optimization of the applications in parallel computing data intensive workloads in distributed systems. In environment for wider area of applications. Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing (HPDC2009) 6. CONCLUSION (Munich, Germany, June 11-13, 2009), 2009, 207-216. This paper proposed task graphs for stream mining algorithms. This is the first clear proposal of task graphs modeling stream [10] STG, Standard task graph set. mining algorithms, and the task graphs are extracted from the http://www.kasahara.elec.waseda.ac.jp/schedule/index.html. actual implementations of the popular existing methodologies. [11] TGFF. http://ziyang.eecs.umich.edu/~dickrp/tgff/. Task graphs proposed in this paper play an important role as the benchmarking tool to evaluate scheduling algorithms, or load [12] TGG, Task graph generator. balancing algorithm, which is indispensable for the research of http://taskgraphgen.sourceforge.net/. scheduling, or load balancing algorithms truly effective for stream [13] Tobita T., Kasahara, H., A standard task graph set for fair mining algorithms. In fact, in this paper, the proposed task graphs evaluation of multiprocessor scheduling algorithms. Journal represent apparently different characteristics, and dependencies of Scheduling, Volume 5, Issue 5, 2002, 379-394. 6