=Paper= {{Paper |id=None |storemode=property |title=Task Graphs of Stream Mining Algorithms |pdfUrl=https://ceur-ws.org/Vol-1018/paper10.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/Akioka13 }} ==Task Graphs of Stream Mining Algorithms== https://ceur-ws.org/Vol-1018/paper10.pdf

Task Graphs of Stream Mining Algorithms
Sayaka Akioka
Meiji University
4-21-1 Nakano, Nakano-ku
Tokyo, 164-8525, Japan
+81-3-5343-8305
akioka@meiji.ac.jp

ABSTRACT application, which is often synthetic workload generated
Acceleration of huge data analysis, especially an analysis of huge, randomly. As the quality of task graphs heavily impacts on
and fast streaming data is one of the major issues in recent validation of scheduling algorithms, the methodology to generate
computer science. Proper modeling, and understanding of task graphs have been studied as well with a strong focus on data
streaming data analysis are indispensable for speed-up, scale out, intensive applications in HPC.
and faster response time of streaming data analysis. Especially for This paper proposes task graphs generated from the actual
the research on scheduling, or load balancing algorithms, a model implementations of stream mining algorithms in order to
of the target application truly impacts on the performance of the contribute to a development of effective, and practical scheduling
scheduling, or load balancing algorithms, however, there is no algorithms for stream mining algorithms. The contributions of this
study on the realistic models, or the actual behaviors of streaming paper are 1) the first proposal of task graphs for stream mining
data analysis yet. This paper proposes a task graph for stream algorithms, 2) the practical and realistic workloads extracted from
mining algorithms with some examples of actual applications. A the existing implementations, 3) task graphs as representations of
task graph represents a workload of the target application with the behaviors of stream mining algorithms to open up unexplored
data dependencies, and control flows. This is the first proposal of problems for conventional scheduling algorithms, and 4) task
task graphs for stream mining algorithms, and the task graphs play graphs as a benchmarking tool to accelerate the development of
an important role as a benchmarking tool for the development of scheduling algorithms for stream mining algorithms.
scheduling, or load balancing algorithms targeting on stream
mining algorithms. The rest of this paper is organized as follows. Section 2 gives a
generic model of stream mining algorithms in order to clarify data
dependencies of the process. Section 3 describes the procedure of
1. INTRODUCTION task graph generation, and proposes a format of task graphs for
Applications to process a massive amount of data, so-called “big stream mining algorithms. Section 4 overviews actual stream
data”, is one of the recent hot research topics. Big data mining algorithms analyzed in this paper, and represents
applications are sometimes considered to be quite similar with corresponding task graphs. Section 5 briefly introduces the related
data intensive applications in high performance computing (HPC), work, and Section 6 concludes this paper.
however, the behaviors of applications in these two domains are
quite different [9].
2. STREAM MINING ALGORITHMS
Big data applications utilize often stream mining algorithms, A stream mining algorithm is an algorithm specialized for a data
while data intensive applications process huge data in a batch. analysis over data streams on the fly. There are many variations of
That is, big data application often tries to analyze data stream, stream mining algorithms, however, general stream mining
which is a sequence of data arriving in chronological order, on the algorithms share a fundamental structure, and a data access
fly. As the data stream flows very fast, stream mining algorithms pattern as shown in Figure 1 [1].
are developed with the purpose of the perfect analysis over such
fast data flows. Once the delay of the analysis arises, and the A stream mining algorithm consists of two parts; a stream
analysis fails to keep up with the data arrival, the whole process processing part, and a query processing part. First, the stream
will be forced to drop some of the arriving data. As many of the processing module picks the target data unit, which is a chunk of
streaming analysis processes place emphasis on the real-time data arrived in a limited time frame, and executes a quick analysis
analysis in chronological order, a drop of the arrival data is highly over the data unit. The quick analysis here can be a
critical. preconditioning process such as a morphological analysis, or a
word counting. Second, the stream processing module updates the
As big data applications scale up with such a severe requirement data cached in one or more sketches with the latest results through
for extremely low latency, big data applications become to run on the quick analysis. That is, the sketches keep the intermediate
the parallel and distributed computing environment such as the analysis, and the stream processing module updates the analysis
computing cloud. In order to exploit parallelism, and speed up the incrementally as more data units are processed. Third, the analysis
applications, scheduling is indispensable. Scheduling algorithms module reads the intermediate analysis from the sketches, and
in parallel and distributed computing environment have been extracts the essence of the data in order to complete the quick
studied intensively for a long time especially in HPC, and these analysis in the stream processing part. Finally, the query
researches often validate, and compare the scheduling algorithms
with task graphs. A task graph represents a workload of a target

1
accesses. On the other hand, in a stream mining algorithm, a
process refers to its data unit only once, which is a read-once-
write-once style. Therefore, a scheduling algorithm for the data
intensive applications is not simply applicable or the purpose of
the speedup of a stream mining algorithm.
Figure 2 illustrates data dependencies between two processes
analyzing data units in line, and data dependencies inside ne
process. The left top flow represents the stream processing part of
the preceding process, and the right bottom flow represents the
stream processing part of the successive process. Each flow
consists of the six stages; read from sketches, read from input,
stream processing, update sketches, read from sketches, and
analysis. An arrow represents a control flow, and a dashed arrow
represents a data dependency.
In Figure 2, there are three data dependencies in total as follows,
and all of these three dependencies are essential to keep the
Figure 1. A model of stream mining algorithms. analysis results consistent, and correct.
. 1. The processing module in the preceding process should
processing part receives this essence for the further analysis, and finish updating the sketches before the processing module in
the whole process for the target data unit is closed. the successive process starts reading the sketches (Dep.1 in
Based on the modeling above, we can conclude that the major Figure 2).
responsibility of the stream processing part is to process each data 2. The processing module should finish updating the sketches
unit for the further analysis, and that the stream processing part before the analysis module in the same process starts
has the huge impact over the latency of the whole process. The reading the sketches (Dep.2 in Figure 2).
stream processing part needs to finish the preconditioning of the
current data unit before the next data unit arrives, otherwise, the 3. The analysis module should finish reading the sketches
next data unit will be lost as there is no storage for buffering the before the processing module in the successive process
incoming data in a stream mining algorithm. On the other hand, starts updating the sketches (Dep.3 in Figure 2).
the query processing part takes care of the detailed analysis such
as a frequent pattern analysis, or a hot topic extraction based on 3. TASK GRAPH DEFINITIONS
the intermediate data passed by the stream processing part. The
As discussed in Section 2, a model of a stream mining algorithm
output by the query processing part is usually pushed into a
has data dependencies both across the processes, and inside one
database system, and there is no such an urgent demand for an
process. Therefore, a task graph or a stream mining algorithm
instantaneous response. Therefore, only the stream processing part
should consist of a data dependency graph, and a control flow
needs to run on a real-time basis, and the successful analysis over
graph. We already modeled both the data dependencies, and the
all the incoming data simply relies on the speed of the stream
control flows for stream mining algorithms in Section 2, however,
processing part.
a task graph is a finer grained model for a specific algorithm and
The model of a stream mining algorithm shown in Figure 1 also implementation.
indicates that the data access pattern of the stream mining A data dependency graph is drawn via an analysis of the actual
algorithms is totally different from the data access pattern of so- implementation of the target algorithm. Figure 3(a) is an example
called data intensive applications, which is intensively of a data dependency graph of the training stage of Naïve Bayes
investigated in HPC. The data access pattern in the data intensive classifier[2] implemented by MOA project [8]. Figure 4
applications is a write-once-read-many [9]. That is, the application represents a pseudo code for the data dependency graph in Figure
refers to the necessary data many times during the computation; 3(a). A data dependency graph is a directed acyclic graph (DAG).
therefore, the key for the speedup of the application is to place he In a data dependency graph, each node represents a basic block, or
necessary data close to the computational nodes for the faster data

Figure 2. Data dependencies of the stream processing parts in two processes in line.
.
2
285
1 1

360 360 360
2 2 ... 2 2 2 ... 2

285
1 1

360 360 360
2 2 ... 2 2 2 ... 2

(a) (b)

Figure 3. The data dependency graph (a), and the
control flow graph (b) for Naïve Bayes implementation
of MoA.

for all training data do
(1) Fetch one training data v

for all attributes for v do
(2-1) Update the weight sum of this attribute.

(2-2) Update the mean value of this attribute.

end for
end for

Figure 4. The training stage of Naïve Bayes algorithm.

an equivalent chunk of codes in the actual implementation, and
each array indicates a data dependency. If an arrow comes up
from node A to node B, the arrow indicates that there is a data
dependency between node A, and node B, and that the process
represented by node B relies on the data generated by the process
represented by node A for consistency of the analysis.

A data dependency graph in Figure 3(a) actually consists of two
DAGs; a DAG with nodes in white, and a DAG with nodes in
gray. Each DAG represents each process in Figure 2. That is, the
DAG with white nodes in Figure 3(a) indicates the preceding
process in Figure 2, and the DAG with gray nodes in Figure 3(b)
indicates the successive process in Figure 2. The arrows between
the two DAGs represent data dependencies between the two
processes. In the case of stream mining applications, which is the
most different point from conventional applications, the
application continues running as long as a new data unit arrives. A
DAG with nodes in a same color represents one process for one
data unit, therefore, DAGs should lie in a line as many as the
number of data the corresponding application processes. In this
case, two DAGs are sufficient for the representation of the
minimum unit of the repeated pattern in the application, and the
data dependency graph does not contain any more redundant
DAGs for simple but sufficient representation.
In a data dependency graph, each node has a number, and the
number indicates that the particular node represents which basic
block in the pseudo code, such as shown in Figure 4. In this
example, node 1 represents the line starting with “(1)” in the Figure 5. XML scheme for a task graph.
pseudo code in Figure 4, and node 2 represents the lines starting

3
for all input data items do
(1) fetch one input data v

for all distinct items appeared do
(2) create or update a border point for v

(3) update summary

(4) update frequency
(5) delete obsolete border points

end for
(6) update pruning threshold

end for

Figure 7. A pseudo-code of top-k (min summary).

be represented also in XML according to this schema, and a
designer of scheduling simulators can easily employ the task
graph as a benchmark by reading this XML.

4. ACTUAL TASK GRAPHS
This section introduces task graphs extracted from the actual
popular methodologies. One is top-k implemented as a Java 1.7
Figure 6. XML representation for the task graph in Figure 3. application. The other is Hoeffding tree algorithm[6], which is
one of decision tree algorithms, and implemented as MOA
module[8].
with “(2-1)”, and “(2-2)” in the pseudo code in Figure 4. Here, as We implemented top-k based on min summary algorithm
determined from the pseudo code in Figure 4, the basic block proposed by Lam et al.[7], and the base proposal by Calders et al.
indicated by node 2 is data parallel. Therefore, in the data [3]. Figure 7 is the pseudo-code of the corresponding algorithm.
dependency graph, several node 2s are located in the same level of Figure 8 (a) represents the extracted data dependency graph, and
the DAG. Logically, there is no limit of the number of node 2s in Figure 8(b) represents the extracted control flow graph.
this case, therefore, an user can put node 2s as many as desired.
As we already saw through the generic model of the stream
A control flow graph is also drawn via an analysis of the actual mining algorithms in Section 2, each node processing one data
implementation of the target algorithm again, and the basic unit basically depends on the results of the previous node. That is,
definitions are almost the same to the case of a data dependency each node is a consumer of the previous node. The exception is
graph. Figure 3(b) is a control flow graph for Naïve Bayes node 1 (data fetching), and node 6 (update of the pruning
classifier, and the corresponding pseudo code is shown in Figure 4. threshold). Especially, node 6 updates the pruning threshold based
Each node represents basic block again, however, each arrow in a on the length of the summary, and node 6 has to wait for the
control flow graph represents the order of the process of basic elimination of the obsolete border points, which is node 5. On the
blocks. That is, in Figure 3(b), node 2 always has to be processed other hand, the process represented from node 2 to node 5 is
just after node 1 is completed. On the other hand, nodes without independent across the distinct items appeared during the
arrays in between do not have any ordering restriction. Therefore, observation, and this part is capable of parallel execution.
these nodes can be executed in a shuffled order, or on the same
stage. As the same to the data dependency graph, a control flow When we focus on the dependency between the preceding process,
graph consists of the minimum but sufficient DAGs for the and the successive process, node 2 in the successive process
simplicity. depends on node 5 in the preceding process. Node 5 in the
preceding process deletes the obsolete border points, while node 2
A control flow graph has a computational cost for each node. A adds a new border point, or increment the counter of the existing
computational cost shown in a control flow graph is the average border point according to the input. There is no dependency when
of the actual computational costs measured in the actual node 2 adds a new border point, however, node 2 needs to decide
computations, however, this version of the control flow graphs do which border point should be updated when node 2 increments the
not contain communication costs. As the control flow graphs here count of the existing border point. This is the reason why node 2
are fine-grained, it is not beneficial to scatter one control flow in the successive process behaves as a consumer of node 5 in the
graph over the distributed computing environment. That is, preceding process.
pipelining control flow graphs according to the speed of the input
data is a more realistic, and practical solution. Communication One more thing we would note here is that the computational cost
costs for pipelining in the distributed computing environment of node 6 is relatively huge compared to the computational costs
contains further discussions, and we reserve this topic for the of the other nodes. The major reason of the heavy load of node 6
future work. is that node 6 needs to calculate the maximum relative frequency
of the least appeared item during the observation. Because of this
Figure 5 is the XML schema for a task graph, and Figure 6 is an process, node 6 is a consumer of node 5, needs to sweep all the
example representation of XML for Figure 3. Task graphs should data in the summary, and consumes more time for completion.

4
1 1 424 Let HT be a tree with a single leaf (the root)
for all training data do
1355 1355 1355
2 2 ... 2 2 2 ... 2 (1) Fetch one training data v, and sort v into leaf l
using HT
2322 2322 2322 for all attributes for v do
3 3 ... 3 3 3 ... 3 (2) Update sufficient statistics in l
end for
1143 1143 1143 (3) Increment nl, the number of examples seen at l
4 4 ... 4 4 4 ... 4 if n1 mod nmin = 0 and data seen at l not all of same
class then
694 694 694 (4) Compute Gl(Xi) for each attribute
5 5 ... 5 5 5 ... 5 (5) Let Xa be attribute with highest Gl
(6) Let Xb be attribute with second-highest Gl
6 6 6376 (7) Compute Hoeffding bound
if Xa != Xb and (Gl(Xa) − Gl(Xb) > ε or ε < τ)

then
1 1 424 (8) Replace l with an internal node that splits on Xa
for all branches of the split do
2 2 ... 2
1355
2 2
1355
... 2
1355 (9) Add a new leaf with initialized sufficient
statistics
2322 2322 2322
end for
3 3 ... 3 3 3 ... 3 end if
end if
end for
1143 4 1143
4 4 ... 4 4 1143... 4 Figure 9. A pseudo-code of a training tree of Hoeffding
tree algorithm.
... 694 694 ... 694
5 5 5 5 5 5
1

6 6 6376 2 2 ... 2 1

(a) (b) 2 2 ... 2
3

Figure 8. The data dependency graph (a), and the control
3
flow graph for top-k (min summary). 4 4 ... 4

This tendency of the computational cost implies that the execution 4 4 ... 4
in a pipeline is really effective for min summary algorithm. In fact, 5
the computational cost of node 6 is almost equivarent to the total
cost of the single path of nodes 1-5, and node 6 is independent 5
from these nodes. Therefore, node 1-5, and node 6 are capable of 6
running in a pipeline, and consumes almost the same
computational time. That is, there is a chance to hide almost half 6
of the execution time of the process time of one data unit, and 7
improve the throughput by pipelining.
7
We also extracted a task graph of Hoeffding tree algorithm from 8
MOA implementation. Figure 9 is the pseudo-code of the
algorithm, and Figure 10 represents the extracted data dependency 8
graph. The control graph is omitted for the page limitation. We 9 9 ... 9
skip the detailed discussion for the page limitation again, however,
we can observe similar tendency of the application as we saw in 9 9 ... 9
the generic model in Section 2, Naïve Bayes in Section 3, and min
summary algorithm in this section. One major difference from the
previous cases is that node 1 depends on node 9, therefore, the Figure 10. A data dependency graph for Hoeffding tree
effect of the pipelining is not huge compared to the other cases. algorithm.
Here, we need to discuss computational costs in the control flow
graph. This version of the task graph represents a computational develop the better methodology for the computational model,
cost as the average of actual executions. This is in a sort of the however, we reserve this issue as a future work.
simplified model as the computational cost of stream mining
algorithms easily varies depending on the input data. We need to

5
5. RELATED WORK compared to the data intensive applications in HPC, and this fact
There are several studies on task graph generation, mainly points out we need to consider scheduling methodologies focusing
focusing on random task generation. A few projects reported task on stream mining algorithms. For the better set of task graphs, we
graphs generated based on the actual well-known applications, are working on more stream mining algorithms.
however, those applications are from numerical applications such
as Fast Fourier Transformation, or applications familiar to HPC
7. REFERENCES
community for a long time. [1] Akioka, S., Muraoka, Y., Yamana, H., Data access pattern
analysis on stream mining algorithms for cloud computation.
Task Graphs for Free (TGFF) provides pseudo-random task In Proceedings of the 2011 International Conference on
graphs [5,11]. TGFF allows users to control several parameters, Parallel and Distributed Processing (PDPTA2011) (Las
however, generates only directed acyclic graphs (DAGs) with one Vegas, USA, July 18-21, 2011), 2011, 36-42.
or multiple start nodes, and one or multiple sink nodes. Each task
graph is assigned a period, and a deadline based on the length of [2] Bifet, A., Holmes, G., Pfahringer, B., Karen, P., Kremer, H.,
the maximum path in the graph, and the user specified parameter. Jansen, T., Seidl, T., MOA: Massive online analysis, a
framework for stream classification and clustering. Journal
GGen is another random task graph generator proposed by of Machine Learning Research (JMLR), Workshop and
Cordeiro et al [4]. GGen generates random task graphs according Conference Proceedings Vol. 11: Workshop on Application
to the well-known random task generation algorithms. In addition of Pattern Analysis, 2010.
to the graph generator, GGen provides a graph analyzer, which
characterizes randomly generated task graphs based on the longest [3] Calders, T., Dexters, N., Goethals, B., Mining Frequent
path, the distribution of the out-degree, and the number of edges. Itemsets in a Stream. In Proceedings of 2007 IEEE
International Conference on Data Mining (ICDM2007)
Task graph generator provides both random task graphs, such as (Omaha, USA, October 28-31, 2007), 2007.
Fast Fourier Transformation, Gaussian Elimination, and LU
Decomposition [12]. The random task graph generator supports [4] Corderiro, D., Mounie, G., Perarnau S., Trystram, D.,
variety of network topologies, including star, and ring. Task graph Vincent, J. M., Wagner, F., Random graph generation for
generator also provides scheduling algorithms as well. scheduling simulations. In Proceedings of the 3rd
International ICST Conference on Simulation Tools and
Tobita et al. proposed Standard Task Graph Set (STG), evaluated Techniques (SIMUTools’10) (Torremolinos, Spain, March
several scheduling algorithms, and published the optimal 15-19, 2010), 2010.
schedules for STG [10,13]. STG is a set of random task graphs,
which are ready to download. Tobita et al. also provides task [5] Dick, R. P., Rhodes D. L., Wolf, W., TGFF: Task graphs for
graphs from numerical applications such as a robot control free. In Proceedings of International Workshop on
programs, a sparse matrix solver, and SPEC fpppp. Hardware/Software Codesign (Seattle, USA, March 15-18,
1998), 1998, 97-101.
Besides the studies on task graph generation, Cordeiro et al.
pointed out that randomly generated task graphs can create biased [6] Domingos, P., Hulten, G., Mining High-Speed Data Streams.
In Proceedings of The 6th ACM SIGKDD Conference on
results, and that the biased results can mislead the analysis of
scheduling algorithms[4]. According to the experiments by Knowledge Discovery and Data Mining (KDD’00) (Boston,
USA, August 20-23, 2000), 2000, 71-80.
Cordeiro et al., a same scheduling algorithm man obtain a speedup
of 3.5 times only by changing the graph generation algorithm for [7] Lam, H. G., Calders, T., Mining top-k frequent items in a
the performance evaluation. data stream with flexible sliding window. In Proceedings of
Random task graphs contributes positively for evaluation of The 16th ACM SIGKDD Conference on Knowledge
scheduling algorithms, however, do not perfectly cover all the Discovery and Data Mining (KDD’10) (Washington DC,
USA, July 25-28, 2010), 2010.
domains of parallel and distributed applications as Cordeiro et al.
figured out in their work. Especially for stream mining [8] McCallum A., Nigram, K., A comparison of event models
applications, which focus on in this paper, the characteristic of the for Naïve Bayes text classification. In Proceedings of AAAI-
application behaviors are quite different from the characteristic of 98 Workshop on ‘Learning for Text Categorization’
the applications familiar to the conventional HPC community as (Madison, USA, July 26-27, 1998), 1998.
we discussed in Section 2. Task graphs generated from the actual [9] Raicu, I., Foster, I. T., Zhao, Y., Little, P., Moretti, C. M.,
stream mining applications have profound significance in the Chaudhary, A., Thain, D. The quest for scalable support of
better optimization of the applications in parallel computing data intensive workloads in distributed systems. In
environment for wider area of applications. Proceedings of the 18th ACM International Symposium on
High Performance Distributed Computing (HPDC2009)
6. CONCLUSION (Munich, Germany, June 11-13, 2009), 2009, 207-216.
This paper proposed task graphs for stream mining algorithms.
This is the first clear proposal of task graphs modeling stream [10] STG, Standard task graph set.
mining algorithms, and the task graphs are extracted from the http://www.kasahara.elec.waseda.ac.jp/schedule/index.html.
actual implementations of the popular existing methodologies. [11] TGFF. http://ziyang.eecs.umich.edu/~dickrp/tgff/.
Task graphs proposed in this paper play an important role as the
benchmarking tool to evaluate scheduling algorithms, or load [12] TGG, Task graph generator.
balancing algorithm, which is indispensable for the research of http://taskgraphgen.sourceforge.net/.
scheduling, or load balancing algorithms truly effective for stream [13] Tobita T., Kasahara, H., A standard task graph set for fair
mining algorithms. In fact, in this paper, the proposed task graphs evaluation of multiprocessor scheduling algorithms. Journal
represent apparently different characteristics, and dependencies of Scheduling, Volume 5, Issue 5, 2002, 379-394.