<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TTaasskk SScchheedduulliinngg iinn DDaattaa SSttrreeaamm PPrroocceessssiinngg?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zbynekˇ Falt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Yaghob ⋆ Zbynek Falt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Yaghob</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering, Charles University</institution>
          ,
          <addr-line>DMepaalrotsmtreanntskeo ́ fnaS ́mo.ft2w5a,r1e1E8n0g0inPerearginuge,, CChzaecrhlesreUpnuibvleicrsity, Malostran</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>One possible technique of data processing is its transformation into a data stream and the execution of particular operations on the data tuples. These operations can be usually processed concurrently especially when the plan of operations is branched. Therefore, this way of data processing is suitable for evaluation in parallel environment. On the other hand, the ordering of the execution of tasks associated with the operations is closely related to the utilization of the hardware components. For that reason, the task scheduler is very important in these systems. In this paper, we introduce our implementation of a task scheduler which deals well with various hardware factors such as caches and NUMA1 factor.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>As the amount of data which are intended for processing becomes larger, new
approaches for their processing are researched. Since the performance of the
systems has been recently increased mainly through the addition of
computational units, the parallel processing of data is a natural evolution in this research
area. New algorithms, new approaches, and new technologies, which utilize this
direction of progress, are developed.</p>
      <p>On the other hand, the development of parallel data processing is more
difficult than single threaded development. The developer must solve issues such as
the decomposition of the problem to independent parts which may be processed
in parallel, thread management, their synchronization and communication.</p>
      <p>Of course, there exist many frameworks which try to make these things easier.
One of the approaches, that makes especially the processing of data queries
easier, is the transformation of input data into data streams and the execution
of particular operations on that streams. The execution plan looks like directed
graph where nodes are the operations and edges determine dataflow among the
operations. The biggest advantage of this approach is that developer of this plan
does not have to take parallelization into account since this is done automatically
⋆ This paper was supported by the grant SVV-2011-263312 and by the grant agency
of Charles University (GAUK), project no. 277911.</p>
      <p>1 Non-Uniform Memory Access
during the evaluation. It is possible, because operations on different parts of the
streams or whole branches of the plan may be processed concurrently.</p>
      <p>
        One of the systems that implements this idea is Bobox [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The main aim of
the Bobox project is to process semantic and semistructured data effectively [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Currently, support for SPARQL [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] query language is under development and
partial support for XQuery [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Tri Query [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] language is already developed.
      </p>
      <p>The contribution of this paper is a presentation of the scalable scheduling
techniques for systems which process data streams. These techniques deal well
with different hardware factors such as size of caches and NUMA factor.
Therefore they enable to evaluate data queries faster on modern systems. Although
they are analyzed and tested on the Bobox system, they can be easily adopted
by other similar systems.</p>
      <p>The remainder of this paper is as follows. In the Section 3.1, there is the
description of execution plan evaluation and tasks creation. In the Section 4, we
measure and analyze the influence of various hardware factors on the efficiency
of the system. The Section 5, contains detailed description of the implementation
of our scheduler and in the Section 6, we present and analyze an experimental
comparison between our old simple and new implementation of the scheduler.
The Section 7 summarizes the results and introduces our future plans.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Parallel processing of data stream is a very actual research topic and many
systems similar to the Bobox are being developed, for example Borealis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
STREAM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Nile [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or NiagarsST [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, the aim of all the systems
is mainly the processing of an infinite data streams or real-time data streams.
This is one of the biggest differences between them and the Bobox, because the
main scope of the Bobox is efficient processing of the offline finite data streams.
      </p>
      <p>
        Of course, these differences determine different requirements on the scheduler
and the scheduling strategies. The common requirements for all systems are
throughput and efficient utilization of available resources. Additionally, in the
processing of infinite or real-time data streams, memory usage and response time
are also very important and some papers address this problems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>In the Bobox-like systems the factors of response time do not have to be
taken into account, as they are not meant to be the frameworks for processing
real-time data. The finiteness of the input data ensures that the memory usage
will not grow beyond control.</p>
      <p>The other major difference is that the Bobox supposes there is no parallelism
in the operator evaluation. This means that in contrary to other systems each
operator is allowed to run only on at most one thread. This restriction makes
operators implementation easier. On the other hand, the only way of achieving
concurrency in the operator implementation is their decomposition to several
atomic operations. For example, sorting can be decomposed to several single
threaded sorting operators, which might be evaluated in parallel. Their results
can be then merged together by the net of merge operators.</p>
      <p>Because one operation can be composed of many partial operations, the
scheduler must be optimized according to this fact. The main reason is that
the communication among these suboperations is more intensive than among
the operators. To deal with it, our scheduler optimizes the work with the CPU
cache and tries to keep this communication in the caches as much as possible.</p>
      <p>
        The strategy implemented in the paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also optimizes the utilization
of the CPU caches. However, in contrast to our work, it tries to optimize the
accesses to caches during operator evaluation. We focused on the utilization of
the caches to speed up the communication among operators.
      </p>
      <p>
        Problems of task scheduling are also a very important part of frameworks
or environment which are not related to data streams processing, as shown for
instance in TBB [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] or OpenMP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. As these papers take into account
cache related problems and solve them via increasing the data locality, they are
not concerned with factors such as NUMA factor.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation of an execution plan</title>
      <p>As mentioned briefly in Section 1, the execution plan looks like a directed graph.
The nodes of this graph represent operators and the edges determine the dataflow
in the plan. Operators can have an arbitrary number of input edges through
which they receive data tuples, and an arbitrary number of output edges along
which they send the resulting tuples. Since tuples are typically small and some
overhead is related to their transfer, they are grouped together into packets.
This grouping helps to reduce the communication overhead. The other
important thing is, that the operators are allowed to communicate with each other
only through these packets. Thus, they should not share any variables or other
resources.</p>
      <p>We denote executions plans as requests in the rest of the paper, because each
plan is typically related to some data query request.
3.1</p>
      <sec id="sec-3-1">
        <title>Tasks creation and their types</title>
        <p>The fact that each operator can run on at most one thread determines the set
of states of each operator. These states are:
– Nothing – The operator has nothing to do.
– Scheduled – The operator has been scheduled to perform its operation.
– Running – The operator is processing input data or performing another
operation. After that it will be switched to the Nothing state.
– Running and scheduled – The operator has been scheduled during the
Running state. When the operator finishes its operation it will be switched to
the state Scheduled. The scheduling requests are serialized in this way and
therefore the operator cannot be run twice at a time even if it is scheduled
multiple times.</p>
        <p>When a packet is received by the operator, it is switched to state Scheduled or
Running scheduled. Every time the operator is switched to the state Scheduled,
new task is created. The objective of the scheduler is to select tasks and run
them on the threads which it has chosen.</p>
        <p>The receipt of the packet is not the only event when the task is created. The
other situation is, for example, when the operator which generates new data
sends data packet, but it still has many other packets remaining. In this case,
the operator schedule itself again after the packet is sent and the next time it is
run, it sends the next packet etc.</p>
        <p>It is obvious that the tasks associated with packet reception require a bit
different handling. The fact that the packet was received probably means that
the content of the packet is hot in cache. Thus, the earlier the associated task will
be performed, the faster this performance probably will be. Other types of tasks
are not directly associated with an packet, therefore it does not matter when or
even where these tasks will be run. The first task type is called immediate and
the second task type is called deferred.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Factors influencing efficiency</title>
      <p>This section contains a list of the most important hardware factors which the
scheduler should take care of. The importance of these factors is experimentally
verified on the Bobox system.
4.1</p>
      <sec id="sec-4-1">
        <title>Cache</title>
        <p>One of the goals we wanted to reach was an effective utilization of caches. One
way of doing it is to try to keep the content of packets in the cache. Therefore,
the packet transfer does not cause many cache misses. If we consider only this
factor, we conclude that the smaller packets the better the performance is since
they fit better into the cache. On the other side, there is some overhead with this
transfer such as synchronization of data structures, scheduler decision algorithms
etc. If we consider only this overhead, then bigger packets decrease this overhead
and increase the efficiency.</p>
        <p>Our task was to find a reasonable compromise between these two extremes.
We measured the relation between the size of packets and the efficiency of the
system. We executed the same request with different packet sizes. The results are
shown in Figure 1 and confirm the hypothesis. With smaller packets the overhead
grows up and with larger packets there is an overhead with cache misses.</p>
        <p>According to the plot, the ideal size of packets is slightly lower than the
size of L2 cache, which is 256KB (see Section 6 for more detailed hardware
specification). It can be easily explained since not only the packets are stored
in the cache, but there are also data structures of scheduler, memory allocator
or operators. These data structures are heavily accessed during the evaluation,
so keeping them also in the cache improves efficiency. Therefore the packet size
must be lowered by the size of these structures in order that they all fit into the
cache. Unfortunately, the size of these structures is hard to estimate.</p>
        <p>800 1000 1200 1400 1600 1800 2000</p>
        <p>Size of envelope [kB]
The number of processors in a SMP system cannot grow infinitely. As the
number of processors in the system is increasing, shared resources are turning into
bottlenecks. The most critical shared resource is the main memory. Therefore
NUMA systems are being developed as a solution to this problem.</p>
        <p>Basically, the NUMA system consists of several nodes. Each node acts as
SMP system, i.e. it has its own local main memory. These nodes are connected
together and the whole system shares the physical address space. Thus, one
node can access local memory of another node, but this access is slower than
an access to its local memory. This slow down is known as NUMA factor. The
NUMA factor tells us how many times slower is access to non-local memory in
the comparison to access to local memory.</p>
        <p>It is obvious that processor should access preferably its local memory. In
the opposite case, the computation is slowed down by the NUMA factor. Of
course, this problem is related more to memory allocation than to scheduling, but
NUMA non-aware scheduler can cause performance penalties, too. For example
when the scheduler move a computation, which has allocated some objects in its
local memory, from one node to another.</p>
        <p>To measure the influence of the NUMA factor on the Bobox system, we did
the following test. Firstly we performed a request on one NUMA node and all
allocations return its local memory. Secondly, we performed the same request on
the same NUMA node, but all the memory was allocated on another node. The
results (see 2) proves that the influence of NUMA factor cannot be ignored.</p>
        <p>For a comparison, we performed synthetic test which measures the NUMA
factor of our testing system. The test measured time needed for reversion of the
order of items in large array of integers. The test were run for all combinations
of nodes where the algorithm was performed and nodes where the memory was
179.9 190.2</p>
        <p>Same node
Different nodes
allocated. The results are shown in the Table 1, where all numbers are
recalculated relatively to the lowest times (which are naturally on the diagonal). Nodes
are numbered from 0 to 3. The factor is quite low in the system, so the difference
between both methods in the Figure 2 is not so significant, but the higher the
factor is, the bigger difference will be.
Data structures of the scheduler reflect the hardware structure of the host system
and the fact that system may be evaluating many requests at a time. Therefore
each NUMA node contains the list of all requests it is currently evaluating, each
request contains the list of all deferred tasks and each thread contains the list
of immediate tasks which were created by this thread. When the host system
is not NUMA system but ordinary SMP, it is still considered a NUMA system
with only one node.</p>
        <p>The algorithm of what to do when a new task is created is quite simple,
because the real work is done by the second part of the algorithm, which is
described below. A new immediate task is inserted into the list of immediate
tasks of the thread which created the task. A new deferred task is inserted into
the list of deferred tasks of the request to which the task belongs.</p>
        <p>When a worker thread finishes its current task, it starts to find another
task which it will execute. The algorithm which selects this task is a bit more
sophisticated and its goal is to maximize the usage of the principle of locality,
which increases the efficiency of the system. Our strategy is to find the first task
in this order:
1. The newest task in the list of immediate tasks that belongs to the thread.</p>
        <p>Since this task is immediate, it is associated with some data packet which was
recently created. The content of the packet was probably in the cache after
the creation. The fact that the task is the newest increases the probability
that no other packet moves the packet out of the cache.
2. The oldest task in the list of deferred tasks that belongs to request, that was
evaluated by the thread for the last time. Each request has probably some
data loaded in the cache and switching to another request would cause more
cache misses than the continuation with the same request. There are two
reasons, why we decided to choose the oldest task from the list.
The first is, that if we chose the newest task, then the execution plan would
be evaluated in the DFS2 manner instead of BFS3. But BFS manner tends
to generate more concurrent tasks than the DFS, so the level of parallelism
is higher.</p>
        <p>The other reason is, that each operator has its own input buffers for incoming
packets and deferred tasks are typically used for deferred packet creation (see
Section 3). Thus, the later this task will be executed, the more probable is,
that the input buffer of the successive operator will be empty. This increases
probability, that the newly created packet will not be rejected by the operator
and also probability, that the oldest packets in the input buffer (i.e. packets
which the successive operator will process) are hot in cache.
3. The oldest existing task of the oldest request which the current node is
evaluating. Therefore old requests are processed primarily. Each request allocates
many resources during its evaluation – memory, files etc. These resources
are usually allocated successively during the request evaluation. It is
obvious that if all request were processed with the same priority, then the total
amount of allocated resources would be increasing faster than if one request
is processed preferably and others are suppressed. Additionally, when the
oldest request is finished, its allocated resources will be freed and will be
available for other requests. Therefore, this strategy tries to minimize the
number of allocated resources during requests evaluation.
4. The oldest task in the list of immediate tasks of another thread from the
same node. This stealing of the task violates the principle of locality, on the
other hand it is better to utilize idle thread than let it sleep.
5. If no such task exists, then the seeking thread is suspended.
2 Depth-first search
3 Breadth-first search</p>
        <p>This architecture causes, that one request can be evaluated only on one
node at a time. This is a suitable behavior for NUMA systems because during
the evaluation threads work only with its local memory. On the other hand,
this might be a problem for request with high degree of parallelism. In that
case, only a part of the system performance is used for its evaluation. Another
problem is that system might become unbalanced, which means, that some nodes
are overloaded, while others are idle.</p>
        <p>The second problem is partially solved by the algorithm, which chooses a
node that has the most free resources available when a new request is created
and passes the request to that node. The algorithm assumes free resources to
be the free memory and the number of idle threads. This solution is partial
because when the request is created, it is unknown how many resources it will
be spending. Therefore the initial decision may be wrong in the global context.
This is solved by a dynamic load balancing described below.</p>
        <p>The first mentioned problem is solved by the load balancing as well.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Other algorithms</title>
        <p>Load balancing When there is at least one idle logical processor in the system,
a special load balancing thread is running. This thread evaluates every 100ms
load of the system. When there is at least one overloaded and at least one
partially idle node, then load balancing algorithm tries to solve this imbalance.
The algorithm selects the newest request of the overloaded node and moves it
to the idle node. The idea is that the newest request has probably the smallest
amount of allocated memory, therefore this transfer should be the most painless.</p>
        <p>If there is no request to move (i.e. node is overloaded and is processing
only one request), then the request becomes shared with the idle node. At this
moment, the request starts to be evaluated by both nodes. Of course, it is possible
that one queue is shared among more than two nodes. This happens when the
request has a high level of parallelism and is able to overload more nodes.
Sleeping threads A very important attribute of task scheduler is whether it
can put to sleep the threads which have nothing to do. An active waiting for a
next task is really inefficient. There are three main reasons. The first one is that
an active waiting keeps the CPU units busy. For cores with Hyper-Threading
technology, where these units are shared between two logical processors, this
behavior is very undesirable since the waiting threads slow down the other threads.</p>
        <p>The second reason is, that an active waiting means a repeated checking
whether there is a new task. Because of the synchronization of the shared
structures this checking loads the bus, memory or cache coherency protocol. Thus,
this approach decreases performance even of non-waiting threads.</p>
        <p>And the last one is, that almost all modern systems have some kind of power
saving technology which switches idle processors to low-power state. Therefore,
thread sleeping reduces the power consumption of the host system.</p>
        <p>However, the thread sleeping brings one problem we have to cope with. When
the scheduler allows thread sleeping, there must be a mechanism for deadlock
prevention. This is related to the situation when there are more tasks than
running threads and though idle threads exist. This situation must be avoided.</p>
        <p>Our solution is quite simple. Each NUMA node keeps the number of the
tasks for which it is responsible, i.e. a node must ensure that all these tasks will
be processed. The thread must not be suspended when the number of tasks is
greater than the number of running tasks. Therefore all tasks for which the node
is responsible will be processed before all threads will be put to sleep. When any
task or request is created and there is an idle thread in the corresponding node,
then this thread is woken up.</p>
        <p>The implementation must also prevent the situation when some request is
shared but some nodes are idle because they are not responsible for any task of
this request. We have decided for a simple but effective solution – the
responsibilities for new deferred tasks of the request are distributed in the round-robin
way among all sharing nodes.</p>
        <p>The implementation of thread sleeping itself can also influence the
performance of the system. The straightforward solution is to created semaphore for
each node. The value of the semaphore would be equal to the number of tasks
for which is the node responsible. New task would increase the value and each
thread would decrease the value before it would try to find another task.</p>
        <p>Unfortunately each change of the semaphore value yields to a system call,
which can be quite slow. Our implementation tries to avoid this. Each thread
has its own binary semaphore. Moreover the number of running threads and the
number of tasks are counted in an atomic variable. Before a thread is suspended,
the number of running thread is increased and the semaphore of the thread is
acquired. When the number of running threads is lower than the number of
tasks, then the number of running threads is increased and one semaphore of
some selected thread is released.</p>
        <p>This implementation ensures, that when the node is fully loaded, i.e. all
threads are running, then no operation is called on semaphores. Furthermore
there is no global semaphore which could become a bottleneck.</p>
        <p>NUMA support NUMA system support is already integrated in the
architecture of the scheduler. Each node has its own task queue and the scheduler
tries to keep all the requests only on one node. Only in the case when one node
is overloaded it tries to share computations among more than one node. This
strategy prevents it from redundant movement of requests among the nodes.</p>
        <p>The next advantage follows from the fact that only the queues of deferred
tasks are shared. Deferred tasks are not related so tightly to data flow as the
immediate tasks. If an packet is created (and its memory allocated) and sent to
other operator, the task associated with this packet will be processed on the same
node (and the thread will work with its local memory). The resulting packet of
the operation will be processed on the same node as well for the same reason.</p>
        <p>On the other hand, access to non-local memory cannot be fully avoided,
because operators such as for example merge or join can have more inputs which
are processed together.
CPU cache support It follows from the measurement in the Section 4 that
performance of the system depends on the size of the packets and the size of the
CPU caches. The optimal size according to the results is the same as size of L2
cache lowered about the size of data structures of scheduler, allocator etc.</p>
        <p>The problem is that this size is hard to estimate, since it depends on many
factors, such as behavior of the operators or the structure of the input data. We
have decided to solve this problem in a very simple way. As the optimal size of
an packet, we consider one half of the L2 cache size. For this size, all experiments
gave the best results. But we plan to solve this problem better in future (see
Section 7).
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>To measure the benefits of the new scheduler, we used system with 4 Intel Xeon
E7540 running at 2GHz CPUs. Each processor has 6 cores with the
HyperThreading support. Altogether the whole system has 48 logical processors. Each
core has its own 32kB data and 32kB instruction L1 cache and 256kB L2 cache.
All cores in the one processor share the 18MB L3 cache. The system has 4 NUMA
nodes; each node has 32GB operating memory. The operating system is the Red
Hat Enterprise Linux.</p>
      <p>
        We used XQuery [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] compiler to generate sufficiently big execution plan which
consists of around 170 operators. As input data we used 10MB XML file. The
results do not include the time needed to read the input data from hard drive.
Each time in the chart is calculated as the median of 5 measurement of the
same test to eliminate inaccuracy. To measure scalability, we were increasing
the number of concurrent queries exponentially. The label N × M in the figure
means that we ran N times M queries concurrently.
      </p>
      <p>For comparison, we used very simple scheduler, which does not distinguish
between immediate and deferred tasks. Each thread keeps its own list of tasks
and processes this list in the FIFO order. If there is no task in its own list, it tries
to steal a task from other threads in round-robin order. For threads suspending,
one global semaphore is used. The value of the semaphore is kept the same as
the number of tasks. The size of packets is constant without respecting the cache
size.</p>
      <p>The result is shown in the Figure 3. Our new implementation of the scheduler
improves the performance about 8 – 12% according to the load of the system.
The experiments show that the described ideas are correct and help improve the
efficiency of the data processing.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we proposed the architecture of task scheduler for the systems,
which process data streams in parallel environment. We also implemented this
scheduler and measured its behavior in the Bobox system. The new scheduler is
about 8 – 12 percent more efficient and is NUMA aware. Moreover the techniques,
]
s
[
e
iTm 100
50
0
80.2
71.0</p>
      <p>New scheduler
we described, can be used not only in our system, but it is possible to adopt
them easily by other similar systems, which contain some kind of task scheduling
mechanism.</p>
      <p>On the other hand, many challenges still remain. One of them is the analysis
of runtime characteristics of the request and dynamic changes of some
parameters according to the characteristics to achieve better results. This is related
mainly to the size of packets, which is hard to estimate statically without the
knowledge of exact behavior of operators.</p>
      <p>We not only plan to do optimizations for the size of cache but also for its
hierarchy. This means that we want to take cache sharing into account as well. This
brings some issues because it is hard to say which threads share caches and which
do not since the operating system can change the affinity of the threads
automatically. The straightforward solution of manual setting the affinities of threads
directly to logical processor does not work well, primarily when processors have
the Hyper-Threading technology. The reason is that the thread scheduler of the
operating system cannot perform any load balancing among logical processors,
which might yield to performance degradation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.J.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Cetintemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherniack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lindner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.S.</given-names>
            <surname>Maskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ryvkina</surname>
          </string-name>
          , et al.
          <article-title>The design of the borealis stream processing engine</article-title>
          .
          <source>In Second Biennial Conference on Innovative Data Systems Research (CIDR</source>
          <year>2005</year>
          ), Asilomar, CA. Citeseer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Arasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Babcock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Datar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Nishizawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rosenstein</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Widom.</surname>
          </string-name>
          <article-title>STREAM: the stanford stream data manager (demonstration description)</article-title>
          .
          <source>In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, page 665. ACM</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Babcock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Datar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <article-title>Operator scheduling in data stream systems</article-title>
          .
          <source>The International Journal on Very Large Data Bases</source>
          ,
          <volume>13</volume>
          (
          <issue>4</issue>
          ):
          <fpage>333</fpage>
          -
          <lpage>353</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Bedna´rek. Bulk Evaluation of User-Defined Functions in XQuery</article-title>
          .
          <source>PhD thesis</source>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Czech Republic,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Bedna´rek and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Dokulil</surname>
          </string-name>
          . Tri Query:
          <article-title>Modifying XQuery for RDF and Relational Data</article-title>
          .
          <source>In 2010 Workshops on Database and Expert Systems Applications</source>
          , pages
          <fpage>342</fpage>
          -
          <lpage>346</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. D. Bedna´rek, J. Dokulil,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yaghob</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Zavoral. The Bobox Project - A Parallel Native</surname>
          </string-name>
          <article-title>Repository for Semi-structured Data and the Semantic Web</article-title>
          .
          <article-title>TAT 2009 - IX. Informcanˇe´ technool´gie - apliak´cie a toe´ria</article-title>
          , pages
          <fpage>44</fpage>
          -
          <lpage>59</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>David</given-names>
            <surname>Bednarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jiri</given-names>
            <surname>Dokulil</surname>
          </string-name>
          , Jakub Yaghob, and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Zavoral</surname>
          </string-name>
          .
          <article-title>Using methods of parallel semi-structured data processing for semantic web</article-title>
          .
          <source>Advances in Semantic Processing</source>
          , International Conference on,
          <volume>0</volume>
          :
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Boag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chamberlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.F.</given-names>
            <surname>Fernan</surname>
          </string-name>
          ´dez,
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Robie</surname>
          </string-name>
          , J. Simeo´n, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefanescu</surname>
          </string-name>
          .
          <source>XQuery 1</source>
          .0:
          <string-name>
            <surname>An</surname>
            <given-names>XML</given-names>
          </string-name>
          <article-title>query language</article-title>
          .
          <source>W3C working draft</source>
          ,
          <volume>15</volume>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.</given-names>
            <surname>Chandra</surname>
          </string-name>
          .
          <article-title>Parallel programming in OpenMP</article-title>
          . Morgan Kaufmann,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. Cieslewicz</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Mee</surname>
            , and
            <given-names>K.A.</given-names>
          </string-name>
          <string-name>
            <surname>Ross</surname>
          </string-name>
          .
          <article-title>Cache-conscious buffering for database operators with state</article-title>
          .
          <source>In Proceedings of the Fifth International Workshop on Data Management on New Hardware</source>
          , pages
          <fpage>43</fpage>
          -
          <lpage>51</lpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Duran</surname>
          </string-name>
          , J. Corbaal´n, and
          <string-name>
            <surname>E. Ayguade.</surname>
          </string-name>
          ´
          <article-title>Evaluation of OpenMP task scheduling strategies</article-title>
          .
          <source>In Proceedings of the 4th international conference on OpenMP in a new era of parallelism</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>110</lpage>
          . Springer-Verlag,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M.A. Hammad</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          <string-name>
            <surname>Mokbel</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>W.G.</given-names>
          </string-name>
          <string-name>
            <surname>Aref</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          <string-name>
            <surname>Catlin</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Eltabakh</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          <string-name>
            <surname>Elfeky</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Gwadera</surname>
          </string-name>
          , et al.
          <article-title>Nile: A query processing engine for data streams</article-title>
          .
          <source>In Data Engineering</source>
          ,
          <year>2004</year>
          .
          <source>Proceedings. 20th International Conference on, page 851. IEEE</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakravarthy</surname>
          </string-name>
          .
          <article-title>Scheduling strategies for processing continuous queries over streams</article-title>
          .
          <source>Key Technologies for Data Management</source>
          , pages
          <fpage>16</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>D.</surname>
          </string-name>
          Maier et al.
          <source>NiagaraST</source>
          . http://datalab.cs.pdx.edu/niagara/,
          <year>2010</year>
          . [Online; accessed 21-December-2010].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. E.
          <string-name>
            <surname>Prud'Hommeaux</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Seaborne</surname>
          </string-name>
          , et al.
          <article-title>SPARQL query language for RDF</article-title>
          .
          <source>W3C working draft, 4</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>J.</given-names>
            <surname>Reinders</surname>
          </string-name>
          .
          <article-title>Intel threading building blocks</article-title>
          .
          <source>O'Reilly</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ali</surname>
          </string-name>
          <article-title>A. Safaei and Mostafa S. Haghjoo. Parallel processing of continuous queries over data streams</article-title>
          .
          <source>Distrib. Parallel Databases</source>
          ,
          <volume>28</volume>
          :
          <fpage>93</fpage>
          -
          <lpage>118</lpage>
          ,
          <year>December 2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>