Eviction Strategies for Semantic Flow
                        Processing?

        Minh Khoa Nguyen,1 Thomas Scharrenbach,1 Abraham Bernstein1

                          University of Zurich, Switzerland
                   {nguyen,scharrenbach,bernstein}@ifi.uzh.ch


        Abstract. In order to cope with the ever-increasing data volume contin-
        uous processing of incoming data via Semantic Flow Processing systems
        have been proposed. These systems allow to answer queries on streams
        of RDF triples. To achieve this goal they match (triple) patterns against
        the incoming stream and generate/update variable bindings. Yet, given
        the continuous nature of the stream the number of bindings can ex-
        plode and exceed memory; in particular when computing aggregates. To
        make the information processing practical Semantic Flow Processing sys-
        tems, therefore, typically limit the considered data to a (moving) window.
        Whilst this technique is simple it may not be able to find patterns spread
        further than the window or may still cause memory overruns when data
        is highly bursty.
        In this paper we propose to maintain bindings (and thus memory) not
        on recency (i.e., a window) but on the likelihood of contributing to
        a complete match. We propose to base the decision on the matching
        likelihood and not creation time (fifo) or at random. Furthermore we
        propose to drop variable bindings instead of data as do load shedding
        approaches. Specifically, we systematically investigate deterministic and
        the matching-likelihood based probabilistic eviction strategy for drop-
        ping variable bindings in terms of recall. We find that a matching likeli-
        hood based eviction can outperform fifo and random eviction strategies
        on synthetic as well as real world data.


Keywords: Semantic Flow Processing, Linked Data stream processing, eviction
strategies, load shedding, matching likelihood estimation on streams


1     Introduction

Information processing increasingly incorporates query matching on dynamically
changing information (i.e. information flows) [5]. Based on these Information
Flow Processing (IFP) systems, such as C-SPARQL [4], EP-SPARQL [1], or
CQELS [10] emerged that are capable of processing flows of semantically anno-
tated data. All of these Semantic Flow Processing (SFP) systems allow defining
?
    The research leading to these results has received funding from the European Union
    Seventh Framework Programme FP7/2007-2011 under grant agreement no 296126.
queries in a language that extends SPARQL with algebra adjustments for match-
ing queries on flows of time-annotated RDF triples.
    Each SFP system works on limited memory resources. They constrain the
number of valid partial variable bindings (i.e., bindings that are not yet complete
solutions to a query) by applying a window on the stream. Such a window limits
the scope of query matching to, for example, a period of time or to a certain
number of triples. On the one hand this limits the scope of answers the system
can find. If a match requires two data items that do not fall in the same scope the
system will never be able to perform such a match. On the other hand, even the
limited scope can not necessarily guarantee that the system will never exceed its
memory resources. Consider the situation when a popular TV broadcast/show,
such as a football match, starts and many users switch to it at the same time.
In this case we may find that a sudden burst in the input data could cause the
system to exceed its resources. The situation gets worse when we are not only
interested in the number of viewers for a show but want to track the users – the
literature refer to the latter case as non-shrinking semantics [3].
    To our knowledge no current SFP approach has a solution for freeing memory
in case these exceed the available resources. While proposals have been made for
load-shedding—dropping or ignoring incoming data—in IFP systems, only one of
them investigated load-shedding according to a statistical model of the matching
history [6]. We believe that instead of ignoring incoming data one should one
should drop variable bindings based on their likelihood that they lead to a result.
We refer to this as eviction strategies in contrast to load shedding.
    Eviction strategies are more flexible than load shedding. If we drop a data
item we may drop information too early in the matching process. Eviction allows
us to define constraints at which point of the matching process we want to
delete items. While there exist deterministic approaches that allow for dropping
variable bindings, such as first-in-first-out (fifo) we propose to use the matching
likelihood as the dropping criterion. In this paper we, hence, investigate the
performance of different eviction strategies. Note that an eviction strategy is
not the same as a garbage collection. While the latter removes those variables
bindings that will not contribute to a solution, the former removes variable
bindings that have the potential but cannot be kept due to memory limitation.
As a consequence applying an eviction strategy could lead to incomplete results
for the sake of feasibility.
    In this paper we propose to use a probabilistic based eviction strategy that
estimates the likelihood of a binding to contribute to a result in contrast to using
a fixed time window for load shedding. Note that this paper intends to show that
a probability based model can outperform fifo and random load shedding but
does not provide a model of how to estimate those probabilities on the go. We
investigate the effect our approach has on the completeness of results compared
with deterministic approaches. We systematically investigated the challenges
arising from the limited memory resources of regular joins for SFP systems. If
the partial bindings for the potential join partners exceed the memory we start
dropping partial results. Our findings enable an SFP to trade off completeness

                                        67
versus resources. To ensure methodological correctness we followed the PCKS
approach [11] for defining stress tests for SFP systems.
    As the relevant properties of an SFP system we identify completeness as a
Quality of Service constraint. The corresponding challenge we want to tackle is
the ability to handle bursts in the data. We therefore define recall as the key
performance indicator for the evaluations of the tests we performed in the scope
of this paper. In order to test the system we check how the system behaves under
restricted resources and under unrestricted resources.
    In our experiments we compared our probabilistic methods based on the
matching likelihood of partial bindings with the first-in-first-out (fifo) deter-
ministic method as well as random load shedding. We show that our statistical
approach can outperform fifo and random load shedding both on simulated data
as well as on a real-world dataset of viewership behavior of IPTV users. The im-
provement ranges from 14% to 50% when using synthetic and real world data.
    This remainder of this paper is structured as follows: next we formally specify
the context of SFP and then introduce our load-shedding approach. We then
evaluate our approach on synthetic and real world data in Section 3. After a
discussion of the results and limitations we present related work in Section 5
and conclude in Section 6.


2     Method: Likelihood-based Eviction of Partial Matches

This section first lays the formal groundwork for defining what eviction strategies
are. Then, it introduces our likelihood-based eviction strategy as well as the
baseline first-in-first-out (fifo) and random strategies.


2.1   Semantic Flow Processing

SFP rely on an algebra that modifies the SPARQL algebra for query matching
on flows of data. The result of an algebra operation is a set of variable bindings
consisting of a finite number of pairs ?var/val, where ?var is a variable name and
val is an RDF term. A query is a tree of algebra expressions where results (i.e.,
bindings) are propagated from the leaves to the root. An SFP system, hence,
consumes time-stamped RDF triples < s, p, o > [time] that are then consumed
(or evaluated) by the algebra operations defined in the query tree. Hereby the
system updates the bindings for all affected levels of the query tree and bindings
at the root level will be emitted as results.
    In our discussions we rely upon the definitions of SPARQL and RDF as given
in the SPARQL-query specification [8]. Whilst we limit our discussions to joins
and aggregates our approach is by no means limited to these operations. A se-
mantic flow F is a finite set of time-stamped RDF triples. We denote algebra
expressions by α. The evaluation of an algebra operation α and a finite set of
variable bindings Ω on a semantic flow again produces a (possibly new/updated)

                                        68
set of variable bindings eval(α, Ω, F) = Ω 0 . In a query tree all algebra expres-
sions consume variable (a semantic flow) bindings, perform the corresponding
operations, and then emit variable bindings.1
    SPARQL query processing systems usually implement an iterator-based ap-
proach: since they know all relevant data beforehand, they can successively per-
form the algebra operations recursively and, hence, know a binding was com-
pletely processed by an algebra expression. A SFP system, in contrast, has
to keep its variable bindings to guarantee completeness. Consuming data in a
streaming fashion turns the execution model upside-down. Instead of recursively
processing all available data for each algebra expression we have to continuously
evaluate newly arriving data for all algebra expressions. Consequently, the alge-
bra expressions in an SFP have never finished evaluating all available data and
have to continuously maintain a list of corresponding variable bindings. Consider
the simple join of two graph patterns ?a ab ?b and ?b bc ?c, for example. As-
suming that the SFP system matched the first pattern with the binding variable
binding µ = {?a/a1 , ?b/b1 } it is not really possible to say if and when a match
for the second graph pattern might happen.
    Keeping all variable bindings, however, will eventually exceed the system’s
memory resources and make processing unfeasible. Therefore, SFP systems typ-
ically limit the number of data they process by windowing, which limits the data
considered for computing partial matches to a given period of time or number
of triples. This allows to mark bindings with an an expiration time-stamp. A
garbage collection process can then remove those variable bindings that become
out-of-scope.
    Unfortunately, windowing has its distinct problems. For example, queries
that require to match patterns that unfold over periods longer than a practical
window for example cannot be matched. In those cases there are two options:
First one can apply load shedding – an approach that drops data from the input
stream, for example by only considering every x-th data item for processing.
Load shedding is a well-established approach and we discuss its difference to
eviction in Section 5 along with related work.
    Second, we may delete variable bindings instead of input data. In this paper
paper we investigate this alternative approach we refer to as eviction. It leaves
the input stream untouched but intermediate variable bindings get deleted to
make place for new ones. In the following we discuss this approach in detail.

2.2    Eviction: Dropping Variable Bindings
In order to limit memory usage of a SFP system we propose to use eviction, which
deletes partial binding from the processing tree. When deleting partial matches
(or shedding load in the input stream) we potentially sacrifice completeness. As
a consequence, it is imperative to understand what consequence eviction has on
completeness as a Quality of Service (QoS) criteria. In this study we, therefore,
compare the performance of the following eviction functions:
1
    Note that the leaves of the tree typically consume F and that the root may emit
    RDF data to a new flow F 0 .


                                         69
 – random deletion,
 – time-based deletion (e.g., delete the oldest bindings first), and
 – data-driven deletion of bindings.

    While the first is self-explanatory, the latter two require further explanation.


Time-Based Deletion Time-based deletion bases on the assumption that the
probability that a binding contributes to a result decreases with the amount of
time it stays in the bindings cache. Whilst we know of no empirical evidence that
this assumption holds for typical SFP benchmarks it is used in systems such as
SASE+ [7], where it is referred to as least-recently-used (LRU). 2
    LRU can be implemented as a priority queue, which can be implemented via
a heap. Heap organization has a complexity of O(log n), where n is the number
of elements in the bindings cache.
    Note that the actual performance of LRU can depend on the time model for
the bindings. A time model is either based on system time or on data time or
on both. Furthermore, an LRU eviction function must specify which time-stamp
of a binding it uses. This time-stamp can be the creation time of the binding or
the time it was last updated. In the scope of this work we use data time and the
time-stamp of the last update of the binding.


Data-Driven Eviction Data-driven eviction—the core contribution of this
paper—bases the eviction decision of a binding µ on the probability that µ con-
tributes to a result. This requires an estimation of the matching probability for
µ. To that end we introduce the notion of the transactional closure for a binding
µ in Section 2.3, which permits counting the number of results it contributed to
for a certain part of the stream data.
     Probability estimation can be performed either online or offline. In the online
case the contribution probability has to be estimated along with the actual query
processing, which is a prerequisite in cases where the data cannot be stored for
offline analysis. In this paper we assume that we can compute the probabilities
offline and describe the method for estimating the probabilities in Section 2.4.
     When applying data-driven eviction on each operator’s binding locally it
requires the same computational effort as LRU.3 Akin to the LRU case, data-
driven eviction can be implemented using a priority queue. The only difference
is the comparison attribute which is now the matching probability rather than
the time-stamp. Global data-driven eviction strategies may require additional
effort (cf Section 2.5).

2
  Note that in our case LRU gives the same result as fifo since we only forward fresh
  bindings to next operator instead of performing updates on the current binding.
3
  Assuming that the probabilities, as in our case, are computed offline and do not
  influence the online processing effort.


                                        70
2.3   The Transactional Closure

In order to evaluate the performance of any eviction or load shedding operator
we have to know whether deleting a data item or a variable binding impacts
the results. A query result emerges from applying a set of algebra operators. In
order to determine whether deletion of data items or variable bindings has an
impact on the results of a query we hence need to know all sequences in which
the application of a set of data items to the algebra operators of a query leads
to a result.
    We now define the basics for computing all possible paths to all query results
for a flow F. Any result of a query is a variable binding µroot the root of the
tree of the query’s algebra expressions Tα∇oot . As such any of these root variable
bindings emerges from a sequence of application of algebra expressions from
Tα∇oot . As a result, for each µroot we have a finite set of sets of variable bindings.
We refer to this set as the transactional closure (µ)+  α,F of the binding µ. Indeed,
we can define such a transactional closure recursively for all bindings for all
algebra expressions in Tα∇oot .
    Having the definition of the transactional closure at hand we now investigate
what happens, if we delete some of the variable bindings from the transactional
closure. If we delete all entries from (µ)+   α,F , then the variable binding µ will
never be created. In this study we want to know what impact any variable
binding µ0 ∈ (µ)+  α,F has on the transactional closure of µ.
    We assume that each algebra expression α maintains a bindings cache: the
finite collection of bindings, which we denote with Ωα . If we now delete a variable
binding µ0 from Ωα we would like to know which potential impact this deletion
has on the query results. We hence define the transactional closure with respect
to α, Ω, and F by (µ)+    α,Ω,F . It inductively defines those variable bindings that
emerge from µ by applying α to hF, Ωi. Let Rα,Ω,F be the result set of α for Ω
                         µ
and F, then we call Rα,Ω,F      = Rα,Ω,F ∩ (µ)+  α,Ω,F the result set of µ.
    The bindings cache of an algebra expression contains different sets of variable
bindings. A join with n join partners, for example, has one set Ωα,n for each of the
n join partners. In this study we assume that each of these sets can hold a limited
number of variable bindings, i.e., |Ωα,n | ≤ ωα,n , where ωα,n is the memory
available to store bindings. When the application of α on the flow F causes
some of these sets to exceed its limit (ωα,n ), then we apply an eviction strategy
E that removes variable bindings from Ωα,n such that the above condition will
be fulfilled again.


2.4   Estimating Matching Probabilities

We estimate the selectivity of a variable binding µ by the number of results that
depend on µ. We define the probability that the result set of a variable binding
µ is non-empty as the above count normalized by the total number of results
with respect to the bindings cache:


                                          71
                                                   µ
                                                  Rα,Ω,F
                             P r(Rα,Ω,F 6= ∅) =
                                                  |Rα,Ω,F |

    Note that the choice of α designates for which biding cache ωα,n this proba-
bility is computed.
    The above formula, hence, computes the probability that there exists a bind-
ing µ0 that bases on µ. If α is a query, then the above formula defines the
probability that µ contributes to a query’s root result.
    In some cases it is possible that a binding µ0 originates from two different
upstream bindings. Consider, for example, the following operation joining two
identical basic graph patterns BGP (?x P Y). Here, any binding µ = {?x/X}
could originate from the first or the second triple pattern matching:

                       JOIN (BGP (?x P Y), BGP (?x P Y))

   For the computation of the probabilities we count both original variable
bindings.


2.5   Eviction Strategies: Local vs Global Approaches

A local eviction strategy operates locally on the binding cache of each operator.
A global eviction strategy, in contrast, operates on all binding caches together,
attempting to optimize the overall global performance. In other words, for a
local eviction strategy all ωα,n are
                                 P fixed
                                     P whereas a global eviction strategy has
the additional constraints that α n ωα,n ≤ ω for all ωα,n . A global eviction
strategy hence requires to also solve (or approximate) a constraint optimization
problem. Since it is our goal to show that data-driven eviction can lead to supe-
rior results over the other strategies we only have to show that one of the two
approaches offers advantages over the other, non data-driven ones. We, hence,
restrict ourselves to the simpler, local eviction strategy shown in Algorithm 1
and leave the global eviction strategy for future work.


Algorithm 1 Local eviction strategy for an algebra expression α = α1 , . . . , αI ,
a flow F, a set of sets of variable bindings Ωαi ,ni , limits on these ωαi ,ni
  Apply F to hα, Ωi
  for all Ωαi ,n do
    if |Ωαi ,n | > ωαi ,n then
       eviction(Ωαi ,n )
    end if
  end for


                                         72
3     Evaluation

The main working hypothesis of our paper is that eviction based on the match-
ing likelihood can outperform the fifo and random dropping approach of variable
bindings. We therefore have to estimate the matching likelihood. This in turns
required implementing the transactional closure in order to be able to compute
the counts on which the matching likelihood is defined. In order to calculate
the matching likelihood, we implemented an ”oracle”. This oracle allows to re-
trieve the cardinality of complete matches in the future. The oracle assumes
that required bindings are not been evicted by other join operators. With this
oracle we can compute the probabilities as proposed in Section 2.4. We will test
probabilistic eviction with probabilities learned from past data in future studies.
Please find the discussion of this limitation in Section 4.
    In this section we evaluate our approach governed by the above hypothesis.
We therefore compared the performance of our likelihood eviction with the fifo
and random strategies with respect to recall as the key performance indicator
(KPI).


3.1   Experimental Setup

We performed our experiments on a synthetic and a real world data set. We used
the same query that performs the following join:

SELECT ?a ?b ?c ?d
WHERE {
    ?a ab ?b .
    ?b bc ?c .
    ?c cd ?d .
}

    We considered joins without sequential or temporal constraints, because of
memory constraints. Our goal was to compute the recall for all potential bind-
ings. Computing the transactional closure for joins requires an exponential num-
ber of traces which matching originated from which variable bindings. We discuss
this limitation in Section 4.
    We performed the experiments considering local eviction. We tested the three
different eviction strategies with bindings caches which we increased exponen-
tially. We chose sizes as multiples of 10. The reason for this is that calculating
matching statistics is currently a time consuming process and therefore we choose
an exponentially increasing size of cache in the evaluation.
    For the comparison we assumed the same time for determining whether an
element was subject to eviction or not. This holds true for fifo and statistical
eviction, as they both keep a cache where the elements to be evicted have to
be sorted out. We can achieve this by applying Heap-Sort. We assume that the
number of elements to be dropped by eviction to be constant in relation to the

                                        73
size of the bindings cache. This way each eviction step is bounded by O(log(n))
where n is the size of the bindings cache.
    All experiments were conducted using the UZH Katts Semantic Flow Pro-
cessing engine.4 While we simulated an uniform distribution of results for the
synthetic dataset we used anonymous IPTV viewership logs and joined them
with Electronic Program Guide data. The data was gathered in the scope of the
ViSTA-TV project.5 For our experiments we used a data sample from Aug 1st,
2012. The sample comprised about 300’000 events in total for the user log and
the EPG data. We thereby only considered such properties that also occur in the
query. For the above mentioned exponential explosion for tracing the matching
we limited ourselves to a subset comprising 10’000 events.
    To retrieve the exact matching probabilities for the bindings we stored those
statistics in a sqlite3 in memory database. However, updating the matching
probability for each binding every time unit (every time unit can change the
likelihood of a match for a binding) is a very time expensive process. Therefore
we updated the probability for each binding only once, namely when we added
it to the bindings cache.
    The following configuration was used to run all our experiments: 2 Quad-core
Intel(R) Xeon(R) CPU X5570@ 2.93GHz, 72GB of RAM, 4 TB of disk space,
running Fedora Core 12 kernel 2.6.32.14 64bit.

3.2   Results
As Figures 1-2 show, our likelihood based eviction approach always outperforms
fifo and random eviction. The difference for the synthetic data set decreases
with increasing size of the bindings cache from 50% to 10% difference. In the
real world data set the differences in recall increases from 14% to 29% for the
10’000 events data set and for the 1’000 events data set we can observe a decrease
from 32% to 14%.
    We see that for the extreme size of the binding cache, we always obtain 100%
recall for all approaches. We evaluate the overall performance of our approach
by the recall measured for the bindings cache size that can hold up to 10% of
the bindings of the cache for which we get 100% recall. While the matching
likelihood based eviction performs very good on the synthetic data set (up to
91% recall with bindings cache of 10% relative of 100% performance limit), it
performs not as good for the real-world data set (up to 60% recall with bindings
cache of 10% relative of 100% performance limit).


4     Discussion
Experimental Results The results of our study indicates that likelihood based
eviction can outperform fifo and random. This was the case for all experiments we
4
  KATTS is a recursive acronym for Katts is A Triple Torrent Sieve and is accessible
  via GitHub at https://github.com/lorenzfischer/katts
5
  http://vista-tv.eu


                                        74
                                1               Prob

                               0.8

                               0.6


                      Recall
                                            o
                                      Fif
                               0.4          Rand

                               0.2

                                0
                                100                     50'000       100'000
                                                       Fringe Size

                                      (a)
                           100        1’000      10’000    100’000
               fifo     0 (0%)    616 (9%) 3065 (46%) 6718 (100%)
            random     41 (1%)    468 (7%) 2578 (38%) 6718 (100%)
              prob  3425 (51%) 5642 (84%) 6144 (91%) 6718 (100%)
            ground 6718 (100%) 6718 (100%) 6718 (100%) 6718 (100%)
                                                       (b)

             Fig. 1: Recall Generated Dataset with 145’000 events.


conducted. The relative difference between is larger for the synthetic data than
for the real world data set. We think that our approach was able to better predict
the matching likelihood for synthetic dataset as it was generated following some
uniform distributions. Nevertheless an eviction strategy based on the matching
likelihood seems to be a promising approach.
    We also found that there is a relation between the size of the bindings caches
and the recall. The larger the bindings cache the higher the recall. While this
result seems not very surprising it yet indicates that future research will have
to be performed finding out the exact size of the bindings cache such that the
recall becomes 100%. Such research would also investigate at which cost an
improvement of the recall using eviction strategies comes.

Shortcomings The most important shortcomings of our study are the limited
number of data items we could process and the restriction to a query consisting
of a regular join. The limitation to the small number of data items is caused
by the exponential number of possible traces from variable bindings to results.
Research on more complicated queries and larger data sets has hence to be
performed online. This, in turn makes an exact calculation of the matching
statistics infeasible. Therefore one has to learn the matching statistic which we
tackle in our future work.
    In this study we hence concentrated on investigating how well likelihood
based eviction works in —principle, compared to fifo and random eviction. We
are currently investigating how to best estimate the matching likelihood on the
go, i.e., along with the processing of incoming data. Future experiments will


                                                        75
                                1


                               0.8

                               0.6


                      Recall
                                          b
                                      Pro
                                             o
                               0.4       Fif
                                          a nd
                                        R
                               0.2

                                0
                                 10     200      400    600     800   1000
                                                  Fringe Size

                                      (a)
                                   10       100      1’000
                        fifo   0 (0%) 3 (14%) 22 (100%)
                     random    0 (0%) 6 (27%) 22 (100%)
                       prob   7 (32%) 9 (41%) 22 (100%)
                     ground 22 (100%) 22 (100%) 22 (100%)
                                                  (b)

              Fig. 2: Recall ViSTA-TV Dataset with 1’000 events.


hence systematically investigate different types of algebra expressions. These
include aggregates but also joins with sequential and temporal constraints, as
these are more typical for SFP systems. We believe that likelihood based eviction
can help most for such cases where a lot of variable bindings are produced, e.g.,
in the case of aggregates, in particular for non-shrinking semantics.
    Our simulation did not investigate response time or throughput but recall.
While we showed that matching likelihood based eviction outperforms determin-
istic approaches like fifo and random eviction in terms of recall, we will investi-
gate at which costs this comes for the former two KPIs. As in the previous case
we will have to then estimate the statistics online rather than offline. Following
the PCKS paradigm [11] are working on implementing tests for determining how
well our approach works for bursts in the data.
    In this study we considered a purely local eviction strategy. It would be
interesting to see the effect of global constraints. We are currently investigating
how to add these to statistical eviction. This requires a well-defined optimization
strategy carefully balancing the relation between local and global constraints.


5   Related Work

Research related to this paper roughly covers three areas: Semantic Flow Process-
ing, selectivity estimates on flow-data, and load shedding for IFP/SFP systems.
In the sequel we discuss how our work relates to each of these.


                                                   76
                                  1


                                 0.8
                                            b
                                        Pro
                                 0.6


                        Recall
                                                 o
                                           Fif
                                 0.4            Rand

                                 0.2

                                  0
                                   10                  5000        10'000
                                                     Fringe Size

                                       (a)
                                   100       1’000     10’000
                        fifo   39 (4%) 290 (31%) 943 (100%)
                     random    27 (3%) 297 (31%) 943 (100%)
                       prob  164 (17%) 567 (60%) 943 (100%)
                     ground 943 (100%) 943 (100%) 943 (100%)
                                                     (b)

               Fig. 3: Recall ViSTA-TV Dataset with 10’000 events.


5.1    Semantic Flow Processing

C-SPARQL [4] performs query matching on subsets of the information flow
defined by windows. The decidability of SPARQL query processing on such finite
sets of RDF triples causes the number of variable bindings produces to be finite.
Their number may still become prohibitively large, for example, when using
non-shrinking semantics for aggregates [3]. Since the execution is performed on
traditional SPARQL engines, we may apply our approach, for example as a filter
for iterator-based implementations such as Jena.6

EP-SPARQL [1] is a complex event processing system which extends the ETALIS
system by a flow-ready extension of SPARQL for query processing [1]. It pro-
vides a garbage collection facility which can “prune outdated events” or “expired
events by using periodic events, generated by the system”. ETALIS is a prolog
engine which seriously hampers the applicability of our approach to ETALIS
engine. Note however, that EP-SPARQL may be implemented on other engines,
e.g., CQUELS, too.

CQELS [10] “implements the required query operators natively to avoid the
overhead and limitations of closed system regimes”. It optimizes the execution
by dynamically re-ordering operators, because “the earlier we prune the triples
that will not make it to the final output, the better, since operators will then
process fewer triples”. This pruning does, however not make any guarantees
6
    https://jena.apache.org/


                                                     77
about the number of variable bindings created by the processors. Our method
should be directly applicable to CQUELS as it provides a native implementation
of the operators and these operators maintain a list of active variable bindings.


5.2   Selectivity Estimates

Selectivity estimates for optimizing the execution of SPARQL queries were first
investigated by [12]. These are not directly applicable to our approach. Selectivity
estimates for SPARQL query optimization base on the selectivity of predicates
whereas we are interested in the matching likelihood of variable bindings. More
related to our work is the operator scheduling of CQELS [10]. They propose to
choose the next operator to execute according to the likelihood that the operator
leads to a positive result. They, however, base the decision on heuristics. It would
be very interesting to see what effect applying our matching likelihood would
have on the performance of CQELS.
    To the best of our knowledge, no approach estimates the matching likelihood
of bindings to determine their selectivity.


5.3   Load Shedding

Load shedding has been applied to information flow processing. Approaches
like [6, 2, 13] perform load shedding by dropping tuples from the stream, i.e.,
dropping data instead of variable bindings. In contrast to a Complex Event
processing (CEP) system they assume a Data Stream Management System, i.e.,
a set of triples with a relational database execution engine.
    An approach we found that follows a similar idea is SASE+ [7]. SASE is has
an automata based matching approach which can be considered similar to vari-
able bindings in our case. They do apply some eviction strategy. Yet, they base
their eviction on a deterministic approach that is similar to our implementation
of fifo.
    Another approach that resembles our work is proposed in [6]. They perform
a simple equi-join on two incoming streams and evict tuples which are less likely
to find a join partner. However, the approach in [6] works with a sliding window
and with a single equi-join of two streams.
    Note that applying a window to the flow of RDF triples can be considered
dropping tuples, too. As such all approaches that implement a window on the
incoming data effectively offer the window as a load shedding strategy.


6     Conclusion and Outlook

In this paper we proposed to perform load shedding by eviction, i.e., by drop-
ping variable bindings rather than by dropping tuples. While the latter is a
well-established technique for Data Stream Management Systems, for CEP our
approach is the first that applies (i) load shedding by eviction and (ii) bases


                                        78
the decision on the matching likelihood of a variable binding rather than on a
heuristic.
    Our experiments show that eviction is a promising strategy for regular joins
for event-driven Semantic Flow Processing systems. We outperform the usually
used deterministic approach first-in-first-out by up to 51% recall.
    While our approach is currently investigating recall as the only key-performance
indicator (KPI) we are confident that we will be able to show that the match-
ing likelihood also performs better than deterministic approaches when different
Quality of Service (QoS) constraints require testing a combination of recall with
other KPIs such as response time.
    Future work will concentrate on extending the algebra expressions to joins
with sequential and temporal constraints. We believe that the solution for load
shedding will have to incorporate statistical information from the data flow and
the query log. We will also investigate how the matching likelihood can be learned
from the stream on-the-go.
    In the future we will also plan to evaluate our approach with standardized
queries and data as proposed in [14] or [9].
    We are convinced that only systematic investigations on the relation between
using windows and likelihood-based eviction for load shedding will reveal this
solution from the data.


References
 1. Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language
    for event processing and stream reasoning. In: Proceedings of the 20th International
    Conference on World Wide Web. pp. 635–644. ACM (2011)
 2. Babcock, B., Datar, M., Motwani, R.: Load Shedding for Aggregation Queries
    over Data Streams. In: Proceedings of the 20th International Conference on Data
    Engineering. pp. 350—-. ICDE ’04, IEEE Computer Society (2004)
 3. Barbieri, D., Braga, D., Ceri, S.: Incremental reasoning on streams and rich back-
    ground knowledge. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuck-
    enschmidt, H., Cabral, L., Tudorache, T. (eds.) The Semantic Web: Research and
    Applications: 7th Extended Semantic Web Conference, ESWC 2010, Heraklion,
    Crete, Greece, May 30 - June 2, 2010. LNCS, vol. 6088, pp. 1–15 (2010)
 4. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL:
    A Continuous Query Language for RDF Data Streams. International Journal of
    Semantic Computing 4(1), 3–25 (2010)
 5. Cugola, G., Margara, A.: Processing flows of information. ACM Computing Sur-
    veys 44(3), 1–62 (Jun 2012)
 6. Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams.
    In: Proceedings of the 2003 ACM SIGMOD international conference on on Man-
    agement of data - SIGMOD ’03. p. 40. ACM Press (2003)
 7. Diao, Y., Immerman, N., Gyllstrom, D.: Sase+: An agile language for kleene closure
    over event streams. Tech. rep., University of Massachusetts Amherst, Department
    of Computer Science (2008)
 8. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C (2011)
 9. Le-Phuoc, D., Dao-Tran, M., Pham, M.: Linked stream data processing engines:
    facts and figures. The Semantic Web– . . . 1380(24761), 1–12 (2012)


                                          79
10. Le-phuoc, D., Dao-tran, M., Parreira, J.X., Hauswirth, M.: A Native and Adaptive
    Approach for Unified Processing of Linked Streams and Linked Data. In: Aroyo,
    L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist,
    E. (eds.) The Semantic Web – ISWC 2011: 10th International Semantic Web Con-
    ference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I. Lecture Notes
    in Computer Science, vol. 7031, pp. 370–388. Springer Berlin / Heidelberg (2011)
11. Scharrenbach, T., Urbani, J., Margara, A., Valle, E.D., Bernstein, A.: Seven Com-
    mandments for Benchmarking Semantic Flow Processing Systems. In: Proc.ESWC
    2013 (to appear). pp. 1–15. No. 296126 (2013)
12. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL Basic
    Graph Pattern Optimization Using Selectivity Estimation. In: Proceedings of the
    17th International World Wide Web Conference (WWW). ACM (Apr 2008)
13. Tatbul, N., Çetintemel, U., Zdonik, S., Cherniack, M., Stonebraker, M.: Load Shed-
    ding in a Data Stream Manager. In: 29th International Conference VLDB. vol. 54,
    pp. 309–320. VLDB Endowment (2003)
14. Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.p.: SRBench : A Streaming RDF
    / SPARQL Benchmark. In: The Semantic Web - ISWC 2012: 11th International
    Semantic Web Conference, ISWC 2012, Boston, MA, USA, Nov 12-14, 2012, Pro-
    ceedings (2012)


                                          80