Ontology-based top-k query answering over
    massive, heterogeneous, and dynamic data?

                                 Daniele Dell’Aglio

 Dipartimento di Elettronica, Informazione e Bioingegneria – Politecnico of Milano,
                    P.za L. Da Vinci, 32. I-20133 Milano - Italy
                          daniele.dellaglio@polimi.it


      Abstract. A relevant kind of task, which is getting more and more
      attention in the recent years, is the selection of the most relevant elements
      in a number of data collections, e.g., the opinion leaders given a set of
      topics, the best hotel offers, and the best cities to live in. At the moment
      the problem is addressed through ad-hoc solutions tailored on the target
      scenario, setting up infrastructures able to manage the expected data
      loads. Even if these solutions work, they could start to have problems
      when the data volume, velocity and variety increase. In my research
      activity I will study the problem of computing the top k relevant items
      given a collection of data sets with both streaming and static data, an
      ontology describing them, and a set of top-k queries where each scoring
      function describes the relevance as a combination of several criteria.


1   Introduction

Relevancy. Big Data are characterized by the so-called three Vs [1]: volume
(high amount of data), velocity (highly dynamic in data) and variety (data
with structural and semantic heterogeneity). When those data are analysed and
queried, it often happens that there is a huge number of answers, but only a
small part of them is relevant (on the basis of some criteria). Let’s consider, for
example, Expedia1 : it processes travel solutions and it provides as search result
the top k results (i.e., the first result page) ordered by a set of user-provided
criteria. In most of the cases, users find the solution they are looking for in the
first page, without moving to the next ones. The input data used by Expedia can
be described through the Big Data dimensions2 . Variety is given by the fact that
data are related to different domains (flight companies, hotels, and car rental
services), are gathered by several sources and have to be integrated. Velocity is a
critical dimension for the Expedia service: availability and price of flight tickets
and hotel rooms are dynamic and the quality of the returned results is strictly
?
  This research is developed under the supervision of Professor Emanuele Della Valle.
1
  Cf. http://www.expedia.com
2
  Even if the whole data computation process is not made by Expedia and there are
  intermediate data providers (e.g., Amadeus), I consider this process as executed by
  a black box system to highlight the features of the data.
2      Daniele Dell’Aglio

dependent on it. Finally, data about about flights, hotels and car rental have
relevant volumes.
    Even if Expedia does not have Big Data problems (its infrastructure is able
to cope with the data it processes), a system like Expedia that pushes further
the aforementioned input data dimensions (e.g., use more domains and more
dynamic data) could have problems in maintaining the quality of the supplied
services. The main goal of my research is the development of methodologies and
algorithms for computing top k results (ordered by a given criteria) in a Big
Data scenario.
    In the last decade, the research activities in this domain have addressed
different sub-problems: the finding the most relevant elements can be expressed
through top-k queries, i.e., queries that asks for the top k tuples from a dataset,
given an order expressed through a scoring function [2]; the velocity is addressed
by stream computation and on-line streaming algorithms to process data in real
time [3]; data variety and data access can be addressed through ontologies to
obtain an holistic view on heterogeneous data sets, exploiting the Ontology Based
Data Access (OBDA) approach [4].
    How to combine these methods and techniques is an open research issue: RDF
stream engines [5], top-k ontological query answering [6] and top-k computation
over data streams [7] are examples of novel research trends that are gathering
more and more attention in the recent years. In my activity I will study how
those elements can be combined in order to efficiently perform data analyses in
a Big Data context.
Problem Statement. Figure 1 depicts the framework I will consider to probe
the problem presented above. Data are gathered from collections of data sets D
and data streams S. The data model is described by an ontology M, and the
information needs are defined through a set Q of continuous queries, containing
a subset K of top-k continuous queries. The problem I investigate in my activity
is how to optimize the query answering in this setting.
    It is worth to note that the existence of a set Q of queries is not a stretch:
in stream processing applications is common to develop network of queries [8],
where each query produces streams and consumes outputs of other queries.
    The remaining of the paper is structured in the following way: Section 2
describes the related works; Section 3 presents the research questions and the
relative hypotheses that will drive my research activity. Section 4 describe the
approach I follow to test the hypotheses and Section 5 ends with some final
considerations.


2   Related work
The raising of data stream sources introduced new problems about how to man-
age, process and query infinite sequences of data with high frequency rate. Two
proposed approach are the Data Stream Management Systems (DSMSs) and
Complex Event Processors (CEPs) [3]: the firsts transform data streams in times-
tamped relations (usually through the window operator) to be processed with
                                       Ontology-based top-k query answering        3


                      Fig. 1. Elements involved in the problem.


well known techniques such as algebras [8]; the seconds look for patterns in the
streams to identify when complex events occur [9]. Recently, those paradigms
have been studied by the Semantic Web community, that come out with dif-
ferent relevant results. On the one hand, the DSMS model inspired the design
and the development of C-SPARQL [10], SPARQLstream [11] and CQELS [12].
On the other hand, the CEP model inspired the EP-SPARQL [13] system. RDF
stream engines are at the basis of my research: they are the first step towards
data streams and ontologies3 .
    A parallel topic that joins Semantic Web and Stream Processing is the exe-
cution of reasoning tasks over data streams. The work in [14] proposes a method
to solve reasoning tasks over an ontology stream (i.e., a sequence of timestamped
ontologies). The work in [15] focuses explicitly on the ontological query answer-
ing in RDF stream processors: it proposes an algorithm, inspired to DRed [16],
to incrementally maintain the ontological entailment of the content of a window.
The approach is data-driven, and the entailment is updated when the window
slides.
    The top-k query answering problem has been widely studied in the Data
Base Management System area [2]: the general idea is to extend the algebras
introducing top-k related operators as first-class citizens, and to provide physical
operators able to determine the top k tuples, ordered by a given criteria, without
scanning the whole dataset.
    Top-k query over data streams and top-k ontological query answering are
research trends in an initial stage. [7] is one of the first works focusing on how to
maintain a top-k answer over a stream. Regarding the top-k ontological query
answering, SPARQL-RANK [17] proposes an extension of SPARQL to optimize
3
    In the following, with RDF stream engines I will indicate the RDF stream engines
    following the DSMS paradigm.
4         Daniele Dell’Aglio

the top-k query answering; anyway, the approach works only under the RDF
entailment regime. SoftFacts [18] is a top-k retrieval system that uses an ontology
layer to manage the conceptual model (using OWL-QL as ontological language),
and relational database systems to store and query the data.


3      Assumptions, research questions and hypotheses
The operational semantics of RDF stream engines such as C-SPARQL, CQELS
and SPARQLstream ) can be described by the model proposed by Stream and
CQL [8], and depicted in Figure 2.


                 Fig. 2. General model of a continuous SPARQL query


    The query answering process can be represented in three logical steps. First,
the input RDF streams are transformed in sets of mappings (using SPARQL
algebra terminology) through the S2R operators4 , usually sliding windows. The
resulting sets of mappings and data from static data sets are transformed in a
new set of mappings through a R2R operator, (the boolean expression part of
the query, compliant with SPARQL 1.1/1.0). Finally a R2S operator converts
the mappings in the output stream.
    The reasoning technique in [15] works on the window operator: it uses the
ontology M, the window content and the static data to compute the materializa-
tion. The latter is then used as input by the R2R operator. This materialization
method works under some assumptions:
    – TBox assertions are not in the input stream;
    – the input of the query is one window over one stream (and optionally static
      data sets);
4
    I maintain the operators’ names as defined by CQL: S2R (stream-to-relation), R2R
    (relation-to-relation) and R2S (relation-to-stream).
                                     Ontology-based top-k query answering       5

 – the same statement cannot be in both the input stream and the static data.
At the beginning of my research these assumptions hold, and I plan to investigate
if they can be relaxed (in particular the second and the third ones).
    Additionally, I make an assumption on the static data sets: they are available
in the first or the secondary memory of the machine that execute the query.
Top k algorithms need fast accesses to the data sources, and the problem of
designing algorithms to execute top k queries over remote repositories (exposed
via SPARQL endpoints) is out of scope in my activity.
Research questions. Starting from the problem statement and exploiting the
RDF stream engine paradigm presented above, I define the following research
questions:
Q.1 In RDF stream processors the query are registered before the arrival of the
    data, and they provide sequence of time ordered results depending on the
    content of the windows. How can these facts be exploited to: 1) design more
    efficient algorithms and 2) improve the expressiveness of the ontological
    language used to define the ontology M?
Q.2 The queries in Q are translated in a set of logical plans (algebras) and then
    in physical plans to be executed. How the topology of the query network,
    the model M, and the scoring functions defined by the top-k queries in K
    can be used to optimize the plans?
The two questions aim at probing the optimization of the query answering pro-
cess by two different points of view. Question Q.1 focuses on how stream process-
ing engines manages the queries: the queries in Q are registered in the system
before the data arrive, and each query is evaluated multiple times on different
portions of the input streams. Question Q.2 takes into account the query plans
optimization, and in particular how the ontology M and the scoring functions in
top-k queries K can be used to improve the logical plans and the physical plans
of the queries.
Hypotheses. In the attempt to investigate the answers for these research ques-
tions, I formulated a set of hypotheses that will lead my activity. Regarding the
question Q.1, the hypotheses are:
H.1.1 The available stream reasoning techniques can be extended to work with
      queries with multiple windows.
H.1.2 The available stream reasoning techniques can be optimized when there
      are multiple queries.
H.1.3 Due to the fact that both the conceptual model and the query are fixed, it
      is possible to improve the expressiveness of the ontological language used
      to define M, maintaining the query answering problem over data streams
      treatable.
H.1.4 It is possible to move from a purely data-driven approach (i.e., materi-
      alization) to a hybrid query-driven and data-driven approach (i.e., query
      rewriting and materialization) to improve the memory consumption and
      the response time.
6       Daniele Dell’Aglio

The four hypotheses are related to the query answering process and how it can
be improved. Stream reasoning technique in [15] assumes the existence of one
window (and optionally a static data source): with hypothesis H.1.1 I want to
verify if and under which conditions it is possible to relax this constraint. The
hypothesis H.1.2 test if, given the set Q of queries, it is possible to make synergies
and do not maintain |Q| materializations separately. Hypothesis H.1.3 is related
to the expressiveness of the ontological language: the technique presented in [15]
works on RDFS+, but it could be possible to extend it to support more expressive
languages. Finally, through the hypothesis H.1.4, I want to test if in the stream
context it is possible to exploit the query registration process to rewrite the
query using M, reducing the memory consumption of the materialization and
the time required to maintain it.
    In parallel, I aim to probe the question Q.2 through the study of the following
hypotheses:

H.2.1 Exploiting the network of queries, it is possible to optimize the query
      plans in RDF stream engines.
H.2.2 The presence of M and K allows to optimize the query plans at logical
      level.
H.2.3 It is possible to exploit M and K to design physical operators that perform
      faster.

In this set of hypotheses I focus on the optimization of the query plans. In stream
processing engines, such as Aurora [19], groups of query plans are optimized; in
hypothesis H.2.1 I want to test if optimization techniques for stream processing
engines can be applied and extended to RDF stream engines. Hypothesis H.2.2
and H.2.3 aim at testing that the ontology M and the top-k queries K enable
optimizations in the query plans (respectively at logical and physical level).

Reflections. The problem I am going to investigate is the optimization of
continuous ontology-based top-k query answering. As explained above, even if
sub-problems are addressed, at the best of my knowledge there are no results
that address this problem. I think that the main reason is that methods and
instruments at the basis of this activity have lacked until some time ago. For
example, RDF stream processing engine and stream reasoning topics are novel
and open research trends, and the research groups of these fields put the most
of the effort in principles and foundation definitions. I believe that these results
compose a solid basis to build my research activity in the next years.


4    Research Plan

Approach. My research starts from a deep state-of-the-art analysis on data
management and description logic fields; some relevant results are presented in
Section 2. In parallel, there will be the identification of a set of use cases, to
determine a set of real problems from which elicit the requirements that will
lead my activities. At the moment I am considering a social listening scenario:
                                      Ontology-based top-k query answering        7

the idea is to analyse social networks data and mobile phone records to extract
knowledge.
    The next step of my activity is the design of the evaluation framework and
the evaluation metrics. Even if some hypotheses will be investigated through a
theoretical approach, others will require an empirical approach. As consequence,
it is important to define this tool from the beginning: it allows to measure the
progresses of the activity and to evaluate the experiments.
    As result of the state of the art analysis, the requirements elicitation and the
evaluation framework definition, there will be an environment to support the
core activity of my research, the tests of the hypotheses. I will follow a three-
step plan. In the first step, I will focus on the hypotheses related to question
Q.1, and consequently on continuous ontological query answering over multiple
streams. In this phase the top-k element is missing: at the moment the research
on top-k query answering is more mature than the one on inference over data
streams, so I believe it is necessary to work on the latter before studying the
synergies between them. In the second step, I will target Q.2, through the test
of the relative hypotheses. Finally, in the third step I will bring together the
results obtained. The output of each step will be a set of approaches, supported
by prototypes to prove their feasibility and to evaluate them.
Evaluation Plan As explained in the previous section, an evaluation framework
is required since the first steps of my activity. The metrics I aim at defining are
related to the following dimensions: response time (the time required to compute
the answers of the query), memory consumption, and correctness of the results.
    In the previous months I started to work on the design of the evaluation
framework. The starting point was the analysis of benchmarks for RDF stream
processors. At the moment, at the best of my knowledge, there are two available
benchmarks: SRBench [20] and LSBench [21]. The two works address different
features of the RDF stream processors, such as the degree of SPARQL 1.1 adop-
tion, the time performance and the throughput. A feature that has not been
investigated yet is the correctness of the results provided RDF stream engines.
I started to work on this aspect, and [22] reports some initial considerations.
The most relevant one is that the available operational semantics of the RDF
stream engines are not enough to model the different behaviours shown by the
systems. As consequence, I am contributing in defining a framework to validate
the results provided by the system.


5   Conclusion
In the next years, the Big Data market will grow [23], and, as consequence,
approaches and methodologies to manage, process and extract knowledge will
become more and more important. The problem I am going to investigate is the
optimization of the top k elements retrieval task in a context characterized by
heterogeneous and massive data streams. I presented the research questions and
the hypotheses that will lead the activity, and the plan I will follow to address
them.
8       Daniele Dell’Aglio

References
 1. Laney, D.: 3D data management: Controlling data volume, velocity, and variety.
    Technical report, META Group (February 2001)
 2. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing tech-
    niques in relational database systems. ACM Comput. Surv. 40(4) (2008)
 3. Cugola, G., Margara, A.: Processing flows of information: From data stream to
    complex event processing. ACM Comput. Surv. 44(3) (2012) 15
 4. Lenzerini, M.: Data integration: A theoretical perspective. In: PODS. (2002)
    233–246
 5. Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world!
    reasoning upon rapidly changing information. IEEE Intelligent Systems 24(6)
    (2009) 83–89
 6. Schlobach, S.: Top-k reasoning for the semantic web. In: ISWC. (2011) 55–59
 7. Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries
    over sliding windows. In: SIGMOD Conference. (2006) 635–646
 8. Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic
    foundations and query execution. VLDB J. 15(2) (2006) 121–142
 9. Luckham, D.C.: The power of events - an introduction to complex event processing
    in distributed enterprise systems. ACM (2005)
10. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL:
    A continuous query language for RDF data streams. IJSC 4(1) (2010) 3–25
11. Calbimonte, J.P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to
    streaming data sources. In: ISWC. (2010) 96–111
12. Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and
    adaptive approach for unified processing of linked streams and linked data. In:
    ISWC. (2011) 370–388
13. Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language
    for event processing and stream reasoning. In: WWW. (2011) 635–644
14. Ren, Y., Pan, J.Z.: Optimising ontology stream reasoning with truth maintenance
    system. In: CIKM ’11
15. Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: Incremental
    reasoning on streams and rich background knowledge. In: ESWC (1). (2010) 1–15
16. Volz, R., Staab, S., Motik, B.: Incrementally maintaining materializations of on-
    tologies stored in logic databases. J. Data Semantics 2 (2005) 1–34
17. Magliacane, S., Bozzon, A., Della Valle, E.: Efficient execution of top-k sparql
    queries. In: ISWC. (2012) 344–360
18. Straccia, U.: Softfacts: A top-k retrieval engine for ontology mediated access to
    relational databases. In: SMC. (2010) 4115–4122
19. Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stone-
    braker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for
    data stream management. VLDB J. 12(2) (2003) 120–139
20. Zhang, Y., Duc, P., Corcho, O., Calbimonte, J.P.: SRBench: A Streaming RDF/S-
    PARQL Benchmark. In: ISWC. (2012) 641–657
21. Le-Phuoc, D., Dao-Tran, M., Pham, M.D., Boncz, P., Eiter, T., Fink, M.: Linked
    stream data processing engines: Facts and figures. In: ISWC. (2012) 300–312
22. Dell’Aglio, D., Balduini, M., Della Valle, E.: On the need to include functional
    testing in rdf stream engine benchmarks. In: BeRSys 2013. (2013)
23. Gens, F.: IDC Predictions 2012: Competing for 2020 (2012)