Addressing Streaming and Historical Data in OBDA
               Systems: Optique’s Approach
                   (Statement of Interest)

     Ian Horrocks4 , Thomas Hubauer3 , Ernesto Jimenez-Ruiz2 , Evgeny Kharlamov2 ,
     Manolis Koubarakis4 , Ralf Möller1 , Konstantina Bereta4 , Christian Neuenstadt1 ,
     Özgür Özçep1 , Mikhail Roshchin3 , Panayiotis Smeros4 , Dmitriy Zheleznyakov2
                         1
                             Hamburg University of Technology, Germany
                                     2
                                       Oxford University, UK
                             3
                               Siemens Corporate Technology, Germany
                                  4
                                    University of Athens, Greece


         Abstract. In large companies such as Siemens and Statoil monitoring tasks are
         of great importance, e.g., Siemens does monitoring of turbines and Statoil of oil
         behaviour in wells. This tasks bring up importance of both streaming and his-
         torical (temporal) data in the Big Data challenge for industries. We present the
         Optique project that addresses this problem by developing an Ontology Based
         Data Access (OBDA) system that incorporates novel tools and methodologies for
         processing and analyses of temporal and streaming data. In particular, we advo-
         cate for modelling time time aware data by temporal RDF and reduce monitoring
         tasks to knowledge discovery and data mining.


1     Introduction
A typical problem that end-users face when dealing with Big Data is of data access,
which arises due to the three dimensions (the so-called “3V”) of Big Data: volume, since
massive amounts of data have been accumulated over the decades, velocity, since the
amounts may be rapidly increasing, and variety, since the data are spread over different
formats. In this context accessing the relevant information is an increasingly difficult
task and the Optique project [8] aims at providing solutions for it.
    The project is focused around two demanding use cases that provide it with mo-
tivation, guidance, and realistic evaluation settings. The first use case is provided by
Siemens5 and encompasses several terabytes of temporal data coming from sensors,
with a growth rate of about 30 gigabytes per day. Users need to query this data in com-
bination with many gigabytes of other relational data that describe events. The second
use case is provided by Statoil6 and concerns more than one petabyte of geological
data. The data are stored in multiple databases which have different schemata, and the
users have to manually combine information from many databases, including temporal,
in order to get the results for a single query. In general, in the oil and gas industry, IT-
experts spend 30–70% of their time gathering and assessing the quality of data [7]. This
 5
     http://www.siemens.com
 6
     http://www.statoil.com
                          predefined
                            quieres
            Application                    ...                          End users
end users
                                       heterogeneous                             Query
                                        data sources

                                                                        Ontology

      information    specialised                               Mappings            Mappings
          need         quieres
                                           ...
                                                                           ...
 end user       IT-expert              heterogeneous
                                        data sources           Heterogeneous data sources

            Fig. 1. Left: existing approaches to data access; Right: OBDA approach


is clearly very expensive in terms of both time and money. The Optique project aims at
solutions that reduce the cost of data access dramatically. More precisely, Optique aims
at automating the process of going from an information requirement to the retrieval of
the relevant data, and to reduce the time needed for this process from days to hours, or
even to minutes. A bigger goal of the project is to provide a platform with a generic
architecture that can be adapted to any domain that requires scalable data access and
efficient query execution.
     The main bottleneck in the Optique’s use cases is that data access is limited to a
restricted set of predefined queries (cf. Figure 1, left, top). Thus, if an end-user needs
data that current applications cannot provide, the help of an IT-expert is required to
translate the information need of end-users to specialised queries and optimise them for
efficient execution (cf. Figure 1, left, bottom). This process can take several days, and
given the fact that in data-intensive industries engineers spend up to 80% of their time
on data access problems [7] this incurs considerable cost.
     The Semantic approach known as “Ontology-Based Data Access” (OBDA) [16,
6] has the potential to address the data access problem by automating the translation
process from the information needs of users (cf. Figure 1, right) to data queries. The key
idea is to use an ontology, that presents to users a semantically rich conceptual model
of the problem domain. The user formulates their information requirements (that is,
queries) in terms of the ontology, and then receives the answers in the same intelligible
form. These requests should be executed over the data automatically, without an IT-
expert’s intervention. To this end, a set of mappings is maintained which describes the
relationship between the terms in the ontology and the corresponding data source fields.
     As discussed above, in the Siemens use case one has to deal with large amounts of
streaming data, e.g., sensor and event data from many turbines and diagnostic centres
in many different streams with the size of up to two kilobytes, in combination with
historical, that is, temporal relational data sources. Thus, one has to provide Semantic
technologies that enable modelling, e.g., with temporal RDF, and processing of both
historical and streaming data which includes data mining and complex event processing.
The data mining (and time series analysis) aspect is crucial for the Siemens use case as
it sets the foundation for preventive diagnostics. More concretely, one of the diagnostic
Classical OBDA                                              Optique OBDA


                                                                          end users                  IT-expert

                                                             Application          Query          Ontology & Mapping
                                                             (Analytics)        Formulation         Management
    end users                        IT-expert


                                                               results
 Application


                                                                                query
                                                                                          Ontology     Mappings
                 query

                            Ontology             Mappings
      results


                                                                             Query Transformation
                         Query Answering
                                                                Distributed Query Optimisation and Processing

                                                                                 ...                  ...
                               ...
                                                                                                 heterogeneous
                          heterogeneous
                                                                         streaming data           data sources
                           data sources


                Fig. 2. Left: classical OBDA approach. Right: the Optique OBDA system


engineers’ requirements is a means to find correlations between events like that of a
can flame failure in a turbine and patterns in turbine’s relevant timed stamped sensor
data measuring temperature (or pressure etc.) If some correlation between an event
and sensor data is detected in the historical data by some data mining procedure, then
this correlation should be expressible by a continuous query that can be used on real-
time data for preventive diagnostics. For example an error event in a turbine T may
be identified within the sensor data by at least five percent decrease of measured value
TC255-Measurement in sensor TC255 w.r.t. the average in one hour followed by a
statistically significant increase (here: more than the two times of the measured average
in hour) of measured value TC256-Measurements. The continuous query may refer
to a measurement ontology and a (rough) model of the turbine structure within the
diagnostic engineer’s ontology, that he has to set up in order to localize failures (up to
some precision).
     The classical OBDA systems (cf. Figure 2, left, shows a conceptual architecture
of a classical OBDA system) fail to provide support for these tasks. In the Optique
project, we aim at developing a next generation OBDA system (cf. Figure 2, right) that
overcomes this limitations. More precisely, the project aims at a cost-effective approach
that includes the development of tools and methodologies for processing and analytics
of streaming and temporal data. These require to revise existing and develop new OBDA
components, in particular, to support novel: (i) ontology and mapping management,
(ii) user-friendly query formulation interface(s), (iii) automated query translation, and
(iv) distributed query optimisation and execution in the Cloud. In this paper we will
give a short overview of challenges that we encompass on the way to this goal.
    The remainder of the paper is organised as follows. We discuss Optique’s challenges
in handling of streaming and historical data and present the general architecture of the
Optique’s OBDA solution in Section 2. Finally, we discuss related work (Section 3) and
conclude (Section 4).
2   Stream Processing and Analytics in Optique’s OBDA
A general requirement, especially motivated by the Siemens use case, is to support
such a combination of the data, ontology, mapping, and query languages that is ex-
pressive enough for modelling machines, symptoms, and diagnoses, and guarantees a
complete, correct, and feasible query answering over temporal and streaming data. We
now overview challenges to be solved in achieving this goal: we discuss ontologies,
query languages, query processing, and visualisation of answers.
    We plan to model temporal data with some extension of RDF. This could be achieved,
for example, by adding to RDF triples an extra fourth component: validity time. Thus,
the first challenge to address is an understanding of the right temporal RDF data model.
    The key component in the Optique OBDA solution is the domain ontology, since it
enables users to understand the data at hand and formulate queries. Thus, the next chal-
lenge to address is how to model both data streams and temporal data via ontologies. In
particular this will require to model time (at least in the query language). On the level
of mappings, a homogeneous mapping language for static and streaming data has to be
provided.
    The query language that the system should provide to end users should combine
  (i) temporal operators, that address the time dimension of data and allow to retrieve
      data which was true “always” in the past or “sometimes” in the last X months,
      etc.,
 (ii) time series analysis operators, such as mean, variance, confidence intervals, stan-
      dard deviation, as well as trends, regression, correlation, etc., and
(iii) stream oriented operators, such as sliding windows.
Besides, the query language should provide some means for intelligent query answering
of queries on complex patterns in the data by, e.g., telling in the negative case what
similar patterns exist (query relaxation) and in the positive case how the multitude of
patterns can be further restricted (query refinement). Finally, the query language should
support formulating queries based on results of explorative data analysis.
     Given the query, mapping languages, and ontology, the Optique system should be
able to translate queries into highly optimised executable code over the underlying tem-
poral and streaming data. This requires techniques for automated query translation of
one-time, continuous, temporal queries, and their combinations. Existing translation
techniques are limited and they do not address query optimisation and distributed query
processing. Thus, novel approaches should be developed.
     Another set of challenges in Optique is related to handling answers to queries. One
issue is visualisation of massive volumes of data formatted according to the domain
ontology. To address this issue, in particular, Data Mining and Pattern Learning, tech-
niques over ontological data should be developed to enable automatic identification of
interesting patterns in the data. Another challenge is how to manage “dirty” data. The
Optique system must provide basic means for data cleaning such as: automated iden-
tification, mapping and alignment of data types, including dates, time synchronisation,
and data quality issues (outlier detection, noise, missing values, etc.).
     To sum up this section, we will provide the general architecture of the Optique
OBDA system.
Integrated via
Information Workbench
 Presentation             Query Formulation Interface                Answer visualisation
                                                                                                                                     Ontology and Mapping
 Layer                                                                                                                               Management Interface


                                                     Visualisation              Stream analytics                             Ontology and Mapping Manager's
             Query Formulation                         engines                                                                   Processing Components
                                                                              mining, log analyses, etc
          Processing Components
                                                                                                                               Bootstrapper               Ontology
   Query by Navigation      QDriven ont                                                                                          Analyser                   and
    Context Sens. Ed        construction                                                                - ontology            Evolution Engine            Mapping
      Direct Editing                                     Ontology                                       - mappings             Transformator              Revision
     Faceted Search         Export funct.              processing:                                      - configuration        Approximator               control &
    1-time Q
               Stream Q
                                                        reasoners,                   Shared             - queries                                          Editing
    SPARQL                 Feedback funct.           module extractors,               triple            - answers             ontology mapping
                                                           etc.                       store             - history
                                                                                                        - etc.


                                                                     Query Answering Component
                                             Query transformation                     Distributed Query Execution
                                                                                              Q Planner
                                  Query Rewriting           Setup module                     Optimisation
                                  Semantic QOpt            Semantic Index                   Data Federation                            Shared
                                  Syntactic QOpt            Materialisation                                                           database
                                                                                         1-time Q SQL     Stream Q
                                  1-time Q
                                                           Query Execution
                                               Stream Q    Data Federation
                                  SPARQL
 Application
 Layer

 Data,
                                                                                                                       Cloud (virtual
 Resource                 ...               RDBs, triple stores,
                                                                              ...       data streams                  resource pool)
                                            temporal DBs, etc.
 Layer

Components                                                                             Colouring Convention                              Types of Users
                                Group of                     Front end:                          Optique                  External               Expert          End
        Component                                                                                solution
                                components                   mainly Web-based                                             solution               users           users


                          Fig. 3. The general architecture of the Optique OBDA system


General architecture of Optique’s OBDA solution. Figure 3 gives an overview of the
Optique’s OBDA solution architecture and its components. The architecture is devel-
oped using the three-tier approach and has three layers:

 – The presentation layer consists of three main user interfaces: to compose queries,
   to visualise answers to queries, and to maintain the system by managing ontologies
   and mappings. The first two interfaces are for both end-users and IT-experts, while
   the third one is meant for IT-experts only.
 – The application layer consists of several (main) components of the Optique’s sys-
   tem, supports its machinery, and provides the following functionality:
        • query formulation,
        • ontology and mapping management,
        • query answering, and
        • processing and analytics of streaming and temporal data.
 – The data and resource layer consists of the data sources that the system provides
   access to, that is, relational, semistructured, temporal databases and data streams.
   It also includes a cloud that provides a virtual resource pool.
The entire Optique system will be integrated via the Information Workbench plat-
form [11]7 .


3     Related Work

Each of the approaches mentioned in the following deals with only one of the aspects
that are relevant for the envisioned software component of the Optique query answering
system: either the aspect of query answering over temporal data that can be described
as historical; or the aspect of query answering over streamed data. Adhering to the
requirements of the (Siemens) use case, the Optique approach favors a more integrative
approach, that combines query answering over historical data and query answering over
regularly updated temporal data stemming from many different streams.
    There exist several approaches that address the problem of representing, inferring
with, and querying temporal data within the general context of ontologies. As the Op-
tique project will follow a weak temporalization of the OBDA paradigm, which will
guarantee the conservation of so-called FOL rewritability (which essentially means a
possibility to translate ontological queries into SQL queries over data sources), work
on modal-style temporal ontology languages formalised via Description Logics [13]
is of minor relevance; because of the bad complexity properties, this is even true for
temporalized lightweight logics [2].
    The approach in [10] introduces temporal RDF graphs, details out a sound and com-
plete inference system, and gives a sketchy introduction to a possible temporal query
language. A similar representation of temporal RDF graphs is adopted within the spatio-
temporal RDF engine STRABON [12, 4]8 , which also defines the spatio-temporal ex-
tension stSPARQL of the W3C recommendation query language SPARQL 1.1. Strabon
is currently the only fully implemented spatio-temporal RDF store with rich function-
ality and very good performance as seen by the comparison in [12, 4]. For a similar
temporal version of RDF, which is oriented at the temporal database language TSQL2,
compare [9]. The authors of [17] favor a more conservative strategy by modeling time
directly with language constructs within RDF and SPARQL—the resulting extensions
of RDF and SPARQL being mere syntactic sugar. The logical approach of [14] follows
ideas of [10] but shifts the discussion to the level of genuine ontology languages such
as OWL; the semantics of the temporalized RDF and OWL languages are given by a
translation to (a fragment of) first order logic. The temporalized SPARQL query lan-
guage uses a careful separation of the time component and the thematic component that
guarantees feasibility of query answering.
    The concept of streaming relational data as well as the concepts underlying complex
event processing are well understood and are the theoretical underpinnings for highly
developed streaming engines used in industrial applications. The picture for stream pro-
cessing within the OBDA paradigm is quite different; the few implemented streaming
engines [5, 3, 15] are still under development and have been shown to lack one or other
basic functionality [18]. Though all of the systems are intended to be used within the
 7
     www.fluidops.com/information-workbench/
 8
     www.strabon.di.uoa.gr
OBDA paradigm, only C-SPARQL [3] seems to have (minimal) capabilities for rea-
soning/query answering over ontologies. There is no agreement yet on how to extend
SPARQL to work over streams; and so all of the mentioned systems have their own
streamified version of SPARQL. However, the core of all extensions seems to be the
addition of (sliding) window operators over streams, which are adapted from query
languages over relational streams [1].


4   Conclusions
We presented motivations and challenges for the use of Ontology Based Data Access
systems as a solution for Big Data access problem in industries. The important chal-
lenge in industries is knowledge discovery and data mining of temporal and streaming
data, while the state of the art Semantic technologies that the OBDA systems rely on fail
to address it adequately. The Optique project aims at providing this demanding technol-
ogy which will be validated it in the two industrial use case: Statoil and Siemens. In
particular we plan to (i) explore the possibility of modelling streaming data with tem-
poral RDF and (ii) understand how knowledge discovery and data mining techniques
developed for Linked Open Data could be adapted and extended in our setting.


Acknowledgements.
The research presented in this paper was financed by the Seventh Framework Pro-
gram (FP7) of the European Commission under Grant Agreement 318338, the Optique
project.


References
 1. Arasu, A., Babu, S., Widom, J.: The cql continuous query language: semantic foundations
    and query execution. The VLDB Journal 15, 121–142 (2006), 10.1007/s00778-004-0147-z
 2. Artale, A., Kontchakov, R., Ryzhikov, V., Zakharyaschev, M.: Past and future of dl-lite. In:
    Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10).
    AAAI Press (2010)
 3. Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: C-sparql: a continuous
    query language for rdf data streams. Int. J. Semantic Computing 4(1), 3–25 (2010)
 4. Bereta, K., Smeros, P., Koubarakis, M.: Representation and querying of valid time of triples
    in linked geospatial data. In: ESWC 2013 (2013)
 5. Calbimonte, J.P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to streaming
    data sources. In: Proceedings of the 9th international semantic web conference on The se-
    mantic web - Volume Part I. pp. 96–111. ISWC’10, Springer-Verlag, Berlin, Heidelberg
    (2010)
 6. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M.,
    Rosati, R., Ruzzi, M., Savo, D.F.: The MASTRO System for Ontology-Based Data Access.
    Semantic Web 2(1), 43–53 (2011)
 7. Crompton, J.: Keynote talk at the W3C Workshop on Semantic Web in Oil & Gas Indus-
    try: Houston, TX, USA, 9–10 December (2008), available from http://www.w3.org/
    2008/12/ogws-slides/Crompton.pdf
 8. Giese, M., Calvanese, D., Haase, P., Horrocks, I., Ioannidis, Y., Kllapi, H., Koubarakis, M.,
    Lenzerini, M., Möller, R., Özçep, O., Rodriguez Muro, M., Rosati, R., Schlatte, R., Schmidt,
    M., Soylu, A., Waaler, A.: Scalable End-user Access to Big Data. In: Rajendra Akerkar: Big
    Data Computing. Florida: Chapman and Hall/CRC. To appear. (2013)
 9. Grandi, F.: T-sparql: a tsql2-like temporal query language for rdf. In: In International Work-
    shop on on Querying Graph Structured Data. pp. 21–30 (2010)
10. Gutierrez, C., Hurtado, C., Vaisman, R.: Temporal rdf. In: In European Conference on the
    Semantic Web (ECSW’ 05). pp. 93–107 (2005)
11. Haase, P., Schmidt, M., Schwarte, A.: The Information Workbench as a Self-Service Platform
    for Linked Data Applications. In: COLD (2011)
12. Kyzirakos, K., Karpathiotakis, M., Koubarakis, M.: Strabon: A Semantic Geospatial DBMS.
    In: International Semantic Web Conference. Boston, USA (Nov 2012)
13. Lutz, C., Wolter, F., Zakharyaschev, M.: Temporal description logics: A survey. In: Demri,
    S., Jensen, C.S. (eds.) 15th International Symposium on Temporal Representation and Rea-
    soning (TIME-08). pp. 3–14 (2008)
14. Motik, B.: Representing and querying validity time in RDF and OWL: a logic-based ap-
    proach. In: Proceedings of the 9th international semantic web conference on The semantic
    web - Volume Part I. pp. 550–565. ISWC’10, Springer-Verlag, Berlin, Heidelberg (2010)
15. Phuoc, D.L., Dao-Tran, M., Parreira, J.X., Hauswirth, M.: A native and adaptive approach
    for unified processing of linked streams and linked data. In: Aroyo, L., Welty, C., Alani,
    H., Taylor, J., Bernstein, A., Kagal, L., Noy, N.F., Blomqvist, E. (eds.) 10th International
    Semantic Web Conference (ISWC 2011). pp. 370–388 (2011)
16. Rodriguez-Muro, M., Calvanese, D.: High Performance Query Answering over DL-Lite On-
    tologies. In: KR (2012)
17. Tappolet, J., Bernstein, A.: Applied temporal rdf: Efficient temporal querying of rdf data with
    sparql. In: Proceedings of the 6th European Semantic Web Conference on The Semantic
    Web: Research and Applications. pp. 308–322. ESWC 2009 Heraklion, Springer-Verlag,
    Berlin, Heidelberg (2009)
18. Zhang, Y., Minh Duc, P., Corcho, O., Calbimonte, J.P.: Srbench: A Streaming RDF/SPARQL
    Benchmark. In: Proceedings of International Semantic Web Conference 2012 (November
    2012)