<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How to Feed the Squerall with RDF and Other Data Nuts?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Nadjib Mami</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Graux</string-name>
          <email>damien.graux@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Scerri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hajira Jabeen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soren Auer</string-name>
          <email>auer@l3s.de</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmanng@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre, Trinity College of Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Enterprise Information Systems</institution>
          ,
          <addr-line>Fraunhofer IAIS</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Smart Data Analytics (SDA) Group, Bonn University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>TIB &amp; L3S Research Center, Hannover University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting di erent data formats. Traditionally, heterogeneous data was transformed o -line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic, modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the- y without prior data transformation. Squerall is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall's extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The term Data Lake [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] denotes a repository of schema-less data stored in its
original form and format without prior transformations. We have built
Squerall [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a software framework implementing the so-called Semantic Data Lake
concept, which enables querying Data Lakes in a uniform manner using Semantic
Web techniques. In essence, Semantic Data Lake incorporates a 'virtual' schema
over the schema-less data repository by mapping data schemata into high-level
ontologies, which then can be queried in a uniform manner using SPARQL.
      </p>
      <p>The value of a Data Lake-accessing system lays in its ability to query as much
data as possible. For this sake, Squerall was built from the ground up with
extensibility in consideration, so to allow and facilitate supporting more data sources.
As we recognize the burden of creating a wrapper for every needed data source,
we resort to leveraging the wrappers that data source providers themselves o er
for many state-of-the-art processing engines. For example, Squerall uses Apache
Spark and Presto as underlying query engines, both of which bene t from a wide
range of connectors accessing the most popular data sources.</p>
      <p>
        In this demonstration, we complement the published5 work about Squerall [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
by (1) providing more details on the data source extensibility aspect, and (2)
demonstrating extensibility by supporting a new data source, RDF.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Squerall and its Extensibility</title>
      <sec id="sec-2-1">
        <title>Squerall: a Semantic Data Lake</title>
        <p>
          Squerall is an implementation of the Semantic Data Lake concept, i.e.,
querying original large and heterogeneous data using established Semantic Web
techniques and technologies. It is built following the Ontology-Based Data Access
principles [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], where elements from the data schema (entities/attributes) are
associated to elements from an ontology (classes/properties), by means of mapping
language, forming a virtual schema against which SPARQL queries can be posed.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Squerall Extensibility</title>
        <p>As we recognize the burden of creating wrappers for the variety of data sources,
we chose not to reinvent the wheel and rely on the wrappers often o ered by the
developers of the data sources themselves or by specialized experts. The way a
connector is used is dependent on the query engine:
{ Spark: the connector's role is to load a speci c data entity into a DataFrame
using Spark SQL API. Its usage is simple, it only requires providing access
values to a prede ned list of options inside a simple connection template:
s p a r k . read . f o r m a t ( s o u r c e T y p e ). o p t i o n s ( o p t i o n s ). load
Where sourceType designates the data source type to access, and options is
a simple key-value list storing e.g., username, password, host, cluster settings,
etc. The template is similar in most data source types. There are dozens
connectors6 already available for a multitude of data sources.
{ Presto: access options are stored in a plain text le in a key-value fashion.</p>
        <p>Presto uses directly SQL interface to query heterogeneous data, e.g., SELECT
cassandra.cdb.product C JOIN mongo.mdb.producer M ON C.producerID
= M.ID, there is no direct interaction with the connectors. Presto internally
and transparently uses the access options to load necessary data on
querytime. Similarly, there are already several ready-to-use connectors for Presto7.</p>
        <p>Hence, while Squerall supports by default MongoDB, Cassandra, Parquet,
CSV and various JDBC sources, interested users can easily provide access to
other data sources leveraging Spark and Presto connectors8.
5 At ISWC-Resources track.
6 https://spark-packages.org/
7 https://prestosql.io/docs/current/connector.html
8 Tutorial: https://github.com/EIS-Bonn/Squerall/wiki/Extending-Squerall
How to feed the Squerall with RDF and other data nuts?</p>
        <p>
          Supporting a New Data Source: Case of RDF Data
In case no connector is found for a given data source type, we show in this section
the principles of supporting a new data source. The procedure concerns Spark
as query engine, where the connector's role is to generate a DataFrame from an
underlying data entity. Squerall did not previously have a wrapper for RDF data.
With the wealth of RDF data available today as part of the Linked Data and
Knowledge Graph movements, supporting RDF data is paramount. Contrary to
the previously supported data sources, RDF does not require a schema, neither
xed nor exible. As a result, lots of RDF data is generated without schema.
In this case, it is required to exhaustively extract the schema from the data
on-the- y during query execution. Also, as per the Data Lake requirements, it
is necessary not to apply any pre-processing, and to directly access the original
data. If an entity inside an RDF data is detected as relevant to (part of) a query, a
set of transformations are applied to atten the (subject,property,object ) triples
and extract the schema elements needed to generate the DataFrame(s). Full
procedure is shown in Figure 1 and is described as follows:
1. First, triples are loaded into Spark distributed dataset9 of the schema
(subject : String, property : String, object : String).
2. Using Spark transformations, we generate a new dataset. We map (s,p,o)
triples to pairs: (s,(p,o)), then group pairs by subject: (s,(p,o)+), then nd
class from p ( p=rdf:type) and map the pairs to new pairs: (class,(s,(p,o)+])),
then group them by class (class, (s, (p, o)+)+). Each class has one or more
instances identi ed by `s' and contains one or more (p, o) pairs.
3. The new dataset is partitioned into a set of class-based DataFrames, columns
of which are the properties and tuples are the objects. This corresponds to
the so-called property table partitioning [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
4. The XSD data types, if present as part of the object, are detected and used
to type the DataFrame attributes, otherwise string is used.
5. Only the relevant entity/ies (matching their attributes against query
properties) detected using the mappings is/are retained, the rest are discarded.
        </p>
        <p>
          This procedure generates a (typed) DataFrame that can join DataFrames
generated using other data connectors from other data sources. The procedure
is part of our previously published e ort: SeBiDa [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We made the usage of the
new RDF connector as simple as the other Spark connectors:
val rdf = new N T t o D F ()
df = rdf . o p t i o n s ( o p t i o n s ). read ( filePath , s p a r k U R I ). toDF
        </p>
        <p>Where NTtoDF is the connector's instance, options are the access information
including RDF le path and the speci c RDF class to load into the DataFrame.
9 Called RDD: Resilient Distributed Dataset, a distributed tabular data structure.</p>
        <p>RDF Triples
((ss11,, p1a,, o1_At1))
(s1, p2, o2_t2)</p>
        <p>…
(s2 ,(s2, ap,n, B)
on_t3)
(s2, pn+1,
on+1_t4)
...</p>
        <p>RDF Connector
Triples Dataset</p>
        <p>S: str P: str
s1 p1
s1 p2</p>
        <p>...
s2 pn
s2 pn+1
...</p>
        <p>Type A DataFrame (Relevant)</p>
        <p>ID: Str p1: t1 p2: t2 ...</p>
        <p>s1 o1 o2 ...</p>
        <p>...</p>
        <p>Type B DataFrame (Irrelevant)</p>
        <p>ID: Str pn: t3 pn+1: t4 ...</p>
        <p>s2 on on+1 ...</p>
        <p>...</p>
        <p>pm: str</p>
        <p>om
pn+r: str
on+r</p>
        <p>DataFrame</p>
        <p>Joins</p>
        <sec id="sec-2-2-1">
          <title>ConDnSe1ctor</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>ConDnSe2ctor</title>
          <p>DS1
DS2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this demonstration article10, we have described with more depth the
extensibility aspect of Squerall in supporting more data sources. We have demonstrated
extensibility principles by adding a support for RDF data. In the common
absence of schema, RDF triples have to be exhaustively parsed and reformatted
into a tabular representation on query-time, which only then can be queried.
In the future, in order to alleviate the reformatting cost and, thus, accelerate
query processing time, we intend to implement a light-weight caching technique,
which can save the results of the attening phase across di erent queries.
Beyond Squerall context, we will investigate making the newly created connector
(currently supporting NTriples RDF) available in Spark Packages (connectors)
hub for the public to be able to process large RDF data using Apache Spark.
10 Screencasts are publicly available from: https://git.io/fjyOO</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dixon</surname>
          </string-name>
          , J.: Pentaho, Hadoop, and
          <string-name>
            <surname>Data Lakes</surname>
          </string-name>
          (
          <year>2010</year>
          ), https://jamesdixon. wordpress.com/
          <year>2010</year>
          /10/14/pentaho-hadoop-and
          <article-title>-data-lakes, online; accessed 27- January-2019</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Querying data lakes using spark and presto</article-title>
          .
          <source>In: The World Wide Web Conference</source>
          . pp.
          <volume>3574</volume>
          {
          <fpage>3578</fpage>
          . WWW '19,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehman</surname>
          </string-name>
          , J.: Squerall:
          <article-title>Virtual ontology-based access to heterogeneous and large data sources</article-title>
          .
          <source>Proceedings of 18th International Semantic Web Conference</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mami</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scerri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>Towards semanti cation of big data technology</article-title>
          .
          <source>In: International Conference on Big Data Analytics and Knowledge Discovery</source>
          . pp.
          <volume>376</volume>
          {
          <fpage>390</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Poggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
          </string-name>
          , R.:
          <article-title>Linking data to ontologies</article-title>
          .
          <source>In: Journal on Data Semantics X</source>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sayers</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
          </string-name>
          , D.:
          <article-title>E cient rdf storage and retrieval in jena2</article-title>
          .
          <source>In: Proceedings of the First International Conference on Semantic Web and Databases</source>
          . pp.
          <volume>120</volume>
          {
          <fpage>139</fpage>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>