<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Data- ow Language for Big RDF Data Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fadi Maali ?</string-name>
          <email>fadi.maali@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland Galway</institution>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>When analysing large RDF datasets, users are left with two main options: using SPARQL or using an existing non-RDF-speci c big data language, both with its own limitations. The pure declarative nature of SPARQL and the high cost of evaluation can be limiting in some scenarios. On the other hand, existing big data languages are designed mainly for tabular data and, therefore, applying them to RDF data results in verbose, unreadable, and sometimes ine cient scripts. My PhD work aims at enhancing programmability of big RDF data. The gaol is to augment the existing tools with a declarative data ow language that focuses on the analysis of large-scale RDF data. Similar to other big data processing languages, I aim at identifying a set of basic operators that are amenable to parallelisation and at supporting extensibility via userde ned custom code. On the other hand, a graph-based data model and support for pattern matching as in SPARQL are to be adopted. Giving the focus on large-scale data, scalability and e ciency are critical requirements. In this paper, I report on my research plan and describe some preliminary results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Petabytes and terabytes datasets are becoming commonplace, especially in
industries such as telecom, health care, retail, pharmaceutical and nancial
services. To process these huge amounts of data, a number of distributed
computational frameworks have been suggested recently [7, 13, 31]. Furthermore,
there has been a surge of activity on layering declarative languages on top
of these platforms. Examples include Pig Latin from Yahoo [16], DryadLINQ
from Microsoft [30], Jaql from IBM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], HiveQL from Facebook [27] and
Meteor/Sopremo [11].
      </p>
      <p>In the Semantic Web realm, this surge of analytics languages was not re ected
despite the signi cant growth in the available RDF data. To analyse large RDF
datasets, users are left mainly with two options: using SPARQL [10] or using an
existing non-RDF-speci c big data language. I argue that each of these options
has its own limitations.</p>
      <p>SPARQL is a graph pattern matching language that provides rich capabilities
for slicing and dicing RDF data. The latest version, SPARQL 1.1, supports
also aggregation and nested queries. Nevertheless, the pure declarative nature
of SPARQL obligates a user to express their needs in a single query. This can
be unnatural for some programmers and challenging for complex needs [15, 9].
Furthermore, SPARQL evaluation is known to be costly [18, 23] and requires all
data to be transformed into RDF beforehand.</p>
      <p>The other alternative of using an existing big data language such as Pig
Latin or HiveQL has also its own limitations. These languages were designed for
tabular data mainly, and, consequently, using them with RDF data commonly
results in verbose, unreadable, and sometimes ine cient scripts [21].</p>
      <p>My PhD work aims at enhancing programmability of big RDF data.
The gaol is to augment the existing tools with a declarative data ow language
that focuses on the analysis of large-scale RDF data. Similar to other big data
processing languages, I aim at identifying a small set of basic operators that
are amenable to parallelisation and at supporting extensibility via user-de ned
custom code. On the other hand, a graph-based data model and support for
pattern matching as in SPARQL are to be adopted. Giving the focus on
largescale data, scalability and e ciency are critical requirements. Moreover, I intend
to work towards relaxing the prerequisite of full transformation of non-RDF data
and facilitating processing of RDF and non-RDF data together.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Relevancy</title>
      <p>Data is playing a crucial role in societies, governments and enterprises. For
instance, data science is increasingly utilised in supporting data-driven decisions
and in delivering data products [14, 20]. Furthermore, scienti c elds such as
bioinformatics, astronomy and oceanography are going through a shift from
\querying the world" to \querying the data" in what commonly referred to as
e-science [12]. The main challenge nowadays is analysing the data and extracting
useful insights from it.</p>
      <p>
        Declarative languages simplify programming and reduce the cost of creation,
maintenance, and modi cation of software. They also help bringing the
nonprofessional user into e ective communication with a database [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In 2008, the
Claremont Report on Database Research identi ed declarative programming as
one of the main research opportunities in the data management eld [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>My PhD work intends to facilitate analysing large amount of RDF data by
designing a declarative language. The fast pace at which the data is growing
and the expected shortage in people with data analytical skills [6], make users'
productivity of paramount importance. Moreover, By embracing the process of
RDF and non-RDF data together, my hope is to increase the utilisation of the
constantly growing size of RDF data.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        A large number of declarative languages were introduced recently as part of the
big data movement. These languages vary in their programming paradigm, and in
their underlying data model. Pig Latin [16] is a data ow language with a tabular
data model that also supports nesting. Jaql [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a declarative scripting language
that blends in a number of constructs from functional programming languages
and uses JSON for its data model. HiveQL [27] adopts a declarative syntax
similar to SQL and its underlying data model is a set of tables. Other examples
of languages include Impala1, Cascalog2, Meteor [11] and DryadLINQ [30]. [26]
presented a performance as well as language comparison of HiveQL, Pig Latin
and Jaql. [22] also compared a number of big data languages but focuses on their
compilation into a series of MapReduce jobs.
      </p>
      <p>
        In the semantic web eld, SPARQL is the W3C recommended querying
language for RDF. A number of extensions to SPARQL were proposed in the
literature to support search for semantic associations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and to add nested regular
expressions [19] for instances. However, these extensions do not change the pure
declarative nature of SPARQL. There are also a number of non-declarative
languages that can be integrated in common programming languages to provide
support for RDF data manipulation [17, 25]. In the more general context of
graph processing languages, [29] provides a good survey.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Research Questions</title>
      <p>A core part of a declarative language is its underlying data model. A data model
consists of a notation to describe data and a set of operations used to manipulate
that data [28]. SPARQL Algebra [18] is the data model underlying SPARQL.
SPARQL Algebra cannot be used as an underlying model for the declarative
language I am working on for the following reasons:
{ It is not fully composable. The current SPARQL algebra transitions from
graphs (i.e. the initial inputs) to sets of bindings (which are basically tables
resulting from pattern matching). Subsequently, further operators such as
Join, Filter, and Union are applied on sets of bindings. In other words,
the ow is partly \hard-coded" in the SPARQL algebra and a user cannot,
for instance, apply a pattern matching on the results of another pattern
matching or \join" two graphs. In a data ow language, the data ow is guided
by the user and cannot be limited to the way SPARQL Algebra imposes.
{ It assumes all data is in RDF.
{ The expressivity of SPARQL comes at the cost of high evaluation
complexity [18, 23].</p>
      <p>Therefore, the main challenge is to de ne an adequate data model that
embraces RDF and non-RDF data and strikes a balance between expressivity and
complexity. Accordingly, my research questions are:
RQ1: What is the appropriate data model to adopt?
RQ2: How do we achieve e cient scalable performance?
RQ3: How do we enable processing of RDF and non-RDF data together?</p>
      <sec id="sec-4-1">
        <title>1 https://github.com/cloudera/impala 2 http://cascalog.org/</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Hypotheses</title>
      <p>We introduce several hypotheses that we would like to test in our research.
H1: A new data model can be de ned to underlie a data ow language for RDF
data. The expressivity and complexity of this data model can be determined.
H2: Algebraic properties of the new data model can be exploited to enhance
performance.</p>
      <p>H3: Scalable e cient performance can be achieved by utilising state-of-the-art
distributed computational frameworks.</p>
      <p>H4: Integrating transformation to RDF as part of the data processing enables
processing RDF and non-RDF data together and can eliminate the need of
full transformation to RDF.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Approach &amp; Preliminary Results</title>
      <p>We had an initial proposal for a data model and a data ow language. Our goal
is to iteratively re ne the model (H1, H2) and our implementation (H3) and
then extend it to include non-RDF data and data transformation (H4). The
next two subsections summarise our preliminary results.
6.1</p>
      <sec id="sec-6-1">
        <title>RDF Algebra</title>
        <p>RDF Algebra is our proposed data model. This algebra de nes operators similar
to those de ned in SPARQL algebra but that are fully composable. To achieve
such composability, the algebra operators input and output are always a pair
of a graph and a corresponding table (H1). The core set of expressions in this
algebra are: atomic, projection, extending, triple pattern matching, ltering,
cross product and aggregation. The syntax and the semantics of these expressions
have been formally de ned and their expressivity in comparison to SPARQL is
captured by the following lemma.</p>
        <p>Lemma 1. RDF Algebra expressions can express SPARQL 1.1 basic graph
patterns with lters, aggregations and assignments.</p>
        <p>We have also started to study some unique algebraic properties of our data
model (H2). Cascading triple patterns and joins in the RDF algebra results in
some unique optimisation opportunities. Therefore, we de ned a partial
ordering relationship between triple patterns to capture subsumption among results.
Consequently, evaluation plans can be optimised and intermediary results can
be reused in order to enhance evaluation performance (H3).</p>
        <p>The innovative part of this model is the pairing of graphs and tables, which,
to the best of our knowledge, was not reported in the literature before. This
ensures full composability and can potentially accommodate tabular data (with
empty graph component that can be populated via transforming the tabular
data only when necessary) (H4).
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>SYRql, A Data ow Language</title>
        <p>Our current data ow language that is grounded in the algebra de ned before is
called SYRql. A SYRql script is a sequence of statements and each statement
is either an assignment or an expression. The syntax of SYRql borrows the use
of \ &gt;" syntax from Jaql to explicitly show the data ow. Whereas pattern
matching in SYRql uses identical syntax to basic graph patterns of SPARQL.
SPARQL syntax for patterns is intuitive, concise and well-known to many users
in the Semantic Web eld. We hope that this facilitates learning SYRql for many
users.</p>
        <p>The current implementation3 uses JSON4 for internal representation of the
data. Particularly, we use JSON arrays for bindings and JSON-LD [24] to
represent graphs. SYRql scripts are parsed and then translated into a directed acyclic
graph (DAG) of MapReduce jobs (H3). Sequences of expressions that can be
evaluated together are grouped into a single MapReduce job. Finally, the graph
is topologically sorted and the MapReduce jobs are scheduled to execute on the
cluster. Our initial performance evaluation showed comparative performance to
well-established languages such as Pig Latin and Jaql (Figure 1).</p>
        <sec id="sec-6-2-1">
          <title>3 https://gitlab.deri.ie/Maali/syrql-jsonld-imp/wikis/home 4 http://json.org</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Evaluation Plan</title>
      <p>We are currently conducting a formal study of the data model, its algebraic
properties, its complexity and expressivity. We plan to compare it to First-Order
Logic languages (H1, H2).</p>
      <p>
        For performance evaluation, we have started comparing response time that
our implementation provides to that of SPARQL and existing big data languages
(H3). Figure 1 shows initial results. The benchmark we used is based on the
Berlin SPARQL Benchmark (BSBM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that de nes an e-commerce use case.
Speci cally, we translated a number of queries in the BSBM Business Intelligence
usecase (BSBM BI)5 into equivalent programs in HiveQL, Pig Latin and Jaql.
To the best of our knowledge, this is the rst benchmark that uses existing big
data languages with RDF data.
      </p>
      <p>Furthermore, we plan to use some data manipulation scenarios from
bioinformatics research to guide requirement collection for processing RDF and non-RDF
data (H4). We plan to conduct a performance evaluation and a user study to
evaluate our work on this regards.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Re ections</title>
      <p>We base our work on a good understanding of Semantic Web technologies as
well as existing Big Data techniques and languages. The initial results we have
collected are promising. Nevertheless, the current implementation leaves rooms
for improvements. We plan to use RDF compression techniques such as HDT [8]
and to experiment with distributed frameworks other than MapReduce such as
Spark. Finally, we believe that our data model and its algebraic properties can
yield fruitful results that can further be applied in tasks like caching RDF query
results, views management and query results reuse.
5 http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
spec/BusinessIntelligenceUseCase/index.html
6. Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh,
and Angela Hung Byers James Manyika. Big data: The next frontier for innovation,
competition, and productivity. Technical report, McKinsey Global Institute, 2011.
7. Je rey Dean and Sanjay Ghemawat. MapReduce: Simpli ed Data Processing on</p>
      <p>Large Clusters. In OSDI, 2004.
8. Javier D Fernandez, Miguel A Mart nez-Prieto, Claudio Gutierrez, Axel Polleres,
and Mario Arias. Binary rdf representation for publication and exchange (hdt).
Web Semantics: Science, Services and Agents on the World Wide Web, 19:22{41,
2013.
9. Stefan Hagedorn and Kai-Uwe Sattler. E cient Parallel Processing of Analytical</p>
      <p>Queries on Linked Data. In OTM, 2013.
10. Steve Harris and Andy Seaborne. SPARQL 1.1 Query Language. W3C
Recommendation 21 March 2013. http://www.w3.org/TR/sparql11-query/.
11. Arvid Heise, Astrid Rheinlander, Marcus Leich, Ulf Leser, and Felix Naumann.
Meteor/Sopremo: An Extensible Query Language and Operator Model. In BigData,
2012.
12. Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm:</p>
      <p>Data-Intensive Scienti c Discovery. Microsoft Research, 2009.
13. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:
Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys,
2007.
14. Mike Loukides. What is Data Science? O`Reilly radar, 6 2010.
15. Fadi Maali and Stefan Decker. Towards an RDF Analytics Language: Learning
from Successful Experiences. In COLD, 2013.
16. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew
Tomkins. Pig Latin: a Not-so-foreign Language for Data Processing. In SIGMOD,
2008.
17. Eyal Oren, Renaud Delbru, Sebastian Gerke, Armin Haller, and Stefan Decker.</p>
      <p>Activerdf: Object-oriented semantic web programming. In WWW, 2007.
18. Jorge Perez, Marcelo Arenas, and Claudio Gutierrez. Semantics and Complexity
of SPARQL. ISWC, 2006.
19. Jorge Perez, Marcelo Arenas, and Claudio Gutierrez. nSPARQL: A navigational
language for RDF. Web Semantics: Science, Services and Agents on the World
Wide Web, 2010.
20. Foster Provost and Tom Fawcett. Data Science and its Relationship to Big Data
and Data-Driven Decision Making. Big Data, 1(1), March 2013.
21. Padmashree Ravindra, HyeongSik Kim, and Kemafor Anyanwu. An Intermediate
Algebra for Optimizing RDF Graph Pattern Matching on MapReduce. In ESWC,
2011.
22. Caetano Sauer and Theo Haerder. Compilation of query languages into mapreduce.</p>
      <p>Datenbank-Spektrum, 2013.
23. Michael Schmidt, Michael Meier, and Georg Lausen. Foundations of sparql query
optimization. In ICDT, 2010.
24. Manu Sporny, Dave Longley, Gregg Kellogg, Markus Lanthaler, and Niklas
Lindstrom. JSON-LD 1.0. W3C Recommendation 16 January 2014.
25. Ste en Staab. Liteq: Language integrated types, extensions and queries for rdf
graphs. Interoperation in Complex Information Ecosystems, 2013.
26. Robert J Stewart, Phil W Trinder, and Hans-Wolfgang Loidl. Comparing High</p>
      <p>Level MapReduce Query Languages. In APPT. 2011.
27. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Ning Zhang 0002, Suresh Anthony, Hao Liu, and Raghotham Murthy. Hive - a
Petabyte Scale Data Warehouse Using Hadoop. In ICDE, 2010.
28. J.D. Ullman. Principles of Database and Knowledge-base Systems, chapter 2.
Computer Science Press, Rockville, 1988.
29. Peter T. Wood. Query Languages for Graph Databases. SIGMOD, 2012.
30. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson,
Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: a System for
Generalpurpose Distributed Data-parallel Computing Using a High-level Language. In
OSDI, 2008.
31. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,
Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient
distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.
In NSDI, 2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Rakesh</given-names>
            <surname>Agrawal</surname>
          </string-name>
          et al.
          <source>The Claremont Report on Database Research</source>
          . SIGMOD Rec.,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Kemafor</given-names>
            <surname>Anyanwu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Sheth</surname>
          </string-name>
          .
          <article-title>P-queries: enabling querying for semantic associations on the semantic web</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kevin</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Beyer</surname>
          </string-name>
          , Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh,
          <string-name>
            <surname>Carl-Christian</surname>
            <given-names>Kanne</given-names>
          </string-name>
          , Fatma Ozcan, and
          <string-name>
            <surname>Eugene</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Shekita</surname>
          </string-name>
          .
          <article-title>Jaql: A Scripting Language for Large Scale Semistructured Data Analysis</article-title>
          .
          <source>PVLDB</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Schultz</surname>
          </string-name>
          .
          <source>The Berlin SPARQL Benchmark. IJSWIS</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Donald</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlin</surname>
            and
            <given-names>Raymond F.</given-names>
          </string-name>
          <string-name>
            <surname>Boyce. SEQUEL: A Structured English Query Language. In</surname>
            <given-names>SIGFIDET</given-names>
          </string-name>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>