A Data Mashup Language for the Data Web

                           Mustafa Jarrar                                                     Marios D. Dikaiakos
                         University of Cyprus
                        mjarrar@cs.ucy.ac.cy                                                    University of Cyprus
                                                                                                mdd@cs.ucy.ac.cy

                                                                             unstructured data. For example, imagine how would be the results
ABSTRACT                                                                     when using Google to search a database of job vacancies, say
                                                                             “well-paid research-oriented job in Europe”. The results will not
This paper is motivated by the massively increasing structured               be precise or clean, because the query itself is still ambiguous
data on the Web (Data Web), and the need for novel methods to                although the underlying data is structured. People are demanding
exploit these data to their full potential. Building on the                  to not only retrieve job links but also want to know the starting
remarkable success of Web 2.0 mashups, this paper regards the                date, salary, location, and may render the results on a map.
internet as a database, where each web data source is seen as a
table, and a mashup is seen as a query over these sources. We                Web 2.0 mashups are a first step in this direction. A mashup is a
propose a data mashup language, which allows people to                       web application that consumes data originated from third parties
intuitively query and mash up structured and linked data on the              and retrieved via APIs. For example, one can build a mashup that
web. Unlike existing query methods, the novelty of MashQL is                 retrieves only well-paid vacancies from Google Base and mix it
that it allows people to navigate, query, and mash up a data                 with similar vacancies from LinkedIn. The problem is that
source(s) without any prior knowledge about its schema,                      building mashups is an art that is limited to skilled programmers.
vocabulary, or technical details. We even do not assume even that            Although some mashup editors have been proposed by the Web
a data source should an online or inline schema. Furthermore,                2.0 community to simplify this art (such as Google Mashups,
MashQL supports query pipes as a built-in concept, rather than               Microsoft’s Popfly, IBM’s sMash, and Yahoo Pipes), however,
only a visualization of links between modules.                               what can be achieved by these editors is limited. They only focus
                                                                             on providing encapsulated access to some APIs, and still require
                                                                             programming skills. In other words, these mashup methods are
                                                                             motivating for -rather than solving- the problem of structured-data
1. INTRODUCTION AND MOTIVATION
                                                                             retrieval. To expose the massive amount of structured data to its
                                                                             full potential, people should be able to query and mash up this
In this short article we propose a data mashup approach in a
                                                                             data easily and effectively.
graphical and Yahoo Pipes’ style. This research is still a work in
progress, thus please refer to [13] for the latest findings.                 Position: To build on the success of Web 2.0 mashups and
                                                                             overcome their limitation, we propose to regard the web as a
In parallel to the continuous development of the hypertext web,
                                                                             database, where each data source is seen as a table, and a mashup
we are witnessing a rapid emergence of the Data Web. Not only
                                                                             is seen as a query over one or multiple sources. In other words,
the amount of social metadata is increasing, but also many
                                                                             instead of developing a mashup as an application that access
companies (e.g., Google Base, Upcoming, Flicker, eBay,
                                                                             structured data through APIs, this art can be simplified by
Amazon, and others) started to make their content freely
                                                                             regarding a mashup as a query. For example, instead of
accessible through APIs. Many others (see linkeddata.org) are
                                                                             developing a “program” to retrieve and fuse certain jobs from
also making their content directly accessible in RDF and in a
                                                                             Google Base and Jobs.ac.uk, this program should be seen as a
linked manner [3]. We are also witnessing the launch of RDFa,
                                                                             data query over two remote sources. Query formulation (i.e.,
which allows people to access and consume HTML pages as
                                                                             mashup development or data fusion) should be fast and should not
structured data sources.
                                                                             require any programming skills.
This trend of structured and linked data is shifting the focus of
                                                                             Challenges: Before a user formulates a query on a data source,
web technologies towards new paradigms of structured-data
                                                                             she needs to know how the data is structured, and what are the
retrieval. Traditional search engines cannot serve such data
                                                                             labels of the data elements, i.e., the schema. Web users are not
because their core design is based on keyword-search over
                                                                             expected to investigate “what is the schema” each time they
 Permission to make digital or hard copies of all or part of this work for   search or filter structured information. This issue is particularly
 personal or classroom use is granted without fee provided that copies are   more difficult in case of RDF and linked data. RDF data may
 not made or distributed for profit or commercial advantage and that         come without a schema\ontology, and if exists, the schema is
 copies bear this notice and the full citation on the first page. To copy
 otherwise, or republish, to post on servers or to redistribute to lists,    mixed up with the data. In addition, as RDF data is a graph, one
 requires prior specific permission and/or a fee.                            have to manually navigate this graph in order to formulate a query
                                                                             about it. Imagine large and multiple linked data sources, with
 Copyright is held by the author/owner(s).                                   diverse content and vocabularies, how you would manage to
 LDOW2009, April 20, 2009, Madrid, Spain.
understand the data structure, inter-relationships, namespaces, and    MashQL visualizes links between query modules, similar to
the unwieldy labels of the data elements. In short, formulating        Yahoo Pipes and other Mashup editors, but the main purpose of
queries in open environments, where data structures and                MashQL is to help people to formulate what is inside these query
vocabularies are unknown in advance, is a hard challenge, and          modules.
may hamper building data mashups by non-IT people.
                                                                       Differently from the above Web 2.0 mashup editors, a more
To allow people to query and mash up data sources intuitively, we      sophisticated editor has been proposed in [8], called MashMaker.
propose a data mashup language, called MashQL. The main                It is a functional programming environment that allows one to
novelty of MashQL is that it allows non IT-skilled people to           mashup web content in a spreadsheet-style user interface. Like a
query and explore one (or multiple) RDF sources without any            spreadsheet, MashMaker stores every value that is computed in a
prior knowledge about the schema, structure, vocabulary, or any        single, central data structure. MashMaker is not comparable with
technical details of these sources. To be more robust and cover        MashQL since it cannot serve as a query language by it is own.
most cases in practice, we even do not assume that a data source
should have -an offline or online- schema\ontology at all. In the      In XML databases, the Lore query language [9] has been
background, MashQL queries are translated into and executed as         proposed to allow people to query XML data graphically, and
SPARQL queries.                                                        without prior knowledge about the data. Lore assumes that data is
                                                                       represented as a graph, called EOM, which is close to RDF. The
Paper organization: Before presenting MashQL, in the next              difference between Lore and MashQL is not only the intuitiveness
section we overview the art of query formulation, which has been       and expressivity, but essentially, MashQL does not assume the
studied by different research communities. We present MashQL           data graph to have a certain schema, however, Lore assumes that
in section 3, and in section 4 we introduce the notion of query        a data graph should have a dataguide, which is a computed
pipes. The implementation of MashQL and a three use cases are          summary of the data, i.e. play the role of a schema.
presented in section 5 and 6 respectively. The coverage and the
limitations of MashQL and its future directions are discussed in       More about query formulations scenarios and (which scenario is
section 7.                                                             more intuitive to the casual user) can be found in a recent
                                                                       usability study in [14]. It concluded that a query language should
                                                                       be close to natural language and graphically intuitive, and it
2. RELATED WORK                                                        should not assume knowledge about the data source.

Several approaches have been proposed by the DB community to
query structured data sources, such as query-by-example [23] and       3. THE MASHQL LANGUAGE
conceptual queries [4,6,17]. However, none of these approaches
was used by casual users. This is because they still assume            The main goal of MashQL is to allow people to mash up and fuse
knowledge about the relational/conceptual schema. Among these,         data sources easily. In the background MashQL queries are
we found ConQuer [4] has some nice features, specially the tree        automatically translated into and executed as SPARQL queries.
structure of queries, but it also assumes one to start from the        Without prior knowledge about a data source, one can navigate
schema. In the natural language processing community, it has           this source and fuse it with another source easily. To allow people
been proposed to allow people to write queries as natural              to build on each other’s results MashQL supports query pipes as a
language sentences, and then translate these sentences into a          built-in concept. The example below shows two web data sources
formal language (SQL [15] or XQuery [16]). However, these              and a SPARQL query to retrieve “the book titles authored by Lara
approaches are challenged with the language ambiguity and the          and published after 2007”. The same query in MashQL is shown
“free mapping” between sentences and data schemes.                     in Figure 2. The first module specifies the query input, and the
                                                                       second module specifies the query body. The output can be piped
This topic started to receive a high importance within the             into a third module (not shown here), which renders the results
Semantic Web community. Several approaches (GRQL [1],                  into a certain format (such as HTML,XML or CSV), or as RDF
iSPARQL [11], NITELIGHT [19] and RDFAuthor [18]) are                   input to other queries. Notice that in this way, one can easily build
proposing to represent triple patterns graphically as ellipses         a query to fuse the content of two sources in a linked manner [3].
connected with arrows. However, these approaches assume                http://Site1.com/RDF       Query:
                                                                      :a1 :Title “Web 2.0”
advanced knowledge of RDF and SPARQL. Other approaches use            :a1 :Author “Hacker B.”
                                                                                                  PREFIX S1: <http://site1.com/rdf>
                                                                                                  PREFIX S2: <http://site1.com/rdf>
Visual Scripting Languages (e.g., SPARQLMotion [21] and Deri          :a1 :Year 2007              SELECT ? ArticleTitle
                                                                      :a1 :Publisher “Springer”   FROM <http://site1.com/rdf>
Pipes [22]), by visualizing links between query modules; but a        :a2 :Title “Web 3.0”        FROM <http://site2.com/rdf>
                                                                                                  WHERE {
query module merely is a window containing a SPARQL script in         :a2 :Author “Smith B.”
                                                                                                   {{?X S1:Title ?ArticleTitle}UNION
                                                                                                   {?X S2:Title ?ArticleTitle}}
a textual form. These approaches are inspired by some industrial       http://Site2.com/RDF
                                                                                                   {?X S1:Author ?X1} UNION {?X S2:Author ?X1}
                                                                      :4 :Title “Semantic Web”
mashup editors such as Popfly, sMash, and Yahoo Pipes. These          :4 :Author “Tom Lara”
                                                                                                   {?X S1:PubYear ?X2} UNION {?X S2:Year ?X2}
                                                                                                   FILTER regex(?X1, “^Hacker”)
industry editors provide a nice visualization of APIs’ interfaces     :4 :PubYear 2005             FILTER (?X2 > 2000)}
                                                                      :5 :Title “Web services”     Results:
and some operators between them. However, when a user needs to        :5 :Author “Bob Hacker”               ArticleTitle
express a query over structured data, she needs to use the formal                                            Web 2.0
language of that editor, such as YQL for Yahoo Pipes. Although
                                                                                  Figure 1. An example of a SPARQL query.
                                                                        projection symbol 5 can be used before a variable to indicate that
                                                                        it will be returned in the results1. In short, while interacting with
                                                                        the editor, the editor queries the dataset in the background in
                                                                        order to generate the next list depending on the previous
                                                                        selections. In this way, people can navigate a graph without prior
                                                                        knowledge about it.
                                                                        Similar to SPARQL, all restrictions in MashQL are considered
                                                                        necessary when evaluating a query. However, if a restriction is
                                                                        prefixed with “maybe”, it is considered optional; and, if it is
                                                                        prefixed with “without” is considered unbound (see Figure 3).
            Figure 2. An example of MashQL query.
                                                                        MashQL supports also union (denoted as “\”) between objects,
                                                                        predicates, subjects, and queries; as well as, a type operator
The intuition of MashQL is described as the following: Each             (“Any”), Inverse predicates, datatype and language tags, and
query Q is seen as a tree. The root of this tree is called the query    many objects filters.
subject (e.g. Article), denoted as Q(S), which is the subject matter                                        PREFIX a: <http:www.example.nam.com>
being inquired. Each branch of the tree is called a restriction R                                           PREFIX S1: <http:www.example.si.com>
and is used to restrict a certain property of the query subject, Q(S)                                       SELECT ?SongTitle, ?AlbumName
                                                                                                            FROM <http:www.example.si.com>
                                                                                                            WHERE {?Song S1:Title ?SongTitle.
                                                                                                              {{?Song S1:Duration ?X1}
ؔ R1 AND … AND Rn. Branches can be expanded to allow sub                                                    UNION {?Song a:Length ?X1}}
                                                                                                              FILTER (?X1 > 3).
                                                                                                              {{?Song S1:Artist S1:Shakira}
                                                                                                            UNION {?Song S1:Artist S1:AxelleRed}}
                                                                                                             OPTIONAL{?Song S1:Album ?AlbumName}.
trees (called query paths), which enable one to navigate the
                                                                                                              OPTIONAL{?Song S1:Copyright ?X2}.
underlying dataset. In this case, the object in the restriction is                                            FILTER (!Bound(?X2)).}
considered the subject of its sub query. As Figure 3 shows, the
                                                                        Figure 4. A query involving optional and negative restrictions.
query retrieves the title of every article, published after 2005, and
written by an author, who has an address, this address has a
country called Cyprus.                                                  4. THE NOTION OF QUERY PIPES
                                 PREFIX S1: <http://www.example.com>
                                 SELECT ?ArticleTitle                   To deploy MashQL in an open world some challenges might be
                                 FORM < http://www.example.com?
                                 WHERE { ?X1 rdf:type :Article.         faced. This section overviews these challenges (from a query
                                         ?X1 S1:Title ?ArticleTitle.    formulation viewpoint) and introduces the notion of query pipes.
                                         ?X1 S1:Year ?X2.
                                         FILTER (?X2 > 2005).
                                         ?X1 S1:Author ?X3.             As discussed earlier, one may create a mashup and redirect its
                                         ?X3 S1:Address ?X4.
                                                                        output to another mashup. We call the chain of queries that
                                         ?X4 S1:Country ?X5.
                                         FILTER regex(?X5, “Malta”)}    connect to each other in this way as pipe. Allowing people to
                                                                        formulate query pipes is not merely a visualization of links
   Figure 3. A query involving paths, and its mapping into SPARQL.      between query modules, but when compiling a pipe (i.e.,
Formulating queries in MashQL is designed to be an interactive          translating it into SPARQL), some issues should be considered.
process, by which the complexity of understanding data structures
is moved to the query editor. Users only use drop-down lists to         First: Translating MashQL into SPARQL SELECT statements is
express their queries.                                                  not enough, because the SELECT statement produces the results

The query subject is selected from a list generated dynamically
from, either: (1) the set of the subject-types in the dataset; (2) or
                                                                        1
the union of all subject and object identifiers in the dataset; users       Some issues are lengthy to illustrate here. For example, when a
can also choose to (3) introduce their own label; in this case the          user moves the mouse over a restriction, it gets the editing mode
label is seen as a variable and displayed in italic. The default            and all other restrictions get the verbalize mode (i.e., all boxes
subject is the variable “Anything”. To add a restriction, the list of       and lists are made invisible, but the verbalization of their
properties (e.g., Title, Author) is generated, depending on the             content is generated and displayed instead). This is not only to
chosen subject. Users may then select a filter (e.g., Equals,               make the readability of the queries closer to natural language,
                                                                            but also to allow users to validate whether what they did is what
Contains, Between, etc.), or select an object identifier from a list,
                                                                            they intended. The editor also detects and normalizes
which is then generated from the set of the possible objects                namespaces: find similar URLs and hide them when necessary.
identifies, depending on the previous selections. Furthermore,              For example, when two properties originating from different
users select to expand the tree to declare a query path. The                data sources have the same URL, their namespaces are found
                                                                            and hided.
in a tabular form. To allow queries to input each other (especially         Let Qi be a query over a set of sources {D1,..,Dm}, and T is a
for producing linked data), the results of a query should be                given time. Qi will be re-executed if (RQiT + RQiA) ≤ T and (RQiT <
formed as a graph. In SPARQL, the CONSTRUCT statement                       RDjT), where 1 ≤ j ≤ m.
produces a graph, but then one needs to manually specify how
this graph should be produced. To overcome this, we propose the             Pipe auto-refresh: Each pipe P(D) is automatically refreshed if
construct (CONSTRUCT *). This is not part of the standard                   RDA expires. This implies re-executing the chain of queries in this
SPARQL but has been proposed also by others to be included in               pipe. Let P(D) be a pipe, D=Qn(D1,..,Dm), and T is a given time. If
the next version of the standard [20]. In MashQL, the                       (RDT+RDA) ≤ T, then each ith query in P(D) is executed if (RQiT <
CONSTRUCT * means retrieves all triples involved in the query               RDjT ), where 1 ≤ j ≤ m for Qi, and 1 ≤ i ≤ n. Queries in P(D) are
conditions and satisfy them. For example, suppose the query in              executed from the bottom to the topmost, or recursively as
Figure 2 is piped into another, its CONSTRUCT * translation will            P(P(D1),…,P(Dm)).
retrieve {<:b1 :Title “Linked Data”>,<:b1 :Author “Lara
T.”>,<:b1 :Year 2007>}. When compiling a pipe of queries, If                As argued in the data warehousing literature [2,24] an efficient
the output of a query is directed as input to another query, a              refreshing strategies is the incremental updates, which suggests
CONSTRUCT * statement will be generated, otherwise, a                       that if a base source receives new transactions, only these
SELECT statement will be generated.                                         transactions are transformed and the affected queries are
                                                                            refreshed. This strategy is still an open research issue for RDF in
Second: When executing a SPARQL query, all query engines                    an open world [7], because RDF data and queries are developed
assume that the queried data is stored locally; otherwise, this data        and maintained autonomously by different people.
must be downloaded and stored at the engine-side before the
execution process starts. The time complexity of executing a
query on local data is usually fast2; however, the bottleneck will
be the downloading time. In case the input of a query is an output          5. IMPLEMENTATION
to another query (i.e., in case of query pipes) the problem will be
even more difficult, as queries will be calling each other.                 First: we have developed an online mashup editor, which will be
Furthermore, it is also possible that users (intentionally or by            publically available next month. Similar to creating feed mashups
mistake) end up with query loops (e.g. Q1→Q2→Q3→Q1), which                  in Yahoo Pipes, MashQL users can query and fuse data sources
may cause computational overheads. To face this challenge,                  and the output of their queries can be redirected as input to other
MashQL allows users to materialize the results of their                     queries. In the background, Oracle 11g is used for storing and
queries/pipes and decide their refreshing strategies, as follows:           querying RDF. When a user specifies a data source(s) as input, it
                                                                            is bulk-loaded to the Oracle’s semantic technology tables.
The results of a query (called derived source) are stored                   MashQL queries are also translated into Oracle’s SPARQL.
physically and deployed as a concrete RDF source. Primal input              While interacting with the editor to formulate a query, the editor
sources (called base sources) are also cached for performance               performs some background queries through AJAX. Each
purposes. Given a query Q over a set of base or derived sources             published query is given a URL. Calling this URL means
{D1,..,Dm}, the results of this query is denoted as D = Q(D1,..,Dm),        executing this query and getting its results back.
and D ∉ {D1,..,Dm}. We define a Pipe as an acyclic chain of
queries, where the result of a query is an input to the next. The           Second: We started to also develop a Firefox add-on in order to
chain of the queries that derives D is denoted as the pipe P(D).            allow people develop mashups at the client side. The opened
                                                                            pages -in the browser tabs- are automatically selected as input
We call the problem of keeping a pipe up-to-date, the pipes                 sources, and at the left-side panel a mashup can be created. The
consistency. Let D be the results of a query Q(D1,..,Dm), and T the         results are rendered by the browser in a new tab. The idea is to
latest time the set {D1,..,Dm} has been changed. Then, D is                 allow web pages that embed RDF triples (i.e., RDFa or
consistent at T if D=Q(D1,..,Dm). To maintain pipes consistency,            microformats) to be queried and mashed up. For example, one
two updating strategies are used: Query auto-refresh and Pipe               will be able to compose his publication list from Google Scholar,
auto-refresh. MashQL maintains for each base or derived source              DBLP, ACM, and CiteSeer; or, filter all video lectures given by
D a timestamp of its last update RDT and an auto-refresh time               Berners-Lee from YouTube and VedioLectures. Because the
interval RDA; and for each query Q a timestamp of its previous              mentioned web sites do not support RDFa yet, one can mine/distil
successful execution RQT and an auto-refresh interval RQA.                  the RDF triples, using third party services such as triplr.org,
                                                                            buzzword.org.uk, wandora.org or Dapper.
Query auto-refresh: Each query will be automatically executed if
its auto-refresh interval expires and one of its inputs is updated.


2
    A query with medium size complexity over a large dataset takes one or
    few seconds [5].
 6. USE CASES
 This is section we present two hypothetical use cases to illustrate
 using MashQL for developing data mashups.


 6.1 Use case: Retailer
 Fnac is a large retailer of cultural and consumer electronics
 products. When a new product arrives to Fnac, it has to be entered
 to the inventory database. This is usually done by scanning the
 barcode on each product, and then manually filling the product
 specifications. Furthermore, as Fnac trades in many countries,              Figure 6. A mashup of product titles from different resources.
 their product specifications have to be translated into several         PREFIX s1: <http:www.cannon/products/rdf>
 languages. To save time entering and translating information            PREFIX s2: <http://www.alfred.com/books>
                                                                         PREFIX s3: <http://www.imdb.com/movies>
 manually, Fnac decided to reuse the product data specifications
                                                                         SELECT ?Barcode ?EnglishTitle ?FrenchTitle
 (and their translation) that are produced at the factory side. For      FROM <http:www.cannon/products/rdf>
 example, suppose Fnac received three packages from Cannon,              FROM <http://www.alfred.com/books>
 Alfred, and IMDB. Fnac would like to scan the barcode of the            FROM <http://www.imdb.com/movies>
                                                                         WHERE{
 received products and then get their specifications directly from        {{?x s1:Barcode ?Barcode} UNION {?x s2:Bcode ?Barcode}
 the online catalogues of those suppliers. In Figure 5 we show               UNION {?x s3:Prodcode ?Barcode}}
 samples of online product catalogues of the three suppliers (we          FILTER (regex(?Barcode, “9781143557532”) ||
                                                                                  regex(?Barcode, “8765422097653”) ||
 assume they are published in RDFa). Figure 6 illustrates a query
                                                                                  regex(?Barcode, “3248765355133)”).
 that Fnac built to look up the multilingual titles of three products.   {OPTIONAL {?x s1:ShortName ?EnglishTitle}} UNION
 This query is a mashup of three RDF data sources with a user-           {{OPTIONAL {?x s1:Title ?EnglishTitle}} UNION
 input of three barcode numbers. The query takes each of these            {OPTIONAL {?x s2:Title ?EnglishTitle}}
                                                                          FILTER (lang(?EnglishName) = ”en”)}
 barcodes and finds the English and French titles. Notice that Fnac       {{OPTIONAL {?x s1:Title ?FrenchTitle}} UNION
 assumed that short titles provided by Cannon are in English, thus,        {OPTIONAL {?x s2:Title ?FrenchTitle}}
 they are joined with the other titles that are tagged with "@en".        FILTER (lang(?FrenchTitle) = ”fr”)}}

 See the retrieved results in Figure 8. In this same way, a barcode                  Figure 7. The SPARQL equivalent of Figure 6.
 reader could be connected with user-input module, to retrieve the
 specifications (which could be stored at the supplier side) each                   Barcode                EnglishTitle          FrenchTitle
 time a product is scanned.                                                   9781143557532           CanScan 4400F
                                                                              8765422097653           The Prophet            Le prophète
 http:www.cannon/products/rdf          http://www.alfred.com/books            3248765355133           All about my mother    Tout sur ma mère
_:P1 :ShortName “CanScan 4400F”     <:B1> :Type <:Book>
                                                                                              Figure 8. Retrieved product titles.
_:P1 :FullName “Canon CanoScan      <:B1> :Title “The Prophet”@en
       4400F Color Image Scanner”   <:B1> :Title “Le prophète”@fr
_:P1 :Producer “Canon”              <:B1> :BCode 8765422097653
_:P1 :ShippingWeight> “4 pounds”
_:P1 :Barcode 9780133557022
                                    <:B1> :Authors “Kahlil Gibran”
                                    <:B1> :ISBN-10 0394404289
                                                                         6.2 Use case: Citations List
_:P2 :ShortName “PowerShot SD100”   <:B3> :Type <:Book>
                                                                         Bob would like to compile the list of articles that cited his articles
_:P2 :FullName “Canon PowerShot     <:B3> :Title “Alfred Nobel”@en
   SD10007.1MP Camera 3x Zoom”      <:B3> :Title “Alfred Nobel”@fr       (excluding what he cited himself). He built a mashup using
_:P2 :Producer “Canon”              <:B3> :BCode 75639898123             MashQL to mix his citations retrieved from both Google Scholar
_:P2 :ShippingWeight> “2 pounds”    <:B3> :Authors “Kenne Fant”
                                                                         and CiteSeer, and then filter out the self-citations. First, he
_:P2 :Barcode 9781143557532         <:B3> :ISBN- 0531123286
                                                                         performed a keyword search (“Bob Hacker”) on both Google
   http://www.imdb.com/movies                                            Scholar and CiteSeer3. Figure 9 shows a sample of the extracted
_:1 rdf:Type <:Movie>
_:1 :Title “All about my mother”@en
                                                                         RDF triples. Bob’s MashQL query is shown in Figure 10, and its
_:1 :Title “Tout sur ma mère”@fr                                         SPARQL equivalent in Figure 11. In this query, Bob wrote:
_:1 :ProdCode 3248765355133                                              retrieve every article that has a title (call it CitingArticle), has an
_:1 :NumberOfDiscs: 1
_:2 rdf:Type <:Movie>
_:2 :Title “Lords of the rings”@en
_:2 :Title “Seigneur des anneaux”@fr                                     3
                                                                             Similar to the previous use case, we assume that both Google
_:2 : ProdCode 4852834058083
_:2 :NumberOfDiscs: 3                                                        Scholar’s and CiteSeer’s render their search results in RDFa
                                                                             (i.e. the RDF triples are embedded in HTML), as many
           Figure 5. Sample of RDF data about products.                      companies started to do nowadays. However, Bob can also use
                                                                             a third party’s service (e.g. triplify.org) to extract triples from
                                                                             HTML pages.
 author that does not contain "Bob Hacker" or "Hacker B.", and                  6.3 Use case: Job Seeking
 cites another article that has a title (call it CitedArticle), and has
 an author that contains "Bob Hacker" or "Hacker B.". Figure 12                 Bob has a PhD in bioinformatics. He is looking for a full-time,
 shows the result of this query.                                                well paid, and research-oriented job in some European countries.
                                                                                He spent an enormous amount of time searching different job
http://scholar.google.com/scholar?q=b   http://www.citeseer.com/search?s=“Bo    portals, each time trying many keywords and filters. Instead, Bob
ob+Hacker                               b Hacker”                               used MashQL to find the job that meets his specific preferences.
<g:3> :Title “Prostate Cancer”      _:1 :Title “Prostate Cancer”                Figure 13 shows Bob’s queries on Google Base and on
<g:3> :Author “Hacker B.,Hacker A.” _:1 :Author “Hacker B., Hacker A.”
<g:4> :Title “Best and Worst        _:2 :Title “Protocols in Molecular
                                                                                Jobs.ac.uk. First, he visited Google Base and performed a
Lifestyles”                         Biology”                                    keyword search (bioinformatics OR "computational biology" OR
<g:4> :Atuhor “Bob Hacker”          _:2 :Atuhor “Bob Hacker”                    "systems biology" OR e-health); he copied the link of the
<g:4> :Cites <g:3>                  _:2 :ArticleCited _:1
<g:7> :Title “Protein Categories” _:3 :Title “Cancer Vaccines”                  retrieved results from Google (which are in rendered in RDFa)
<g:7> :Atuhor “Bob Smith”           _:3 :Atuhor “Eve Lee, Bob Hacker”           into the RDFInput module; and then created a MashQL query on
<g:7> :Cites <g:3>                  _:4 :Title “Overview about Systems
                                                                                these results. He performed a similar task to query Jobs.ac.uk.
<g:7> :Cites <g:4>                  Biology”
<g:8> :Title “Cancer Vaccines”      _:4 :Atuhor “Tom Lara”                      The third MashQL module in Figure 13, mixes the results of the
<g:8> :Atuhor “Alice Hacker”        _:4 :ArticleCited _:1                       above two queries and filters them based on location preferences
<g:8> :Cites <g:3>                  _:4 :ArticleCited _:2
                                                                                (provided in the UserInput module). The SPRQAL equivalent to
         Figure 9. Sample of RDF data about Bob’s articles.                     Bob’s MashQL query is shown in Figure 14.


         Figure 10. A mashup of citation from different sites.

  PREFIX s1: http://scholar.google.com/scholar?q=bob+Hacker
  PREFIX s2: http://www.citeseer.com/search?s=“Bob Hacker
  SELECT CitingArticle? ?CitedArticle
  From <http://scholar.google.com/scholar?q=bob+Hacker>
  From <http://www.citeseer.com/search?s=“Bob Hacker”>                                         Figure 13. Bob’s mashup of jobs.
  WHERE {
    {{?X1 s1:Title ?CitingArticle} UNION                                       CONSTRUCT *
     {?X1 s2:Title ?CitingArticle}}                                            WHERE {?Job :JobIndustry ?X1;      CONSTRUCT *
    {{?X1 s1:Author ?X2} UNION {?X1 s2:Author ?X2}}                                         :Type ?X2;            WHERE {
    {{?X1 s1:Cites ?X3} UNION {?X1 s2:ArticleCited ?X3}}                                    :Currency ?X3;        ?Job :Category ?X1;
    {{?X3 S1:Title ?CitedArticle} UNION                                                     :Salary ?X4.               :Role ?X2;
    {?X3 S2:Title ?CitedArticle}                                               FILTER(?X1=“Education”||                :SalaryCurrency ?X3;
    {{?X3 s1:Author ?X4} UNION {?X3 s2:Author ?X4}}                                    ?X1=“HealthCare”)               :SalaryLower ?X4.
    FILTER    (regex(?X2,”^Bob    Hacker”)||regex(?X2,”^Hacker                 FILTER(?X2=“Full-Time”||           FILTER (?X1=“Health” ||
  B.”))}                                                                               ?X2=“Fulltime”)||                  ?X1=“BioSciences”)
    FILTER Not(regex(?X4,”^Bob Hacker”) ||                                             ?X2=“Contract”)            FILTER(?X2=“Research\Academic
               regex(?X4,”^Hacker B.”)) }                                      FILTER(?X3=“^Euro”||               )
                                                                                       ?X3=“^€”)                  FILTER (?X3 = “UKP”)
                                                                               FILTER(?X4>=75000||                FILTER (?X4 > 50000) }
            Figure 11. The SPARQL equivalent of Figure 10.                             ?X4<=120000)}
                                                                               SELECT ?Job
              CitingArticle                           CitedArticle             WHERE {
 Protein Categories                     Prostate Cancer                            ?Job :Location ?X1
 Protein Categories                     Best and Worst Lifestyles                  FILTER (?X1=“^UK” || ?X1=“^Belgium”)||?X1 = “^Germany”)
 Cancer Vaccines                        Prostate Cancer                                    || ?X1=“^Austria”)|| ?X1=“^Holland”))}
 Overview about Systems Biology         Prostate Cancer
 Overview about Systems Biology         Protocols in Molecular Biology                 Figure 14. The SPARQL equivalent of Figure 13.
                      Figure 12. The query results.
7. DISCUSSION AND FUTURE                                             4 Bloesch A, Halpin, T: Conceptual Queries using ConQuer–II.
                                                                       (1997)
DIRECTIONS
                                                                     5 Chong E, Das S, Eadon G, Srinivasan J: An efficient SQL-
This article proposed a language that allows people to query and       based RDF querying scheme. VLDB (2005)
mash up structured data without any prior knowledge about the        6 Czejdo B, and Elmasri R, and Rusinkiewicz M, and Embley D:
schema, structure, vocabulary, or technical details of this data.      An algebraic language for graphical query formulation using
Not only non-IT experts can use MashQL, but professionals can          an EER model. Computer Science conference. ACM. (1987)
also use it to build advanced queries.                               7 Deng Y, Hung E, Subrahmanian VS: Maintaining RDF views.
                                                                       Tech. Rep CS-TR-4612 University of Maryland. 2004
MashQL supports all constructs of the W3C standard SPARQL,           8 Ennals R, Garofalakis M: MashMaker: mashups for the
except the “NAMED GRAPH” construct, which is introduced for            masses. SIGMOD Conference 2007:
advanced use, i.e. switching between different graphs within same    9 Goldman    R, Widom J: DataGuides: Enabling Query
query. To be close to user needs and intuition, we defined new         Formulation and Optimization in Semistructured Databases.
constructs (e.g. OneOf, union “\”, Without, Any, reverse “~”, and      VLDB (1997)
others). The constructs are not directly supported in SPARQL, but    10 Hofstede A, Proper H, and Weide T: Computer Supported
emulated. We plan to include aggregation and grouping functions;       Query Formulation in an Evolving Context. Australasian DB
especially as they are supported by Oracle’s SPARQL.                   Conf. (1995)
                                                                     11 http://demo.openlinksw.com/isparql (Feb. 2009)
Yet, MashQL does not support inferencing constructs (such as
                                                                     12 Jarrar M, Dikaiakos: MashQL: A Query-by-Diagram Topping
SubClass, or SubProperty), which are useful indeed for data
                                                                       SPARQL. Proceedings of ONISW'08 workshop. (2008).
fusion. As these constructs are expensive to compute (thus lead to
bad interactivity of MashQL), we plan replace the Oracle’s           13 Jarrar M, Dikaiakos M: A query-by-diagram language
semantic technology that we are currently using as an RDF store,       (MashQL). Technical Article TAR200805. University of
                                                                       Cyprus, 2008. ttp://www.cs.ucy.ac.cy/~mjarrar/JD08.pdf
with an RDF index that we are developing, for speedy OWL
inferencing.                                                         14 Kaufmann E, Bernstein A: How Useful Are Natural Language
                                                                       Interfaces to the Semantic Web for Casual End-Users. ISWC
                                                                       (2007)
We have downloaded most of the public RDF sources, on which
our MashQL editor will be deployed online next month. Not only       15 Li Y, Yang H, Jagadish H: NaLIX: An interactive natural
people will benefit from this, but we will also have the               language interface for querying XML. SIGMOD (2005)
opportunity to better evaluate the usability of MashQL and its       16 Popescu A, Etzioni O, Kautz H: Towards a theory of natural
contribution to linking and fusing more data bottom-up.                language interfaces to databases. 8th Con on Intelligent user
                                                                       interfaces. (2003)
                                                                     17 Parent C, Spaccapietra S: About Complex Entities, Complex
Acknowledgement                                                        Objects and Object-Oriented Data Models. Info. System
We are indebted to Dr. George Pallis, Dr. Demetris Zeinalipour,        Concepts(1989)
and other colleagues for their valuable comments and feedback        18 http://rdfweb.org/people/damian/RDFAuthor (Jan. 2009)
on the early drafts of this paper. This research is partially        19 Russell A, Smart R, Braines D, Shadbolt R.: NITELIGHT: A
supported by the SEARCHiN project (FP6-042467, Marie Curie             Graphical Tool for Semantic Query Construction. The
Actions).                                                              Semantic Web User Interaction Workshop. (2008)
                                                                     20 http://esw.w3.org/topic/SPARQL/Extensions? (Feb. 2009)
                                                                     21 http://www.topquadrant.com/sparqlmotion (Feb. 2009)

REFERENCES                                                           22 Tummarello G, Polleres A, Morbidoni C: Who the FOAF
                                                                       knows Alice? A needed step toward Semantic Web Pipes.
1 Athanasis N, Christophides V, Kotzinos D: Generating On the          ISWC WS. (2007)
   Fly Queries for the Semantic Web. ISWC (2004)                     23 Zloof M: Query-by-Example:a Data Base Language. IBM
2 Abiteboul S, Duschkal O: Complexity of Answering Queries             Systems Journal, 16(4). (1977)
   Using Materialized      Views.    ACM     SIGACT-SIGMOD-          24 Zhuge Y, Garcia-Molina H, Hammer J, Widom J: View
   SIGART. (1998)                                                      Maintenance in a Warehousing Environment. SIGMOD (1995)
3 Bizer C, Heath T, Berners-Lee T:Linked Data: Principles and
   State of the Art. WWW (2008)