=Paper=
{{Paper
|id=Vol-538/paper-14
|storemode=property
|title=A Data Mashup Language for the Data Web
|pdfUrl=https://ceur-ws.org/Vol-538/ldow2009_paper14.pdf
|volume=Vol-538
|dblpUrl=https://dblp.org/rec/conf/www/JarrarD09
}}
==A Data Mashup Language for the Data Web==
A Data Mashup Language for the Data Web
Mustafa Jarrar Marios D. Dikaiakos
University of Cyprus
mjarrar@cs.ucy.ac.cy University of Cyprus
mdd@cs.ucy.ac.cy
unstructured data. For example, imagine how would be the results
ABSTRACT when using Google to search a database of job vacancies, say
“well-paid research-oriented job in Europe”. The results will not
This paper is motivated by the massively increasing structured be precise or clean, because the query itself is still ambiguous
data on the Web (Data Web), and the need for novel methods to although the underlying data is structured. People are demanding
exploit these data to their full potential. Building on the to not only retrieve job links but also want to know the starting
remarkable success of Web 2.0 mashups, this paper regards the date, salary, location, and may render the results on a map.
internet as a database, where each web data source is seen as a
table, and a mashup is seen as a query over these sources. We Web 2.0 mashups are a first step in this direction. A mashup is a
propose a data mashup language, which allows people to web application that consumes data originated from third parties
intuitively query and mash up structured and linked data on the and retrieved via APIs. For example, one can build a mashup that
web. Unlike existing query methods, the novelty of MashQL is retrieves only well-paid vacancies from Google Base and mix it
that it allows people to navigate, query, and mash up a data with similar vacancies from LinkedIn. The problem is that
source(s) without any prior knowledge about its schema, building mashups is an art that is limited to skilled programmers.
vocabulary, or technical details. We even do not assume even that Although some mashup editors have been proposed by the Web
a data source should an online or inline schema. Furthermore, 2.0 community to simplify this art (such as Google Mashups,
MashQL supports query pipes as a built-in concept, rather than Microsoft’s Popfly, IBM’s sMash, and Yahoo Pipes), however,
only a visualization of links between modules. what can be achieved by these editors is limited. They only focus
on providing encapsulated access to some APIs, and still require
programming skills. In other words, these mashup methods are
motivating for -rather than solving- the problem of structured-data
1. INTRODUCTION AND MOTIVATION
retrieval. To expose the massive amount of structured data to its
full potential, people should be able to query and mash up this
In this short article we propose a data mashup approach in a
data easily and effectively.
graphical and Yahoo Pipes’ style. This research is still a work in
progress, thus please refer to [13] for the latest findings. Position: To build on the success of Web 2.0 mashups and
overcome their limitation, we propose to regard the web as a
In parallel to the continuous development of the hypertext web,
database, where each data source is seen as a table, and a mashup
we are witnessing a rapid emergence of the Data Web. Not only
is seen as a query over one or multiple sources. In other words,
the amount of social metadata is increasing, but also many
instead of developing a mashup as an application that access
companies (e.g., Google Base, Upcoming, Flicker, eBay,
structured data through APIs, this art can be simplified by
Amazon, and others) started to make their content freely
regarding a mashup as a query. For example, instead of
accessible through APIs. Many others (see linkeddata.org) are
developing a “program” to retrieve and fuse certain jobs from
also making their content directly accessible in RDF and in a
Google Base and Jobs.ac.uk, this program should be seen as a
linked manner [3]. We are also witnessing the launch of RDFa,
data query over two remote sources. Query formulation (i.e.,
which allows people to access and consume HTML pages as
mashup development or data fusion) should be fast and should not
structured data sources.
require any programming skills.
This trend of structured and linked data is shifting the focus of
Challenges: Before a user formulates a query on a data source,
web technologies towards new paradigms of structured-data
she needs to know how the data is structured, and what are the
retrieval. Traditional search engines cannot serve such data
labels of the data elements, i.e., the schema. Web users are not
because their core design is based on keyword-search over
expected to investigate “what is the schema” each time they
Permission to make digital or hard copies of all or part of this work for search or filter structured information. This issue is particularly
personal or classroom use is granted without fee provided that copies are more difficult in case of RDF and linked data. RDF data may
not made or distributed for profit or commercial advantage and that come without a schema\ontology, and if exists, the schema is
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, mixed up with the data. In addition, as RDF data is a graph, one
requires prior specific permission and/or a fee. have to manually navigate this graph in order to formulate a query
about it. Imagine large and multiple linked data sources, with
Copyright is held by the author/owner(s). diverse content and vocabularies, how you would manage to
LDOW2009, April 20, 2009, Madrid, Spain.
understand the data structure, inter-relationships, namespaces, and MashQL visualizes links between query modules, similar to
the unwieldy labels of the data elements. In short, formulating Yahoo Pipes and other Mashup editors, but the main purpose of
queries in open environments, where data structures and MashQL is to help people to formulate what is inside these query
vocabularies are unknown in advance, is a hard challenge, and modules.
may hamper building data mashups by non-IT people.
Differently from the above Web 2.0 mashup editors, a more
To allow people to query and mash up data sources intuitively, we sophisticated editor has been proposed in [8], called MashMaker.
propose a data mashup language, called MashQL. The main It is a functional programming environment that allows one to
novelty of MashQL is that it allows non IT-skilled people to mashup web content in a spreadsheet-style user interface. Like a
query and explore one (or multiple) RDF sources without any spreadsheet, MashMaker stores every value that is computed in a
prior knowledge about the schema, structure, vocabulary, or any single, central data structure. MashMaker is not comparable with
technical details of these sources. To be more robust and cover MashQL since it cannot serve as a query language by it is own.
most cases in practice, we even do not assume that a data source
should have -an offline or online- schema\ontology at all. In the In XML databases, the Lore query language [9] has been
background, MashQL queries are translated into and executed as proposed to allow people to query XML data graphically, and
SPARQL queries. without prior knowledge about the data. Lore assumes that data is
represented as a graph, called EOM, which is close to RDF. The
Paper organization: Before presenting MashQL, in the next difference between Lore and MashQL is not only the intuitiveness
section we overview the art of query formulation, which has been and expressivity, but essentially, MashQL does not assume the
studied by different research communities. We present MashQL data graph to have a certain schema, however, Lore assumes that
in section 3, and in section 4 we introduce the notion of query a data graph should have a dataguide, which is a computed
pipes. The implementation of MashQL and a three use cases are summary of the data, i.e. play the role of a schema.
presented in section 5 and 6 respectively. The coverage and the
limitations of MashQL and its future directions are discussed in More about query formulations scenarios and (which scenario is
section 7. more intuitive to the casual user) can be found in a recent
usability study in [14]. It concluded that a query language should
be close to natural language and graphically intuitive, and it
2. RELATED WORK should not assume knowledge about the data source.
Several approaches have been proposed by the DB community to
query structured data sources, such as query-by-example [23] and 3. THE MASHQL LANGUAGE
conceptual queries [4,6,17]. However, none of these approaches
was used by casual users. This is because they still assume The main goal of MashQL is to allow people to mash up and fuse
knowledge about the relational/conceptual schema. Among these, data sources easily. In the background MashQL queries are
we found ConQuer [4] has some nice features, specially the tree automatically translated into and executed as SPARQL queries.
structure of queries, but it also assumes one to start from the Without prior knowledge about a data source, one can navigate
schema. In the natural language processing community, it has this source and fuse it with another source easily. To allow people
been proposed to allow people to write queries as natural to build on each other’s results MashQL supports query pipes as a
language sentences, and then translate these sentences into a built-in concept. The example below shows two web data sources
formal language (SQL [15] or XQuery [16]). However, these and a SPARQL query to retrieve “the book titles authored by Lara
approaches are challenged with the language ambiguity and the and published after 2007”. The same query in MashQL is shown
“free mapping” between sentences and data schemes. in Figure 2. The first module specifies the query input, and the
second module specifies the query body. The output can be piped
This topic started to receive a high importance within the into a third module (not shown here), which renders the results
Semantic Web community. Several approaches (GRQL [1], into a certain format (such as HTML,XML or CSV), or as RDF
iSPARQL [11], NITELIGHT [19] and RDFAuthor [18]) are input to other queries. Notice that in this way, one can easily build
proposing to represent triple patterns graphically as ellipses a query to fuse the content of two sources in a linked manner [3].
connected with arrows. However, these approaches assume http://Site1.com/RDF Query:
:a1 :Title “Web 2.0”
advanced knowledge of RDF and SPARQL. Other approaches use :a1 :Author “Hacker B.”
PREFIX S1:
PREFIX S2:
Visual Scripting Languages (e.g., SPARQLMotion [21] and Deri :a1 :Year 2007 SELECT ? ArticleTitle
:a1 :Publisher “Springer” FROM
Pipes [22]), by visualizing links between query modules; but a :a2 :Title “Web 3.0” FROM
WHERE {
query module merely is a window containing a SPARQL script in :a2 :Author “Smith B.”
{{?X S1:Title ?ArticleTitle}UNION
{?X S2:Title ?ArticleTitle}}
a textual form. These approaches are inspired by some industrial http://Site2.com/RDF
{?X S1:Author ?X1} UNION {?X S2:Author ?X1}
:4 :Title “Semantic Web”
mashup editors such as Popfly, sMash, and Yahoo Pipes. These :4 :Author “Tom Lara”
{?X S1:PubYear ?X2} UNION {?X S2:Year ?X2}
FILTER regex(?X1, “^Hacker”)
industry editors provide a nice visualization of APIs’ interfaces :4 :PubYear 2005 FILTER (?X2 > 2000)}
:5 :Title “Web services” Results:
and some operators between them. However, when a user needs to :5 :Author “Bob Hacker” ArticleTitle
express a query over structured data, she needs to use the formal Web 2.0
language of that editor, such as YQL for Yahoo Pipes. Although
Figure 1. An example of a SPARQL query.
projection symbol 5 can be used before a variable to indicate that
it will be returned in the results1. In short, while interacting with
the editor, the editor queries the dataset in the background in
order to generate the next list depending on the previous
selections. In this way, people can navigate a graph without prior
knowledge about it.
Similar to SPARQL, all restrictions in MashQL are considered
necessary when evaluating a query. However, if a restriction is
prefixed with “maybe”, it is considered optional; and, if it is
prefixed with “without” is considered unbound (see Figure 3).
Figure 2. An example of MashQL query.
MashQL supports also union (denoted as “\”) between objects,
predicates, subjects, and queries; as well as, a type operator
The intuition of MashQL is described as the following: Each (“Any”), Inverse predicates, datatype and language tags, and
query Q is seen as a tree. The root of this tree is called the query many objects filters.
subject (e.g. Article), denoted as Q(S), which is the subject matter PREFIX a:
being inquired. Each branch of the tree is called a restriction R PREFIX S1:
and is used to restrict a certain property of the query subject, Q(S) SELECT ?SongTitle, ?AlbumName
FROM
WHERE {?Song S1:Title ?SongTitle.
{{?Song S1:Duration ?X1}
ؔ R1 AND … AND Rn. Branches can be expanded to allow sub UNION {?Song a:Length ?X1}}
FILTER (?X1 > 3).
{{?Song S1:Artist S1:Shakira}
UNION {?Song S1:Artist S1:AxelleRed}}
OPTIONAL{?Song S1:Album ?AlbumName}.
trees (called query paths), which enable one to navigate the
OPTIONAL{?Song S1:Copyright ?X2}.
underlying dataset. In this case, the object in the restriction is FILTER (!Bound(?X2)).}
considered the subject of its sub query. As Figure 3 shows, the
Figure 4. A query involving optional and negative restrictions.
query retrieves the title of every article, published after 2005, and
written by an author, who has an address, this address has a
country called Cyprus. 4. THE NOTION OF QUERY PIPES
PREFIX S1:
SELECT ?ArticleTitle To deploy MashQL in an open world some challenges might be
FORM < http://www.example.com?
WHERE { ?X1 rdf:type :Article. faced. This section overviews these challenges (from a query
?X1 S1:Title ?ArticleTitle. formulation viewpoint) and introduces the notion of query pipes.
?X1 S1:Year ?X2.
FILTER (?X2 > 2005).
?X1 S1:Author ?X3. As discussed earlier, one may create a mashup and redirect its
?X3 S1:Address ?X4.
output to another mashup. We call the chain of queries that
?X4 S1:Country ?X5.
FILTER regex(?X5, “Malta”)} connect to each other in this way as pipe. Allowing people to
formulate query pipes is not merely a visualization of links
Figure 3. A query involving paths, and its mapping into SPARQL. between query modules, but when compiling a pipe (i.e.,
Formulating queries in MashQL is designed to be an interactive translating it into SPARQL), some issues should be considered.
process, by which the complexity of understanding data structures
is moved to the query editor. Users only use drop-down lists to First: Translating MashQL into SPARQL SELECT statements is
express their queries. not enough, because the SELECT statement produces the results
The query subject is selected from a list generated dynamically
from, either: (1) the set of the subject-types in the dataset; (2) or
1
the union of all subject and object identifiers in the dataset; users Some issues are lengthy to illustrate here. For example, when a
can also choose to (3) introduce their own label; in this case the user moves the mouse over a restriction, it gets the editing mode
label is seen as a variable and displayed in italic. The default and all other restrictions get the verbalize mode (i.e., all boxes
subject is the variable “Anything”. To add a restriction, the list of and lists are made invisible, but the verbalization of their
properties (e.g., Title, Author) is generated, depending on the content is generated and displayed instead). This is not only to
chosen subject. Users may then select a filter (e.g., Equals, make the readability of the queries closer to natural language,
but also to allow users to validate whether what they did is what
Contains, Between, etc.), or select an object identifier from a list,
they intended. The editor also detects and normalizes
which is then generated from the set of the possible objects namespaces: find similar URLs and hide them when necessary.
identifies, depending on the previous selections. Furthermore, For example, when two properties originating from different
users select to expand the tree to declare a query path. The data sources have the same URL, their namespaces are found
and hided.
in a tabular form. To allow queries to input each other (especially Let Qi be a query over a set of sources {D1,..,Dm}, and T is a
for producing linked data), the results of a query should be given time. Qi will be re-executed if (RQiT + RQiA) ≤ T and (RQiT <
formed as a graph. In SPARQL, the CONSTRUCT statement RDjT), where 1 ≤ j ≤ m.
produces a graph, but then one needs to manually specify how
this graph should be produced. To overcome this, we propose the Pipe auto-refresh: Each pipe P(D) is automatically refreshed if
construct (CONSTRUCT *). This is not part of the standard RDA expires. This implies re-executing the chain of queries in this
SPARQL but has been proposed also by others to be included in pipe. Let P(D) be a pipe, D=Qn(D1,..,Dm), and T is a given time. If
the next version of the standard [20]. In MashQL, the (RDT+RDA) ≤ T, then each ith query in P(D) is executed if (RQiT <
CONSTRUCT * means retrieves all triples involved in the query RDjT ), where 1 ≤ j ≤ m for Qi, and 1 ≤ i ≤ n. Queries in P(D) are
conditions and satisfy them. For example, suppose the query in executed from the bottom to the topmost, or recursively as
Figure 2 is piped into another, its CONSTRUCT * translation will P(P(D1),…,P(Dm)).
retrieve {<:b1 :Title “Linked Data”>,<:b1 :Author “Lara
T.”>,<:b1 :Year 2007>}. When compiling a pipe of queries, If As argued in the data warehousing literature [2,24] an efficient
the output of a query is directed as input to another query, a refreshing strategies is the incremental updates, which suggests
CONSTRUCT * statement will be generated, otherwise, a that if a base source receives new transactions, only these
SELECT statement will be generated. transactions are transformed and the affected queries are
refreshed. This strategy is still an open research issue for RDF in
Second: When executing a SPARQL query, all query engines an open world [7], because RDF data and queries are developed
assume that the queried data is stored locally; otherwise, this data and maintained autonomously by different people.
must be downloaded and stored at the engine-side before the
execution process starts. The time complexity of executing a
query on local data is usually fast2; however, the bottleneck will
be the downloading time. In case the input of a query is an output 5. IMPLEMENTATION
to another query (i.e., in case of query pipes) the problem will be
even more difficult, as queries will be calling each other. First: we have developed an online mashup editor, which will be
Furthermore, it is also possible that users (intentionally or by publically available next month. Similar to creating feed mashups
mistake) end up with query loops (e.g. Q1→Q2→Q3→Q1), which in Yahoo Pipes, MashQL users can query and fuse data sources
may cause computational overheads. To face this challenge, and the output of their queries can be redirected as input to other
MashQL allows users to materialize the results of their queries. In the background, Oracle 11g is used for storing and
queries/pipes and decide their refreshing strategies, as follows: querying RDF. When a user specifies a data source(s) as input, it
is bulk-loaded to the Oracle’s semantic technology tables.
The results of a query (called derived source) are stored MashQL queries are also translated into Oracle’s SPARQL.
physically and deployed as a concrete RDF source. Primal input While interacting with the editor to formulate a query, the editor
sources (called base sources) are also cached for performance performs some background queries through AJAX. Each
purposes. Given a query Q over a set of base or derived sources published query is given a URL. Calling this URL means
{D1,..,Dm}, the results of this query is denoted as D = Q(D1,..,Dm), executing this query and getting its results back.
and D ∉ {D1,..,Dm}. We define a Pipe as an acyclic chain of
queries, where the result of a query is an input to the next. The Second: We started to also develop a Firefox add-on in order to
chain of the queries that derives D is denoted as the pipe P(D). allow people develop mashups at the client side. The opened
pages -in the browser tabs- are automatically selected as input
We call the problem of keeping a pipe up-to-date, the pipes sources, and at the left-side panel a mashup can be created. The
consistency. Let D be the results of a query Q(D1,..,Dm), and T the results are rendered by the browser in a new tab. The idea is to
latest time the set {D1,..,Dm} has been changed. Then, D is allow web pages that embed RDF triples (i.e., RDFa or
consistent at T if D=Q(D1,..,Dm). To maintain pipes consistency, microformats) to be queried and mashed up. For example, one
two updating strategies are used: Query auto-refresh and Pipe will be able to compose his publication list from Google Scholar,
auto-refresh. MashQL maintains for each base or derived source DBLP, ACM, and CiteSeer; or, filter all video lectures given by
D a timestamp of its last update RDT and an auto-refresh time Berners-Lee from YouTube and VedioLectures. Because the
interval RDA; and for each query Q a timestamp of its previous mentioned web sites do not support RDFa yet, one can mine/distil
successful execution RQT and an auto-refresh interval RQA. the RDF triples, using third party services such as triplr.org,
buzzword.org.uk, wandora.org or Dapper.
Query auto-refresh: Each query will be automatically executed if
its auto-refresh interval expires and one of its inputs is updated.
2
A query with medium size complexity over a large dataset takes one or
few seconds [5].
6. USE CASES
This is section we present two hypothetical use cases to illustrate
using MashQL for developing data mashups.
6.1 Use case: Retailer
Fnac is a large retailer of cultural and consumer electronics
products. When a new product arrives to Fnac, it has to be entered
to the inventory database. This is usually done by scanning the
barcode on each product, and then manually filling the product
specifications. Furthermore, as Fnac trades in many countries, Figure 6. A mashup of product titles from different resources.
their product specifications have to be translated into several PREFIX s1:
languages. To save time entering and translating information PREFIX s2:
PREFIX s3:
manually, Fnac decided to reuse the product data specifications
SELECT ?Barcode ?EnglishTitle ?FrenchTitle
(and their translation) that are produced at the factory side. For FROM
example, suppose Fnac received three packages from Cannon, FROM
Alfred, and IMDB. Fnac would like to scan the barcode of the FROM
WHERE{
received products and then get their specifications directly from {{?x s1:Barcode ?Barcode} UNION {?x s2:Bcode ?Barcode}
the online catalogues of those suppliers. In Figure 5 we show UNION {?x s3:Prodcode ?Barcode}}
samples of online product catalogues of the three suppliers (we FILTER (regex(?Barcode, “9781143557532”) ||
regex(?Barcode, “8765422097653”) ||
assume they are published in RDFa). Figure 6 illustrates a query
regex(?Barcode, “3248765355133)”).
that Fnac built to look up the multilingual titles of three products. {OPTIONAL {?x s1:ShortName ?EnglishTitle}} UNION
This query is a mashup of three RDF data sources with a user- {{OPTIONAL {?x s1:Title ?EnglishTitle}} UNION
input of three barcode numbers. The query takes each of these {OPTIONAL {?x s2:Title ?EnglishTitle}}
FILTER (lang(?EnglishName) = ”en”)}
barcodes and finds the English and French titles. Notice that Fnac {{OPTIONAL {?x s1:Title ?FrenchTitle}} UNION
assumed that short titles provided by Cannon are in English, thus, {OPTIONAL {?x s2:Title ?FrenchTitle}}
they are joined with the other titles that are tagged with "@en". FILTER (lang(?FrenchTitle) = ”fr”)}}
See the retrieved results in Figure 8. In this same way, a barcode Figure 7. The SPARQL equivalent of Figure 6.
reader could be connected with user-input module, to retrieve the
specifications (which could be stored at the supplier side) each Barcode EnglishTitle FrenchTitle
time a product is scanned. 9781143557532 CanScan 4400F
8765422097653 The Prophet Le prophète
http:www.cannon/products/rdf http://www.alfred.com/books 3248765355133 All about my mother Tout sur ma mère
_:P1 :ShortName “CanScan 4400F” <:B1> :Type <:Book>
Figure 8. Retrieved product titles.
_:P1 :FullName “Canon CanoScan <:B1> :Title “The Prophet”@en
4400F Color Image Scanner” <:B1> :Title “Le prophète”@fr
_:P1 :Producer “Canon” <:B1> :BCode 8765422097653
_:P1 :ShippingWeight> “4 pounds”
_:P1 :Barcode 9780133557022
<:B1> :Authors “Kahlil Gibran”
<:B1> :ISBN-10 0394404289
6.2 Use case: Citations List
_:P2 :ShortName “PowerShot SD100” <:B3> :Type <:Book>
Bob would like to compile the list of articles that cited his articles
_:P2 :FullName “Canon PowerShot <:B3> :Title “Alfred Nobel”@en
SD10007.1MP Camera 3x Zoom” <:B3> :Title “Alfred Nobel”@fr (excluding what he cited himself). He built a mashup using
_:P2 :Producer “Canon” <:B3> :BCode 75639898123 MashQL to mix his citations retrieved from both Google Scholar
_:P2 :ShippingWeight> “2 pounds” <:B3> :Authors “Kenne Fant”
and CiteSeer, and then filter out the self-citations. First, he
_:P2 :Barcode 9781143557532 <:B3> :ISBN- 0531123286
performed a keyword search (“Bob Hacker”) on both Google
http://www.imdb.com/movies Scholar and CiteSeer3. Figure 9 shows a sample of the extracted
_:1 rdf:Type <:Movie>
_:1 :Title “All about my mother”@en
RDF triples. Bob’s MashQL query is shown in Figure 10, and its
_:1 :Title “Tout sur ma mère”@fr SPARQL equivalent in Figure 11. In this query, Bob wrote:
_:1 :ProdCode 3248765355133 retrieve every article that has a title (call it CitingArticle), has an
_:1 :NumberOfDiscs: 1
_:2 rdf:Type <:Movie>
_:2 :Title “Lords of the rings”@en
_:2 :Title “Seigneur des anneaux”@fr 3
Similar to the previous use case, we assume that both Google
_:2 : ProdCode 4852834058083
_:2 :NumberOfDiscs: 3 Scholar’s and CiteSeer’s render their search results in RDFa
(i.e. the RDF triples are embedded in HTML), as many
Figure 5. Sample of RDF data about products. companies started to do nowadays. However, Bob can also use
a third party’s service (e.g. triplify.org) to extract triples from
HTML pages.
author that does not contain "Bob Hacker" or "Hacker B.", and 6.3 Use case: Job Seeking
cites another article that has a title (call it CitedArticle), and has
an author that contains "Bob Hacker" or "Hacker B.". Figure 12 Bob has a PhD in bioinformatics. He is looking for a full-time,
shows the result of this query. well paid, and research-oriented job in some European countries.
He spent an enormous amount of time searching different job
http://scholar.google.com/scholar?q=b http://www.citeseer.com/search?s=“Bo portals, each time trying many keywords and filters. Instead, Bob
ob+Hacker b Hacker” used MashQL to find the job that meets his specific preferences.
:Title “Prostate Cancer” _:1 :Title “Prostate Cancer” Figure 13 shows Bob’s queries on Google Base and on
:Author “Hacker B.,Hacker A.” _:1 :Author “Hacker B., Hacker A.”
:Title “Best and Worst _:2 :Title “Protocols in Molecular
Jobs.ac.uk. First, he visited Google Base and performed a
Lifestyles” Biology” keyword search (bioinformatics OR "computational biology" OR
:Atuhor “Bob Hacker” _:2 :Atuhor “Bob Hacker” "systems biology" OR e-health); he copied the link of the
:Cites _:2 :ArticleCited _:1
:Title “Protein Categories” _:3 :Title “Cancer Vaccines” retrieved results from Google (which are in rendered in RDFa)
:Atuhor “Bob Smith” _:3 :Atuhor “Eve Lee, Bob Hacker” into the RDFInput module; and then created a MashQL query on
:Cites _:4 :Title “Overview about Systems
these results. He performed a similar task to query Jobs.ac.uk.
:Cites Biology”
:Title “Cancer Vaccines” _:4 :Atuhor “Tom Lara” The third MashQL module in Figure 13, mixes the results of the
:Atuhor “Alice Hacker” _:4 :ArticleCited _:1 above two queries and filters them based on location preferences
:Cites _:4 :ArticleCited _:2
(provided in the UserInput module). The SPRQAL equivalent to
Figure 9. Sample of RDF data about Bob’s articles. Bob’s MashQL query is shown in Figure 14.
Figure 10. A mashup of citation from different sites.
PREFIX s1: http://scholar.google.com/scholar?q=bob+Hacker
PREFIX s2: http://www.citeseer.com/search?s=“Bob Hacker
SELECT CitingArticle? ?CitedArticle
From
From Figure 13. Bob’s mashup of jobs.
WHERE {
{{?X1 s1:Title ?CitingArticle} UNION CONSTRUCT *
{?X1 s2:Title ?CitingArticle}} WHERE {?Job :JobIndustry ?X1; CONSTRUCT *
{{?X1 s1:Author ?X2} UNION {?X1 s2:Author ?X2}} :Type ?X2; WHERE {
{{?X1 s1:Cites ?X3} UNION {?X1 s2:ArticleCited ?X3}} :Currency ?X3; ?Job :Category ?X1;
{{?X3 S1:Title ?CitedArticle} UNION :Salary ?X4. :Role ?X2;
{?X3 S2:Title ?CitedArticle} FILTER(?X1=“Education”|| :SalaryCurrency ?X3;
{{?X3 s1:Author ?X4} UNION {?X3 s2:Author ?X4}} ?X1=“HealthCare”) :SalaryLower ?X4.
FILTER (regex(?X2,”^Bob Hacker”)||regex(?X2,”^Hacker FILTER(?X2=“Full-Time”|| FILTER (?X1=“Health” ||
B.”))} ?X2=“Fulltime”)|| ?X1=“BioSciences”)
FILTER Not(regex(?X4,”^Bob Hacker”) || ?X2=“Contract”) FILTER(?X2=“Research\Academic
regex(?X4,”^Hacker B.”)) } FILTER(?X3=“^Euro”|| )
?X3=“^€”) FILTER (?X3 = “UKP”)
FILTER(?X4>=75000|| FILTER (?X4 > 50000) }
Figure 11. The SPARQL equivalent of Figure 10. ?X4<=120000)}
SELECT ?Job
CitingArticle CitedArticle WHERE {
Protein Categories Prostate Cancer ?Job :Location ?X1
Protein Categories Best and Worst Lifestyles FILTER (?X1=“^UK” || ?X1=“^Belgium”)||?X1 = “^Germany”)
Cancer Vaccines Prostate Cancer || ?X1=“^Austria”)|| ?X1=“^Holland”))}
Overview about Systems Biology Prostate Cancer
Overview about Systems Biology Protocols in Molecular Biology Figure 14. The SPARQL equivalent of Figure 13.
Figure 12. The query results.
7. DISCUSSION AND FUTURE 4 Bloesch A, Halpin, T: Conceptual Queries using ConQuer–II.
(1997)
DIRECTIONS
5 Chong E, Das S, Eadon G, Srinivasan J: An efficient SQL-
This article proposed a language that allows people to query and based RDF querying scheme. VLDB (2005)
mash up structured data without any prior knowledge about the 6 Czejdo B, and Elmasri R, and Rusinkiewicz M, and Embley D:
schema, structure, vocabulary, or technical details of this data. An algebraic language for graphical query formulation using
Not only non-IT experts can use MashQL, but professionals can an EER model. Computer Science conference. ACM. (1987)
also use it to build advanced queries. 7 Deng Y, Hung E, Subrahmanian VS: Maintaining RDF views.
Tech. Rep CS-TR-4612 University of Maryland. 2004
MashQL supports all constructs of the W3C standard SPARQL, 8 Ennals R, Garofalakis M: MashMaker: mashups for the
except the “NAMED GRAPH” construct, which is introduced for masses. SIGMOD Conference 2007:
advanced use, i.e. switching between different graphs within same 9 Goldman R, Widom J: DataGuides: Enabling Query
query. To be close to user needs and intuition, we defined new Formulation and Optimization in Semistructured Databases.
constructs (e.g. OneOf, union “\”, Without, Any, reverse “~”, and VLDB (1997)
others). The constructs are not directly supported in SPARQL, but 10 Hofstede A, Proper H, and Weide T: Computer Supported
emulated. We plan to include aggregation and grouping functions; Query Formulation in an Evolving Context. Australasian DB
especially as they are supported by Oracle’s SPARQL. Conf. (1995)
11 http://demo.openlinksw.com/isparql (Feb. 2009)
Yet, MashQL does not support inferencing constructs (such as
12 Jarrar M, Dikaiakos: MashQL: A Query-by-Diagram Topping
SubClass, or SubProperty), which are useful indeed for data
SPARQL. Proceedings of ONISW'08 workshop. (2008).
fusion. As these constructs are expensive to compute (thus lead to
bad interactivity of MashQL), we plan replace the Oracle’s 13 Jarrar M, Dikaiakos M: A query-by-diagram language
semantic technology that we are currently using as an RDF store, (MashQL). Technical Article TAR200805. University of
Cyprus, 2008. ttp://www.cs.ucy.ac.cy/~mjarrar/JD08.pdf
with an RDF index that we are developing, for speedy OWL
inferencing. 14 Kaufmann E, Bernstein A: How Useful Are Natural Language
Interfaces to the Semantic Web for Casual End-Users. ISWC
(2007)
We have downloaded most of the public RDF sources, on which
our MashQL editor will be deployed online next month. Not only 15 Li Y, Yang H, Jagadish H: NaLIX: An interactive natural
people will benefit from this, but we will also have the language interface for querying XML. SIGMOD (2005)
opportunity to better evaluate the usability of MashQL and its 16 Popescu A, Etzioni O, Kautz H: Towards a theory of natural
contribution to linking and fusing more data bottom-up. language interfaces to databases. 8th Con on Intelligent user
interfaces. (2003)
17 Parent C, Spaccapietra S: About Complex Entities, Complex
Acknowledgement Objects and Object-Oriented Data Models. Info. System
We are indebted to Dr. George Pallis, Dr. Demetris Zeinalipour, Concepts(1989)
and other colleagues for their valuable comments and feedback 18 http://rdfweb.org/people/damian/RDFAuthor (Jan. 2009)
on the early drafts of this paper. This research is partially 19 Russell A, Smart R, Braines D, Shadbolt R.: NITELIGHT: A
supported by the SEARCHiN project (FP6-042467, Marie Curie Graphical Tool for Semantic Query Construction. The
Actions). Semantic Web User Interaction Workshop. (2008)
20 http://esw.w3.org/topic/SPARQL/Extensions? (Feb. 2009)
21 http://www.topquadrant.com/sparqlmotion (Feb. 2009)
REFERENCES 22 Tummarello G, Polleres A, Morbidoni C: Who the FOAF
knows Alice? A needed step toward Semantic Web Pipes.
1 Athanasis N, Christophides V, Kotzinos D: Generating On the ISWC WS. (2007)
Fly Queries for the Semantic Web. ISWC (2004) 23 Zloof M: Query-by-Example:a Data Base Language. IBM
2 Abiteboul S, Duschkal O: Complexity of Answering Queries Systems Journal, 16(4). (1977)
Using Materialized Views. ACM SIGACT-SIGMOD- 24 Zhuge Y, Garcia-Molina H, Hammer J, Widom J: View
SIGART. (1998) Maintenance in a Warehousing Environment. SIGMOD (1995)
3 Bizer C, Heath T, Berners-Lee T:Linked Data: Principles and
State of the Art. WWW (2008)