=Paper= {{Paper |id=Vol-1184/paper7 |storemode=property |title=Programmable Analytics for Linked Open Data |pdfUrl=https://ceur-ws.org/Vol-1184/ldow2014_paper_07.pdf |volume=Vol-1184 |dblpUrl=https://dblp.org/rec/conf/www/HuR14 }} ==Programmable Analytics for Linked Open Data== https://ceur-ws.org/Vol-1184/ldow2014_paper_07.pdf

Programmable Analytics for Linked Open Data

Bo Hu Eduarda Mendes Emeric Viel
Fujitsu Laboratories of Europe Rodrigues Fujitsu Laboratories Ltd
Middlesex, UK Fujitsu Laboratories of Europe Kawasaki, Japan
bo.hu@uk.fujitsu.com Middlesex, UK emeric.viel@jp.fujitsu.com

ABSTRACT semantic-rich data accumulated so far have actually driven
LOD initiative has made a major impact on data provision. away potential users. On the one hand, it adds extra layers
Thus far, more than 800 datasets have been published, con- of abstraction/conceptualisation to the data, making them
taining tens of billions of RDF triples. The sheer size of not suitable for toolkits tuned against data represented in
data has not resulted in a significant increase of data con- tabular format. On the other hand, the sheer volume of
sumption. We contend that a new programming paradigm data renders many semantic web tools less productive. We
is necessary to simplify LOD data utilisation. This paper contend that a major obstacle that prevents ordinary users
reports an early phase development towards programmable from tapping into LOD cloud is the lack of a mechanism
web of LOD data. We propose to tap into a distributed that allows people to make “sense” out of the overwhelming
computing environment underpinning the popular statistical amount of data. More specifically, in order to facilitate the
toolkit R. Where possible, native R operators and functions general uptake of LOD by research communities and practi-
are used in our approach so as to lower the learning curve. tioners, simply making the data available is not sufficient. It
The crux of our future work lies in the full implementation is essential to offer, along side the data, a means of utilising
and evaluation. such resources in such a way that is comprehensible to users
with a wide range of backgrounds and potentially limited
knowledge of semantic technologies.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
In this paper, we propose a solution that tightly integrates
D.2.12 [Interoperability]: Data mapping
linked data computing with the popular statistic program-
ming platform R. This brings together two well established
Keywords efforts and thus two large user bases: R offers a declara-
Linked Open Data, RDF, R, Programmability tive and well formed programming language for mining and
analysing LOD datasets while the LOD Cloud paves the way
1. INTRODUCTION to instantaneous access to a large amount of structured data
As of mid 2013, totally 870 datasets had been published as on which existing R functions and packages can be applied.
part of the Linked Open Data (LOD) cloud, exposing nearly
62 billion RDF triples in a computer-readable representation
format1 . These numbers are still rapidly growing largely at-
1.1 Programmability of LOD
Thus far, data available through LOD Cloud are accessed
tribute to open governmental data initiatives and “online”
primarily using SPARQL. Typically, this is conducted by
high-throughput scientific instruments. As greater amounts
submitting query scripts to a SPARQL endpoint and based
of data become available through LOD cloud, the expected
on the query results, filtering/joining/aggregating (available
virtuous cycle–more data leading to more consumption and
from SPARQL 1.1) candidate results either on the server
thus encouraged data publication–has not been clearly wit-
side or at the local clients. SPARQL is based on set al-
nessed. On the contrary, it is observed that, in many occa-
gebra. This is both an advantage and a disadvantage. It
sions, after the initial spark of interest and test applications,
resembles the prevailing SQL for RDB. People familiar with
data use at many linked data hosting sites declined signifi-
the latter can, therefore, enjoy a fast learning curve when
cantly [3]. Some critics believed that the massive amounts of
making the paradigm shift. On the other hand, SPARQL
1
http://stats.lod2.eu. Accessed: January 2014. is mainly a query language and thus does not stand-out for
post-query data processing. In many cases, the results of
SPARQL queries are extracted and converted into the na-
tive data structures of other programming languages (e.g.
Java) for further manipulation.

Equipping and/or enhancing LOD with high programmabil-
ity beyond SPARQL has been investigated previously. The
(dis)similarity between RDF as the underlying data struc-
Copyright held by the author/owner(s)
LDOW2014 April 8, 2014, Seoul, Korea. ture of LOD and the general object oriented methodology
inspired ActiveRDF [5], where semantic data are exposed
through a declarative script language. Along the same di- 2. PROGRAMMABLE LOD
rection, RDFReactor [9] packs RDF resources as Java ob- LOD Cloud provides a framework to access and navigate
jects where instances are objects and properties are accessed through apparently unrelated data, with conceptual models
through java methods. capable of explicating hidden knowledge. The logic based
axioms (underpinning RDF) in many cases are not powerful
Unfortunately, the above integrations have not lowered the enough to capture all the regularities in the data. We vision
threshold to fully exploring LOD cloud. Among other rea- that a programming language, aiming to utilise and inter-
sons, the most prominent ones include the follows. It will be act with LOD cloud (the datasets therein), is preferably to
very difficult for such approaches to deal with missing values present the following characteristics.
and sparse structures, which abound in uncurated or auto-
matically produced collections. The size and quality of LOD Native support to LOD data structure. The underlying data
cloud lends itself to statistical data analysis. Performing structure of LOD is RDF triples which essentially compose
such analysis using SPARQL queries can become cumber- a directed, labelled graph. SPARQL, the standard RDF
some in many cases, requiring recursive SPARQL queries querying language, transforms data into tabular form for
and multiple join operations. Moreover, neither SPARQL better alignment with the RDB conventions. This extra
nor the integrated framework enjoys native support to ma- formatting layer is not always necessary when the underlying
trix operations and solving linear equations, while such char- data structure can be accessed with native graph operators.
acteristics become increasingly critical in processing large
amounts of data. Native support to data analysis. Better data accessibility
inherent to LOD presents itself as both an opportunity and
R, as a dynamic and functional language, offers good capac- a challenge. With better access, an LOD data consumer is
ity to enhance the programmability of LOD and remedy the exposed to data linked in through semantic relations, most
shortcoming of existing approaches. of which he or she may not be aware of. More data is not al-
ways necessarily a merit. In this case, the consumer is likely
to be overwhelmed by data with different formats and differ-
1.2 Why R? ent semantics, making analysis struggling. A programming
R is a programming language and a software toolkit for data platform capable of dynamically handling different format
science. Though not outspoken, R is designed for domain becomes desirable.
experts instead of conventional software programmers. It
focuses on transactions that are more familiar to the for- Ready for distributed processing. Applications accessing LOD
mer, e.g. organising data, manipulating spreadsheets and Cloud can easily be exposed to billions of triples, tanta-
data visualisation. R is open source with over 2,000 pack- mount to terabyte-grade data transactions. Single machine
ages/libraries for a wide variety of data analytics2 . The and single threaded statistical offerings will find themselves
most distinctive feature of R is its native support to vector struggling in such situations. The programming platform
arithmetics. In addition, versatile graphics and data visual- should offer parallelisation capacity for good scalability.
isation packages as well as easy access to a large number of
specialist machine learning and predictive algorithms make Inspecting R within the scope of the above requirements, we
R a widely adopted computing environment in scientific com- can make the following observations. Firstly, R is a func-
munities (c.f. [2]). R is essentially single threaded. Scaling tional language with lazy evaluation, wherein functions are
R for Big Data analysis can be achieved with RHadoop3 . In lifted to become first class citizen. Also, R has a dynamic
this paper, we focus on adapting R for LOD data structure. type system. These fit well with RDF’s idiosyncrasy. Sec-
ondly, R is designed for statistical computing. Missing value
Integrating R and LOD has been inspected previously. The support and sparse matrix handling permeates all R func-
SPARQL R Library [8] aims to expose RDF data and wrap tions and operations. Finally, though R is single-threaded,
SPARQL endpoints with a black-box style connector library. for many machine learning tasks it is possible to distribute
Largely in the same vein, the most recent effort, rrdf li- the underlying R data structures and facilitate process dis-
brary [10], allows loading and updating RDF files through tribution over a layer of data abstraction.
manually crafted RDF-R mapping. The in-memory RDF
models can then be queried using SPARQL. We see the
following issues with SPARQL-based integration. Firstly, 3. SYSTEM ARCHITECTURE
SPARQL queries and the target RDF data sets are not The concept of programmable LOD is experimented on the
transparent to R users, making it difficult to validate and BigGraph platform, denoted as BGR . BigGraph aims at a
optimise the processes. Arbitrary SPARQL queries can in- generic distributed graph storage with RESTful interface.
cur global scans that drastically impede the system perfor- Figure 1 illustrates the main building blocks of BGR . At
mance. Secondly, R environment loses the regulatory control the top, there is the user interface. An BGR user programs
over SPARQL queries. Such a blindness subjects the system using R primaries with dedicated functions that facilitate
to safety and security concerns. Finally, domain experts and the RDF to R data type mapping. BGR programs are sub-
statisticians are required to manually compose the SPARQL mitted to a master node as the main entry point through
queries. This means learning the fundamentals of RDF and which the user interacts with the system. The runtime at
a new query language. the master is responsible for the following tasks: 1) inter-
preting BGR programs; 2) interacting with the in-memory
2
http://www.r-project.org. Accessed: January 2014. graph model for graph transactions; and 3) deciding which
3
https://github.com/RevolutionAnalytics/RHadoop/wiki data server/worker it should directly query.
Program (extended R) 3.2 Mapping to the underlying storage
In order to accommodate the sheer size of LOD Cloud and
leverage parallel data loading, a distributed storage is nec-
In-memory essary. We opt for an edge-based storage solution that fits
Master Node nicely with the principles of a Key-Value Store (KVS) [4].
KVS plays a key role in our approach to scale-out RDF
graphs. RDF triples are, however, not KVS ready. The first
Graph Model and foremost step is therefore to define the key-value tuples
that a standard KVS can conveniently consume. In BGR ,
different components of a triple are concatenated together
and encoded as UUID which is then treated as the key while
R R
System System
the value parts of KVS are reserved for other purposes, e.g.
... ... named graph, provenance, and access control.
Storage Storage
Driver Driver An RDF triple is indexed three times each. Even though pre-
senting a replication factor of at least three, our approach is
under the consideration of query performance and fault re-
Physical covery. Loading RDF data into R variables is normally tak-
Storage ing the form of localised range queries, fixing either the sub-
N0 ... ... Nn ject or object of the triples and replacing the rest with wild-
cards. For instance graph.find(s, null, null) retrieves all
the triples of a resource while graph.find(null, p, o) presents
Figure 1: System architecture an inverse traverse from object o. By replicating triples,
data can be sorted according to not only subjects but also
predicates and objects. This improves query execution.
The runtime on each data server mainly consists of two key
components: R environment and storage driver. Each lo- 3.3 Loading graph
cal R installation executes statistical analysis directly or ex- For performance, LOD datasets are treated in the following
poses such analytical capacity through the in-memory graph ways. For datasets with RESTful API (e.g. DBpedia), the
model. A storage driver is responsible for I/O with the un- RDF resource to R variable mapping can be realised straight-
derlying storage unit. forwardly. Some datasets expose only SPARQL endpoints.
SPARQL queries become necessary with the restriction that
3.1 Mapping RDF resources to R variables only local scans (e.g. hs, ∗, ∗i or h∗, ∗, oi) are permitted. Ide-
The fundamental data structure for storing data in R is vec- ally, results of scan are used to construct local data graph.
tor, where a single integer for example is seen as a vector of In the long run, on-demand data crawling can maintain lo-
length one. Variations and extensions of vector data type cal copies of frequently used datasets, helping to ensure data
include matrices, arrays and data frames. Though RDF quality and manage mappings through local data curation.
graphs can be easily stored as adjacency matrices or adja-
cency lists, we would opt against a full conversion of LOD 3.4 Processing data
cloud, adding extra computing expenses. Rather, a direct R is inherently a single threaded application, though paral-
one-to-one mapping between RDF resources (being classes lelisation has been implemented using snow and snowfall
and instances) and R variables can provide a seamless and packages [6]. The use of LOD Cloud falls into the following
smooth integration while at the same time ensures the in- categories for which we proposed solutions to achieving good
tegrity of the original data. For instance, an RDF instance scalability.
becomes an R dataframe consisting of single-element vectors.
Similarly, an RDF class can be assigned to a two dimen- 3.4.1 Bulky processing
sional dataframe with rows corresponding to instances and This OLAP-like data processing aims to emerge patterns
columns the properties. Instance values can be loaded ei- (such as hidden semantic relationships and semantic data
ther column wise or row wise depending on the analytical clusters) out of data held in LOD Cloud. Such a process nor-
and performance requirements. In the following example, mally is performed on preloaded data and is not time criti-
column-based initialisation is conducted. cal. While a plethora of R packages can be leveraged for data
mining, the main difficulty lies in populating R dataframes
with LOD data that can facilitate R functions. By encoding
> s <- data.frame(name=av, age=bv, email=cv) each RDF resource as one R variable, it is easy to construct
> s
matrices that fit with special purposes. For many predictive
age name homepage
P1 5 foo foo@bar.com machine learning tasks, voting based aggregation (e.g. bag-
P2 6 john john@bar.com ging [1]) can distribute the overall learning tasks to carefully
... sampled subsets of the target datasets. This can be easily
achieved and managed by traversing the graph to the se-
lected subsects of concept instances.
Note that in this example, a class resource is extensionally
represented by the set of its instances at the snapshot of Example. Given a dataset with patients data, the follow-
data loading. ing code fragment splits the set of patient instances into 10
subsets4 . Traversal with named vertices and edges can be in the previous section) to data in an incremental fashion.
carried out along both inbound and outbound directions. This incremental characteristic is two-fold. Firstly, the sys-
tem should detect the difference between existing classified
data and inputs so as to isolate the changes and restrain re-
1: patient_v <- graph_get_vertex("Patient") classification only against the differences. Secondly, the sys-
2: all_patients <- graph_get(patient_v, edge="rdf:type")
3: for(i in 1:10) {
tem should update only those classifiers whose input data
4: vname <- paste("p_set", i, sep=""); have changed since the most recent retraining. BGR ac-
5: assign(vname, commodates both requirements through distributed logging
sample(all_patients,length(all_patients)/10)) of graph structural changes and localised event propagation
6: saveRDS(vname, file="...") observing graph structures. For instance, “OutEdgeCreat-
7: } edEvent” is issued by the storage listener if an edge is in-
serted. This event instance carries information such as the
edge (in triple form) and on which vertex (vs ) this edge is
Here, we assume the entire set of patient instances will be
created. Events propagate along paths that originate from
loaded into memory. Alternatively, a partial loading can be
vs to avoid global scans. As a result, affected classifiers along
executed to lower the demand for computing resources and
the propagation routes are scheduled for update. Note that
latency. In the following example, edges of patient resource
some machine learning algorithms can be easily adapted to
is indexed. Sampling is conducted against the index. Only
fulfill the requirements (c.f. random forest [?]).
selected instances are loaded.
Versioning resources. An RDF resource normally consists
1: patient_v <- graph_get_vertex("Patient") of multiple triples jointly stating the constrains on the re-
2: patient_size <- graph_get_edge_count(patient_v, source. Therefore, the event-driven incremental processing,
edge="rdf:type") which only has visibility of individual triples, requires a
3: patient_index <- graph_edge_index(patient_v, mechanism to obtain complete statements of the resource.
edge="rdf:type")
We use versioning to ensure consistency when data are clas-
4: n <- c(1:patient_size)
5: ns <- sample(n, size/10) sified and when classifiers are retrained. Version information
6: for (i in ns) { is stored at the value part of the key-value tuples and version
7: ins<-graph_traverse(patient_v,edge=index[i]) updates are treated as atomic operations.
8: saveRDS(ins, file="...")
9: } Multiple threads. Multi-threaded R is not likely to be avail-
able in the near future. As spawning threads is not possible,
The following code fragment intends to construct a random- BGR runs multiple processes communicating through socket.
forest-based prognosis model (line 11) for a certain disease For instance, one R process listens to the underlying storage
based on a patient’s gender and age. The patient data are driver for fetching graph structural events through a dedi-
loaded with a graph traverse transaction over the given pa- cate socket address. The events are then parsed to extract
tient instance vertices and the given outgoing edges (line 3- event types, triples that raise the events, and versions of the
6, where wildcard indicates all the outgoing edges). Missing triples. Other R processes handle the events and dispatch
values are set to a default one (i.e. age = 75) for simplicity them for further actions when necessary, again by writing
(line 9). to a socket address. Socket-based communication may not
provide ideal performance; in many cases it becomes the
main bottle neck of performance. It, however, offers the
1: p_partition<-readRDS(file="...")
2: patients <- data.frame() most cost-effective solution to increase parallelism without
3: for(i in p_partition) { dismantling R.
4: p_data <- graph_traverse(vertex=i, out_edge="*");
5: patients <-rbind(patients, p_data) 3.5 Resource local processors
6: }
7: size <- length(patients) We advocate and practice a declarative and resource-centric
8: training_set <- data.frame( approach in BGR . More specifically, expected analytics are
age=patients$has_age, constructed at the resource level and are associated with
gender=patients$has_gender, ...) the target resource through RDF property declarations. For
9: training_set$age[is.na(training_set$age)] <- 75 instance the following RDF triples assign an R random-forest
10: labels <- as.factor(patients$status);
classifier (defined in section 3.4.1) to a resource (i.e. the
11: rfp <- randomForest(training_set, labels)
“Patient” class).
In this example, we assume that the patient data partitions
are passed using data file residing on the disk (line 1). This :Patient a owl:Class ;
is for illustrative purposes only and does not exclude shared rdfs:subClassOf
memory or message passing based solutions. [ a owl:Restriction ;
owl:onProperty :has_behaviour ;
owl:someValuesFrom
3.4.2 Incremental processing [ a owl:Class ;
OLTP-like realtime data processing is supported through an owl:oneOf (:new_patient_behaviour
event-driven mechanism that applies classifiers (obtained as :update_patient_behaviour)]].
...
4
Based on the literature, bagging should take a fraction be- :new_patient_behaviour
tween 1/2 to 1/50 depending on the size of the sample data. a :Behaviour , owl:NamedIndividual ;
:event :onNewInstanceAdded ; 6: .jcall(g.obj, "S", "find", x, y)
:has_handler "R:rfp" . 7: }

This essentially defines how a resource (e.g. Patient) reacts
to (or behaves against) events (e.g. onNewInstanceAsserted We intend to minimise the effort of extending R, i.e. avoid-
event), realised using the attached process (e.g. R:rfp). At ing introducing compiled R packages. This is under mainly
the ontology class level, enumeration (owl:oneOf) is used practical considerations. It lowers the learning curves for
to establish conceptual relationship between the Patient people already familiar with R, as basically no extra opera-
class and the desired functionalities w.r.t. the correspond- tors need to learn. Also, it increases the visibility of data
ing events. The actual implementation of behaviour in- management with respect to the underlying data structure.
stances can be realised, for example, in R. Depending on
the size of the compiled code, the implementation can be 5. CONCLUSIONS
stored either entirely at the value part of the KV tuple This paper calls for user-friendly and programmable LOD
of h:new_patient_behaviour, :has_handler, "R:rft"i or by leveraging and enhancing R, a free software toolkit for
separately with a pointer from the value part of the tuple. statistical computing and graphics.
When a new patient instance is asserted, an event is raised
which will trigger the embedded R function to react to such Note that there are a few R packages (e.g. bigmemoRy) that
a change in the storage. aims in particular at Big Data computing. There are also R
packages (e.g. foreach, ff, etc.) for strengthening R par-
Several advantages are evident by assigning behaviour and allelism. Our proposal is not to compete with such existing
storing its implementation close to a resource. Firstly, for solutions but to advocate a collaboration of two indepen-
a distributed data storage, this implies a close proximity of dent efforts and provide solutions that fit the visions and
data and process localities. Secondly, behaviour enhances requirements of linked data paradigm.
the reactive programming principle by packing small pro-
cess units against very specific data units. Thirdly, data be- We also do not see competition with the RESTful movement,
haviours and their implementations are conceptualised with such as Linked Data Platform (LDP, [?]) which already
well-formed RDFS constructs. This facilitates ontological gained momentum in the LOD community. LDP works at a
inferences when necessary, though with caveats: i) increased layer lower than the proposed LOD/R integration, assisting
inference complexity and ii) anonymous resources complicat- data exposure so that the data can be consumed by the BGR
ing RDF query handling. functions and operators.

4. PRELIMINARY RESULTS 6. REFERENCES
R
BG is still under development. This section reports the [1] E. Bauer and R. Kohavi. An empirical comparison of
system design that has been considered so far and lists out voting classification algorithms: Bagging, boosting,
potential future work. and variants. Machine Learning, 36(1-2):105–139, July
1999.
The underlying graph storage is a distributed KVS based on [2] B. Everitt and T. Hothorn. A handbook of statistical
HBase. HBase also handles data partition, locality, replica- analyses using R. CRC Press, Boca Raton, Fla, 2010.
tion and fault tolerance. Jena graph introduces the neces- [3] N. C. Helbig, A. M. Cresswell, B. Burke, and
sary abstraction layer for indexing and retrieving triples in L. Luna-Reyes. The dynamics of opening government
the KVS. A simple graph programming interface is respon- data. Technical report, Nov. 2012.
sible for graph traversal and scan operations. It follows the [4] A. Lakshman and P. Malik. Cassandra: a
Tinkerpop Blueprint convention5 and currently talks to Jena decentralized structured storage system. SIGOPS
graph so as to construct resource subgraph from the edge Oper. Syst. Rev., 44(2):35–40, Apr. 2010.
based storage data structure. The use of Jena is mainly for [5] E. Oren, B. Heitmann, and S. Decker. Activerdf:
the convenience of leveraging Jena models when in-memory Embedding semantic web data into object-oriented
ontology inference becomes necessary. In the future, direct languages. Web Semant., 6(3):191–202, Sept. 2008.
communication between storage and graph API is expected [6] L. Tierney, A. J. Rossini, and N. Li. Snow : A parallel
to improve the overall system performance. This is at the computing framework for the r system. International
price of reduced ontological inference capacity. Journal of Parallel Programming, 37(1):78–90, 2009.
[7] S. Urbanek. rJava: Low-Level R to Java Interface,
Both storage and graph modules are implemented in Java. 2009. R package version 0.8-1.
R communicates with the storage driver through an R-Java
[8] W. R. van Hage and T. Kauppinen. SPARQL package
interfacing library, rJava package [7]. Calling Java methods
for R, 2011. available at http:
are straightforward as illustrated in the following example:
//linkedscience.org/tools/sparql-package-for-r.
[9] M. Völkel. Rdfreactor – from ontologies to
1: .jinit() programatic data access. In Proc. of the Jena User
2: # do something before loading the graph Conference 2006. HP Bristol, May 2006.
3: g.obj<- .jnew("Graph") [10] E. Willighagen. Accessing biological data with
4: # do something else semantic web technologies.
5: graph.find <- function(x, y) { http://dx.doi.org/10.7287/peerj.preprints.185v1, 2013.
5
https://github.com/tinkerpop/blueprints/wiki