=Paper=
{{Paper
|id=Vol-1184/paper7
|storemode=property
|title=Programmable Analytics for Linked Open Data
|pdfUrl=https://ceur-ws.org/Vol-1184/ldow2014_paper_07.pdf
|volume=Vol-1184
|dblpUrl=https://dblp.org/rec/conf/www/HuR14
}}
==Programmable Analytics for Linked Open Data==
Programmable Analytics for Linked Open Data Bo Hu Eduarda Mendes Emeric Viel Fujitsu Laboratories of Europe Rodrigues Fujitsu Laboratories Ltd Middlesex, UK Fujitsu Laboratories of Europe Kawasaki, Japan bo.hu@uk.fujitsu.com Middlesex, UK emeric.viel@jp.fujitsu.com ABSTRACT semantic-rich data accumulated so far have actually driven LOD initiative has made a major impact on data provision. away potential users. On the one hand, it adds extra layers Thus far, more than 800 datasets have been published, con- of abstraction/conceptualisation to the data, making them taining tens of billions of RDF triples. The sheer size of not suitable for toolkits tuned against data represented in data has not resulted in a significant increase of data con- tabular format. On the other hand, the sheer volume of sumption. We contend that a new programming paradigm data renders many semantic web tools less productive. We is necessary to simplify LOD data utilisation. This paper contend that a major obstacle that prevents ordinary users reports an early phase development towards programmable from tapping into LOD cloud is the lack of a mechanism web of LOD data. We propose to tap into a distributed that allows people to make “sense” out of the overwhelming computing environment underpinning the popular statistical amount of data. More specifically, in order to facilitate the toolkit R. Where possible, native R operators and functions general uptake of LOD by research communities and practi- are used in our approach so as to lower the learning curve. tioners, simply making the data available is not sufficient. It The crux of our future work lies in the full implementation is essential to offer, along side the data, a means of utilising and evaluation. such resources in such a way that is comprehensible to users with a wide range of backgrounds and potentially limited knowledge of semantic technologies. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; In this paper, we propose a solution that tightly integrates D.2.12 [Interoperability]: Data mapping linked data computing with the popular statistic program- ming platform R. This brings together two well established Keywords efforts and thus two large user bases: R offers a declara- Linked Open Data, RDF, R, Programmability tive and well formed programming language for mining and analysing LOD datasets while the LOD Cloud paves the way 1. INTRODUCTION to instantaneous access to a large amount of structured data As of mid 2013, totally 870 datasets had been published as on which existing R functions and packages can be applied. part of the Linked Open Data (LOD) cloud, exposing nearly 62 billion RDF triples in a computer-readable representation format1 . These numbers are still rapidly growing largely at- 1.1 Programmability of LOD Thus far, data available through LOD Cloud are accessed tribute to open governmental data initiatives and “online” primarily using SPARQL. Typically, this is conducted by high-throughput scientific instruments. As greater amounts submitting query scripts to a SPARQL endpoint and based of data become available through LOD cloud, the expected on the query results, filtering/joining/aggregating (available virtuous cycle–more data leading to more consumption and from SPARQL 1.1) candidate results either on the server thus encouraged data publication–has not been clearly wit- side or at the local clients. SPARQL is based on set al- nessed. On the contrary, it is observed that, in many occa- gebra. This is both an advantage and a disadvantage. It sions, after the initial spark of interest and test applications, resembles the prevailing SQL for RDB. People familiar with data use at many linked data hosting sites declined signifi- the latter can, therefore, enjoy a fast learning curve when cantly [3]. Some critics believed that the massive amounts of making the paradigm shift. On the other hand, SPARQL 1 http://stats.lod2.eu. Accessed: January 2014. is mainly a query language and thus does not stand-out for post-query data processing. In many cases, the results of SPARQL queries are extracted and converted into the na- tive data structures of other programming languages (e.g. Java) for further manipulation. Equipping and/or enhancing LOD with high programmabil- ity beyond SPARQL has been investigated previously. The (dis)similarity between RDF as the underlying data struc- Copyright held by the author/owner(s) LDOW2014 April 8, 2014, Seoul, Korea. ture of LOD and the general object oriented methodology inspired ActiveRDF [5], where semantic data are exposed through a declarative script language. Along the same di- 2. PROGRAMMABLE LOD rection, RDFReactor [9] packs RDF resources as Java ob- LOD Cloud provides a framework to access and navigate jects where instances are objects and properties are accessed through apparently unrelated data, with conceptual models through java methods. capable of explicating hidden knowledge. The logic based axioms (underpinning RDF) in many cases are not powerful Unfortunately, the above integrations have not lowered the enough to capture all the regularities in the data. We vision threshold to fully exploring LOD cloud. Among other rea- that a programming language, aiming to utilise and inter- sons, the most prominent ones include the follows. It will be act with LOD cloud (the datasets therein), is preferably to very difficult for such approaches to deal with missing values present the following characteristics. and sparse structures, which abound in uncurated or auto- matically produced collections. The size and quality of LOD Native support to LOD data structure. The underlying data cloud lends itself to statistical data analysis. Performing structure of LOD is RDF triples which essentially compose such analysis using SPARQL queries can become cumber- a directed, labelled graph. SPARQL, the standard RDF some in many cases, requiring recursive SPARQL queries querying language, transforms data into tabular form for and multiple join operations. Moreover, neither SPARQL better alignment with the RDB conventions. This extra nor the integrated framework enjoys native support to ma- formatting layer is not always necessary when the underlying trix operations and solving linear equations, while such char- data structure can be accessed with native graph operators. acteristics become increasingly critical in processing large amounts of data. Native support to data analysis. Better data accessibility inherent to LOD presents itself as both an opportunity and R, as a dynamic and functional language, offers good capac- a challenge. With better access, an LOD data consumer is ity to enhance the programmability of LOD and remedy the exposed to data linked in through semantic relations, most shortcoming of existing approaches. of which he or she may not be aware of. More data is not al- ways necessarily a merit. In this case, the consumer is likely to be overwhelmed by data with different formats and differ- 1.2 Why R? ent semantics, making analysis struggling. A programming R is a programming language and a software toolkit for data platform capable of dynamically handling different format science. Though not outspoken, R is designed for domain becomes desirable. experts instead of conventional software programmers. It focuses on transactions that are more familiar to the for- Ready for distributed processing. Applications accessing LOD mer, e.g. organising data, manipulating spreadsheets and Cloud can easily be exposed to billions of triples, tanta- data visualisation. R is open source with over 2,000 pack- mount to terabyte-grade data transactions. Single machine ages/libraries for a wide variety of data analytics2 . The and single threaded statistical offerings will find themselves most distinctive feature of R is its native support to vector struggling in such situations. The programming platform arithmetics. In addition, versatile graphics and data visual- should offer parallelisation capacity for good scalability. isation packages as well as easy access to a large number of specialist machine learning and predictive algorithms make Inspecting R within the scope of the above requirements, we R a widely adopted computing environment in scientific com- can make the following observations. Firstly, R is a func- munities (c.f. [2]). R is essentially single threaded. Scaling tional language with lazy evaluation, wherein functions are R for Big Data analysis can be achieved with RHadoop3 . In lifted to become first class citizen. Also, R has a dynamic this paper, we focus on adapting R for LOD data structure. type system. These fit well with RDF’s idiosyncrasy. Sec- ondly, R is designed for statistical computing. Missing value Integrating R and LOD has been inspected previously. The support and sparse matrix handling permeates all R func- SPARQL R Library [8] aims to expose RDF data and wrap tions and operations. Finally, though R is single-threaded, SPARQL endpoints with a black-box style connector library. for many machine learning tasks it is possible to distribute Largely in the same vein, the most recent effort, rrdf li- the underlying R data structures and facilitate process dis- brary [10], allows loading and updating RDF files through tribution over a layer of data abstraction. manually crafted RDF-R mapping. The in-memory RDF models can then be queried using SPARQL. We see the following issues with SPARQL-based integration. Firstly, 3. SYSTEM ARCHITECTURE SPARQL queries and the target RDF data sets are not The concept of programmable LOD is experimented on the transparent to R users, making it difficult to validate and BigGraph platform, denoted as BGR . BigGraph aims at a optimise the processes. Arbitrary SPARQL queries can in- generic distributed graph storage with RESTful interface. cur global scans that drastically impede the system perfor- Figure 1 illustrates the main building blocks of BGR . At mance. Secondly, R environment loses the regulatory control the top, there is the user interface. An BGR user programs over SPARQL queries. Such a blindness subjects the system using R primaries with dedicated functions that facilitate to safety and security concerns. Finally, domain experts and the RDF to R data type mapping. BGR programs are sub- statisticians are required to manually compose the SPARQL mitted to a master node as the main entry point through queries. This means learning the fundamentals of RDF and which the user interacts with the system. The runtime at a new query language. the master is responsible for the following tasks: 1) inter- preting BGR programs; 2) interacting with the in-memory 2 http://www.r-project.org. Accessed: January 2014. graph model for graph transactions; and 3) deciding which 3 https://github.com/RevolutionAnalytics/RHadoop/wiki data server/worker it should directly query. Program (extended R) 3.2 Mapping to the underlying storage In order to accommodate the sheer size of LOD Cloud and leverage parallel data loading, a distributed storage is nec- In-memory essary. We opt for an edge-based storage solution that fits Master Node nicely with the principles of a Key-Value Store (KVS) [4]. KVS plays a key role in our approach to scale-out RDF graphs. RDF triples are, however, not KVS ready. The first Graph Model and foremost step is therefore to define the key-value tuples that a standard KVS can conveniently consume. In BGR , different components of a triple are concatenated together and encoded as UUID which is then treated as the key while R R System System the value parts of KVS are reserved for other purposes, e.g. ... ... named graph, provenance, and access control. Storage Storage Driver Driver An RDF triple is indexed three times each. Even though pre- senting a replication factor of at least three, our approach is under the consideration of query performance and fault re- Physical covery. Loading RDF data into R variables is normally tak- Storage ing the form of localised range queries, fixing either the sub- N0 ... ... Nn ject or object of the triples and replacing the rest with wild- cards. For instance graph.find(s, null, null) retrieves all the triples of a resource while graph.find(null, p, o) presents Figure 1: System architecture an inverse traverse from object o. By replicating triples, data can be sorted according to not only subjects but also predicates and objects. This improves query execution. The runtime on each data server mainly consists of two key components: R environment and storage driver. Each lo- 3.3 Loading graph cal R installation executes statistical analysis directly or ex- For performance, LOD datasets are treated in the following poses such analytical capacity through the in-memory graph ways. For datasets with RESTful API (e.g. DBpedia), the model. A storage driver is responsible for I/O with the un- RDF resource to R variable mapping can be realised straight- derlying storage unit. forwardly. Some datasets expose only SPARQL endpoints. SPARQL queries become necessary with the restriction that 3.1 Mapping RDF resources to R variables only local scans (e.g. hs, ∗, ∗i or h∗, ∗, oi) are permitted. Ide- The fundamental data structure for storing data in R is vec- ally, results of scan are used to construct local data graph. tor, where a single integer for example is seen as a vector of In the long run, on-demand data crawling can maintain lo- length one. Variations and extensions of vector data type cal copies of frequently used datasets, helping to ensure data include matrices, arrays and data frames. Though RDF quality and manage mappings through local data curation. graphs can be easily stored as adjacency matrices or adja- cency lists, we would opt against a full conversion of LOD 3.4 Processing data cloud, adding extra computing expenses. Rather, a direct R is inherently a single threaded application, though paral- one-to-one mapping between RDF resources (being classes lelisation has been implemented using snow and snowfall and instances) and R variables can provide a seamless and packages [6]. The use of LOD Cloud falls into the following smooth integration while at the same time ensures the in- categories for which we proposed solutions to achieving good tegrity of the original data. For instance, an RDF instance scalability. becomes an R dataframe consisting of single-element vectors. Similarly, an RDF class can be assigned to a two dimen- 3.4.1 Bulky processing sional dataframe with rows corresponding to instances and This OLAP-like data processing aims to emerge patterns columns the properties. Instance values can be loaded ei- (such as hidden semantic relationships and semantic data ther column wise or row wise depending on the analytical clusters) out of data held in LOD Cloud. Such a process nor- and performance requirements. In the following example, mally is performed on preloaded data and is not time criti- column-based initialisation is conducted. cal. While a plethora of R packages can be leveraged for data mining, the main difficulty lies in populating R dataframes with LOD data that can facilitate R functions. By encoding > s <- data.frame(name=av, age=bv, email=cv) each RDF resource as one R variable, it is easy to construct > s matrices that fit with special purposes. For many predictive age name homepage P1 5 foo foo@bar.com machine learning tasks, voting based aggregation (e.g. bag- P2 6 john john@bar.com ging [1]) can distribute the overall learning tasks to carefully ... sampled subsets of the target datasets. This can be easily achieved and managed by traversing the graph to the se- lected subsects of concept instances. Note that in this example, a class resource is extensionally represented by the set of its instances at the snapshot of Example. Given a dataset with patients data, the follow- data loading. ing code fragment splits the set of patient instances into 10 subsets4 . Traversal with named vertices and edges can be in the previous section) to data in an incremental fashion. carried out along both inbound and outbound directions. This incremental characteristic is two-fold. Firstly, the sys- tem should detect the difference between existing classified data and inputs so as to isolate the changes and restrain re- 1: patient_v <- graph_get_vertex("Patient") classification only against the differences. Secondly, the sys- 2: all_patients <- graph_get(patient_v, edge="rdf:type") 3: for(i in 1:10) { tem should update only those classifiers whose input data 4: vname <- paste("p_set", i, sep=""); have changed since the most recent retraining. BGR ac- 5: assign(vname, commodates both requirements through distributed logging sample(all_patients,length(all_patients)/10)) of graph structural changes and localised event propagation 6: saveRDS(vname, file="...") observing graph structures. For instance, “OutEdgeCreat- 7: } edEvent” is issued by the storage listener if an edge is in- serted. This event instance carries information such as the edge (in triple form) and on which vertex (vs ) this edge is Here, we assume the entire set of patient instances will be created. Events propagate along paths that originate from loaded into memory. Alternatively, a partial loading can be vs to avoid global scans. As a result, affected classifiers along executed to lower the demand for computing resources and the propagation routes are scheduled for update. Note that latency. In the following example, edges of patient resource some machine learning algorithms can be easily adapted to is indexed. Sampling is conducted against the index. Only fulfill the requirements (c.f. random forest [?]). selected instances are loaded. Versioning resources. An RDF resource normally consists 1: patient_v <- graph_get_vertex("Patient") of multiple triples jointly stating the constrains on the re- 2: patient_size <- graph_get_edge_count(patient_v, source. Therefore, the event-driven incremental processing, edge="rdf:type") which only has visibility of individual triples, requires a 3: patient_index <- graph_edge_index(patient_v, mechanism to obtain complete statements of the resource. edge="rdf:type") We use versioning to ensure consistency when data are clas- 4: n <- c(1:patient_size) 5: ns <- sample(n, size/10) sified and when classifiers are retrained. Version information 6: for (i in ns) { is stored at the value part of the key-value tuples and version 7: ins<-graph_traverse(patient_v,edge=index[i]) updates are treated as atomic operations. 8: saveRDS(ins, file="...") 9: } Multiple threads. Multi-threaded R is not likely to be avail- able in the near future. As spawning threads is not possible, The following code fragment intends to construct a random- BGR runs multiple processes communicating through socket. forest-based prognosis model (line 11) for a certain disease For instance, one R process listens to the underlying storage based on a patient’s gender and age. The patient data are driver for fetching graph structural events through a dedi- loaded with a graph traverse transaction over the given pa- cate socket address. The events are then parsed to extract tient instance vertices and the given outgoing edges (line 3- event types, triples that raise the events, and versions of the 6, where wildcard indicates all the outgoing edges). Missing triples. Other R processes handle the events and dispatch values are set to a default one (i.e. age = 75) for simplicity them for further actions when necessary, again by writing (line 9). to a socket address. Socket-based communication may not provide ideal performance; in many cases it becomes the main bottle neck of performance. It, however, offers the 1: p_partition<-readRDS(file="...") 2: patients <- data.frame() most cost-effective solution to increase parallelism without 3: for(i in p_partition) { dismantling R. 4: p_data <- graph_traverse(vertex=i, out_edge="*"); 5: patients <-rbind(patients, p_data) 3.5 Resource local processors 6: } 7: size <- length(patients) We advocate and practice a declarative and resource-centric 8: training_set <- data.frame( approach in BGR . More specifically, expected analytics are age=patients$has_age, constructed at the resource level and are associated with gender=patients$has_gender, ...) the target resource through RDF property declarations. For 9: training_set$age[is.na(training_set$age)] <- 75 instance the following RDF triples assign an R random-forest 10: labels <- as.factor(patients$status); classifier (defined in section 3.4.1) to a resource (i.e. the 11: rfp <- randomForest(training_set, labels) “Patient” class). In this example, we assume that the patient data partitions are passed using data file residing on the disk (line 1). This :Patient a owl:Class ; is for illustrative purposes only and does not exclude shared rdfs:subClassOf memory or message passing based solutions. [ a owl:Restriction ; owl:onProperty :has_behaviour ; owl:someValuesFrom 3.4.2 Incremental processing [ a owl:Class ; OLTP-like realtime data processing is supported through an owl:oneOf (:new_patient_behaviour event-driven mechanism that applies classifiers (obtained as :update_patient_behaviour)]]. ... 4 Based on the literature, bagging should take a fraction be- :new_patient_behaviour tween 1/2 to 1/50 depending on the size of the sample data. a :Behaviour , owl:NamedIndividual ; :event :onNewInstanceAdded ; 6: .jcall(g.obj, "S", "find", x, y) :has_handler "R:rfp" . 7: } This essentially defines how a resource (e.g. Patient) reacts to (or behaves against) events (e.g. onNewInstanceAsserted We intend to minimise the effort of extending R, i.e. avoid- event), realised using the attached process (e.g. R:rfp). At ing introducing compiled R packages. This is under mainly the ontology class level, enumeration (owl:oneOf) is used practical considerations. It lowers the learning curves for to establish conceptual relationship between the Patient people already familiar with R, as basically no extra opera- class and the desired functionalities w.r.t. the correspond- tors need to learn. Also, it increases the visibility of data ing events. The actual implementation of behaviour in- management with respect to the underlying data structure. stances can be realised, for example, in R. Depending on the size of the compiled code, the implementation can be 5. CONCLUSIONS stored either entirely at the value part of the KV tuple This paper calls for user-friendly and programmable LOD of h:new_patient_behaviour, :has_handler, "R:rft"i or by leveraging and enhancing R, a free software toolkit for separately with a pointer from the value part of the tuple. statistical computing and graphics. When a new patient instance is asserted, an event is raised which will trigger the embedded R function to react to such Note that there are a few R packages (e.g. bigmemoRy) that a change in the storage. aims in particular at Big Data computing. There are also R packages (e.g. foreach, ff, etc.) for strengthening R par- Several advantages are evident by assigning behaviour and allelism. Our proposal is not to compete with such existing storing its implementation close to a resource. Firstly, for solutions but to advocate a collaboration of two indepen- a distributed data storage, this implies a close proximity of dent efforts and provide solutions that fit the visions and data and process localities. Secondly, behaviour enhances requirements of linked data paradigm. the reactive programming principle by packing small pro- cess units against very specific data units. Thirdly, data be- We also do not see competition with the RESTful movement, haviours and their implementations are conceptualised with such as Linked Data Platform (LDP, [?]) which already well-formed RDFS constructs. This facilitates ontological gained momentum in the LOD community. LDP works at a inferences when necessary, though with caveats: i) increased layer lower than the proposed LOD/R integration, assisting inference complexity and ii) anonymous resources complicat- data exposure so that the data can be consumed by the BGR ing RDF query handling. functions and operators. 4. PRELIMINARY RESULTS 6. REFERENCES R BG is still under development. This section reports the [1] E. Bauer and R. Kohavi. An empirical comparison of system design that has been considered so far and lists out voting classification algorithms: Bagging, boosting, potential future work. and variants. Machine Learning, 36(1-2):105–139, July 1999. The underlying graph storage is a distributed KVS based on [2] B. Everitt and T. Hothorn. A handbook of statistical HBase. HBase also handles data partition, locality, replica- analyses using R. CRC Press, Boca Raton, Fla, 2010. tion and fault tolerance. Jena graph introduces the neces- [3] N. C. Helbig, A. M. Cresswell, B. Burke, and sary abstraction layer for indexing and retrieving triples in L. Luna-Reyes. The dynamics of opening government the KVS. A simple graph programming interface is respon- data. Technical report, Nov. 2012. sible for graph traversal and scan operations. It follows the [4] A. Lakshman and P. Malik. Cassandra: a Tinkerpop Blueprint convention5 and currently talks to Jena decentralized structured storage system. SIGOPS graph so as to construct resource subgraph from the edge Oper. Syst. Rev., 44(2):35–40, Apr. 2010. based storage data structure. The use of Jena is mainly for [5] E. Oren, B. Heitmann, and S. Decker. Activerdf: the convenience of leveraging Jena models when in-memory Embedding semantic web data into object-oriented ontology inference becomes necessary. In the future, direct languages. Web Semant., 6(3):191–202, Sept. 2008. communication between storage and graph API is expected [6] L. Tierney, A. J. Rossini, and N. Li. Snow : A parallel to improve the overall system performance. This is at the computing framework for the r system. International price of reduced ontological inference capacity. Journal of Parallel Programming, 37(1):78–90, 2009. [7] S. Urbanek. rJava: Low-Level R to Java Interface, Both storage and graph modules are implemented in Java. 2009. R package version 0.8-1. R communicates with the storage driver through an R-Java [8] W. R. van Hage and T. Kauppinen. SPARQL package interfacing library, rJava package [7]. Calling Java methods for R, 2011. available at http: are straightforward as illustrated in the following example: //linkedscience.org/tools/sparql-package-for-r. [9] M. Völkel. Rdfreactor – from ontologies to 1: .jinit() programatic data access. In Proc. of the Jena User 2: # do something before loading the graph Conference 2006. HP Bristol, May 2006. 3: g.obj<- .jnew("Graph") [10] E. Willighagen. Accessing biological data with 4: # do something else semantic web technologies. 5: graph.find <- function(x, y) { http://dx.doi.org/10.7287/peerj.preprints.185v1, 2013. 5 https://github.com/tinkerpop/blueprints/wiki