Beazley: a New Storage Systems Evaluation

Beazley: a New Storage Systems Evaluation MikalaiYatskevich mikalai.yatskevich@comlab.ox.ac.uk Oxford University Computing Laboratory

Wolfson Building, Parks Road OX1 3QD Oxford UK

IanHorrocks ian.horrocks@comlab.ox.ac.uk Oxford University Computing Laboratory

Wolfson Building, Parks Road OX1 3QD Oxford UK

GrahamKlyne graham.klyne@zoo.ox.ac.uk Zoology Department Oxford University

South Parks Road OX1 3PS Oxford UK

Beazley: a New Storage Systems Evaluation 30C108D67FB02E6FC9C7B290A5A0200E GROBID - A machine learning software for extracting information from scholarly documents

Evaluation is a major issue in the development of systems, sometimes as important as the implementation of a system itself. In the Semantic Web area, and especially in the area of the storage systems that provide a persistence layer for ontologies and instance data, evaluation efforts have been intermittent and area specific. In this paper we propose a new dataset for storage systems evaluation called Beazley dataset. The complete dataset version includes more than 16 millions of triples and 35 queries. We evaluate dataset exploiting several storage models of the state of the art storage systems.

Introduction

Evaluation is a systematic assessment of system properties against a set of predefined criteria. Evaluation is a major issue in the development of systems, sometimes even as important as the implementation of a system itself. It has been shown in the past that performance evaluation can help implementers to better understand the sources of intractability and/or inefficiency in their systems, and to propose novel optimization techniques in an effort to make their systems more scalable in specific application scenarios.

Storage systems often called RDF stores provide a persistence layer for ontologies and instance data. They provide basic reasoning services such as computing transitive closure of the subsumption hierarchies. Storage systems differ from description logics (DL) reasoners that provide more complex reasoning services but do not provide storage facilities. The main inference services in the DL reasoners can be performed as conceptual satisfiability. For RDF stores, the main inference service is query answering.

In the Semantic Web area, and especially in the area of storage systems, evaluation efforts have been intermittent and area specific. There is no agreed standard or methodology for systems evaluation. In the evaluation of the DL systems artificially generated datasets served an important role [13] until a large number of the real-world ontologies has been developed. This real-world ontologies have been used in the large scale evaluation efforts [9]. The evaluation of the storage systems are focused on the artificially generated datasets [12,17,10]. Thus, the evaluation of the storage systems will benefit from the real-world datasets that will overcome the limitations of the state of the art generation methods. The most common instance generator and evaluation suite that is used by the Semantic Web community for storage systems evaluation is the Lehigh University Benchmark (LUBM) [12]. Although, LUBM is used mainly for testing instance retrieval and query answering algorithms, it also has many shortcomings. First of all the ALEHI R+ DL used is significantly less expressive than the DL underpinning OWL. Moreover, the data that are created for each university are completely independent. Consequently, if one applies a clustering method during loading it is possible to apply query answering over each university independently.

In this paper we propose a new dataset for storage systems evaluation called Beazley dataset. The dataset comprises real world archeological data gathered in CLAROS initiative [15] and a set of queries used in CLAROS web cite application. The dataset presents the information about archeological artifacts. It instantiates CIDOC-CRM OWL DL ontology [8]. The complete dataset version includes more than 16 millions of triples and 35 queries. We evaluate the dataset exploiting both memory and disk-based storage models of the two state of the art storage systems.

The paper is structured as follows. Section 2 provides a brief introduction to the storage systems along with the datasets used for their evaluation. Section 3 provides a detailed description of the Beazley dataset. Section 4 provides a detailed description of the dataset evaluation set up. Section 5 describes the evaluation results. Section 6 concludes the paper.

Related Work

The majority of the evaluation efforts in the storage systems area were focused on artificially generated datasets. They provide a mechanism to cover a class of inputs in a scalable manner. The most prominent example is the Lehigh University Benchmark (LUBM) [12]. LUBM consist of a small ontology, with 43 classes, 25 roles, 85 TBox axioms and 8 RBox axioms, and several Java classes that can be used to create instance assertions (ABox) for this specific TBox and RBox. The ontology describes universities, i.e. courses, students, departments, publications as well as their interrelations. For example, a student is enrolled in some courses that is taught by some academic staff, while academic staffs are associated with publications, are affiliated with other universities, lead research teams or are heads of departments. The DL of LUBM is ALEHI R+ , nevertheless it does not make heavy use of the constructors since there is just one transitive role, 5 sub-role axioms and 2 inverse role axioms. The ABox is created following the method described in [4]. Finally, the benchmarking suite also comes with 14 queries that are proposed for testing a system against the generated ABoxes.

Although, LUBM is used mainly for testing instance retrieval and query answering algorithms, it also has many shortcomings. First of all the DL used is significantly less expressive than the DL underpinning OWL. Moreover, the data that are created for each university are completely independent. Consequently, if one applies a clustering method during loading it is possible to apply query answering over each university independently. An extension of LUBM to remedy these problems was the University Ontology Benchmark (UOBM) [17]. UOBM extends LUBM by adding more concepts and roles that are intended to connect individuals from different universities. Although UOBM is still not large when compared to ontologies such as NCI [11] or GALEN [21], it uses a relatively expressive ontology language SHIN (D). Finally, a set of test queries is also offered. Unfortunately, although UOBM does indeed make use of more complex constructors and is more structurally complex it has not been widely accepted by the Semantic Web community.

The Berlin SPARQL benchmark [5] focus on integration and visualization data from various data sources. It is build around scenario that does not require heavyweight reasoning. The class hierarchy is generated in random way. The query mix includes 25 queries that represent navigation pattern in e-commerce use case. SP2Bench [22] uses DBLP [16] bibliographic scenario. The ontology used have 9 classes and 77 properties. The query mix includes 11 queries utilizing various SPARQL language constructs. The Billion Triple Challenge [3] aims at the evaluation of the Semantic Web applications to process a large quantities of the RDF data that is represented by various schemata.

The Beazley dataset

The Beazley dataset [15] presents the information about archeological artifacts. The RDF data instantiates CIDOC-CRM ontology [8]. The complete dataset version includes more than 16 millions of triples. The frequency of the triples in the dataset depending on predicate values f p = f req(D, p) is depicted on Figure 1a. The frequency of the triples with a given subject value frequency f sn = f req(D, s n ), depicted on Figure 1b, varies depending on a predicate value p. The f sn = f req(D, s n , is represented) and f sn = f req(D, s n , is ref f ered) are depicted on Figures 1c, 2a. This makes Beazley archive dataset different from RDF datasets produced using automatic generation procedures [12,17]. In these works the uniform distributions and, hence, frequencies are assumed. The frequency of the triples with a given object value frequency f on = f req(D, o n ), depicted on Figure 2b, varies depending on a predicate value p and a subject value frequency s n . The f on = f req(D, o n , s n , has time span), f on ∈ [1, 30] is depicted on Figure 2c. The f on = f req(D, o n , took place), f on = f req(D, o n , not af ter) are depicted on Figures 3a, 3b. The query set used in CLAROS web cite application [15] composed from 35 queries of various size and complexity. The different queries are executed different number of times during the web application life cycle. They could be classified into two large groups. The first query group QG1 comprises Q1-Q18 presented at Table 1. The queries from QG1 are executed at

The evaluation set up

We evaluated dataset using both disk and memory-based storage models of the two state of the art storage systems: Jena TDB, Jena ARQ, Sesame-memory, Sesame-native. Jena [18] is a Java framework for building semantic web applications. It includes OWL [19] and RDF [14] API, in memory and persistent storage models, SPARQL [20] query engine. ARQ [1] is a general purpose query engine, supporting SPARQL and other query languages, that can utilize several Jena storage models. In our experiments we used ARQ in memory storage model. TDB [2] is a high-performance native storage engine that exploits custom indexing strategy. Sesame [7] is an open source Java framework for storage and querying RDF data. Sesame supports SPARQL and SERQL [6] query languages, memory-based and disk-based storage. We evaluated the systems exploiting their user interfaces.

The evaluation has been performed on AMD Phenom II 2600 Mhz Processor with 8Gb main memory installed.

The data used in the evaluation included Beazley dataset with 16 millions of triples and its reduced version with 10 millions of triples. The dataset loading time, query execution time and total query set times were measured. The query per hour and second per query measures were calculated given that each query in QG10 is executed 10 times while each query in QG1 is executed once. This setting allowed to represent the CLAROS application query mix.

The system performance

The data loading times are presented in Table 2. The memory based models ARQ and Sesame-memory were not able to load the complete Beazley dataset. There was insufficient memory for ARQ. The loading into Sesame-memory were terminated after 5 days. The reduced Beazley dataset version was loaded less than in 1 hour all the systems. The Jena memory and storage based models were more efficient in the data loading then the Sesame models.

The query answering times and the other query performance measures are presented in Table 3.

The systems showed performance ranged from 1 millisecond to 74.8 hours per query. The native Sesame storage model was more than 5 times more efficient than its memory storage model. The ARQ was 2 orders of magnitude less efficient than TDB. It was more then order of magnitude more efficient than TDB given that query Q33 was excluded from the query set. This query took ARQ 74.8 hours to execute. Thus, it influenced on the total result. The other 34 queries were completed in 270 seconds. None of the systems was able to execute the complete query mix on the complete dataset in less that 10 minutes what makes the CLAROS application and, therefore, Beazley dataset challenging for state of the art storage systems.

The quality of the query answering results is affected by quality of the original data input. Making improvements to the incoming data (which are obtained by extraction from existing databases) is an ongoing activity, which the Beazley Archive team are addressing by (a) improving the data extraction processes, (b) by applying heuristics to clean up some of the data values (e.g. dates), (c) highlighting inconsistencies that are detected by the extraction processes and passing these back to the data originators for correction, and (d) use of thesauri and authority lists to map terminology variations to common terms.

Conclusion

The query set tested in the paper was used in an initial development of the CLAROS application. Naively constructed, it was designed mainly to provide functionality rather than performance. The new version of the CLAROS application will include an updated query set designed with partners from the Jena team to identify bottlenecks and improve the queries. The goal of this efforts is to redesign queries to achieve sub-second response times. The strategies for the dataset improvement are (1) pre-calculation of certain path queries to reduce run-time joins (roughly equivalent to "materialized views" in relational data), and (2) use of additional indexes associated with "virtual properties" that can reduce the need for in-memory sorting of results when processing SPARQL queries (analogous to schema-defined indexes in relational databases). Essentially, 4 techniques have been used:

1. reordering of queries so that more selective selective elements are evaluated earlier (this can also be performed automatically by the ARQ query processor in Jena); 2. "materialization" of property paths and UNIONS in queries -adding "short cut" properties to the triple store, and use these properties in queries; 3. customized indexes for finding earliest-and latest-occurrences of a given object type, and also for providing consistent ordering in other keywordbased object access queries. These new indexes are not Lucene-based, as originally intended, as Lucene handing of result sorting is less scalable than had been anticipated. Instead, a simple arrangement of flat files named by keywords, with contents sorted by the ordering key is used; 4. pre-calculation of object counts by various categories, so that counting queries can run without having to access every matching object.

Our hope is that this kind of ad-hoc optimization work can suggest ways forward for more principled ontology-based optimization of triple store access. We intend that this revised system will be the basis of a public version of the CLAROS application developed by academic groups who are focused on application of the technologies rather than technology research.

Fig. 1 .1Fig. 1. a) The frequencies of predicates; b) The frequencies of the triples with a given subject value frequency; c) The frequencies of the triples with a given subject value frequency for predicate = is represented

Table 1 .1The Beazley query set.Variables Joins Text search Ordering Comparisons OPTIONAL BOUNDQ120Q21115XXQ31115XXQ4914XXQ567XXQ61710XXQ712Q812Q952Q109142XQ1120312XXXQ1211192XXQ13712XXQ14611XQ151832XXXQ161120XXQ1711202XQ1819322XXXQ191014XQ201422X2XQ212032XX2XQ221219X2XQ231322X2XQ2421322XX2XQ2517282X2XQ261220X2XQ278122XQ281424X2XQ2911202XQ301119X2XQ311832XX2XQ321628XX2XQ331120X2X2XQ341525XX2XX2XQ351323X2X

Table 2 .2Loading times, seconds.ARQ TDB Sesame-native Sesame-memoryBeazley-16Mt N/A 1445.681878.43N/ABeazley-10Mt 360.19 434.071087.142991.44

Table 3 .3Query answering times and aggregate performance measures on Beazley datasetQuery answering times-16Mt,sQuery answering times-10Mt,sSesame-nativeTDBSesame-memory ARQ Sesame-native TDBQ10.030.050.020.0060.030.12Q239.82119.14210.345.9638.29171.7Q338.39125.16217.984.9337.33115.91Q426.0194.86137.126.8922.289.25Q539.997.32128.055.1138.6290.35Q657.04139.98343.9825.9556.24135.24Q70.0010.090.0010.010.0010.08Q80.0020.090.0010.010.0010.08Q919.6670.9139.0237.2211.238.52Q1016.3491.558.8725.213.0888.78Q1121.57186.8593.364.2218.3197.23Q1219.52105.9480.261.2416.01102.27Q1318.156.9935.631.238.393.36Q1449.66111.32209.437.1748.31110.1Q1533.7105.46222.816.3229.6999.14Q1631.48122.49194.795.7327.13119.2Q1719.69106.5581.165.8415.4799.42Q1822.63173.64106.075.5318.52168.48Q1932114140.645.3631.06112.16Q2025.99122.97125.515.7221.55116.97Q2131.61147.94182.6225.6426.75149.24Q2235.99133.69206.715.5338.02126.08Q230.43109.180.125.370.4103.41Q2425.71154.82116.255.0920.7142.33Q2524.47139.02114.795.1318.31132.38Q2618.215.7557.639.5612.315.91Q2713.775.5646.67.4813.415.04Q2820.5112.4690.985.3914.8311.92Q2919.16.3869.275.0112.665.73Q3036.75131.25207.134.7838.63123.26Q3132.77148.84193.965.6527.18142.03Q3230.22135.05162.234.7624.15137.57Q332.398.520.00574.8h0.0072.41Q343.3312.910.0054.950.0084.86Q350.48111.270.125.010.4100.23Total13.45m52.63m64.55m74.87h11.65m50.84mQpH169.5440.6335.050.25198.8642.71SpQ21.2388.59102.6814330.418.184.28

Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010). Shanghai, China. November 8, 2010.

We described a new dataset for storage systems evaluation called Beazley dataset. The dataset proved to be challenging for state of the art storage systems. In fact, none from the systems evaluated was able to demonstrate the level of performance needed for the real world application utilizing the dataset data. The work suggests that Semantic Web technologies applied indiscriminately (or naively) may not always yield acceptable performance, but significant performance improvements are possible through judicious optimizations to the stored data and queries used, without distorting the semantic coherence of the original data. Performance improvement work to date has been ad hoc, but suggests some strategies that might be considered for automated query optimization.

ARQ -A SPARQL processor for Jena TDB -A SPARQL database for Jena The Billion Triple Challenge Benchmarking database systems a systematic approach DBitton DDewitt CTurbyfill VLDB '83: Proceedings of the 9th International Conference on Very Large Data Bases

San Francisco, CA, USA

Morgan Kaufmann Publishers Inc 1983 The berlin SPARQL benchmark CBizer ASchultz Int. J. Semantic Web Inf. Syst 5 2 2009 SeRQL: A Second Generation RDF Query Language JBroekstra AKampman SWAD-Europe Workshop on Semantic Web Storage and Retrieval 2003 Sesame: A generic architecture for storing and querying RDF and RDF Schema JBroekstra AKampman FVan Harmelen Proceedings of the first Int'l Semantic Web Conference (ISWC 2002) Lecture Notes in Computer Science IanHorrocks JamesHendler the first Int'l Semantic Web Conference (ISWC 2002)

Sardinia, Italy

Springer Verlag May 2002 2342 The dream of a global knowledge network-a new approach MDoerr DIorizzo J. Comput. Cult. Herit 1 1 June 2008 Automated benchmarking of description logic reasoners TGardiner IHorrocks DTsarkov Proc. of the 2006 Description Logic Workshop of the 2006 Description Logic Workshop 2006. 2006 189 How incomplete is your semantic web reasoner? IanHorrocks GiorgosStoilos BernardoCuencaGrau Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010) the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010) 2010 To Appear The national cancer institute's thésaurus and ontology JGolbeck GFragoso FHartel JHendler JOberthaler BParsia J. Web Sem 1 1 2003 Lubm: A benchmark for owl knowledge base systems YGuo ZPan JHeflin Web Semantics: Science, Services and Agents on the World Wide Web 3 2-3 2005 An analysis of empirical testing for modal decision procedures IHorrocks PFPatel-Schneider RSebastiani Logic Journal of the IGPL 8 3 2000 Resource Description Framework (RDF): Concepts and Abstract Syntax GKlyne JCarroll W3C Recommendation 10 February 2004 CLAROS -Bringing Classical Art to a Global Public DKurtz GParker DShotton GKlyne FSchroff AZisserman YWilks International Conference on e-Science and Grid Computing 2009 Dblp database MLey Towards a complete OWL ontology benchmark LMa YYang ZQiu GTXie YPan SLiu ESWC 2006 Jena: Implementing the RDF Model and Syntax Specification BMcbride SemWeb 2001 OWL Web Ontology Language semantics and abstract syntax PPatel-Schneider PHayes IHorrocks W3C Recommendation 10 February 2004 SPARQL Query Language for RDF EPrud ASeaborne W3C Recommendation 15 January 2008 Foundations for an electronic medical record ARector WNowlan SKay Methods Inf Med 30 3 1991 An experimental comparison of RDF data management approaches in a SPARQL benchmark scenario MSchmidt THornung NKüchlin GLausen CPinkel ISWC '08: Proceedings of the 7th International Conference on The Semantic Web

Berlin, Heidelberg

Springer-Verlag 2008