Evaluating semantic data infrastructure components for small devices Andriy Nikolov, Ning Li, Mathieu d’Aquin, Enrico Motta Knowledge Media Institute, The Open University, Milton Keynes, UK {a.nikolov, n.li, m.daquin, e.motta}@open.ac.uk Abstract. Existing benchmarks for semantic data management tools primarily focus on evaluating the capability of tools to process large- scale data. In use case scenarios involving small-scale and mobile devices, the data volumes are typically smaller, however, the ability to operate with limited computational resources becomes important. In this paper, we describe the experiments we performed to evaluate components of se- mantic data infrastructure for devices with limited computational power. 1 Introduction Performance is an important criterion for selection of semantic infrastructure components such as data stores, rule engines, and query engines. So far, the de- velopment effort has primarily focused on large-scale data processing scenarios. In such scenarios, the tools are assumed to have access to considerable com- putational resources (powerful servers and even clusters of machines) and are expected to process large volumes of data. Because of this, evaluation bench- marking tests (e.g., the Berlin SPARQL benchmark [1] for triple stores or Open- RuleBench [2] for rule engines) are usually conducted on powerful machines and involve large datasets. With the widespread use of small-scale and mobile de- vices, new scenarios appear in which computational resource limitations become important. For example, in the scope of the SmartProducts project1 , the fo- cus is on “smart products” (e.g., cars and kitchen appliances), which are able to communicate with the user and between themselves and to apply embedded “proactive knowledge” to assist the users with their tasks. In this scenario, it is important to choose semantic data processing tools, which not only require less time to process data, but also require less computational resources to produce results (in particular, memory). Semantic data processing infrastructure for smart products involves two main components: a triple store which stores RDF data and a rule engine which is able to make inferences based on these data and solve user tasks. In this pa- per we discuss the experiments conducted to select appropriate components to support our use case scenario. We especially focus on the particularities related to evaluating such components when targeting small, resource-limited devices. 1 http://www.smartproducts-project.eu/ Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010). Shanghai, China. November 8, 2010. This includes in particular assessing not only the response time of the tools, but also their average resource consumption for different amounts of data. The rest of the paper is organised in the following way. In section 2, we discuss the requirements imposed by our use case scenario and the reasons for our prelim- inary choice of tools as solutions. Based on these requirements, we build test datasets to perform comparative tests. Section 3 summarises our earlier experi- ments comparing performance of RDF triple stores on small- and medium-scale datasets. Section 4 describes the experiments we performed with two selected rule engines (Jess2 and BaseVISor3 ) and discusses their results. Finally, section 5 discusses the directions for future work. 2 Requirements The SmartProducts project envisages a scenario where user appliances (e.g., a cooking guide, a car computer) make use of embedded proactive knowledge in order to communicate and cooperate with humans, other products, and en- vironment. Proactive knowledge includes knowledge about the product itself (features, functions, dependencies, e.g., oven’s maximal temperature), its envi- ronment (e.g., other devices in the kitchen, food items stored in the fridge), its users (e.g., health profile), and the usage context (e.g., a cake is being prepared for a birthday party). This knowledge is formally described using ontologies, con- tained in a semantic data store, and utilised using formal reasoning mechanisms. The data management infrastructure must support these functionalities. When selecting existing tools to reuse within this infrastructure, two kinds of requirements has to be taken into account: – Functional requirements related to the desired capabilities of the tools. – Pragmatic requirements related to the computational resources needed by the tools. In order to support its required tasks, a smart product needs to possess several reasoning capabilities. First, in order to make use of semantic data represented in RDF, its inferencing engine must support inferencing based on ontological axioms expressed in RDFS or OWL. Second, and even more important, in order to exhibit situation awareness and react to the changes in the environment, the inference engine has to support reasoning with custom-defined rules. As a result of rule firing, not only the knowledge base can be updated with new facts, but an arbitrary action can be triggered: e.g., starting a user interaction. Thus, for our scenario we considered general-purpose rule engines (in particular, Jess and BaseVISor) rather than OWL-DL reasoners based on description logic. The pragmatic requirements are imposed by the choice of the hardware plat- form. Smart products are assumed to be implemented based on devices with reasonable computational capabilities, high-speed networking, and Linux oper- ating system support (e.g., Gumstix4 or smartphones). The actual hardware 2 http://www.jessrules.com/ 3 http://vistology.com/basevisor/basevisor.html 4 http://www.gumstix.com/ parameters may vary within each device class: for instance, the newest Overo gumstix models use a 600MHz processor and 256 MB RAM. The new models of smartphones (like Xperia X10 from Sony Ericsson) can reach 1GHz CPU speed and 1 GB memory. 3 Comparison of triple stores As possible candidates, we considered three widely used triple stores: Jena5 , Sesame6 , and Mulgara7 (some others, targeted at large-scale data, were excluded, e.g., Virtuoso8 , 4store9 ). To compare the triple stores, we constructed a bench- mark dataset consisting of small- to medium-sized ontologies retrieved from the Watson search server10 . The sets of ontologies to be used for testing semantic tools have been built following two main requirements. First, they had to cover ontology sizes from very small (just a few triples) to medium-scale ones (hun- dreds of thousands of triples). These medium-scale ontologies represent the limit of what the best performing tools can handle on small devices. Second, they had to take into account that, especially at small-scale, size is not the only parameter that might affect the performance of semantic tools. It is therefore important that, within each set, the ontologies vary in size, but stay relatively homogeneous with respect to these other characteristics. We first devised a script to build sets of ontologies from Watson grouping to- gether ontologies having similar characteristics, therefore building homogenous sets with respect to these characteristics. The parameters employed for building these groups are the ratios properties, individuals, and classes, and the com- plexity of the ontological description, as expressed by the underlying description logic (e.g., ALH). As a result of this automatic process, we obtained 99 different sets of ontologies. We then manually selected amongst these sets the ones to be used for our benchmarks, considering only the sets containing ontologies with appropriate ranges of sizes. Table 1. Triple store test results (test set 53): Jena (TDB), Sesame (native), Mulgara Jena Sesame Mulgara Avg. loading time/triple (ms) 1 1 2 Avg. memory/triple (KB) 1 0.16 0.43 Avg. disk space/triple (KB) 0.17 0.1 32 Avg. query time/Ktriple (ms) 232 43 1291 5 http://jena.sourceforge.net/ 6 http://www.openrdf.org/ 7 http://www.mulgara.org/ 8 http://virtuoso.openlinksw.com/ 9 http://4store.org/ 10 http://watson.kmi.open.ac.uk/ To test the triple stores, we applied the following procedure: 1. Loading an ontology into the data store. 2. Running test queries. 3. Measuring the disk space taken by the data store. We developed 8 test SPARQL queries applicable to a wide range of ontologies (e.g., selecting all rdfs:label values or selecting all properties of all instances of all classes). The metrics we used included average loading time, memory, disk space, and query response time per single triple. The results for one of our test sets containing the widest range of dataset sizes are provided in Table 1. The dataset contains 21 different ontologies in the range between 3208 and 658808 triples. In more details, our benchmark and tests for triple stores are described in [3]. Based on the tests, we could make several observations concerning the performance of the tools. In general, Sesame was found to outperform both Jena and Mulgara for small-size datasets, although its advantage tends to decrease with the growth of the dataset size. It is interesting to note that in the large-scale benchmarking tests [1] Sesame generally performed worse than other tools. One of the causes for this was the fact that Jena and Mulgara allocate larger amounts of resources straight from the start (especially Mulgara), while an “empty Sesame” is light- weight. In other terms, the “fixed cost” associated with processing ontologies with Sesame is significantly lower than the one of Jena and Mulgara, whereas the “variable cost” appears to be higher. This clearly demonstrates that benchmarks targeting different scales, and considering different dimensions as available resources (i.e., not only time), are crucial. Indeed, in resource-limited environment, the best tool at a given scale, may not be the right one at another scale. Here, it appears clearly that Sesame is a strong candidate to be used on a small device, due to its low fixed cost, while, for the same reason, it might become inadequate as devices get bigger. 4 Comparison of rule engines Based on the functional requirements, we focused on the usage of rule engines im- plementing the Rete algorithm [4]. We considered two initial options: a general- purpose rule engine (Jess) which we could then adapt for the usage of RDF data and a rule engine designed for dealing with RDF data (BaseVISor). In the case of a general-purpose engine, the advantages were the possibilities to use both forward and backward-chaining rule processing and to define arbitrary facts (n-ary relations) as opposed to only triples. The latter feature was con- venient for defining intermediate facts, which were used during reasoning and removed from the working memory afterwards. To represent RDF data, we used the standard approach initially proposed in [5]: representing each RDF triple with a Jess fact (triple (subject ?s) (predicate ?p) (object ?o)) and implementing OWL inferencing with Jess rules. For Jess, we used two rule processing strate- gies: forward-chaining and hybrid (forward- and backward-chaining). With the forward-chaining option, all data potentially relevant for the task are loaded into the working memory at the start. This option maximises the processing speed at the expense of memory usage. With the hybrid approach, only a minimal amount of data needed to trigger the reasoning are loaded into the working memory. Backward-chaining rules are used to load information from the triple store into the rule engine working memory when needed. For data loading, we used an approach similar to [6]: using calls to Java functions which executed SPARQL queries on a triple store and translated results as Jess facts. This ap- proach allowed saving the working memory but consumed more time. For testing, we used the recipe selection task from one of our use cases. In this task, the RDF dataset included a set of recipes (with their ingredients), a set of user preferences, and information about available ingredients. Based on this information, the task of the rule engine was to use rules to provide a ranking of recipes which could be proposed to the user. In an example scenario, we used a dataset which, after calculating the OWL-DL inferences, contained about 280000 triples. The rule base included rules which evaluated whether a recipe in the dataset satisfied a specific type of constraint imposed on one of the recipe parameters: e.g., ingredients, cooking time or nutritional value. For example, “IF the user has a preference P for ingredient X AND recipe A contains X, add a fact declaring that A satisfies X”. As a result, the rule base produced a ranking of recipes based on the number of satisfied constraints. The test set of 21 rules was originally composed for Jess in the forward-chaining mode. Then, these original rules were adapted for two other cases. For the Jess hybrid mode, rules for uploading data from the triple store were added. For BaseVISor, Jess forward- chaining rules were translated into the BaseVISor native format. Auxiliary n-ary relations created during inferencing were represented by several triple facts. Table 2. Reasoner test results: Jess and BaseVISor Jess (forward) Jess (hybrid) BaseVISor Engine run-time (s) 81 302 4.7 Memory usage (M) 290 60 160 We tested the performance in three cases on a PC laptop with a 791MHz Intel Core 2 Duo CPU and 2GB memory. The results are shown in Table 2. It can be seen that BaseVISor outperformed Jess in the forward-chaining mode in terms of both runtime and memory consumption, apparently, because of its specific sup- port of triple facts. The hybrid mode required substantially less memory, because unnecessary data was not loaded. However, loading data at runtime required 3 times longer than in the forward-chaining case. Although in the forward-chaining case more data had to be loaded from the start, it took less time because assert- ing new facts did not require checking rule activations. In general, BaseVISor was found to provide a good trade-off between time and memory cost. 5 Conclusion and future work In this paper, we briefly overviewed two benchmarking tests realised as part of a specific scenario related to the realisation of smart products: products embed- ding knowledge and smart behaviours. For this reason, the major particularity of these two studies is that they were targeting small devices, i.e., hardware and software environments where strong limitations might apply on the resources available. Through these two tests, we demonstrated the importance of bench- marks specifically designed for resource-limited environments, mainly for two reasons: (i) because, in small devices, other dimensions than response time and scalability should be assessed, including for example storage space and memory and (ii) because semantic tools can achieve different levels of performance at different scale. In our tests, we showed in particular how some of the tools that would be judged best at a large scale would actually perform rather badly when working with smaller datasets, and strong hardware limitations. More importantly, these two elements made emerge the need for a new, more flexible type of benchmarks. Indeed, almost in all cases, the selection of a given tool is not only a matter of performance, but represents a trade-off between performance and the availability of resources. For this reason, the comparison of semantic technologies should not be an absolute measure, a ranking of the tools, independent from the environment where they apply, but should rather work as guides to elaborate this trade-off for particular scenarios and environments. In other terms, benchmarks such as the ones presented here should evolve to become guidelines supporting developers in selecting the right set of tools, depending on their requirements and constraints. 6 Acknowledgements Part of this research has been funded under the EC 7th Framework Programme, in the context of the SmartProducts project (231204). References 1. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. IJSWS 5(2) (2009) 1–24 2. Liang, S., Fodor, P., Wan, H., Kifer, M.: OpenRuleBench: An analysis of the per- formance of rule engines. In: WWW 2009, Madrid, Spain (2009) 601–610 3. d’Aquin, M., Nikolov, A., Motta, E.: How much semantic data on small devices? In: EKAW 2010, Lisbon, Portugal (2010) 4. Forgy, C.L.: Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence 19 (1982) 17–37 5. Mai, J., Paslaru-Bontas, E., Lin, Z.: OWL2Jess: A transformational implementation of the OWL semantics. In: Parallel and Distributed Processing and Applications - ISPA 2005 Workshops. (2005) 599–608 6. Bak, J., Jedrzejek, C., Falkowski, M.: Usage of the Jess engine, rules and ontology to query a relational database. In: International Symposium on Rule Interchange and Applications. (2009) 216–230