=Paper=
{{Paper
|id=Vol-1690/paper51
|storemode=property
|title=USE-RB : Benchmarking how Reasoners Work in Harmony with Modern Hardware
|pdfUrl=https://ceur-ws.org/Vol-1690/paper51.pdf
|volume=Vol-1690
|authors=Christophe Gravier,Julien Subercaze
|dblpUrl=https://dblp.org/rec/conf/semweb/GravierS16
}}
==USE-RB : Benchmarking how Reasoners Work in Harmony with Modern Hardware==
USE-RB : Benchmarking how reasoners work in harmony with modern hardware Christophe Gravier? and Julien Subercaze Univ Lyon, UJM-Saint-Etienne, CNRS Laboratoire Hubert Curien UMR 5516 F-42023 Saint Etienne, France {christophe.gravier,julien.subercaze}@univ-st-etienne.fr http://laboratoirehubertcurien.fr/ Abstract. As our computers embed more cores, efficient reasoners are designed with parallelization but also CPU and memory friendliness in mind. These latter contribute to make reasoner tractable in practice de- spite the computational complexity of logical fragments. However, cre- ating benchmarks to monitor this CPU-friendliness for many reasoners, datasets and logical fragments is a tedious task. In this paper, we present the Université Saint-Etienne Reasoners Benchmark (USE-RB) that auto- mates the setup and execution of reasoners benchmarks with a particular attention to monitor how reasoners work in harmony with the CPU. Keywords: Reasoning, performance, benchmark, caches, memory. 1 Introduction The number of Web applications relying on a triplestore and a reasoner has seen an exponential growth in the last years. This has resulted in new Web frontends for browsing, searching, and expressing complex queries over online data. As online data has never been as interlinked as today, the emerging challenge is to process these data in a computationaly efficient manner, especially when reasoning. However, reasoning is a computational expensive task, and there are many logic fragments around, albeit no logic fragments fits all applications. For exam- ple, subsumption computation complexity in SHOIN description logic is decid- able and exhibit a NEXPTIME time complexity. While such order of complexity may prevent the working ontologist to include a reasoner in their application – although the benefits to leverage implicit triples – the situation is not so desper- ate from practical point-of-view. Actually, even when the theoretical complexity seems to be intractable, there are optimized reasoners available [2, 4] that are usable for practical real world cases. Historically, reasoning scalabilty has been tackled from a distributed com- puting point-of-view with practical system such as WebPie [1]. While these sys- tems provided an unprecedented scalability, they fall short of scaling linearily ? This work has been supported by the CNRS PEPS-Secu project. 2 Lecture Notes in Computer Science: Authors’ Instructions with the number of nodes in the cluster. One can observe that the most re- cent research have shifted towards parallelization on a single node equipped with many cores [2–4]. These approaches focus on designing in-memory, efficient and cache-friendly data structures and algorithms in order to win back other- wise lost CPU cycles as in hardware-unoptimized counterparts. By designing cache-friendly systems, data and code locality are expected to fully exploit CPU subsystems such as the prefetcher, the translation lookaside buffers and page management, to name a few. Actually, most of reasoning consists in memory I/O rather than raw arithmetic computations – few computations are made but usually data structures are to be traversed several times in different ways. For instance, systems such as RDFOx [2] are defying Amdahl law as they exhibit a close-to-linear scalability with respect to the number of threads devoted to the reasoning task. With major chips makers such as Intel running a many cores policy for the next years, one can expect that research efforts combined with technology advances will lead further reasoning performance to the real field. In order to sustain the research effort, it is therefore of the utmost practical interest to go deeper in understanding reasoners performance – examinating reasoners from a CPU-friendliness point-of-view is a mandatory effort. Through USE-RB – University Saint-Etienne Reasoner Benchmark – we en- vision a CPU-friendly reasoners benchmark. USE-RB integrate all the facilities to plug any existing / to-be reasoners or datasets, to easily run and evaluate how a reasoner is working in harmony with the CPU. We believe that provid- ing our benchmark to the community will contribute to the research for high performance many-cores reasoners. 2 USE-RB 2.1 Outline USE-RB is expected to be configured with a list of reasoners, a list of datasets and a list of logic fragments. The cartesian product of these sets results in as many benchmark configuration to be run – a benchmark task. The module named USE-RB, the entry point of the execution of the program, is actually responsible for sequentially spawning as many instances of an external program named ReasonersBenchmarked – thus each benchmark task is run in isola- tion. A Java interface provides functional genericity in order to easily integrate new reasoner to the benchmark. Each execution of ReasonersBenchmarked can be parameterized with the number of warm-up iterations and the number actual iterations for which USE-RB will measure CPU counters. 2.2 Performance metrics USE-RB track down CPU counters for each execution of a ReasonersBench- marked instance. The primary counter is the wallclock time taken by the pro- gram to run one benchmark configuration – the actual number of CPU cycles, Title Suppressed Due to Excessive Length 3 translated to temporal units, used by the CPU for this process. Other metrics focus on hardware-specific counters for various CPU component, categorized as follows. Instructions. Instructions metrics includes the number of branch mispredic- tions (branch misses) by the CPU. This is relevant for reasoners given the tremendous amount of data structures to traverse – these traversals are to be as much uniform as possible, therefore predictable for early CPU pipeline units. USE-RB also reports the total number of instructions, the number of instructions per CPU cycle, and the number of stalled CPU cycle per instruction. Memory. Memory metrics includes the number of page faults, the number of transactional lookaside buffer loads and misses (total number and hit ratio). It also includes the number of stall CPU cycles when accessing any hierarchy of the memory. Cache. Cache metrics are a sub category of Memory metrics – a highly promi- nent set of metrics when designing high performance applications, so that it falls into its own category in USE-RB. Cache metrics include the miss rate on all levels of cache. It also provides a per cache hierarchy level information on cache hit and misses. As for the first level of cache (L1), USE-RB differ- entiates data and instruction L1 caches. In case of a high cache miss rate, this allows to undersand whether the reasoner falls short to provide a good data or instruction locality. 2.3 Results and extensibility USE-RB is shipped with vanilla datasets, logical fragments, and reasoners. It is natively able to run benchmarks by creating Benchmark configurations (see 2.1) by selecting one or several datasets, logical fragments and reasoners among : – Datasets : 9 different sizes of a BSBM dataset (from 100,000 triples up to 100 million triples), Wikipedia Ontology, Wordnet ontology, Yago taxonomy, 7 LUBM datasets (from 1 and up to 100 million triples). We also ship ded- icated datasets focusing on benchmarking the closure computation of the subsumption axioms. dTLB-load-misses rate page-faults per 1K Triples 0.2 200 0.15 150 0.1 100 5 · 10−2 50 0 0 rdfox rdfox rdfox rdfox inferray owlimse inferray owlimse inferray owlimse inferray owlimse rdfox inferray owlimse rdfox inferray owlimse rdfox inferray owlimse rdfox inferray owlimse rdfox inferray owlimse rdfox owlimse rdfox owlimse rdfox owlimse rdfox owlimse rdfox inferray owlimse inferray inferray inferray inferray rdfox inferray owlimse rdfox inferray owlimse rdfox inferray owlimse rdfox inferray owlimse lubm5 lubm10 lubm25 lubm50 lubm75 lubm100 Wiki Yago Wordnet lubm5 lubm10 lubm25 lubm50 lubm75 lubm100 Wiki Yago Wordnet Fig. 1. Cache misses, TLB misses and Page Faults per triple inferred for the RDFS benchmark. 4 Lecture Notes in Computer Science: Authors’ Instructions – Reasoners : JENA [5], OWLIM [6]1 , SLIDER [3], SESAME [7], RDFOX [2], and Inferray [4]. – Logical Fragments2 : RDFS default, RDFS-Full, Rho-DF, RDFS+. One can easily add his/her own dataset, reasoner or logical fragment3 . Fig- ure 1 provides an example of figure that can be drawn from the execution of USE-RB. This figure reports data TLB misses and page faults per triple in- ferred for a RDFS-Full benchmark configuration4 . Though the benchmark was run on all vanilla reasoner, we kept the most performant reasoners – omitted reasoners are outperformed by several order of magnitude as reported in [4]. 3 Conclusion In this paper, we presented USE-RB, a system for creating reasoners benchmarks and to observe how reasoners works in harmony with the CPU through the monitoring of CPU counters. USE-RB is publicly available at https://github. com/telecom-se/USE-RB. We believe that the presented benchmark is of the utmost practical interest for the researchers and industries who are willing to provide high performance reasoners. We also think that such frameworks are mandatory for promoting reproducibility of experiments, whenever possible. This framework also includes various popular datasets, reasoners, and implemented logical fragments – while providing the customizable features. References 1. Urbani, J., Kotoulas, S., Maassen, J., Van Harmelen, F., Bal, H.: WebPIE: A web- scale parallel inference engine using MapReduce. In: Web Semantics: Science, Ser- vices and Agents on the World Wide Web, Vol. 10, pp. 59-75 (2012) 2. Motik, B., Nenov, Y., Piro, R., Horrocks, I., Olteanu, D.: Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF Systems. In: AAAI, pp. 129-137 (2014) 3. Chevalier, J., Subercaze, J., Gravier, C., Laforest, F.: Slider: an Efficient Incremental Reasoner. In: SIGMOD, pp. 1081-1086 (2015) 4. Subercaze, J., Gravier, C., Chevalier, J., Laforest, F.: Inferray: fast in-memory RDF inference. In: VLDB, 9(6), pp. 468-479 (2016) 5. McBride, B.: Jena: A semantic web toolkit. In: IEEE Internet computing 6(6) pp. 55 (2002) 6. Bishop, B. et al: OWLIM: A family of scalable semantic repositories. In: Semantic Web 2(1), pp. 33–42 (2011) 7. Broekstra, J., Arjohn, K., Van Harmelen, F.: Sesame: A generic architecture for storing and querying rdf and rdf schema. In: ISWC, pp. 54–68 (2002) 1 Requires a OWLIM-SE licence 2 Logical fragments description in [4] 3 https://github.com/telecom-se/USE-RB 4 This exlcudes BSBM dataset since it does not support the expressivity of the RDF- Full logical fragment.