Empirical Investigation of Subsumption Test Hardness in Description Logic Classification Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler The University of Manchester Oxford Road, Manchester, M13 9PL, UK {matentzn,bparsia,sattler}@cs.manchester.ac.uk Abstract. Recently, modular techniques have been employed for op- timising Description Logic reasoning, specifically to enable incremental reasoning and improve overall classification time. Classifying a module of an ontology should be significantly easier than reasoning in the whole ontology. However, we observed in previous work that neither it is gen- erally true that modular reasoning techniques have a reliable positive effect, nor even that the classification time of a module is less than or equal to the classification time of the whole ontology. One possible ex- planation for the latter could be that counter-productive optimisations are triggered within the reasoner when dealing with the sub-module, and thus individual subsumption tests get harder when parts of the ontology are missing. The goal of this paper is to understand the contribution of subsumption tests to the hardness of classification. The contribution is twofold: (1) We analyse the impact of subsumption test hardness on DL classification by analysing a well known corpus of ontologies, and (2) we present a novel approach based on modularity to robustly detecting subsumption tests that are too hard. Keywords: classification,ontologies,benchmarking,modular reasoning 1 Introduction Reasoning in popular, very expressive Description Logics (DL) is very difficult (e.g., SROIQ is N2Exptime-complete) [12]. Perhaps surprisingly, modern rea- soning systems suitable for the entirety of OWL 2 DL (essentially a notational variant of SROIQ) such as FaCT++ [20], Pellet [18], HermiT [5] and recently Konclude [19] generally perform well against real ontologies. However, due to the poor performance in some (often important) cases, the quest for optimisations is ongoing. The need to empirically validate such optimisations stems from the sheer complexity of reasoner architectures. Worst case complexity analysis and its variants do not account for the high variability of classification times of real ontologies. Modern reasoning systems have to accommodate multiple reasoning services and also tend to implement a wide range of optimisation techniques that might affect each other. Various sources of non-determinism, mainly traversal (subsumption test order) and or-branch exploration of a tableau further com- plicate the situation. Statistical methods such as linear regression [16] tend to 2 Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler be only precise for trivial cases, and are limited in their explanatory richness. Currently, principled benchmarking provides the only way to creating detailed characterisations of DL reasoning performance. Using locality-based modules to optimise Description Logic classification ex- perienced a resurgence in recent years [15, 21]. Intuitively, breaking the input problem into smaller pieces, reasoning over those pieces separately, then recom- bining the results is appealing. Furthermore, if there are especially difficult parts of the ontology, perhaps they can be isolated to reduce their impact. In practice however, modular reasoning techniques do not always improve the performance of classification [6]. In fact, they can drastically impair performance, making it a hit and miss game to chose between a modular reasoner (e.g. MORe-HermiT, Chainsaw-JFact) and its monolithic counterpart (e.g. HermiT, JFact). These cases can often be due to various kinds of overhead induced by modular reason- ers (module extraction) or redundancy introduced by the mostly unavoidable and often significant overlap between the various modules extracted. In a pre- liminary set of experiments [13] we observed that not only are there cases where there are individual subsumption tests that can be, often significantly, harder in a module extracted by a modular reasoner than in the whole ontology, but also that there are occasionally modules whose classification time exceeds that of the entire ontology O it was extracted from. The goal of this paper is to understanding the contribution of subsumption tests (ST) to the hardness of classification. The contribution is twofold: (1) We analyse the impact of ST hardness on DL classification by characterising a well known corpus of ontologies, and (2) we present a novel approach based on modularity to robustly detecting subsumption tests that are potentially too hard. As a result, we re-confirm the almost 20 years old results by Horrocks [11] that ST’s are generally rather easy. We also isolate counter-intuitive instances that, however, are often likely to be the consequence of the surprising degree of observed stochasticity in the classification process. 2 Background Understanding the experimental design and methodology presented here does not require more than a cursory understanding of the syntax, semantics, and proof theories implemented. The most prominent families of reasoning algorithms for description logics are tableau (incl. hyper-tableau) and consequence-based. In our work, we are mainly concerned with tableau-based algorithms. The reason- ers under investigations in this paper are designed to implement key reasoning services for the Web Ontology Language (OWL), most importantly classification and consistency checking. Given an ontology (a set of axioms) O, the signature of an ontology O e is the set names appearing in the axioms in O. We use CT (O) (classification time) and CT (M) respectively to denote the time of computing the set of atomic subsumptions (i.e., statements of the form A v B where A and B are properties or classes in the signature) or classification of O. For brevity, we refer to overall classification time as OCT and subsumption test time as STT. Subsumption Test Hardness in DL Classification 3 While subsumption testing, and therefore classification, is in theory intractable, highly optimised reasoners do fairly well in practice. The observed efficiency de- spite the worst case complexity is in principle down to four factors. (1) Real ontologies are bounded in size. That means that even an exponential algorithm might fully classify an ontology, from a user perspective, in an acceptable amount of time. (2) Many ontologies fall into tractable fragments of OWL, and can be classified using efficient polynomial algorithms such as the ones from the family of consequence-based algorithms. (3) The last 20 years brought a plethora of different optimisations to make satisfiability checks easier [11, 3]. (4) Very effi- cient algorithms were developed to avoid the vast majority of subsumption tests altogether [1, 4, 17]. Current modular classification approaches use so-called syntactic locality- based ⊥-modules [10] which have a number of desirable properties: (1) They are relatively cheap to extract and are reasonably compact and exact, (2) If O |= A v C then for any given ⊥-module M⊥ , of O where A ∈ M g⊥ , M⊥ |= A v C (were C is an arbitrary expression over the signature of O). Thus, ⊥-modules are classification complete for their signature with respect to their parent ontology. Hereafter, we will use M to refer to a syntactic locality based ⊥-module. Very recently, reasoner developers have started to utilise modularity for classification. They either are (1) using modules for incremental reasoning [9] or (2) using modules to improve classification time [15, 21]. 3 Related Work Attempts to understand DL reasoning performance are, up until today, rarely systematic or comprehensive. Recently, the ORE reasoner competition tries to establish the methodological foundations for more reliable comparisons [6] be- tween different reasoners and across a range of different reasoning services. OWL Reasoner benchmarks have been conducted for varying purposes, for example (and most prominently) guiding end-users for selecting appropriate reasoners for their problem [2, 6] or understanding reasoning or the state of reasoning in general [7]. Dentler et al. [2] conduct a principled investigation to identify suit- able criteria for choosing an appropriate reasoner for EL ontologies. In our work, we are interested in mapping out subsumption test hardness during full classi- fication across reasoner-ontology pairs (phenomenological characterisation) and the potential of modularity to pinpoint counter-intuitive cases (i.e. harder tests in a sub-module). Most benchmarks conduct an only semi-principled dataset selection: Even carefully executed benchmarks such as Dentler et al. [2] usually cherry pick a set of somehow relevant ontologies. Few works sample from existing corpora or the web, and only Gonçalves et al. [7], to the best of our knowledge, deal with corpora larger than 500 ontologies. In practice, the current de facto gold-standard corpus for ontology experimentation is BioPortal [14], which also provides a well designed infrastructure to obtain an interesting range of biomedi- cal ontologies programatically. We are using a snapshot of BioPortal in this work. As far as we know, no benchmark to date has investigated subsumption testing 4 Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler during classification across reasoners in a principled manner. However, various benchmarks have investigated the effect of certain optimisations on subsump- tion test avoidance [4]. While the literature on classification optimisation and reasoning is vast, little progress has been made in understanding classification hardness of real ontologies, both empirically and formally. 4 Subsumption Test Hardness The phenomenon under investigation is subsumption test hardness in the context of classification. A subsumption test is a question asked by the reasoner to determine whether A v B. The subsumption test hardness is the time it takes to compute the answer, operationalised as wall-clock time. In this work the answer to a test is either yes or no. Note however, that for any implementation (1) more than just a binary answer will be provided (i.e., cached models, derived subsumptions) and (2) no guarantee is given that the answer is correct (bugs in the reasoner). “In the context of classification” means that we are not exploring individual “cold” tests, i.e. letting the reasoner compute whether A v B for any A, B from outside the classification process, because we want to understand the contribution of subsumption testing to classification as a whole, with all the optimisations involved. Our model of subsumption test hardness with respect to sub-modules is based on the following intuition: Given a positive ST ST , it should be the case that for every two modules M1 ,M2 with M1 ⊂ M2 ⊂ O in which the ST is triggered, the hardness of ST always stays the same. The reason for that are module properties: every justification for an entailment is part of every module that entails it. Thus, every way that the entailment holds is contained in the module, no “new” information about the entailment exists in the rest of the ontology. Intuitively, additional “stuff” can make it harder to figure out the entailment, but not make it easier. This makes this metric a possible indicator of counter-productive optimisations: If we find that ST M2 is harder than ST M1 , we might conclude that the reasoner is doing some unnecessary extra work in M2 (case 1); if ST M1 is harder than ST M2 , there is a possibility that a counter- productive optimisation may have been triggered (case 2). Only the second case is truly pathological: A test should never get harder when irrelevant axioms are removed from the ontology. The first case might simply occur because if M grows, it gets harder to identify the irrelevant axioms. One possibility to explain both cases may be the inherent stochasticity of classification as implemented by current OWL Reasoners. For example, a random factor might (for example by changing the test order) simply shift the load of ST in M1 to another ST ST 2 that consecutively makes ST easier. Another reason for a test becoming easier in a sub-module might be the exploitation of partial results from negative tests (e.g. caching). Our empirical investigation of subsumption tests has two parts: (A) a broad characterisation of the landscape of subsumption testing and (B) an in-depth characterisation of non-easy subsumption tests. We treat a test as non-easy if it Subsumption Test Hardness in DL Classification 5 takes longer than 100 ms . The first part A will attempt to answer the following questions: What is the impact of subsumption testing on reasoning performance in general (RQ1)? How many tests are positive or negative and how do they differ in hardness (RQ2)? How hard are real tests actually (RQ3)? Part B serves as an in-depth characterisation that attempts to address ques- tions related to the general stability of the measurements (intra-module) and the effect of modularity (inter-module). Is ST hardness a stable phenomenon (RQ4)? This is important in order to judge how reliably we can trace a single sub- sumption test through different sub-modules of an ontology, and may also give a warning sign for triggered non-determinism, for example in the case that a test appears or disappears given a particular ontology-reasoner pair across runs. We will address this problem mainly by looking at the coefficient of variation (COV) of subsumption test hardness. The COV is a statistical, standardised measure of dispersion of a distribution (for example the distribution of test hardness) de- fined as the ratio of the standard deviation to the mean and is used to compare the variation of one data series to another, even if they are on a different scale. What are the reasons for instability (RQ5)? We will not conclusively try to an- swer this problem, but we will collect some evidence for stochasicity by looking at intra-module variation of test counts, a strong indicator of non-determinism. Does modularity change the hardness of tests (RQ6)? In order to answer this question, we will classify tests by analysing how modularity affects their hardness. This happens as follows: We identify all super and submodule combi- nations M1 , M2 as described earlier. For each test triggered in both M1 and M2 , we determine: (1) was the effect positive on average (across runs), (2) what was the magnitude of the effect and (3) was the effect stable? We define sta- bility of an effect as follows: given a subsumption test ST that occurs in two modules M1 , M2 with M1 ⊂ M2 , and two sets of measurements X(ST M1 ) and X(ST M2 ) (a) measurements M E ∈ X(ST M1 ) are either all harder or all easier than measurements M E ∈ X(ST M2 ) (strong stability) or (b) the over- lap of the ranges of X(ST M1 ) and X(ST M2 ) is less than 10% of the range of X(ST M1 ) u X(ST M1 ). We group ST hardness into the following bins: Very Hard (more than 100 seconds), Hard (>10 sec), Medium Hard (>1 sec), Medium (>100 ms), Medium Easy (>10 ms), Easy (>1 ms), Very Easy (>100 µs.), Trivial (<100 µs). The upper bound of each bin corresponds to the lower bound of the previous one. 5 Experimental Design We conducted our study on a corpus of 339 OWL API (3.5.0)-parsable BioPor- tal ontologies, obtained through the BioPortal REST Services1 (January 2015 snaphshot). All ontologies were serialised into OWL/XML, with merged imports closure. A minimum amount of repair (injecting missing declarations, dropping empty n-ary axioms, etc.) was applied to ensure that trivial violations do not impair DLness. 1 http://data.bioontology.org/documentation 6 Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler For all our experiments, we use four OWL reasoners that implement the OWL API interface: HermiT 1.3.8, Pellet 2.3.1, JFact 1.2.3 and FaCT++ 1.6.3. All four are among the most heavily used reasoners for OWL 2 DL. The reasoners have been modified for the benchmark: When a subsumption test is conducted, the start and end timestamps, the sub and super class under consideration and the result of the test are recorded. While we can use this approach to compare results for each reasoner, interpretation of comparisons between reasoners might be misleading due to implementational details. For example, methods that test for subsumption and ultimately satisfiability may be nested. See companion website for more information (Section 6). Because we are interested in real life behaviour, we allowed the reasoner to fall into states like the deterministic part of HermiT for Horn-SHIQ or Pellets internal EL-Reasoner. That said, we cannot claim to measure all subsumption tests a reasoner does. We can, however, establish a lower bound and are confident that we capture the vast majority of the hard tests, because the sum of test times occasionally account for almost 100% of the OCT for all reasoners. A set of four equal-spec Mac Minis with Mac OS X Lion 10 (64 bit), 16 GB RAM and 2.7 GHz Intel Core i764-bit was used for the benchmarking. Every single classification was done in a separate isolated virtual machine (Java 7, -Xms2G, -Xmx12G). In order to reduce potential bias induced by run order (unaccounted for background processes kicking in, runtime optimisations), we fully randomise the run order and evenly distribute the experiment run jobs across the four machines. Experimental Pipeline: For the first experiment we execute a single run of all reasoners across the entire corpus, with a timeout of 60 minutes per run. Due to technical details, the timeout is a lower bound and might not be triggered until some minutes later. Note that we include every ontology in the corpus, including the ones not strictly in OWL DL (53). The reason for that is that these ontologies do form part of the landscape, and reasoners are used on them. The main sources of violations are uses of reserved vocabulary (37% of all violations across the corpus), illegal punning (32%) and uses of datatypes not on the OWL 2 datatype map (11%). For the second experiment, we select a set of reasoner-ontology pairs for which, according to the results of experiment 1, at least one test was measured that was harder than 100 milliseconds. Because of the various claims we have with respect to modules, we also excluded ontologies that do not fall under OWL 2 DL. Runtime limitations forced us to exclude the NCIt from the sample, due to the extreme number of measured subsumption tests (JFact 751,907 tests, Pellet 461,831, FaCT++ 605,481). For this experiment, we first obtain random cumulative subsets from the ontologies in our narrowed down sample, similar to Gonçalves et al. [8], with 16 slices. In a nutshell, given the set of logical axioms the 1 ontologies are comprised of, we obtain a random 16 th of the axioms, serialise 1 this subset, add another randomly drawn 16 th from the remaining axioms to the first, serialise them together, and then iteratively grow each consecutive subset until the final set is the whole ontology. From the signature of each subset Subsumption Test Hardness in DL Classification 7 sampled, we obtain the ⊥-locality module using the OWL API module extractor. Module properties ensure that, given subset S1 ⊂ S2, MS1 f ⊂ MS2 f . The module 16 of 16 th, MOe , corresponds to the whole ontology. We call this nested set of modules a path. Note that the modules are usually considerably larger than their respective subsets, which will give us a good sample of relatively large modules with hopefully hard subsumption tests. Each of the modules obtained is classified three times (i.e., three independent runs) by each reasoner. Given a path M1 ⊂ M2 ... ⊂ Mn , we call P the set of all pairs Mi , Mj with i∈ P we look at the change from either the CT (M1 ) to CT (M2 ) or the change from a subsumption test ST M1 to ST M2 . Every pair of mea- surements has a tendency, a magnitude and a degree of stability. The tendency easier (mean hardness change less than -5%) denotes that a test is easier in the super-module (potentially pathological), harder (mean hardness change more than 5%) the reverse, and neutral means the mean measurement difference does Subsumption Test Hardness in DL Classification 9 Fact++[EXP] HermiT[EXP] JFact[EXP] Pellet[EXP] 50 40 30 20 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Fact++[EXP] HermiT[EXP] JFact[EXP] Pellet[EXP] 50 40 30 20 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Fact++[EXP] HermiT[EXP] JFact[EXP] Pellet[EXP] 250000 200000 150000 100000 50000 0 1e−02 1e+00 1e+02 1e−02 1e+00 1e+02 1e−02 1e+00 1e+02 1e−02 1e+00 1e+02 Fig. 2. Top 2 rows: Histogram of variation (COV) by reasoner. Top: OCT, bottom: SUMST. Bottom: Histogram of variation (COV) of ST measurements by reasoner. Mind the log scale. not change by more than 5%. High magnitudes are changes above 50%, medium magnitudes are changes between 5% and 50% and low changes are below 5%. An effect can be of three degrees of stability: clear cut, high or low, see Section 4. Neutral cases have high stability if both sets of measurements have a variation coefficient less than 5%. From the module perspective, the main observation to be made here is that there are 39 cases in the set where the sub-module is harder than the super-module and 173 where there is no significant change in hardness (less than 5% change). Test time stability varies a lot across reasoners. While FaCT++ measurements are mostly stable, Pellet and JFact measurements vary a lot across almost all potential categories. The pathological cases as described in Section 4, EHC and EHH, occur rarely. Out of the 1,507,654 tests that got easier overall, only 399,644 (26%) are of a high magnitude. Out of those, 376,078 are clear cut, and 8,850 of high stability. From the clear cut cases, 59,656 are potentially unaffected by non-determinism, out of which 204 are harder than 100 ms. Out of the highly stable cases, only 302 are potentially unaffected by non-determinism, out of which only 10 are harder than 100ms. None of the tests in both groups are harder than a second. 7 Discussion We quantify the impact of subsumption test hardness (RQ1) on classification time in two ways: (1) Contribution of test times measured to OCT and (2) ratio of number of ontologies with tests to those without. Only few of the 136 ontologies with tests were dominated by test hardness: Only 1 ontology had more than a 50% contribution of total SST for Hermit, 7 for Pellet, 19 for FaCT++ and 23 for JFact. However, there are cases where the contribution is very high. The ratio of ontologies entirely without tests is very high: FaCT++: 52%-71%, HermiT 55%-80%, JFact in 56%-76% and Pellet 64%-84%. We have established only the lower bound. The upper bound covers the very unlikely possibility that the failed classifications might be all without tests. Additionally, 36 of the 10 Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler Fact++[EXP] HermiT[EXP] JFact[EXP] Pellet[EXP] 30 40 20 150 40 30 100 20 20 10 50 10 0 0 0 0 HH HM NL EM HH HM NL EH EM HH HM NL EH EM HH HM NL Fact++[EXP] HermiT[EXP] 1000000 6e+05 750000 4e+05 500000 250000 2e+05 0 0e+00 EHC EHH EHL EMCEMH EML HHC HHH HHL HMCHMHHML NLH NLL EHC EHH EHL EMCEMH EML HHC HHH HHL HMCHMHHML NLH NLL JFact[EXP] Pellet[EXP] 1200000 150000 800000 100000 400000 50000 0 0 EHC EHH EHL EMCEMH EML HHC HHH HHL HMCHMHHML NLH NLL EHC EHH EHL EMCEMH EML HHC HHH HHL HMCHMHHML NLH NLL Fig. 3. Hardness classes by reasoner. Top row: OCT, bottom 4: SST. Bin labels x-axis: 1st letter: tendency (easier, neutral, harder), 2nd: magnitude (low, medium, high), 3rd: stability: (clearcut, high, low). Y-axis: number of comparisons. ontologies entirely without tests have less than 100 TBox axioms (12 less than 10). RQ2 is quantified by ratio of positive to negative tests. Positive tests account to between 0.12% (Pellet) and 2.61% (FaCT++) of the overall number of tests (JFact 2.49%, HermiT 2.12%). This low ratio is not surprising, given that the worst case N 2 is dominated by far by non-subsumptions. As a side observation, current traversal algorithms appear highly efficient. Only 3 ontology-reasoner pairs (two distinct ontologies, small TBoxes) trigger more than 10% of the worst case N 2 number of subsumption tests, and 50 pairs (30 unique ontologies) trigger more than 1% of the worst case. This result however is only indicative of the efficiency, as we do not guarantee to measure all tests. The distribution of test hardness as shown in Figures 1 tends towards easy tests (RQ3). Figure 1 show that the number of really hard tests are in the minority: only 346 out of 2,671,896 tests measured overall are harder than a second. This result may emphasise the importance of test avoidance over further optimising individual subsumption tests. However, as there are individual tests that can make up to 25% of the overall reasoning time, it cannot be disregarded. The variation of the measurements, both for individual tests and overall times, is, at least in its magnitude, surprising (RQ4). While the variation of test times could be so high merely due to the low number of measurements that are very vulnerable to experimental error (for example an unaccounted for system background kicking in, stochasticity in the garbage collection, room tempera- ture), we cannot claim the same for the variation in the numbers of triggered tests. That 68% of the modules in the sample vary in the number of tests is a very strong indicator for the stochasticity of the classification process (at least in this particular sample), be it due to random effects in the programming lan- guage or deliberate randomness induced by the implementation. This poses a serious threat for single-run benchmarking, as it is still general practice in the DL community. A small indication of the potential impact of a particular pro- gramming environment is the very low average variation in test times collected Subsumption Test Hardness in DL Classification 11 for FaCT++, which is the only reasoner in the set implemented in C++. In the inter-module comparison we learned that our pathological cases rarely happen and if so, the effect they might have on overall classification time is negligible, due to the potential degree of the effect and the rarity in which they occur. Fur- thermore, the strong evidence of stochasticity of the classification process makes it unclear whether the effect might not simply be due to non-determinism. De- spite having detected some cases that are clearly counter-intuitive (in the sense of getting easier when irrelevant stuff is added in), we cannot be sure whether modularity is the cause, due of the small effect size (RQ6). On top of that, eas- ier and harder tests almost balance each other out. Given our sample bias, our results are not conclusive. 8 Conclusions and Future Work In this paper we have presented a procedure for reliable and reproducible isola- tion of counter-intuitive reasoning behaviour on subsumption tests during clas- sification and presented some such isolated cases. Future work includes com- pleting the full characterisation of the corpus with respect to the pathological cases and then investigating the causal basis of those cases. The most likely ex- planation is that the additional axioms trigger a cheaper choice in the complex non-determinism algorithms. The big challenge is whether any progress can be made in a fairly reasoner independent way. One idea is to extract the justifica- tions for a given entailment and see whether they are disproportionally difficult individually. This would suggest that the additional information is directing the algorithm toward “easier” reasoners. References 1. F. Baader, B. Hollunder, B. Nebel, H.-J. Profitlich, and E. Franconi. An empirical analysis of optimization techniques for terminological representation systems. Appl. Intell., 4(2):109–132, 1994. 2. K. Dentler, R. Cornet, A. t. Teije, and N. d. Keizer. Comparison of reasoners for large ontologies in the OWL 2 EL profile. Semantic Web, 2(2):71–87, 2011. 3. B. Glimm, I. Horrocks, and B. Motik. Optimized Description Logic Reasoning via Core Blocking. In IJCAR 2010, pages 457–471, 2010. 4. B. Glimm, I. Horrocks, B. Motik, R. Shearer, and G. Stoilos. A novel approach to ontology classification. J. Web Sem., 14:84–101, 2012. 5. B. Glimm, I. Horrocks, B. Motik, G. Stoilos, and Z. Wang. HermiT: An OWL 2 Reasoner. J. Autom. Reasoning, 53(3):245–269, 2014. 6. R. S. Gonçalves, S. Bail, E. Jiménez-Ruiz, N. Matentzoglu, B. Parsia, B. Glimm, and Y. Kazakov. ORE Reasoner Evaluation (ORE) Workshop 2013 Results: Short Report. In ORE 2013, pages 1–18, 2013. 7. R. S. Gonçalves, N. Matentzoglu, B. Parsia, and U. Sattler. The Empirical Ro- bustness of Description Logic Classification. In ISWC 2013, pages 277–280, 2013. 8. R. S. Gonçalves, B. Parsia, and U. Sattler. Performance Heterogeneity and Ap- proximate Reasoning in Description Logic Ontologies. In ISWC 2012, pages 82–98, 2012. 12 Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler 9. B. C. Grau, C. Halaschek-Wiener, and Y. Kazakov. History Matters: Incremental Ontology Reasoning Using Modules. In ISWC 2007, pages 183–196, 2007. 10. B. C. Grau, I. Horrocks, Y. Kazakov, and U. Sattler. Modular Reuse of Ontologies: Theory and Practice. J. Artif. Intell. Res. (JAIR), 31:273–318, 2008. 11. I. R. Horrocks. Optimising tableaux decision procedures for description logics. PhD thesis, Citeseer, 1997. 12. Y. Kazakov. RIQ and SROIQ Are Harder than SHOIQ. In KR 2008, pages 274– 284, 2008. 13. N. Matentzoglu, B. Parsia, and U. Sattler. An Empirical Investigation of Difficulty of Subsets of Description Logic Ontologies. In DL 2014., pages 659–670, 2014. 14. N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L. Rubin, M.-A. D. Storey, C. G. Chute, and M. A. Musen. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37(Web-Server-Issue):170–173, 2009. 15. A. A. Romero, B. C. Grau, and I. Horrocks. MORe: Modular Combination of OWL Reasoners for Ontology Classification. In ISWC 2012, pages 1–16, 2012. 16. V. Sazonau, U. Sattler, and G. Brown. Predicting Performance of OWL Reasoners: Locally or Globally? In KR 2014, 2014. 17. R. Shearer and I. Horrocks. Exploiting Partial Information in Taxonomy Con- struction. In ISWC 2009, pages 569–584, 2009. 18. E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and Y. Katz. Pellet: A practical OWL-DL reasoner. J. Web Sem., 5(2):51–53, 2007. 19. A. Steigmiller, T. Liebig, and B. Glimm. Konclude: System description. J. Web Sem., 27:78–85, 2014. 20. D. Tsarkov and I. Horrocks. FaCT++ Description Logic Reasoner: System De- scription. In IJCAR 2006, pages 292–297, 2006. 21. D. Tsarkov and I. Palmisano. Chainsaw: a Metareasoner for Large Ontologies. In ORE 2012, 2012.