=Paper= {{Paper |id=Vol-1193/paper_12 |storemode=property |title=Predicting OWL Reasoners: Locally or Globally? |pdfUrl=https://ceur-ws.org/Vol-1193/paper_12.pdf |volume=Vol-1193 |dblpUrl=https://dblp.org/rec/conf/dlog/SazonauSB14 }} ==Predicting OWL Reasoners: Locally or Globally?== https://ceur-ws.org/Vol-1193/paper_12.pdf
        Predicting OWL Reasoners: Locally or Globally?

                      Viachaslau Sazonau, Uli Sattler, and Gavin Brown

                              The University of Manchester
                         Oxford Road, Manchester, M13 9PL, UK
                {sazonauv,sattler,gbrown}@cs.manchester.ac.uk




          Abstract. We propose a novel approach for performance prediction of OWL
          reasoners. It selects suitable, small ontology subsets, and then extrapolates rea-
          soner’s performance on them to the whole ontology. We investigate intercorrela-
          tion of ontology features using PCA. Finally, we discuss various error measures
          for performance prediction and compare our approach against an existing one
          using these measures.



1      Introduction

Although there exist many reasoners, there is no single reasoner that performs best
on all given inputs. The same ontology may take just several seconds to reason over
for some reasoners and require hours to process for others [5]. Recent results of the
competition between 14 reasoners taking place at the OWL Reasoner Evaluation Work-
shop, ORE 2013,1 show that there is no single winner in all ontology categories. The
huge difference in performance of reasoners on a given input often surprises ontology
designers. During ontology design and maintenance, reasoning performance may vary
significantly and in surprising ways: axiom additions or removals may cause a signif-
icant increase or decrease of classification time, often in contrast to intuitive expecta-
tions [5]. Recent studies of performance variability suggest that there are ontologies
which are performance homogeneous and others are performance heterogeneous for
a reasoner [5]. An ontology O is performance heterogeneous if classification time of
O0 ⊂ O is not linearly correlated with its relative size. The latter suggests that inter-
action effects come into play when certain axioms are part of the ontology and, conse-
quently, loose their impact on reasoner’s performance when those axioms are removed.
    These observations make performance prediction of a reasoner on a given input
challenging and often impossible even for experienced and skilled ontology engineers.
However, the ability to predict performance is attractive. Firstly, reasoner developers
equipped with a predictor would be able to find out which ontologies are harder for
a reasoner. This can speed up tests of reasoners and facilitate new optimizations. Sec-
ondly, an estimate of classification time might allow an ontology engineer to decide
whether the reasoning task is worth waiting for or not. As a practical example, the
progress bar of an ontology editor, such as Protégé 4,2 could be made more reliable
 1
     http://ore2013.cs.manchester.ac.uk/competition/
 2
     http://protege.stanford.edu/
with a good performance predictor. It could also supply auto-picking the fastest rea-
soner for a given ontology from the list of available reasoners.
    The contributions of this work are three-fold. Firstly, we suggest an approach to
analyse intercorrelation of ontology features via Principal Component Analysis, PCA,
given a corpus of ontologies, and observe that ontology features are highly intercorre-
lated, so that 57 features can be “faithfully” replaced by one or two features. Secondly,
we introduce a new method for performance prediction which is competitive with exist-
ing performance predictors. Thirdly, we discuss different prediction accuracy measures.

2      Preliminaries
We introduce the machine learning concepts that we will refer to [1]. Assume there is
a finite collection of objects (here ontologies). Each object from the collection has its
own label which is not necessarily unique. Hence, the objects can be separated into
different categories, or classes, based on their labels. Then, a new object of the same
type appears and the task is to predict its label, or to classify it into an appropriate cat-
egory. This is called the machine learning classification problem. In order to make a
prediction for an object, some information has to be extracted to describe its proper-
ties. It is commonly represented as a multidimensional vector. The vector variables are
called features and the vector is called a feature vector. Feature vectors with known
and presumably unknown labels are called training and testing examples, respectively.
A set of N feature vectors of dimensionality d is commonly written as the N × d
matrix: X = [x1 , . . . , xN ]T , where xi = (xi1 , xi2 , . . . , xid ), i = 1 . . . N , is the i-
th feature vector in the set and T stands for the matrix transposition. Let the label be
yi , i = 1 . . . N and the vector of all labels y. Thus, the data set is [X | yT ].
     Training examples with labels are used to construct a machine learning model. This
process is called learning, or training through “supervision”. The set of related meth-
ods is called supervised machine learning. In essence, a model is a data structure, for
instance, a graph or a probability distribution, which is built from the data set. Once the
model is trained, it can be used to predict a label of an unknown object based on its fea-
tures. More specifically, the model implicitly compares the unknown object with known
objects, identifies similar known objects, and uses their labels to make a prediction.
     There exist recent attempts to apply supervised machine learning to performance
prediction of a reasoner on a given ontology. In [10], the authors define 27 ontol-
ogy features, or metrics, of 4 types: ontology level (ONT), class hierarchy level (CH),
anonymous class expressions (ANE) and properties (PRO). A feature vector xi is ex-
tracted from each ontology Oi , i = 1, ..., N in the corpus. It is assigned a label
yi = b(Oi , R), i = 1 . . . N , which is the performance bin of Oi given a reasoner
R. Kang et al. attempt to predict the performance bin ŷ = b̂(O, R) for an ontology O
based on its feature vector x, or “profile”.3 To do so, they train a model using feature
vectors X of all training ontologies in the corpus with their labels y = (y1 , y2 , ..., yN )
per reasoner, i.e. the “global” corpus view [X | yT ]. We call such a technique a global
approach. For basics of predicting algorithm runtime, we refer the interested reader to
a survey of machine learning methods for three NP-complete problems [8].
 3
     Not to be confused with OWL profiles such as OWL 2 EL
3    Feature Analysis and Dimensionality Reduction
As we will see, it is considerably hard or even impossible to define the “ideal” set
of ontology features for performance prediction of reasoners. Nonetheless, we need to
take features of different scales into account, e.g. entropy of the graph, EOG, and size of
vocabulary, SOV,4 used by Kang et al. Competing approaches exist in machine learning
to address this problem [18].
    We skip some features from Kang et al. and add our own features [15]. Features
based on the inferred class hierarchy are not included because this requires classification
and, thus, makes little sense for performance prediction. Overall, we consider d = 57
ontology features including 33 new features. Intuitively, more features should allow us
to capture more information about the object. However, it turns out that a smaller set
of features can produce a more accurate model in practice [6]. It can be achieved either
by eliminating weakly useful features from the original set, hence, selecting features,
or by transforming the features into a lower dimensional space with minimal loss of
information, known as constructing features [6].
    One of the simple feature selection approaches is based on measuring Pearson’s
correlation coefficient [6] between labels and features in the data set. However, in order
to understand correlations between subsets of features, one would have to consider the
correlation coefficients for all subsets, which is clearly infeasible. A common feature
construction method is Principal Component Analysis, PCA [9]. PCA projects feature
vectors of the data set into a new coordinate system, maximizing the variances in the
projection space. In summary, PCA helps to identify the presence of intercorrelation of
features and provides a new, smaller set of uncorrelated features.


4    Experimental Setup
In this work, we investigate three commonly used reasoners: Hermit 1.3.8 [14], Pellet
2.3.0 [16], and JFact 1.0.0, a Java version of FaCT++ [17]. We have chosen the NCBO
BioPortal5 as an ontology corpus for experiments since it is a natural set of ontologies
primarily maintained by biomedical domain experts. In general, a corpus of ontologies
should be selected carefully because a significant bias towards easy ontologies may
cause misleading generalizations. Although dealing with such bias is usually difficult
due to the lack of hard ontologies, we need to take this into account.
     We exclude empty ontologies, duplicates, multiple versions of the same ontology,
and ontologies with loading errors. As a result, we obtain the BioPortal corpus of 357
ontologies of various sizes. In order to confirm the results of PCA experiments (and only
for this purpose), we examine an additional corpus, the Tones Ontology Repository.6 We
filter it out via exactly the same procedure. As a result, we obtain the Tones corpus of
196 ontologies. We measure the classification time of each reasoner on each ontology
in BioPortal. We run a reasoner 20 times on each ontology and record the average
execution time for this ontology. The ontology loading time is not included. A timeout
 4
   http://www.csse.monash.edu/∼yli/metrics perf/
 5
   http://bioportal.bioontology.org/
 6
   http://owl.cs.manchester.ac.uk/repository/
of 1,000 seconds is used for all reasoners. All experiments are executed on the same
machine. The operating system is Windows 7, 64-bit. CPU is Intel Core i3-2330M, 2
cores, 2.2 GHz. RAM is DDR2, 4 GB. All algorithms are implemented in Java, JDK
1.7, using OWL API 3.4.3.
    In addition to the performance bins
                                                  R             Bins       Exc
specified by Kang et al., we add one more                 A B C D E
bin E in order to distinguish the hard-           JFact 250 50 16 11 23 7
est ontologies with classification times          Hermit 222 53 21 13 16 32
longer than 1,000 seconds. We also in-            Pellet 247 48 21 11 28 2
clude all ontologies with the classifica-
tion time less than one second in bin Table 1: Performance statistics on BioPortal
A ∈ (0s, 1s). Thus, we consider the fol-
lowing 5 bins: A ∈ (0s, 1s), B ∈ [1s, 10s), C ∈ [10s, 100s), D ∈ [100s, 1, 000s),
E ∈ [1, 000s, +∞). The performance statistics on BioPortal is shown in Table 1, where
the “Exc” bin means that a reasoner raises an exception on those ontologies and does
not finish its job properly.


5   Intercorrelation of Ontology Features and Ontology Projections

Given a data set [X | yT ], PCA deals only with a set of feature vectors X and ignores
labels y. It produces a set of PCs, P Cj , with respective variances, λj . We apply PCA
to BioPortal and obtain a set of PCs, P CjB . To check our findings, we also apply PCA
to Tones, which induces another set of PCs, P CjT , and compare the results. We use
P CjX to denote both P CjB and P CjT . As a result, we found out that the first 3 out of 57
components explain ≈99% of all variance for both data sets. Hence, we can hypothesize
that the original 57 features are highly intercorrelated. Another important thing to notice
is that P C1B explains ≈ 87% and P C1T explains ≈ 96% of total variance for BioPortal
and Tones, respectively.
     An additional advantage of PCA is that it allows us to visualise a data set in a
lower dimensional space. Since just few components have significant variances, we can
reasonably visualise the corpora in two dimensional space (P C1X , P C2X ) and explore
their spread in the projection space. This way, we preserve ≈ 94% of total variance for
BioPortal and ≈ 98% of total variance for Tones. As Figures 1a, 1b show, ontologies of
both corpora are mainly spread along the first coordinate with smaller deviations along
the second coordinate and even smaller along others.
     BioPortal has more outliers, i.e. ontologies which differ from the main cluster of
similar ontologies. There are 9 evident outliers out of 357 ontologies. They include 9 out
of 10 biggest ontologies in the corpus. It is interesting to examine classification times of
these outliers for given reasoners, see Table 2. We use unique identifiers of ontologies
for Figure 1a and Table 2. We can see from Tables 1 and 2 that the outliers are clearly
among the computationally hardest ontologies for all three reasoners. It is important to
note that these outliers have been identified without any performance measurements.
None of these outliers are the heterogeneous ontologies from [5].
     In addition, observing such an efficient representation of the total variance by P C1X
in Figures 1a, 1b, a reasonable question is how the new features P C1B , P C1T are re-
 (a) BioPortal: Ontology projections using       (b) Tones: Ontology projections using P C1T
 P C1B and P C2B                                 and P C2T




 (c) Contribution coefficients of original        (d) Correlation between each P CiX and ontol-
 features to P C1X                                                                 X
                                                  ogy size (correlation ≈ 0 for P C16     X
                                                                                      -P C57 )

          Fig. 1: Comparing BioPortal (B) and Tones (T) using PCA, X ∈ {B, T }

lated to the original features. We examine the (linear) coefficients with which each of
the original features contributes to P C1X , see Figure 1c. These coefficients are respec-
tive components of the eigenvector with the largest eigenvalue. Figure 1c shows that
there are only 6 original features, see Table 3, that contribute significantly to P C1X for
both corpora (see [15] for descriptions of the features). Two most contributing features
are #ClassNameOccurrences and #SubClassOfAxioms. The first one counts the number
of all occurrences of named classes without consideration whether there exist equally
named occurrences or not. Table 3 shows that #ClassNameOccurrences contributes to
the first component much more than #ClassNames which is the number of (unique)
class names. The feature #SubClassOfAxioms counts the number of axioms that explic-
itly encode subsumption relationships between classes in the ontology. These features
are anticipated to be relevant for ontology classification performance. What is not ex-
pected, however, is that these 6 features are the only contributing features with similar
values for both corpora. This implies that P C1B is approximately equal to P C1T .
     Exploring feature vectors, we observe that these 6 identified features significantly
correlate with ontology size.7 Hence, one can be interested how P C1X is affected by it.
To measure this, we calculate the correlation coefficient between the first component
and ontology size. It turns out that the first component correlates surprisingly well with
 7
     In the following, ontology size is the number of axioms
    ID Acr.   Size,   DL     CT , sec
              #axioms      JFa Her Pel
    1   NCBI 847,755 AL t      t    t
    2   GAZ 652,361 ALE+ 206 t      318
                                                  ID Feature              Contrib.
    3   CCO 624,708 SRI t      99 t                                       B T
    4   GEXO 549,132 SRI t     t    t             1 #ClassNameOccurrences 0.80 0.85
    5   Biom. 660,188 SRIF t   t    t             24 #SubClassOfAxioms    0.38 0.37
    6   REXO 463,799 SRI t     t    t             4 #ClassNames           0.21 0.25
    7   RH-M. 403,210 AL t     53 26              2 #ObjectPropertyName- 0.24 0.15
    8   RETO 415,551 SRI t     t    t                Occurrences
    9   CPO 379,734 SR     t   t    t             57 parsingDepth         0.24 0.15
 Table 2: BioPortal outliers revealed by          7 #ObjectSomeValuesFrom 0.23 0.15
 PCA and their classification times (timeout      Table 3: Six features with the highest
 t=1,000 s; DL by OWL API)                        contributions to P C1X , X ∈ {B, T }

ontology size and no other components do, see Figure 1d. As a consequence, one can
ask: is ontology size a good performance indicator? Can we predict performance better
with 57 features or with size alone? We will answer these questions below.


6   A Local Approach to Performance Prediction

Our “local” approach predicts performance for the whole ontology, given the data gath-
ered from its samples. Each sample is an ontology subset, ideally small, and comes with
a classification time and a size. We call such a technique a local approach. A reasonable
way to construct samples is to represent an ontology as a directed graph G = (V, E).
Each vertex v ∈ V is labelled with some set of axioms, called a clump. Edges E repre-
sent some relationships between clumps. From such a representation we can assemble
sets of clumps in many possible ways and use these sets as samples for performance
prediction.
     We introduce growing-based sampling. Given G = (V, E), the process of sampling
begins from a single atom. A next sample is constructed by extending the current sample
with a new atom, see Algorithm 1. A sensible way of choosing a next atom can cause
significant increases of the classification time of a sample. Obviously, the quality of
following performance prediction depends on our notion of a clump, relations between
clumps, and the assembling mechanism used. For example, a trivial way is to take small
arbitrary ontology subsets as samples without any relations between them. Yet, they are
not suitable performance indicators.
     A more informed way to define clumps is Atomic Decomposition, AD [3]. In this
case, clumps are atoms a ∈ ADO which are pairwise disjoint sets of axioms and O =
S˙
   a∈ADO a. Then, an ontology is represented as a directed acyclic graph G = (V, E),
where V = {va |a ∈ ADO }, (va , vb ) ∈ E if a ≺ b with ≺ the dependency relation
from [3]. For v ∈ V we use at(v) ⊆ O to denote axioms in the atom associated with v,
i.e. at(va ) = a.
     Algorithm 1 produces a set of samples with their performance measurements for
an ontology. The relative size limits hcj inj=1 determine the number of samples. New
Algorithm 1 Growing-based sampling
 1: Input: G = (V, E) a graph-based representation of an ontology O, R a reasoner, hcj in
                                                                                        j=1
    relative size limits in increasing order, m an optional parameter
 2: Output: Ω a list of n samples with time measurements by R
 3: S ← ∅; V ← ∅; H ← ∅; v ← random(V)
 4: for each j from 1 to n do
 5:    while |S|/|O| < cj do
 6:       V ← V ∪ {v}; S ← S ∪ at(v)
 7:       append E(v) \ (H ∪ V ) to H; v ← nextClump(V, H?, m?)
 8:    end while
 9:    CT (S, R) ← measure(S, R)
10:    append (S, CT (S, R)) to Ω
11: end for
12: return Ω

atoms are added to a sample S until a size limit cj is exceeded. At that moment, the
classification time CT (S, R) is measured. After that, growing of the sample continues
without performance measurements until the next size limit cj+1 is exceeded.
    The subroutine nextClump(V, H?, m?) specifies the behaviour of Algorithm 1. It
has one mandatory, V , and two optional, H and m, input arguments. A version of the
subroutine nextClump() is determined by these input parameters. We use three different
strategies of growing-based sampling [15]:
  1. nextClump(V ): select randomly a new atom from the remaining set of atoms;
  2. nextClump(V, H): select randomly an atom that is directly related to an atom in V ;
  3. nextClump(V, H, m): select randomly an atom that is directly related to an atom in
V amongst m most recently added atoms.
    We repeat Algorithm 1 r times and gather all q = rn samples in a single suite
Ω = hSj , yj iqj=1 for an ontology O, where yj = CT (Sj , R), j = 1 . . . q. This local
data is then used to make predictions for the whole ontology via regression analysis.
Based on the the results of our feature analysis, we assume that the regression function
f can be a simple polynomial of x = |S|:

                                                  L
                                                  X
                                f (x, β) = β0 +         βl x l .                        (1)
                                                  l=1



    Popular methods for estimating the unknown parameters β are least squares and
non-linear least squares [12]. Given a performance prediction task, we can assign β0 =
0 because f (0, β) = 0, i.e. the classification time of an empty ontology is zero. After
that, once the regression function f is fitted to the local data by estimating remaining
parameters β, it can be used to predict the performance via extrapolation ŷ = f (|O|, β),
given the size |O| of an ontology O.
    Moreover, we suggest to estimate the polynomial degree L for each ontology sep-
arately by evaluating how well the model fits the samples. We choose a polynomial of
degree Lmin as a model if it has a minimal residual sum of squares [7], RSS, once the
parameters β (β0 = 0) are fitted to the samples:
                                                          q 
                                                          X                          2
                                                                       PL        l
      Lmin = arg minL RSS(L), where RSS(L) =                    yj −     l=1 βl xj        .   (2)
                                                          j=1


    Thus, where a global approach gathers data from many ontologies and interpolates
for unseen ontologies, a local approach collects data only from part of a given ontology
and extrapolates to its full size. In contrast to a global approach, it does not rely on the
global view of a corpus to exploit similarities/dissimilarities between ontologies and it
uses no features except sample sizes.


7   Prediction Accuracy Measures

We aim to test a global and local performance predictor and compare them. Conse-
quently, we need a procedure to evaluate a performance predictor. The first possibility
is based on performance bins b(Oi , R), i = 1 . . . M . The model predicts a bin b̂(Oi , R)
for each ontology from the testing set. Whenever it predicts a wrong bin for an ontol-
ogy, i.e. b̂(Oi , R) 6= b(Oi , R), this is counted as an error. Once the model is tested on
all ontologies, the average classification error, ACE, is calculated:
                                            M
                                       1 X
                    ACE(R) = 1 −             δ(b̂(Oi , R), b(Oi , R)).                        (3)
                                       M i=1

    where δ(b̂, b) is the Kronecker delta. The classification accuracy is used by Kang
et al. to evaluate a model. In fact, it equals 1 − ACE(R). However, ACE and, con-
sequently, the classification accuracy have their disadvantages. Firstly, they depend on
binning which may cause incorrect conclusions while assessing a model. Secondly, both
measures count all wrong predictions equally. For example, predicting bin 5 and bin 2
for actual bin 1 are equally wrong, regardless of the huge difference of the errors.
    Consequently, when predicting performance bins, one can consider a second, more
precise than ACE way to evaluate the model - the confusion matrix [4]. The confusion
matrix is especially useful for evaluating the model on the unbalanced data set, i.e. if the
distribution of examples against labels is uneven. As Table 1 shows, our data set is, in
fact, significantly biased to label A. In contrast to ACE, the confusion matrix allows us
to investigate errors the model makes while predicting certain labels (performance bins
here). It essentially counts the number of ontologies of actual bin b which are classified
to bin b̂ for each possible pair (b, b̂) under {A, B, C, D, E}.
    A third way to assess a model is to measure the raw differences between the actual
and predicted classification time for each testing ontology. The mean absolute error,
MAE, is a common way to measure the differences between actual and predicted values:
                                        M
                                   1 X
                    M AE(R) =            |ŷ(Oi , R) − CT (Oi , R)|,                          (4)
                                   M i=1
    where ŷ(Oi , R) is the predicted classification time of a reasoner R on an ontol-
ogy Oi , and CT (Oi , R) is the actual classification time. This measure avoids ontology
binning whose usefulness for evaluating performance predictors is an open question.
    However, MAE has its own disadvantage because the hardest ontologies, which are
rare, can contribute more to MAE than all remaining ontologies together. In this case,
MAE measures the model’s fitness to hard ontologies but not to the data set. This is even
worse for another common measure, the mean squared error (MSE), which amplifies
the differences. Thus, all aforementioned measures have their disadvantages, no single
measure is enough for predictor evaluation, and we should rather consider them all.


8   Performance Prediction Experiments and Results

We use the k-NN classifier [2] as a global approach model. The choice of a machine
learning model is not critical here because we test overall utility of global approaches
for reasoning performance prediction and plan to compare these results with the local
approach. The k-NN classifier is a reasonable choice here because, firstly, it allows to
obtain sufficiently good predictions for many data sets. Secondly, it has only one input
parameter k which makes its tuning easy and analysis of its predictions transparent.
Thirdly, it can be used to predict both performance bins and classification times.
     A common approach for evaluating a supervised machine learning model using the
same set of examples is cross-validation [11]. In order to obtain the best prediction
results, the parameter k needs to be tuned for each reasoner. Observing that our data set
consists of at most N = 357 examples, we apply leave-one-out cross-validation [11],
LOOCV. An optimal k is estimated via LOOCV on the first 80% of the data set, and the
best model with tuned k is evaluated on the remaining 20%. This procedure is repeated
1,000 times per reasoner, each time shuffling the data set randomly.
     For the sake of simplicity, we denote the cross-validation generalization error,
which is either average ACE or average MAE over all best models, by ACE or MAE,
respectively. We conduct two performance prediction experiments for the global ap-
proach. In the first one, we extract all 57 features from each ontology in BioPortal and
train a model, Glob.57f, using the resulting data set. In the second one, we use an ontol-
ogy size as the only feature to train a model, Glob.1f. The latter is based on our earlier
observations and PCA results. While constructing the confusion matrix, the parameter
k is tuned for each reasoner via LOOCV on the whole data set.
     For the local approach, we construct samples for an ontology via Algorithm 1
with the subroutine nextClump(V, H, m). This way produces the most performance in-
dicative samples via building a specific path through the ontology graph [15]. Algo-
rithm 1 is executed r = 10 times per ontology with the following parameters: m = 1,
hcj inj=1 = h0.01, 0.02, ..., 0.10i (n = 10). Hence, the algorithm constructs a suite of
q = 100 samples of relative size ≤ 10% for each ontology. Here, we consider the
following trade-off: higher size limits should give more accurate predictions but take
longer to be classified. We assume that f (x) is a polynomial of degree L ≤ 2, consider-
ing the observations in [5]. Prediction results of the global approach (Glob.57f, Glob.1f)
with respective error deviations and local approach (Local) are gathered in Table 4. We
do not report error deviations for a local approach because its predictions are determin-
istic. Ontologies that cause an exception for a reasoner, see Table 1, are excluded from
the data set for that reasoner.
 R                 ACE
     Glob.57f      Glob.1f       Local
 JFa 0.230 (0.048) 0.211 (0.045) 0.169
 Her 0.231 (0.053) 0.237 (0.047) 0.224    Real           Predicted bin
 Pel 0.301 (0.048) 0.301 (0.048) 0.250    bin A         B        C      D     E
                 MAE, sec                 A 245/242/232 4/7/16 0/0/1 0/0/0 1/1/1
 JFa 61 (24)       69 (18)       43       B 31/23/15    19/27/28 0/0/7 0/0/0 0/0/0
 Her 70 (25)       89 (23)       42       C    2/0/0    9/9/0    3/2/11 0/0/3 2/5/2
 Pel 106 (31)      127 (27)      69       D 1/1/0       2/2/0    3/4/1 0/0/6 5/4/4
 Table 4: Comparison of global and        E 2/2/1       5/4/1    4/5/3 0/3/4 12/9/14
 local approaches on BioPortal (with      Table 5: Confusion matrix for JFact on Bio-
 error deviations)                        Portal (Glob.57f/Glob.1f/Local)

     As Table 4 shows, the local approach delivers better prediction accuracy than the
global one in all cases, for both ACE and MAE. In addition, one can notice that Glob.1f
outperforms Glob.57f for JFact, shows the same error for Pellet, and slightly loses for
Hermit, if we compare via ACE. However, if we compare via MAE, Glob.57f slightly
outperforms Glob.1f for all reasoners. Despite that, Glob.1f has a more stable model
with respect to error deviations. Hence, extracting 57 ontology features makes little
sense, especially for bin predictions. One can notice that Pellet is harder to predict
than others on BioPortal for all approaches via either ACE or MAE. It is also worth
mentioning that the local approach chooses the polynomial of degree L = 1 as the best
prediction model in 72%, 78%, 76% for JFact, Hermit, and Pellet, respectively. Hence,
all reasoners are mostly linearly predictable. Table 5 shows the confusion matrix of
JFact predictions (the confusion matrices for Hermit and Pellet are similar). Glob.57f
and Glob.1f are slightly more accurate than the local approach on the easiest bin A, the
latter is more correct on all harder bins B, C, D, E; especially on bins C, D.
  R                 ACE
      Glob.57f      Glob.1f       Local
  JFa 0.533 (0.298) 0.375 (0.112) 0.260
  Her 0.476 (0.245) 0.407 (0.096) 0.379           Real        Predicted bin
  Pel 0.506 (0.107) 0.565 (0.108) 0.481           bin B        C     D     E
                  MAE, sec                        B 46/50/43 4/0/7 0/0/0 0/0/0
  JFa 201 (70)      237 (56)      141             C    7/10/0 8/6/11 0/0/3 1/0/2
  Her 198 (68)      260 (53)      120             D 4/3/0      2/2/1 0/0/6 5/6/4
  Pel 337 (74)      323 (73)      227             E 6/6/2      2/6/3 2/0/4 13/11/14
Table 6: Comparison of global and lo-          Table 7: Confusion matrix for JFact
cal approaches on BioPortal without easy       on BioPortal without easy ontologies
ontologies (with error deviations)             (Glob.57f/Glob.1f/Local)

    Considering the strong bias of the corpus, the question arises how performance pre-
dictors behave if easy ontologies are excluded. Therefore, for each reasoner, we remove
ontologies of bin A ∈ (0s, 1s) and repeat the same chain of experiments (including
training) on the remaining 100, 103, 108 “hard” ontologies, see Table 6. Hard ontologies
cause extremely higher errors of performance predictions for all approaches. Glob.1f
shows surprisingly more accurate and stable predictions than Glob.57f via ACE for
JFact and Hermit but fails to do so for Pellet. In contrast, Glob.57f outperforms Glob.1f
via MAE for JFact and Hermit (though Glob.1f is more stable) but fails to do so for Pel-
let. At last, the local approach outperforms, sometimes significantly, the global one in
all cases, for both ACE and MAE. The confusion matrix in Table 7 shows that Glob.57f
and Glob.1f are again more accurate on the easiest bin B, while the local approach is
more accurate on harder bins C, D, E. We also observe that, if we exclude the hetero-
geneous ontologies found in [5] from the hard ontologies and repeat the performance
prediction experiments (including training), then Table 6 changes just slightly. Hence,
heterogeneous ontologies do not cause unusually high errors of predictions.


9   Discussion of Results and Future Work

To conclude, we have observed that performance prediction of reasoners is a hard and
interesting problem. In this work, we propose a novel approach, called local, to solve
it. It is based on carefully selecting ontology samples, ideally small, and then using
them to predict reasoning performance on the full ontology. The AD, as an ontology
graph, is useful to construct such samples: samples of sizes ≤ 10% allow us to obtain
good predictions via simple extrapolations. In comparison to a global approach, a local
approach gives more accurate performance predictions, does not rely on an ontology
corpus and, therefore, cannot be biased by it. Moreover, a local approach reveals in-
teresting information about reasoners’ behaviour: linear/nonlinear predictability on the
corpus. However, it requires reasoning over samples to make a prediction.
     Although selecting an unbiased corpus remains a difficult problem [13], we defi-
nitely need to acknowledge the bias and interpret observations correctly. Moreover, the
error measures for evaluating predictors should be chosen carefully because there is no
single best measure. A reasonable way is to use several measures evaluating different
aspects of prediction accuracy.
     One of our main findings is that, for both BioPortal and Tones, multiple ontology
features are efficiently represented by ontology size. Predictions using ontology size
alone are comparable to predictions using all 57 ontology features. Therefore, ontology
features for performance prediction should be defined and used cautiously. Firstly, we
should not mix features of different scales in the feature vector because low-scale fea-
tures can be “hidden” by others. Nonetheless, if these features are informative, suitable
techniques, able to handle different scales, must be used. Secondly, we have to investi-
gate intercorrelation of features because the features may turn out to carry little useful
information and even degrade the accuracy.
     As future work, we consider local approach improvements. The maximal sample
size can be estimated for each ontology separately because 10% samples are not in-
dicative enough to make good predictions for some ontologies. In addition, there may
exist more suitable regression models for extrapolation. One can also study appropriate
methods to cope with ontology features of different scales with respect to performance
prediction, e.g. ensemble learning. Finally, we can work on assembling “good” corpora
using dimensionality reduction techniques such as PCA.
References
 1. Alpaydin, E.: Introduction to Machine Learning. MIT, 2nd edn. (2010)
 2. Cover, T., Hart, P.: Nearest neighbor pattern classification. Information Theory, IEEE Trans-
    actions on 13(1), 21–27 (1967)
 3. Del Vescovo, C.: The modular structure of an ontology: Atomic Decomposition and its ap-
    plications. Ph.D. thesis, School of Computer Science, University of Manchester (2013)
 4. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861 – 874
    (2006)
 5. Gonçalves, R.S., Parsia, B., Sattler, U.: Performance heterogeneity and approximate reason-
    ing in Description Logic ontologies. In: Proc. of ISWC-12. LNCS, vol. 7649, pp. 82–98
    (2012)
 6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. of Mach. Learn-
    ing Res. 3, 1157–1182 (Mar 2003)
 7. Hastie, T., Tibshirani, R., Friedman, J.J.H.: The elements of statistical learning, vol. 1.
    Springer New York (2001)
 8. Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: methods &
    evaluation. Artificial Intelligence 206, 79–111 (2014)
 9. Jolliffe, I.T.: Principal component analysis. Springer, second edn. (Oct 2002)
10. Kang, Y.B., Li, Y.F., Krishnaswamy, S.: Predicting reasoning performance using ontology
    metrics. In: Proc. of ISWC-12. LNCS, vol. 7649, pp. 198–214 (2012)
11. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model
    selection. pp. 1137–1143. Morgan Kaufmann (1995)
12. Lawson, C.L., Hanson, R.J.: Solving least squares problems. 3 edn. (1995)
13. Matentzoglu, N., Bail, S., Parsia, B.: A corpus of OWL DL ontologies. In: Eiter, T., Glimm,
    B., Kazakov, Y., Krötzsch, M. (eds.) Description Logics. CEUR Workshop Proceedings, vol.
    1014, pp. 829–841. CEUR-WS.org (2013)
14. Motik, B., Shearer, R., Horrocks, I.: A hypertableau calculus for SHIQ. In: Proc. of DL-07.
    pp. 419–426. Bozen/Bolzano University Press (2007)
15. Sazonau, V.: Performance prediction of OWL reasoners. Master’s thesis, School of Computer
    Science, University of Manchester (2013)
16. Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., Katz, Y.: Pellet: a practical OWL-DL
    reasoner. J. of Web Semantics 5(2), 51–53 (2007)
17. Tsarkov, D., Horrocks, I.: FACT++ Description Logic reasoner: system description. In: Proc.
    of IJCAR-06. LNCS, vol. 4130, pp. 292–297. Springer-Verlag (2006)
18. Tuv, E., Borisov, A., Torkkola, K.: Best subset feature selection for massive mixed-type
    problems. In: Proc. of IDEAL-06. pp. 1048–1056. Springer-Verlag (2006)