=Paper= {{Paper |id=None |storemode=property |title=Benchmarking Classifier Performance with Sparse Measurements |pdfUrl=https://ceur-ws.org/Vol-1422/172.pdf |volume=Vol-1422 |dblpUrl=https://dblp.org/rec/conf/itat/Motl15 }} ==Benchmarking Classifier Performance with Sparse Measurements== https://ceur-ws.org/Vol-1422/172.pdf
J. Yaghob (Ed.): ITAT 2015 pp. 172–178
Charles University in Prague, Prague, 2015



           Benchmarking Classifier Performance with Sparse Measurements

                                                                 Jan Motl1

                                             Czech Technical University, Prague, Czech Republic,
                                                        jan.motl@fit.cvut.cz,
                                             WWW home page: http://relational.cvut.cz

Abstract: The presented paper describes a methodology,                                             Algo. 1      Algo. 2       Algo. 3
how to perform benchmarking, when classifier perfor-
mance measurements are sparse. The described methodol-                   Dataset A                  0.55           0.5          0.45
ogy is based on missing value imputation and was demon-                  Dataset B                  0.65           0.6          0.55
strated to work, even when 80% of measurements are                       Dataset C                  0.95            1            0.9
missing, for example because of unavailable algorithm im-                Average accuracy           0.72           0.7          0.63
plementations or unavailable datasets. The methodology                   Average ranking            1.33          1.67            3
was then applied on 29 relational classifiers & proposi-                 # Wins                       2             1             0
tional tools and 15 datasets, making it the biggest meta-
analysis in relational classification up to date.                        Table 1: Hypothetical evaluation of classifiers based on
                                                                         accuracy (bigger is better) with three ordering methods.
                                                                         In this scenario, all the methods are in agreement (Algo-
1     Introduction                                                       rithm 1 is always the best).

You can’t improve what you can’t measure. However, in                                              Algo. 1      Algo. 2       Algo. 3
some fields the comparison of different approaches is de-
manding. For example, in the field of relational classi-                 Dataset A                  0.55          0.5            -
fiers, essentially each classifier uses different syntax and             Dataset B                  0.65          0.6            -
requires data in different format, making the comparison                 Dataset C                    -             1           0.9
of relational classifiers difficult. Despite these obstacles,            Average accuracy            0.6          0.7           0.9
each author of a new relational classifier attempts to prove             Average ranking              1           1.67           2
that his algorithm is better than some previous algorithm                # Wins                       2             1            0
and takes the burden of comparing his algorithm to a small
set of algorithms on a limited set of datasets. But how can              Table 2: With sparse measurements, average measure pre-
we compare the algorithms, if they are not evaluated on                  dicts that Algorithm 3 is the best while the rest of the meth-
the same set of datasets?                                                ods predict that Algorithm 1 is the best.


1.1    Literature Review                                                 multiple) of the following methods:
The biggest meta-analysis of relational classifiers (to                      • Average measure
our best knowledge) is “Is mutagenesis still challeng-                       • Average ranking
ing?” [27], where 19 algorithms are examined. While this                     • Count of wins
analysis is highly interesting, it limits itself on comparison
of the classifiers on a single dataset.                                     The different ordering methods [6] are illustrated on an
   The biggest analysis in the regard of used datasets is                example in Table 1. In this hypothetical scenario 3 algo-
from Dhafer [21], where a single algorithm is tested on                  rithms are evaluated on 3 datasets with accuracy. Based
10 datasets and 20 tasks (some datasets have multiple tar-               on each ordering method, the first algorithm is the best
gets).                                                                   and the third algorithm is the worst.
   If we are interested into comparison of multiple algo-                   But what if not all the measures are available? With the
rithms on multiple datasets, the counts are comparably                   same data, but some missing, we can get different results
smaller. For example, in the article from Bina [2] 6 al-                 (Table 2). Based on average accuracy the third algorithm
gorithms on 5 datasets are examined.                                     is the best. But we are getting this result only because the
   This meta-analysis presents comparison of 29 algo-                    third algorithm was evaluated on the datasets with high
rithms on 15 datasets.                                                   average accuracy (i.e. easy datasets), while the rest of the
                                                                         algorithms were evaluated on datasets with lower average
                                                                         accuracy (i.e. hard datasets).
1.2    Baseline
                                                                            Average ranking and count of wins are more robust
Traditionally, a set of algorithms is evaluated on a set of              to missing values. However, neither of them is infalli-
datasets. And then the algorithms are ordered with one (or               ble. Imagine that someone publishes an algorithm and its
Benchmarking Classifier Performance with Sparse Measurements                                                                                       173


weaker version and measures the accuracy not only on the                            1
                                                                                              Evaluation of propositional classifiers

common datasets but also on thousands of randomly gen-
erated datasets that are never ever going to be classified by                      0.8
any other algorithm. Then the stronger version of the clas-
sifier is going to score at least a thousand wins and place




                                                                     Correlation
                                                                                   0.6
on the first position on the leaderboard regardless of the
score on the few common datasets.
                                                                                   0.4
   If all the algorithms were evaluated on at least one com-
mon dataset, we could also order the algorithms just based
                                                                                   0.2
on the common datasets. But if there isn’t any dataset on                                                                               Pearson
which all the algorithms are evaluated, we have to come                                                                                 Spearman
                                                                                    0
out with another solution.                                                               0   0.2         0.4          0.6          0.8              1
   The solution is to perform missing value imputation and                                               Fraction of data

convert the problem to the problem we can already solve.
                                                                  Figure 1: Correlation of the predicted algorithm order with
                                                                  the ground truth based on the proportion of missing data.
2     Imputation
                                                                                              Evaluation of propositional classifiers
                                                                                   0.4
The proposed missing value imputation iteratively approx-                                                                   Global Average
imates:                                                                        0.35                                         User k−NN
                          −→ −→
                    acc ≈ alg ∗ dat                   (1)                                                                   Item k−NN
                                                                                   0.3                                      User Item Baseline
with following pseudocode:                                                                                                  Slope One
                                                                                                                            Matrix Factorization
                                                                               0.25
                                                                   RMSE




acc    = p i v o t ( i n p u t , @mean )                                           0.2
                                                                                                                            Proposed

alg    = rowmean ( a c c )
dat    = ones (1 , ncol ( acc ) )                                              0.15

for     i = 1: n i t                                                               0.1
      d a t = d a t + colmean ( acc − a l g ∗ d a t )
      a l g = a l g + rowmean ( a c c − a l g ∗ d a t )                        0.05
                                                                                         0   0.2         0.4          0.6          0.8              1
end                                                                                                      Fraction of data

    Where:
                                                                  Figure 2: RMSE of the sampled and imputed submatrix
input: Matrix with three columns: {algorithm name,                with the dense submatrix.
     dataset name, measured accuracy}.
acc: Matrix with accuracies, where algorithms are in rows         was extracted. The dense submatrix was then randomly
     and datasets in columns.                                     sampled with a variable count of measurements. The
alg: Column vector with average accuracy of the algo-             missing values were then imputed. The resulting learning
     rithms over all the datasets. Initialized to average al-     curve, depicted in figure 1, suggests, that once 20% of all
     gorithm accuracy.                                            combinations algorithm × dataset are used, a fairly good
dat: Row vector with relative difficulty of the datasets.         estimate of the actual ordering can be estimated.
     Initialized to a vector of ones.                                A comparison of the proposed imputation method to
nit: Parameter describing the count of iterations. 10 iter-       other imputation methods in regard to Root Mean Square
     ations are commonly sufficient.                              Error (RMSE) is in figure 2. The reference methods are
                                                                  from RapidMiner and their parameters were optimized
2.1   Evaluation on a Dense Dataset                               with grid search.

To assess the ability of the proposed imputation to prop-         2.2 Theoretical Evaluation
erly order relational classifiers, a test on a related task was
performed. Arguably the closest task to relational classifi-      According to the No-Free-Lunch theorem [41], the best
cation, which is well benchmarkable, is propositional clas-       classifier will not be the same for all the data sets. Hence
sification - the most common type of classification, where        we shouldn’t even attempt to measure and average classi-
a single table is classified.                                     fier’s performance over a wide set of datasets. But merely
   Conveniently, accuracies of 179 propositional classi-          describe strengths and weaknesses of different classifiers.
fiers on 121 datasets were published in a recent study by         In this respect, the selected imputation method fails be-
Fernandez-Delgado [8]. Since not all the examined algo-           cause it is not able to model interactions between datasets
rithms always finished successfully, e.g. due to colinearity      and algorithms. Nevertheless, in the practice, some classi-
of data, a dense submatrix of 179 algorithms on 14 datasets       fiers appear to be systematically better then other [8]. If all
174                                                                                                                     J. Motl


we want is to order classifiers based on their expected ac-     Algorithm             Algorithm type             Reference
curacy on a set of datasets, the absence of ability to model
interactions is irrelevant.                                     Aleph                 ILP                        [35]
                                                                CILP++∗               Neural Network             [9]
   Another property of the used methodology is that it
                                                                CrossMine             ILP                        [42]
doesn’t permit mixture of measures. That is unfortunate
                                                                E-NB∗                 Probabilistic              [37]
since some articles [18] report only accuracy, while other
                                                                FOIL                  ILP                        [23]
articles [10] report only precision, recall and F-measure.
                                                                FORF-NA               Decision Tree              [40]
   A special attention is necessary when we are comparing
                                                                Graph-NB              Probabilistic              [26]
results from several authors, because not only the evalua-
                                                                HNBC∗                 Probabilistic              [37]
tion methodology can differ (e.g. 10-fold cross-validation
                                                                kFOIL                 Kernel                     [23]
vs. single hold-out sample), but also datasets can differ de-
                                                                Lynx-RSM∗             Propositionalization       [29]
spite the common name. For example, the canonical ver-
                                                                MLN                   Probabilistic              [36]
sion of East-West dataset (further abbreviated as Trains)
                                                                MRDTL-2               Decision Tree              [25]
contains 10 instances [30]. However, some authors prefer
                                                                MRNBC∗                Probabilistic              [37]
an extended version of the dataset with 20 instances [35].
                                                                MVC∗                  Multi-View                 [11]
Nevertheless, data quality is the pitfall common to all anal-
                                                                MVC-IM∗               Multi-View                 [12]
yses. To alleviate the problem with different datasets in the
                                                                MulSVM∗               Propositionalization       [43]
future, a new (and the first) relational repository was based
                                                                nFOIL∗                Probabilistic              [22]
at relational.fit.cvut.cz. Further discussion about
                                                                PIC∗                  Probabilistic              [37]
the collected data is provided in the next section.
                                                                RELAGGS               Propositionalization       [17]
   The final limitation of the method is that it doesn’t pro-
                                                                RPT∗                  Probabilistic              [28]
vide trustworthy confidence intervals. The first reason is
                                                                RSD                   Propositionalization       [18]
that measures for the same algorithm and dataset are aver-
                                                                RollUp                Propositionalization       [16]
aged and treated as a single measure. The second reason
                                                                SINUS                 Propositionalization       [18]
is that the algorithm performs missing value imputation,
                                                                SimFlat∗              Propositionalization       [12]
violating the assumption of sample independence.
                                                                SDF∗                  Decision Tree              [2]
   A list summarizing the advantages and constrains of the      TILDE                 Decision Tree              [37]
proposed method follows:                                        TreeLiker-Poly        ILP                        [20]
                                                                TreeLiker-RelF        ILP                        [20]
Advantages:                                                     Wordification∗        Propositionalization       [35]

  • Permits benchmarking with sparse measures.                  Table 3: List of 29 relational classifiers and propositional
  • Respects that some datasets are tougher than others.        algorithms used in the meta-analysis. A star by the algo-
                                                                rithm name marks algorithms, for which measurements by
  • Allows conflicting measurements (for example, by
                                                                someone else than by the algorithm authors was not found.
    different authors).


Disadvantages:                                                  3.1 Measure Selection
  • Neglects interactions between datasets and algo-
    rithms.
  • Requires one common measure (e.g. we cannot mix             Since almost all relational classifiers in the literature are
    accuracy and F-measure).                                    evaluated on classification accuracy (with exceptions like
  • Requires comparably prepared datasets (e.g. using           CLAMF [10], ACORA [34] or SAYU [4], which are eval-
    the same instance count).                                   uated in the literature only with measures based on preci-
  • Doesn’t provide confidence intervals.                       sion & recall) but only a few were evaluated with a dif-
                                                                ferent measure (like precision & recall, F-measure, AUC
                                                                or AUC-PR), the meta-analysis limits itself to classifica-
3 Classification of Relational Data                             tion accuracy. The methods how to measure accuracy may
                                                                differ, but only testing accuracies (not training) were col-
The proposed methodology how to benchmark with sparse           lected.
measurements is applied on relational classifiers, includ-         Other interesting measures, like runtime or memory
ing propositional tools. In the following paragraphs de-        consumption, were not evaluated, as they are rarely pub-
scription of the collected measures, benchmarked algo-          lished. And even if they were published, they would be
rithms and datasets follow. The collected data can be           hardly comparable as the measurements are platform de-
downloaded from motl.us\benchmarking.                           pendent.
Benchmarking Classifier Performance with Sparse Measurements                                                                     175


Dataset             Target           #Instances     Reference
                                                                                        Comparison of relational classifiers
Alzheimer           acetyl           1326           [15]            Wordification
Alzheimer           amine            686            [15]             Lynx−RSM
Alzheimer           memory           642            [15]                    RPT
Alzheimer           toxicity         886            [15]               MVC−IM
                                                                             PIC
Carcinogenesis      carcinogenic     329            [38]                    MVC
ECML                insurance        7329           [19]                   kFOIL
Financial           loan status      682            [1]               MRDTL−2
Hepatitis           biopsy           690            [35]                SimFlat
                                                                     CrossMine
IMDb                ratings          281449         [2]              RELAGGS
KRK                 depth-of-win     1000           [18]                    SDF
Mondial             religion         185            [39]               MulSVM
                                                                         RollUp
MovieLens           age              941            [2]
                                                                          HNBC
Musk-small          musk             92             [7]                  SINUS
Musk-large          musk             102            [7]                    E−NB
Mutagenesis         mutagenic        188            [5]                   nFOIL
                                                                            RSD
Thrombosis          degree           770            [3]                     RelF
Trains              direction        10             [30]                    MLN
UW-CSE              advisedBy        339            [36]               MRNBC
                                                                             Poly
                                                                           Aleph
Table 4: List of 15 datasets with their targets (Alzheimer           Graph−NB
dataset has multiple targets) used in the meta-analysis.                    FOIL
                                                                        CILP++
                                                                          TILDE
                                                                     FORF−NA
3.2    Algorithm Selection
                                                                                     0.6      0.7       0.8         0.9           1
                                                                                    Expected mean accuracy on all the datasets
The selection of relational classifiers and propositionaliza-
tion tools was restricted to algorithms, which:

    • Were published in conference or journal paper.                    Figure 3: Box plot with expected accuracies.
    • Were benchmarked on at least four datasets.
    • Were evaluated on classification accuracy.                4.1 Validation

The list of compared algorithms is in table 3.                  The ordering of algorithms from the meta-analysis should
                                                                roughly correspond to the ordering of the algorithms in
                                                                the individual articles. Differences in the orderings are
3.3    Dataset Selection                                        evaluated with Spearman correlation in table 5.
                                                                   As we can see, the orderings in the literature can con-
Datasets were selected based on the following criteria:         tradict – once RELLAGS is better than CrossMine, once
                                                                CrossMine is better than RELLAGS.
    • The dataset has a defined classification target.
    • The dataset consists of at least two tables.
    • The dataset is used by at least four algorithms.
                                                                5     Discussion

    The used relational datasets are listed in table 4.         Summary of figure 3 based on algorithm type is in figure 4.
                                                                Interestingly, kernel and multi-view approaches are aver-
                                                                agely the most accurate algorithms. But some proposition-
                                                                alization algorithms, namely Wordification, Lynx-RSM
4     Results
                                                                and RPT (Relational Probabilistic Tree) beat them. Never-
                                                                theless, note that propositionalization algorithms are over-
Box plot in Figure 3 depicts estimated average classifica-      represented in the meta-analysis, making it more likely
tion accuracies of 29 algorithms on 15 datasets (18 tasks).     that some of them place at extreme positions.
The input data consists of 26% of all combinations algo-           Also note, that performance of algorithms is influenced
rithm × dataset, making the estimates solid (recall fig-        by their setting. While algorithms in figure 3 are ordered
ure 1). The accuracies were estimated with 1000 bootstrap       based on the average accuracy per combination of algo-
samples. Whiskers depict 1.5 IQR.                               rithm × dataset, algorithms in figure 5 are ordered based
176                                                                                                                                 J. Motl


Ordering                                                                             Spearman Correlation                    Reference
TILDE < FORF-NA < Graph-NB < RelF < Poly < SDF                                                0.89                           [2]
MRNBC < TILDE < E-NB < HNBC < PIC                                                             0.9                            [37]
TILDE < RELLAGS < CrossMine < MVC-IM                                                          1.0                            [12]
FOIL < TILDE < CrossMine < RELLAGS < MVC                                                      0.9                            [31]

                Table 5: Comparison of algorithm ordering in the literature with ordering from the imputation.


                                Comparison of approaches                    loans or probability of default of a new customer. Gen-
          Multi−View                                                        erally, the second task is tougher because one of the best
              Kernel                                                        predictors of customer’s behavior is the customer’s past
 Propositionalization                                                       behavior. Nevertheless, all datasets in the meta-analysis
                                                                            have exactly one target value per classified object (includ-
      DecisionTree
                                                                            ing Financial dataset). Note that difference between with-
        Probabilistic
                                                                            in/across classification in IMDb and MovieLens datasets
                 ILP
                                                                            is another issue [33].
     NeuralNetwork
                                                                               Also the goals of modeling can differ. For example
                        0.7       0.75      0.8      0.85            0.9
                                                                            Markov Logic Network∗ in [14] is evaluated as genera-
                                  Classification accuracy
                                                                            tive model, so accuracies reported are over all predicates,
                                                                            not just the target one. And accuracies can vary substan-
      Figure 4: Estimated accuracies by algorithm type.
                                                                            tially with respect to the chosen target predicate. All the
                                                                            algorithms in the meta-analysis are evaluated in a discrim-
on the maximal known accuracy per combination of al-                        inative setting.
gorithm × dataset. Notably, with the right setting, RSD                        Additionally, not all classifiers are designed to perform
improves its ranking by 5 positions.                                        well on a wide spectrum of datasets. Indeed, there are
   Finally, it is trusted that each individual author mea-                  algorithms like MOLFEA [13] that are designed to work
sured accuracy correctly. For example, in the article                       only on a narrow subset of datasets. A possible specializa-
from 2014 [24] authors of Wordification applied cross-                      tion of the algorithms in the meta-analysis is not taken in
validation only on the propositional classifiers, leaving                   the consideration.
discretization and propositionalization out of the cross-                      At last different authors may have different ethos. Algo-
validation. This design can lead to overly optimistic                       rithms that were evaluated only by the algorithm authors
estimates of the accuracy. In the follow up article                         (to our best knowledge) were marked with a star in table 3.
from 2015 [35] the authors of Wordification are al-                         And many algorithms that place at the top of the ranking
ready applying cross-validating on the whole process flow.                  are stared algorithms. However, this trend can also be ex-
Wordification accuracies used in this article are exclu-                    plained with following hypotheses:
sively from [35].
                                                                              • Recent algorithms tend to be better than the old al-
5.1    Fairness of Comparison                                                   gorithms. And recent algorithms (like Wordifica-
                                                                                tion [35]) did not have enough time to accumulate
The comparison solely based on the classification accu-                         references.
racy can be unfair. For example, CLAMF∗ [10] and Co-
MoVi∗ [32] are designed to work with temporal datasets.
Hence in Financial dataset they estimate probability of                       • New algorithms tend to look better in comparison to
a loan default only from the data before the time of                            mediocre algorithms than in comparison to the best
a loan application, while classifiers designed for the stati-                   algorithm in the field. Hence authors prefer to com-
cal datasets also use data at and after the time of the loan                    pare their algorithms against mediocre algorithms.
application. All classifiers used in the meta-analysis treat
all the datasets as if they were statical.                                    • Third-party evaluators do not have the knowledge and
   Validation of temporal datasets can be furthermore com-                      resources to find the best algorithm setting. Hence
plicated by repeated target events. For example, a cus-                         popular algorithms have, on average, low accuracy.
tomer may apply for a loan many times. And now there are                        This problem is partially mitigated by considering
two perfectly plausible goals - we may want to calculate                        only the best reported accuracies in figure 5.
probability of default of a current customer with history of
    ∗ The algorithm is not included in the meta-analysis because it’s ac-     Overall, comparison of measurements from several
curacy wasn’t measured on enough datasets.                                  sources is not a simple task at all.
Benchmarking Classifier Performance with Sparse Measurements                                                                         177


                                                                       [5] Debnath, A. K., Lopez de Compadre, R. L., Debnath, G.,
                       Comparison of relational classifiers                Shusterman, A. J., Hansch, C.: Structure-activity relation-
           RPT                                                             ship of mutagenic aromatic and heteroaromatic nitro com-
    Lynx−RSM                                                               pounds. Correlation with molecular orbital energies and
   Wordification                                                           hydrophobicity. Journal of Medicinal Chemistry 34(2)
            PIC                                                            (1991), 786–797
      MVC−IM
    CrossMine                                                          [6] Demšar, J.: Statistical comparisons of classifiers over mul-
    RELAGGS                                                                tiple data sets. The Journal of Machine Learning Research
          kFOIL                                                            7 (2006), 1–30
           MVC                                                         [7] Dietterich, T.: Solving the multiple instance problem
     MRDTL−2
       SimFlat
                                                                           with axis-parallel rectangles. Artificial Intelligence 89(1–2)
        RollUp                                                             (1997), 31–71
           SDF                                                         [8] Fernández-Delgado, M., Cernadas, E., Barro, S.,
      MulSVM                                                               Amorim, D.:         Do we need hundreds of classifiers to
           RSD                                                             solve real world classification problems?           Journal of
         HNBC
        SINUS
                                                                           Machine Learning Research 15 (2014), 3133–3181
           RelF                                                        [9] França, M. V. M., Zaverucha, G., D’Avila Garcez, A.: Fast
         nFOIL                                                             relational learning using bottom clause propositionalization
          E−NB                                                             with artificial neural networks. Machine Learning 94(1)
           MLN
                                                                           (2014), 81–104
          Aleph
      MRNBC                                                           [10] Frank, R., Moser, F., Ester, M.: A method for multi-
           FOIL                                                            relational classification using single and multi-feature ag-
            Poly                                                           gregation functions. Lecture Notes in Computer Science
    Graph−NB                                                               4702 (2007), 430–437
         TILDE
       CILP++                                                         [11] Guo, H., Viktor, H. L.: Mining relational databases with
    FORF−NA                                                                multi-view learning. Proceedings of the 4th International
                    0.6        0.7       0.8         0.9        1          Workshop on Multi-Relational Mining – MRDM’05, 2005,
                   Optimistic mean accuracy on all the datasets            15–24
                                                                      [12] Guo, H., Viktor, H. L.:         Learning from skewed class
                                                                           multi-relational databases. Fundamenta Informaticae 89(1)
                                                                           (2008), 69–94
       Figure 5: Box plot with optimistic accuracies.
                                                                      [13] Helma, C., Kramer, S., De Raedt, L.: The molecular feature
                                                                           miner MOLFEA. In: Proceedings of the Beilstein Work-
6 Conclusion                                                               shop 2002: Molecular Informatics: Confronting Complex-
                                                                           ity; Beilstein Institut, 2002, 1–16
Based on the performed analysis, Wordification,                       [14] Khosravi, H., Schulte, O., Hu, J., Gao, T.: Learning com-
Lynx−RSM and Relational Probabilistic Tree on av-                          pact Markov logic networks with decision trees. Machine
erage outperform other 26 algorithms for relational                        Learning 89(3) (2012), 257–277
classifications. Other promising categories of relational             [15] King, R. D., Sternberg, M., Srinivasan, A.: Relating chem-
classifiers are multi-view and kernel based approaches.                    ical activity to structure: an examination of ILP successes.
                                                                           New Generation Computing 13 (3–4) (1995), 411–433
                                                                      [16] Knobbe, A., J., De Haas, M., Siebes, A.: Propositionali-
Acknowledgement
                                                                           sation and aggregates. Lecture Notes in Computer Science
The research reported in this paper has been supported by                  2168 (2001), 277–288
the Czech Science Foundation (GAČR) grant 13-17187S.                 [17] Krogel, M. -A.: On propositionalization for knowledge
                                                                           discovery in relational databases. PhD Thesis, Otto-von-
                                                                           Guericke-Universität Magdeburg, 2005
References
                                                                      [18] Krogel, M. -A., Rawles, S., Železný, F., Flach, P. A.,
 [1] Berka, P.:     Workshop notes on Discovery Challenge                  Lavrač, N., Wrobel, S.: Comparative evaluation of ap-
     PKDD’99, 1999                                                         proaches to propositionalization. In: Proceedings of the
                                                                           13th International Conference on Inductive Logic Program-
 [2] Bina, B., Schulte, O., Crawford, B., Qian, Z., Xiong, Y.:
                                                                           ming, volume 2835, 194–217, 2003
     Simple decision forests for multi-relational classification.
     Decision Support Systems 54(3) 2013, 1269–1279                   [19] Krogel, M. -A., Wrobel, S.: Facets of aggregation ap-
                                                                           proaches to propositionalization. In: Inductive Logic
 [3] Coursac, I., Duteil, N.: PKDD 2001 Discovery Challenge
                                                                           Programming: 13th International Conference, 30–39,
     – Medical Domain, 2001
                                                                           Springer, Berlin, 2003
 [4] Davis, J., Burnside, E., Page, D.: View learning ex-
                                                                      [20] Kuželka, O.: Fast construction of relational features for
     tended: inventing new tables for statistical relational learn-
                                                                           machine learning. PhD Thesis, Czech Technical University,
     ing. ICML Workshop on Open Problems in Statistical Re-
                                                                           2013
     lational Learning, 2006
178                                                                                                                              J. Motl


[21] Lahbib, D., Boullé, M., Laurent, D.: Itemset-based variable      [37] Schulte, O., Bina, B., Crawford, B., Bingham, D., Xiong,
     construction in multi-relational supervised learning. Lec-            Y.: A hierarchy of independence assumptions for multi-
     ture Notes in Computer Science (including subseries Lec-              relational Bayes net classifiers. Proceedings of the 2013
     ture Notes in Artificial Intelligence and Lecture Notes in            IEEE Symposium on Computational Intelligence and Data
     Bioinformatics), 7842 LNAI:130–150, 2013                              Mining, CIDM 2013 – 2013 IEEE Symposium Series on
[22] Landwehr, N.: Integrating naive bayes and FOIL. Journal               Computational Intelligence, SSCI 2013, 2013, 150–159
     of Machine Learning Research 8 (2007), 481–507                   [38] Srinivasan, A., King, R. D., Muggleton, S. H., Stern-
[23] Landwehr, N., Passerini, A., De Raedt, L., Frasconi, P.:              berg, M. J. E.: Carcinogenesis predictions using ILP. In-
     kFOIL: Learning simple relational kernels. Aaai 6 (2006),             ductive Logic Programming 1297 (1997) 273–287
     389–394                                                          [39] Taskar, B., Abbeel, P., Koller, D.: Discriminative proba-
[24] Lavrač, N., Perovšek, M., Vavpetič, A.: Propositionaliza-           bilistic models for relational data. UAI’02 Proceedings of
     tion online. In: ECML PKDD 2014, 456–459, Springer-                   the Eighteenth Conference on Uncertainty in Artificial In-
     Verlag, 2014                                                          telligence, 2002, 485–492
[25] Leiva, H. A., Atramentov, A., Honavar, V.: A multi-              [40] Vens, C., Van Assche, A., Blockeel, H., Džeroski, S.: First
     relational decision tree learning algorithm. Proceedings of           order random forests with complex aggregates. Lecture
     the 13th International Conference on Inductive Logic Pro-             Notes in Computer Science 3194 (2004), 323–340
     gramming, 2002, 38–56                                            [41] Wolpert, D.: The existence of a priori distinctions between
[26] Liu, H., Yin, X., Han, J.: An efficient multi-relational              learning algorithms. Neural Computation 8(7) (1996),
     Naïve Bayesian classifier based on semantic relationship              1391–1420
     graph. Proceedings of the 4th International Workshop             [42] Yin, X., Han, J., Yang, J., Yu, P. S.: CrossMine: efficient
     on Multi-Relational Mining – MRDM’05, Eleventh ACM                    classification across multiple database relations. Lecture
     SIGKDD International Conference on Knowledge Discov-                  Notes in Computer Science (including subseries Lecture
     ery and Data Mining (KDD-2005), 2005, 39–48                           Notes in Artificial Intelligence and Lecture Notes in Bioin-
[27] Lodhi, H., Muggleton, S.: Is mutagenesis still challenging?           formatics) 3848 LNAI(6) (2006), 172–195
     Proceedings of the 15th International Conference on Induc-       [43] Zou, M., Wang, T., Li, H., Yang, D.: A general multi-
     tive Logic Programming, ILP 2005, Late-Breaking Papers.,              relational classification approach using feature generation
     2005, 35–40                                                           and selection. In: 6th International Conference, ADMA,
[28] Macskassy, S. A., Provost, F.: A simple relational classifier.        21–33, Springer-Verlag, 2010
     Technical Report, Stern New York University, 2003
[29] Di Mauro, N., Esposito, F.: Ensemble relational learning
     based on selective propositionalization. CoRR (2013), 1–
     10
[30] Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To
     the international computing community: a new east-west
     challenge. Technical Report, Oxford University Computing
     Laboratory, Oxford, 1994
[31] Modi, S.: Relational classification using multiple view ap-
     proach with voting. International Journal of Computer Ap-
     plications 70(16) (2013), 31–36

[32] Neto, R. O., Adeodato, P. J. L., Salgado, A. C., Filho, D. R.,
     Machado, G. R.: CoMoVi: a framework for data trans-
     formation in credit behavioral scoring applications using
     model driven architecture. SEKE 2014 (2014), 286–291

[33] Neville, J., Gallagher, B., Eliassi-Rad, T., Wang, T.: Cor-
     recting evaluation bias of relational classifiers with network
     cross validation. Knowledge and Information Systems 30
     (1) (2012), 31–55

[34] Perlich, C., Provost, F.: Distribution-based aggregation
     for relational learning with identifier attributes. Machine
     Learning 62(1–2) SPEC. ISS. (2006), 65–105

[35] Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B.,
     Lavrač, N.: Wordification: propositionalization by unfold-
     ing relational data into bags of words. Expert Systems with
     Applications 42(17–18) (2015), 6442–6456

[36] Richardson, M., Domingos, P.: Markov logic networks.
     Machine Learning 62 (1–2) SPEC. ISS. (February 2006),
     107–136