=Paper= {{Paper |id=None |storemode=property |title=Benchmarking Classifier Performance with Sparse Measurements |pdfUrl=https://ceur-ws.org/Vol-1422/172.pdf |volume=Vol-1422 |dblpUrl=https://dblp.org/rec/conf/itat/Motl15 }} ==Benchmarking Classifier Performance with Sparse Measurements== https://ceur-ws.org/Vol-1422/172.pdf

J. Yaghob (Ed.): ITAT 2015 pp. 172–178
Charles University in Prague, Prague, 2015

Benchmarking Classifier Performance with Sparse Measurements

Jan Motl1

Czech Technical University, Prague, Czech Republic,
jan.motl@fit.cvut.cz,
WWW home page: http://relational.cvut.cz

Abstract: The presented paper describes a methodology, Algo. 1 Algo. 2 Algo. 3
how to perform benchmarking, when classifier perfor-
mance measurements are sparse. The described methodol- Dataset A 0.55 0.5 0.45
ogy is based on missing value imputation and was demon- Dataset B 0.65 0.6 0.55
strated to work, even when 80% of measurements are Dataset C 0.95 1 0.9
missing, for example because of unavailable algorithm im- Average accuracy 0.72 0.7 0.63
plementations or unavailable datasets. The methodology Average ranking 1.33 1.67 3
was then applied on 29 relational classifiers & proposi- # Wins 2 1 0
tional tools and 15 datasets, making it the biggest meta-
analysis in relational classification up to date. Table 1: Hypothetical evaluation of classifiers based on
accuracy (bigger is better) with three ordering methods.
In this scenario, all the methods are in agreement (Algo-
1 Introduction rithm 1 is always the best).

You can’t improve what you can’t measure. However, in Algo. 1 Algo. 2 Algo. 3
some fields the comparison of different approaches is de-
manding. For example, in the field of relational classi- Dataset A 0.55 0.5 -
fiers, essentially each classifier uses different syntax and Dataset B 0.65 0.6 -
requires data in different format, making the comparison Dataset C - 1 0.9
of relational classifiers difficult. Despite these obstacles, Average accuracy 0.6 0.7 0.9
each author of a new relational classifier attempts to prove Average ranking 1 1.67 2
that his algorithm is better than some previous algorithm # Wins 2 1 0
and takes the burden of comparing his algorithm to a small
set of algorithms on a limited set of datasets. But how can Table 2: With sparse measurements, average measure pre-
we compare the algorithms, if they are not evaluated on dicts that Algorithm 3 is the best while the rest of the meth-
the same set of datasets? ods predict that Algorithm 1 is the best.

1.1 Literature Review multiple) of the following methods:
The biggest meta-analysis of relational classifiers (to • Average measure
our best knowledge) is “Is mutagenesis still challeng- • Average ranking
ing?” [27], where 19 algorithms are examined. While this • Count of wins
analysis is highly interesting, it limits itself on comparison
of the classifiers on a single dataset. The different ordering methods [6] are illustrated on an
The biggest analysis in the regard of used datasets is example in Table 1. In this hypothetical scenario 3 algo-
from Dhafer [21], where a single algorithm is tested on rithms are evaluated on 3 datasets with accuracy. Based
10 datasets and 20 tasks (some datasets have multiple tar- on each ordering method, the first algorithm is the best
gets). and the third algorithm is the worst.
If we are interested into comparison of multiple algo- But what if not all the measures are available? With the
rithms on multiple datasets, the counts are comparably same data, but some missing, we can get different results
smaller. For example, in the article from Bina [2] 6 al- (Table 2). Based on average accuracy the third algorithm
gorithms on 5 datasets are examined. is the best. But we are getting this result only because the
This meta-analysis presents comparison of 29 algo- third algorithm was evaluated on the datasets with high
rithms on 15 datasets. average accuracy (i.e. easy datasets), while the rest of the
algorithms were evaluated on datasets with lower average
accuracy (i.e. hard datasets).
1.2 Baseline
Average ranking and count of wins are more robust
Traditionally, a set of algorithms is evaluated on a set of to missing values. However, neither of them is infalli-
datasets. And then the algorithms are ordered with one (or ble. Imagine that someone publishes an algorithm and its
Benchmarking Classifier Performance with Sparse Measurements 173

weaker version and measures the accuracy not only on the 1
Evaluation of propositional classifiers

common datasets but also on thousands of randomly gen-
erated datasets that are never ever going to be classified by 0.8
any other algorithm. Then the stronger version of the clas-
sifier is going to score at least a thousand wins and place

Correlation
0.6
on the first position on the leaderboard regardless of the
score on the few common datasets.
0.4
If all the algorithms were evaluated on at least one com-
mon dataset, we could also order the algorithms just based
0.2
on the common datasets. But if there isn’t any dataset on Pearson
which all the algorithms are evaluated, we have to come Spearman
0
out with another solution. 0 0.2 0.4 0.6 0.8 1
The solution is to perform missing value imputation and Fraction of data

convert the problem to the problem we can already solve.
Figure 1: Correlation of the predicted algorithm order with
the ground truth based on the proportion of missing data.
2 Imputation
Evaluation of propositional classifiers
0.4
The proposed missing value imputation iteratively approx- Global Average
imates: 0.35 User k−NN
−→ −→
acc ≈ alg ∗ dat (1) Item k−NN
0.3 User Item Baseline
with following pseudocode: Slope One
Matrix Factorization
0.25
RMSE

acc = p i v o t ( i n p u t , @mean ) 0.2
Proposed

alg = rowmean ( a c c )
dat = ones (1 , ncol ( acc ) ) 0.15

for i = 1: n i t 0.1
d a t = d a t + colmean ( acc − a l g ∗ d a t )
a l g = a l g + rowmean ( a c c − a l g ∗ d a t ) 0.05
0 0.2 0.4 0.6 0.8 1
end Fraction of data

Where:
Figure 2: RMSE of the sampled and imputed submatrix
input: Matrix with three columns: {algorithm name, with the dense submatrix.
dataset name, measured accuracy}.
acc: Matrix with accuracies, where algorithms are in rows was extracted. The dense submatrix was then randomly
and datasets in columns. sampled with a variable count of measurements. The
alg: Column vector with average accuracy of the algo- missing values were then imputed. The resulting learning
rithms over all the datasets. Initialized to average al- curve, depicted in figure 1, suggests, that once 20% of all
gorithm accuracy. combinations algorithm × dataset are used, a fairly good
dat: Row vector with relative difficulty of the datasets. estimate of the actual ordering can be estimated.
Initialized to a vector of ones. A comparison of the proposed imputation method to
nit: Parameter describing the count of iterations. 10 iter- other imputation methods in regard to Root Mean Square
ations are commonly sufficient. Error (RMSE) is in figure 2. The reference methods are
from RapidMiner and their parameters were optimized
2.1 Evaluation on a Dense Dataset with grid search.

To assess the ability of the proposed imputation to prop- 2.2 Theoretical Evaluation
erly order relational classifiers, a test on a related task was
performed. Arguably the closest task to relational classifi- According to the No-Free-Lunch theorem [41], the best
cation, which is well benchmarkable, is propositional clas- classifier will not be the same for all the data sets. Hence
sification - the most common type of classification, where we shouldn’t even attempt to measure and average classi-
a single table is classified. fier’s performance over a wide set of datasets. But merely
Conveniently, accuracies of 179 propositional classi- describe strengths and weaknesses of different classifiers.
fiers on 121 datasets were published in a recent study by In this respect, the selected imputation method fails be-
Fernandez-Delgado [8]. Since not all the examined algo- cause it is not able to model interactions between datasets
rithms always finished successfully, e.g. due to colinearity and algorithms. Nevertheless, in the practice, some classi-
of data, a dense submatrix of 179 algorithms on 14 datasets fiers appear to be systematically better then other [8]. If all
174 J. Motl

we want is to order classifiers based on their expected ac- Algorithm Algorithm type Reference
curacy on a set of datasets, the absence of ability to model
interactions is irrelevant. Aleph ILP [35]
CILP++∗ Neural Network [9]
Another property of the used methodology is that it
CrossMine ILP [42]
doesn’t permit mixture of measures. That is unfortunate
E-NB∗ Probabilistic [37]
since some articles [18] report only accuracy, while other
FOIL ILP [23]
articles [10] report only precision, recall and F-measure.
FORF-NA Decision Tree [40]
A special attention is necessary when we are comparing
Graph-NB Probabilistic [26]
results from several authors, because not only the evalua-
HNBC∗ Probabilistic [37]
tion methodology can differ (e.g. 10-fold cross-validation
kFOIL Kernel [23]
vs. single hold-out sample), but also datasets can differ de-
Lynx-RSM∗ Propositionalization [29]
spite the common name. For example, the canonical ver-
MLN Probabilistic [36]
sion of East-West dataset (further abbreviated as Trains)
MRDTL-2 Decision Tree [25]
contains 10 instances [30]. However, some authors prefer
MRNBC∗ Probabilistic [37]
an extended version of the dataset with 20 instances [35].
MVC∗ Multi-View [11]
Nevertheless, data quality is the pitfall common to all anal-
MVC-IM∗ Multi-View [12]
yses. To alleviate the problem with different datasets in the
MulSVM∗ Propositionalization [43]
future, a new (and the first) relational repository was based
nFOIL∗ Probabilistic [22]
at relational.fit.cvut.cz. Further discussion about
PIC∗ Probabilistic [37]
the collected data is provided in the next section.
RELAGGS Propositionalization [17]
The final limitation of the method is that it doesn’t pro-
RPT∗ Probabilistic [28]
vide trustworthy confidence intervals. The first reason is
RSD Propositionalization [18]
that measures for the same algorithm and dataset are aver-
RollUp Propositionalization [16]
aged and treated as a single measure. The second reason
SINUS Propositionalization [18]
is that the algorithm performs missing value imputation,
SimFlat∗ Propositionalization [12]
violating the assumption of sample independence.
SDF∗ Decision Tree [2]
A list summarizing the advantages and constrains of the TILDE Decision Tree [37]
proposed method follows: TreeLiker-Poly ILP [20]
TreeLiker-RelF ILP [20]
Advantages: Wordification∗ Propositionalization [35]

• Permits benchmarking with sparse measures. Table 3: List of 29 relational classifiers and propositional
• Respects that some datasets are tougher than others. algorithms used in the meta-analysis. A star by the algo-
rithm name marks algorithms, for which measurements by
• Allows conflicting measurements (for example, by
someone else than by the algorithm authors was not found.
different authors).

Disadvantages: 3.1 Measure Selection
• Neglects interactions between datasets and algo-
rithms.
• Requires one common measure (e.g. we cannot mix Since almost all relational classifiers in the literature are
accuracy and F-measure). evaluated on classification accuracy (with exceptions like
• Requires comparably prepared datasets (e.g. using CLAMF [10], ACORA [34] or SAYU [4], which are eval-
the same instance count). uated in the literature only with measures based on preci-
• Doesn’t provide confidence intervals. sion & recall) but only a few were evaluated with a dif-
ferent measure (like precision & recall, F-measure, AUC
or AUC-PR), the meta-analysis limits itself to classifica-
3 Classification of Relational Data tion accuracy. The methods how to measure accuracy may
differ, but only testing accuracies (not training) were col-
The proposed methodology how to benchmark with sparse lected.
measurements is applied on relational classifiers, includ- Other interesting measures, like runtime or memory
ing propositional tools. In the following paragraphs de- consumption, were not evaluated, as they are rarely pub-
scription of the collected measures, benchmarked algo- lished. And even if they were published, they would be
rithms and datasets follow. The collected data can be hardly comparable as the measurements are platform de-
downloaded from motl.us\benchmarking. pendent.
Benchmarking Classifier Performance with Sparse Measurements 175

Dataset Target #Instances Reference
Comparison of relational classifiers
Alzheimer acetyl 1326 [15] Wordification
Alzheimer amine 686 [15] Lynx−RSM
Alzheimer memory 642 [15] RPT
Alzheimer toxicity 886 [15] MVC−IM
PIC
Carcinogenesis carcinogenic 329 [38] MVC
ECML insurance 7329 [19] kFOIL
Financial loan status 682 [1] MRDTL−2
Hepatitis biopsy 690 [35] SimFlat
CrossMine
IMDb ratings 281449 [2] RELAGGS
KRK depth-of-win 1000 [18] SDF
Mondial religion 185 [39] MulSVM
RollUp
MovieLens age 941 [2]
HNBC
Musk-small musk 92 [7] SINUS
Musk-large musk 102 [7] E−NB
Mutagenesis mutagenic 188 [5] nFOIL
RSD
Thrombosis degree 770 [3] RelF
Trains direction 10 [30] MLN
UW-CSE advisedBy 339 [36] MRNBC
Poly
Aleph
Table 4: List of 15 datasets with their targets (Alzheimer Graph−NB
dataset has multiple targets) used in the meta-analysis. FOIL
CILP++
TILDE
FORF−NA
3.2 Algorithm Selection
0.6 0.7 0.8 0.9 1
Expected mean accuracy on all the datasets
The selection of relational classifiers and propositionaliza-
tion tools was restricted to algorithms, which:

• Were published in conference or journal paper. Figure 3: Box plot with expected accuracies.
• Were benchmarked on at least four datasets.
• Were evaluated on classification accuracy. 4.1 Validation

The list of compared algorithms is in table 3. The ordering of algorithms from the meta-analysis should
roughly correspond to the ordering of the algorithms in
the individual articles. Differences in the orderings are
3.3 Dataset Selection evaluated with Spearman correlation in table 5.
As we can see, the orderings in the literature can con-
Datasets were selected based on the following criteria: tradict – once RELLAGS is better than CrossMine, once
CrossMine is better than RELLAGS.
• The dataset has a defined classification target.
• The dataset consists of at least two tables.
• The dataset is used by at least four algorithms.
5 Discussion

The used relational datasets are listed in table 4. Summary of figure 3 based on algorithm type is in figure 4.
Interestingly, kernel and multi-view approaches are aver-
agely the most accurate algorithms. But some proposition-
alization algorithms, namely Wordification, Lynx-RSM
4 Results
and RPT (Relational Probabilistic Tree) beat them. Never-
theless, note that propositionalization algorithms are over-
Box plot in Figure 3 depicts estimated average classifica- represented in the meta-analysis, making it more likely
tion accuracies of 29 algorithms on 15 datasets (18 tasks). that some of them place at extreme positions.
The input data consists of 26% of all combinations algo- Also note, that performance of algorithms is influenced
rithm × dataset, making the estimates solid (recall fig- by their setting. While algorithms in figure 3 are ordered
ure 1). The accuracies were estimated with 1000 bootstrap based on the average accuracy per combination of algo-
samples. Whiskers depict 1.5 IQR. rithm × dataset, algorithms in figure 5 are ordered based
176 J. Motl

Ordering Spearman Correlation Reference
TILDE < FORF-NA < Graph-NB < RelF < Poly < SDF 0.89 [2]
MRNBC < TILDE < E-NB < HNBC < PIC 0.9 [37]
TILDE < RELLAGS < CrossMine < MVC-IM 1.0 [12]
FOIL < TILDE < CrossMine < RELLAGS < MVC 0.9 [31]

Table 5: Comparison of algorithm ordering in the literature with ordering from the imputation.

Comparison of approaches loans or probability of default of a new customer. Gen-
Multi−View erally, the second task is tougher because one of the best
Kernel predictors of customer’s behavior is the customer’s past
Propositionalization behavior. Nevertheless, all datasets in the meta-analysis
have exactly one target value per classified object (includ-
DecisionTree
ing Financial dataset). Note that difference between with-
Probabilistic
in/across classification in IMDb and MovieLens datasets
ILP
is another issue [33].
NeuralNetwork
Also the goals of modeling can differ. For example
0.7 0.75 0.8 0.85 0.9
Markov Logic Network∗ in [14] is evaluated as genera-
Classification accuracy
tive model, so accuracies reported are over all predicates,
not just the target one. And accuracies can vary substan-
Figure 4: Estimated accuracies by algorithm type.
tially with respect to the chosen target predicate. All the
algorithms in the meta-analysis are evaluated in a discrim-
on the maximal known accuracy per combination of al- inative setting.
gorithm × dataset. Notably, with the right setting, RSD Additionally, not all classifiers are designed to perform
improves its ranking by 5 positions. well on a wide spectrum of datasets. Indeed, there are
Finally, it is trusted that each individual author mea- algorithms like MOLFEA [13] that are designed to work
sured accuracy correctly. For example, in the article only on a narrow subset of datasets. A possible specializa-
from 2014 [24] authors of Wordification applied cross- tion of the algorithms in the meta-analysis is not taken in
validation only on the propositional classifiers, leaving the consideration.
discretization and propositionalization out of the cross- At last different authors may have different ethos. Algo-
validation. This design can lead to overly optimistic rithms that were evaluated only by the algorithm authors
estimates of the accuracy. In the follow up article (to our best knowledge) were marked with a star in table 3.
from 2015 [35] the authors of Wordification are al- And many algorithms that place at the top of the ranking
ready applying cross-validating on the whole process flow. are stared algorithms. However, this trend can also be ex-
Wordification accuracies used in this article are exclu- plained with following hypotheses:
sively from [35].
• Recent algorithms tend to be better than the old al-
5.1 Fairness of Comparison gorithms. And recent algorithms (like Wordifica-
tion [35]) did not have enough time to accumulate
The comparison solely based on the classification accu- references.
racy can be unfair. For example, CLAMF∗ [10] and Co-
MoVi∗ [32] are designed to work with temporal datasets.
Hence in Financial dataset they estimate probability of • New algorithms tend to look better in comparison to
a loan default only from the data before the time of mediocre algorithms than in comparison to the best
a loan application, while classifiers designed for the stati- algorithm in the field. Hence authors prefer to com-
cal datasets also use data at and after the time of the loan pare their algorithms against mediocre algorithms.
application. All classifiers used in the meta-analysis treat
all the datasets as if they were statical. • Third-party evaluators do not have the knowledge and
Validation of temporal datasets can be furthermore com- resources to find the best algorithm setting. Hence
plicated by repeated target events. For example, a cus- popular algorithms have, on average, low accuracy.
tomer may apply for a loan many times. And now there are This problem is partially mitigated by considering
two perfectly plausible goals - we may want to calculate only the best reported accuracies in figure 5.
probability of default of a current customer with history of
∗ The algorithm is not included in the meta-analysis because it’s ac- Overall, comparison of measurements from several
curacy wasn’t measured on enough datasets. sources is not a simple task at all.
Benchmarking Classifier Performance with Sparse Measurements 177

[5] Debnath, A. K., Lopez de Compadre, R. L., Debnath, G.,
Comparison of relational classifiers Shusterman, A. J., Hansch, C.: Structure-activity relation-
RPT ship of mutagenic aromatic and heteroaromatic nitro com-
Lynx−RSM pounds. Correlation with molecular orbital energies and
Wordification hydrophobicity. Journal of Medicinal Chemistry 34(2)
PIC (1991), 786–797
MVC−IM
CrossMine [6] Demšar, J.: Statistical comparisons of classifiers over mul-
RELAGGS tiple data sets. The Journal of Machine Learning Research
kFOIL 7 (2006), 1–30
MVC [7] Dietterich, T.: Solving the multiple instance problem
MRDTL−2
SimFlat
with axis-parallel rectangles. Artificial Intelligence 89(1–2)
RollUp (1997), 31–71
SDF [8] Fernández-Delgado, M., Cernadas, E., Barro, S.,
MulSVM Amorim, D.: Do we need hundreds of classifiers to
RSD solve real world classification problems? Journal of
HNBC
SINUS
Machine Learning Research 15 (2014), 3133–3181
RelF [9] França, M. V. M., Zaverucha, G., D’Avila Garcez, A.: Fast
nFOIL relational learning using bottom clause propositionalization
E−NB with artificial neural networks. Machine Learning 94(1)
MLN
(2014), 81–104
Aleph
MRNBC [10] Frank, R., Moser, F., Ester, M.: A method for multi-
FOIL relational classification using single and multi-feature ag-
Poly gregation functions. Lecture Notes in Computer Science
Graph−NB 4702 (2007), 430–437
TILDE
CILP++ [11] Guo, H., Viktor, H. L.: Mining relational databases with
FORF−NA multi-view learning. Proceedings of the 4th International
0.6 0.7 0.8 0.9 1 Workshop on Multi-Relational Mining – MRDM’05, 2005,
Optimistic mean accuracy on all the datasets 15–24
[12] Guo, H., Viktor, H. L.: Learning from skewed class
multi-relational databases. Fundamenta Informaticae 89(1)
(2008), 69–94
Figure 5: Box plot with optimistic accuracies.
[13] Helma, C., Kramer, S., De Raedt, L.: The molecular feature
miner MOLFEA. In: Proceedings of the Beilstein Work-
6 Conclusion shop 2002: Molecular Informatics: Confronting Complex-
ity; Beilstein Institut, 2002, 1–16
Based on the performed analysis, Wordification, [14] Khosravi, H., Schulte, O., Hu, J., Gao, T.: Learning com-
Lynx−RSM and Relational Probabilistic Tree on av- pact Markov logic networks with decision trees. Machine
erage outperform other 26 algorithms for relational Learning 89(3) (2012), 257–277
classifications. Other promising categories of relational [15] King, R. D., Sternberg, M., Srinivasan, A.: Relating chem-
classifiers are multi-view and kernel based approaches. ical activity to structure: an examination of ILP successes.
New Generation Computing 13 (3–4) (1995), 411–433
[16] Knobbe, A., J., De Haas, M., Siebes, A.: Propositionali-
Acknowledgement
sation and aggregates. Lecture Notes in Computer Science
The research reported in this paper has been supported by 2168 (2001), 277–288
the Czech Science Foundation (GAČR) grant 13-17187S. [17] Krogel, M. -A.: On propositionalization for knowledge
discovery in relational databases. PhD Thesis, Otto-von-
Guericke-Universität Magdeburg, 2005
References
[18] Krogel, M. -A., Rawles, S., Železný, F., Flach, P. A.,
[1] Berka, P.: Workshop notes on Discovery Challenge Lavrač, N., Wrobel, S.: Comparative evaluation of ap-
PKDD’99, 1999 proaches to propositionalization. In: Proceedings of the
13th International Conference on Inductive Logic Program-
[2] Bina, B., Schulte, O., Crawford, B., Qian, Z., Xiong, Y.:
ming, volume 2835, 194–217, 2003
Simple decision forests for multi-relational classification.
Decision Support Systems 54(3) 2013, 1269–1279 [19] Krogel, M. -A., Wrobel, S.: Facets of aggregation ap-
proaches to propositionalization. In: Inductive Logic
[3] Coursac, I., Duteil, N.: PKDD 2001 Discovery Challenge
Programming: 13th International Conference, 30–39,
– Medical Domain, 2001
Springer, Berlin, 2003
[4] Davis, J., Burnside, E., Page, D.: View learning ex-
[20] Kuželka, O.: Fast construction of relational features for
tended: inventing new tables for statistical relational learn-
machine learning. PhD Thesis, Czech Technical University,
ing. ICML Workshop on Open Problems in Statistical Re-
2013
lational Learning, 2006
178 J. Motl

[21] Lahbib, D., Boullé, M., Laurent, D.: Itemset-based variable [37] Schulte, O., Bina, B., Crawford, B., Bingham, D., Xiong,
construction in multi-relational supervised learning. Lec- Y.: A hierarchy of independence assumptions for multi-
ture Notes in Computer Science (including subseries Lec- relational Bayes net classifiers. Proceedings of the 2013
ture Notes in Artificial Intelligence and Lecture Notes in IEEE Symposium on Computational Intelligence and Data
Bioinformatics), 7842 LNAI:130–150, 2013 Mining, CIDM 2013 – 2013 IEEE Symposium Series on
[22] Landwehr, N.: Integrating naive bayes and FOIL. Journal Computational Intelligence, SSCI 2013, 2013, 150–159
of Machine Learning Research 8 (2007), 481–507 [38] Srinivasan, A., King, R. D., Muggleton, S. H., Stern-
[23] Landwehr, N., Passerini, A., De Raedt, L., Frasconi, P.: berg, M. J. E.: Carcinogenesis predictions using ILP. In-
kFOIL: Learning simple relational kernels. Aaai 6 (2006), ductive Logic Programming 1297 (1997) 273–287
389–394 [39] Taskar, B., Abbeel, P., Koller, D.: Discriminative proba-
[24] Lavrač, N., Perovšek, M., Vavpetič, A.: Propositionaliza- bilistic models for relational data. UAI’02 Proceedings of
tion online. In: ECML PKDD 2014, 456–459, Springer- the Eighteenth Conference on Uncertainty in Artificial In-
Verlag, 2014 telligence, 2002, 485–492
[25] Leiva, H. A., Atramentov, A., Honavar, V.: A multi- [40] Vens, C., Van Assche, A., Blockeel, H., Džeroski, S.: First
relational decision tree learning algorithm. Proceedings of order random forests with complex aggregates. Lecture
the 13th International Conference on Inductive Logic Pro- Notes in Computer Science 3194 (2004), 323–340
gramming, 2002, 38–56 [41] Wolpert, D.: The existence of a priori distinctions between
[26] Liu, H., Yin, X., Han, J.: An efficient multi-relational learning algorithms. Neural Computation 8(7) (1996),
Naïve Bayesian classifier based on semantic relationship 1391–1420
graph. Proceedings of the 4th International Workshop [42] Yin, X., Han, J., Yang, J., Yu, P. S.: CrossMine: efficient
on Multi-Relational Mining – MRDM’05, Eleventh ACM classification across multiple database relations. Lecture
SIGKDD International Conference on Knowledge Discov- Notes in Computer Science (including subseries Lecture
ery and Data Mining (KDD-2005), 2005, 39–48 Notes in Artificial Intelligence and Lecture Notes in Bioin-
[27] Lodhi, H., Muggleton, S.: Is mutagenesis still challenging? formatics) 3848 LNAI(6) (2006), 172–195
Proceedings of the 15th International Conference on Induc- [43] Zou, M., Wang, T., Li, H., Yang, D.: A general multi-
tive Logic Programming, ILP 2005, Late-Breaking Papers., relational classification approach using feature generation
2005, 35–40 and selection. In: 6th International Conference, ADMA,
[28] Macskassy, S. A., Provost, F.: A simple relational classifier. 21–33, Springer-Verlag, 2010
Technical Report, Stern New York University, 2003
[29] Di Mauro, N., Esposito, F.: Ensemble relational learning
based on selective propositionalization. CoRR (2013), 1–
10
[30] Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To
the international computing community: a new east-west
challenge. Technical Report, Oxford University Computing
Laboratory, Oxford, 1994
[31] Modi, S.: Relational classification using multiple view ap-
proach with voting. International Journal of Computer Ap-
plications 70(16) (2013), 31–36

[32] Neto, R. O., Adeodato, P. J. L., Salgado, A. C., Filho, D. R.,
Machado, G. R.: CoMoVi: a framework for data trans-
formation in credit behavioral scoring applications using
model driven architecture. SEKE 2014 (2014), 286–291

[33] Neville, J., Gallagher, B., Eliassi-Rad, T., Wang, T.: Cor-
recting evaluation bias of relational classifiers with network
cross validation. Knowledge and Information Systems 30
(1) (2012), 31–55

[34] Perlich, C., Provost, F.: Distribution-based aggregation
for relational learning with identifier attributes. Machine
Learning 62(1–2) SPEC. ISS. (2006), 65–105

[35] Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B.,
Lavrač, N.: Wordification: propositionalization by unfold-
ing relational data into bags of words. Expert Systems with
Applications 42(17–18) (2015), 6442–6456

[36] Richardson, M., Domingos, P.: Markov logic networks.
Machine Learning 62 (1–2) SPEC. ISS. (February 2006),
107–136