J. Yaghob (Ed.): ITAT 2015 pp. 172–178 Charles University in Prague, Prague, 2015 Benchmarking Classifier Performance with Sparse Measurements Jan Motl1 Czech Technical University, Prague, Czech Republic, jan.motl@fit.cvut.cz, WWW home page: http://relational.cvut.cz Abstract: The presented paper describes a methodology, Algo. 1 Algo. 2 Algo. 3 how to perform benchmarking, when classifier perfor- mance measurements are sparse. The described methodol- Dataset A 0.55 0.5 0.45 ogy is based on missing value imputation and was demon- Dataset B 0.65 0.6 0.55 strated to work, even when 80% of measurements are Dataset C 0.95 1 0.9 missing, for example because of unavailable algorithm im- Average accuracy 0.72 0.7 0.63 plementations or unavailable datasets. The methodology Average ranking 1.33 1.67 3 was then applied on 29 relational classifiers & proposi- # Wins 2 1 0 tional tools and 15 datasets, making it the biggest meta- analysis in relational classification up to date. Table 1: Hypothetical evaluation of classifiers based on accuracy (bigger is better) with three ordering methods. In this scenario, all the methods are in agreement (Algo- 1 Introduction rithm 1 is always the best). You can’t improve what you can’t measure. However, in Algo. 1 Algo. 2 Algo. 3 some fields the comparison of different approaches is de- manding. For example, in the field of relational classi- Dataset A 0.55 0.5 - fiers, essentially each classifier uses different syntax and Dataset B 0.65 0.6 - requires data in different format, making the comparison Dataset C - 1 0.9 of relational classifiers difficult. Despite these obstacles, Average accuracy 0.6 0.7 0.9 each author of a new relational classifier attempts to prove Average ranking 1 1.67 2 that his algorithm is better than some previous algorithm # Wins 2 1 0 and takes the burden of comparing his algorithm to a small set of algorithms on a limited set of datasets. But how can Table 2: With sparse measurements, average measure pre- we compare the algorithms, if they are not evaluated on dicts that Algorithm 3 is the best while the rest of the meth- the same set of datasets? ods predict that Algorithm 1 is the best. 1.1 Literature Review multiple) of the following methods: The biggest meta-analysis of relational classifiers (to • Average measure our best knowledge) is “Is mutagenesis still challeng- • Average ranking ing?” [27], where 19 algorithms are examined. While this • Count of wins analysis is highly interesting, it limits itself on comparison of the classifiers on a single dataset. The different ordering methods [6] are illustrated on an The biggest analysis in the regard of used datasets is example in Table 1. In this hypothetical scenario 3 algo- from Dhafer [21], where a single algorithm is tested on rithms are evaluated on 3 datasets with accuracy. Based 10 datasets and 20 tasks (some datasets have multiple tar- on each ordering method, the first algorithm is the best gets). and the third algorithm is the worst. If we are interested into comparison of multiple algo- But what if not all the measures are available? With the rithms on multiple datasets, the counts are comparably same data, but some missing, we can get different results smaller. For example, in the article from Bina [2] 6 al- (Table 2). Based on average accuracy the third algorithm gorithms on 5 datasets are examined. is the best. But we are getting this result only because the This meta-analysis presents comparison of 29 algo- third algorithm was evaluated on the datasets with high rithms on 15 datasets. average accuracy (i.e. easy datasets), while the rest of the algorithms were evaluated on datasets with lower average accuracy (i.e. hard datasets). 1.2 Baseline Average ranking and count of wins are more robust Traditionally, a set of algorithms is evaluated on a set of to missing values. However, neither of them is infalli- datasets. And then the algorithms are ordered with one (or ble. Imagine that someone publishes an algorithm and its Benchmarking Classifier Performance with Sparse Measurements 173 weaker version and measures the accuracy not only on the 1 Evaluation of propositional classifiers common datasets but also on thousands of randomly gen- erated datasets that are never ever going to be classified by 0.8 any other algorithm. Then the stronger version of the clas- sifier is going to score at least a thousand wins and place Correlation 0.6 on the first position on the leaderboard regardless of the score on the few common datasets. 0.4 If all the algorithms were evaluated on at least one com- mon dataset, we could also order the algorithms just based 0.2 on the common datasets. But if there isn’t any dataset on Pearson which all the algorithms are evaluated, we have to come Spearman 0 out with another solution. 0 0.2 0.4 0.6 0.8 1 The solution is to perform missing value imputation and Fraction of data convert the problem to the problem we can already solve. Figure 1: Correlation of the predicted algorithm order with the ground truth based on the proportion of missing data. 2 Imputation Evaluation of propositional classifiers 0.4 The proposed missing value imputation iteratively approx- Global Average imates: 0.35 User k−NN −→ −→ acc ≈ alg ∗ dat (1) Item k−NN 0.3 User Item Baseline with following pseudocode: Slope One Matrix Factorization 0.25 RMSE acc = p i v o t ( i n p u t , @mean ) 0.2 Proposed alg = rowmean ( a c c ) dat = ones (1 , ncol ( acc ) ) 0.15 for i = 1: n i t 0.1 d a t = d a t + colmean ( acc − a l g ∗ d a t ) a l g = a l g + rowmean ( a c c − a l g ∗ d a t ) 0.05 0 0.2 0.4 0.6 0.8 1 end Fraction of data Where: Figure 2: RMSE of the sampled and imputed submatrix input: Matrix with three columns: {algorithm name, with the dense submatrix. dataset name, measured accuracy}. acc: Matrix with accuracies, where algorithms are in rows was extracted. The dense submatrix was then randomly and datasets in columns. sampled with a variable count of measurements. The alg: Column vector with average accuracy of the algo- missing values were then imputed. The resulting learning rithms over all the datasets. Initialized to average al- curve, depicted in figure 1, suggests, that once 20% of all gorithm accuracy. combinations algorithm × dataset are used, a fairly good dat: Row vector with relative difficulty of the datasets. estimate of the actual ordering can be estimated. Initialized to a vector of ones. A comparison of the proposed imputation method to nit: Parameter describing the count of iterations. 10 iter- other imputation methods in regard to Root Mean Square ations are commonly sufficient. Error (RMSE) is in figure 2. The reference methods are from RapidMiner and their parameters were optimized 2.1 Evaluation on a Dense Dataset with grid search. To assess the ability of the proposed imputation to prop- 2.2 Theoretical Evaluation erly order relational classifiers, a test on a related task was performed. Arguably the closest task to relational classifi- According to the No-Free-Lunch theorem [41], the best cation, which is well benchmarkable, is propositional clas- classifier will not be the same for all the data sets. Hence sification - the most common type of classification, where we shouldn’t even attempt to measure and average classi- a single table is classified. fier’s performance over a wide set of datasets. But merely Conveniently, accuracies of 179 propositional classi- describe strengths and weaknesses of different classifiers. fiers on 121 datasets were published in a recent study by In this respect, the selected imputation method fails be- Fernandez-Delgado [8]. Since not all the examined algo- cause it is not able to model interactions between datasets rithms always finished successfully, e.g. due to colinearity and algorithms. Nevertheless, in the practice, some classi- of data, a dense submatrix of 179 algorithms on 14 datasets fiers appear to be systematically better then other [8]. If all 174 J. Motl we want is to order classifiers based on their expected ac- Algorithm Algorithm type Reference curacy on a set of datasets, the absence of ability to model interactions is irrelevant. Aleph ILP [35] CILP++∗ Neural Network [9] Another property of the used methodology is that it CrossMine ILP [42] doesn’t permit mixture of measures. That is unfortunate E-NB∗ Probabilistic [37] since some articles [18] report only accuracy, while other FOIL ILP [23] articles [10] report only precision, recall and F-measure. FORF-NA Decision Tree [40] A special attention is necessary when we are comparing Graph-NB Probabilistic [26] results from several authors, because not only the evalua- HNBC∗ Probabilistic [37] tion methodology can differ (e.g. 10-fold cross-validation kFOIL Kernel [23] vs. single hold-out sample), but also datasets can differ de- Lynx-RSM∗ Propositionalization [29] spite the common name. For example, the canonical ver- MLN Probabilistic [36] sion of East-West dataset (further abbreviated as Trains) MRDTL-2 Decision Tree [25] contains 10 instances [30]. However, some authors prefer MRNBC∗ Probabilistic [37] an extended version of the dataset with 20 instances [35]. MVC∗ Multi-View [11] Nevertheless, data quality is the pitfall common to all anal- MVC-IM∗ Multi-View [12] yses. To alleviate the problem with different datasets in the MulSVM∗ Propositionalization [43] future, a new (and the first) relational repository was based nFOIL∗ Probabilistic [22] at relational.fit.cvut.cz. Further discussion about PIC∗ Probabilistic [37] the collected data is provided in the next section. RELAGGS Propositionalization [17] The final limitation of the method is that it doesn’t pro- RPT∗ Probabilistic [28] vide trustworthy confidence intervals. The first reason is RSD Propositionalization [18] that measures for the same algorithm and dataset are aver- RollUp Propositionalization [16] aged and treated as a single measure. The second reason SINUS Propositionalization [18] is that the algorithm performs missing value imputation, SimFlat∗ Propositionalization [12] violating the assumption of sample independence. SDF∗ Decision Tree [2] A list summarizing the advantages and constrains of the TILDE Decision Tree [37] proposed method follows: TreeLiker-Poly ILP [20] TreeLiker-RelF ILP [20] Advantages: Wordification∗ Propositionalization [35] • Permits benchmarking with sparse measures. Table 3: List of 29 relational classifiers and propositional • Respects that some datasets are tougher than others. algorithms used in the meta-analysis. A star by the algo- rithm name marks algorithms, for which measurements by • Allows conflicting measurements (for example, by someone else than by the algorithm authors was not found. different authors). Disadvantages: 3.1 Measure Selection • Neglects interactions between datasets and algo- rithms. • Requires one common measure (e.g. we cannot mix Since almost all relational classifiers in the literature are accuracy and F-measure). evaluated on classification accuracy (with exceptions like • Requires comparably prepared datasets (e.g. using CLAMF [10], ACORA [34] or SAYU [4], which are eval- the same instance count). uated in the literature only with measures based on preci- • Doesn’t provide confidence intervals. sion & recall) but only a few were evaluated with a dif- ferent measure (like precision & recall, F-measure, AUC or AUC-PR), the meta-analysis limits itself to classifica- 3 Classification of Relational Data tion accuracy. The methods how to measure accuracy may differ, but only testing accuracies (not training) were col- The proposed methodology how to benchmark with sparse lected. measurements is applied on relational classifiers, includ- Other interesting measures, like runtime or memory ing propositional tools. In the following paragraphs de- consumption, were not evaluated, as they are rarely pub- scription of the collected measures, benchmarked algo- lished. And even if they were published, they would be rithms and datasets follow. The collected data can be hardly comparable as the measurements are platform de- downloaded from motl.us\benchmarking. pendent. Benchmarking Classifier Performance with Sparse Measurements 175 Dataset Target #Instances Reference Comparison of relational classifiers Alzheimer acetyl 1326 [15] Wordification Alzheimer amine 686 [15] Lynx−RSM Alzheimer memory 642 [15] RPT Alzheimer toxicity 886 [15] MVC−IM PIC Carcinogenesis carcinogenic 329 [38] MVC ECML insurance 7329 [19] kFOIL Financial loan status 682 [1] MRDTL−2 Hepatitis biopsy 690 [35] SimFlat CrossMine IMDb ratings 281449 [2] RELAGGS KRK depth-of-win 1000 [18] SDF Mondial religion 185 [39] MulSVM RollUp MovieLens age 941 [2] HNBC Musk-small musk 92 [7] SINUS Musk-large musk 102 [7] E−NB Mutagenesis mutagenic 188 [5] nFOIL RSD Thrombosis degree 770 [3] RelF Trains direction 10 [30] MLN UW-CSE advisedBy 339 [36] MRNBC Poly Aleph Table 4: List of 15 datasets with their targets (Alzheimer Graph−NB dataset has multiple targets) used in the meta-analysis. FOIL CILP++ TILDE FORF−NA 3.2 Algorithm Selection 0.6 0.7 0.8 0.9 1 Expected mean accuracy on all the datasets The selection of relational classifiers and propositionaliza- tion tools was restricted to algorithms, which: • Were published in conference or journal paper. Figure 3: Box plot with expected accuracies. • Were benchmarked on at least four datasets. • Were evaluated on classification accuracy. 4.1 Validation The list of compared algorithms is in table 3. The ordering of algorithms from the meta-analysis should roughly correspond to the ordering of the algorithms in the individual articles. Differences in the orderings are 3.3 Dataset Selection evaluated with Spearman correlation in table 5. As we can see, the orderings in the literature can con- Datasets were selected based on the following criteria: tradict – once RELLAGS is better than CrossMine, once CrossMine is better than RELLAGS. • The dataset has a defined classification target. • The dataset consists of at least two tables. • The dataset is used by at least four algorithms. 5 Discussion The used relational datasets are listed in table 4. Summary of figure 3 based on algorithm type is in figure 4. Interestingly, kernel and multi-view approaches are aver- agely the most accurate algorithms. But some proposition- alization algorithms, namely Wordification, Lynx-RSM 4 Results and RPT (Relational Probabilistic Tree) beat them. Never- theless, note that propositionalization algorithms are over- Box plot in Figure 3 depicts estimated average classifica- represented in the meta-analysis, making it more likely tion accuracies of 29 algorithms on 15 datasets (18 tasks). that some of them place at extreme positions. The input data consists of 26% of all combinations algo- Also note, that performance of algorithms is influenced rithm × dataset, making the estimates solid (recall fig- by their setting. While algorithms in figure 3 are ordered ure 1). The accuracies were estimated with 1000 bootstrap based on the average accuracy per combination of algo- samples. Whiskers depict 1.5 IQR. rithm × dataset, algorithms in figure 5 are ordered based 176 J. Motl Ordering Spearman Correlation Reference TILDE < FORF-NA < Graph-NB < RelF < Poly < SDF 0.89 [2] MRNBC < TILDE < E-NB < HNBC < PIC 0.9 [37] TILDE < RELLAGS < CrossMine < MVC-IM 1.0 [12] FOIL < TILDE < CrossMine < RELLAGS < MVC 0.9 [31] Table 5: Comparison of algorithm ordering in the literature with ordering from the imputation. Comparison of approaches loans or probability of default of a new customer. Gen- Multi−View erally, the second task is tougher because one of the best Kernel predictors of customer’s behavior is the customer’s past Propositionalization behavior. Nevertheless, all datasets in the meta-analysis have exactly one target value per classified object (includ- DecisionTree ing Financial dataset). Note that difference between with- Probabilistic in/across classification in IMDb and MovieLens datasets ILP is another issue [33]. NeuralNetwork Also the goals of modeling can differ. For example 0.7 0.75 0.8 0.85 0.9 Markov Logic Network∗ in [14] is evaluated as genera- Classification accuracy tive model, so accuracies reported are over all predicates, not just the target one. And accuracies can vary substan- Figure 4: Estimated accuracies by algorithm type. tially with respect to the chosen target predicate. All the algorithms in the meta-analysis are evaluated in a discrim- on the maximal known accuracy per combination of al- inative setting. gorithm × dataset. Notably, with the right setting, RSD Additionally, not all classifiers are designed to perform improves its ranking by 5 positions. well on a wide spectrum of datasets. Indeed, there are Finally, it is trusted that each individual author mea- algorithms like MOLFEA [13] that are designed to work sured accuracy correctly. For example, in the article only on a narrow subset of datasets. A possible specializa- from 2014 [24] authors of Wordification applied cross- tion of the algorithms in the meta-analysis is not taken in validation only on the propositional classifiers, leaving the consideration. discretization and propositionalization out of the cross- At last different authors may have different ethos. Algo- validation. This design can lead to overly optimistic rithms that were evaluated only by the algorithm authors estimates of the accuracy. In the follow up article (to our best knowledge) were marked with a star in table 3. from 2015 [35] the authors of Wordification are al- And many algorithms that place at the top of the ranking ready applying cross-validating on the whole process flow. are stared algorithms. However, this trend can also be ex- Wordification accuracies used in this article are exclu- plained with following hypotheses: sively from [35]. • Recent algorithms tend to be better than the old al- 5.1 Fairness of Comparison gorithms. And recent algorithms (like Wordifica- tion [35]) did not have enough time to accumulate The comparison solely based on the classification accu- references. racy can be unfair. For example, CLAMF∗ [10] and Co- MoVi∗ [32] are designed to work with temporal datasets. Hence in Financial dataset they estimate probability of • New algorithms tend to look better in comparison to a loan default only from the data before the time of mediocre algorithms than in comparison to the best a loan application, while classifiers designed for the stati- algorithm in the field. Hence authors prefer to com- cal datasets also use data at and after the time of the loan pare their algorithms against mediocre algorithms. application. All classifiers used in the meta-analysis treat all the datasets as if they were statical. • Third-party evaluators do not have the knowledge and Validation of temporal datasets can be furthermore com- resources to find the best algorithm setting. Hence plicated by repeated target events. For example, a cus- popular algorithms have, on average, low accuracy. tomer may apply for a loan many times. And now there are This problem is partially mitigated by considering two perfectly plausible goals - we may want to calculate only the best reported accuracies in figure 5. probability of default of a current customer with history of ∗ The algorithm is not included in the meta-analysis because it’s ac- Overall, comparison of measurements from several curacy wasn’t measured on enough datasets. sources is not a simple task at all. Benchmarking Classifier Performance with Sparse Measurements 177 [5] Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Comparison of relational classifiers Shusterman, A. J., Hansch, C.: Structure-activity relation- RPT ship of mutagenic aromatic and heteroaromatic nitro com- Lynx−RSM pounds. Correlation with molecular orbital energies and Wordification hydrophobicity. Journal of Medicinal Chemistry 34(2) PIC (1991), 786–797 MVC−IM CrossMine [6] Demšar, J.: Statistical comparisons of classifiers over mul- RELAGGS tiple data sets. The Journal of Machine Learning Research kFOIL 7 (2006), 1–30 MVC [7] Dietterich, T.: Solving the multiple instance problem MRDTL−2 SimFlat with axis-parallel rectangles. Artificial Intelligence 89(1–2) RollUp (1997), 31–71 SDF [8] Fernández-Delgado, M., Cernadas, E., Barro, S., MulSVM Amorim, D.: Do we need hundreds of classifiers to RSD solve real world classification problems? Journal of HNBC SINUS Machine Learning Research 15 (2014), 3133–3181 RelF [9] França, M. V. M., Zaverucha, G., D’Avila Garcez, A.: Fast nFOIL relational learning using bottom clause propositionalization E−NB with artificial neural networks. Machine Learning 94(1) MLN (2014), 81–104 Aleph MRNBC [10] Frank, R., Moser, F., Ester, M.: A method for multi- FOIL relational classification using single and multi-feature ag- Poly gregation functions. Lecture Notes in Computer Science Graph−NB 4702 (2007), 430–437 TILDE CILP++ [11] Guo, H., Viktor, H. L.: Mining relational databases with FORF−NA multi-view learning. Proceedings of the 4th International 0.6 0.7 0.8 0.9 1 Workshop on Multi-Relational Mining – MRDM’05, 2005, Optimistic mean accuracy on all the datasets 15–24 [12] Guo, H., Viktor, H. L.: Learning from skewed class multi-relational databases. Fundamenta Informaticae 89(1) (2008), 69–94 Figure 5: Box plot with optimistic accuracies. [13] Helma, C., Kramer, S., De Raedt, L.: The molecular feature miner MOLFEA. In: Proceedings of the Beilstein Work- 6 Conclusion shop 2002: Molecular Informatics: Confronting Complex- ity; Beilstein Institut, 2002, 1–16 Based on the performed analysis, Wordification, [14] Khosravi, H., Schulte, O., Hu, J., Gao, T.: Learning com- Lynx−RSM and Relational Probabilistic Tree on av- pact Markov logic networks with decision trees. Machine erage outperform other 26 algorithms for relational Learning 89(3) (2012), 257–277 classifications. Other promising categories of relational [15] King, R. D., Sternberg, M., Srinivasan, A.: Relating chem- classifiers are multi-view and kernel based approaches. ical activity to structure: an examination of ILP successes. New Generation Computing 13 (3–4) (1995), 411–433 [16] Knobbe, A., J., De Haas, M., Siebes, A.: Propositionali- Acknowledgement sation and aggregates. Lecture Notes in Computer Science The research reported in this paper has been supported by 2168 (2001), 277–288 the Czech Science Foundation (GAČR) grant 13-17187S. [17] Krogel, M. -A.: On propositionalization for knowledge discovery in relational databases. PhD Thesis, Otto-von- Guericke-Universität Magdeburg, 2005 References [18] Krogel, M. -A., Rawles, S., Železný, F., Flach, P. A., [1] Berka, P.: Workshop notes on Discovery Challenge Lavrač, N., Wrobel, S.: Comparative evaluation of ap- PKDD’99, 1999 proaches to propositionalization. In: Proceedings of the 13th International Conference on Inductive Logic Program- [2] Bina, B., Schulte, O., Crawford, B., Qian, Z., Xiong, Y.: ming, volume 2835, 194–217, 2003 Simple decision forests for multi-relational classification. Decision Support Systems 54(3) 2013, 1269–1279 [19] Krogel, M. -A., Wrobel, S.: Facets of aggregation ap- proaches to propositionalization. In: Inductive Logic [3] Coursac, I., Duteil, N.: PKDD 2001 Discovery Challenge Programming: 13th International Conference, 30–39, – Medical Domain, 2001 Springer, Berlin, 2003 [4] Davis, J., Burnside, E., Page, D.: View learning ex- [20] Kuželka, O.: Fast construction of relational features for tended: inventing new tables for statistical relational learn- machine learning. PhD Thesis, Czech Technical University, ing. ICML Workshop on Open Problems in Statistical Re- 2013 lational Learning, 2006 178 J. Motl [21] Lahbib, D., Boullé, M., Laurent, D.: Itemset-based variable [37] Schulte, O., Bina, B., Crawford, B., Bingham, D., Xiong, construction in multi-relational supervised learning. Lec- Y.: A hierarchy of independence assumptions for multi- ture Notes in Computer Science (including subseries Lec- relational Bayes net classifiers. Proceedings of the 2013 ture Notes in Artificial Intelligence and Lecture Notes in IEEE Symposium on Computational Intelligence and Data Bioinformatics), 7842 LNAI:130–150, 2013 Mining, CIDM 2013 – 2013 IEEE Symposium Series on [22] Landwehr, N.: Integrating naive bayes and FOIL. Journal Computational Intelligence, SSCI 2013, 2013, 150–159 of Machine Learning Research 8 (2007), 481–507 [38] Srinivasan, A., King, R. D., Muggleton, S. H., Stern- [23] Landwehr, N., Passerini, A., De Raedt, L., Frasconi, P.: berg, M. J. E.: Carcinogenesis predictions using ILP. In- kFOIL: Learning simple relational kernels. Aaai 6 (2006), ductive Logic Programming 1297 (1997) 273–287 389–394 [39] Taskar, B., Abbeel, P., Koller, D.: Discriminative proba- [24] Lavrač, N., Perovšek, M., Vavpetič, A.: Propositionaliza- bilistic models for relational data. UAI’02 Proceedings of tion online. In: ECML PKDD 2014, 456–459, Springer- the Eighteenth Conference on Uncertainty in Artificial In- Verlag, 2014 telligence, 2002, 485–492 [25] Leiva, H. A., Atramentov, A., Honavar, V.: A multi- [40] Vens, C., Van Assche, A., Blockeel, H., Džeroski, S.: First relational decision tree learning algorithm. Proceedings of order random forests with complex aggregates. Lecture the 13th International Conference on Inductive Logic Pro- Notes in Computer Science 3194 (2004), 323–340 gramming, 2002, 38–56 [41] Wolpert, D.: The existence of a priori distinctions between [26] Liu, H., Yin, X., Han, J.: An efficient multi-relational learning algorithms. Neural Computation 8(7) (1996), Naïve Bayesian classifier based on semantic relationship 1391–1420 graph. Proceedings of the 4th International Workshop [42] Yin, X., Han, J., Yang, J., Yu, P. S.: CrossMine: efficient on Multi-Relational Mining – MRDM’05, Eleventh ACM classification across multiple database relations. Lecture SIGKDD International Conference on Knowledge Discov- Notes in Computer Science (including subseries Lecture ery and Data Mining (KDD-2005), 2005, 39–48 Notes in Artificial Intelligence and Lecture Notes in Bioin- [27] Lodhi, H., Muggleton, S.: Is mutagenesis still challenging? formatics) 3848 LNAI(6) (2006), 172–195 Proceedings of the 15th International Conference on Induc- [43] Zou, M., Wang, T., Li, H., Yang, D.: A general multi- tive Logic Programming, ILP 2005, Late-Breaking Papers., relational classification approach using feature generation 2005, 35–40 and selection. In: 6th International Conference, ADMA, [28] Macskassy, S. A., Provost, F.: A simple relational classifier. 21–33, Springer-Verlag, 2010 Technical Report, Stern New York University, 2003 [29] Di Mauro, N., Esposito, F.: Ensemble relational learning based on selective propositionalization. CoRR (2013), 1– 10 [30] Michie, D., Muggleton, S., Page, D., Srinivasan, A.: To the international computing community: a new east-west challenge. Technical Report, Oxford University Computing Laboratory, Oxford, 1994 [31] Modi, S.: Relational classification using multiple view ap- proach with voting. International Journal of Computer Ap- plications 70(16) (2013), 31–36 [32] Neto, R. O., Adeodato, P. J. L., Salgado, A. C., Filho, D. R., Machado, G. R.: CoMoVi: a framework for data trans- formation in credit behavioral scoring applications using model driven architecture. SEKE 2014 (2014), 286–291 [33] Neville, J., Gallagher, B., Eliassi-Rad, T., Wang, T.: Cor- recting evaluation bias of relational classifiers with network cross validation. Knowledge and Information Systems 30 (1) (2012), 31–55 [34] Perlich, C., Provost, F.: Distribution-based aggregation for relational learning with identifier attributes. Machine Learning 62(1–2) SPEC. ISS. (2006), 65–105 [35] Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfold- ing relational data into bags of words. Expert Systems with Applications 42(17–18) (2015), 6442–6456 [36] Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62 (1–2) SPEC. ISS. (February 2006), 107–136