Ranking Feature for Classifier-based Instance Matching Khai Nguyen1,2,3 and Ryutaro Ichise1,2 1 The Graduate University for Advanced Studies, Japan 2 National Institute of Informatics, Japan {nhkhai,ichise}@nii.ac.jp 3 University of Science, VNU-HCMC, Vietnam nhkhai@fit.hcmus.edu.vn Abstract. Instance matching is the problem of finding the instances that describe the same object. It can be viewed as a classification prob- lem, where a pair of two instances is predicted as match or non-match. A common limitation of existing classifier-based matching systems is the absence of instance pairs ranking. We propose using a ranking feature to enhance the classifier in instance matching. Experiments on real datasets confirm the significant improvement when applying our method. Keywords: instance matching, classification, ranking. 1 Introduction Instance matching detects the instances describing the same object in two dif- ferent repositories, RS and RT . This task can be considered as a classification problem, in which each example represents a feature vector consisting of the cor- relation indicators (e.g., literal similarities) of two instances [3, 6, 7]. For training data, each example is also associated with a class, which is either matched (i.e., positive) or non-matched (i.e., negative). The matching task on a new example is to predict its actual class. In instance matching, an important technique is blocking, which groups the potentially matched instances into the same block [1]. For example, a simple method is to group the instances sharing a number of tokens. By avoiding the huge |RS × RT | pairwise comparisons, blocking reduces the complexity of the matching process. The ranking is important because of the different ambiguity between the blocks. For example, the block of ‘Smith’ is much larger than ‘Obama’ and thus, is more ambiguous. Discriminating the matched and non-matched for such blocks should be based on their local characteristics. In other words, it is better to use different discrimination strategies for different blocks instead of a single strategy for all blocks. A ranking algorithm (e.g., stable matching [4, 5], bipartite graph matching [2], and max-delta [2]) is among possible solutions because it considers only the most confident pairs in each block as positive. 2 K. Nguyen, and R. Ichise Traditional classifiers discriminate the positive and the negative based on a global boundary drawn from all training examples collected from all blocks. The disadvantage of this approach is the local characteristics of each block is ignored. Therefore, it is ineffective because the blocks are naturally heterogeneous in terms of ambiguity. We propose to reflect the ranking value of an example (i.e., the vector repre- senting the correlations between two instances) as a feature. For each example, a ranking feature is computed using the example itself and all the related exam- ples, which are drawn from the same block. This ranking feature is added to the original vector as an extra element. As a result, the ranking value is included in the final feature vector. Finally, the classifiers take the final vectors as the input. 2 The ranking feature and optimization 2.1 Feature design The general idea of the ranking feature is to capture the preference of an example against all the related examples. Let Q be a collection of examples drawn from a block. The ranking feature r(x, Q) of an example x of Q is defined as follows. 1 X r(x, Q) = (f (x) − f (t)) (1) |Q| t∈Q 1 f (x) = 1 + e−wx where f is the logistic function and w is the weight vector. Each element of w assigns the impact of a feature in the example x. w is optimized by a learning algorithm using the training data. An example may exist in different blocks. That is, the ranking feature of an example changes accordingly to the block of interest. Higher value of r indicates a better rank of an example in a block. The logistic function is widely used in classification and regression because it has a good ability to normalize the input, thus, it is useful in analyses. Fur- thermore, logistic function is convex and easy to be optimized. 2.2 Optimization The goal of optimization is to find the vector w based on the training data. The optimality of the ranking feature is to maximize the r(x, Q) (Eq. 1) if x is a positive example and to minimize this value otherwise. The optimization for w also reflects this expectation. Let X be a set of training examples and is divided into different groups: X = {Q1 , Q2 , ...}. Each group Qi is respective to the block i, from which the examples of Qi are drawn. The optimization of w is done by minimizing the loss L(w, X ), which is defined as follows. Ranking Feature for Classifier-based Instance Matching 3 X X 1 L(w, X ) = −(`(x) − `(y)) × (f (x) − f (y)) + λ||w||2 (2) 2 Qi ∈X (x,y)∈Qi ×Qi where λ is the regularization parameter, which can be determined by using validation data. ` is the class indicator. `(z) = 1 if z is positive, otherwise `(z) = 0. The intuition of ` is to take only the preference of the examples of different classes. 3 Experiment We use eight datasets for the experiment. Each dataset contains two repositories with different properties and a collection of matched instances. We apply the property alignment, blocking, and similarity estimation modules of cLink [5] to generate the examples. The property alignment creates the property mappings between two repositories (e.g., name and label) using an overlap metric on the values of the properties. The blocking generates the blocks of instances by using token-based comparison. Two instances are placed in the same block if they share at least one token. One pair may exist in different blocks. Each of such pairs is represented by multiple examples with different ranking features. The similarity estimation computes the feature vector for each instance pair. Each element of a vector is the result of applying a similarity metric on a property mapping. That is, multiple metrics can be applied for the same mapping. For strings, we use exact matching, TF-IDF, Levenshtein, Jaro-Winkler, and Jaccard. For numbers, we use reversed difference. For date-times and URIs, we use exact matching. The summary of the datasets are in Table 1. We compare the performance of the classifier when using and not using the proposed ranking feature. We apply 5 folds cross-validation. We separate the training set into two parts: 80% and 20%. The former is for minimizing L(w, X ) (Eq. 2) and the latter is for optimizing λ parameter. The hyperparameters of the classifiers are not tuned. We split the examples based on the blocks so that the separated sets are independent. Table 2 reports the F1 scores of 4 clas- sifiers: Logistic regression (LR), J48 decision tree, Random Forest (RF), and Table 1. Summary of datasets. ID Name #Pairwise #Examples #Blocks #Features #Positive D1 DBLP-ACM 6.00×106 344,587 7,083 37 2,224 D2 ABT-Buy 1.18×106 72,185 2,057 23 1,097 D3 Amazon-GoogleProduct 4.40×106 100,672 8,413 27 1,300 D4 Sider-Drugbank 52.57×106 6,120 1,894 59 1,142 D5 Sider-Diseasome 21.76×106 4,227 604 31 344 D6 Sider-DailyMed 26.71×106 6,440 1,463 31 3,225 D7 Sider-DBpedia 11.17×109 931,701 6,290 58 1,449 D8 Dailymed-DBpedia 41.84×109 973,019 5,465 114 2,454 4 K. Nguyen, and R. Ichise Table 2. F1 scores of using ranking (marked with ‘*’) and not using ranking ID LR LR* J48 J48* RF RF* SVM SVM* D1 0.9702 0.9774 0.9736 0.9866 0.9850 0.9886 0.9652 0.9748 D2 0.5956 0.5980 0.5766 0.6028 0.6470 0.6736 0.5410 0.6024 D3 0.5508 0.5224 0.5252 0.5682 0.5766 0.6380 0.4204 0.5114 D4 0.9480 0.9710 0.9564 0.9740 0.9704 0.9810 0.9486 0.9734 D5 0.9148 0.9114 0.9036 0.9076 0.9472 0.9472 0.9436 0.9436 D6 0.9548 0.9536 0.9780 0.9854 0.9804 0.9902 0.9578 0.9586 D7 0.7104 0.7046 0.7258 0.7232 0.6886 0.6994 0.7096 0.7156 D8 0.7934 0.7968 0.8258 0.8398 0.8562 0.8638 0.7930 0.8094 Support Vector Machine (SVM). In this table, the result of using ranking fea- ture is marked with ‘*’. The italic and bold numbers indicate the best result in the context of same classifier and dataset, respectively. According to this table, Random Forest achieves the best result. Furthermore, using ranking feature en- hances the performance in 26 out of 32 tests (81%). A paired t-test also reveals that the improvement is statistically significant (p=0.0012). Such result confirms the effective role of ranking factor in classifier-based instance matching. 4 Conclusion The experiment result shows that our proposed feature is promising. For further research, we are interested in two directions. The first one is to train the models of the ranking feature and the classifier simultaneously. The second one is to model any ranking algorithm so that it can be combined with classifiers. References [1] Christen, P.: A survey of indexing techniques for scalable record linkage and dedu- plication. IEEE Trans. on Knowledge and Data Engineering 24 (9). pp. 1537–1555. (2012) [2] Gal, A., Roitman, H., Sagi, T.: From diversity-based prediction to better ontology & schema matching. In: WWW’2016. pp. 1145–1155. (2016) [3] Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers. In: ESWC’2015. pp. 388–402. (2015) [4] Ngomo, A.C.N., Lehmann, J., Auer, S., Höffner, K.: RAVEN - Active learning of link specifications. In: OM’2011. pp. 25–36. (2011) [5] Nguyen, K., Ichise, R.: Linked data entity resolution system enhanced by config- uration learning algorithm. IEICE Trans. on Information and Systems 99(6). pp. 1521–1530. (2016) [6] Nguyen, K., Ichise, R., Le, H.B.: Learning approach for domain-independent linked data instance matching. In: MDS’2012. pp. 7–15. (2012) [7] Rong, S., Niu, X., Xiang, W.E., Wang, H., Yang, Q., Yu, Y.: A machine learning approach for instance matching based on similarity metrics. In: ISWC’2012. pp. 460–475. (2012)