Introduction

Ranking Feature for Classi er-based Instance Matching

Khai Nguyen

nhkhai@fit.hcmus.edu.vn 0 1 2

Ryutaro Ichise

ichiseg@nii.ac.jp 0 1 0 National Institute of Informatics , Japan 1 The Graduate University for Advanced Studies , Japan 2 University of Science , VNU-HCMC , Vietnam

Instance matching is the problem of nding the instances that describe the same object. It can be viewed as a classi cation problem, where a pair of two instances is predicted as match or non-match. A common limitation of existing classi er-based matching systems is the absence of instance pairs ranking. We propose using a ranking feature to enhance the classi er in instance matching. Experiments on real datasets con rm the signi cant improvement when applying our method.

instance matching classi cation ranking

Introduction

Instance matching detects the instances describing the same object in two different repositories, RS and RT . This task can be considered as a classi cation problem, in which each example represents a feature vector consisting of the correlation indicators (e.g., literal similarities) of two instances [ 3, 6, 7 ]. For training data, each example is also associated with a class, which is either matched (i.e., positive) or non-matched (i.e., negative). The matching task on a new example is to predict its actual class.

In instance matching, an important technique is blocking, which groups the potentially matched instances into the same block [ 1 ]. For example, a simple method is to group the instances sharing a number of tokens. By avoiding the huge jRS RT j pairwise comparisons, blocking reduces the complexity of the matching process.

The ranking is important because of the di erent ambiguity between the blocks. For example, the block of `Smith' is much larger than `Obama' and thus, is more ambiguous. Discriminating the matched and non-matched for such blocks should be based on their local characteristics. In other words, it is better to use di erent discrimination strategies for di erent blocks instead of a single strategy for all blocks. A ranking algorithm (e.g., stable matching [ 4, 5 ], bipartite graph matching [ 2 ], and max-delta [ 2 ]) is among possible solutions because it considers only the most con dent pairs in each block as positive.

Traditional classi ers discriminate the positive and the negative based on a global boundary drawn from all training examples collected from all blocks. The disadvantage of this approach is the local characteristics of each block is ignored. Therefore, it is ine ective because the blocks are naturally heterogeneous in terms of ambiguity.

We propose to re ect the ranking value of an example (i.e., the vector representing the correlations between two instances) as a feature. For each example, a ranking feature is computed using the example itself and all the related examples, which are drawn from the same block. This ranking feature is added to the original vector as an extra element. As a result, the ranking value is included in the nal feature vector. Finally, the classi ers take the nal vectors as the input. 2 2.1

The ranking feature and optimization Feature design

The general idea of the ranking feature is to capture the preference of an example against all the related examples. Let Q be a collection of examples drawn from a block. The ranking feature r(x; Q) of an example x of Q is de ned as follows. r(x; Q) = f (x) = 1

X (f (x) jQj t2Q

1 1 + e wx f (t)) (1) where f is the logistic function and w is the weight vector. Each element of w assigns the impact of a feature in the example x. w is optimized by a learning algorithm using the training data. An example may exist in di erent blocks. That is, the ranking feature of an example changes accordingly to the block of interest. Higher value of r indicates a better rank of an example in a block.

The logistic function is widely used in classi cation and regression because it has a good ability to normalize the input, thus, it is useful in analyses. Furthermore, logistic function is convex and easy to be optimized. 2.2

Optimization

The goal of optimization is to nd the vector w based on the training data. The optimality of the ranking feature is to maximize the r(x; Q) (Eq. 1) if x is a positive example and to minimize this value otherwise. The optimization for w also re ects this expectation.

Let X be a set of training examples and is divided into di erent groups: X = fQ1; Q2; :::g. Each group Qi is respective to the block i, from which the examples of Qi are drawn. The optimization of w is done by minimizing the loss L(w; X ), which is de ned as follows.

Ranking Feature for Classi er-based Instance Matching (2) L(w; X ) = X

X Qi2X (x;y)2Qi Qi (`(x) `(y)) (f (x)

1 f (y)) + 2 jjwjj2 where is the regularization parameter, which can be determined by using validation data. ` is the class indicator. `(z) = 1 if z is positive, otherwise `(z) = 0. The intuition of ` is to take only the preference of the examples of di erent classes. 3

Experiment

We use eight datasets for the experiment. Each dataset contains two repositories with di erent properties and a collection of matched instances. We apply the property alignment, blocking, and similarity estimation modules of cLink [ 5 ] to generate the examples. The property alignment creates the property mappings between two repositories (e.g., name and label) using an overlap metric on the values of the properties. The blocking generates the blocks of instances by using token-based comparison. Two instances are placed in the same block if they share at least one token. One pair may exist in di erent blocks. Each of such pairs is represented by multiple examples with di erent ranking features. The similarity estimation computes the feature vector for each instance pair. Each element of a vector is the result of applying a similarity metric on a property mapping. That is, multiple metrics can be applied for the same mapping. For strings, we use exact matching, TF-IDF, Levenshtein, Jaro-Winkler, and Jaccard. For numbers, we use reversed di erence. For date-times and URIs, we use exact matching. The summary of the datasets are in Table 1.

We compare the performance of the classi er when using and not using the proposed ranking feature. We apply 5 folds cross-validation. We separate the training set into two parts: 80% and 20%. The former is for minimizing L(w; X ) (Eq. 2) and the latter is for optimizing parameter. The hyperparameters of the classi ers are not tuned. We split the examples based on the blocks so that the separated sets are independent. Table 2 reports the F1 scores of 4 classi ers: Logistic regression (LR), J48 decision tree, Random Forest (RF), and Support Vector Machine (SVM). In this table, the result of using ranking feature is marked with `*'. The italic and bold numbers indicate the best result in the context of same classi er and dataset, respectively. According to this table, Random Forest achieves the best result. Furthermore, using ranking feature enhances the performance in 26 out of 32 tests (81%). A paired t-test also reveals that the improvement is statistically signi cant (p=0.0012). Such result con rms the e ective role of ranking factor in classi er-based instance matching. 4

Conclusion

The experiment result shows that our proposed feature is promising. For further research, we are interested in two directions. The rst one is to train the models of the ranking feature and the classi er simultaneously. The second one is to model any ranking algorithm so that it can be combined with classi ers.

[1] Christen , P.: A survey of indexing techniques for scalable record linkage and deduplication . IEEE Trans. on Knowledge and Data Engineering 24 ( 9 ). pp. 1537 { 1555 . ( 2012 )

[2] Gal , A. , Roitman , H. , Sagi , T. : From diversity-based prediction to better ontology & schema matching . In: WWW' 2016 . pp. 1145 { 1155 . ( 2016 )

[3] Kejriwal , M. , Miranker , D.P. : Semi-supervised instance matching using boosted classi ers . In: ESWC' 2015 . pp. 388 { 402 . ( 2015 )

[4] Ngomo , A.C.N. , Lehmann , J. , Auer , S. , Ho ner, K.: RAVEN - Active learning of link speci cations . In: OM' 2011 . pp. 25 { 36 . ( 2011 )

[5] Nguyen , K. , Ichise , R.: Linked data entity resolution system enhanced by con guration learning algorithm . IEICE Trans. on Information and Systems 99 ( 6 ). pp. 1521 { 1530 . ( 2016 )

[6] Nguyen , K. , Ichise , R. , Le , H.B. : Learning approach for domain-independent linked data instance matching . In: MDS' 2012 . pp. 7 { 15 . ( 2012 )

[7] Rong , S. , Niu , X. , Xiang , W.E. , Wang , H. , Yang , Q. , Yu , Y.: A machine learning approach for instance matching based on similarity metrics . In: ISWC' 2012 . pp. 460 { 475 . ( 2012 )