<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ranking Feature for Classi er-based Instance Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khai Nguyen</string-name>
          <email>nhkhai@fit.hcmus.edu.vn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichiseg@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Informatics</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The Graduate University for Advanced Studies</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCMC</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Instance matching is the problem of nding the instances that describe the same object. It can be viewed as a classi cation problem, where a pair of two instances is predicted as match or non-match. A common limitation of existing classi er-based matching systems is the absence of instance pairs ranking. We propose using a ranking feature to enhance the classi er in instance matching. Experiments on real datasets con rm the signi cant improvement when applying our method.</p>
      </abstract>
      <kwd-group>
        <kwd>instance matching</kwd>
        <kwd>classi cation</kwd>
        <kwd>ranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Instance matching detects the instances describing the same object in two
different repositories, RS and RT . This task can be considered as a classi cation
problem, in which each example represents a feature vector consisting of the
correlation indicators (e.g., literal similarities) of two instances [
        <xref ref-type="bibr" rid="ref3 ref6 ref7">3, 6, 7</xref>
        ]. For training
data, each example is also associated with a class, which is either matched (i.e.,
positive) or non-matched (i.e., negative). The matching task on a new example
is to predict its actual class.
      </p>
      <p>
        In instance matching, an important technique is blocking, which groups the
potentially matched instances into the same block [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, a simple
method is to group the instances sharing a number of tokens. By avoiding the
huge jRS RT j pairwise comparisons, blocking reduces the complexity of the
matching process.
      </p>
      <p>
        The ranking is important because of the di erent ambiguity between the
blocks. For example, the block of `Smith' is much larger than `Obama' and thus,
is more ambiguous. Discriminating the matched and non-matched for such blocks
should be based on their local characteristics. In other words, it is better to use
di erent discrimination strategies for di erent blocks instead of a single strategy
for all blocks. A ranking algorithm (e.g., stable matching [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], bipartite graph
matching [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and max-delta [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) is among possible solutions because it considers
only the most con dent pairs in each block as positive.
      </p>
      <p>Traditional classi ers discriminate the positive and the negative based on a
global boundary drawn from all training examples collected from all blocks. The
disadvantage of this approach is the local characteristics of each block is ignored.
Therefore, it is ine ective because the blocks are naturally heterogeneous in
terms of ambiguity.</p>
      <p>We propose to re ect the ranking value of an example (i.e., the vector
representing the correlations between two instances) as a feature. For each example,
a ranking feature is computed using the example itself and all the related
examples, which are drawn from the same block. This ranking feature is added to the
original vector as an extra element. As a result, the ranking value is included in
the nal feature vector. Finally, the classi ers take the nal vectors as the input.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>The ranking feature and optimization</title>
      <sec id="sec-2-1">
        <title>Feature design</title>
        <p>The general idea of the ranking feature is to capture the preference of an example
against all the related examples. Let Q be a collection of examples drawn from
a block. The ranking feature r(x; Q) of an example x of Q is de ned as follows.
r(x; Q) =
f (x) =
1</p>
        <p>X (f (x)
jQj t2Q</p>
        <p>1
1 + e wx
f (t))
(1)
where f is the logistic function and w is the weight vector. Each element of w
assigns the impact of a feature in the example x. w is optimized by a learning
algorithm using the training data. An example may exist in di erent blocks.
That is, the ranking feature of an example changes accordingly to the block of
interest. Higher value of r indicates a better rank of an example in a block.</p>
        <p>The logistic function is widely used in classi cation and regression because
it has a good ability to normalize the input, thus, it is useful in analyses.
Furthermore, logistic function is convex and easy to be optimized.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Optimization</title>
        <p>The goal of optimization is to nd the vector w based on the training data. The
optimality of the ranking feature is to maximize the r(x; Q) (Eq. 1) if x is a
positive example and to minimize this value otherwise. The optimization for w
also re ects this expectation.</p>
        <p>Let X be a set of training examples and is divided into di erent groups:
X = fQ1; Q2; :::g. Each group Qi is respective to the block i, from which the
examples of Qi are drawn. The optimization of w is done by minimizing the loss
L(w; X ), which is de ned as follows.</p>
        <p>Ranking Feature for Classi er-based Instance Matching
(2)
L(w; X ) = X</p>
        <p>X
Qi2X (x;y)2Qi Qi
(`(x)
`(y))
(f (x)</p>
        <p>1
f (y)) + 2 jjwjj2
where is the regularization parameter, which can be determined by using
validation data. ` is the class indicator. `(z) = 1 if z is positive, otherwise
`(z) = 0. The intuition of ` is to take only the preference of the examples of
di erent classes.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>
        We use eight datasets for the experiment. Each dataset contains two repositories
with di erent properties and a collection of matched instances. We apply the
property alignment, blocking, and similarity estimation modules of cLink [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to
generate the examples. The property alignment creates the property mappings
between two repositories (e.g., name and label) using an overlap metric on the
values of the properties. The blocking generates the blocks of instances by using
token-based comparison. Two instances are placed in the same block if they share
at least one token. One pair may exist in di erent blocks. Each of such pairs is
represented by multiple examples with di erent ranking features. The similarity
estimation computes the feature vector for each instance pair. Each element of a
vector is the result of applying a similarity metric on a property mapping. That
is, multiple metrics can be applied for the same mapping. For strings, we use
exact matching, TF-IDF, Levenshtein, Jaro-Winkler, and Jaccard. For numbers,
we use reversed di erence. For date-times and URIs, we use exact matching. The
summary of the datasets are in Table 1.
      </p>
      <p>We compare the performance of the classi er when using and not using the
proposed ranking feature. We apply 5 folds cross-validation. We separate the
training set into two parts: 80% and 20%. The former is for minimizing L(w; X )
(Eq. 2) and the latter is for optimizing parameter. The hyperparameters of
the classi ers are not tuned. We split the examples based on the blocks so that
the separated sets are independent. Table 2 reports the F1 scores of 4
classi ers: Logistic regression (LR), J48 decision tree, Random Forest (RF), and
Support Vector Machine (SVM). In this table, the result of using ranking
feature is marked with `*'. The italic and bold numbers indicate the best result in
the context of same classi er and dataset, respectively. According to this table,
Random Forest achieves the best result. Furthermore, using ranking feature
enhances the performance in 26 out of 32 tests (81%). A paired t-test also reveals
that the improvement is statistically signi cant (p=0.0012). Such result con rms
the e ective role of ranking factor in classi er-based instance matching.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The experiment result shows that our proposed feature is promising. For further
research, we are interested in two directions. The rst one is to train the models
of the ranking feature and the classi er simultaneously. The second one is to
model any ranking algorithm so that it can be combined with classi ers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A survey of indexing techniques for scalable record linkage and deduplication</article-title>
          .
          <source>IEEE Trans. on Knowledge and Data Engineering</source>
          <volume>24</volume>
          (
          <issue>9</issue>
          ). pp.
          <volume>1537</volume>
          {
          <fpage>1555</fpage>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roitman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>From diversity-based prediction to better ontology &amp; schema matching</article-title>
          . In: WWW'
          <year>2016</year>
          . pp.
          <volume>1145</volume>
          {
          <fpage>1155</fpage>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kejriwal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranker</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          :
          <article-title>Semi-supervised instance matching using boosted classi ers</article-title>
          . In: ESWC'
          <year>2015</year>
          . pp.
          <volume>388</volume>
          {
          <fpage>402</fpage>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ho ner, K.:
          <article-title>RAVEN - Active learning of link speci cations</article-title>
          . In: OM'
          <year>2011</year>
          . pp.
          <volume>25</volume>
          {
          <fpage>36</fpage>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
          </string-name>
          , R.:
          <article-title>Linked data entity resolution system enhanced by con guration learning algorithm</article-title>
          .
          <source>IEICE Trans. on Information and Systems</source>
          <volume>99</volume>
          (
          <issue>6</issue>
          ). pp.
          <volume>1521</volume>
          {
          <fpage>1530</fpage>
          . (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>H.B.</given-names>
          </string-name>
          :
          <article-title>Learning approach for domain-independent linked data instance matching</article-title>
          . In: MDS'
          <year>2012</year>
          . pp.
          <volume>7</volume>
          {
          <fpage>15</fpage>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Rong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>W.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>A machine learning approach for instance matching based on similarity metrics</article-title>
          . In: ISWC'
          <year>2012</year>
          . pp.
          <volume>460</volume>
          {
          <fpage>475</fpage>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>