<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An E ective Con guration Learning Algorithm for Entity Resolution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khai Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichiseg@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Graduate University for Advanced Studies, Japan National Institute of Informatics</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity resolution is the problem of nding co-referent instances, which at the same time describe the same topic. It is an important component of data integration systems and is indispensable in linked data publication process. Entity resolution has been a subject of extensive research; however, seeking for a perfect resolution algorithm remains a work in progress. Many approaches have been proposed for entity resolution. Among them, supervised entity resolution has been revealed as the most accurate approach [6, 2]. Meanwhile, con guration-based matching [2, 3, 5, 4] attracts most studies because of its advantages in scalability and interpretation. In order to match two instances of di erent repositories, con guration-based matching algorithms estimate the similarities between the values of the same attributes. After that, these similarities are aggregated into one matching score. This score is used to determine whether two instances are co-referent or not. The declarations of equivalent attributes, similarity measures, similarity aggregation, and acceptance threshold are speci ed by a matching con guration, which can be automatically optimized by a learning algorithm. Con guration learning using genetic algorithm has been a research topic of some studies [2, 5, 3]. The limitation of genetic algorithm is that it costs numerous iterations for reaching the convergence. We propose cLearn as a heuristic algorithm that is e ective and more e cient. cLearn can be used to enhance the performance of any con guration-based entity resolution system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A con guration speci es the property mappings, similarity measures, similarity
aggregation strategy, and matching acceptance threshold. Property mappings
and similarity measures are combined together into similarity functions. Given
series of initial similarity functions, similarity aggregation options, and the
labeled instances pairs, the mission of cLearn is to select the optimal con guration.</p>
      <p>cLearn begins with the consideration of each single similarity function and
then checks their combinations. When checking the new combination this
algorithm applies a heuristic for selecting most potentially optimal con guration.
Concretely, the heuristic accepts the new combination if only its performance
Training size
5%
Varied by subset SOcbSjLecINtCTo+recfLearn 0.894</p>
      <p>0.464
is better than that of the combined elements. This heuristic is reasonable as a
series of similarity functions that reduces the performance has little possibility
of generating a further combination with improvement. In addition to nding
for similarity functions, the algorithm also optimizes the similarity aggregator
and matching acceptance threshold.</p>
      <p>cLearn is implemented as part of ScSLINT framework, and its source code
is available at http://ri-www.nii.ac.jp/ScSLINT.</p>
    </sec>
    <sec id="sec-2">
      <title>3 Evaluation</title>
      <p>
        Table 1 reports the comparison between cLearn and other supervised systems,
including ObjectCoref [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Adaboost-based instance matching system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
OAEI 2010 dataset is used and the same amount of training data is given to
each pair of compared systems. According to this table, cLearn consistently
outperforms other algorithms.
      </p>
      <p>
        cLearn is e cient as the average numbers of con gurations that cLearn has
to check before stopping is only 246. This number is promising because it is
much smaller than that of using genetic algorithm, which is reported in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with
a recommendation of 500 con gurations for each iteration.
      </p>
      <p>With the e ectiveness, potential e ciency, and small training data
requirement of cLearn on a real dataset like OAEI 2010, we believe that cLearn
has promising application in supervised entity resolution, including using
active learning strategy to even reduce the annotation e ort.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , J., Cheng, G.,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Objectcoref &amp; falcon-ao: results for oaei 2010</article-title>
          .
          <source>In: 5th Ontology Matching</source>
          . pp.
          <volume>158</volume>
          {
          <issue>165</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Active learning of expressive linkage rules using genetic programming</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web 23</source>
          ,
          <issue>2</issue>
          {
          <fpage>15</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Unsupervised learning of link speci cations: Deterministic vs. non-deterministic</article-title>
          .
          <source>In: 8th Ontology Matching</source>
          . pp.
          <volume>25</volume>
          {
          <issue>36</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichise</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Interlinking linked data sources using a domainindependent system</article-title>
          .
          <source>In: 2nd JIST. LNCS</source>
          , vol.
          <volume>7774</volume>
          , pp.
          <volume>113</volume>
          {
          <fpage>128</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
          </string-name>
          , E.:
          <article-title>Unsupervised learning of link discovery conguration</article-title>
          .
          <source>In: 9th ESWC. LNCS</source>
          , vol.
          <volume>7295</volume>
          , pp.
          <volume>119</volume>
          {
          <fpage>133</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Rong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>W.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>A machine learning approach for entity resolution based on similarity metrics</article-title>
          .
          <source>In: 11th ISWC. LNCS</source>
          , vol.
          <volume>7649</volume>
          , pp.
          <volume>460</volume>
          {
          <fpage>475</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>