=Paper= {{Paper |id=Vol-1486/paper_35 |storemode=property |title=A Heuristic Approach for Configuration Learning of Supervised Instance Matching |pdfUrl=https://ceur-ws.org/Vol-1486/paper_35.pdf |volume=Vol-1486 |dblpUrl=https://dblp.org/rec/conf/semweb/NguyenI15a }} ==A Heuristic Approach for Configuration Learning of Supervised Instance Matching== https://ceur-ws.org/Vol-1486/paper_35.pdf
      A Heuristic Approach for Configuration
     Learning of Supervised Instance Matching

                         Khai Nguyen and Ryutaro Ichise

               The Graduate University for Advanced Studies, Japan
                     National Institute of Informatics, Japan
                          {nhkhai,ichise}@nii.ac.jp


       Abstract. Instance matching is the problem of finding instances that
       co-describe the same topic. The performance of an instance matching sys-
       tem relies on the specified configuration, which is optimally designed for
       each particular dataset. We propose cLearn, a heuristic-based algorithm
       that automatically searches for the most appropriate matching configu-
       ration for the given repositories. cLearn is effective when being tested
       on OAEI benchmarks and outperforms previous supervised systems.

       Keywords: supervised, instance matching, heuristic search, configura-
       tion learning.


1    Background
Finding co-reference instances is an important problem of knowledge discovery
and data mining. It has a large application in data integration and linked data
publication. Many approaches have been proposed for instance matching. Among
them, supervised instance matching has been revealed as the most accurate
approach [8, 1, 3]. Meanwhile, configuration-based matching [3, 4, 6, 5] attracts
most studies because of its advantages in scalability and interpretation. This
approach considers the matching score of instances to predict whether they are
co-referent or not. The score computation and other settings of instance matching
process are specified by a matching configuration, which can be optimized by a
learning algorithm. Configuration learning using genetic algorithm has been a
research topic of some studies [3, 6, 4]. The limitation of genetic algorithm is that
it costs many iterations for reaching the convergence. We propose cLearn as a
heuristic algorithm that is effective and more efficient. cLearn can be used to
enhance the performance of any configuration-based instance matching system.
    Fig. 1 illustrates the general architecture of supervised and configuration-
based instance matching. Given two input repositories, Rsource and Rtarget ,
the property alignment generator creates the property alignments which are
expected to describe the same attribute. The similarity function generator uses
the property alignments to create a list of initial similarity functions. A simi-
larity function is specified by two pieces of information: a property alignment
[psource , ptarget ] and a similarity measure S (e.g. Levenshtein, Jaro-Winkler, and
TF-IDF). For two instances: x and y, a similarity function calculates the similar-
ity S(value(x, psource ), value(y, ptarget )) where value(a, p) extracts the attribute
2              K. Nguyen, R. Ichise
               Property alignment      Property    Similarity function   Initial similarity


    Rsource
                   generator          alignments       generator             functions
                                      Candidate    Labeled candidates     Configuration
                                      generator   Unlabeled candidates      Learner
    Rtarget

                              Co-reference   Matching     Similarity        Optimal
              Co-references
                                 filter       scores      aggregator      configuration

Fig. 1. The general architecture of configuration-based supervised entity resolution
systems.

value declared by p of a. The candidate generator reduces the huge number of
pairwise alignments between instances of input repositories by selecting only
pairs of potentially co-referent instances [7]. Such pairs are called candidates.
In supervised instance matching, annotation step is applied to candidate set in
order to obtain a labeled set. Labeled candidates is then provided for the config-
uration learner. The similarity aggregator computes the final matching score for
the unlabeled candidates, using the similarity functions and aggregation func-
tion (e.g., linear, quadratic, and boolean) produced by the configuration learner.
The final component, co-reference filter, applies some constraints (e.g. stable
matching) on the matching scores to construct the final co-references [5].

2             The cLearn algorithm
A configuration specifies the combination of similarity functions Fsim , the simi-
larity aggregator Agg, and the acceptance threshold δ of the co-reference filter.
Given series of initial similarity functions Isim and similarity aggregators Iagg ,
and the labeled candidates, the mission of cLearn is to select the optimal Fsim ,
Agg, and δ.
    The pseudo code of cLearn is written in Algorithm 11 . Labeled candidates are
divided into training set T and validation set V before inputting into cLearn.
Init creates a configuration by assigning Agg, Fsim , and δ with given values.
M atch executes the similarity aggregator and co-reference filter in order to ob-
tain the detected co-referent instances. F indT hreshold automatically assigns
a value to δ. This function first selects the top |T + | candidates with highest
matching score, where |T + | is the number of the actual co-references (positive
candidates) in T . Then, it assigns the lowest score of the correctly detected
co-reference to δ. Evaluate computes the performance of entity resolution by
comparing the generated results with the labeled data. In cLearn, F 1 score is
used as the default performance metric because F 1 it is the main expectation of
general instance matching task. The metric can be easily changed (e.g, precision
and recall) to adapt with the objectives of particular task.
    cLearn begins with the consideration of each single similarity function and
then checks their combinations. This algorithm works with two underlying heuris-
tics. The first one is the limitation of the number of single similarity functions
to Ktop (line 10), which is set to 16 by default. The second one is the direct en-
hancement assumption (line 20). The performance of using a new combination
1
       In this pseudo code, we use dot (‘.’) notation to indicate the member accessor.
       A Heuristic Approach for Configuration Learning of Instance Matching        3


    Algorithm 1: cLearn
   Input: Training set T , validation set V , integer paramerter Ktop
   list of similarity functions Isim , list of similarity aggregators Iagg
   Output: Optimal configuration Copt
 1 Cagg ← ∅
 2 foreach A ∈ Iagg do
 3      visited ← ∅
 4      foreach sim ∈ Isim do
 5          c ← Init(Agg ← A, Fsim ← sim, δ ← 0)
 6          links ← M atch(c, T )
 7          c.δ ← F indT hreshold(links, T )
 8          F 1 ← Evaluate(links, T )
 9          visited ← visited ∪ {[c, F 1]}
10      candidate ← T opHighestF 1(visited, Ktop )
11      while candidate 6= ∅ do
12         next ← ∅
13         foreach g ∈ candidate do
14             foreach h ∈ candidate do
15                c ← Init(Agg ← A, Fsim ← g.c.Fsim ∪ h.c.Fsim , δ ← 0)
16                links ← M atch(c, T )
17                c.δ ← F indT hreshold(links, T )
18                F 1 ← Evaluate(links, T )
19                visited ← visited ∪ {[c, F 1]}
20                if g.c.Fsim 6= h.c.Fsim and F 1 ≥ g.F 1 and F 1 ≥ h.F 1 then
21                    next ← next ∪ {[c, F 1]}

22          candidate ← next
23      F 1 ← Evaluate(argmaxv∈visited (v.F 1).c, V )
24      Cagg ← Cagg ∪ {[c, F 1]}
25   [Copt , F 1] ← argmaxv∈Cagg (v.F 1)
26   return Copt



must not be less than that of the components. This heuristic does not guar-
antee an identical result with exhaustive search. However, it is reasonable as a
series of similarity functions that reduces the performance has little possibility of
generating a further combination with improvement. Meanwhile, the exhaustive
search is extremely expensive.
    cLearn is implemented as part of ScSLINT framework, and its source code
is available at http://ri-www.nii.ac.jp/ScSLINT.

3     Evaluation
We use OAEI 2010 dataset to compare ScSLINT+cLearn with ObjectCoref [2]
and the work in [8], which uses Adaboost to train a classifier. This dataset
contains five subsets, whose sizes are from medium to large. We use the same
amount of training data given to each other system for cLearn, on each re-
4       K. Nguyen, R. Ichise

                Table 1. F1 score of the compared systems on OAEI 2010.
Training size      System       Sider-    Sider-   Sider-   Sider- DailyMed-
                               Drugbank Diseasome DailyMed DBpedia DBpedia
                 ScSLINT+cLearn 0.911     0.824    0.777 0.6414     0.424
5%
                 Adaboost        0.903    0.794    0.733    0.641    0.375
                 ScSLINT+cLearn 0.894     0.829    0.722
Varied by subset
                 ObjectCoref     0.464    0.743    0.708


spective subset for the comparisons. Adaboost uses 5% candidates for training
and ObjectCoref uses 20 actual co-references, which is respectively equivalent to
2.3%, 11.6% and 1.2% candidates for the first three subsets. The results on the
last two subsets of ObjectCoref are missing. In order to reduce the random noise,
for each subset, we repeat the test 10 times and compute the average results,
which are reported in Table 1. According to this table, ScSLINT+cLearn consis-
tently outperforms other systems. Compared to ObjectCoref, ScSLINT+cLearn
remarkably improves the results. Compared to Adaboost, cLearn is much better
on two subsets related to DailyMed, a repository by itself contains the highest
number of co-references.
    cLearn is efficient as the average numbers of configurations that cLearn
has to check before stopping is only 246. This number is promising because it
is much smaller than that of using genetic algorithm, which is reported in [3]
with a recommendation of 500 configurations for each iteration. A qualitative
comparison with genetic algorithm is considered as future work.
    With the efficiency, effectiveness, and small training data requirement of
cLearn on a real dataset like OAEI 2010, we believe that cLearn has promis-
ing application in supervised instance matching, including using active learning
strategy to even reduce the annotation effort.

References
[1] Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string
    similarity measures. In: 9th KDD. pp. 39–48 (2003)
[2] Hu, W., Chen, J., Cheng, G., Qu, Y.: Objectcoref & falcon-ao: results for oaei 2010.
    In: 5th Ontology Matching. pp. 158–165 (2010)
[3] Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic pro-
    gramming. Web Semantics: Science, Services and Agents on the World Wide Web
    23, 2–15 (2013)
[4] Ngomo, A.C.N., Lyko, K.: Unsupervised learning of link specifications: Determin-
    istic vs. non-deterministic. In: 8th Ontology Matching. pp. 25–36 (2013)
[5] Nguyen, K., Ichise, R., Le, B.: Interlinking linked data sources using a domain-
    independent system. In: 2nd JIST. LNCS, vol. 7774, pp. 113–128. Springer (2013)
[6] Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery con-
    figuration. In: 9th ESWC. LNCS, vol. 7295, pp. 119–133. Springer (2012)
[7] Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution
    for large heterogeneous information spaces. In: 4th WSDM. pp. 535–544. (2011)
[8] Rong, S., Niu, X., Xiang, W.E., Wang, H., Yang, Q., Yu, Y.: A machine learning
    approach for instance matching based on similarity metrics. In: 11th ISWC. LNCS,
    vol. 7649, pp. 460–475. Springer (2012)