Generalizing Matching Knowledge using Active Learning

                                                          Anna Primpeli
                                                   supervised by Christian Bizer
                                                      Data and Web Science Group
                                                        University of Mannheim
                                              anna@informatik.uni-mannheim.de


ABSTRACT                                                                     Widely used matching systems such as COMA [2] indi-
Research on integrating small numbers of datasets suggests                cate the need of rich matcher and aggregator libraries in
the use of customized matching rules in order to adapt to                 order to solve different types of heterogeneity and find cor-
the patterns in the data and achieve better results. The                  respondences. A major finding of our group on integrating
state-of-the-art work on matching large numbers of datasets               small numbers of datasets is that specific matchers and ag-
exploits attribute co-occurrence as well as the similarity of             gregators deriving from such rich libraries as well as property
values between multiple sources. We build upon these re-                  specific data normalization techniques can be combined into
search directions in order to develop a method for generaliz-             high quality, domain specific matching rules [5]. Such rules
ing matching knowledge using minimal human intervention.                  achieve a twofold goal: firstly, they give an insight into the
The central idea of our research program is that even in large            current task by encoding matching knowledge, and secondly
numbers of datasets of a specific domain patterns (matching               they achieve high quality correspondences by adapting to
knowledge) reoccur, and discovering those can facilitate the              the nature of every matching scenario.
integration task. Our proposed approach plans to use and                     Research on integrating large numbers of datasets has
extend existing work of our group on schema and instance                  shown that it is valuable to exploit attribute co-occurrence
matching as well as on learning expressive rules with ac-                 in the schema corpus as well as the similarity of data values
tive learning. We plan to evaluate our approach on publicly               not only between a central data source and a single external
available e-commerce data collected from the Web.                         data source, but also the similarities of data values between
                                                                          multiple sources [4, 10]. A weak spot that can be observed
                                                                          in these approaches is that the employed data normaliza-
                                                                          tion techniques, similarity functions, and matching rules are
1.    INTRODUCTION                                                        not customized for the different types of entities and thus
   Data integration is a long standing and very active re-                produce lower quality results than customized techniques.
search topic dealing with overcoming the semantic and syn-                   The proposed research program builds upon this work and
tactic heterogeneity of records located in the same or sep-               aims as its first goal to investigate the extent to which it is
arate data sources [3]. While early work focused on inte-                 possible to generalize matching knowledge in order to im-
grating data from small numbers of datasets in a corporate                prove matching quality in large-scale N:1 and N:M match-
context, there is an increasing body of research on integrat-             ing situations. The rationale for this approach is that typical
ing large numbers of datasets in the Web context, where an                patterns reoccur among entities of a certain domain. An ex-
increased level of heterogeneity exists on both the instance              ample of such a pattern would be ”When matching entities
and schema-level.                                                         representing the product type phones, it is effective to com-
   The matching approaches dealing with the task of inte-                 pare the attributes [brand, producer] using the following pre-
grating large numbers of datasets can be categorized by the               processing methods [tokenization, lowercasing], the following
addressed integration scenario. One scenario is the N:1,                  similarity function [Levenshtein distance] and the following
in which multiple datasets are matched against a central                  threshold [0.75]”.
source; for instance, web tables against DBpedia [8] or prod-                In most matching situations, collaboration between hu-
uct entities against a central catalog. The second scenario               mans and machines is helpful in order to judge corner cases
is the N:M, in which datasets are matched with each other                 or to boot-strap matching with a certain amount of supervi-
without the help of an intermediate schema or a knowledge                 sion [12]. Previous work of our group on guiding collabora-
base.                                                                     tion between humans and machines on small-scale matching
                                                                          scenarios has shown that using active learning can produce
                                                                          high quality matching results even with a small amount of
                                                                          human interaction [6]: The employed active learning ap-
                                                                          proach was evaluated against six different datasets reaching
                                                                          between 0.8 and 0.98 F1 score after asking the human an-
                                                                          notator to label ten pairs of entities as positive or negative
                                                                          matches. Building upon this work and extending certain
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,
Germany.
                                                                          steps of the active learning process, we formulate the sec-
Copyright (c) 2017 for this paper by its authors. Copying permitted for   ond goal of the thesis, which is steering human attention in
private and academic purposes.
                                        Figure 1: Proposed matching approach pipeline


large-scale matching situations with the aim to learn rele-      achieves an F1 score of 0.8 on the instance-level and 0.7
vant, high quality matching knowledge.                           on the schema-level for the task of matching web tables to
  Summing up, in the context of this work, we aim to answer      DBpedia [11].
the following research questions:                                   The resulting schema correspondences of this step are
                                                                 grouped into clusters, with each cluster representing a spe-
     • Are domain specific patterns transferable within large-   cific property such as product name or brand. The motiva-
       scale matching situations?                                tion behind property clusters is that matching information
                                                                 concerning one property can further affect the other ele-
     • How can we maximize the benefit of human supervi-
                                                                 ments of the cluster, as it will be later explained in Section
       sion with respect to the discovery of those patterns?
                                                                 2.6.
                                                                    In this step, we plan to reuse existing work to form our
   In order to answer the above stated research questions,
                                                                 baseline for further improvement using active learning.
we will experiment with the N:1 and N:M matching scenar-
ios using datasets created in the context of the Web Data
Commons project1 such as web tables, schema.org data and
                                                                 2.2    Construction of unlabeled instance pair
product data.
                                                                        pool
   The remainder of this paper is organized as follows. Sec-        The second step involves the construction of an unlabeled
tion 2 describes the proposed matching approach. Section         pool of pairs of instances that are potential candidates for
3 presents the data which we plan to use for evaluation.         labeling by the user. Considering the complexity involved
Finally, Section 4 gives a short overview of our workplan.       with matching large-scale data as well as our goal for creat-
                                                                 ing generalized matching rules, the unlabeled instance pair
                                                                 pool should be constructed on the basis of two guidelines:
2.     PROPOSED MATCHING APPROACH                                computational space reduction and preservation of matching
   This section gives an overview of the current plan for gen-   knowledge.
eralizing matching knowledge using active learning for the          To achieve computational space reduction we propose an
N:1 matching scenario. The planned approach involves three       indexing and a blocking technique. We use three types of
main steps. The first step is matching on the instance and       information to build an index value out of every instance:
schema-level with the goal to generate a preliminary set of      the instance name, the attribute labels and the attribute
schema correspondences which forms the basis for later re-       values using words or n-grams. After indexing, we make
finement. Next, we build upon the concepts of the Active-        sure that the candidates for the unlabeled instance pair
GenLink algorithm [6], an active learning approach based         pool hold valuable information while eliminating the rest of
on genetic programming, which uses human supervision to          them. To ensure this, different pair characteristics are eval-
evolve matching rules applicable on the instance-level. The      uated. Such possible entity characteristics aside being likely
final step is the refinement of the schema correspondences       matches, would be if the involved entities are described by
based on the instance-level matching results. The two last       many frequent properties and if they are head or tail entities,
steps are iterated until the desired accuracy or the max-        based on how often they occur in the data corpus.
imum amount of questions to the human annotator have                After defining such informativeness criteria, we linearly
been reached.                                                    scan over the entity names of the central source and we
   Figure 1 shows the steps of the matching process which        generate a pair if it is considered informative. Next, the
will be the main guideline of this research. Below the indi-     generated pair is added in the unlabeled instance pair pool.
vidual steps are further explained, and the related state-of-       Thus, in this step we need to discover which characteris-
the-art work upon which we build our proposed approach is        tics make a pair a good candidate for the instance pair pool
presented together with our suggested methodological con-        and how to combine them in order to draw the line between
tributions.                                                      informative and non-informative pairs.
2.1     Initial matching                                         2.3    Construction of initial population of match-
   The first step of our algorithm involves matching on the             ing rules
instance and schema-level with the goal to generate an ini-
                                                                    As a next step, the initial linkage rules are created. We
tial set of correspondences which will be refined in the next
                                                                 build upon the linkage rule definition introduced in [6]. A
steps of the algorithm. To achieve this, we employ existing
                                                                 linkage rule is defined as a combination of different operators
techniques for large-scale matching.
                                                                 having a tree representation that gradually transforms with
   For the N:1 scenario we use the T2K matching algorithm
                                                                 the evolution of the GenLink algorithm, a variation of the
which involves cycles of instance and schema matching and
                                                                 genetic algorithm [5]. A linkage rule contains the following
1
    http://webdatacommons.org/                                   set of operators:
                                                                   to define a query strategy that selects the most informative
                                                                   pair to be labeled, thus minimizing the human involvement
                                                                   in the whole process.
                                                                      For this we build on the three query strategies employed
                                                                   by [6]: 1. query by committee evaluates the unlabeled pairs
                                                                   against the current population of matching rules and selects
                                                                   the pair that causes the biggest disagreement, 2. query by
                                                                   divergence selects one pair out of every group in the similar-
                                                                   ity space, thus considering pairs which convey different sim-
                                                                   ilarity patterns, and 3. query by restricted committee uses
                                                                   the query by committee strategy but only considers the dis-
                                                                   agreements between the top K optimal matching rules of the
                                                                   current population.
                                                                      Our strategy will build upon the existing ones and further
                                                                   clarify which other criteria should be considered in order to
                                                                   maximize the benefit of the selected pair. One possible di-
                                                                   rection of our query strategy could be to prefer pairs that
                                                                   contain many properties so that information about a bigger
                                                                   variety of properties can be learned after a pair has been an-
                                                                   notated. In addition, the usage of a mixture between head
            Figure 2: An example matching rule
                                                                   and tail entities can prove effective in revealing information
                                                                   concerning the whole domain and not focus only on the fre-
                                                                   quent entities. Another possible component of our query
 (a) Property Operator: Selects the values of a property.
                                                                   strategy could be the characteristics of the properties of the
 (b) Transformation Operator: Defines the transformation           selected pairs. Such characteristics might be frequency and
     functions for the selected property values. Such trans-       the size of the cluster to which they belong. The ratio-
     formations may be: case normalization, address stan-          nale behind using those features is that if the answer of the
     dardization, stop-word removal, and structural trans-         human annotator gives further insight for a centroid of a
     formations such as segmentation and extraction from           property cluster then other properties may be indirectly af-
     values from URIs [5].                                         fected, as described more detailed later in Section 2.6, thus
                                                                   leading to a faster convergence of the algorithm.
 (c) Comparison Operator: Defines the distance metric and
     threshold that should be used to compare the selected         2.5    Linkage rule population evolution
     values.                                                          In this step, we exploit the information provided in the
                                                                   previous step by the human annotator as supervision to
 (d) Aggregation Operator: Defines the way to combine
                                                                   evolve the population of matching rules. The goal is to grad-
     the results of the different comparison operators of the
                                                                   ually generate more customized and accurate matching rules
     previous level.
                                                                   which evaluate correctly against the labeled set of pairs. To
   The difference in our setting is that the property operators    achieve this we use the GenLink algorithm [5].
do not refer to specific properties but to property clusters,         GenLink evolves the population of linkage rules in two
as introduced in Section 2.1. Thus, when a rule is applied to      steps, selection and transformation. Firstly, the matching
a specific instance pair from the pool of unlabeled pairs, the     rules are evaluated on the basis of their fitness on the current
property operator checks if both entities contain a property       set of labeled pairs and selected using tournament selection
which is part of any property cluster. If this is the case,        [7]. The selected rules are then transformed by applying
the property operator outputs a pair of values. Otherwise          certain crossover operations. More specifically, a crossover
it outputs an empty set of values. The functionality of the        operator accepts two linkage rules and returns an updated
other operators remains the same.                                  linkage rule that is built by recombining parts of the parents.
   Figure 2 shows an example rule of our matching approach.        In our setting we use the specific set of crossover operations
In the illustrated example the property operator selects the       of GenLink : transformation, distance measure, threshold,
clusters that represent the product brand and the product          combine operators, aggregation function, weight, and aggre-
name properties. In the next level, the values of the prop-        gation hierarchy.
erty brand are lowercased and a specific comparison operator
is defined. Based on the weights and threshold the compar-         2.6    Evolution of property clusters
ison operators normalize the similarity score to the range           In the final step of our approach, the evolution of the
[0,1]. Finally, the results of the comparison operators are        property clusters which preserve the schema-level correspon-
aggregated into a single score using the average aggregator        dences takes place. The goal is to gradually improve the
which finally decides if the pair is a positive or a negative      matching accuracy on the schema-level whilst exploiting the
correspondence.                                                    information on the instance-level.
                                                                     To achieve this we select the top rules of the current link-
2.4    Pair selection from instance pool                           age rule population based on their fitness score and apply
  In this step a pair of instances is selected from the instance   them on the unlabeled candidates. Possible metrics for cal-
pool and presented to the human annotator who provides a           culating the fitness score are F1 or Matthews correlation
label as matching or non-matching. The goal of this step is        coefficient in the case that the set of reference links is un-
balanced. As a result, we retrieve instance-level correspon-     4.   WORKPLAN
dences which we use as input for duplicate-based schema             The outlined research program is currently in its first year.
matching with the goal to refine the property clusters.          As an initial step towards accomplishing the goals defined in
   We follow the approach of DUMAS (Duplicate-based Match-       the context of this work we will focus on the N:1 matching
ing of Schemas) [1] for improving schema correspondences         scenario by applying the steps presented in Section 2. Next
based on instance-level duplicates. In their work, they evolve   we will move on to the N:M matching scenario for which spe-
the meta-level similarities gradually by calculating similar-    cial indexing and blocking techniques need to be defined in
ities on the instance-level using the SoftTFIDF measure          order to deal with the increased complexity. Finally, grant-
and then solving the transformed problem as a bipartite          ing that our proposed approach meets our goals, we aim to
weighted matching one. In every iteration, schema-level          enhance existing knowledge bases by annotating them with
matches are either confirmed, doubted, or rejected.              matching knowledge.
   After calculating the schema-level similarities, the prop-
erty clusters of our setting are refined. We will investigate    5.   REFERENCES
how the move of one element from a cluster may affect the         [1] A. Bilke and F. Naumann. Schema matching using
rest of the related elements. The indirect effects may be a           duplicates. In Proc. of the 21st Int. Conf. on Data
result of frequent co-occurence or strong similarity. For ex-         Engineering, pages 69–80. IEEE, 2005.
ample, consider the property cluster setting C1 = {A, B, C}       [2] H.-H. Do and E. Rahm. COMA: a system for flexible
and C2 = {D, E}. Assuming that property A matches to                  combination of schema matching approaches. In Proc.
properties D and E, we move A to cluster C2 . If we ad-               of the 28th Int. Conf. on Very Large Data Bases,
ditionally know that property B is very close on the simi-            pages 610–621. VLDB Endowment, 2002.
larity space to property A, then B follows A thus formulat-
                                                                  [3] A. Halevy, A. Rajaraman, and J. Ordille. Data
ing the final state of property clusters as: C1 = {C} and
                                                                      integration: The teenage years. In Proc. of the 32nd
C2 = {A, B, D, E}.
                                                                      Int. Conf. on Very large data bases, pages 9–16.
                                                                      VLDB Endowment, 2006.
2.7     Convergence and output                                    [4] B. He and K. C.-C. Chang. A holistic paradigm for
   The process iterates by selecting a new pair from the un-          large scale schema matching. SIGMOD Record,
labeled instance pair pool, evolving the linkage rules, and           33(4):20, 2004.
further property cluster refinement as described in Sections      [5] R. Isele. Learning Expressive Linkage Rules for Entity
2.4, 2.5 and 2.6. The cycle of iterations terminates when             Matching using Genetic Programming. PhD thesis,
either the evolved linkage rules achieve the desired fitness          University of Mannheim, 2013.
score or the maximum number of questions to the user has
                                                                  [6] R. Isele, A. Jentzsch, and C. Bizer. Active learning of
been reached.
                                                                      expressive linkage rules for the web of data. In Proc.
   The outputs of the proposed matching approach are in-
                                                                      of the 12th Int. Conf. on Web Engineering, pages
stance and schema-level correspondences as well as gener-
                                                                      411–418. Springer, 2012.
alized matching knowledge deriving from the linkage rules
                                                                  [7] J. R. Koza, M. A. Keane, M. J. Streeter,
with the best fitness score. In the N:1 matching scenario the
                                                                      W. Mydlowec, J. Yu, and G. Lanza. Genetic
acquired knowledge can be used to annotate the knowledge
                                                                      programming IV: Routine human-competitive machine
base with rules concerning attribute relevance for matching,
                                                                      intelligence, volume 5. Springer Science & Business
appropriate similarity functions, data normalization trans-
                                                                      Media, 2006.
formations, aggregation functions, and thresholds. In the
N:M matching scenario we aim to exploit the resulting match-      [8] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch,
ing knowledge rules to annotate the implicit, mediated schema         D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey,
created through holistically matching entities.                       P. Van Kleef, S. Auer, and C. Bizer. DBpedia–a
                                                                      large-scale, multilingual knowledge base extracted
                                                                      from wikipedia. Semantic Web Journal, 2014.
3.     EVALUATION                                                 [9] P. Petrovski, A. Primpeli, R. Meusel, and C. Bizer.
   We plan to evaluate our approach on e-commerce data we             The WDC gold standards for product feature
already created in the context of the Web Data Commons                extraction and product matching. In Proc. of the 17th
project. The dataset contains 13 million product-related              Int. Conf. on Electronic Commerce and Web
web pages retrieved from the 32 most frequently visited web-          Technologies, pages 73–86. Springer, Cham, 2016.
sites. We have manually annotated 500 electronic product         [10] E. Rahm. The case for holistic data integration. In
entities and created a product catalog with 160 products of           Proc. of the 20th Advances in Databases and
the same electronic categories. The total number of cor-              Information Systems Conf., pages 11–27. Springer,
respondences contained in our gold standard is 75,000, of             2016.
which 1,500 are positive [9].                                    [11] D. Ritze, O. Lehmberg, and C. Bizer. Matching
   Other possible use cases for evaluating our approach would         HTML tables to DBpedia. In Proc. of the 5th Int.
be web tables, linked open data and schema.org annota-                Conf. on Web Intelligence, Mining and Semantics,
tions. Web Data Commons provides the WDC Web Tables                   page 10. ACM, 2015.
Corpus2 , the largest non-commercial corpus of web tables        [12] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales,
deriving from 1.78 billion HTML pages with 90.2 million               M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu.
relational tables.                                                    Data Curation at Scale: The Data Tamer System. In
                                                                      Proc. of the Conf. on Innovative Data Systems
2                                                                     Research, 2013.
    http://webdatacommons.org/webtables/