Generalizing Matching Knowledge using Active Learning Anna Primpeli supervised by Christian Bizer Data and Web Science Group University of Mannheim anna@informatik.uni-mannheim.de ABSTRACT Widely used matching systems such as COMA [2] indi- Research on integrating small numbers of datasets suggests cate the need of rich matcher and aggregator libraries in the use of customized matching rules in order to adapt to order to solve different types of heterogeneity and find cor- the patterns in the data and achieve better results. The respondences. A major finding of our group on integrating state-of-the-art work on matching large numbers of datasets small numbers of datasets is that specific matchers and ag- exploits attribute co-occurrence as well as the similarity of gregators deriving from such rich libraries as well as property values between multiple sources. We build upon these re- specific data normalization techniques can be combined into search directions in order to develop a method for generaliz- high quality, domain specific matching rules [5]. Such rules ing matching knowledge using minimal human intervention. achieve a twofold goal: firstly, they give an insight into the The central idea of our research program is that even in large current task by encoding matching knowledge, and secondly numbers of datasets of a specific domain patterns (matching they achieve high quality correspondences by adapting to knowledge) reoccur, and discovering those can facilitate the the nature of every matching scenario. integration task. Our proposed approach plans to use and Research on integrating large numbers of datasets has extend existing work of our group on schema and instance shown that it is valuable to exploit attribute co-occurrence matching as well as on learning expressive rules with ac- in the schema corpus as well as the similarity of data values tive learning. We plan to evaluate our approach on publicly not only between a central data source and a single external available e-commerce data collected from the Web. data source, but also the similarities of data values between multiple sources [4, 10]. A weak spot that can be observed in these approaches is that the employed data normaliza- tion techniques, similarity functions, and matching rules are 1. INTRODUCTION not customized for the different types of entities and thus Data integration is a long standing and very active re- produce lower quality results than customized techniques. search topic dealing with overcoming the semantic and syn- The proposed research program builds upon this work and tactic heterogeneity of records located in the same or sep- aims as its first goal to investigate the extent to which it is arate data sources [3]. While early work focused on inte- possible to generalize matching knowledge in order to im- grating data from small numbers of datasets in a corporate prove matching quality in large-scale N:1 and N:M match- context, there is an increasing body of research on integrat- ing situations. The rationale for this approach is that typical ing large numbers of datasets in the Web context, where an patterns reoccur among entities of a certain domain. An ex- increased level of heterogeneity exists on both the instance ample of such a pattern would be ”When matching entities and schema-level. representing the product type phones, it is effective to com- The matching approaches dealing with the task of inte- pare the attributes [brand, producer] using the following pre- grating large numbers of datasets can be categorized by the processing methods [tokenization, lowercasing], the following addressed integration scenario. One scenario is the N:1, similarity function [Levenshtein distance] and the following in which multiple datasets are matched against a central threshold [0.75]”. source; for instance, web tables against DBpedia [8] or prod- In most matching situations, collaboration between hu- uct entities against a central catalog. The second scenario mans and machines is helpful in order to judge corner cases is the N:M, in which datasets are matched with each other or to boot-strap matching with a certain amount of supervi- without the help of an intermediate schema or a knowledge sion [12]. Previous work of our group on guiding collabora- base. tion between humans and machines on small-scale matching scenarios has shown that using active learning can produce high quality matching results even with a small amount of human interaction [6]: The employed active learning ap- proach was evaluated against six different datasets reaching between 0.8 and 0.98 F1 score after asking the human an- notator to label ten pairs of entities as positive or negative matches. Building upon this work and extending certain Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, Germany. steps of the active learning process, we formulate the sec- Copyright (c) 2017 for this paper by its authors. Copying permitted for ond goal of the thesis, which is steering human attention in private and academic purposes. Figure 1: Proposed matching approach pipeline large-scale matching situations with the aim to learn rele- achieves an F1 score of 0.8 on the instance-level and 0.7 vant, high quality matching knowledge. on the schema-level for the task of matching web tables to Summing up, in the context of this work, we aim to answer DBpedia [11]. the following research questions: The resulting schema correspondences of this step are grouped into clusters, with each cluster representing a spe- • Are domain specific patterns transferable within large- cific property such as product name or brand. The motiva- scale matching situations? tion behind property clusters is that matching information concerning one property can further affect the other ele- • How can we maximize the benefit of human supervi- ments of the cluster, as it will be later explained in Section sion with respect to the discovery of those patterns? 2.6. In this step, we plan to reuse existing work to form our In order to answer the above stated research questions, baseline for further improvement using active learning. we will experiment with the N:1 and N:M matching scenar- ios using datasets created in the context of the Web Data Commons project1 such as web tables, schema.org data and 2.2 Construction of unlabeled instance pair product data. pool The remainder of this paper is organized as follows. Sec- The second step involves the construction of an unlabeled tion 2 describes the proposed matching approach. Section pool of pairs of instances that are potential candidates for 3 presents the data which we plan to use for evaluation. labeling by the user. Considering the complexity involved Finally, Section 4 gives a short overview of our workplan. with matching large-scale data as well as our goal for creat- ing generalized matching rules, the unlabeled instance pair pool should be constructed on the basis of two guidelines: 2. PROPOSED MATCHING APPROACH computational space reduction and preservation of matching This section gives an overview of the current plan for gen- knowledge. eralizing matching knowledge using active learning for the To achieve computational space reduction we propose an N:1 matching scenario. The planned approach involves three indexing and a blocking technique. We use three types of main steps. The first step is matching on the instance and information to build an index value out of every instance: schema-level with the goal to generate a preliminary set of the instance name, the attribute labels and the attribute schema correspondences which forms the basis for later re- values using words or n-grams. After indexing, we make finement. Next, we build upon the concepts of the Active- sure that the candidates for the unlabeled instance pair GenLink algorithm [6], an active learning approach based pool hold valuable information while eliminating the rest of on genetic programming, which uses human supervision to them. To ensure this, different pair characteristics are eval- evolve matching rules applicable on the instance-level. The uated. Such possible entity characteristics aside being likely final step is the refinement of the schema correspondences matches, would be if the involved entities are described by based on the instance-level matching results. The two last many frequent properties and if they are head or tail entities, steps are iterated until the desired accuracy or the max- based on how often they occur in the data corpus. imum amount of questions to the human annotator have After defining such informativeness criteria, we linearly been reached. scan over the entity names of the central source and we Figure 1 shows the steps of the matching process which generate a pair if it is considered informative. Next, the will be the main guideline of this research. Below the indi- generated pair is added in the unlabeled instance pair pool. vidual steps are further explained, and the related state-of- Thus, in this step we need to discover which characteris- the-art work upon which we build our proposed approach is tics make a pair a good candidate for the instance pair pool presented together with our suggested methodological con- and how to combine them in order to draw the line between tributions. informative and non-informative pairs. 2.1 Initial matching 2.3 Construction of initial population of match- The first step of our algorithm involves matching on the ing rules instance and schema-level with the goal to generate an ini- As a next step, the initial linkage rules are created. We tial set of correspondences which will be refined in the next build upon the linkage rule definition introduced in [6]. A steps of the algorithm. To achieve this, we employ existing linkage rule is defined as a combination of different operators techniques for large-scale matching. having a tree representation that gradually transforms with For the N:1 scenario we use the T2K matching algorithm the evolution of the GenLink algorithm, a variation of the which involves cycles of instance and schema matching and genetic algorithm [5]. A linkage rule contains the following 1 http://webdatacommons.org/ set of operators: to define a query strategy that selects the most informative pair to be labeled, thus minimizing the human involvement in the whole process. For this we build on the three query strategies employed by [6]: 1. query by committee evaluates the unlabeled pairs against the current population of matching rules and selects the pair that causes the biggest disagreement, 2. query by divergence selects one pair out of every group in the similar- ity space, thus considering pairs which convey different sim- ilarity patterns, and 3. query by restricted committee uses the query by committee strategy but only considers the dis- agreements between the top K optimal matching rules of the current population. Our strategy will build upon the existing ones and further clarify which other criteria should be considered in order to maximize the benefit of the selected pair. One possible di- rection of our query strategy could be to prefer pairs that contain many properties so that information about a bigger variety of properties can be learned after a pair has been an- notated. In addition, the usage of a mixture between head Figure 2: An example matching rule and tail entities can prove effective in revealing information concerning the whole domain and not focus only on the fre- quent entities. Another possible component of our query (a) Property Operator: Selects the values of a property. strategy could be the characteristics of the properties of the (b) Transformation Operator: Defines the transformation selected pairs. Such characteristics might be frequency and functions for the selected property values. Such trans- the size of the cluster to which they belong. The ratio- formations may be: case normalization, address stan- nale behind using those features is that if the answer of the dardization, stop-word removal, and structural trans- human annotator gives further insight for a centroid of a formations such as segmentation and extraction from property cluster then other properties may be indirectly af- values from URIs [5]. fected, as described more detailed later in Section 2.6, thus leading to a faster convergence of the algorithm. (c) Comparison Operator: Defines the distance metric and threshold that should be used to compare the selected 2.5 Linkage rule population evolution values. In this step, we exploit the information provided in the previous step by the human annotator as supervision to (d) Aggregation Operator: Defines the way to combine evolve the population of matching rules. The goal is to grad- the results of the different comparison operators of the ually generate more customized and accurate matching rules previous level. which evaluate correctly against the labeled set of pairs. To The difference in our setting is that the property operators achieve this we use the GenLink algorithm [5]. do not refer to specific properties but to property clusters, GenLink evolves the population of linkage rules in two as introduced in Section 2.1. Thus, when a rule is applied to steps, selection and transformation. Firstly, the matching a specific instance pair from the pool of unlabeled pairs, the rules are evaluated on the basis of their fitness on the current property operator checks if both entities contain a property set of labeled pairs and selected using tournament selection which is part of any property cluster. If this is the case, [7]. The selected rules are then transformed by applying the property operator outputs a pair of values. Otherwise certain crossover operations. More specifically, a crossover it outputs an empty set of values. The functionality of the operator accepts two linkage rules and returns an updated other operators remains the same. linkage rule that is built by recombining parts of the parents. Figure 2 shows an example rule of our matching approach. In our setting we use the specific set of crossover operations In the illustrated example the property operator selects the of GenLink : transformation, distance measure, threshold, clusters that represent the product brand and the product combine operators, aggregation function, weight, and aggre- name properties. In the next level, the values of the prop- gation hierarchy. erty brand are lowercased and a specific comparison operator is defined. Based on the weights and threshold the compar- 2.6 Evolution of property clusters ison operators normalize the similarity score to the range In the final step of our approach, the evolution of the [0,1]. Finally, the results of the comparison operators are property clusters which preserve the schema-level correspon- aggregated into a single score using the average aggregator dences takes place. The goal is to gradually improve the which finally decides if the pair is a positive or a negative matching accuracy on the schema-level whilst exploiting the correspondence. information on the instance-level. To achieve this we select the top rules of the current link- 2.4 Pair selection from instance pool age rule population based on their fitness score and apply In this step a pair of instances is selected from the instance them on the unlabeled candidates. Possible metrics for cal- pool and presented to the human annotator who provides a culating the fitness score are F1 or Matthews correlation label as matching or non-matching. The goal of this step is coefficient in the case that the set of reference links is un- balanced. As a result, we retrieve instance-level correspon- 4. WORKPLAN dences which we use as input for duplicate-based schema The outlined research program is currently in its first year. matching with the goal to refine the property clusters. As an initial step towards accomplishing the goals defined in We follow the approach of DUMAS (Duplicate-based Match- the context of this work we will focus on the N:1 matching ing of Schemas) [1] for improving schema correspondences scenario by applying the steps presented in Section 2. Next based on instance-level duplicates. In their work, they evolve we will move on to the N:M matching scenario for which spe- the meta-level similarities gradually by calculating similar- cial indexing and blocking techniques need to be defined in ities on the instance-level using the SoftTFIDF measure order to deal with the increased complexity. Finally, grant- and then solving the transformed problem as a bipartite ing that our proposed approach meets our goals, we aim to weighted matching one. In every iteration, schema-level enhance existing knowledge bases by annotating them with matches are either confirmed, doubted, or rejected. matching knowledge. After calculating the schema-level similarities, the prop- erty clusters of our setting are refined. We will investigate 5. REFERENCES how the move of one element from a cluster may affect the [1] A. Bilke and F. Naumann. Schema matching using rest of the related elements. The indirect effects may be a duplicates. In Proc. of the 21st Int. Conf. on Data result of frequent co-occurence or strong similarity. For ex- Engineering, pages 69–80. IEEE, 2005. ample, consider the property cluster setting C1 = {A, B, C} [2] H.-H. Do and E. Rahm. COMA: a system for flexible and C2 = {D, E}. Assuming that property A matches to combination of schema matching approaches. In Proc. properties D and E, we move A to cluster C2 . If we ad- of the 28th Int. Conf. on Very Large Data Bases, ditionally know that property B is very close on the simi- pages 610–621. VLDB Endowment, 2002. larity space to property A, then B follows A thus formulat- [3] A. Halevy, A. Rajaraman, and J. Ordille. Data ing the final state of property clusters as: C1 = {C} and integration: The teenage years. In Proc. of the 32nd C2 = {A, B, D, E}. Int. Conf. on Very large data bases, pages 9–16. VLDB Endowment, 2006. 2.7 Convergence and output [4] B. He and K. C.-C. Chang. A holistic paradigm for The process iterates by selecting a new pair from the un- large scale schema matching. SIGMOD Record, labeled instance pair pool, evolving the linkage rules, and 33(4):20, 2004. further property cluster refinement as described in Sections [5] R. Isele. Learning Expressive Linkage Rules for Entity 2.4, 2.5 and 2.6. The cycle of iterations terminates when Matching using Genetic Programming. PhD thesis, either the evolved linkage rules achieve the desired fitness University of Mannheim, 2013. score or the maximum number of questions to the user has [6] R. Isele, A. Jentzsch, and C. Bizer. Active learning of been reached. expressive linkage rules for the web of data. In Proc. The outputs of the proposed matching approach are in- of the 12th Int. Conf. on Web Engineering, pages stance and schema-level correspondences as well as gener- 411–418. Springer, 2012. alized matching knowledge deriving from the linkage rules [7] J. R. Koza, M. A. Keane, M. J. Streeter, with the best fitness score. In the N:1 matching scenario the W. Mydlowec, J. Yu, and G. Lanza. Genetic acquired knowledge can be used to annotate the knowledge programming IV: Routine human-competitive machine base with rules concerning attribute relevance for matching, intelligence, volume 5. Springer Science & Business appropriate similarity functions, data normalization trans- Media, 2006. formations, aggregation functions, and thresholds. In the N:M matching scenario we aim to exploit the resulting match- [8] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, ing knowledge rules to annotate the implicit, mediated schema D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, created through holistically matching entities. P. Van Kleef, S. Auer, and C. Bizer. DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014. 3. EVALUATION [9] P. Petrovski, A. Primpeli, R. Meusel, and C. Bizer. We plan to evaluate our approach on e-commerce data we The WDC gold standards for product feature already created in the context of the Web Data Commons extraction and product matching. In Proc. of the 17th project. The dataset contains 13 million product-related Int. Conf. on Electronic Commerce and Web web pages retrieved from the 32 most frequently visited web- Technologies, pages 73–86. Springer, Cham, 2016. sites. We have manually annotated 500 electronic product [10] E. Rahm. The case for holistic data integration. In entities and created a product catalog with 160 products of Proc. of the 20th Advances in Databases and the same electronic categories. The total number of cor- Information Systems Conf., pages 11–27. Springer, respondences contained in our gold standard is 75,000, of 2016. which 1,500 are positive [9]. [11] D. Ritze, O. Lehmberg, and C. Bizer. Matching Other possible use cases for evaluating our approach would HTML tables to DBpedia. In Proc. of the 5th Int. be web tables, linked open data and schema.org annota- Conf. on Web Intelligence, Mining and Semantics, tions. Web Data Commons provides the WDC Web Tables page 10. ACM, 2015. Corpus2 , the largest non-commercial corpus of web tables [12] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, deriving from 1.78 billion HTML pages with 90.2 million M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. relational tables. Data Curation at Scale: The Data Tamer System. In Proc. of the Conf. on Innovative Data Systems 2 Research, 2013. http://webdatacommons.org/webtables/