<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generalizing Matching Knowledge using Active Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Anna Primpeli supervised by Christian Bizer Data and Web Science Group University of Mannheim</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>28</volume>
      <issue>2017</issue>
      <abstract>
        <p>Research on integrating small numbers of datasets suggests the use of customized matching rules in order to adapt to the patterns in the data and achieve better results. The state-of-the-art work on matching large numbers of datasets exploits attribute co-occurrence as well as the similarity of values between multiple sources. We build upon these research directions in order to develop a method for generalizing matching knowledge using minimal human intervention. The central idea of our research program is that even in large numbers of datasets of a speci c domain patterns (matching knowledge) reoccur, and discovering those can facilitate the integration task. Our proposed approach plans to use and extend existing work of our group on schema and instance matching as well as on learning expressive rules with active learning. We plan to evaluate our approach on publicly available e-commerce data collected from the Web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Data integration is a long standing and very active
research topic dealing with overcoming the semantic and
syntactic heterogeneity of records located in the same or
separate data sources [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While early work focused on
integrating data from small numbers of datasets in a corporate
context, there is an increasing body of research on
integrating large numbers of datasets in the Web context, where an
increased level of heterogeneity exists on both the instance
and schema-level.
      </p>
      <p>
        The matching approaches dealing with the task of
integrating large numbers of datasets can be categorized by the
addressed integration scenario. One scenario is the N:1,
in which multiple datasets are matched against a central
source; for instance, web tables against DBpedia [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or
product entities against a central catalog. The second scenario
is the N:M, in which datasets are matched with each other
without the help of an intermediate schema or a knowledge
base.
      </p>
      <p>
        Widely used matching systems such as COMA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
indicate the need of rich matcher and aggregator libraries in
order to solve di erent types of heterogeneity and nd
correspondences. A major nding of our group on integrating
small numbers of datasets is that speci c matchers and
aggregators deriving from such rich libraries as well as property
speci c data normalization techniques can be combined into
high quality, domain speci c matching rules [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Such rules
achieve a twofold goal: rstly, they give an insight into the
current task by encoding matching knowledge, and secondly
they achieve high quality correspondences by adapting to
the nature of every matching scenario.
      </p>
      <p>
        Research on integrating large numbers of datasets has
shown that it is valuable to exploit attribute co-occurrence
in the schema corpus as well as the similarity of data values
not only between a central data source and a single external
data source, but also the similarities of data values between
multiple sources [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ]. A weak spot that can be observed
in these approaches is that the employed data
normalization techniques, similarity functions, and matching rules are
not customized for the di erent types of entities and thus
produce lower quality results than customized techniques.
      </p>
      <p>The proposed research program builds upon this work and
aims as its rst goal to investigate the extent to which it is
possible to generalize matching knowledge in order to
improve matching quality in large-scale N:1 and N:M
matching situations. The rationale for this approach is that typical
patterns reoccur among entities of a certain domain. An
example of such a pattern would be "When matching entities
representing the product type phones, it is e ective to
compare the attributes [brand, producer] using the following
preprocessing methods [tokenization, lowercasing], the following
similarity function [Levenshtein distance] and the following
threshold [0.75]".</p>
      <p>
        In most matching situations, collaboration between
humans and machines is helpful in order to judge corner cases
or to boot-strap matching with a certain amount of
supervision [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Previous work of our group on guiding
collaboration between humans and machines on small-scale matching
scenarios has shown that using active learning can produce
high quality matching results even with a small amount of
human interaction [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: The employed active learning
approach was evaluated against six di erent datasets reaching
between 0.8 and 0.98 F1 score after asking the human
annotator to label ten pairs of entities as positive or negative
matches. Building upon this work and extending certain
steps of the active learning process, we formulate the
second goal of the thesis, which is steering human attention in
large-scale matching situations with the aim to learn
relevant, high quality matching knowledge.
      </p>
      <p>Summing up, in the context of this work, we aim to answer
the following research questions:</p>
      <p>Are domain speci c patterns transferable within
largescale matching situations?
How can we maximize the bene t of human
supervision with respect to the discovery of those patterns?
In order to answer the above stated research questions,
we will experiment with the N:1 and N:M matching
scenarios using datasets created in the context of the Web Data
Commons project1 such as web tables, schema.org data and
product data.</p>
      <p>The remainder of this paper is organized as follows.
Section 2 describes the proposed matching approach. Section
3 presents the data which we plan to use for evaluation.
Finally, Section 4 gives a short overview of our workplan.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED MATCHING APPROACH</title>
      <p>
        This section gives an overview of the current plan for
generalizing matching knowledge using active learning for the
N:1 matching scenario. The planned approach involves three
main steps. The rst step is matching on the instance and
schema-level with the goal to generate a preliminary set of
schema correspondences which forms the basis for later
renement. Next, we build upon the concepts of the
ActiveGenLink algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], an active learning approach based
on genetic programming, which uses human supervision to
evolve matching rules applicable on the instance-level. The
nal step is the re nement of the schema correspondences
based on the instance-level matching results. The two last
steps are iterated until the desired accuracy or the
maximum amount of questions to the human annotator have
been reached.
      </p>
      <p>Figure 1 shows the steps of the matching process which
will be the main guideline of this research. Below the
individual steps are further explained, and the related
state-ofthe-art work upon which we build our proposed approach is
presented together with our suggested methodological
contributions.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Initial matching</title>
      <p>The rst step of our algorithm involves matching on the
instance and schema-level with the goal to generate an
initial set of correspondences which will be re ned in the next
steps of the algorithm. To achieve this, we employ existing
techniques for large-scale matching.</p>
      <p>
        For the N:1 scenario we use the T2K matching algorithm
which involves cycles of instance and schema matching and
1http://webdatacommons.org/
achieves an F1 score of 0.8 on the instance-level and 0.7
on the schema-level for the task of matching web tables to
DBpedia [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The resulting schema correspondences of this step are
grouped into clusters, with each cluster representing a
speci c property such as product name or brand. The
motivation behind property clusters is that matching information
concerning one property can further a ect the other
elements of the cluster, as it will be later explained in Section
2.6.</p>
      <p>In this step, we plan to reuse existing work to form our
baseline for further improvement using active learning.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Construction of unlabeled instance pair pool</title>
      <p>The second step involves the construction of an unlabeled
pool of pairs of instances that are potential candidates for
labeling by the user. Considering the complexity involved
with matching large-scale data as well as our goal for
creating generalized matching rules, the unlabeled instance pair
pool should be constructed on the basis of two guidelines:
computational space reduction and preservation of matching
knowledge.</p>
      <p>To achieve computational space reduction we propose an
indexing and a blocking technique. We use three types of
information to build an index value out of every instance:
the instance name, the attribute labels and the attribute
values using words or n-grams. After indexing, we make
sure that the candidates for the unlabeled instance pair
pool hold valuable information while eliminating the rest of
them. To ensure this, di erent pair characteristics are
evaluated. Such possible entity characteristics aside being likely
matches, would be if the involved entities are described by
many frequent properties and if they are head or tail entities,
based on how often they occur in the data corpus.</p>
      <p>After de ning such informativeness criteria, we linearly
scan over the entity names of the central source and we
generate a pair if it is considered informative. Next, the
generated pair is added in the unlabeled instance pair pool.</p>
      <p>Thus, in this step we need to discover which
characteristics make a pair a good candidate for the instance pair pool
and how to combine them in order to draw the line between
informative and non-informative pairs.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Construction of initial population of matching rules</title>
      <p>
        As a next step, the initial linkage rules are created. We
build upon the linkage rule de nition introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A
linkage rule is de ned as a combination of di erent operators
having a tree representation that gradually transforms with
the evolution of the GenLink algorithm, a variation of the
genetic algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A linkage rule contains the following
set of operators:
(b) Transformation Operator: De nes the transformation
functions for the selected property values. Such
transformations may be: case normalization, address
standardization, stop-word removal, and structural
transformations such as segmentation and extraction from
values from URIs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
(c) Comparison Operator: De nes the distance metric and
threshold that should be used to compare the selected
values.
(d) Aggregation Operator: De nes the way to combine
the results of the di erent comparison operators of the
previous level.
      </p>
      <p>The di erence in our setting is that the property operators
do not refer to speci c properties but to property clusters,
as introduced in Section 2.1. Thus, when a rule is applied to
a speci c instance pair from the pool of unlabeled pairs, the
property operator checks if both entities contain a property
which is part of any property cluster. If this is the case,
the property operator outputs a pair of values. Otherwise
it outputs an empty set of values. The functionality of the
other operators remains the same.</p>
      <p>
        Figure 2 shows an example rule of our matching approach.
In the illustrated example the property operator selects the
clusters that represent the product brand and the product
name properties. In the next level, the values of the
property brand are lowercased and a speci c comparison operator
is de ned. Based on the weights and threshold the
comparison operators normalize the similarity score to the range
[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. Finally, the results of the comparison operators are
aggregated into a single score using the average aggregator
which nally decides if the pair is a positive or a negative
correspondence.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Pair selection from instance pool</title>
      <p>In this step a pair of instances is selected from the instance
pool and presented to the human annotator who provides a
label as matching or non-matching. The goal of this step is
to de ne a query strategy that selects the most informative
pair to be labeled, thus minimizing the human involvement
in the whole process.</p>
      <p>
        For this we build on the three query strategies employed
by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: 1. query by committee evaluates the unlabeled pairs
against the current population of matching rules and selects
the pair that causes the biggest disagreement, 2. query by
divergence selects one pair out of every group in the
similarity space, thus considering pairs which convey di erent
similarity patterns, and 3. query by restricted committee uses
the query by committee strategy but only considers the
disagreements between the top K optimal matching rules of the
current population.
      </p>
      <p>Our strategy will build upon the existing ones and further
clarify which other criteria should be considered in order to
maximize the bene t of the selected pair. One possible
direction of our query strategy could be to prefer pairs that
contain many properties so that information about a bigger
variety of properties can be learned after a pair has been
annotated. In addition, the usage of a mixture between head
and tail entities can prove e ective in revealing information
concerning the whole domain and not focus only on the
frequent entities. Another possible component of our query
strategy could be the characteristics of the properties of the
selected pairs. Such characteristics might be frequency and
the size of the cluster to which they belong. The
rationale behind using those features is that if the answer of the
human annotator gives further insight for a centroid of a
property cluster then other properties may be indirectly
affected, as described more detailed later in Section 2.6, thus
leading to a faster convergence of the algorithm.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Linkage rule population evolution</title>
      <p>
        In this step, we exploit the information provided in the
previous step by the human annotator as supervision to
evolve the population of matching rules. The goal is to
gradually generate more customized and accurate matching rules
which evaluate correctly against the labeled set of pairs. To
achieve this we use the GenLink algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        GenLink evolves the population of linkage rules in two
steps, selection and transformation. Firstly, the matching
rules are evaluated on the basis of their tness on the current
set of labeled pairs and selected using tournament selection
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The selected rules are then transformed by applying
certain crossover operations. More speci cally, a crossover
operator accepts two linkage rules and returns an updated
linkage rule that is built by recombining parts of the parents.
In our setting we use the speci c set of crossover operations
of GenLink : transformation, distance measure, threshold,
combine operators, aggregation function, weight, and
aggregation hierarchy.
2.6
      </p>
    </sec>
    <sec id="sec-8">
      <title>Evolution of property clusters</title>
      <p>In the nal step of our approach, the evolution of the
property clusters which preserve the schema-level
correspondences takes place. The goal is to gradually improve the
matching accuracy on the schema-level whilst exploiting the
information on the instance-level.</p>
      <p>To achieve this we select the top rules of the current
linkage rule population based on their tness score and apply
them on the unlabeled candidates. Possible metrics for
calculating the tness score are F1 or Matthews correlation
coe cient in the case that the set of reference links is
unbalanced. As a result, we retrieve instance-level
correspondences which we use as input for duplicate-based schema
matching with the goal to re ne the property clusters.</p>
      <p>
        We follow the approach of DUMAS (Duplicate-based
Matching of Schemas) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for improving schema correspondences
based on instance-level duplicates. In their work, they evolve
the meta-level similarities gradually by calculating
similarities on the instance-level using the SoftTFIDF measure
and then solving the transformed problem as a bipartite
weighted matching one. In every iteration, schema-level
matches are either con rmed, doubted, or rejected.
      </p>
      <p>After calculating the schema-level similarities, the
property clusters of our setting are re ned. We will investigate
how the move of one element from a cluster may a ect the
rest of the related elements. The indirect e ects may be a
result of frequent co-occurence or strong similarity. For
example, consider the property cluster setting C1 = fA; B; Cg
and C2 = fD; Eg. Assuming that property A matches to
properties D and E, we move A to cluster C2. If we
additionally know that property B is very close on the
similarity space to property A, then B follows A thus
formulating the nal state of property clusters as: C1 = fCg and
C2 = fA; B; D; Eg.
2.7</p>
    </sec>
    <sec id="sec-9">
      <title>Convergence and output</title>
      <p>The process iterates by selecting a new pair from the
unlabeled instance pair pool, evolving the linkage rules, and
further property cluster re nement as described in Sections
2.4, 2.5 and 2.6. The cycle of iterations terminates when
either the evolved linkage rules achieve the desired tness
score or the maximum number of questions to the user has
been reached.</p>
      <p>The outputs of the proposed matching approach are
instance and schema-level correspondences as well as
generalized matching knowledge deriving from the linkage rules
with the best tness score. In the N:1 matching scenario the
acquired knowledge can be used to annotate the knowledge
base with rules concerning attribute relevance for matching,
appropriate similarity functions, data normalization
transformations, aggregation functions, and thresholds. In the
N:M matching scenario we aim to exploit the resulting
matching knowledge rules to annotate the implicit, mediated schema
created through holistically matching entities.</p>
    </sec>
    <sec id="sec-10">
      <title>EVALUATION</title>
      <p>
        We plan to evaluate our approach on e-commerce data we
already created in the context of the Web Data Commons
project. The dataset contains 13 million product-related
web pages retrieved from the 32 most frequently visited
websites. We have manually annotated 500 electronic product
entities and created a product catalog with 160 products of
the same electronic categories. The total number of
correspondences contained in our gold standard is 75,000, of
which 1,500 are positive [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Other possible use cases for evaluating our approach would
be web tables, linked open data and schema.org
annotations. Web Data Commons provides the WDC Web Tables
Corpus2, the largest non-commercial corpus of web tables
deriving from 1.78 billion HTML pages with 90.2 million
relational tables.</p>
    </sec>
    <sec id="sec-11">
      <title>WORKPLAN</title>
      <p>The outlined research program is currently in its rst year.
As an initial step towards accomplishing the goals de ned in
the context of this work we will focus on the N:1 matching
scenario by applying the steps presented in Section 2. Next
we will move on to the N:M matching scenario for which
special indexing and blocking techniques need to be de ned in
order to deal with the increased complexity. Finally,
granting that our proposed approach meets our goals, we aim to
enhance existing knowledge bases by annotating them with
matching knowledge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bilke</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Schema matching using duplicates</article-title>
          .
          <source>In Proc. of the 21st Int. Conf. on Data Engineering</source>
          , pages
          <volume>69</volume>
          {
          <fpage>80</fpage>
          . IEEE,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.-H.</given-names>
            <surname>Do</surname>
          </string-name>
          and
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          <article-title>COMA: a system for exible combination of schema matching approaches</article-title>
          .
          <source>In Proc. of the 28th Int. Conf. on Very Large Data Bases</source>
          , pages
          <volume>610</volume>
          {
          <fpage>621</fpage>
          . VLDB Endowment,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ordille</surname>
          </string-name>
          .
          <article-title>Data integration: The teenage years</article-title>
          .
          <source>In Proc. of the 32nd Int. Conf. on Very large data bases</source>
          , pages
          <volume>9</volume>
          {
          <fpage>16</fpage>
          . VLDB Endowment,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>K. C.-C. Chang</surname>
          </string-name>
          .
          <article-title>A holistic paradigm for large scale schema matching</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>33</volume>
          (
          <issue>4</issue>
          ):
          <fpage>20</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          .
          <article-title>Learning Expressive Linkage Rules for Entity Matching using Genetic Programming</article-title>
          .
          <source>PhD thesis</source>
          , University of Mannheim,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Active learning of expressive linkage rules for the web of data</article-title>
          .
          <source>In Proc. of the 12th Int. Conf. on Web Engineering</source>
          , pages
          <volume>411</volume>
          {
          <fpage>418</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Koza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Keane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Streeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Mydlowec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Lanza</surname>
          </string-name>
          .
          <article-title>Genetic programming IV: Routine human-competitive machine intelligence</article-title>
          , volume
          <volume>5</volume>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Van Kleef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>DBpedia{a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Petrovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Primpeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The WDC gold standards for product feature extraction and product matching</article-title>
          .
          <source>In Proc. of the 17th Int. Conf. on Electronic Commerce and Web Technologies</source>
          , pages
          <volume>73</volume>
          {
          <fpage>86</fpage>
          . Springer, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          <article-title>The case for holistic data integration</article-title>
          .
          <source>In Proc. of the 20th Advances in Databases and Information Systems Conf.</source>
          , pages
          <volume>11</volume>
          {
          <fpage>27</fpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ritze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Matching HTML tables to DBpedia</article-title>
          .
          <source>In Proc. of the 5th Int. Conf. on Web Intelligence, Mining and Semantics, page 10. ACM</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bruckner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          , G. Beskales,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherniack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>Data Curation at Scale: The Data Tamer System</article-title>
          .
          <source>In Proc. of the Conf. on Innovative Data Systems Research</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>