<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology Augmentation Through Matching with Web Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oliver Lehmberg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research</institution>
          ,
          <addr-line>Yorktown Heights, New York</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim</institution>
          ,
          <addr-line>B6 26, 68159 Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we examine the possibility of using data collected from millions of tables on the Web to extend an ontology with new attributes. There are two major challenges in using such a large number of potentially noisy tables for this task. First, table columns need to be matched to create groups of columns that represent a new (or existing) attribute for a particular class in the ontology. Second, the column groups need to be ranked according to their \usefulness" in augmenting the ontology. We show several approaches to addressing these challenges and report on the results of our extensive experiments using Web Tables from the Web Data Commons corpus, and using the DBpedia Ontology as our target ontology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Web is a vast source of valuable knowledge that can be used to extend
or augment a given ontology. Knowledge extraction from the Web is a
wellstudied problem and an active area of research [
        <xref ref-type="bibr" rid="ref11 ref5 ref7">5, 7, 11</xref>
        ]. While such knowledge
is often extracted from textual (or semi-structured) contents using information
extraction and wrapper induction techniques, there have also been attempts in
using the structured data that is exposed on web pages as HTML tables [
        <xref ref-type="bibr" rid="ref13 ref14 ref5">5, 14,
13</xref>
        ].
      </p>
      <p>In this paper, we examine the possibility of using Web tables to augment
a given ontology with a new set of attributes. Our hypothesis is that for each
class in the given ontology, there are tables on the Web describing instances of
the class and their various attributes. Further, not only a large number of these
attributes are not already captured in the ontology, but many are not considered
\useful", i.e., may be irrelevant, inaccurate, or redundant.</p>
      <p>The approach we take in this work is a two-step process. First, tables are
matched among each other and to the target ontology, to group columns that
refer to the same attribute and align them with classes and existing attributes in
the target ontology. The second step ranks the column groups based on a measure
of quality or usefulness of the group in augmenting the existing attributes in
the target ontology. We perform an empirical study of the performance of this
approach in using Web Tables extracted from the Common Crawl3 to augment
the properties in DBpedia ontology.</p>
      <sec id="sec-1-1">
        <title>3 http://commoncrawl.org/</title>
        <p>
          The pioneering work using Web tables to discover new attributes was done by
Cafarella et al. in 2008 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. They create the so-called \attribute correlation
statistics database (AcsDb)" which contains attribute counts based on the column
headers in a large corpus of Web tables. From these counts, they estimate
attribute occurrence probabilities. Applications for this database are a schema
auto-complete function, synonym generation and a tool enabling easy join graph
traversal for end-users. We extend their approach as we use clusters derived from
matched columns instead of columns headers as basic unit for the statistics.
        </p>
        <p>
          Das Sarma et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] use label- and value-based schema matching methods
to map Web tables to a given query table. For their \Schema Complement"
operation they consider all unmapped columns and rank them using the AcsDb
and the entity coverage of the input table provided by the user. Their goal is to
rank complete tables by their usefulness for the complement task. While they
use a matching of Web table columns to the query table to rule out existing
attributes, when it comes to nding new attributes, they fall back to the AcsDb
approach. In contrast to that, we calculate attribute statistics based on matched
column clusters.
        </p>
        <p>
          Lee et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] extract attributes from Probase [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], Web documents, a search
engine query log and DBpedia [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and estimate their typicality using frequencies
of class/attribute and instance/attribute occurrences. The extraction process is
completely label-based. For the merging of attributes, they use synonyms derived
from Wikipedia.
        </p>
        <p>
          Several systems have been proposed to extend a user-speci ed query table
with content from a corpus of Web tables [
          <xref ref-type="bibr" rid="ref10 ref16 ref2">2, 16, 10</xref>
          ]. For the task of nding
new attributes, the user can specify a keyword query which describes the new
attribute, so no ranking is required. Alternatively, the InfoGather system [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and
the Mannheim SearchJoin Engine [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] can generate additional attributes based
on a schema matching, but both systems do not rank the resulting attributes
based on a relevance score.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>Our goal is to design an ontology augmentation solution to nd new attributes
for an ontology using an external source of structured data, such as a corpus
of web tables. The general idea followed by existing approaches is to count
attribute occurrences in the table corpus and use them to estimate probabilities
for encountering these attributes. Based on these probabilities, several di erent
metrics can be de ned to assess the value of adding an attribute to the ontology
(see Section 3.3). These metrics measure how likely a new attribute is to co-occur
with existing attributes (in the ontology) or how consistent the resulting schema
would be if the new attribute is added to the ontology.</p>
      <p>Existing methods often consider the use-case of extending a user-provided
data source in an ad-hoc setting. In the case of extending an ontology, however,
a variety of matching methods can be used to align the schema of the Web
tables with the ontology. We propose to incorporate the mapping created by
such methods by calculating all co-occurrence frequencies based on the mapping.</p>
      <p>
        The relevance of new attributes is measured based on how frequently they
co-occur with known attributes. Using exact string matching, these frequencies
can be obtained from a corpus of web tables by counting, as shown by the
approaches using the AcsDb [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When using fuzzy matching methods, however,
the attributes must rst be mapped among each other and then be partitioned
according to their similarity values. This results in attribute clusters whose
frequency can be determined by adding up the frequencies of all attributes in the
cluster.
3.1
      </p>
      <sec id="sec-2-1">
        <title>Identifying Equal Attributes</title>
        <p>We compare several di erent approaches of de ning attribute similarity, which
will be introduced in the following.</p>
        <p>
          Equality of Known Attributes. For the attributes that already exist in the
ontology, we create a mapping from the web tables to the ontology. For the
results in this paper, we use T2K Match [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to map Web Tables to DBpedia
ontology. This mapping de nes which columns in the web tables correspond to
which property in the ontology. By transitivity, all attributes which correspond
to the same property are equal.
        </p>
        <p>
          Equality of Unknown Attributes. Based on the mapping produced by T2K
Match, we can group the web tables by their class in the knowledge base
(blocking step) and then match all un-mapped attributes among each other. For
attributes which do not exist in the ontology, we compare the following schema
matching approaches:
Label-based Matching. Using the column headers of web tables as features, we
evaluate using exact column header equality to nd matching columns. We
refer to this approach as \Exact" in our experiments. We further evaluate \String
Similarity", which calculates the similarity of the column headers using the
Generalised Jaccard Similarity with Edit Distance as inner similarity function.
Instance-based Equality. We further evaluate similarities which are created by
the instance-based schema matcher of the Helix System [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We refer to the
con guration using cosine similarity as \Helix Cosine" and to the con guration
using containment similarity as \Helix Containment".
        </p>
        <p>Key/Value-based Equality. The Key/Value-based equality \Key/Value
Matching" compares only those values of two columns, which are mapped to the same
instance in the ontology. This means, two columns are equivalent only if they
contain similar values for the same instances. To obtain the similarity values, we
use the value-based matching component of T2K Match.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Similarity Graph Partitioning</title>
        <p>
          After the calculation of the similarity values, we must decide which set of columns
refers to the same attribute. For attributes that already exist in the ontology,
all columns with a similarity value which is above a threshold are considered to
be equal to the existing attribute. However, for attributes which do not exist
in the ontology, there is no such central attribute. We hence evaluate di erent
partitioning strategies [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for the graph that is de ned by the similarities among
the columns of the web tables.
        </p>
        <p>Connected Components. We calculate the connected components on the
similarity graph. Each resulting component is a cluster.</p>
        <p>Center. The Center algorithm uses the list of similarities sorted in descending
order to create star-shaped clusters. The rst time a node is encountered in the
sorted list, it becomes the center of a cluster. Any other node appearing in a
similarity pair with this node is then assigned to the cluster having the former
node as center.</p>
        <p>MergeCenter. The MergeCenter algorithm is similar to the Center algorithm,
but has one extension. This extension is that if a node is similar to the centers
of two di erent clusters, these clusters are merged together.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Attribute Ranking</title>
        <p>
          After de ning attribute equality, we can now specify how the relevance of new
attributes is determined. All compared ranking methods are de ned based on
attribute cooccurrence probabilities, which we de ne according to Cafarella et
al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>Let a schema s 2 S be a set of attributes and S be the set of all schemata. A
table has this schema if its columns correspond to the attributes (based on the
schema mapping), regardless of their order and column header. Let f req(s) be
the number of tables with schema s in the corpus and schema f req(a) be the
number of tables that contain attribute a:</p>
        <p>Then the probability of encountering a in any table in the corpus is
schema f req(a) =</p>
        <p>f req(s)</p>
        <p>X
fsjs2S^a2sg
p(a) =
schema f req(a)</p>
        <p>Ps2S f req(s)
(1)
(2)</p>
        <p>The number of tables that contain two attributes a1; a2 is de ned analogously
as schema f req(a1; a2). The conditional probability of seeing attribute a1 given
a2 is
And the joint probability is
(3)
(4)
(5)
(6)
(7)
Attribute Ranking Methods. We now de ne the methods that are used to
calculate a score for each attribute, which is then used to rank all unknown
attributes. A higher score indicates a higher relevance of the attribute for the
schema extension task.</p>
        <p>
          Conditional Probability based on Class. Given the class C in the ontology, how
likely is it to encounter the attribute a [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. If each schema is mapped to a class
C and schema f req(a; C) is the number of tables mapped to C that contain a,
we can de ne the conditional probability of encountering an attribute based on
the class as in Equation 5, where SC is the schema of class C. This measure
only considers the class mapping of the web tables, irrespective of the presence
of known attributes in the same web table.
        </p>
        <p>p(ajC) = P</p>
        <p>
          schema f req(a; C)
a22SC schema f req(a2; C)
Schema Consistency. This measure re ects the likelihood of seeing a new
attribute together with the existing attributes [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. It is based on the conditional
probability derived from the cooccurrence statistics. This measure considers all
known attributes which co-occur with the new attribute a, i.e., the more known
attributes co-occur, the higher the score.
        </p>
        <p>SchemaConsistency(a; s) =
1</p>
        <p>
          X p(aja2)
jsj a22s
Schema Coherency. Based on Point-wise Mutual Information (PMI), schema
coherency is the average of the PMI scores of all possible attribute combinations
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The PMI score of two attributes is positive if the attributes are correlated,
zero if they are independent, and negative if they are negatively correlated.
        </p>
        <p>SchemaCoherency(a; s) =
1</p>
        <p>X npmi(a1; a)
jsj a12s
npmi(a1; a2) =</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>4.1</p>
      <sec id="sec-3-1">
        <title>Experiments on T2D Gold Standard</title>
        <p>
          Our rst set of experiments are performed on the T2D Gold Standard [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which
was originally developed to evaluate systems for the web table to knowledge base
matching task (using DBpedia as the knowledge base).
        </p>
        <p>Identifying equal Attributes We evaluate the di erent matching and
partitioning approaches introduced in Section 3.1. The gold standard contains
mappings from the web table columns to properties in the ontology. As we are
interested in nding partitions of columns which represent the same attribute, we
create one partition for each property in the ontology, which contains all columns
which are mapped to this property. We then apply the di erent methods to all
columns of the web tables in the gold standard and measure the degree to which
we can reconstruct these partitions.</p>
        <p>Figure 1 shows a comparison of the di erent partitioning approaches. The
xaxis depicts the similarity threshold and the y-axis shows the resulting F1-score.
We can see that the best performance is achieve with rather low thresholds and
the Center algorithm.</p>
        <p>Figure 2 shows the quality of the Clusterings using di erent matching
approaches. Again, the x-axis depicts the similarity threshold and the y-axis shows
the resulting F1-score. We see that the label-based matching with string
similarity outperforms the instance-based approaches. The reason for the good
performance of the label-based approach is that the web tables are grouped by the
class in the ontology to which they are mapped, and hence column headers are
in most cases not ambiguous. The rather bad performance of the instance-based
approaches is explained by the fact that web tables usually only have very few
rows and there might not be enough overlap among the columns from di erent
tables.</p>
        <p>Combining the instance-based and label-based approach into a hybrid matcher
did not signi cantly improve the performance compared to the label-based
approach. Closer inspection of the results showed that this is due to the used gold
standard, which contains mostly tables with columns labels of high quality.
Attribute Ranking For a subset of the T2D tables, those mapped to the
country class, we manually label all columns with either \useful" or \not useful". In
total, this subset contains 207 columns, of which 86 are annotated as \useful". We
then evaluate the performance of the di erent ranking methods. Figure 3 shows
the precision@K and recall@K achieved by the di erent ranking approaches. In
addition to the ranking methods described in Section 3.3, we further evaluate
each of the ranking methods in a variant that is weighted by PageRank. The
intuition is that web pages with a high PageRank likely contain useful content
and hence the web tables on these pages also contain relevant attributes. The
used PageRank values are obtained from the publicly available Common Crawl
WWW Ranking.4 For each partition of columns, we use the maximum
PageRank of all source web pages and multiply it with the score that was calculated by
the ranking method. Among the di erent ranking methods, schema consistency
performs best, followed by schema coherency. The variations with PageRank</p>
        <sec id="sec-3-1-1">
          <title>4 http://wwwranking.webdatacommons.org/</title>
          <p>perform worst, which might be caused by the rather small number of web sites
in the gold standard.</p>
          <p>Remove one Attribute Experiment The assessment of the usefulness of
an attribute can be subjective. Hence we design another experiment, where we
remove one existing attribute from the ontology for several classes. As this
attribute was already existing, we can objectively say that it is useful. We then
measure the quality of the rst cluster that resembles this attribute and also
the rank at which we can nd it in the output. We use the following classes and
attributes in this experiment: Company (industry ), Country (population), Film
(year ), Mountain (height ), Plant (family ), VideoGame (genre). The left chart in
Figure 4 shows the average rank of the rst attribute cluster which matches the
removed attribute over all ranking methods by matcher. The bar \No
Matching" shows the result of neither using correspondences to the ontology nor any of
the matching approaches, i.e., attributes are equal only if their column headers
match exactly. The results without prior mapping knowledge show the
importance of matching attributes before calculating the ranking functions. Without
mapping knowledge, attribute frequencies are under-estimated, and the
respective attribute is ranked too low. The right chart in Figure 4 shows the average
rank over all matching methods by ranking method. Again, the schema
consistency ranking performs best and the variations including PageRank consistently
perform worse.
We now repeat our experiments on the WDC Web Tables Corpus 20125, which
contains 147 million relational web tables. To give an overall impression of the
full corpus, Figure 5 shows the number of new columns and clusters that we can
generate for selected classes. These numbers show the large amount of potentially
new attributes that can be found in the corpus.</p>
          <p>Attribute Ranking As we have no gold standard for the full corpus, we
manually annotate the top 15 ranked clusters for each ranking method for several
classes with either \useful" or \not useful". Figure 5 shows the performance
of each method averaged over all classes in terms of precision@15. The results
show again that the schema coherency and consistency measures outperform the
conditional measure. This indicates that attribute co-occurrence is a stronger
signal than pure frequency of attributes, even if conditioned with a class from
the ontology.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>5 http://webdatacommons.org/webtables/index.html</title>
          <p>Remove one Attribute Experiment Again, to have a more objective view on
the results, we remove one attribute from DBpedia as before and nd the
topranked attribute cluster which matches the removed attribute. The left chart
in Figure 6 shows the rank of these clusters by matcher and the right chart
by ranking method. Concerning the matching approach, we now nd that the
label-based and key/value-based methods achieve comparable results. The
difference here to the experiment on the gold standard is that we take into account
a much larger number of tables and hence have more variety and a more realistic
sample of the data quality. If we compare both of the matching approaches to
a baseline approach (\No Matching"), which does not use the prior knowledge
of the mappings to the ontology, we can again see that the ranking results are
worse. Looking at the di erent ranking methods, we see a result that di ers
from the previous results. The Conditional Probability ranking now performs
best. A possible explanation is that the attributes that we removed are quite
common. Hence, many tables have such attributes and the ranking by frequency
is su cient. Another interesting fact is that now the PageRank makes a di
erence. Although it is still worse than without, we can presume that a reasonable
evaluation of a ranking method incorporating the PageRank requires the use of
a large corpus.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion &amp; Future Work</title>
      <p>In summary, the results of our experiments show that:
{ It is feasible to use a large corpus of structured data from the Web to
augment an ontology. In particular, we are able to augment a general-domain
ontology such as DBpedia with millions of Web Tables extracted from the
Web. We manually veri ed that a number of attributes ranked highly by our
algorithms were strong candidates for augmenting the DBpedia ontology,
and such augmentations would enable new applications of the ontology.
{ Our results comparing di erent algorithms were mixed and without a clear
winner across all the experiments. The size of the gold standard and the
classes chosen for manual veri cation clearly a ected the relative
performance of the algorithms. This calls for larger benchmarks, more
comprehensive evaluation, and hybrid/ensemble methods that e ectively take
advantage of the bene ts of each of the algorithms.</p>
      <p>Future work also includes: 1) extending our framework to include more
advanced matching techniques particularly from recent work in ontology matching
2) evaluation on other sources of structured data (e.g., open data portals such
as data.gov), and other ontologies 3) Extending the augmentation to relations
and classes of the ontology 4) using the same quality metrics for ontology
augmentation from textual and semi-structured sources and an evaluation of how
well structured data on the Web can contribute to building and augmenting an
ontology, comparing with the textual and semi-structured sources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and
          <string-name>
            <given-names>Zachary</given-names>
            <surname>Ives</surname>
          </string-name>
          .
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Michael</surname>
            J Cafarella, Alon Halevy, and
            <given-names>Nodira</given-names>
          </string-name>
          <string-name>
            <surname>Khoussainova</surname>
          </string-name>
          .
          <article-title>Data integration for the relational web</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>1090</volume>
          {
          <fpage>1101</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Michael</surname>
          </string-name>
          J Cafarella, Alon Halevy, Daisy Zhe Wang,
          <string-name>
            <surname>Eugene Wu</surname>
            ,
            <given-names>and Yang</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Webtables: exploring the power of tables on the web</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>538</volume>
          {
          <fpage>549</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Anish Das</surname>
            <given-names>Sarma</given-names>
          </string-name>
          , Lujun Fang, Nitin Gupta, Alon Halevy,
          <string-name>
            <given-names>Hongrae</given-names>
            <surname>Lee</surname>
          </string-name>
          , Fei Wu, Reynold Xin, and
          <string-name>
            <given-names>Cong</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Finding related tables</article-title>
          .
          <source>In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>817</volume>
          {
          <fpage>828</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Xin</given-names>
            <surname>Dong</surname>
          </string-name>
          , Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang.
          <article-title>Knowledge vault: A webscale approach to probabilistic knowledge fusion</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>601</volume>
          {
          <fpage>610</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Jason</given-names>
            <surname>Ellis</surname>
          </string-name>
          , Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, and
          <string-name>
            <surname>Michael</surname>
          </string-name>
          J Ward.
          <article-title>Exploring big data with helix: Finding needles in a big haystack</article-title>
          .
          <source>ACM SIGMOD Record</source>
          ,
          <volume>43</volume>
          (
          <issue>4</issue>
          ):
          <volume>43</volume>
          {
          <fpage>54</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Rahul</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and
          <string-name>
            <given-names>Fei</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Biperpedia: An ontology for search applications</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>7</volume>
          (
          <issue>7</issue>
          ):
          <volume>505</volume>
          {
          <fpage>516</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Oktie</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          , Fei Chiang, Hyun Chul Lee, and
          <string-name>
            <surname>Renee</surname>
          </string-name>
          J Miller.
          <article-title>Framework for evaluating clustering algorithms in duplicate detection</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>1282</volume>
          {
          <fpage>1293</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Taesung</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zhongyuan</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Haixun</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <article-title>Seung-won Hwang. Attribute extraction and scoring: A probabilistic approach</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2013</year>
          IEEE 29th International Conference on, pages
          <volume>194</volume>
          {
          <fpage>205</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Oliver</surname>
            <given-names>Lehmberg</given-names>
          </string-name>
          , Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The mannheim search join engine</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Maximilian</surname>
            <given-names>Nickel</given-names>
          </string-name>
          , Kevin Murphy, Volker Tresp, and
          <string-name>
            <given-names>Evgeniy</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          .
          <article-title>A review of relational machine learning for knowledge graphs</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>104</volume>
          (
          <issue>1</issue>
          ):
          <volume>11</volume>
          {
          <fpage>33</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dominique</surname>
            <given-names>Ritze</given-names>
          </string-name>
          , Oliver Lehmberg, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Matching html tables to dbpedia</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, page 10. ACM</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Dominique</surname>
            <given-names>Ritze</given-names>
          </string-name>
          , Oliver Lehmberg, Yaser Oulabi, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Pro ling the potential of web tables for augmenting cross-domain knowledge bases</article-title>
          .
          <source>In Proceedings of the 25th international conference on world wide web</source>
          , pages
          <volume>251</volume>
          {
          <fpage>261</fpage>
          . International World Wide Web Conferences Steering Committee,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Petros</surname>
            <given-names>Venetis</given-names>
          </string-name>
          , Alon Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and
          <string-name>
            <given-names>Chung</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Recovering semantics of tables on the web</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>4</volume>
          (
          <issue>9</issue>
          ):
          <volume>528</volume>
          {
          <fpage>538</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wentao</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hongsong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Haixun</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kenny Q Zhu. Probase</surname>
          </string-name>
          :
          <article-title>A probabilistic taxonomy for text understanding</article-title>
          .
          <source>In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>481</volume>
          {
          <fpage>492</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mohamed</surname>
            <given-names>Yakout</given-names>
          </string-name>
          , Kris Ganjam, Kaushik Chakrabarti, and
          <string-name>
            <given-names>Surajit</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Infogather: entity augmentation and attribute discovery by holistic matching with web tables</article-title>
          .
          <source>In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>97</volume>
          {
          <fpage>108</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>