<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Understanding a Large Corpus of Web Tables Through Matching with Knowledge Bases { An Empirical Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael J. Ward</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariano Rodriguez-Muro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kavitha Srinivas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM T.J. Watson Research Center Yorktown Heights</institution>
          ,
          <addr-line>NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Extracting and analyzing the vast amount of structured tabular data available on the Web is a challenging task and has received a signi cant attention in the past few years. In this paper, we present the results of our analysis of the contents of a large corpus of over 90 million Web Tables through matching table contents with instances from a public cross-domain ontology such as DBpedia. The goal of this study is twofold. First, we examine how a large-scale matching of all table contents with a knowledge base can help us gain a better understanding of the corpus beyond what we gain from simple statistical measures such as distribution of table sizes and values. Second, we show how the results of our analysis are a ected by the choice of the ontology and knowledge base. The ontologies studied include DBpedia Ontology, Schema.org, YAGO, Wikidata, and Freebase. Our results can provide a guideline for practitioners relying on these knowledge bases for data analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>Web Tables</kwd>
        <kwd>Annotation</kwd>
        <kwd>Instance-Based Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The World Wide Web contains a large amount of structured data embedded
in HTML pages. A study by Cafarella et al. [
        <xref ref-type="bibr" rid="ref3">6</xref>
        ] over Google's index of English
documents found an estimated 154 million high-quality relational tables.
Subsequent studies show the value of web tables in various applications, ranging from
table search [
        <xref ref-type="bibr" rid="ref12">15</xref>
        ] and enhancing Web search [1, 3] to data discovery in
spreadsheet software [2, 3] to mining table contents to enhance open-domain
information extraction [
        <xref ref-type="bibr" rid="ref4">7</xref>
        ]. A major challenge in applications relying on Web Tables is
lack of metadata along with missing or ambiguous column headers. Therefore, a
content-based analysis needs to be performed to understand the contents of the
tables and their relevance in a particular application.
      </p>
      <p>
        Recently, a large corpus of web tables has been made publicly available as a
part of the Web Data Commons project [
        <xref ref-type="bibr" rid="ref9">12</xref>
        ]. As a part of the project
documentation [
        <xref ref-type="bibr" rid="ref10 ref11">13, 14</xref>
        ], detailed statistics about the corpus is provided, such as distribution
of the number of columns and rows, headers, label values, and data types. In this
paper, our goal is to perform a semantic analysis of the contents of the tables,
to nd similarly detailed statistics about the kind of entity types found in this
corpus. We follow previous work on recovering semantics of web tables [
        <xref ref-type="bibr" rid="ref12">15</xref>
        ] and
column concept determination [
        <xref ref-type="bibr" rid="ref5">8</xref>
        ] and perform our analysis through matching
table contents with instances of large cross-domain knowledge bases.
      </p>
      <p>Shortly after we started our study, it became apparent that the results of our
analysis do not only re ect the contents of tables, but also the contents and
ontology structure of the knowledge base used. For example, using our approach in
tagging columns with entity types (RDF classes) in knowledge bases (details in
Section 2), we observe a very di erent distribution of tags in the output based
on the knowledge base used. Figure 1 shows a \word cloud" visualization of
the most frequent entity types using four di erent ontologies. Using only
DBpedia ontology classes, the most dominant types of entities seem to be related to
people, places, and organizations. Using only YAGO classes, the most frequent
types are similar to those from DBpedia ontology results, but with more detailed
breakdown and additional types such as \Event" and \Organism" that do not
appear in DBpedia results. Freebase results on the other hand are very di
erent, and clearly show a large number of music and media related contents in
Web tables. The gure looks completely di erent for Wikidata results, showing
\chemical compound" as a very frequent type, which is not observed in Freebase
or YAGO types. This shows the important role the choice of knowledge base and
ontology plays in semantic data analysis.</p>
      <p>
        In the following section, we brie y describe the matching framework used for
the results of our analysis. We then revise some of the basic statistics provided
by authors of the source data documentation [
        <xref ref-type="bibr" rid="ref11">14</xref>
        ], and then provide a detailed
analysis of the entity types found in the corpus using our matching framework.
We end the paper with a discussion on the results and a few interesting directions
for future work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Matching Framework</title>
      <p>
        In this section, we brie y describe the framework used for matching table
contents with instances in public cross-domain knowledge bases. Although
implementation of this framework required a signi cant amount of engineering work
to make it scale, the methods used at the core of the framework are not new
and have been explored in the past. In particular, our MapReduce-based overlap
analysis is similar to the work of Deng et al. [
        <xref ref-type="bibr" rid="ref5">8</xref>
        ], and based on an extension of
our previous work on large-scale instance-based matching of ontologies [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ]. Here,
we only provide the big picture to help understanding the results of our analysis
described in the following sections.
      </p>
      <p>
        Figure 2 shows the overall matching framework. As input, we have the
whole corpus of Web Tables as structured CSV les on one hand and a set of
RDF knowledge bases which we refer to as reference knowledge on the other
hand. Based on our previous work on data virtualization [
        <xref ref-type="bibr" rid="ref7">10</xref>
        ], we turn both
(a) DBpedia Ontology Tags
      </p>
      <p>(b) DBpedia YAGO Classes Tags
(c) Freebase Type Tags</p>
      <p>(d) Wikidata Type Tags
the tabular data and RDF reference knowledge into a common format and
store them as key-values on HDFS. For tabular data, the key is a unique URI
identifying a column in an input table, and the values are the values that
appear in the column. For reference knowledge input, the key is the RDF class
URI, and the values are the labels of instances of that class. For example, URI
rep://webtables/23793831 0 4377639018067805567.csv/company+name
represents column with header company+name in le
23793831 0 4377639018067805567.csv in the input data. The values
associated with this URI are contents of the column, which in this case
is a list of company names. An example of reference knowledge URI is
http://dbpedia.org/ontology/Company which is the DBpedia ontology class
representing entities of type \Company". The values associated with this URI
are labels of instances of this type, which means a list of all company names in
DBpedia.</p>
      <p>
        The similarity analysis component of the framework takes in the key-values
and returns as output a table with each record associating a column in an input
table with a tag which is an RDF class in reference knowledge, along with a
con dence score. This tag indicates a similarity between values associated with
the column and the class in input key-values, based on a similarity measure.
Our system includes a large number of similarity functions but for the purpose
of this study, we focus on one similarity measure that is very simple yet accurate
and powerful for annotation of tables. Similar to Deng et al. [
        <xref ref-type="bibr" rid="ref5">8</xref>
        ], we refer to this
Similarity)
      </p>
      <p>Analysis)
Key: Column URI
Value: Cell Content</p>
      <p>Key: Class URI</p>
      <p>Value: Instance Label
similarity analysis as overlap analysis. The values are rst normalized, i.e., values
are changed to lowercase and special characters are removed. We also lter
numeric and date values to focus only on string-valued contents that are useful for
semantic annotation. The similarity score is then the size of the intersection of
the sets of ltered normalized values associated with the input URIs. The goal of
overlap analysis is to nd the number of values in a given column that represent
a given entity type (class) in the input reference knowledge. In the above
example, the column is tagged with class http://dbpedia.org/ontology/Company
with score 158, which indicates there are 158 values in the column that (after
normalization) appear as labels of entities of type Company on DBpedia.</p>
      <p>
        The reference knowledge in this study consists of three knowledge bases: (i)
DBpedia [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ] (ii) Freebase [
        <xref ref-type="bibr" rid="ref2">5</xref>
        ], and (iii) Wikidata [
        <xref ref-type="bibr" rid="ref13 ref8">11, 16</xref>
        ]. We have downloaded
the latest versions of these sources (as of April 2015) as RDF NTriples dumps.
DBpedia uses several vocabularies of entity types including DBpedia Ontology,
Schema.org, and YAGO. We report the results of our analysis separately for these
three type systems, which results in 5 di erent results for each analysis. We only
process the English portion of the knowledge bases and drop non-English labels.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Basic Statistics</title>
      <p>
        We rst report some basic statistics from the Web Tables corpus we analyzed.
Note that for this study, our input is the English subset of the Web Tables
corpus [
        <xref ref-type="bibr" rid="ref11">14</xref>
        ] the same way we only keep the English portion of the reference
knowledge. Some of the statistics we report can be found on the data publisher's
documentation [
        <xref ref-type="bibr" rid="ref11">14</xref>
        ] as well, but there is a small di erence between the numbers
that could be due to di erent mechanisms used for processing the data. For
example, we had to drop a number of les due to parsing errors or decompression
failures, but that could be a results of the di erence between the libraries used.
      </p>
      <p>The number of tables we successfully processed is 91,357,232, that results in
overall 320,327,999 columns (on average 3.5 columns per table). This results in
320,327,999 unique keys and 3,194,624,478 values (roughly 10 values per column)
in the key-value input of Web Tables after ltering numerical and non-string
values for similarity analysis. DBpedia contains 369,153 classes, out of which
445 are from DBpedia Ontology, 43 are from Schema.org, and 368,447 are from
YAGO. Freebase contains 15,576 classes, while Wikidata contains 10,250 classes.
The number of values after ltering numeric and non-string values is 67,390,185
in DBpedia, 169,783,412 in Freebase, and Wikidata has 2,349,915 values. These
numbers already show how di erent the knowledge bases are in terms of types
and values.</p>
      <p>We rst examine the distribution of rows and columns. Figure 3(a) shows the
overall distribution of columns in the Web Tables. As it can be seen, the majority
of the tables have lower than 3 columns. There are 1,574,872 tables with only
1 column, and roughly 62 million out of the 91 million tables (32%) have 2 or
3 columns. Now let us consider only the tables that appear in the output of
our overlap analysis with intersection threshold set to 20, i.e., tables that in at
least one of their columns have more than 20 normalized values shared with one
of the knowledge reference sources. Such tables are much more likely to be of
a higher quality and useful for further analysis and applications. Figure 3(b)
shows the distribution of columns over these tables. As the gure shows, there is
a smaller percentage of tables with small number of columns, with roughly 59%
of the tables having 4 or more columns. This con rms the intuition that higher
quality tables are more likely to have more number of columns, although there
is still a signi cant number of tables with meaningful contents that have 3 or
less columns.</p>
      <p>Figure 3(c) shows the overall distribution of the number of rows in the whole
corpus. Again, the majority of the tables are smaller ones, with roughly 78
million tables having under 20 rows, and roughly 1.5 million tables containing
over 100 rows. Figure 3(d) shows the same statistics for tables with an overlap
score over 20. Here again, the distribution of rows is clearly di erent from the
whole corpus, with the majority of the tables having over 100 rows.</p>
      <p>Next, we study the distribution of overlap scores over all tables and across
di erent ontologies. Figure 4 shows the results (Schema.org results omitted for
brevity). In all cases, the majority of tags have a score under 40, but there is
a notable percentage of tags with a score above 100, i.e., the column has over
100 values shared with the set of labels of at least one type in the reference
knowledge, a clear indication that the table is describing entities of that type.
The main di erence in the results across di erent ontologies is in the overall
number of tags. With overlap score threshold of 20, there are 1,736,531 DBpedia
Ontology tags, 542,178 Schema.org, 6,319,559 YAGO, 26,620,967 Freebase, and
865,718 Wikidata tags. The number of tags is a function of the size of the
ontology in terms of number of classes and instances, but also the type system
in the ontology. For example, Schema.org has only 43 classes resulting in an
average of over 12,600 columns per each tag, but YAGO contains 368,447 classes
which means an average of 17 columns per tag.</p>
      <p>(a) Distribution of Number of Columns (b) Distribution of Number of Columns for
per Table Tables with 20+ Overlap Score
78,474,558
(c) Distribution of Number of Rows per (d) Distribution of Number of Rows per</p>
      <p>Table Table with 20+ Overlap Score
We now present detailed statistics on the tags returned by the overlap similarity
analysis described in Section 2. Going back to Figure 1 in Section 1, the
word cloud gures are generated using the overlap analysis with the overlap
threshold set to 20. The gure is then made using the top 150 most frequent
tags in the output of the overlap analysis, with the size of each tag re ecting
the number of columns annotated with that tag. The labels are derived
either from the last portion of the class URI (for DBpedia and Freebase),
or by looking up English class labels (for Wikidata). For example, \Person"
in Figure 1(a) represents class http://dbpedia.org/ontology/Person
whereas music.recording in Figure 1(c) represents
http://rdf.freebase.com/ns/music.recording, and chemical compound in
Figure 1(d) represents https://www.wikidata.org/wiki/Q11173 which has
\chemical compound" as its English label.</p>
      <p>In addition to the word cloud gures, Tables 1 and 2 show the top 20 most
frequent tags in the output of our similarity analysis for each of the ontologies,
along with their frequency in the output. From these results, it is clear that no
single ontology on its own can provide the full picture of the types of entities
that can be found on the Web tables. DBpedia ontology seem to have a better
(a) DBpedia Ontology
(d) Wikidata
(a) YAGO
(b) Freebase
coverage for person and place related entities, whereas YAGO has a large number
of abstract classes being most frequent in the output. Schema.org provides a
cleaner view over the small number of types it contains. Wikidata has a few
surprising types on the top list, such as \commune of France". This may be due
to a bias on the source on the number of editors contributing to entities under
certain topics. Freebase clearly has a better coverage for media-related types,
and the abundance of tags in music and media domain shows both the fact that
there is a large number of tables in the Web tables corpus containing music and
entertainment related contents, and that Freebase has a good coverage in this
domain.</p>
      <p>Finally, we examine a sample set of entity types across knowledge bases
and see how many times they appear as a column tag in the overlap
analysis output. Table 3 shows the results. Note that we have picked popular
entity types that can easily be mapped manually. For example, Person
entity type is represented by class http://dbpedia.org/ontology/Person
in DBpedia, http://dbpedia.org/class/yago/Person in
YAGO, http://schema.org/Person in Schema.org and</p>
      <p>DBpedia Ontology YAGO Schema.org
Type Freq. Type Freq. Type Freq.
Agent 242,410 PhysicalEntity 364,830 Person 186,332
Person 186,332 Object 349,139 Place 120,361
Place 120,361 YagoLegalActorGeo 344,487 CreativeWork 53,959
PopulatedPlace 112,647 Whole 230,667 Organization 50,509
Athlete 85,427 YagoLegalActor 226,633 Country 37,221
Settlement 60,219 YagoPerm.LocatedEntity 198,304 MusicGroup 22,926
ChemicalSubstance 57,519 CausalAgent 186,789 EducationalOrg. 12,159
ChemicalCompound 57,227 LivingThing 182,570 City 10,743
Work 53,959 Organism 182,569 CollegeOrUniversity 10,598
Organisation 50,509 Person 175,501 Movie 10,243
O ceHolder 40,198 Abstraction 145,407 SportsTeam 9,594
Politician 39,121 LivingPeople 136,955 MusicAlbum 4,786
Country 37,221 YagoGeoEntity 120,433 Book 2,103
BaseballPlayer 30,301 Location 109,739 School 1,181
MotorsportRacer 26,293 Region 106,200 MusicRecording 1,166
RacingDriver 25,135 District 95,294 Product 1,130
Congressman 24,143 AdministrativeDistrict 92,808 TelevisionStation 1,037
MusicalWork 17,881 Group 85,668 StadiumOrArena 918
NascarDriver 16,766 Contestant 60,177 AdministrativeArea 896
Senator 15,087 Player 56,373 RadioStation 815
http://rdf.freebase.com/ns/people.person in Freebase. The numbers
show a notable di erence between the number of times these classes appear as
column tags, showing a di erent coverage of instances across the knowledge
bases. Freebase has by far the largest number of tags in these sample types.
Even for the three ontologies that have the same instance data from DBpedia,
there is a di erence between the number of times they are used as a tag,
showing that for example there are instances in DBpedia that have type Person
in DBpedia ontology and Schema.org but not YAGO, and surprisingly, there
are instances of Country class type in YAGO that are not marked as Country
in DBpedia ontology or Schema.org.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion &amp; Future Directions</title>
      <p>In this paper, we presented the results of our study on understanding a large
corpus of web tables through matching with public cross-domain knowledge bases.
We focused on only one mechanism for understanding the corpus of tables,
namely, tagging columns with entity types (classes) in knowledge bases. We
believe that our study with its strict focus can provide new insights into the use
of public cross-domain knowledge bases for similar analytics tasks. Our results
clearly show the di erence in size and coverage of domains in public cross-domain
knowledge bases, and how they can a ect the results of a large-scale analysis.
Our results also show several issues in the Web Data Commons Web Tables
corpus, such as the relatively large number of tables that contain very little or no
meaningful contents.</p>
      <p>
        Our immediate next step includes expanding this study to include other
similarity measures and large-scale instance matching techniques [
        <xref ref-type="bibr" rid="ref6">9</xref>
        ]. Another
interesting direction for future work is studying the use of domain-speci c knowledge
bases to study the coverage of a certain domain in the corpus of Web Tables. For
example, biomedical ontologies can be used in matching to discover healthcare
related structured data on the Web.
      </p>
      <p>The results reported in this paper may change after the reference knowledge
sources or the corpus of tables are updated. Therefore, our plan is to maintain a
website containing our latest results, along with the output of our analysis that
can be used to build various search and discovery applications over the Web
accessed 29-04-2015].</p>
      <p>CIDR, 2015.
1. Google Web Tables. http://research.google.com/tables. [Online; accessed
292. Microsoft Excel Power Query. http://office.microsoft.com/powerbi. [Online;
3. S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh,</p>
      <p>W. Shen, K. Wilder, F. Wu, and C. Yu. Applying WebTables in Practice. In
1 For latest results, refer to our project page: http://purl.org/net/webtables.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann. DBpedia - A Crystallization</surname>
          </string-name>
          <article-title>Point for the Web of Data</article-title>
          . JWS,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>154</volume>
          {
          <fpage>165</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          5.
          <string-name>
            <surname>K. D. Bollacker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Paritosh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Sturge</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          . Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <volume>1247</volume>
          {
          <fpage>1250</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          6.
          <string-name>
            <surname>M. J. Cafarella</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>D. Zhe</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , E. Wu, and
            <given-names>Y. Zhang.</given-names>
          </string-name>
          <article-title>WebTables: Exploring the Power of Tables on the Web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>538</volume>
          {
          <fpage>549</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Callan.</surname>
          </string-name>
          <article-title>WebSets: extracting sets of entities from the web using unsupervised information extraction</article-title>
          .
          <source>In WSDM</source>
          , pages
          <volume>243</volume>
          {
          <fpage>252</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>13</issue>
          ):
          <volume>1606</volume>
          {
          <fpage>1617</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fokoue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kementsietsidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Ward</surname>
          </string-name>
          .
          <article-title>Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>49</volume>
          {
          <fpage>64</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          10.
          <string-name>
            <surname>J. B. Ellis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fokoue</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kementsietsidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Srinivas</surname>
            , and
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Ward</surname>
          </string-name>
          .
          <article-title>Exploring Big Data with Helix: Finding Needles in a Big Haystack</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>43</volume>
          (
          <issue>4</issue>
          ):
          <volume>43</volume>
          {
          <fpage>54</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          11.
          <string-name>
            <given-names>F.</given-names>
            <surname>Erxleben</surname>
          </string-name>
          , M. Gunther, M. Krotzsch, J. Mendez, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          .
          <article-title>Introducing Wikidata to the Linked Data Web</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>50</volume>
          {
          <fpage>65</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          12. H. Muhleisen and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer. Web Data</surname>
          </string-name>
          Commons - Extracting
          <source>Structured Data from Two Large Web Corpora</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          13.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ristoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Diete</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krstanovic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Knller. Web Data Commons - Web Tables</surname>
          </string-name>
          . http://webdatacommons. org/webtables. [Online; accessed 29-04-2015].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          14.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ristoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Web Data Commons - English Subset of the Web Tables Corpus</article-title>
          . http://webdatacommons.org/webtables/ englishTables.html. [Online; accessed 29-04-2015].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          15.
          <string-name>
            <given-names>P.</given-names>
            <surname>Venetis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Miao, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <source>Recovering Semantics of Tables on the Web. PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>9</issue>
          ):
          <volume>528</volume>
          {
          <fpage>538</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          16.
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krotzsch. Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>57</volume>
          (
          <issue>10</issue>
          ):
          <volume>78</volume>
          {
          <fpage>85</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>