<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim, Germany Data and Web Science Group</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>For making the web of linked data grow, information extraction methods are a good alternative for manual dataset curation, since there is an abundance of semi-structured and unstructured information which can be harvested that way. At the same time, existing structured data sets can be used for training and evaluating such information extraction systems. In this paper, we introduce a method for creating training and test corpora from websites annotated with structured data. Using different classes in schema.org and websites annotated with Microdata, we show how training and test data can be curated at large scale and across various domains. Furthermore, we discuss how negative examples can be generated as well as open challenges and future directs for this kind of training data curation.</p>
      </abstract>
      <kwd-group>
        <kwd>Information Extraction</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Web Data Commons</kwd>
        <kwd>Microdata</kwd>
        <kwd>schema</kwd>
        <kwd>org</kwd>
        <kwd>Bootstrapping the Web of Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The web of linked data is constantly growing, from a small number of hand-curated
datasets to around 1; 000 datasets [
        <xref ref-type="bibr" rid="ref1 ref11">1, 11</xref>
        ], many of which are created using heuristics
and/or crowdsourcing. Since manual creation of datasets has its inherent scalability
limitations, methods that automatically populate the web of linked data are a suitable
means for its future growth.
      </p>
      <p>
        Different methods for automatic population have been proposed. Open information
extraction methods are unconstrained in the data they try to create, i.e., they do not use
any predefined schema [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In contrast, supervised methods have been proposed that are
trained using existing LOD datasets and applied to extract new facts, either by using the
dataset as a training set for the extraction [
        <xref ref-type="bibr" rid="ref13 ref2">2, 13</xref>
        ], or by performing open information
extraction first, and mapping the extracted facts to a given schema or ontology [
        <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
        ]
afterwards. In this paper, we discuss the creation of large-scale training and evaluation
data sets for such supervised information extraction methods.
In the last years, more and more websites started making use of markup languages as
Microdata, RDFa, or Microformats to annotate information on their pages. In 2014, over
17:3% of popular websites made use of at least one of those three markup languages,
with schema.org and Microdata being among the most widely deployed standards [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Tools like Any231 are capable of extracting such annotated information from those web
pages and returning them as RDF triples.
      </p>
      <p>
        On of the largest, publicly available collections of such triples extracted from HTML
pages is provided by the Web Data Commons project.2 The triples were extracted by
the project using Any23 and Web crawls curated by the Common Crawl Foundation,3
which maintains one of the largest, publicly available Web crawl corpora. So far, the
project offers four different datasets, gathered from crawls from 2010, 2012, 2013, and
2014, including all together over 50 billion triples. The latest dataset, including 20
billion triples, which were extracted from over half a billion HTML pages, contains large
quantities of product, review, address, blog post, people, organization, event, and
cooking recipe data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The largest fraction of structured data, i.e., 58% of all triples and
54% of all entities, use the same schema, i.e., schema.org4, and the HTML Microdata
standard5 for annotating data. At the same time, being promoted by major search
engines, this format is the one whose deployment is growing the most rapidly [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8–10</xref>
        ].
      </p>
      <p>Since both the original web page and the extracted RDF triples are publicly
available, those pairs (i.e., a web page and the corresponding set of triples) can serve as
training and test data for a supervised information extraction system.</p>
      <p>As the ultimate goal of an information extraction system would be to extract such
data from web pages without markup, the test set should consist of non-markup pages.
However, for such pages, it would be very time-consuming to curate a reasonably sized
gold standard. As an alternative, we use the original pages from the Common Crawl
and remove the markup. This removal is done by erasing all Microdata attributes found
in the HTML code.</p>
      <p>In order to train and evaluate high precision extraction frameworks, negative
examples are also useful, i.e., examples for pages that do not contain any information for a
given class (e.g., person data). While this is hard to obtain negative examples without
human inspection, we propose the use of an approximate approach here: given that a
page is already annotated with Microdata and schema.org, we assume that the
website creator has annotated all information which can be potentially annotated with the
respective method. Thus, if a web page which contains Microdata does not contain
annotations for a specific class, we assume that the page does not contain any information
about instances of that class.</p>
      <p>Figure 1 summarizes the creation of the data sets and the evaluation process.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Description</title>
      <p>The datasets that we created for evaluation focus on five different classes in schema.org.
The classes were chosen in a way such that (a) a good variety of domains is covered and</p>
      <sec id="sec-2-1">
        <title>1 https://code.google.com/p/any23/</title>
        <p>2 http://webdatacommons.org/structureddata
3 http://commoncrawl.org/
4 http://schema.org/
5 http://www.w3.org/TR/microdata/
Web pages</p>
        <p>with
Microdata
split</p>
        <p>Training
Dataset</p>
        <p>Test</p>
        <p>Dataset
Input (Web Data Commons)
Provided for the challenge
Evaluation (by challenge organizers)
Developed by challenge participants
Submitted by challenge participants
extraction</p>
        <p>Plain HTML
(Training)
extraction</p>
        <p>Plain HTML</p>
        <p>(Test)
Extracted
Statements
(Test)
training
Extraction
System</p>
        <p>Extracted
Statements
(Training)
execution</p>
        <p>Extracted
Statements
(Test)
Evaluation
(b) the class is used by many different unique hosts. The latter is important, since for
classes only deployed on a few different domains, which are potentially template-driven
web sites, there is a danger of overfitting to those templates.</p>
        <p>For each class, we provide a training dataset with minimal 7; 000 and maximal
20; 000 instances, and a test dataset with minimal 1; 900 and maximal 4; 700 instances.6
Those can be used to set up systematic evaluations.7</p>
        <p>In addition to the five class-specific datasets, we propose to evaluate approaches
also on a mixed dataset, which contains instances from multiple classes. Table 1 shows
some basic statistics about the datasets created.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Metrics and Baselines</title>
      <p>For evaluating information extraction systems that use the methodology described above
in order to train models for information extraction, we propose to evaluate them using
6 Note that for each page, there is exactly one root entity of the respective class, e.g., MusicRecording. The other entities
are connected to the root entity, e.g., the artist and the record company of that recording.</p>
      <p>7 http://oak.dcs.shef.ac.uk/ld4ie2015/LD4IE2015/IE_challenge.html
the originally extracted triples, using recall, precision, and F-measure as performance
metrics. For obtaining stable results, the use of cross validation is advised.</p>
      <p>The baseline for a class-specific extractor is creating a single blank node of the
given schema.org class for each web page. This results in extractors of high precision
(as the information is always correct) and low recall (since no further information is
extracted). Such a system can be seen as a minimal baseline. Table 2 depicts the results
of that baseline on the datasets discussed above.</p>
      <p>
        For running challenges, such as the Linked Data for Information Extraction
challenges [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], it is easily possible to create additional holdout sets, for which only the
transformed web pages are given to the participants, while the corresponding original
pages and triples are kept secret. This allows for participants to send in the triples they
found and perform a comparison of different systems.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have shown that it is possible to create large-size training and
evaluation data sets, which allows for benchmarking supervised information extraction
systems. Using Microdata annotations with schema.org, we have discussed the creation of
a corpus of training and test sets from various domains, ranging from recipes to sports
events and music recordings. We have also discussed how to address the problem of
generating negative examples.</p>
      <p>
        While the corpus used in this paper focuses on schema.org and Microdata, similar
datasets can be created when exploiting other markup languages, such as Microformats8
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or RDFa9. Also, as existing crawl corpora might have limitations in terms of
coverage, focused crawling for specific formats, vocabularies and classes can be applied to
gather a sufficient data corpus for supervised learning as proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>In the future, it will be interesting to see how existing information extraction
systems perform given these datasets, as well as which new information extraction systems
will be developed for bootstrapping the Web of Data.</p>
      <sec id="sec-4-1">
        <title>8 http://microformats.org/ 9 http://www.w3.org/TR/xhtml-rdfa/</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked Data - The Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Ciravegna</surname>
          </string-name>
          , Anna Lisa Gentile, and Ziqi Zhang. LODIE:
          <article-title>linked open data for web-scale information extraction</article-title>
          .
          <source>In Proceedings of the Workshop on Semantic Web and Information Extraction</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Anthony</given-names>
            <surname>Fader</surname>
          </string-name>
          , Janara Christensen, Stephen Soderland, and
          <string-name>
            <surname>Mausam</surname>
            <given-names>Mausam</given-names>
          </string-name>
          .
          <article-title>Open information extraction: The second generation</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>11</volume>
          , pages
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Antonis</given-names>
            <surname>Koukourikos</surname>
          </string-name>
          , Vangelis Karkaletsis, and
          <article-title>George A Vouros</article-title>
          .
          <article-title>Towards enriching linked open data via open information extraction</article-title>
          .
          <source>In Workshop on Knowledge Discovery and Data Mining meets Linked Open Data (KnowLOD)</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>A web-scale study of the adoption and evolution of the schema. org vocabulary over time</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, page 15. ACM</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Roi</given-names>
            <surname>Blanco</surname>
          </string-name>
          .
          <article-title>Focused crawling for structured data</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          , pages
          <fpage>1039</fpage>
          -
          <lpage>1048</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Meusel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <source>Linked Data for Information Extraction Challenge</source>
          <year>2014</year>
          :
          <article-title>Tasks and Results</article-title>
          .
          <source>In Linked Data for Information Extraction</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Meusel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Heuristics for fixing common errors in deployed schema. org microdata</article-title>
          .
          <source>In The Semantic Web. Latest Advances and New Domains</source>
          , pages
          <fpage>152</fpage>
          -
          <lpage>168</lpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Meusel</surname>
          </string-name>
          , Petar Petrovski, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The WebDataCommons Microdata, RDFa and Microformat Dataset Series</article-title>
          .
          <source>In 13th Int. Semantic Web Conference (ISWC14)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>What the adoption of schema. org tells about linked open data</article-title>
          .
          <source>In 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data (PROFILES '15)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Max</surname>
            <given-names>Schmachtenberg</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Heiko</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the Linked Data Best Practices in Different Topical Domains</article-title>
          . In International Semantic Web Conference,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stephen</surname>
            <given-names>Soderland</given-names>
          </string-name>
          , Brendan Roof, Bo Qin, Shi Xu,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          , et al.
          <article-title>Adapting open information extraction to domain-specific relations</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ziqi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Anna Lisa Gentile, and
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Augenstein</surname>
          </string-name>
          .
          <article-title>Linked data as background knowledge for information extraction on the web</article-title>
          .
          <source>SIGWEB Newsl</source>
          ., (Summer):
          <volume>5</volume>
          :
          <fpage>1</fpage>
          -
          <issue>5</issue>
          :
          <fpage>9</fpage>
          ,
          <string-name>
            <surname>July</surname>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>