<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Web Data Commons - Extracting Structured Data from Two Large Web Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hannes Mühleisen</string-name>
          <email>muehleis@inf.fu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Bizer</string-name>
          <email>christian.bizer@fu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Web-based Systems Group, Freie Universität Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Web-based Systems Group, Freie Universität Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <volume>16</volume>
      <issue>2012</issue>
      <abstract>
        <p>More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-todata web corpus that is currently available to the public, and provides the extracted data for download in the form of RDF-quads. In this paper, we give an overview of the project and present statistics about the popularity of the different encoding standards as well as the kinds of data that are published using each format.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>&lt; span c l a s s = " v c a r d " &gt;</p>
      <p>
        &lt; span c l a s s = " f n " &gt; J a n e Doe&lt; / span &gt;
&lt; / span &gt;
In this example, two inert &lt;span&gt; elements are used to first create a
person description and then define the name of the person described.
The main disadvantages of Microformats are their case-by-case
syntax and their restriction to a specific set of vocabulary terms.
2http://microformats.org
To improve the situation, the newer formats, RDFa and Microdata,
provide vocabulary-independent syntaxes and allow terms from
arbitrary vocabularies to be used.
2.2 RDFa
RDFa defines a serialization format for embedding RDF data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
within (X)HTML pages. RDFa provides a vocabulary-agnostic
syntax to describe resources, annotate them with literal values, and
create links to other resources on other pages using custom HTML
attributes. By also providing a reference to the used vocabulary,
consuming applications are able to discern annotations. To express
information about a person in RDFa, one could write the following
markup:
&lt; span xmlns : f o a f = " h t t p : / / x m l n s . com / f o a f / 0 . 1 / "
t y p e o f = " f o a f : P e r s o n " &gt;
&lt; span p r o p e r t y = " f o a f : name " &gt; J a n e Doe&lt; / span &gt;
&lt; / span &gt;
Using RDFa markup, we refer to an external, commonly-used
vocabulary and define an URI for the thing we are describing. Using
terms from the vocabulary, we then select the “Person” type for the
described thing, and annotate it with a name. While requiring more
markup than the hCard example above, considerable flexibility is
gained. A mayor supporter of RDFa is Facebook, which has based
its Open Graph Protocol3 on the RDFa standard.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.3 Microdata</title>
      <p>
        While impressive, the graph model underlying RDF was thought to
represent entrance barriers for web authors. Therefore, the
competing Microdata format [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] emerged as part of the HTML5
standardization effort. In many ways, Microdata is very similar to RDFa, it
defines a set of new HTML attributes and allows the use of arbitrary
vocabularies to create structured data. However, Microdata uses
key-value pairs as its underlying data model, which lacks much
of the expressiveness of RDF, but at the same time also simplifies
usage and processing. Again, our running example of embedded
structured data to describe a person is given below:
&lt; span i t e m s c o p e
i t e m t y p e = " h t t p : / / s c h e m a . o r g / P e r s o n " &gt;
&lt; span i t e m p r o p = " name " &gt; J a n e Doe&lt; / span &gt;
&lt; / span &gt;
We see how the reference to both type and vocabulary document
found in RDFa is compressed into a single type definition. Apart
from that, the annotations are similar, but without the ability to
mix vocabularies as it is the case in RDFa. The Microdata
standard gained attention as it was selected as preferred syntax by the
Schema.org initiative, a joint effort of Google, Bing and Yahoo,
which defines a number of vocabularies for common items and
carries the promise that data that is represented using these vocabularies
will be used within the applications of the founding organizations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. FORMAT USAGE</title>
      <p>As of February 2012, the Common Crawl foundation have released
two Web corpera. The first corpus contains web resources (pages,
images, ...) that have been crawled between September 2009 and
September 2010. The second corpus contains resources dating
February 2012, thereby yielding two distinct data points for which
3http://ogp.me/</p>
      <sec id="sec-3-1">
        <title>Crawl Dates</title>
        <p>Total URLs
HTML Pages
Pages with Data
we can compare the usage of structured data within the pages. As a
first step, we have filtered the web resources contained in the corpora
to only include HTML pages. Table 1 shows a comparison of the
two corpora. We can see how HTML pages represent the bulk of the
corpora. The newer crawl contains fewer web pages. 148 million
HTML pages within the 2009/2010 crawl contained structured data,
while 189 million pages within the 2012 crawl contained structured
data. Taking the different size of the crawl into account, we can
see that the fraction of web pages that contain structured data has
increased from 6 % in 2010 to 12 % in 2012. The absolute numbers
of web pages that used the different formats are given in Table 2.
The data sets that we extracted from the corpora consist of 3.2 billion
RDF quads (2012 corpus) and 5.2 billion RDF quads (2009/2010
corpus).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Format</title>
        <p>
          RDFa
Microdata
geo
hcalendar
hcard
hlisting
hresume
hreview
species
hrecipe
xfn
The study by Yahoo! Research [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] confirms our observations. The
study is based on a Yahoo corpus consisting of 12 Billion web pages.
The analysis was repeated three times between 2008 and 2010 to
investigate the development of the formats. They measured the
percentage of URLs that contained the respective format. These
results are shown in Fig. 2.
We can see that RDFa exhibits non-linear growth, with the other
formats are not showing comparable developments. A second survey
on the format distribution was presented by Sindice, a web index
specializing in structured data [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Their survey was based on a set
of 231 Million web documents collected in 2011. Their results were
similar to the 2010 sample from the Yahoo! survey, showing major
uptake for RDFa.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. TYPES OF DATA</title>
      <p>While each Microformat can only be used to annotate the specific
types of data it was designed for, RDFa and Microdata are able
to use arbitrary vocabularies. Therefore, the format comparison
alone does not yield insight into the types of data being published.
RDFa and Microdata both support the definition of a data type
for the annotated entities. Thus simple counting the occurrences
of these types can give an indicator of their popularity. The top
20 values for type definitions of the RDFa data within the 2012
corpus are given in Table 3. Type definitions are given as shortened
URLs, using common prefixes4. Note that gd: stands for Google’s
Data-Vocabulary, one of the predecessor of Schema.org.
We have then manually grouped the 100 most frequently occurring
types by entity count into groups. These groups are given in Table 4.
The most frequent types were from the area of website structure
annotation, where for example navigational aides are marked. The
second most popular area are information about people, businesses
and organizations in general, followed by media such as audio files,
pictures and videos. Product offers and corresponding reviews
represent the fourth most frequent group, and geographical information
such as addresses and coordinates was least frequent. Groups below
1 % frequency are not given.</p>
      <p>Table 6 shows the same analysis for the Microdata data within the
2012 corpus. Apart from variations in the specific percentages, the
same groups were found to be most frequently used. An interesting
observation was that only two of the 100 most frequently occurring
types were not from of the Schema.org namespace, confirming the
overwhelming prevalence of types from this namespace, which is
4http://prefix.cc/popular.file.ini
Type
gd:Breadcrumb
foaf:Image
gd:Organization
foaf:Document
skos:Concept
gd:Review-aggregate
sioc:UserAccount
gd:Rating
gd:Person
sioctypes:Comment
gd:Product
gd:Address
gd:Review
mo:Track
gd:Geo
mo:Release
commerce:Business
sioctypes:BlogPost
mo:SignalGroup
mo:ReleaseEvent
To further investigate the lack of diversity that has become apparent
in the analysis of the 100 most frequently used types for RDFa and
Microdata, we have calculated a histogram for the most frequently
used types. These histograms are shown in Fig. 3. For both corpora,
the histogram of type frequencies is plotted on a logarithmic scale.
From the graph, we can make two observations: A small number of
types enjoy very high popularity, and the long tail is rather short. For
both formats, no more than 200 types had more than 1000 instances.
Type
gd:Breadcrumb
schema:VideoObject
schema:Offer
schema:PostalAddress
schema:MusicRecording
schema:AggregateRating
schema:Product
schema:Person
gd:Offer
schema:Article
schema:WebPage
gd:Rating
schema:Review
schema:Organization
schema:Rating
gd:Organization
gd:Product
gd:Person
gd:Review-aggregate
gd:Address</p>
      <sec id="sec-4-1">
        <title>Area</title>
        <p>Website Structure
Products, Reviews
Media
Geodata
People, Organizations
% Entities
23 %
19 %
15 %
8 %
7 %</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. EXTRACTION PROCESS</title>
      <p>The Common Crawl data sets are stored in the AWS Simple Storage
Service (S3), hence extraction was also performed in the Amazon
cloud (EC2). The main criteria here are the costs to achieve a certain
task. Extracting structured data had to be performed in a distributed
way in order to finish this task in a reasonable time. Instead of
using the ubiquitous Hadoop framework, we found using the Simple
Queue Service (SQS) to coordinate for our extraction process
increased efficiency. SQS provides a message queue implementation,
which we used to co-ordinate 100 extraction nodes.</p>
      <p>The Common Crawl corpora were already partitioned into
compressed files of around 100MB each. We added the identifiers of
each of these files as messages to the queue. The extraction nodes
share this queue and take file identifiers from it. The corresponding
file was then downloaded from S3 to the node. The compressed
archive was split into individual web pages. On each page, we ran
our RDF extractor based on the Anything To Triples (Any23) library.
The resulting RDF triples were written back to S3 together with
extraction statistics and later collected.
expression found potential matches.</p>
      <p>The costs for parsing the 28.9 Terabytes of compressed input data of
the 2009/2010 Common Crawl corpus, extracting the RDF data and
storing the extracted data on S3 totaled 576 EUR (excluding VAT)
in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge
for the extraction which altogether required 3,537 machine hours.
For the 20.9 Terabytes of the February 2012 corpus, 3,007 machine
hours at a total cost of 523 EUR were required.</p>
    </sec>
    <sec id="sec-6">
      <title>6. CONCLUSION</title>
      <p>The analysis of the two Common Crawl corpora has shown that the
percentage of web pages that contain structured data has increased
from 6 % in 2010 to 12 % in 2012. The analysis showed an
increasing uptake of RDFa and Microdata, while the Microformat
deplyoment stood more or less constant.</p>
      <p>The analysis of the types of the annotated entities revealed that the
generic formats are used to annotate web pages with structural
information (breadcrumps) as well as to embed data describing people,
organizations, media files, e-commerce data such as products and
corresponding reviews and geographical information such as
coordinates. Further analysis of the usage frequency of the type definitions
of the annotated entities showed a very short tail, with less than
200 significant types. The deployed types as well as the deployed
formats seem to closely correlate to the announced support of the
big web companies for specific types and formats, meaning that
Google, Microsoft, Yahoo almost exclusively determine adoption.
We hope that the data we have extracted from the two web crawls
will serve as a resource for future analysis, enabling public research
on a topic that was previously almost exclusive to organizations
with access to large web corpora. More detailed statistics about the
extracted data as well as the extracted data itself are available at
http://webdatacommons.org.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We would like to thank the Common Crawl foundation for creating
and publishing their crawl corpora as well as the Any23 team for
their structured data extraction framework. This work has been
supported by the PlanetData and LOD2 projects funded by the
European Community’s Seventh Framework Programme.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Campinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ceccarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Perry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Tummarello. The</surname>
          </string-name>
          sindice
          <article-title>-2011 dataset for entity-oriented search in the web of data</article-title>
          .
          <source>In EOS, SIGIR 2011 workshop, July</source>
          <volume>28</volume>
          , Beijing, China,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I. Hickson. HTML</given-names>
            <surname>Microdata</surname>
          </string-name>
          . http://www.w3.org/TR/microdata/,
          <year>2011</year>
          . Working Draft.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Klyne</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          .
          <article-title>Resource Description Framework (RDF): Concepts and Abstract Syntax -</article-title>
          W3C
          <string-name>
            <surname>Recommendation</surname>
          </string-name>
          ,
          <year>2004</year>
          . http://www.w3.org/TR/rdf-concepts/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          .
          <article-title>Microformats and RDFa deployment across the Web</article-title>
          . http://tripletalk.wordpress.com/
          <year>2011</year>
          /01/ 25/rdfa-deployment
          <article-title>-across-the-</article-title>
          <string-name>
            <surname>web</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>