=Paper= {{Paper |id=None |storemode=property |title=Web Data Commons - Extracting Structured Data from Two Large Web Corpora |pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-inv-paper-2.pdf |volume=Vol-937 |dblpUrl=https://dblp.org/rec/conf/www/MuhleisenB12 }} ==Web Data Commons - Extracting Structured Data from Two Large Web Corpora== https://ceur-ws.org/Vol-937/ldow2012-inv-paper-2.pdf
                      Web Data Commons –
     Extracting Structured Data from Two Large Web Corpora

                          Hannes Mühleisen                                                    Christian Bizer
                      Web-based Systems Group                                           Web-based Systems Group
                       Freie Universität Berlin                                          Freie Universität Berlin
                              Germany                                                           Germany
                    muehleis@inf.fu-berlin.de                                        christian.bizer@fu-berlin.de

ABSTRACT                                                                     into the the crawls based on their PageRank score, making these
More and more websites embed structured data describing for in-              corpora snapshots of the popular part of the web.
stance products, people, organizations, places, events, resumes, and
cooking recipes into their HTML pages using encoding standards               The Web Data Commons project has extracted all Microformat,
such as Microformats, Microdatas and RDFa. The Web Data Com-                 Microdata and RDFa data from the Common Crawl web corpora
mons project extracts all Microformat, Microdata and RDFa data               and provides the extracted data for download in the form of RDF-
from the Common Crawl web corpus, the largest and most up-to-                quads. In this paper, we give an overview of the project and present
data web corpus that is currently available to the public, and provides      statistics about the popularity of the different encoding formats as
the extracted data for download in the form of RDF-quads. In this            well as the kinds of data that are published using each format.
paper, we give an overview of the project and present statistics about
the popularity of the different encoding standards as well as the            The remainder of this paper is structured as follows: Section 2
kinds of data that are published using each format.                          gives an overview of the different formats that are used to embed
                                                                             structured data into HTML pages. Section 3 describes and compares
                                                                             the changes in format popularity over time, while Section 4 discusses
1.     INTRODUCTION                                                          the kinds of structured data that are embedded into web pages today.
In recent years, much work has been invested in transforming the             Section 5 presents the extraction framework that was used to process
so-called “eyeball” web, where information is presented for visual           the Common Crawl corpora on the Amazon Compute Cloud. Finally,
human perception towards a “Web of Data”, where data is produced,            Section 6 summarizes the findings of this paper.
consumed and recombined in a more or less formal way. A part of
this transformation is the increasing number of websites which em-
bed structured data into their HTML pages using different encoding           2.     EMBEDDING STRUCTURED DATA
formats. The most prevalent formats for embedding structured data            This section summarizes the basics about Microformats, RDFa and
are Microformats, which use style definitions to annotate HTML               Microdata and provides references for further reading.
text with terms from a fixed set of vocabularies; RDFa, which is used
to embed any kind of RDF data into HTML pages; and Microdata,                2.1     Microformats
a recent format developed in the context of HTML5.                           An early approach for adding structure to HTML pages were Micro-
                                                                             formats2 . Microformats define of a number of fixed vocabularies
The embedded data is crawled together with the HTML pages by                 to annotate specific things such as people, calendar entries, prod-
Google, Microsoft and Yahoo!, which use the the data to enrich               ucts etc. within HTML pages. Well known Microformats include
their search results. These companies have so far been the only ones         hCalendar for calendar entries according to RFC2445, hCard for
capable of providing insights into the amount as well as the types           people, organizations and places according to RFC2426, geo for
of data that are currently published on the Web using Microformats,          geographic coordinates, hListing for classified ads, hResume for
RDFa and Microdata. While a previously published study by Yahoo!             resume information, hReview for product reviews, hRecipe for cook-
Research [4] provided many insight, the analyzed web corpus not              ing recipes, Species for taxonomic names of species and XFN for
publicly available. This prohibits further analysis and the figures          modeling relationships between humans.
provided in the study have to be taken at face value.
                                                                             For example, to represent a person within a HTML page using the
However, the situation has changed with the advent of the Common             hCard Microformat, one could use the following markup:
Crawl. Common Crawl1 is a non-profit foundation that collects data
from web pages using crawler software and publishes this data. So            
far, the Common Crawl foundation has published two Web corpora,                J a n e Doe< / span >
one dating 2009/2010 and one dating February 2012. Together the              < / span >
two corpora contain over 4.5 Billion web pages. Pages are included
1                                                                            In this example, two inert  elements are used to first create a
    http://http://commoncrawl.org
                                                                             person description and then define the name of the person described.
                                                                             The main disadvantages of Microformats are their case-by-case
                                                                             syntax and their restriction to a specific set of vocabulary terms.
                                 Copyright is held by the author/owner(s).   2
                                 LDOW2012, April 16, 2012, Lyon, France.         http://microformats.org
To improve the situation, the newer formats, RDFa and Microdata,                                                                  2009/2010           2012
provide vocabulary-independent syntaxes and allow terms from                                              Crawl Dates            09/09 – 09/11        02/12
arbitrary vocabularies to be used.                                                                        Total URLs                 2.8B             1.7B
                                                                                                         HTML Pages                  2.5B             1.5B
                                                                                                        Pages with Data              148M             189M
2.2       RDFa
RDFa defines a serialization format for embedding RDF data [3]                                     Table 1: Comparison of the Common Crawl corpora
within (X)HTML pages. RDFa provides a vocabulary-agnostic
syntax to describe resources, annotate them with literal values, and
create links to other resources on other pages using custom HTML            we can compare the usage of structured data within the pages. As a
attributes. By also providing a reference to the used vocabulary,           first step, we have filtered the web resources contained in the corpora
consuming applications are able to discern annotations. To express          to only include HTML pages. Table 1 shows a comparison of the
information about a person in RDFa, one could write the following           two corpora. We can see how HTML pages represent the bulk of the
markup:                                                                     corpora. The newer crawl contains fewer web pages. 148 million
                                        while 189 million pages within the 2012 crawl contained structured
  J a n e Doe< / span >         data. Taking the different size of the crawl into account, we can
< / span >                                                                  see that the fraction of web pages that contain structured data has
                                                                            increased from 6 % in 2010 to 12 % in 2012. The absolute numbers
                                                                            of web pages that used the different formats are given in Table 2.
Using RDFa markup, we refer to an external, commonly-used vo-               The data sets that we extracted from the corpora consist of 3.2 billion
cabulary and define an URI for the thing we are describing. Using           RDF quads (2012 corpus) and 5.2 billion RDF quads (2009/2010
terms from the vocabulary, we then select the “Person” type for the         corpus).
described thing, and annotate it with a name. While requiring more
markup than the hCard example above, considerable flexibility is                                          Format           2009/2010               2012
gained. A mayor supporter of RDFa is Facebook, which has based                                            RDFa             14,314,036         67,901,246
its Open Graph Protocol3 on the RDFa standard.                                                            Microdata            56,964         26,929,865
                                                                                                          geo               5,051,622          2,491,933
2.3       Microdata                                                                                       hcalendar         2,747,276          1,506,379
While impressive, the graph model underlying RDF was thought to                                           hcard            83,583,167         61,360,686
represent entrance barriers for web authors. Therefore, the compet-                                       hlisting          1,227,574            197,027
ing Microdata format [2] emerged as part of the HTML5 standard-                                           hresume             387,364             20,762
ization effort. In many ways, Microdata is very similar to RDFa, it                                       hreview           2,836,701          1,971,870
defines a set of new HTML attributes and allows the use of arbitrary                                      species              25,158             14,033
vocabularies to create structured data. However, Microdata uses                                           hrecipe             115,345            422,289
key-value pairs as its underlying data model, which lacks much                                            xfn              37,526,630         26,004,925
of the expressiveness of RDF, but at the same time also simplifies
usage and processing. Again, our running example of embedded                                           Table 2: URLs using the different Formats
structured data to describe a person is given below:
                                                                            Fig. 1 shows the distribution of the different formats as a percentage
        of URLs within the corpora and compares the fractions for the
     J a n e Doe< / span >                2009/2010 corpus and the 2012 corpus. We see that RDFa and
< / span >                                                                  Microformats gain popularity, while the usage of the single-purpose
                                                                            Microformats remain more or less constant. The reason for the
                                                                            explosive adoption of the Microdata syntax between 2010 and 2012
We see how the reference to both type and vocabulary document
                                                                            might be announcement in 2011 that Microdata is the preferred
found in RDFa is compressed into a single type definition. Apart
                                                                            syntax of the Schema.org initiative.
from that, the annotations are similar, but without the ability to
mix vocabularies as it is the case in RDFa. The Microdata stan-
dard gained attention as it was selected as preferred syntax by the                                                                                             2009/2010
                                                                                                   4




Schema.org initiative, a joint effort of Google, Bing and Yahoo,                                                                                                02−2012
                                                                              Percentage of URLs




which defines a number of vocabularies for common items and car-
                                                                                                   3




ries the promise that data that is represented using these vocabularies
will be used within the applications of the founding organizations.
                                                                                                   2
                                                                                                   1




3.      FORMAT USAGE
As of February 2012, the Common Crawl foundation have released
                                                                                                   0




two Web corpera. The first corpus contains web resources (pages,                                        RDFa   Microdata   geo    hcalendar   hcard   hreview     XFN
images, ...) that have been crawled between September 2009 and
                                                                                                                                   Format
September 2010. The second corpus contains resources dating
February 2012, thereby yielding two distinct data points for which
3
    http://ogp.me/                                                              Figure 1: Common Crawl Corpora – Format Distribution
The study by Yahoo! Research [4] confirms our observations. The                                                                     Type                        Entities
study is based on a Yahoo corpus consisting of 12 Billion web pages.                                                                gd:Breadcrumb            13,541,661
The analysis was repeated three times between 2008 and 2010 to                                                                      foaf:Image                4,705,292
investigate the development of the formats. They measured the                                                                       gd:Organization           3,430,437
percentage of URLs that contained the respective format. These                                                                      foaf:Document             2,732,134
results are shown in Fig. 2.                                                                                                        skos:Concept              2,307,455
                                                                                                                                    gd:Review-aggregate       2,166,435
                                                                                                                                    sioc:UserAccount          1,150,720
                                                                                      09/2008                                       gd:Rating                 1,055,997
                         3.0




                                                                                      03/2009
                                                                                                                                    gd:Person                   880,670
    Percentage of URLs




                                                                                      10/2010
                                                                                                                                    sioctypes:Comment           666,844
                         2.0




                                                                                                                                    gd:Product                  619,493
                                                                                                                                    gd:Address                  615,930
                         1.0




                                                                                                                                    gd:Review                   540,537
                                                                                                                                    mo:Track                    444,998
                                                                                                                                    gd:Geo                      380,323
                         0.0




                               RDFa eRDF   tag   hcard    adr     hatom   xfn   geo   hreview                                       mo:Release                  238,262
                                                                                                                                    commerce:Business           197,305
                                                         Format
                                                                                                                                    sioctypes:BlogPost          177,031
                                                                                                                                    mo:SignalGroup              174,289
                                                                                                                                    mo:ReleaseEvent             139,118
Figure 2: Yahoo! Corpora – Format Distribution (adopted fom
[4])
                                                                                                                                     Table 3: Top-20 Types for RDFa
We can see that RDFa exhibits non-linear growth, with the other
formats are not showing comparable developments. A second survey                                                                    Area                     % Entities
on the format distribution was presented by Sindice, a web index                                                                    Website Structure            29 %
specializing in structured data [1]. Their survey was based on a set                                                                People, Organizations        12 %
of 231 Million web documents collected in 2011. Their results were                                                                  Media                        11 %
similar to the 2010 sample from the Yahoo! survey, showing major                                                                    Products, Reviews            10 %
uptake for RDFa.                                                                                                                    Geodata                       2%

                                                                                                                                    Table 4: Entities by Area for RDFa
4.                       TYPES OF DATA
While each Microformat can only be used to annotate the specific
types of data it was designed for, RDFa and Microdata are able
                                                                                                not surprising since the Microdata format itself was made popular
to use arbitrary vocabularies. Therefore, the format comparison
                                                                                                by this initiative. This is also shown in the listing of the 20 most
alone does not yield insight into the types of data being published.
                                                                                                frequent type definitions given in Table 5, where all URLs originate
RDFa and Microdata both support the definition of a data type
                                                                                                from either the Schema.org domain or the Data-Vocabulary domain.
for the annotated entities. Thus simple counting the occurrences
of these types can give an indicator of their popularity. The top
                                                                                                To further investigate the lack of diversity that has become apparent
20 values for type definitions of the RDFa data within the 2012
                                                                                                in the analysis of the 100 most frequently used types for RDFa and
corpus are given in Table 3. Type definitions are given as shortened
                                                                                                Microdata, we have calculated a histogram for the most frequently
URLs, using common prefixes4 . Note that gd: stands for Google’s
                                                                                                used types. These histograms are shown in Fig. 3. For both corpora,
Data-Vocabulary, one of the predecessor of Schema.org.
                                                                                                the histogram of type frequencies is plotted on a logarithmic scale.
                                                                                                From the graph, we can make two observations: A small number of
We have then manually grouped the 100 most frequently occurring
                                                                                                types enjoy very high popularity, and the long tail is rather short. For
types by entity count into groups. These groups are given in Table 4.
                                                                                                both formats, no more than 200 types had more than 1000 instances.
The most frequent types were from the area of website structure
annotation, where for example navigational aides are marked. The
second most popular area are information about people, businesses                                                                                                          Microdata 02/2012
                                                                                                                                                                           RDFa 02/2012
                                                                                                                       5e+06




and organizations in general, followed by media such as audio files,                                                                                                       RDFa 2009/2010
                                                                                                                                                                           Microdata 2009/2010
pictures and videos. Product offers and corresponding reviews rep-
                                                                                                  Entity Count (log)

                                                                                                                       5e+05




resent the fourth most frequent group, and geographical information
such as addresses and coordinates was least frequent. Groups below
                                                                                                                       5e+04




1 % frequency are not given.
                                                                                                                       5e+03




Table 6 shows the same analysis for the Microdata data within the
2012 corpus. Apart from variations in the specific percentages, the
same groups were found to be most frequently used. An interesting                                                               0          50         100            150                  200
observation was that only two of the 100 most frequently occurring                                                                                    Type
types were not from of the Schema.org namespace, confirming the
overwhelming prevalence of types from this namespace, which is
4                                                                                                                              Figure 3: Microformat / RDFa type frequency
    http://prefix.cc/popular.file.ini
              Type                            Entities                   expression found potential matches.
              gd:Breadcrumb                18,528,472
              schema:VideoObject           10,760,983                    The costs for parsing the 28.9 Terabytes of compressed input data of
              schema:Offer                  6,608,047                    the 2009/2010 Common Crawl corpus, extracting the RDF data and
              schema:PostalAddress          5,714,201                    storing the extracted data on S3 totaled 576 EUR (excluding VAT)
              schema:MusicRecording         2,054,647                    in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge
              schema:AggregateRating        2,035,318                    for the extraction which altogether required 3,537 machine hours.
              schema:Product                1,811,496                    For the 20.9 Terabytes of the February 2012 corpus, 3,007 machine
              schema:Person                 1,746,049                    hours at a total cost of 523 EUR were required.
              gd:Offer                      1,542,498
              schema:Article                1,243,972
              schema:WebPage                1,189,900
                                                                         6.    CONCLUSION
                                                                         The analysis of the two Common Crawl corpora has shown that the
              gd:Rating                     1,135,718
                                                                         percentage of web pages that contain structured data has increased
              schema:Review                 1,016,285
                                                                         from 6 % in 2010 to 12 % in 2012. The analysis showed an in-
              schema:Organization           1,011,754
                                                                         creasing uptake of RDFa and Microdata, while the Microformat
              schema:Rating                   872,688
                                                                         deplyoment stood more or less constant.
              gd:Organization                 861,558
              gd:Product                      647,419
                                                                         The analysis of the types of the annotated entities revealed that the
              gd:Person                       564,921
                                                                         generic formats are used to annotate web pages with structural infor-
              gd:Review-aggregate             539,642
                                                                         mation (breadcrumps) as well as to embed data describing people,
              gd:Address                      538,163
                                                                         organizations, media files, e-commerce data such as products and
                                                                         corresponding reviews and geographical information such as coordi-
              Table 5: Top-20 Types for Microdata
                                                                         nates. Further analysis of the usage frequency of the type definitions
                                                                         of the annotated entities showed a very short tail, with less than
                Area                      % Entities                     200 significant types. The deployed types as well as the deployed
                Website Structure             23 %                       formats seem to closely correlate to the announced support of the
                Products, Reviews             19 %                       big web companies for specific types and formats, meaning that
                Media                         15 %                       Google, Microsoft, Yahoo almost exclusively determine adoption.
                Geodata                        8%
                People, Organizations          7%                        We hope that the data we have extracted from the two web crawls
                                                                         will serve as a resource for future analysis, enabling public research
            Table 6: Entities by Area for Microdata                      on a topic that was previously almost exclusive to organizations
                                                                         with access to large web corpora. More detailed statistics about the
                                                                         extracted data as well as the extracted data itself are available at
5.    EXTRACTION PROCESS                                                 http://webdatacommons.org.
The Common Crawl data sets are stored in the AWS Simple Storage
Service (S3), hence extraction was also performed in the Amazon          Acknowledgements
cloud (EC2). The main criteria here are the costs to achieve a certain   We would like to thank the Common Crawl foundation for creating
task. Extracting structured data had to be performed in a distributed    and publishing their crawl corpora as well as the Any23 team for
way in order to finish this task in a reasonable time. Instead of        their structured data extraction framework. This work has been
using the ubiquitous Hadoop framework, we found using the Simple         supported by the PlanetData and LOD2 projects funded by the
Queue Service (SQS) to coordinate for our extraction process in-         European Community’s Seventh Framework Programme.
creased efficiency. SQS provides a message queue implementation,
which we used to co-ordinate 100 extraction nodes.
                                                                         7.    REFERENCES
                                                                         [1] S. Campinas, D. Ceccarelli, T. E. Perry, R. Delbru, K. Balog,
The Common Crawl corpora were already partitioned into com-
                                                                             and G. Tummarello. The sindice-2011 dataset for
pressed files of around 100MB each. We added the identifiers of
                                                                             entity-oriented search in the web of data. In EOS, SIGIR 2011
each of these files as messages to the queue. The extraction nodes
                                                                             workshop, July 28, Beijing, China, 2011.
share this queue and take file identifiers from it. The corresponding
file was then downloaded from S3 to the node. The compressed             [2] I. Hickson. HTML Microdata.
archive was split into individual web pages. On each page, we ran            http://www.w3.org/TR/microdata/, 2011. Working
our RDF extractor based on the Anything To Triples (Any23) library.          Draft.
The resulting RDF triples were written back to S3 together with          [3] G. Klyne and J. J. Carroll. Resource Description Framework
extraction statistics and later collected.                                   (RDF): Concepts and Abstract Syntax - W3C Recommendation,
                                                                             2004. http://www.w3.org/TR/rdf-concepts/.
Any23 parses web pages for structured data by building a DOM             [4] P. Mika. Microformats and RDFa deployment across the Web .
tree and then evaluates XPath expressions to extract the structured          http://tripletalk.wordpress.com/2011/01/
data. While profiling, we found this tree generation to account for          25/rdfa-deployment-across-the-web/, 2011.
much of the parsing cost, and we have thus searched for a way to
reduce the number of times this tree is built. Our solution was to
run regular expressions against each archived web page prior to
extraction, which detected the presence of structured data within the
HTML page, and only to run the Any23 extractor when the regular