<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Analysis Based on the Web Data Commons Data Set</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elias Kärle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>elias.kaerle@sti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioan Toma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>y ioan.toma@sti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dieter Fensel</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Semantic Technology Institute (STI) Innsbruck University of Innsbruck Technikerstrasse 21a 6020 Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The Common Crawl</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>is a project started in 2012 by Freie Universitat Berlin and the Karlsruhe Institute of Technology, and it extracts di erent types of structured data from the Common Crawl and also provides them to the public for free</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>is an organization which crawls the web several times a year and provides the collected archives and data sets to the public for free. Web Data Commons</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>75</fpage>
      <lpage>78</lpage>
      <abstract>
        <p>It has been almost four years now since the world's leading search engine operators (Bing, Google, Yahoo! and Yandex), decided to start working on an initiative to enrich web pages with structured data; an initiative known as schema.org. Since then, many web masters and those responsible for designing web pages started adapting this technology to enrich websites with semantic information. This paper analyzes parts of the structured data in the largest web crawl available and open to the public, the Common Crawl, in order to nd out how the tourism branch is using schema.org. On the use case of hotels, it studies the usage and distribution of schema.org/Hotel, examines who uses schema.org, how it is applied and whether or not the classes and properties of the vocabulary are used in a syntactically and semantically correct way.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;schema</kwd>
        <kwd>org</kwd>
        <kwd>semantic annotation</kwd>
        <kwd>analysis</kwd>
        <kwd>hotel</kwd>
        <kwd>tourism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The interest of this paper lies upon a data set within Web
Data Commons, containing Microdata, RDFa and
Microformat, used to annotate web page content with schema.org
[3]. In this paper we present our work on getting a
comprehensive overview of the distribution of tourism speci c
schema.org vocabulary over the web, using the example of
the type schema.org/Hotel.</p>
      <p>This paper is structured as follows: Section 2 describes
related work, section 3 states the research questions and
explains the methodology used to analyze the data. Section 4
presents the ndings of the research, and section 5 concludes
the paper.
1http://googlewebmastercentral.blogspot.co.at/2011/06/introducingschemaorg-search-engines.html
2http://commoncrawl.org/
3http://webdatacommons.org/</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>During our work on this project, we came across work which
is related to our research. First of all, in the paper by
Stavrakantonakis et al. (2013), the authors survey the use
of Web 2.0 technologies, the use of content management
systems and social channels and Web 3.0 technologies, as well
as the use of semantic web technologies and structured data
on the websites of 2155 hotels in Austria. The outcome of
this research is that only 5% of the websites employ
semantic technologies and the vast majority of hotels "completely
ignore the existence of technologies that could enrich the
website content with high level metadata and give machine
readable meaning to the presented information" [4].
During the analysis, we came across several cases of wrong
usage of schema.org. To detect, analyze and solve those
problems the work of Meusel et al. (2015)[2] serves as a
starting point for our further work, when we wish to give
advice towards the semantically and syntactically correct
usage of schema.org annotations.</p>
      <p>When it comes to nding and choosing the most suitable
vocabulary, a project worth mentioning is vocab.cc. It is an
open source project which allows users to search for linked
data vocabularies, based on the dataset of the Billion Triple
Challange4.</p>
      <p>The available schema.org annotations have a commercial
exploitation potential, which is currently pursued by several
institutions. For example, current STI Innsbruck's start-up
e ort ONLIM5 is applying annotations on online social
media technologies in its product, social media marketing tool.
The start-up already runs pilots, such as with touristic
associations of Innsbruck6, and implements semantic
dissemination support by implementing schema.org support on their
website and publishing the touristic data of the regions as
linked open data.</p>
      <p>
        Another direction towards widespread real life application
of schema.org is in the development of tools assisting web
developers to easily and correctly introduce schema.org
annotations. One example here is the WYSIWYM project
described in Khalili et al. (2013) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. RESEARCH QUESTION AND METHOD</title>
    </sec>
    <sec id="sec-4">
      <title>OLOGY</title>
      <p>As a starting point for our analysis, we de ne key research
questions we want to answer. These are:
1. How many hotels use schema.org? This question
triggers an analysis on whether or not it is possible to
indicate a number of hotels that are annotated with
schema.org, either on their own website or on third
party websites?
2. Is schema.org used syntactically and
semantically correctly or are there many mistakes? The
answer to this question surveys the mistakes made
when it comes to annotating hotels.</p>
      <sec id="sec-4-1">
        <title>4http://km.aifb.kit.edu/projects/btc-2012/ 5http://www.onlim.com 6http://www.innsbruck.info</title>
        <p>3. Who is using schema.org in the touristic eld?
The last question is looking for answers w.r.t. whether
or not hotels use schema.org on their own web sites and
which other platforms annotate hotels with schema.org.
As mentioned in the introduction, the primary source of our
data was the result of the Common Crawl. Since our
analysis should be based only on structured data and, to be more
precise, on schema.org, we took advantage of a project called
Web Data Commons. This project uses the data from
Common Crawl and extracts all sorts of structured data which
then are divided into three main data sets. The Hyperlink
Graph, the Web Tables, and the RDFa, Microdata and
Microformat dataset, upon which our interest lies. From this
dataset we are using the "Schema.org Class Speci c
DataSubsets" and from those subsets the one containing all triples
related to schema.org/Hotel.</p>
        <p>The schema.org/Hotel speci c subset of the 2013 crawl was
2.2GB in compressed and 35GB in uncompressed size.
Over all, we used 37 di erent queries. The measurement
and analysis of the collected data, which was present in CSV
tables, was mostly done by hand or by arithmetic functions
in Microsoft Excel, as well as through generating charts and
diagram.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. RESULTS</title>
      <p>In the following Section we will present the results of our
analysis of the schema.org/Hotel related structured data on
the 2013 corpus of the Web Data Commons project. In
Section 2 of the paper, we have de ned three main questions
which will be answered below.</p>
    </sec>
    <sec id="sec-6">
      <title>4.1 How many hotels use schema.org?</title>
      <p>When trying to nd out how many hotels are present in the
triple store, one can rst query for all triples with predicate
rdf:type and object schema.org/Hotel and count them. The
output would be about 4.841.000 hotels in the whole data
set. But after a little manual inspection it is clearly visible
that many hotels are annotated more than once because,
for example, they have schema.org annotations on their own
website and are annotated in listings of one or several
booking platforms. Trying to do the same query with the
restriction of only counting hotels with unique names results
in a reduced number, about 740.000, which is also not
expressive, because details about the hotels with same names,
like for example Hotel Post or Hotel Adler - which are very
common hotel names in Austria, are still not distinct.
A solution to that problem would have been to perform a
search on unique hotel names and locations or addresses, but
we observe that less than 75% of the hotels in the dataset
have proper annotation for an address. To be more speci c,
only about 3 million hotels added used schema.org/Address,
2.2 million Hotels used schema.org/street, 2 million hotels
used schema.org/zip, 1.9 million hotels used schema.org/land
and schema.org/Region and only 1.1 million hotels used
schema.org/name as a country name. See Table 1 for more
details.</p>
      <p>If we count all appearances of annotations of hotels per
country, of course, only for those 23.2 % of hotels which have an
annotation for schema.org/postalAddress and schema.org/name
within postalAddress, we come to the conclusion that the
large majority of triples is found within the United States,
followed by Canada, China, Great Britain, Germany and
others. See a detailed listing in Table 2.</p>
      <p>Another interesting aspect of this data set was to nd out
which categories of hotels are either using schema.org on
their own, or are annotated by others. For this purpose
we inspected the appearance of schema.org/Rating, which
aims, due to the documentation7, to show the rating on a
numerical scale from one to ve, as it is done in hotels with
the stars rating (*, ..., *****). In our understanding, values
with .5, like in 3.5 stars, indicate a higher level hotel, such
as for example the ***Superior rating. But again, the
observation is that only about 2.3 million hotel triples even make
use of the schema.org/Rating class (see also Table 1 for
details). Analysis of the mentioned 2.3 million triples showed
a clear tendency of higher rated hotels to be annotated more
accurately and more frequently, see Table 3 for datails.</p>
    </sec>
    <sec id="sec-7">
      <title>4.2 How is schema.org used in the hotel domain?</title>
      <p>This question will be answered by taking a detailed look at
which classes are used when it comes to annotating hotels
and which attributes are in use.</p>
      <p>To nd out which classes and properties are used and how
often they appear, we iterated over all hotel triples and all
related properties. We grouped those properties by name</p>
      <sec id="sec-7-1">
        <title>7https://schema.org/Rating</title>
        <p>and counted the appearance. With this method we found
37.192.502 triples, directly related to hotel triples. The
most frequently used property was schema.org/Hotel/name,
5.666.474 times, which is interesting, because there are only
about 4.8 million hotel triples. Obviously some hotels were
annotated with two or more names. The second most
frequently used property was schema.org/Hotel/review, 5.226.132
times, which is not very surprising, because as we will see
in Section 4.3, a large number of hotels are annotated with
schema.org on rating websites. Place three in this ranking is
rdf:type, with 4.841.353 appearances, which is the attribute
that tells a triple that it is a hotel - this number of course
equals the number of total hotel triples. Overall there are
119 di erent properties in use which either refer to literals
or to classes. To nd more details about the top ten used
properties, see Table 4.</p>
        <p>In the documentation for schema.org/Hotel there are 62
properties mentioned from either the Hotel class itself or
inherited from LocalBusiness, Organization, Thing and Place,
while our analysis came up with 111 di erent properties.
This again is an indicator that large inaccuracies take place
when it comes to annotations. Attributes are written
syntactically wrong, for example makeO er instead of makesO er
and some properties even get invented out of thin air, like
Hotel/wedding, Hotel/telefax or Hotel/?description?. Even
tough almost all properties of schema.org are generally in
use, only 8 properties appear in more than 30% of hotel
triples and only 20 of the 62 described properties are used
in more than 1% of the hotel triples.</p>
        <p>To sum up this question, there is a movement observable
towards semantic annotation of hotels but there still a lot
to be done to match a su cient annotation.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4.3 Who is using schema.org in the hotel domain?</title>
      <p>With this question we wanted to nd out if it is the
individual hotel that uses schema.org most or best to describe
its properties, or if it is a third party page which displays
and annotates hotels for whatever reason (e.g. for providing
the hotel information in order to collect the hotel bookings).
After manually browsing through some of the hotels in the
data set during the process of the analysis, it appeared, by
looking at the mentioned fourth column of the NQuad (the
data provenance column), that only a very small number
of hotels showed their own url as a provenance. The vast
majority of the hotels appeared to be annotated by third
party websites. So we came up with a hypothesis which
says: "In the tourism domain, schema.org is predominantly
used by booking- or rating webs sites, barely by hotel web
sites themselves".</p>
      <p>The approach we took to prove the derived hypothesis was
the following: iterating over all hotel triples found on
bookingand rating websites which o er a hotel-URL (as hotel-URL
schema.org/Hotel/url is used) and checking if the hotel web
site is schema.org annotated. Further, we use the hotels
paylevel domain as a unique identi er and note if a schema.org
annotation was found on the hotel web site or not. And
nally, check if the speci c hotel appears multiple times in
the data set, and if so, note on which other web sites and
count the appearance. With this method we get a detailed
overview of how many hotels use schema.org themselves and
which other websites, rating- or booking sites use schema.org
to annotate hotels.
ous domains. One other idea for future work we are
currently addressing, is to create an extension (similar to that
of schema.org) for tourism. As we discovered in the
hotel domain (and this is true for other touristic elds as
well), a lot of important information can not yet be
annotated by schema.org: for example, number of beds per
hotel room, availabilty of a TV or a whirlpool, etc.
Extending schema.org with terminology for describing hotels, hotel
rooms, amenities and in general any other aspect of
accommodations and their features could really enrich schema.org
and make it even more valuable for tourism.</p>
      <p>As mentioned in Section 4.3, we would also like to survey the
whole data set to nd out who is using schema.org most and
to get speci c numbers about the distribution of schema.org
on hotels' own websites. As a very interesting part of our
future work, we would like to compare all the ndings we
described in this paper with the newly published 2014 data
set of the Web Data Commons project, and perhaps even
newer data sets as soon as they are published.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the Online Communications
working group9 for their active discussions and input during
the OC meetings, Christian Bizer and Rober Meusel from
the University of Mannheim for their input on how to work
with the WDC data set, Ontotext10 for a free license of the
GraphDB repository software, used as our triple store and
third-party funded projects in which STI is involved such as
Byte, FITMAN, LDCT, TourPack, OntoHealth, OpenFridge
and EuTravel for their support.</p>
    </sec>
    <sec id="sec-10">
      <title>5. CONCLUSIONS AND FUTURE WORK</title>
      <p>To conclude this paper we would like to highlight that schema.org [2] R. Meusel and H. Paulheim. Heuristics for xing
is used in the touristic domain. Hotels start annotating web common errors in deployed schema. org microdata. In
sites for more visibility in search engines and to power rich The Semantic Web. Latest Advances and New
snippets. Also third party web sites such as rating- or book- Domains, pages 152{168. Springer, 2015.
ing platforms are using schema.org more often-sometimes [3] R. Meusel, P. Petrovski, and C. Bizer. The
even excessively- to increase search engine visibility as well webdatacommons microdata, rdfa and microformat
as to make their data more visible and useful for other de- dataset series. In The Semantic Web{ISWC 2014,
velopments, like the usage in mobile apps. Nevertheless, pages 277{292. Springer, 2014.
especially for the hotels' own web sites, there is much more [4] I. Stavrakantonakis, I. Toma, A. Fensel, and D. Fensel.
that could and should be done when it comes to annotation. Hotel websites, web 2.0, web 3.0 and online direct
Very often schema.org classes and properties are used incor- marketing: The case of austria. In Information and
rectly. Some properties are invented by the website devel- Communication Technologies in Tourism 2014, pages
opers and often, very important classes and properties- such 665{677. Springer, 2013.
as the URL, telephone number, description or geographic
location- are totally omitted. It appears that the hotel
owners' only concern is to be visible and highly ranked in the
web search engines, but they completely ignore what could
be created from of their hotel's data if properly annotated by
third party apps such as event platforms or other
servicesor information orientated web sites.</p>
      <p>We also wish to highlight that since May 2015 (when schema.org
version 28 was released), a newly introduced schema.org
extensions mechanism has been enabling extessions for
vari</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khalili</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Wysiwym authoring of structured content based on schema</article-title>
          .
          <source>org. In Web Information Systems Engineering{WISE</source>
          <year>2013</year>
          , pages
          <fpage>425</fpage>
          {
          <fpage>438</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>