Schema.org Usage for Hotels
                    An Analysis Based on the Web Data Commons Data Set
                                                        Elias Kärle*
                                                   elias.kaerle@sti2.at
              Anna Fensel*                            Ioan Toma*†                           Dieter Fensel*
           anna.fensel@sti2.at                   ioan.toma@sti2.at                      dieter.fensel@sti2.at
             *                                                           †
                 Semantic Technology Institute (STI) Innsbruck               UMIT - University for Health Sciences
                           University of Innsbruck                           Medical Informatics and Technology
                            Technikerstrasse 21a                                Eduard-Wallnöfer-Zentrum 1
                           6020 Innsbruck, Austria                                6060 Hall in Tyrol, Austria

ABSTRACT                                                             and hence a big challenge for web masters and search engine
It has been almost four years now since the world’s leading          optimization experts. This makes it even more important
search engine operators (Bing, Google, Yahoo! and Yandex),           to stick to certain recommendations or standards concern-
decided to start working on an initiative to enrich web pages        ing content markup on web pages and to follow initiatives
with structured data; an initiative known as schema.org.             launched by search engine operators, such as schema.org.
Since then, many web masters and those responsible for de-
signing web pages started adapting this technology to enrich         On June 2nd 2011 the worlds biggest search engines, Google,
websites with semantic information. This paper analyzes              Bing and Yahoo!, decided to ”create and support a common
parts of the structured data in the largest web crawl avail-         set of schemas for structured data markup on web pages.” 1 ,
able and open to the public, the Common Crawl, in order              called schema.org. On November 1st of the same year, the
to find out how the tourism branch is using schema.org. On           operator of the largest Russian search engine, Yandex, joined
the use case of hotels, it studies the usage and distribution        the initiative and together they are constantly working on
of schema.org/Hotel, examines who uses schema.org, how it            the refinement and the further development of this set of vo-
is applied and whether or not the classes and properties of          cabulary. After these companies announced that the usage
the vocabulary are used in a syntactically and semantically          of schema.org will lead to significantly better search results
correct way.                                                         and search engines presence and rankings, numerous web-
                                                                     sites started annotating their content with the vocabulary
                                                                     provided by schema.org.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous
                                                                     The Common Crawl 2 is an organization which crawls the
                                                                     web several times a year and provides the collected archives
General Terms                                                        and data sets to the public for free. Web Data Commons 3 is
Analysis                                                             a project started in 2012 by Freie Universität Berlin and the
                                                                     Karlsruhe Institute of Technology, and it extracts different
                                                                     types of structured data from the Common Crawl and also
Keywords                                                             provides them to the public for free.
schema.org, semantic annotation, analysis, hotel, tourism
                                                                     The interest of this paper lies upon a data set within Web
1.   INTRODUCTION                                                    Data Commons, containing Microdata, RDFa and Micro-
Particularly in the tourism branch, the web has evolved to           format, used to annotate web page content with schema.org
be the most important tool for representing businesses and           [3]. In this paper we present our work on getting a com-
distributing information about offers, events and other facts        prehensive overview of the distribution of tourism specific
to potential customers. How search engines rank the pop-             schema.org vocabulary over the web, using the example of
ularity of certain pages changes frequently over time, and           the type schema.org/Hotel.
is probably the best kept secret of search engine providers,
                                                                     This paper is structured as follows: Section 2 describes re-
                                                                     lated work, section 3 states the research questions and ex-
                                                                     plains the methodology used to analyze the data. Section 4
                                                                     presents the findings of the research, and section 5 concludes
                                                                     the paper.
                                                                     1
                                                                       http://googlewebmastercentral.blogspot.co.at/2011/06/introducing-
                                                                     schemaorg-search-engines.html
                                                                     2
                                                                       http://commoncrawl.org/
                                                                     3
                                                                       http://webdatacommons.org/


                                                                75
2.     RELATED WORK                                                          3. Who is using schema.org in the touristic field?
During our work on this project, we came across work which                      The last question is looking for answers w.r.t. whether
is related to our research. First of all, in the paper by                       or not hotels use schema.org on their own web sites and
Stavrakantonakis et al. (2013), the authors survey the use                      which other platforms annotate hotels with schema.org.
of Web 2.0 technologies, the use of content management sys-
tems and social channels and Web 3.0 technologies, as well
as the use of semantic web technologies and structured data             As mentioned in the introduction, the primary source of our
on the websites of 2155 hotels in Austria. The outcome of               data was the result of the Common Crawl. Since our analy-
this research is that only 5% of the websites employ seman-             sis should be based only on structured data and, to be more
tic technologies and the vast majority of hotels ”completely            precise, on schema.org, we took advantage of a project called
ignore the existence of technologies that could enrich the              Web Data Commons. This project uses the data from Com-
website content with high level metadata and give machine               mon Crawl and extracts all sorts of structured data which
readable meaning to the presented information” [4].                     then are divided into three main data sets. The Hyperlink
                                                                        Graph, the Web Tables, and the RDFa, Microdata and Mi-
During the analysis, we came across several cases of wrong              croformat dataset, upon which our interest lies. From this
usage of schema.org. To detect, analyze and solve those                 dataset we are using the ”Schema.org Class Specific Data-
problems the work of Meusel et al. (2015)[2] serves as a                Subsets” and from those subsets the one containing all triples
starting point for our further work, when we wish to give               related to schema.org/Hotel.
advice towards the semantically and syntactically correct
usage of schema.org annotations.                                        The schema.org/Hotel specific subset of the 2013 crawl was
                                                                        2.2GB in compressed and 35GB in uncompressed size.
When it comes to finding and choosing the most suitable
vocabulary, a project worth mentioning is vocab.cc. It is an            Over all, we used 37 different queries. The measurement
open source project which allows users to search for linked             and analysis of the collected data, which was present in CSV
data vocabularies, based on the dataset of the Billion Triple           tables, was mostly done by hand or by arithmetic functions
Challange4 .                                                            in Microsoft Excel, as well as through generating charts and
                                                                        diagram.
The available schema.org annotations have a commercial ex-
ploitation potential, which is currently pursued by several             4.     RESULTS
institutions. For example, current STI Innsbruck’s start-up             In the following Section we will present the results of our
effort ONLIM5 is applying annotations on online social me-              analysis of the schema.org/Hotel related structured data on
dia technologies in its product, social media marketing tool.           the 2013 corpus of the Web Data Commons project. In
The start-up already runs pilots, such as with touristic asso-          Section 2 of the paper, we have defined three main questions
ciations of Innsbruck6 , and implements semantic dissemina-             which will be answered below.
tion support by implementing schema.org support on their
website and publishing the touristic data of the regions as             4.1      How many hotels use schema.org?
linked open data.                                                       When trying to find out how many hotels are present in the
                                                                        triple store, one can first query for all triples with predicate
Another direction towards widespread real life application              rdf:type and object schema.org/Hotel and count them. The
of schema.org is in the development of tools assisting web              output would be about 4.841.000 hotels in the whole data
developers to easily and correctly introduce schema.org an-             set. But after a little manual inspection it is clearly visible
notations. One example here is the WYSIWYM project                      that many hotels are annotated more than once because,
described in Khalili et al. (2013) [1].                                 for example, they have schema.org annotations on their own
                                                                        website and are annotated in listings of one or several book-
3.     RESEARCH QUESTION AND METHOD-                                    ing platforms. Trying to do the same query with the re-
       OLOGY                                                            striction of only counting hotels with unique names results
As a starting point for our analysis, we define key research            in a reduced number, about 740.000, which is also not ex-
questions we want to answer. These are:                                 pressive, because details about the hotels with same names,
                                                                        like for example Hotel Post or Hotel Adler - which are very
                                                                        common hotel names in Austria, are still not distinct.
     1. How many hotels use schema.org? This question
        triggers an analysis on whether or not it is possible to        A solution to that problem would have been to perform a
        indicate a number of hotels that are annotated with             search on unique hotel names and locations or addresses, but
        schema.org, either on their own website or on third             we observe that less than 75% of the hotels in the dataset
        party websites?                                                 have proper annotation for an address. To be more specific,
                                                                        only about 3 million hotels added used schema.org/Address,
     2. Is schema.org used syntactically and semanti-                   2.2 million Hotels used schema.org/street, 2 million hotels
        cally correctly or are there many mistakes? The                 used schema.org/zip, 1.9 million hotels used schema.org/land
        answer to this question surveys the mistakes made               and schema.org/Region and only 1.1 million hotels used
        when it comes to annotating hotels.                             schema.org/name as a country name. See Table 1 for more
4
  http://km.aifb.kit.edu/projects/btc-2012/                             details.
5
  http://www.onlim.com
6
  http://www.innsbruck.info                                             If we count all appearances of annotations of hotels per coun-


                                                                   76
Table 1: Classes and properties used in the data set                   Table 3: Distribution of ratings among annotated
 Class or Property         Usage sum Percentage                        hotels
 schema.org/Hotel            4.841.000      100%                                Rating Usage sum Percentage
 schema.org/PostalAddress    3.035.000      62,7%                               5          866.932      36,5%
 schema.org/addressCountry   1.904.000      39,3%                               4.5         35.079       1,5%
 schema.org/Country/name     1.125.000      23,2%                               4          651.606      27,4%
 schema.org/addressRegion    1.902.000      39,3%                               3.5         66.208       2,8%
 schema.org/postalCode       2.011.000      41,5%                               3          426.925       18%
 schema.org/streetAddress    2.284.000      47,2%                               2.5         15.476       0,6%
 schema.org/Rating           2.377.000     49,10%                               2          176.800       7,4%
 schema.org/ratingValue      2.375.000     49,06%                               1.5          941        0,03%
                                                                                1          135.958       5,7%

    Table 2: Distribution of hotel triples per country
        Rank Country          Sum      Percentage
        1          US      1.021.513      90,8%                        Table 4: Usage of properties in hotel triples. (For
        2          CA        52.360        4,7%                        space reasons http://schema.org/ is shortened to
        3          CN        20.648        1,8%                        sc:)
        4          GB        11.580        1,0%                          Property name            Usage sum Percentage
        5          DE         3.163       0,28%                          sc:Hotel/name             5.666.474     117%
        6          MX         1.921       0,17%                          sc:Hotel/review           5.226.132     108%
        7          PR         1.250        0,1%                          rdf:type                  4.841.353     100%
        8          AR         1016        0,09%                          sc:Hotel/image            3.439.579    71,0%
        9          PH          765        0,07%                          sc:Hotel/address          3.035.301    62,7%
        10          IN         699        0,06%                          sc:Hotel/aggregateRating  2.723.587    56,3%
                  other      10.085        0,9%                          sc:Hotel/rating           2.377.406    49,1%
                                                                         sc:Hotel/description      1.934.486      40%
                                                                         sc:Hotel/url              1.749.830    36,1%
try, of course, only for those 23.2 % of hotels which have an            sc:Hotel/geo              1.323.333    27,3%
annotation for schema.org/postalAddress and schema.org/name
within postalAddress, we come to the conclusion that the
large majority of triples is found within the United States,
followed by Canada, China, Great Britain, Germany and                  and counted the appearance. With this method we found
others. See a detailed listing in Table 2.                             37.192.502 triples, directly related to hotel triples. The
                                                                       most frequently used property was schema.org/Hotel/name,
Another interesting aspect of this data set was to find out            5.666.474 times, which is interesting, because there are only
which categories of hotels are either using schema.org on              about 4.8 million hotel triples. Obviously some hotels were
their own, or are annotated by others. For this purpose                annotated with two or more names. The second most fre-
we inspected the appearance of schema.org/Rating, which                quently used property was schema.org/Hotel/review, 5.226.132
aims, due to the documentation7 , to show the rating on a              times, which is not very surprising, because as we will see
numerical scale from one to five, as it is done in hotels with         in Section 4.3, a large number of hotels are annotated with
the stars rating (*, ..., *****). In our understanding, values         schema.org on rating websites. Place three in this ranking is
with .5, like in 3.5 stars, indicate a higher level hotel, such        rdf:type, with 4.841.353 appearances, which is the attribute
as for example the ***Superior rating. But again, the obser-           that tells a triple that it is a hotel - this number of course
vation is that only about 2.3 million hotel triples even make          equals the number of total hotel triples. Overall there are
use of the schema.org/Rating class (see also Table 1 for de-           119 different properties in use which either refer to literals
tails). Analysis of the mentioned 2.3 million triples showed           or to classes. To find more details about the top ten used
a clear tendency of higher rated hotels to be annotated more           properties, see Table 4.
accurately and more frequently, see Table 3 for datails.
                                                                       In the documentation for schema.org/Hotel there are 62
                                                                       properties mentioned from either the Hotel class itself or in-
4.2      How is schema.org used in the hotel do-                       herited from LocalBusiness, Organization, Thing and Place,
         main?                                                         while our analysis came up with 111 different properties.
This question will be answered by taking a detailed look at            This again is an indicator that large inaccuracies take place
which classes are used when it comes to annotating hotels              when it comes to annotations. Attributes are written syntac-
and which attributes are in use.                                       tically wrong, for example makeOffer instead of makesOffer
                                                                       and some properties even get invented out of thin air, like
To find out which classes and properties are used and how              Hotel/wedding, Hotel/telefax or Hotel/?description?. Even
often they appear, we iterated over all hotel triples and all          tough almost all properties of schema.org are generally in
related properties. We grouped those properties by name                use, only 8 properties appear in more than 30% of hotel
                                                                       triples and only 20 of the 62 described properties are used
7
    https://schema.org/Rating                                          in more than 1% of the hotel triples.


                                                                  77
To sum up this question, there is a movement observable                  ous domains. One other idea for future work we are cur-
towards semantic annotation of hotels but there still a lot              rently addressing, is to create an extension (similar to that
to be done to match a sufficient annotation.                             of schema.org) for tourism. As we discovered in the ho-
                                                                         tel domain (and this is true for other touristic fields as
4.3      Who is using schema.org in the hotel do-                        well), a lot of important information can not yet be an-
                                                                         notated by schema.org: for example, number of beds per
         main?                                                           hotel room, availabilty of a TV or a whirlpool, etc. Extend-
With this question we wanted to find out if it is the indi-              ing schema.org with terminology for describing hotels, hotel
vidual hotel that uses schema.org most or best to describe               rooms, amenities and in general any other aspect of accom-
its properties, or if it is a third party page which displays            modations and their features could really enrich schema.org
and annotates hotels for whatever reason (e.g. for providing             and make it even more valuable for tourism.
the hotel information in order to collect the hotel bookings).
After manually browsing through some of the hotels in the                As mentioned in Section 4.3, we would also like to survey the
data set during the process of the analysis, it appeared, by             whole data set to find out who is using schema.org most and
looking at the mentioned fourth column of the NQuad (the                 to get specific numbers about the distribution of schema.org
data provenance column), that only a very small number                   on hotels’ own websites. As a very interesting part of our
of hotels showed their own url as a provenance. The vast                 future work, we would like to compare all the findings we
majority of the hotels appeared to be annotated by third                 described in this paper with the newly published 2014 data
party websites. So we came up with a hypothesis which                    set of the Web Data Commons project, and perhaps even
says: ”In the tourism domain, schema.org is predominantly                newer data sets as soon as they are published.
used by booking- or rating webs sites, barely by hotel web
sites themselves”.
                                                                         Acknowledgments
The approach we took to prove the derived hypothesis was                 The authors would like to thank the Online Communications
the following: iterating over all hotel triples found on booking-        working group9 for their active discussions and input during
and rating websites which offer a hotel-URL (as hotel-URL                the OC meetings, Christian Bizer and Rober Meusel from
schema.org/Hotel/url is used) and checking if the hotel web              the University of Mannheim for their input on how to work
site is schema.org annotated. Further, we use the hotels pay-            with the WDC data set, Ontotext10 for a free license of the
level domain as a unique identifier and note if a schema.org             GraphDB repository software, used as our triple store and
annotation was found on the hotel web site or not. And                   third-party funded projects in which STI is involved such as
finally, check if the specific hotel appears multiple times in           Byte, FITMAN, LDCT, TourPack, OntoHealth, OpenFridge
the data set, and if so, note on which other web sites and               and EuTravel for their support.
count the appearance. With this method we get a detailed
overview of how many hotels use schema.org themselves and                6.      REFERENCES
which other websites, rating- or booking sites use schema.org     [1] A. Khalili and S. Auer. Wysiwym authoring of
to annotate hotels.                                                   structured content based on schema. org. In Web
                                                                      Information Systems Engineering–WISE 2013, pages
5. CONCLUSIONS AND FUTURE WORK                                        425–438. Springer, 2013.
To conclude this paper we would like to highlight that schema.org [2] R.  Meusel and H. Paulheim. Heuristics for fixing
is used in the touristic domain. Hotels start annotating web          common errors in deployed schema. org microdata. In
sites for more visibility in search engines and to power rich         The Semantic Web. Latest Advances and New
snippets. Also third party web sites such as rating- or book-         Domains, pages 152–168. Springer, 2015.
ing platforms are using schema.org more often-sometimes           [3] R. Meusel, P. Petrovski, and C. Bizer. The
even excessively- to increase search engine visibility as well        webdatacommons microdata, rdfa and microformat
as to make their data more visible and useful for other de-           dataset series. In The Semantic Web–ISWC 2014,
velopments, like the usage in mobile apps. Nevertheless,              pages 277–292. Springer, 2014.
especially for the hotels’ own web sites, there is much more      [4] I. Stavrakantonakis, I. Toma, A. Fensel, and D. Fensel.
that could and should be done when it comes to annotation.            Hotel websites, web 2.0, web 3.0 and online direct
Very often schema.org classes and properties are used incor-          marketing: The case of austria. In Information and
rectly. Some properties are invented by the website devel-            Communication Technologies in Tourism 2014, pages
opers and often, very important classes and properties- such          665–677. Springer, 2013.
as the URL, telephone number, description or geographic
location- are totally omitted. It appears that the hotel own-
ers’ only concern is to be visible and highly ranked in the
web search engines, but they completely ignore what could
be created from of their hotel’s data if properly annotated by
third party apps such as event platforms or other services-
or information orientated web sites.

We also wish to highlight that since May 2015 (when schema.org
version 28 was released), a newly introduced schema.org ex-
tensions mechanism has been enabling extessions for vari-                9
                                                                              http://oc.sti2.at/
8                                                                        10
    http://schema.org/version/2.0/                                            http://ontotext.com/products/ontotext-graphdb/


                                                                    78