Schema.org Usage for Hotels An Analysis Based on the Web Data Commons Data Set Elias Kärle* elias.kaerle@sti2.at Anna Fensel* Ioan Toma*† Dieter Fensel* anna.fensel@sti2.at ioan.toma@sti2.at dieter.fensel@sti2.at * † Semantic Technology Institute (STI) Innsbruck UMIT - University for Health Sciences University of Innsbruck Medical Informatics and Technology Technikerstrasse 21a Eduard-Wallnöfer-Zentrum 1 6020 Innsbruck, Austria 6060 Hall in Tyrol, Austria ABSTRACT and hence a big challenge for web masters and search engine It has been almost four years now since the world’s leading optimization experts. This makes it even more important search engine operators (Bing, Google, Yahoo! and Yandex), to stick to certain recommendations or standards concern- decided to start working on an initiative to enrich web pages ing content markup on web pages and to follow initiatives with structured data; an initiative known as schema.org. launched by search engine operators, such as schema.org. Since then, many web masters and those responsible for de- signing web pages started adapting this technology to enrich On June 2nd 2011 the worlds biggest search engines, Google, websites with semantic information. This paper analyzes Bing and Yahoo!, decided to ”create and support a common parts of the structured data in the largest web crawl avail- set of schemas for structured data markup on web pages.” 1 , able and open to the public, the Common Crawl, in order called schema.org. On November 1st of the same year, the to find out how the tourism branch is using schema.org. On operator of the largest Russian search engine, Yandex, joined the use case of hotels, it studies the usage and distribution the initiative and together they are constantly working on of schema.org/Hotel, examines who uses schema.org, how it the refinement and the further development of this set of vo- is applied and whether or not the classes and properties of cabulary. After these companies announced that the usage the vocabulary are used in a syntactically and semantically of schema.org will lead to significantly better search results correct way. and search engines presence and rankings, numerous web- sites started annotating their content with the vocabulary provided by schema.org. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous The Common Crawl 2 is an organization which crawls the web several times a year and provides the collected archives General Terms and data sets to the public for free. Web Data Commons 3 is Analysis a project started in 2012 by Freie Universität Berlin and the Karlsruhe Institute of Technology, and it extracts different types of structured data from the Common Crawl and also Keywords provides them to the public for free. schema.org, semantic annotation, analysis, hotel, tourism The interest of this paper lies upon a data set within Web 1. INTRODUCTION Data Commons, containing Microdata, RDFa and Micro- Particularly in the tourism branch, the web has evolved to format, used to annotate web page content with schema.org be the most important tool for representing businesses and [3]. In this paper we present our work on getting a com- distributing information about offers, events and other facts prehensive overview of the distribution of tourism specific to potential customers. How search engines rank the pop- schema.org vocabulary over the web, using the example of ularity of certain pages changes frequently over time, and the type schema.org/Hotel. is probably the best kept secret of search engine providers, This paper is structured as follows: Section 2 describes re- lated work, section 3 states the research questions and ex- plains the methodology used to analyze the data. Section 4 presents the findings of the research, and section 5 concludes the paper. 1 http://googlewebmastercentral.blogspot.co.at/2011/06/introducing- schemaorg-search-engines.html 2 http://commoncrawl.org/ 3 http://webdatacommons.org/ 75 2. RELATED WORK 3. Who is using schema.org in the touristic field? During our work on this project, we came across work which The last question is looking for answers w.r.t. whether is related to our research. First of all, in the paper by or not hotels use schema.org on their own web sites and Stavrakantonakis et al. (2013), the authors survey the use which other platforms annotate hotels with schema.org. of Web 2.0 technologies, the use of content management sys- tems and social channels and Web 3.0 technologies, as well as the use of semantic web technologies and structured data As mentioned in the introduction, the primary source of our on the websites of 2155 hotels in Austria. The outcome of data was the result of the Common Crawl. Since our analy- this research is that only 5% of the websites employ seman- sis should be based only on structured data and, to be more tic technologies and the vast majority of hotels ”completely precise, on schema.org, we took advantage of a project called ignore the existence of technologies that could enrich the Web Data Commons. This project uses the data from Com- website content with high level metadata and give machine mon Crawl and extracts all sorts of structured data which readable meaning to the presented information” [4]. then are divided into three main data sets. The Hyperlink Graph, the Web Tables, and the RDFa, Microdata and Mi- During the analysis, we came across several cases of wrong croformat dataset, upon which our interest lies. From this usage of schema.org. To detect, analyze and solve those dataset we are using the ”Schema.org Class Specific Data- problems the work of Meusel et al. (2015)[2] serves as a Subsets” and from those subsets the one containing all triples starting point for our further work, when we wish to give related to schema.org/Hotel. advice towards the semantically and syntactically correct usage of schema.org annotations. The schema.org/Hotel specific subset of the 2013 crawl was 2.2GB in compressed and 35GB in uncompressed size. When it comes to finding and choosing the most suitable vocabulary, a project worth mentioning is vocab.cc. It is an Over all, we used 37 different queries. The measurement open source project which allows users to search for linked and analysis of the collected data, which was present in CSV data vocabularies, based on the dataset of the Billion Triple tables, was mostly done by hand or by arithmetic functions Challange4 . in Microsoft Excel, as well as through generating charts and diagram. The available schema.org annotations have a commercial ex- ploitation potential, which is currently pursued by several 4. RESULTS institutions. For example, current STI Innsbruck’s start-up In the following Section we will present the results of our effort ONLIM5 is applying annotations on online social me- analysis of the schema.org/Hotel related structured data on dia technologies in its product, social media marketing tool. the 2013 corpus of the Web Data Commons project. In The start-up already runs pilots, such as with touristic asso- Section 2 of the paper, we have defined three main questions ciations of Innsbruck6 , and implements semantic dissemina- which will be answered below. tion support by implementing schema.org support on their website and publishing the touristic data of the regions as 4.1 How many hotels use schema.org? linked open data. When trying to find out how many hotels are present in the triple store, one can first query for all triples with predicate Another direction towards widespread real life application rdf:type and object schema.org/Hotel and count them. The of schema.org is in the development of tools assisting web output would be about 4.841.000 hotels in the whole data developers to easily and correctly introduce schema.org an- set. But after a little manual inspection it is clearly visible notations. One example here is the WYSIWYM project that many hotels are annotated more than once because, described in Khalili et al. (2013) [1]. for example, they have schema.org annotations on their own website and are annotated in listings of one or several book- 3. RESEARCH QUESTION AND METHOD- ing platforms. Trying to do the same query with the re- OLOGY striction of only counting hotels with unique names results As a starting point for our analysis, we define key research in a reduced number, about 740.000, which is also not ex- questions we want to answer. These are: pressive, because details about the hotels with same names, like for example Hotel Post or Hotel Adler - which are very common hotel names in Austria, are still not distinct. 1. How many hotels use schema.org? This question triggers an analysis on whether or not it is possible to A solution to that problem would have been to perform a indicate a number of hotels that are annotated with search on unique hotel names and locations or addresses, but schema.org, either on their own website or on third we observe that less than 75% of the hotels in the dataset party websites? have proper annotation for an address. To be more specific, only about 3 million hotels added used schema.org/Address, 2. Is schema.org used syntactically and semanti- 2.2 million Hotels used schema.org/street, 2 million hotels cally correctly or are there many mistakes? The used schema.org/zip, 1.9 million hotels used schema.org/land answer to this question surveys the mistakes made and schema.org/Region and only 1.1 million hotels used when it comes to annotating hotels. schema.org/name as a country name. See Table 1 for more 4 http://km.aifb.kit.edu/projects/btc-2012/ details. 5 http://www.onlim.com 6 http://www.innsbruck.info If we count all appearances of annotations of hotels per coun- 76 Table 1: Classes and properties used in the data set Table 3: Distribution of ratings among annotated Class or Property Usage sum Percentage hotels schema.org/Hotel 4.841.000 100% Rating Usage sum Percentage schema.org/PostalAddress 3.035.000 62,7% 5 866.932 36,5% schema.org/addressCountry 1.904.000 39,3% 4.5 35.079 1,5% schema.org/Country/name 1.125.000 23,2% 4 651.606 27,4% schema.org/addressRegion 1.902.000 39,3% 3.5 66.208 2,8% schema.org/postalCode 2.011.000 41,5% 3 426.925 18% schema.org/streetAddress 2.284.000 47,2% 2.5 15.476 0,6% schema.org/Rating 2.377.000 49,10% 2 176.800 7,4% schema.org/ratingValue 2.375.000 49,06% 1.5 941 0,03% 1 135.958 5,7% Table 2: Distribution of hotel triples per country Rank Country Sum Percentage 1 US 1.021.513 90,8% Table 4: Usage of properties in hotel triples. (For 2 CA 52.360 4,7% space reasons http://schema.org/ is shortened to 3 CN 20.648 1,8% sc:) 4 GB 11.580 1,0% Property name Usage sum Percentage 5 DE 3.163 0,28% sc:Hotel/name 5.666.474 117% 6 MX 1.921 0,17% sc:Hotel/review 5.226.132 108% 7 PR 1.250 0,1% rdf:type 4.841.353 100% 8 AR 1016 0,09% sc:Hotel/image 3.439.579 71,0% 9 PH 765 0,07% sc:Hotel/address 3.035.301 62,7% 10 IN 699 0,06% sc:Hotel/aggregateRating 2.723.587 56,3% other 10.085 0,9% sc:Hotel/rating 2.377.406 49,1% sc:Hotel/description 1.934.486 40% sc:Hotel/url 1.749.830 36,1% try, of course, only for those 23.2 % of hotels which have an sc:Hotel/geo 1.323.333 27,3% annotation for schema.org/postalAddress and schema.org/name within postalAddress, we come to the conclusion that the large majority of triples is found within the United States, followed by Canada, China, Great Britain, Germany and and counted the appearance. With this method we found others. See a detailed listing in Table 2. 37.192.502 triples, directly related to hotel triples. The most frequently used property was schema.org/Hotel/name, Another interesting aspect of this data set was to find out 5.666.474 times, which is interesting, because there are only which categories of hotels are either using schema.org on about 4.8 million hotel triples. Obviously some hotels were their own, or are annotated by others. For this purpose annotated with two or more names. The second most fre- we inspected the appearance of schema.org/Rating, which quently used property was schema.org/Hotel/review, 5.226.132 aims, due to the documentation7 , to show the rating on a times, which is not very surprising, because as we will see numerical scale from one to five, as it is done in hotels with in Section 4.3, a large number of hotels are annotated with the stars rating (*, ..., *****). In our understanding, values schema.org on rating websites. Place three in this ranking is with .5, like in 3.5 stars, indicate a higher level hotel, such rdf:type, with 4.841.353 appearances, which is the attribute as for example the ***Superior rating. But again, the obser- that tells a triple that it is a hotel - this number of course vation is that only about 2.3 million hotel triples even make equals the number of total hotel triples. Overall there are use of the schema.org/Rating class (see also Table 1 for de- 119 different properties in use which either refer to literals tails). Analysis of the mentioned 2.3 million triples showed or to classes. To find more details about the top ten used a clear tendency of higher rated hotels to be annotated more properties, see Table 4. accurately and more frequently, see Table 3 for datails. In the documentation for schema.org/Hotel there are 62 properties mentioned from either the Hotel class itself or in- 4.2 How is schema.org used in the hotel do- herited from LocalBusiness, Organization, Thing and Place, main? while our analysis came up with 111 different properties. This question will be answered by taking a detailed look at This again is an indicator that large inaccuracies take place which classes are used when it comes to annotating hotels when it comes to annotations. Attributes are written syntac- and which attributes are in use. tically wrong, for example makeOffer instead of makesOffer and some properties even get invented out of thin air, like To find out which classes and properties are used and how Hotel/wedding, Hotel/telefax or Hotel/?description?. Even often they appear, we iterated over all hotel triples and all tough almost all properties of schema.org are generally in related properties. We grouped those properties by name use, only 8 properties appear in more than 30% of hotel triples and only 20 of the 62 described properties are used 7 https://schema.org/Rating in more than 1% of the hotel triples. 77 To sum up this question, there is a movement observable ous domains. One other idea for future work we are cur- towards semantic annotation of hotels but there still a lot rently addressing, is to create an extension (similar to that to be done to match a sufficient annotation. of schema.org) for tourism. As we discovered in the ho- tel domain (and this is true for other touristic fields as 4.3 Who is using schema.org in the hotel do- well), a lot of important information can not yet be an- notated by schema.org: for example, number of beds per main? hotel room, availabilty of a TV or a whirlpool, etc. Extend- With this question we wanted to find out if it is the indi- ing schema.org with terminology for describing hotels, hotel vidual hotel that uses schema.org most or best to describe rooms, amenities and in general any other aspect of accom- its properties, or if it is a third party page which displays modations and their features could really enrich schema.org and annotates hotels for whatever reason (e.g. for providing and make it even more valuable for tourism. the hotel information in order to collect the hotel bookings). After manually browsing through some of the hotels in the As mentioned in Section 4.3, we would also like to survey the data set during the process of the analysis, it appeared, by whole data set to find out who is using schema.org most and looking at the mentioned fourth column of the NQuad (the to get specific numbers about the distribution of schema.org data provenance column), that only a very small number on hotels’ own websites. As a very interesting part of our of hotels showed their own url as a provenance. The vast future work, we would like to compare all the findings we majority of the hotels appeared to be annotated by third described in this paper with the newly published 2014 data party websites. So we came up with a hypothesis which set of the Web Data Commons project, and perhaps even says: ”In the tourism domain, schema.org is predominantly newer data sets as soon as they are published. used by booking- or rating webs sites, barely by hotel web sites themselves”. Acknowledgments The approach we took to prove the derived hypothesis was The authors would like to thank the Online Communications the following: iterating over all hotel triples found on booking- working group9 for their active discussions and input during and rating websites which offer a hotel-URL (as hotel-URL the OC meetings, Christian Bizer and Rober Meusel from schema.org/Hotel/url is used) and checking if the hotel web the University of Mannheim for their input on how to work site is schema.org annotated. Further, we use the hotels pay- with the WDC data set, Ontotext10 for a free license of the level domain as a unique identifier and note if a schema.org GraphDB repository software, used as our triple store and annotation was found on the hotel web site or not. And third-party funded projects in which STI is involved such as finally, check if the specific hotel appears multiple times in Byte, FITMAN, LDCT, TourPack, OntoHealth, OpenFridge the data set, and if so, note on which other web sites and and EuTravel for their support. count the appearance. With this method we get a detailed overview of how many hotels use schema.org themselves and 6. REFERENCES which other websites, rating- or booking sites use schema.org [1] A. Khalili and S. Auer. Wysiwym authoring of to annotate hotels. structured content based on schema. org. In Web Information Systems Engineering–WISE 2013, pages 5. CONCLUSIONS AND FUTURE WORK 425–438. Springer, 2013. To conclude this paper we would like to highlight that schema.org [2] R. Meusel and H. Paulheim. Heuristics for fixing is used in the touristic domain. Hotels start annotating web common errors in deployed schema. org microdata. In sites for more visibility in search engines and to power rich The Semantic Web. Latest Advances and New snippets. Also third party web sites such as rating- or book- Domains, pages 152–168. Springer, 2015. ing platforms are using schema.org more often-sometimes [3] R. Meusel, P. Petrovski, and C. Bizer. The even excessively- to increase search engine visibility as well webdatacommons microdata, rdfa and microformat as to make their data more visible and useful for other de- dataset series. In The Semantic Web–ISWC 2014, velopments, like the usage in mobile apps. Nevertheless, pages 277–292. Springer, 2014. especially for the hotels’ own web sites, there is much more [4] I. Stavrakantonakis, I. Toma, A. Fensel, and D. Fensel. that could and should be done when it comes to annotation. Hotel websites, web 2.0, web 3.0 and online direct Very often schema.org classes and properties are used incor- marketing: The case of austria. In Information and rectly. Some properties are invented by the website devel- Communication Technologies in Tourism 2014, pages opers and often, very important classes and properties- such 665–677. Springer, 2013. as the URL, telephone number, description or geographic location- are totally omitted. It appears that the hotel own- ers’ only concern is to be visible and highly ranked in the web search engines, but they completely ignore what could be created from of their hotel’s data if properly annotated by third party apps such as event platforms or other services- or information orientated web sites. We also wish to highlight that since May 2015 (when schema.org version 28 was released), a newly introduced schema.org ex- tensions mechanism has been enabling extessions for vari- 9 http://oc.sti2.at/ 8 10 http://schema.org/version/2.0/ http://ontotext.com/products/ontotext-graphdb/ 78