-

An Analysis Based on the Web Data Commons Data Set

Elias Kärle

0 1 2

elias.kaerle@sti

0 2 4

Ioan Toma

0 1 2

y ioan.toma@sti

0 2 4 0 Dieter Fensel 1 Semantic Technology Institute (STI) Innsbruck University of Innsbruck Technikerstrasse 21a 6020 Innsbruck , Austria 2 The Common Crawl 3 is a project started in 2012 by Freie Universitat Berlin and the Karlsruhe Institute of Technology, and it extracts di erent types of structured data from the Common Crawl and also provides them to the public for free 4 is an organization which crawls the web several times a year and provides the collected archives and data sets to the public for free. Web Data Commons

2011

75 78

It has been almost four years now since the world's leading search engine operators (Bing, Google, Yahoo! and Yandex), decided to start working on an initiative to enrich web pages with structured data; an initiative known as schema.org. Since then, many web masters and those responsible for designing web pages started adapting this technology to enrich websites with semantic information. This paper analyzes parts of the structured data in the largest web crawl available and open to the public, the Common Crawl, in order to nd out how the tourism branch is using schema.org. On the use case of hotels, it studies the usage and distribution of schema.org/Hotel, examines who uses schema.org, how it is applied and whether or not the classes and properties of the vocabulary are used in a syntactically and semantically correct way.

eol>schema org semantic annotation analysis hotel tourism

The interest of this paper lies upon a data set within Web Data Commons, containing Microdata, RDFa and Microformat, used to annotate web page content with schema.org [3]. In this paper we present our work on getting a comprehensive overview of the distribution of tourism speci c schema.org vocabulary over the web, using the example of the type schema.org/Hotel.

This paper is structured as follows: Section 2 describes related work, section 3 states the research questions and explains the methodology used to analyze the data. Section 4 presents the ndings of the research, and section 5 concludes the paper. 1http://googlewebmastercentral.blogspot.co.at/2011/06/introducingschemaorg-search-engines.html 2http://commoncrawl.org/ 3http://webdatacommons.org/

2. RELATED WORK

During our work on this project, we came across work which is related to our research. First of all, in the paper by Stavrakantonakis et al. (2013), the authors survey the use of Web 2.0 technologies, the use of content management systems and social channels and Web 3.0 technologies, as well as the use of semantic web technologies and structured data on the websites of 2155 hotels in Austria. The outcome of this research is that only 5% of the websites employ semantic technologies and the vast majority of hotels "completely ignore the existence of technologies that could enrich the website content with high level metadata and give machine readable meaning to the presented information" [4]. During the analysis, we came across several cases of wrong usage of schema.org. To detect, analyze and solve those problems the work of Meusel et al. (2015)[2] serves as a starting point for our further work, when we wish to give advice towards the semantically and syntactically correct usage of schema.org annotations.

When it comes to nding and choosing the most suitable vocabulary, a project worth mentioning is vocab.cc. It is an open source project which allows users to search for linked data vocabularies, based on the dataset of the Billion Triple Challange4.

The available schema.org annotations have a commercial exploitation potential, which is currently pursued by several institutions. For example, current STI Innsbruck's start-up e ort ONLIM5 is applying annotations on online social media technologies in its product, social media marketing tool. The start-up already runs pilots, such as with touristic associations of Innsbruck6, and implements semantic dissemination support by implementing schema.org support on their website and publishing the touristic data of the regions as linked open data.

Another direction towards widespread real life application of schema.org is in the development of tools assisting web developers to easily and correctly introduce schema.org annotations. One example here is the WYSIWYM project described in Khalili et al. (2013) [ 1 ].

3. RESEARCH QUESTION AND METHOD OLOGY

As a starting point for our analysis, we de ne key research questions we want to answer. These are: 1. How many hotels use schema.org? This question triggers an analysis on whether or not it is possible to indicate a number of hotels that are annotated with schema.org, either on their own website or on third party websites? 2. Is schema.org used syntactically and semantically correctly or are there many mistakes? The answer to this question surveys the mistakes made when it comes to annotating hotels.

4http://km.aifb.kit.edu/projects/btc-2012/ 5http://www.onlim.com 6http://www.innsbruck.info

3. Who is using schema.org in the touristic eld? The last question is looking for answers w.r.t. whether or not hotels use schema.org on their own web sites and which other platforms annotate hotels with schema.org. As mentioned in the introduction, the primary source of our data was the result of the Common Crawl. Since our analysis should be based only on structured data and, to be more precise, on schema.org, we took advantage of a project called Web Data Commons. This project uses the data from Common Crawl and extracts all sorts of structured data which then are divided into three main data sets. The Hyperlink Graph, the Web Tables, and the RDFa, Microdata and Microformat dataset, upon which our interest lies. From this dataset we are using the "Schema.org Class Speci c DataSubsets" and from those subsets the one containing all triples related to schema.org/Hotel.

The schema.org/Hotel speci c subset of the 2013 crawl was 2.2GB in compressed and 35GB in uncompressed size. Over all, we used 37 di erent queries. The measurement and analysis of the collected data, which was present in CSV tables, was mostly done by hand or by arithmetic functions in Microsoft Excel, as well as through generating charts and diagram.

4. RESULTS

In the following Section we will present the results of our analysis of the schema.org/Hotel related structured data on the 2013 corpus of the Web Data Commons project. In Section 2 of the paper, we have de ned three main questions which will be answered below.

4.1 How many hotels use schema.org?

When trying to nd out how many hotels are present in the triple store, one can rst query for all triples with predicate rdf:type and object schema.org/Hotel and count them. The output would be about 4.841.000 hotels in the whole data set. But after a little manual inspection it is clearly visible that many hotels are annotated more than once because, for example, they have schema.org annotations on their own website and are annotated in listings of one or several booking platforms. Trying to do the same query with the restriction of only counting hotels with unique names results in a reduced number, about 740.000, which is also not expressive, because details about the hotels with same names, like for example Hotel Post or Hotel Adler - which are very common hotel names in Austria, are still not distinct. A solution to that problem would have been to perform a search on unique hotel names and locations or addresses, but we observe that less than 75% of the hotels in the dataset have proper annotation for an address. To be more speci c, only about 3 million hotels added used schema.org/Address, 2.2 million Hotels used schema.org/street, 2 million hotels used schema.org/zip, 1.9 million hotels used schema.org/land and schema.org/Region and only 1.1 million hotels used schema.org/name as a country name. See Table 1 for more details.

If we count all appearances of annotations of hotels per country, of course, only for those 23.2 % of hotels which have an annotation for schema.org/postalAddress and schema.org/name within postalAddress, we come to the conclusion that the large majority of triples is found within the United States, followed by Canada, China, Great Britain, Germany and others. See a detailed listing in Table 2.

Another interesting aspect of this data set was to nd out which categories of hotels are either using schema.org on their own, or are annotated by others. For this purpose we inspected the appearance of schema.org/Rating, which aims, due to the documentation7, to show the rating on a numerical scale from one to ve, as it is done in hotels with the stars rating (*, ..., *****). In our understanding, values with .5, like in 3.5 stars, indicate a higher level hotel, such as for example the ***Superior rating. But again, the observation is that only about 2.3 million hotel triples even make use of the schema.org/Rating class (see also Table 1 for details). Analysis of the mentioned 2.3 million triples showed a clear tendency of higher rated hotels to be annotated more accurately and more frequently, see Table 3 for datails.

4.2 How is schema.org used in the hotel domain?

This question will be answered by taking a detailed look at which classes are used when it comes to annotating hotels and which attributes are in use.

To nd out which classes and properties are used and how often they appear, we iterated over all hotel triples and all related properties. We grouped those properties by name

7https://schema.org/Rating

and counted the appearance. With this method we found 37.192.502 triples, directly related to hotel triples. The most frequently used property was schema.org/Hotel/name, 5.666.474 times, which is interesting, because there are only about 4.8 million hotel triples. Obviously some hotels were annotated with two or more names. The second most frequently used property was schema.org/Hotel/review, 5.226.132 times, which is not very surprising, because as we will see in Section 4.3, a large number of hotels are annotated with schema.org on rating websites. Place three in this ranking is rdf:type, with 4.841.353 appearances, which is the attribute that tells a triple that it is a hotel - this number of course equals the number of total hotel triples. Overall there are 119 di erent properties in use which either refer to literals or to classes. To nd more details about the top ten used properties, see Table 4.

In the documentation for schema.org/Hotel there are 62 properties mentioned from either the Hotel class itself or inherited from LocalBusiness, Organization, Thing and Place, while our analysis came up with 111 di erent properties. This again is an indicator that large inaccuracies take place when it comes to annotations. Attributes are written syntactically wrong, for example makeO er instead of makesO er and some properties even get invented out of thin air, like Hotel/wedding, Hotel/telefax or Hotel/?description?. Even tough almost all properties of schema.org are generally in use, only 8 properties appear in more than 30% of hotel triples and only 20 of the 62 described properties are used in more than 1% of the hotel triples.

To sum up this question, there is a movement observable towards semantic annotation of hotels but there still a lot to be done to match a su cient annotation.

4.3 Who is using schema.org in the hotel domain?

With this question we wanted to nd out if it is the individual hotel that uses schema.org most or best to describe its properties, or if it is a third party page which displays and annotates hotels for whatever reason (e.g. for providing the hotel information in order to collect the hotel bookings). After manually browsing through some of the hotels in the data set during the process of the analysis, it appeared, by looking at the mentioned fourth column of the NQuad (the data provenance column), that only a very small number of hotels showed their own url as a provenance. The vast majority of the hotels appeared to be annotated by third party websites. So we came up with a hypothesis which says: "In the tourism domain, schema.org is predominantly used by booking- or rating webs sites, barely by hotel web sites themselves".

The approach we took to prove the derived hypothesis was the following: iterating over all hotel triples found on bookingand rating websites which o er a hotel-URL (as hotel-URL schema.org/Hotel/url is used) and checking if the hotel web site is schema.org annotated. Further, we use the hotels paylevel domain as a unique identi er and note if a schema.org annotation was found on the hotel web site or not. And nally, check if the speci c hotel appears multiple times in the data set, and if so, note on which other web sites and count the appearance. With this method we get a detailed overview of how many hotels use schema.org themselves and which other websites, rating- or booking sites use schema.org to annotate hotels. ous domains. One other idea for future work we are currently addressing, is to create an extension (similar to that of schema.org) for tourism. As we discovered in the hotel domain (and this is true for other touristic elds as well), a lot of important information can not yet be annotated by schema.org: for example, number of beds per hotel room, availabilty of a TV or a whirlpool, etc. Extending schema.org with terminology for describing hotels, hotel rooms, amenities and in general any other aspect of accommodations and their features could really enrich schema.org and make it even more valuable for tourism.

As mentioned in Section 4.3, we would also like to survey the whole data set to nd out who is using schema.org most and to get speci c numbers about the distribution of schema.org on hotels' own websites. As a very interesting part of our future work, we would like to compare all the ndings we described in this paper with the newly published 2014 data set of the Web Data Commons project, and perhaps even newer data sets as soon as they are published.

Acknowledgments

The authors would like to thank the Online Communications working group9 for their active discussions and input during the OC meetings, Christian Bizer and Rober Meusel from the University of Mannheim for their input on how to work with the WDC data set, Ontotext10 for a free license of the GraphDB repository software, used as our triple store and third-party funded projects in which STI is involved such as Byte, FITMAN, LDCT, TourPack, OntoHealth, OpenFridge and EuTravel for their support.

5. CONCLUSIONS AND FUTURE WORK

To conclude this paper we would like to highlight that schema.org [2] R. Meusel and H. Paulheim. Heuristics for xing is used in the touristic domain. Hotels start annotating web common errors in deployed schema. org microdata. In sites for more visibility in search engines and to power rich The Semantic Web. Latest Advances and New snippets. Also third party web sites such as rating- or book- Domains, pages 152{168. Springer, 2015. ing platforms are using schema.org more often-sometimes [3] R. Meusel, P. Petrovski, and C. Bizer. The even excessively- to increase search engine visibility as well webdatacommons microdata, rdfa and microformat as to make their data more visible and useful for other de- dataset series. In The Semantic Web{ISWC 2014, velopments, like the usage in mobile apps. Nevertheless, pages 277{292. Springer, 2014. especially for the hotels' own web sites, there is much more [4] I. Stavrakantonakis, I. Toma, A. Fensel, and D. Fensel. that could and should be done when it comes to annotation. Hotel websites, web 2.0, web 3.0 and online direct Very often schema.org classes and properties are used incor- marketing: The case of austria. In Information and rectly. Some properties are invented by the website devel- Communication Technologies in Tourism 2014, pages opers and often, very important classes and properties- such 665{677. Springer, 2013. as the URL, telephone number, description or geographic location- are totally omitted. It appears that the hotel owners' only concern is to be visible and highly ranked in the web search engines, but they completely ignore what could be created from of their hotel's data if properly annotated by third party apps such as event platforms or other servicesor information orientated web sites.

We also wish to highlight that since May 2015 (when schema.org version 28 was released), a newly introduced schema.org extensions mechanism has been enabling extessions for vari

[1]

Khalili and

Auer . Wysiwym authoring of structured content based on schema . org. In Web Information Systems Engineering{WISE 2013 , pages 425 { 438 . Springer, 2013 .