<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What the Adoption of schema.org Tells About Linked Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim, Germany Research Group Data and Web Science</institution>
        </aff>
      </contrib-group>
      <fpage>84</fpage>
      <lpage>90</lpage>
      <abstract>
        <p>schema.org is a common data markup schema, pushed by large search engine providers such as Google, Yahoo!, and Bing. To date, a few hundred thousand web site providers adopt schema.org annotations embedded in their web pages via Microdata. While Microdata and Linked Open Data are not 100% the same, there are some commonalities which make a joint analysis of the two valuable and reasonable. Pro ling this data reveals interesting insights in the ways a schema is used (and also misused) on a large scale. Furthermore, adding a temporal dimension to the analysis can make the interaction between the adoption and the evolution of the standard visible. In this paper, we discuss our group's e orts to pro le the corpus of deployed schema.org data, and suggest which lessons learned from that endeavour can be transferred to the Linked Open Data community.</p>
      </abstract>
      <kwd-group>
        <kwd>Microdata</kwd>
        <kwd>schema</kwd>
        <kwd>org</kwd>
        <kwd>Linked Open Data</kwd>
        <kwd>Data Pro ling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Microdata is a mechanism for embedding meta information in HTML.1 Among
its competitors, i.e., microformats2 and RDFa3, it is currently the most deployed
annotation format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Microdata is directly embedded into the HTML code. That is, di erent
sections in the HTML code are marked up or annotated with schema classes and
properties. In the following example, an address in an HTML page is marked up
with Microdata:</p>
      <p>A parser like Any234 can extract the knowledge encoded in this HTML page,
e.g., to RDF. The corresponding RDF triples in this example are:
_:1 a &lt;http://schema.org/PostalAddress&gt; .
_:1 &lt;http://schema.org/name&gt; "Data and Web Science Group" .
_:1 &lt;http://schema.org/addressLocality&gt; "Mannheim" .
_:1 &lt;http://schema.org/postalCode&gt; "68131" .
_:1 &lt;http://schema.org/adressCounty&gt; "Germany" .</p>
      <p>
        Although it is possible to use arbitrary vocabularies for Microdata markup,
schema.org5 has become a de facto standard, with other vocabularies playing
only minor roles, This is mainly due to the fact that schema.org is pushed my
major search engines, i.e., Google, Yahoo!, Bing, and Yandex. While schema.org
can be used both with Microdata and RDFa, the latter is only rarely deployed
[
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ].6 In its latest release (1.93), schema.org comprises 620 classes and 890
properties.
2
      </p>
      <p>
        Microdata and schema.org vs. Linked Open Data
As shown above, Microdata markup can be parsed into RDF, and thus, like
RDFa, provides a means to publish Linked Data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, there are a few
essential di erences between Microdata and Linked Data, as it is commonly
used.
      </p>
      <p>Microdata, as shown above, is embedded into HTML. Since HTML
documents themselves are trees, each the graph encoded by the RDF document
extracted from Microdata is a set of trees. This means that Microdata is less
expressive than pure RDF, which also allows for any directed graphs (containing
cycles, and even more advanced constructs such as rei cation).</p>
      <p>In 2006, Tim Berners-Lee formulated four principles for publishing Linked
Data, i.e.7
1. Use URIs as names for things,
2. Use HTTP URIs so that people can look up those names,
3. When someone looks up a URI, provide useful information, using the
standards (RDF*, SPARQL), and
4. Include links to other URIs. so that they can discover more things.
In the example above, blank nodes were used for identifying concepts annotated
on the web site, following the W3C recommendation8. Thus, in that case, URIs
are not suitable names for things, as blank node identi ers are volatile, and
neither are they resolvable by HTTP requests.
4 https://any23.apache.org/
5 http://schema.org
6 Furthermore, although possible, schema.org is rarely used in Linked Open Data.
7 http://www.w3.org/DesignIssues/LinkedData.html
8 http://www.w3.org/TR/microdata/</p>
      <p>As far as links to other resources are concerned, schema.org foresees the
property sameAs9, which, however, is currently deployed by less than 0:02% of
all Microdata providers.10 Thus, in its current form, schema.org Microdata only
ful lls the third out of the four principles, if we accept Microdata as a standard
on equal terms with RDFa.</p>
      <p>In the same document, Berners-Lee created the ve star scheme in 2010,
de ning ve levels of Linked Open Data:
* Available on the web (whatever format) but with an open licence, to be Open</p>
      <p>Data
** Available as machine-readable structured data (e.g. excel instead of image
scan of a table)
*** as (2) plus non-proprietary format (e.g. CSV instead of excel)
**** All the above plus, Use open standards from W3C (RDF and SPARQL)
to identify things, so that people can point at your stu
***** All the above, plus: Link your data to other peoples data to provide
context
While the license issue is tricky (many web pages do not come with an explicit
license for their content), the rst four stars are ful lled by Microdata.</p>
      <p>These re ections show that while Microdata is not essentially the same as
Linked Open Data, there are a few commonalities which render it reasonable
to have a closer look at both together, and see what lessons learned can be
transferred from one to the other.
3</p>
      <p>
        Standard Conformance in Linked Open Data and
schema.org Microdata
schema.org provides a formal schema de nition for the classes and properties
to be used for annotating data. This allows for analyzing the conformance to
standard, schema, and best practices, as it is has been done in various places
for Linked Open Data as well [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we compare the conformance of
schema.org to that in Linked Open Data, based on a corpus of RDF extracted
from Microdata in the Web Data Commons project11. That corpus comprises
data from 398,542 pay-level domains (PLDs), with a total of 6.4 billion triples.
      </p>
      <p>The analysis covers the following aspects:
{ Usage of wrong namespaces, such as http://shema.org/
{ Usage of unde ned types
{ Usage of unde ned properties
{ Confusion of datatype properties and object properties
{ Datatype range violations (e.g., using a number instead of a date)
9 http://schema.org/sameAs
10 http://webdatacommons.org/structureddata/2014-12/stats/stats.html
11 http://webdatacommons.org/
87
{ Property domain violations (i.e., using properties with subjects of a class
not contained in the domain de nition)
{ Object property range violations (i.e., using properties with objects of a class
not contained in the range de nition)</p>
      <p>By comparing the numbers generated from the RDF corpus to those in
similar works conducted on Linked Open Data, we could identify a few interesting
di erences:
{ The usage of unde ned elements (i.e., types and properties) is less frequent
for Microdata than for LOD. For Microdata, 5.6% resp. 9.7% of the
documents use unde ned types and properties, as opposed to 38.8% and 72.4%
of all documents in LOD.
{ The confusion of datatype and object properties in Microdata is much larger.</p>
      <p>In our corpus, 24.35% of all documents use object properties with a literal
object, compared to only 8% in LOD.
{ Datatype ranges are violated more than twice as often in Microdata than in
LOD (12.1% vs. 4.6%). In both cases, date formats are the most frequent
problem.
{ Domain violations and object property range violations occur slightly more
often in Microdata (3.2% of all documents) than in LOD (2.4% of all
documents).</p>
      <p>
        Generally, we can see that in absolute numbers, Microdata has a surprisingly
high conformance to the schema. The only deviation is the datatype and object
property confusion, which can be partly attributed to the way triples are
generated from the annotated HTML code. If an object property is used without any
subordinate elements, a triple is extracted which contains the subsequent text as
a string literal. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], it has been argued that parsing the contents into a blank
node might be the better option here. We have shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that this strategy is
feasible, and that in many cases, it is even possible to assign a meaningful type
to the new blank node.
      </p>
      <p>There are quite a few possible reasons for the quality of schema.org
Microdata often being higher than that of LOD. First, there is a direct economic
incentive of providing correct Microdata (as it leads to better visibility in search
engine results). Second, schema.org is well-documented, with lots of
ready-touse and easy-to-adapt examples on the documenting web pages. Third, content
management systems (CMS), such as Drupal have adopted schema.org12 and,
with millions of installations, serve as multipliers. Finally, schema.org
continuously evolves, taking up users' suggestions, which may also lead to \misused"
constructs becoming o cially allowed in later releases.
4</p>
      <p>Co-Evolution of the schema.org A-box and T-box
To quantify on these hypotheses, we have started a diachronic analysis of
deployed schema.org Microdata. To that end, we have taken three snapshots of
Mi12 https://www.drupal.org/project/schemaorg
88
crodata, i.e., from 2012, 2013, and 2014, and look at the corresponding schema
de nitions which were valid at the respective point in time. With that data
collection, we are able to analyze both
{ The overall convergence (or divergence) of the data, i.e., whether instances
of a given class are described more uniformly over time
{ Top-down (i.e., schema rst) e ects, such as the adoption rate of new features
in the schema.org standard de nition, or the adoption rate of deprecations
{ Bottom-up (i.e., data rst) e ects, such as standard elements being
introduced after they have been used \ino cially"</p>
      <p>For measuring convergence, we use the set of properties which are de ned for
an instance of a class as a bit vector, and compute the heterogeneity of each class
as the normalized entropy rate across all the instances of the class. An increase
in the entropy rate re ects a growing heterogeneity, while a decrease re ects a
growing homogeneity.</p>
      <p>Globally, the entropy drastically drops, so that we can diagnose a strong
homogenization of the data. Looking at class-speci c di erences, we can see
that the adoption of schema.org by content management systems (CMS) such as
Drupal has lead to an increase of homogeneity (e.g., for classes like Website or
Blog), as well as classes promoted by Google Rich Snippets13, which lead to
better search engine visibility, such as Product and Offer, and are also extensively
documented with ready-to-use examples.</p>
      <p>For top-down processes, we compared the usage of classes and properties
before and after they were o cially included in the schema.org standard. We
found that new classes and properties are often adopted very slowly. There are
even domains covered by schema.org for which no deployed data can be found at
all, such as the medical domain, where a larger vocabulary was bulk-integrated
into schema.org.14 Deprecations, on the other hand, quickly lead to elements in
deployed data being replaced by the newly recommended versions.</p>
      <p>
        For bottom-up processes, we also compare the usage of classes and properties
before and after their o cial announcement. We can observe a mild in uence on
new classes and properties (i.e., they are occasionally used before becoming o
cial). This is particularly visible in data vocabulary, the deprecated predecessor
of schema.org, still being the second most deployed Microdata vocabulary [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>There is, however, a strong in uence on domains and ranges of existing
properties: here, properties are often used in a di erent context than intended, and
this is likely to be re ected in later versions of the standard.</p>
      <p>In addition to simply using unde ned classes and properties, there is an o
cial mechanism in schema.org, i.e., the extension mechanism.15 This mechanism
allows users to create subclasses and subproperties of existing classes and
properties on the y. Overall, this mechanism is only rarely used, without a measurable
13 https://developers.google.com/structured-data/rich-snippets/
14 http://blog.schema.org/2012/06/health-and-medical-vocabulary-for.html
15 http://schema.org/docs/extension.html
89
impact of classes and properties used as extensions rst becoming part of the
standard later.</p>
      <p>
        It is particularly noteworthy that schema.org is in a state of constant
evolution. In the past three years, more than 25 revisions have been published.
Together with the fact that both bottom-up and top-down processes can be
observed, where the deployed data in uences the schema, we can see that there is a
co-evolution of data (A-box) and schema (T-box) in schema.org, which is rarely
observed for Linked Open Data. For comparison, FOAF 16, the most widely
deployed LOD vocabulary [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], has undergone only six revisions within the past
eight years.
5
      </p>
      <p>Conclusion and Outlook
In this paper, we have contrasted the usage of schema.org Microdata and Linked
Open Data. We have looked at standard conformance for Microdata, taking from
a synchronic and a diachronic perspective. We have identi ed several drivers of
Microdata adoption:
Business Incentive Whenever there is a direct incentive to use Microdata,
such as a better listing in search engine results, we can observe that the
schema is followed more strictly.</p>
      <p>Availability of Documentation Ready-to-adapt examples increase standard
conformance.</p>
      <p>Implementation in Widely Deployed Platforms Content-management
systems like Drupal use schema.org, which leads to a large-scale usage
(sometimes even unconscious to the website owner) and, at the same time, a larger
homogeneity of the provided data.</p>
      <p>Standard Flexibility If the standard is violated in the same way at larger
scale, this may hint at a shortcoming in the standard. schema.org frequently
adopts to violations by either declaring them valid, or by o ering solutions
for the gaps that are lled by the violations.</p>
      <p>The rst two ndings are also supported by the rare adoption of schema.org's
extension mechanism. There is hardly a business incentive to de ne a class via
the extension mechanism (if it manages to parse it correctly, a data consumer
is likely to treat it exactly the same as the de ned super-type), and, in contrast
to the rest of the schema, the extension mechanism is only described in a rather
far-o section of the schema.org documentation pages.</p>
      <p>For Linked Open Data, we can state that things are slightly di erent. There
are no major drivers like the big search engine companies promoting schema.org,
which have lead to a dramatic increase of available schema.org Microdata (the
adoption of schema.org has grown by roughly a factor of 10 during the past
two years). Documentation is often scarce and/or at a deep technical level, and
data is provided by technology evangelists rather than commercial providers.
16 http://www.foaf-project.org/
90
Last, schema exibility is not as strongly observable as for schema.org, as the
comparison with FOAF shows.</p>
      <p>In summary, although Microdata and Linked Open Data have some essential
di erences, they are similar enough to make a comparison feasible and
reasonable. Some of the factors identi ed driving the quick adoption of schema.org
Microdata are also interesting ndings which could be adopted to further push
the adoption of Linked Open Data.</p>
      <p>Acknowledgements
The author would like to thank Robert Meusel and Christian Bizer for their
valuable ideas, input and analyses, part of which are re ected in this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Muhleisen, H.,
          <string-name>
            <surname>Schuhmacher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Volker, J.:
          <article-title>Deployment of rdfa, microdata, and microformats on the web{a quantitative analysis</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2013</year>
          , pp.
          <volume>17</volume>
          {
          <fpage>32</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data-the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passant</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Weaving the pedantic web</article-title>
          .
          <source>In: Linked Data on the Web</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Heuristics for xing errors in deployed schema.org microdata</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          (
          <year>2015</year>
          ), to appear
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrovski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The webdatacommons microdata, rdfa and microformat dataset series</article-title>
          . In: ISWC (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          :
          <article-title>Analyzing Schema.org</article-title>
          . In: International Semantic Web Conference (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schmachtenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Adoption of the linked data best practices in di erent topical domains</article-title>
          . In: International Semantic Web Conference (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>