<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Survey on Challenges for Entity Retrieval in Web Markup Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ran Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Besnik Fetahu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ujwal Gadiraju</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Dietze</string-name>
          <email>dietzeg@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz Universita ̈t Hannover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Embedded markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented data source. RDF statements from markup are highly redundant, co-references are very frequent yet explicit links are missing, and frequently contain errors. We present a preliminary analysis on the challenges associated with markup data in the context of entity retrieval. We analyze four main factors: (i) co-references, (ii) redundancy, (iii) inconsistencies, and (iv) accessibility of information in the case of URLs. We conclude with general guidelines on how to handle such challenges when dealing with embedded markup data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Markup annotations embedded in HTML pages have become prevalent on the Web,
building on standards such as RDFa1, Microdata2 and Microformats3, and driven by
initiatives such as schema.org, a joint effort led by Google, Yahoo!, Bing and Yandex.</p>
      <p>
        The Web Data Commons [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a recent initiative investigating a Web crawl of 2.01
billion HTML pages from over 15 million pay-level-domains (PLDs) found that 30%
of all pages contain some form of embedded markup already, resulting in a corpus of
20.48 billion RDF quads. The scale of the data suggests potential for a range of tasks,
such as entity retrieval, knowledge base population, or entity summarisation. Initial case
studies investigate for instance the scope of bibliographic data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or learning resources
metadata [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] sourced from Web markup.
      </p>
      <p>However, entities described through extracted facts from embedded markup have
different characteristics compared to Linked Data. For example, co-references are very
frequent, yet not linked through explicit statements. In addition, statements are highly
redundant and often limited to a small set of highly popular predicates, complemented
by a long tail of less frequent predicates. Moreover, the extracted data contains a wide
variety of syntactical and semantic errors.</p>
      <p>
        In this work, we present an overview of challenges and attributes of markup data.
We focus on the entity retrieval use case which represents one of the prerequisites for
tasks like entity summarization [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or knowledge base augmentation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Web Markup Attributes and Challenges</title>
      <p>In this section, we present a preliminary analysis of some of the key attributes of markup
data, and the challenges that are associated with them. We focus on the use of markup
data for entity retrieval.
0.4
0.3
0.2
0.1
0
0.2
0.15
0.1
0.05</p>
      <p>0</p>
      <p>We conduct a preliminary analysis for two entity types extracted from the WDC
2014 dataset, Movie and Book, since they correspond to a considerable amount of
data and are easy validate manually. For each type we randomly pick 10 entity queries
which we select from Wikipedia entity names from the respective categories. We setup
a Lucene4 index for the different subsets of markup data for the two types. We retrieve
top–500 entities by querying on the predicate sc:name with BM25 as our query
similarity model. We manually evaluate the corresponding result sets, and find out that
there are no relevant entities beyond top–50 (see Table 1 for details). In more detail, we
analyze the following aspects of Web Markup data.</p>
      <p>Entity Co-References. We investigate the occurence of co-references, namely that of
resource descriptions that are about the same entity. In contrast to traditional RDF
datasets, statements extracted from markup form sparsely connected graphs. In our
case, none of the relevant entities are interlinked explicitly. From our evaluation for
the types Movie and Book, we find the following number of relevant entities for our
random query set.</p>
      <p>type</p>
      <p>q1 q2 q3 q4 q5 q6 q7 q8 q9 q10
Information Redundancy. Here we deal with the redundancy aspect of statements
from Web markup. From the relevant entities for our query set, we have a set of 42
distinct predicates. The distribution of statements for each predicate of the two types
are shown in Figure 1. The distribution in the case of Movie is highly skewed, with the
top predicate accounting for more than 30% of the statements. In the case of Book we
see a more uniform spread of statements across the predicates.</p>
      <p>t:rsccoa l:rscu i:scgeam :scenam ft:rydep lit:scsubheadedP :rscegen :ryckssoedw i:trrcscdeo iit:rcscsdpeno t:rcscsao t:rrccseoa itt:rsceaagegnggaR ittt:cscnaneongR illt:rcsnbauhUm il:cseadhen i:rrvscoedp :rsscaadw i:rvsceew i:cscdeenau it:rscudano li:trrscae :rrccsodupe :trscauho i:csguaLnaegn it:ryccspoanonudopCm it:rrycscgeahpoY illii:ryyscsndeaFFm t:scuoba ililtt:rvcsaeenadenaeH i:ysccsuBm i:rcvsseew
ff:rcsseo :scneam t:fryepd :trcsouha l:rscu i:rvscseew t:rkcsaooobFm i:sscnb i:scegam lit:sscbdehdaeuP iit:rcscsnpode li:rsschbepu t:scuoab i:scuanggenaL f:rsscageebunPOm i:rsscseaeeNm iit:kscondoobE :ryskscodew t:rsykcapeooobTFm ft:rkcsaoobom iittt:rcscnnouoaenC :rscegne it:rryccsaehopgY itt:rcsgnageaeggaR
(a) Movie (b) Book</p>
      <p>Fig. 1: Statement distribution across predicates for types Movie and Book.</p>
      <p>In terms of redundancy, we notice that the amount of redundant information varies
across predicates. Figure 2 shows the proportion of redundant information for the
different predicates for the two entity types. It is worth noting that for Object predicates (e.g.
sc:actor) we consider a statement to be duplicate if the corresponding sc:label
are the same, whereas in the case of Datatype predicates we simply match the
corresponding values.
4 https://lucene.apache.org</p>
      <p>We note that due to the inherently different nature of the types Movie and Book,
redundant information is emphasized for different predicates and with a varying degree
of redundancy. This is expected given that the range of these predicates for different
types varies. For example, for type Movie, we note that predicate sc:director has
a high degree of redundancy since the sc:director of an entity of type Movie is a very
frequently provided i.e. important property. Hence it’s occurring very often in the result
set (and naturally is the same).</p>
      <p>In terms of non-redundancy, we note that the highest degree of information comes
from predicates whose range are literals. Specifically, sc:headline has the
highest degree of non-redundancy with 53%, sc:inLanguage with 57% degree of
nonredundancy, for type Movie and Book, respectively.</p>
      <p>redundant
non-redundant
redundant
non-redundant
:tscaubo :trsccsao ililtt:rvcsnedaeaenaeH :rcssdaaw ittt:cscnneangoR it:rryccshgpeaoY :trrccsoae lit:sscubedehadP ii:trcsscedopn i:trrsccdoe i:trscudnoa :rscgene il:scehaned i:scgannLgeua liili:ryyscsdneaFFm :rykscsedow :csanem it:ryccsonopandoupCm i:rrcsvoped</p>
      <p>:tscauob itt:rcsaegagngegaR :trscaouh iit:kscoobodnE t:rkscoooabFm f:trkcsooobam t:rykcsoapeoboTFm it:rrycscghpeaoY ilt:sscebuheddaP ii:trscscpedno :rscgene i:scaegm i:csgunnLaaeg iittt:rcsconouneanC i:scsnb :rykcsseodw :scnaem f:rscsebaegunPOm :ffrscsoe li:rscsupbhe i:rvscseew i:rsscseeaeNm
(a) Movie (b) Book</p>
      <p>
        Fig. 2: Redundancy of information across predicates for types Movie and Book.
Information Inconsistency. Meusel and Paulheim [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] provide a preliminary analysis
on some of the most common errors encountered in markup data. The errors range from
syntactical to semantic errors, i.e. typos or misinterpretation and misuse of vocabulary
terms. In this work, we follow their guidelines and measure such violations.
      </p>
      <p>In terms of semantic errors, we focus on the violation of Object and Datatype
properties. On average we find up to 3% and 4% semantic errors for Movie and Book,
respectively. Other errors like range violations, we focus on few predicates which have
specific requirements on the literal value, like datePublished and find 20% and
45% wrong values for Movie and Book, respectively.</p>
      <p>Information Accessibility. An interesting aspect we note in Figure 1 is the high
proportion of statements that contain links to external resources. Considering that such
statements account for approximately 12% of markup data (on average for both types),
we assess the accessibility of such links considering two aspects: (i) HTTP response
from the URLs (i.e. HTTP Status Code=200 in case the URL is accessible), and (ii) the
content type (e.g. text/html etc.)</p>
      <p>In Figure 3 we show the availability of such URLs for both types. We note that there
is a significant difference between the entity types, where for type Movie the ratio of
unavailable links is higher in comparison to Book.</p>
      <p>In terms of content-type, the majority of URLs is of content–type text/html. In the
case of Movie we have 65% text/html and 35% jpg/png, similarly for Book, we have
70% text/html and 30% jpg/png. It is worth noting that even though there is a high
amount of information available in the form of markup data, for tasks like entity
retrieval or entity summarization, the available information presents a non-negligible set
of resources which cannot be ignored.
rscsehbup
li
:
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
it:rrccsoed
(a) Movie (b) Book</p>
      <p>Fig. 3: HTTP status codes for the different predicates for type Movie and Book.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this work, we presented a preliminary analysis of some of the attributes of Web
Markup data, and the challenges associated with them. While obtaining a ground truth
for marked up facts is costly, we focused our investigation on a preliminary subset of
the Web Data Commons.</p>
      <p>When attemptng tasks like entity retrieval, which we envisage as one of the
prerequisites for more advanced tasks such as knowledge base augmentation, markup data
poses several key challenges. Our study underlines that there are inherently different
characteristics of markup data when compared to traditional Linked Data and
knowledge graphs, where characteristics even vary heavily across different entity types.</p>
      <p>
        Some of our preliminary findings include (i)Statements about the same entities
sourced from different PLDs, such as description, keywords, ratings etc., usually
contains complementary information traditional Linked Data and knowledge bases (e.g.
DBpedia); (ii)Predicates that encode dates usually contain a large amount of errors, and
thus require plausability checks and further quality assurance metrics; (iii) When
considering standard IR indexes, retrieval beyond top–50 does not provide any additionally
relevant information in most cases for entity-centric queries. While these findings are
based on a small subset of markup data, focused on two specific types, related case
studies (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]or [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have uncovered similar challenges in other domains.
References
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Heuristics for fixing common errors in deployed schema.org microdata</article-title>
          .
          <source>In ESWC</source>
          , volume
          <volume>9088</volume>
          , pages
          <fpage>152</fpage>
          -
          <lpage>168</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Petrovski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>The webdatacommons microdata, rdfa and microformat dataset series</article-title>
          .
          <source>In The Semantic Web-ISWC</source>
          <year>2014</year>
          , pages
          <fpage>277</fpage>
          -
          <lpage>292</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ritze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Oulabi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Profiling the potential of web tables for augmenting cross-domain knowledge bases</article-title>
          .
          <source>In WWW. ACM</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          , U. Gadiraju,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Analysing structured scholarly data embedded in web pages</article-title>
          .
          <source>Apr</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Towards embedded markup of learning resources on the web: An initial quantitative analysis of lrmi terms usage</article-title>
          . In J. Bourdeau,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nkambou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          , and
          <string-name>
            <surname>B. Y</surname>
          </string-name>
          . Zhao, editors,
          <source>Companion Proceedings of the 25th International Conference on World Wide Web (WWW2016)</source>
          , pages
          <fpage>513</fpage>
          -
          <lpage>517</lpage>
          . ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Gadiraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fetahu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Entity summarisation on structured web markup</article-title>
          .
          <source>In ESWC: Satellite Events</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>