<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Studying Linked Data Accessibility Healthiness for the Long Tail of the Data Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johannes Frey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marvin Hofer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KMI Competence Center @ Institute for Applied Informatics, Leipzig University</institution>
          ,
          <addr-line>Germany, https:// kmi-leipzig.de</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge Integration and Linked Data Technologies (KILT/AKSW) / DBpedia Association @ Institute for Applied Informatics</institution>
          ,
          <addr-line>Leipzig, Germany</addr-line>
          ,
          <institution>https:// aksw.org/Groups/ KI LT</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In this paper, we explore the accessibility healthiness of Linked Data within the context of the Data Web, focusing on the long tail of data sources. Unlike the traditional web, Linked Data lacks a driving infrastructure to enhance accessibility, leading to negative impacts on data consumers, adoption, and the creation of large-scale infrastructures. We investigate challenges posed by issues such as link rot, unparseable content, downtime, and timeouts that hinder efective access to Linked Data. The study involves a novel Linked Data client that logs debugging information, providing insights into the eficiency and efectiveness of accessing Linked Data. The research also includes discussions on the methods and approach taken, IRI identity mismatch handling, crawling results, and Linked Data parsing statistics. Through extensive analysis of HTTP response status codes and accessibility issues, the paper quantifies common problems but also proposes methods for enhancing Linked Data accessibility in order to retrieve consistent sub-graphs from the Data Web.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Linked Data</kwd>
        <kwd>Accessibility Issues</kwd>
        <kwd>Web Crawling</kwd>
        <kwd>Long Tail</kwd>
        <kwd>Data Quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Linked Data was proposed as a way of creating a giant global graph (GGG) of interconnected
data on the Web. The idea was that by using shared vocabularies and standard protocols,
disparate datasets could be connected to form a single, unified resource that could be navigated
and explored in a way similar to the WWW with web browsers.</p>
      <p>The evolution of the data itself but also the hosting environment and circumstances lead to
accessibility issues such as link rot, unparseable content, downtime, or timeouts when trying to
access it. In the traditional web, Google’s search engine ofers specific browsing entry points
based on detailed information needs and enhances accessibility by caching sites and incentivizing
proper syntax and standards (e.g. schema.org). Such a driving infrastructure is missing for the
Web of Data. Accessibility issues are negatively afecting consumers and adoption, but also
hinder the creation of large-scale infrastructures for a better usability of the Data Web, impacting
areas such as data management (e.g. entity indexing, sameAs link clustering) and preservation
(archiving, crawling) of Linked Data. While Linked Data crawls - that also explore accessibility
aspects - have already been subject to previous work, we exclusively focused on Linked Data
according to Tim Berners-Lee’s original Design Issues1, thus excluding SPARQL endpoints and
RDF dataset dumps, but allowed anything – in particular embedded JSON-LD – that follows
the rules. Particularly, we collected crawling seed IRIs from several million domains to assess
the long tail, which deserves better exploration, especially given the context of establishing
aforementioned usability infrastructure. Furthermore, we implemented a novel Linked Data
client which extensively logs and stores debugging information as necessary first step to study
the eficiency and efectiveness of accessing Linked Data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Linked Data Crawls have been the subject of several research eforts typically aimed at building
large collections of Linked Open Data resources from the Web. In this section, we summarize, to
the best of our knowledge, the most notable Linked Data and RDF crawling eforts with public
access.</p>
      <p>
        The the most recent iteration of the Billion Triple Challenge - the BTC-2019 dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
involved crawling of over 2.6 million documents from 394 pay-level domains of the Web. The
authors retrieved more than 2.1 billion unique quads and 256 million distinct triples. The crawl
has been performed using LD-Spider 1.3 in a breadth-first manner based on 442 URLs from
DyLDO.
      </p>
      <p>
        The Dynamic Linked Data Observatory (DyLDO) project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] performs weekly crawls
on Linked Open Data since 2012. Based on a fixed seed list (containing 95,737 URIs from 652
domains), it dereferences RDF data in a first round. Subsequently, all discovered IRIs are used
to perform another crawl, persisting the retrieved RDF data, HTTP headers, and redirects.
The crawler applies breadth-first search and performs 2-5 more rounds, whereas each round,
dereferences all unseen IRIs of the next hop based on a frontier list from the previous round.
DyLDO was developed to assess the temporal stability of Linked Data resources; the availability
and functioning of the Linked Data mechanisms as well as data evolution for particular RDF
resources can be analyzed over time. Although the authors used techniques aiming at covering
a wide cross-section of domains in the initial seed, the design is focused on completing the
crawling given limited time and resource constraints and the seed list is capturing an over 12
years old state of the LOD cloud.
      </p>
      <p>
        LOD Laundromat [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a tool that crawls and cleans RDF dumps. The seed was created
using a combination of manual and automated methods. The authors added dump URLs using
the CKAN API, from e.g. Datahub, but also added several datasets where they knew the location
of the dumps. The dump files are retrieved with a custom and fault-tolerant crawler/parser,
and VoID triples in the dumps are used to (recursively) discover new datasets and their dump
ifle locations. Moreover, users can submit URLs locating to either an RDF dump or a VoID
description of a dataset. However, as of August 2023, the service URL http://lodlaundromat.org
did not host any data or project related information anymore and the GitHub page states that it
is closed for maintenance since July 2021. Fortunately, a subset of the data (650K RDF documents
summing up to 524 GB of compressed HDT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] data and over 3.3 billion triples) is available in
LOD-a-lot [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which has been used as a basis for this work.
      </p>
      <p>
        Web Data Commons [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a large-scale project that aims to extract structured data from
the Common Crawl. Since 2012, the project released several structured data dumps based on
semantic annotations in the crawled HTML files, including Microdata, Microformat, RDFa, as
well as Schema.org from embedded JSON-LD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Out of 1.5 billion URLs with semantic markup
(of over 3 billion URLs from the October 2022 crawl) that were hosted on 14 million domains,
over 19 billion typed entities and 86 billion triples were extracted2.
      </p>
      <p>
        DBpedia Archivo [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is an augmented ontology archive that automatically crawls, discovers,
versions, and archives ontologies. In order to discover OWL and SKOS ontologies, it performs
follow-your-nose Linked Data on (transitive) dependencies/imports in ontologies from previous
iterations of Archivo crawls, but also employs vocabulary usage reports in VoID files, ontology
repositories, and user inclusion requests. As such it tries to crawl the Web of Ontologies, a
subset of the LOD cloud, and a study in 2022 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] has shown that it improves the accessibility of
the terminological context (property &amp; class IRIs) for 80% of the triples in LOD-a-lot respectively
45% the used terms.
      </p>
      <p>To the best of our knowledge, none of the eforts specifically focus on and evaluate the long
tail of Linked Data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Approach</title>
      <sec id="sec-3-1">
        <title>3.1. Linked Data Access</title>
        <p>Linked Data access follows the principles of web architecture3, a multitude of standards,
protocols, and rules, including the use of URIs (Uniform Resource Identifiers) and IRIs
(Internationalized Resource Identifier) to identify resources, and data models, like RDF (Resource Description
Framework), to represent data.</p>
        <p>As the main actors of the Linked Data web architecture, the implementation of clients and
servers plays an important role in enabling the publication, discovery, and consumption of
Linked Data. A Linked Data server provides access to resources through HTTP(S) IRIs. Clients,
on the other hand, are responsible for consuming and processing Linked Data from servers and
thus retrieve a local sub-graph of the globally accessible Linked Data graph. The Linked Data
consumption process involves the following major phases between client and server.
1. IRI dereferencing: The client sends a GET request to the server identified by the HTTP(S)</p>
        <p>IRI and follows redirects.
2. Representation selection: The server responds with a representation of the resource in a
particular format (such as plain RDF serialization formats, RDFa, or JSON-LD).
3. Representation parsing: The client parses the representation (the payload contained in
the HTTP body) to extract the RDF or other structured data from it.</p>
        <sec id="sec-3-1-1">
          <title>2http://webdatacommons.org/structureddata/2022-12/stats/stats.html 3https://www.w3.org/standards/webarch/</title>
          <p>4. Follow-your-nose: The client might dereference any IRI it finds in the structured data to
retrieve additional resources.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Access Mechanisms and Pitfalls</title>
        <p>Several mechanisms allow for flexibility in requesting and serving Linked Data and therefore
lead to an increased variety and complexity for both server and clients. Furthermore,
implementations and setups may not adhere fully to expectations and specifications, which can
result in accessibility failures when trying to fetch data from the GGG. A third area, that is
highly relevant for access, is retrieval performance and throughput, i.e. how much data can be
requested in what time.</p>
        <p>• Redirects and Links. Servers can use redirects (HTTP code 3xx) as well as link
rel="alternate"...\ in the HTTP header and the HTML &lt;meta&gt; tag to point to the
data document. The correct server configuration is a common pitfall; long(er) redirect
chains decrease the performance of servers and clients; loops prevent accessibility.
• Serialization Variety. There is no single mandatory format specified in Linked Data,
rather a multitude of RDF serialization formats (e.g. Turtle, N-Triples, RDF/XML) and
HTML embedded formats such as RDFa or JSON-LD exist that the server could use in
response; clients should support them to be able to retrieve all parts of the GGG.
• Content Negotiation. As not all servers/clients can deal with all formats, clients may
send a prioritized list of formats in the Accept header. The server should select a supported
format in favor of the client‘s request. However there are no guarantees on how the
server selects the response format, thus requiring the client to be flexible and employing
try-and-error heuristics; especially manual proxy/rewrite rule configurations on the
server side are a common source of errors.
• Parsing. The retrieved serialization can contain erroneous elements, syntactical errors,
or a deviant serialization format that was incorrectly reported, requiring fault-tolerant
parsing methods (e.g. skipping erroneous parts).
• Access Limits &amp; Performance. Servers can be at capacity resulting in timeout errors
or they can apply rate limits indicated with HTTP 429 error, clients need to obey the
robots.txt and the Retry-after Headers (which might not be correctly computed by the
server). Overall, low speed of responding to requests (due to server capacity or configured
limits) can be a large bottleneck, when accessing the GGG.
• IRI normalization URIs/IRIs can be represented and encoded in diferent forms, but
merging interlinked subgraphs of the GGG fetched from several servers requires string
equality. IRI normalization can be necessary to make use of the links (see Section 3.3).
To create a practical Linked Data client4, we used a bottom-up approach and made changes to
the configuration and code based on what we learned from Section 4 &amp; 5.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Handling IRI identity mismatch</title>
        <p>Many RDF tools such as graph databases, RDF libraries, and reasoners often require exact
matches for IRIs in order to correctly identify them as the same, whereas Linked Data does not
as it has mechanisms like redirects (e.g. for HTTPS upgrades).</p>
        <p>While some of the IRI representation variety can be normalized by syntactical transformation
on the IRI string which is tackled in RFC standards, others need to be further canonicalized by
dereferencing and matching the IRI in the delivered RDF response.</p>
        <p>IRI Syntax Normalization is a set of rules that transform an IRI into a normal form
that allows equality checks. The RFC5 includes case normalization, character normalization,
percent-encoding normalization, path segment normalization, and scheme-based normalization.
However, we argue that this is not suficient for Linked Data in practice.</p>
        <p>Canonicalization &amp; Consistently Dereferenceable IRIs (CDIRI). An identity mismatch
that needs additional canonicalization occurs when the normalized IRI  of the IRI  that was
supposed to be dereferenced does not occur or is not described as an element of the dereferenced
Linked Data document for . Subsequently, we define an IRI as the consistently dereferenceable
(CDIRI) for  if its normal form  matches the normalized IRI of the resource (most commonly
used in subject position) of the data or in simple words - What you request is what you get. In
our implementation, we treat the CDIRI as the canonical IRI for the referenced resource or its
parts if several CDIRIs are present, which is the case for fragment (#) IRIs. Determining the
CDIRIs and using them as a replacement for all IRIs in third-party Linked Data, which provide
owl:sameAs or other relationships to  for a resource  and thus make the resulting merged
local graphs for  and  connected (e.g. such that a SPARQL pattern like  owl:sameAs . 
?p ?o2 would succeed). In the following snippet, we show 2 examples of RDF resources, listing
the actual Entity IDs in the RDF and used IRIs in third-party Linked Data (CDIRIs underlined):
1. http://dbpedia.org/resource/Björk , https://dbpedia.org/resource/Björk and https://dbpedia.org/page/Björk (html view)
redirect,resolve, or link to RDF using http://dbpedia.org/resource/Björk as entity ID</p>
        <p>http://d-nb.info/gnd/1140180746 and https://d-nb.info/gnd/1140180746 nowadays redirect/resolve to RDF using
https://d-nb.info/gnd/1140180746 as entity ID</p>
        <p>Note, that the DNB identifiers were switched to HTTPS several years ago. However, datasets
linking to the legacy DNB HTTP identifiers are still widespread. W.r.t. DBpedia, sometimes the
HTML page IRI is confused with the entity ID. Working with CDIRIs thus supports cleaning
up the connection of sub-graphs provided by diferent Linked Data providers and increases
accessibility.</p>
        <p>We developed the Pinguin6 canonicalization algorithm for Linked Data clients. It collects
all intermediate IRIs, normalizes them and creates a surjective mapping of IRIs to the CDIRI as
well as validates the CDIRI by resolving it again.</p>
        <sec id="sec-3-3-1">
          <title>5Normalization of URIs in RFC3986 Section 6, and for IRIs in RFC3987 Section 5 6named after the random resonance of Ping URI to the German word Pinguin</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. IRI Seed Selection</title>
        <p>We picked two set of IRIs that cover a huge spectrum of domains to study the long tail.</p>
        <p>
          Source 1 (LAL) from LOD-a-lot: For the first source of IRI seeds for crawling, we used
LOD-a-lot [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], a compact archive of RDF dumps. It contains more than 28.36 billion triples with
3.21 billion distinct subjects. Following a previous study [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the data still contains noisy and
erroneous data (e.g. IRIs containing prefixes although supposed to be absolute, usage of subject
IRIs in property position etc.). As we are interested in IRIs that enable us to retrieve RDF via
Linked Data, we filtered LOD-a-lot for all rdf:type statements (3,321,354,308 triples) and then
collected all distinct IRIs (2,911,686,622) that occur in the subject position of those. This also
removed all invalid and not absolute IRIs as confirmed via the Java 11 URL checker.
        </p>
        <p>Source 2 (DEL) from DBpedia: As second source, we used the DBpedia external links
2022-09 (DEL) dataset7, which contains all hyperlinks to external websites from the Wikitext of
articles of 137 Wikipedia versions. The links point to over 33.5 million distinct IRIs (2 invalid
IRIs were removed).</p>
        <p>Since almost 3 billion URLs would pose a significant challenge (in terms of time, trafic, and
storage requirements) for a crawling experiment, we decided to study the distribution of URLs
per domain (FQDN - fully qualified domain name), in order to understand whether we can
shrink it without limiting the number of domains. The rational behind this is, that our focus is
on studying the accessibility of Linked Data for the domains, instead of a full analysis of the
(payload) data of the domain.</p>
        <p>As can be seen in Figure 1 (left), the distribution of the URL counts per domain for the
top-100k domains (the 100,000 domains having the most URLs) follows a power law distribution.
This type of statistical distribution is characterized by the pattern that a small number of items
occur frequently (called the "head"), while a large number of items occur rarely (denoted as the
"long tail"). In our case, the items compare to the domains that occur in the host part in the
seed URLs.</p>
        <p>The curve of LAL is steeper compared to DEL in log-log-scale, implying that the URL counts
decrease more rapidly for LAL as the rank increases, indicating a greater inequality in the</p>
        <sec id="sec-4-1-1">
          <title>7https://databus.dbpedia.org/dbpedia/generic/external-links/2022.09.01</title>
          <p>distribution. Moreover, LAL top-k domains have more URLs per domain than DEL until rank
2,802. In Figure 1 (right) is shown, how much portion of the overall amount of URLs of the
datasets are contained in the accumulative counts of the top-k ranks. While for LAL the top-15
domains accumulate over 90 % of all LAL URLs, for DEL they only contain approximately 21 %.
In other words, the head of LAL is much shorter with higher IRI counts compared to DEL. In
order to reduce the number of URLs, we decided to sample URLs of the head. We decided to
limit every domain to 2000 URIs. Subsequently, we used Waterman’s Algorithm R for random
sampling to pick 2000 URLs for each of the top-k domains that contain more than 2000 URLs,
which splits the domains into a subset of domains that is sampled 2k+ and a subset that is not
2k- (see Table 1). For LAL, this resulted in sampling URLs for the top-1492 domains that cover
99.63 % of the URIs and reducing its number of 2k+ URLs from 2.901 billion to 2.948 million,
whereas for DEL the top-1152 domains covering 45.99 % were reduced from 15.422 million to
2.304 million. As a result, 13.699 (LAL) respectively 20.412 million URLs (DEL) were selected for
the seeds. The resulting seeds are quite complementary since only 46,810 (0.14 %) of the URLs
and 115,313 of the domains (2.08 %) overlap between the two sources.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Crawling Results</title>
        <p>From the samples described in Section 4.1, we removed the IRI fragment identifiers and
deduplicated the resulting IRIs. We then used our custom crawler implementation that can deal with
scheduling millions of domains for parallel crawling.</p>
        <p>Crawling Parameters: Several parameters can customize the crawling process. We
describe the most important parameters of our setup below and refer to our paper repository8 for
transparency and reproducibility. A single request’s timeout is set to 10 seconds. The default
request delay is 100ms. The crawler obeys Retry-After headers up to a delay of 10s. The system
follows a maximum of 10 redirects. IRIs of one domain are requested sequentially, and the
total timeout for one domain is max time=2*iriCountForDomain+13 seconds otherwise
a DomainTooSlow exception will occur. The crawler is configured in this way, because we
consider this an appropriate setting performance-wise for an efective GGG, given that network
and server performance increased significantly since the inception of Linked Data in 2006.
After encountering more than 50 exceptions of either Java IOException or request timeout
exceptions (signaled by Java ConnectTimeoutException, HttpConnectTimeoutException or
HttpTimeoutException) in a sequence for one domain, the crawling of that domain is stopped.
Subsequently, all remaining IRIs of that domain are discarded and marked with the
TooManyFailuresInARowException. Moreover, we only store resource payloads that are smaller than 10
megabytes. Each requests contains the following Accept request header (adapted from Apache
Jena, but giving priority to line-based and triple formats for fault-tolerant parsing (pruning
invalid lines).
1 application/n-triples;q=1, text/turtle;q=0.9, application/n-quads;q=0.9,
2 application/ld+json;q=0.8, application/rdf+xml;q=0.7, application/trig;q=0.7,
3 */*;q=0.5 # covers text/html with embedded JSON-LD</p>
        <p>Table 2 displays the analyzed log data of four seed partitions. The table is split into two parts.
The first part shows the distribution of HTTP response status codes, while the second part
displays the accessibility issues that prevented the completion of the HTTP request.</p>
        <p>In terms of HTTP status codes, it is worth noting that status code 200 typically has an average
recall rate of only 16%, with the highest value being 42.7% for DEL2k+ and the lowest for
LAL2kat a mere 8.4%. The second most frequently returned status code is 404 Not Found, with an
overall percentage of 6.3%. It’s also important to mention that 2.4% or 809 thousand IRIs end up
causing too many redirects and therefore remain unresolved with a 3xx code. When it comes to
overall accessibility, DEL was slightly more accessible than LAL, with 83.1% for the 2k+ seed
and 28.6% for the 2k- seed, as opposed to LAL’s 58.7% for the 2k+ and 21.2% for the 2k-.</p>
        <p>Out of the 33.1 million IRIs that were requested, 67.7% failed due to accessibility issues. The
crawler encountered two types of issues preventing the completion of the HTTP request: those
related to the remote server and those related to our crawling requirements. The former includes
DNS record inaccessibility and resource reachability, while the latter is caused by exceeding
manually set thresholds such as payload size (MaxResourceSize). The table displays that over
95% of the total exceptions come from the top four accessibility issues. The most common
issue is the UnreliableDomain exception (ranging from 20 to 54%), which can be triggered
by DomainTooSlow and TooManyFailuresInARow exceptions. The second most common
exception is the ConnectException (31.5%) which indicates the unreachability of a host due to
DNS retrieval or a missing web server under the resolved IP address and port. The IOException
(14.6%) indicates that the remote server unexpectedly closed an HTTP request, e.g., the response
stream. Further, it was found that 12.9% of the issues were related to RequestTimeouts that
lasted for more than 10 seconds. Two exceptions related to IRI resolution are InvalidRedirectIRI
and PrivateIPSkipped. InvalidRedirectIRI denotes that the initial well-formed IRI redirects to
an invalid IRI, while PrivateIPSkipped indicates skipped requests due to the IRI resolving to a
private IP range. Other JAVA Errors, which account for 1.4%, are primarily caused by issues
with the JAVA HTTP library, such as HTTP message syntax or SSL problems.</p>
        <p>Table 3 analyzes the body of successfully fetched resources (status code 200). 15k responses
with code 200, but an empty body, could not be analyzed. 26% of LAL2k+ and 42% of DEL2k+
IRIs had payload in the body. A significantly lower portion but with a similar ratio between
LAL and DEL was measured for the tails with 8.4 respectively 16.2%. In a subsequent step, we
used our failure-tolerant parser to extract Linked Data from the payload. It was configured
to parse all plain RDF formats (including plain JSON-LD, but excluding RDFa) and JSON-LD
embedded in HTML. For LAL2k+ and LAL2k- 37 vs. 44% of the content could be parsed making
up a only a total of around 10 vs. 4% of the IRIs leading to Linked Data. Since DEL is likely
to contain more resources that are not described using Linked Data, it is not surprising that
these numbers are also low with 8 vs. 3.7%. As final measure of the accessibility of a Linked
Data IRI is row 3. The existence of the CDIRI in the parsed document is necessary to retrieve
information about it and further navigate in the GGG (exploiting incoming and outgoing links).
For LAL this is the case for 8 vs. 1.1 % of the IRIs in contrast to 5.7 vs. 3.2%.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>To shed light on Linked Data accessibility healthiness of the long tail, we sent requests to IRIs
from 5,661,415 distinct domains. We compared accessibility statistics for IRIs from head and
tail of LOD-a-lot (containing IRIs from dumped RDF resources) to Wikipedia external links
(containing manually curated IRIs, mostly intended for browsers, not necessarily with RDF data)
extracted with DBpedia. We discovered, that the head and tail of these Wikipedia links also
were significantly impacted by link rot or very slow speeds, with only 43% and 16% respectively
returning HTTP 200 status codes. For LAL, we actually expected a much lower and divergent
number (26% and 8%) since on the one hand it contains a portion of dumped resources that
never were accessible via Linked Data and on the other hand it was published before 2017
several years earlier than Wikipedia - (based on the assumption that age increases link rot).</p>
      <p>Addressing the identified accessibility challenges, an infrastructure for improving resource
availability and a refined Linked Data consumption strategies seem potential steps toward
fostering a more usable and accessible Linked Data web.</p>
      <p>Acknowledgements: This work was partially supported by grants from the German Federal
Ministry for Economic Afairs and Climate Action (BMWK) to the projects KISS (01MK22001A)
and OpenFlaaS (100594042), by the European Research Council for the project ScienceGRAPH
(819536) as well as by the Federal Ministry of Education and Research of Germany and by the
Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus in the program Center
of Excellence for AI-research "Center for Scalable Data Analytics and Artificial Intelligence
Dresden/Leipzig", project identification number: ScaDS.AI</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , T. Käfer, BTC-2019:
          <article-title>the 2019 billion triple challenge dataset</article-title>
          ,
          <source>in: ISWC</source>
          <year>2019</year>
          , volume
          <volume>11779</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -307
          <fpage>96</fpage>
          -
          <lpage>7</lpage>
          _
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Käfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. O'Byrne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hogan</surname>
          </string-name>
          ,
          <article-title>Observing linked data dynamics, in: The Semantic Web: Semantics and Big Data</article-title>
          ,
          <string-name>
            <surname>ESWC</surname>
          </string-name>
          <year>2013</year>
          , volume
          <volume>7882</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>227</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -38288-8_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Beek</surname>
          </string-name>
          , et al.,
          <article-title>Lod laundromat: A uniform way of publishing other people's dirty data</article-title>
          ,
          <source>in: ISWC</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>228</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -11964-9_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Martínez-Prieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arias</surname>
          </string-name>
          ,
          <article-title>Binary rdf representation for publication and exchange (hdt)</article-title>
          ,
          <source>in: Journal of Web Semantics</source>
          , volume
          <volume>19</volume>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          ,
          <year>2013</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>41</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2013</year>
          .
          <volume>01</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Beek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <article-title>Lod-a-lot: A single-file enabler for data science</article-title>
          ,
          <source>in: SEMANTICS</source>
          <year>2017</year>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          . doi:
          <volume>10</volume>
          .1145/3132218.3132241.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Petrovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>The webdatacommons microdata, rdfa and microformat dataset series</article-title>
          ,
          <source>in: ISWC</source>
          <year>2014</year>
          , volume
          <volume>8796</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>292</lpage>
          . doi:10.1 007/978-3-
          <fpage>319</fpage>
          -11964-9_
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Brinkmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Primpeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>The web data commons schema.org data set series</article-title>
          , in: WWW 2023 Companion, ACM,
          <year>2023</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>139</lpage>
          . doi:
          <volume>10</volume>
          .1145/3543873.3587331.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Streitmatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Götz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , N. Arndt,
          <article-title>DBpedia archivo: A web-scale interface for ontology archiving under consumer-oriented aspects</article-title>
          ,
          <source>in: Semantic Systems</source>
          , volume
          <volume>12378</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>35</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -59833-
          <issue>4</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Streitmatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arndt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <article-title>Reproducibility crisis in the LOD cloud? studying the impact of ontology accessibility and archiving as a counter measure</article-title>
          ,
          <source>in: ISWC</source>
          <year>2022</year>
          , volume
          <volume>13489</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>107</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -194
          <fpage>33</fpage>
          -
          <lpage>7</lpage>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>