<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Managing Broken URLs in Federated Metadata?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tien-Dung Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Shulman</string-name>
          <email>elena.shulman@eun.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Schoolnet</institution>
          ,
          <addr-line>Rue de Treves 61, B-1040 Brussels</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <fpage>35</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>In a large federation of learning object repositories, learning object locations or learning object URLs in metadata can become out of date when learning objects are moved or deleted. Regular checking of learning object URLs is essential. However, fully checking across a large federation imposes an unsustainable burden on time and resources while negatively impacting networks and repositories. This paper describes a broken URL handling system with a heuristic model that can provide a sustainable solution for federation service managers while enhancing communication and collaboration among federation stakeholders.</p>
      </abstract>
      <kwd-group>
        <kwd>Broken URLs</kwd>
        <kwd>federated metadata</kwd>
        <kwd>heuristic model</kwd>
        <kwd>learning resource exchange</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Fig. 1. Learning Resource Exchange.</p>
      <p>This paper presents the results of our e orts to develop a solution that will
e ectively detect broken URLs, automate a system for communicating with
content providers and allow greater exibility for LRE service managers in resolving
issues that can negatively impact user experience with the LRE.</p>
      <p>Section 2 presents the background to the problem. Section 3 presents the
architecture of the broken URLs handling system and details our proposed
heuristic algorithm to automatically detect the broken URLs. Section 4 details the
guidelines developed to facilitate e ective and expeditious communication with
content providers and procedures for LRE managers and system responses
depending on the status of broken URLs once reported. Section 5 is a discussion
including comments on some elements of comparison with related work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Broken URLs are problematic on several levels for a service that manages a
large federation of repositories. Unsurprisingly, broken URLs signi cantly
impact user satisfaction with the system. Teachers have reported a loss of con
dence in the LREs catalog and perceived quality of the LRE more broadly [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
From the users point of view, broken URLs are also a source of frustration when
teachers attempt to access resources they have previously placed in a favorites
folder or when they recommend to colleagues resources that have become
suddenly unavailable. As depicted in Figure 1, the LRE federates LO repositories
from various origins. LRE content is provided by ministries of education (MoE),
commercial and non-pro t content providers (Publisher), and cultural heritage
organizations (Museums). Some of these repositories might store user-generated
content (Teachers). Because the LRE does not host the objects or control access
to objects directly, the LRE must rely on the timely collaboration of content
providers depicted in Figure 1 to update their metadata whenever object
locations have changed.
      </p>
      <p>Experience has shown that broken URLs can become a signi cant problem.
Part of the solution we propose is based on a heuristic algorithm developed
and tested for this purpose. This paper also illustrates the way we have linked a
solution for broken URLs detection to support communication and collaboration
with content providers.</p>
      <p>The LRE has a number of techniques in place intended to ameliorate the issue
with metadata quality but none has provided a sustainable solution in a system
reliant on coordinated actions in a growing federation. First, to keep the LRE
catalog up-to-date, metadata is collected on regular basis (e.g., everyday). This
technique ensures that updates to the metadata on the part of content providers
are re ected in the main catalog. However, this requires content providers to
maintain the accuracy of their own metadata. We have found that this technique
has its own set of drawbacks because some providers do not use the metadata
they produce for the LRE. Another technique relies on users' scrutiny. Teachers
who nd broken URLs are invited to report them.</p>
      <p>Reported URLs are checked by the system and, if the problem is con rmed,
the corresponding entries are removed from the catalog and their content providers
receive a request to correct the problem. The main drawback of this technique
comes from the fact that it is a highly individualized solution impacting the
quality of a few records at a time. Moreover, this technique is fully reliant on the
goodwill of the catalog users and does not ultimately shield them from
unsatisfactory search and retrieval experience culminating in the discovery of broken
URLs.</p>
      <p>A third technique consists of systematically checking all the URLs of the
catalog on a regular basis. Unfortunately, this solution does not scale. As the
number of catalog entries grows, the time necessary to check the catalog fully
can be prohibitively extensive in duration. For example, in the LRE a full check
of more than 200,000 URLs takes more than two days. To make matters worse,
systematically checking all the learning resource URLs available on a system can
sometime be considered as \unfriendly" by their administrators.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Broken URL Handling System</title>
      <p>While allowing the LRE to e ciently detect and make decisions to remove
catalog entries with broken URLs from the search, the proposed solution is ultimately
intended to support and automate more e ective communication with content
providers and to facilitate their collaboration in expeditiously updating their
metadata. The handling system (Figure 2) and guidelines described in this
article accomplish several objectives. They demonstrate the feasibility of a heuristic
model in e ciently detecting broken URLs and propose a mechanism to trigger
ameliorative actions from providers. They illustrate our e orts at establishing
clear guidelines to support decision making and communication with providers
for LRE service managers responsible for metadata quality and user
satisfaction. A component to detect and hide broken URLs is only part of the solution.
The proposed solution also takes into account and avoids broken URL detection
techniques that can appear to be denial of service attacks on partner systems.</p>
      <p>There are several steps in the detection and handling of broken URLs. The
rst stage was a full system check for broken URLs to determine the scale of the
problem and to create a baseline to test the proposed algorithm. As we discussed,
while this kind of check can produce useful data on URLs, it is not feasible to
use this technique on a regular basis for reasons already mentioned. The second
stage is our proposed long term solution to detect and trigger corrections for
bad URLs based on a check of smaller subsets and provide better automated
channels for communicating with content providers. Techniques for determining
the size and location of the sets to be checked are described below.</p>
      <p>As mentioned above, it is necessary to put in place a heuristic checking
algorithm to regularly check URLs of the catalog but not to \attack" the systems
hosting LOs. Therefore only a subset of URLs should be checked at a time. Our
analysis of the LO location distribution shows the URL selection should be
based on the LO domain name system (or domain for short). For each domain,
only a subset of its URLs should be checked. The rst step is to determine
which domains should be targeted for checks. To determine if the system stops
or continues checking URLs in a domain, a sampling plan is applied after all
selected URLs are checked.
3.1</p>
      <sec id="sec-3-1">
        <title>Learning Object Location Distribution</title>
        <p>Learning objects in the LRE federation are hosted in one or several domains.
Typically, a URL or a LO location refers only to one domain while a domain
normally hosts more than one LO. We analyzed the relationship between LOs
and domains based on 2 aspects. First we looked at the domain distribution
in the catalog, which indicates how many domains have a certain number of
LOs. Second we analyzed the distribution of LOs to determine how LOs are
distributed among domains.</p>
        <p>
          In June 2010, there were 842 domains in the LRE. Figure 3 shows the
domain distribution and Figure 4 shows the LO distribution. A signi cantly large
percentage of domains (60%) hosted just a single learning object. Even more
important for the development of our solution was the nding that only 4% of
domains hosted more than 1000 LOs and 70% of LOs are located on domains
presently hosting over 10,000 LOs. The fact that a small number of domains
are hosting 96% of all LOs was instrumental in building a heuristic checking
algorithm.
Based on the analysis of LO distributions, we propose to check each domain
separately. For each domain, two checking steps are applied. First, the system
checks all reported-broken URLs, which are reported as broken URLs in the last
check, and calculates the con dence of the system on the domain. The second
step in the check applies a sampling plan to check all previously good URLs.
The sampling plan (Figure 5) is a simpli ed solution of the sequential sampling
plan [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
{ Single sampling plan
        </p>
        <p>One sample of URLs on a domain, such as 10% of URLs, is selected at
random. In case there are more than p2 of selected URLs, such as 90%,
are good, all unchecked URLs are considered as good URLs. Otherwise, the
whole URLs on the domain are checked.
{ Sequential sampling plan</p>
        <p>A set of URLs in a domain, for instance a set of 100 URLs, is selected at
a time and after inspection a decision is made to accept (stop checking)
or reject (check all URLs) in the domain or select another set in the same
domain (continue sampling). In general, this sequential sampling allows for
quick decisions, especially when frequency of broken URLs in a domain is
particularly high or particularly low.
{ Sampling plan in the algorithm</p>
        <p>Because we recognize the di culties in calculating the slope of accept and
reject lines, the checking algorithm uses sequential sampling plan with a
simpli cation - the slope is zero (Figure 5).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Sampling Plan's Parameters</title>
        <p>{ Pre-de ned threshold p1 presents the low con dence of the checking system
on the domain. p = p1 is the reject line in the sampling plan.
{ Two pre-de ned thresholds p2 and p2+ present the low-value con dence and
the high-value con dence of the checking system on the domain .
{ Npb is the number of broken URLs in the previous check, Npsb is the number
of stay-broken URLs in the current check, the rate r is calculated as follow
r = Npsb=Npb
(1)
{ A threshold p2 presents the con dence of the checking system on the domain.
p2 is calculated as follow
p2 = p2 + r (p2+
p2 )
p = p2 is the accept line in the sampling plan.</p>
        <p>Sampling Plan Procedure
{ A group G of URLs is selected at a time for checking. Assume that Nc is the
total number of checked URLs in the current check, Ng is the total number
of good URLs, p is the percentage of good URLs calculated by
p = Ng=Nc
{ If p &lt; p1 then the system checks all URLs of the set. If p2 p then the
system assumes that all the other URLs on the domain are good and stops
checking. Otherwise, the system continues sampling and checking (Figure
5). Algorithm 1 shows the detail of the checking technique.
(2)
(3)
Algorithm 1 Heuristic checking algorithm
1: procedure Check
2: for all domain 2 domains do
3: Re-check all broken URLs on domain
4: Calculate p2 using equation 2
5: sampling = true
6: while (sampling and (domain has unchecked URLs)) do
7: Take a group of URLs for checking
8: Check this group
9: Update p using equation 3
10: if p &lt; p1 then
11: Check all other URLs
12: sampling = false
13: else
14: if p2 p then
15: sampling = false
16: end if
17: end if
18: end while
19: end for
20: end procedure
. assume that all other URLs are good
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiment</title>
        <p>To compare our algorithm with the full system check, we rst executed a full
system check to detect all broken URLs in the LRE, then we executed the
heuristic checking algorithm 3 times in simulation mode with the ndings from
the full system checking results for comparison.</p>
        <p>In June 2010, the total number of URLs in the LRE was 236763, among
them 45711 (19,31%) URLs were broken1. The rst run was executed with an
assumption that all URLs are good, while the second run and the third run
used knowledge from the previous run. The following parameters were used:
G = 100; p1 = 50%; p2 = 90%; p2+ = 95%.</p>
        <p>The rst run only checked 17,36% of URLs but was able to detect 73,48% of
broken URLs. The second and the third checks detected 94,92% and 98,67% of
broken URLs. Overall, after 3 runs, the heuristic check algorithm only checked
36,78% of all URLs on average but detected 98,67% of broken URLs.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Procedures to Correct Broken URLs</title>
      <p>Given that the LRE federates LO repositories from various origins, e ective
handling broken URLs depends simultaneously on the heuristic checking algorithm
that can be deployed to test discrete domains and automation of communication
channels with content providers. Therefore, we established a schedule for check
1 The experiment took place on June 23rd, 2010.
Run</p>
      <p>Selected URLs Broken URLs Rate
(in number and (in number and (number of
in percentage of in percentage of broken URLs
the total number the total number /number of</p>
      <p>of URLs) of broken URLs) selected URLs)
sequences and conditions under which LRE service managers could act on the
ndings of the checking algorithm, initiate the removal of records from search
results, set in motion further checks and determine when it was appropriate to
return records to the search. Making these conditions explicit allows for better
management of metadata quality in the LRE, shields users from negative
experiences during search and retrieval and opens opportunities to collaborate with
content providers in improving the quality of their own metadata.</p>
      <p>The rst step in this process begins when a broken link is rst found and the
LRE system generates an automated noti cation for providers. System controller
coordinates all other components and generates reports. These machine readable
reports are set to trigger a x and can be exposed to providers using RSS, ATOM
feeds or downloadable excel sheets depending on the preferences of the providers.
After seven days, the LRE system is set to recheck the problem URLs. If the
problem persists and there has been no other response from the provider, an
automated email message is generated for the LRE system manager notifying
them of URL's status, providing relevant details about the record(s) in question
and contact information for the provider. Using this data, the LRE manager
sends a personal email to the repository manager. The repository manager will
be noti ed that the problem has not been corrected despite an automated report.
The guidelines stipulate that after three weeks records with broken URLs will
be removed from the search results. At this point we hope to begin a dialog
with the repository manager, determine the source of the problem, o er advice
or other assistance if needed and provide feedback more generally to repository
administrators. In order to mitigate the detrimental e ects of broken URL on
user experience, we have instituted a time limit for how long records with broken
URLs having been identi ed and reported to providers can remain in the LRE
search. One month after the initial discovery of the broken URL, the system
will initiate another check of the domain using the algorithm described above. If
the problem has not been corrected, records with broken URLs will be removed
from search results. This action is accompanied by an automated noti cation
sent to the provider listing items that are no longer in the search results and
the reason for their removal. Once records have been removed from the search, a
Fibonacci sequence is implemented to determine if and when the problem URLs
are corrected by providers. When the check sequence indicates that the broken
URL has been corrected, the records will be returned to the search results. An
automated noti cation to providers, both machine readable and as an email,
will be generated to alert the providers of their record(s)' change in status.
This ability to return the records to the search takes into account scenarios in
which providers are not able to respond expeditiously to reports of broken URLs
without unduly impacting the quality of LREs services.</p>
      <p>The guidelines described above take into account the need to provide
exibility of responses for LRE service managers and opportunities to improve
collaboration between system administrators within the federation.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Related Works</title>
      <p>
        Although, recognition of the problems and attempts to grapple with the
challenges of broken URLs or \link rot" are not new, most of existing approaches
to automatically detect broken URLs are based on the relationship between
resources or pages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which is not applicable in a LO federation where
there is almost no link between two learning objects locations.
      </p>
      <p>
        There are also some methods such as [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to assure that the links are
always available. These methods could be divided into two categories. The rst
approach involves creating copies and keeping resources in a local repository [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This approach assumes that the LOs will not be modi ed or requires a
check of the links on the y to determine if and when the system needs to switch
the links.
      </p>
      <p>
        The second approach uses a persistent identi er service to resolve the URLs
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, this approach only works well if such service is available for content
providers sites, which is not the case in the LRE.
      </p>
      <p>The proposed mechanism to manage broken URLs allows LRE managers to
control the quality of LO metadata in a federation. It combines di erent ways
to detect broken URLs and report problems to the content providers. Because
the heuristic checking algorithm is able to detect broken URLs without blocking
network communication it can be implemented to run in a sequence and schedule
intended to provide detection, reports and for corrective actions either on the
part of providers or, if necessary, on the part of LRE managers.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        In this paper, we have presented a framework with a heuristic checking algorithm
that allows for the correction of broken URLs in federated metadata. Future
work will focus on further enhancements of this framework to detect broken
URLs when metadata arrives in the LRE. This future work will involve the
development of a lter to remove broken URLs from metadata collected via the
Open Archives Initiative Protocol for Metadata Harvesting OAI-PMH [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or the
Simple Publishing Interface SPI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masinter</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCahill</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Uniform Resource Locators (URL) (RFC 1738)</article-title>
          . Network Working Group (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ingham</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caughey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Fixing the \Broken-Link" Problem: The W3Objects Approach</article-title>
          . In: Computer Networks and
          <source>ISDN Systems</source>
          ,
          <volume>28</volume>
          , pp.
          <fpage>1255</fpage>
          -
          <lpage>1268</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kahn</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilensky</surname>
          </string-name>
          , R.:
          <article-title>A Framework for Distributed Digital Object Services</article-title>
          .
          <source>In: International Journal on Digital Libraries</source>
          ,
          <volume>6</volume>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>123</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Massart</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards a Pan-European Learning Resource Exchange Infrastructure</article-title>
          . In: Feldman,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Kraft</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Ku ik, T. (eds.) NGITS'
          <year>2009</year>
          , LNCS, vol.
          <volume>5831</volume>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>132</lpage>
          . Springer, Haifa, Israel (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lagoze</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sompel</surname>
            ,
            <given-names>H. V.</given-names>
          </string-name>
          :
          <article-title>The Open Archives Initiative Protocol for Metadata Harvesting http</article-title>
          ://www.openarchives.org/OAI/openarchivesprotocol.htm (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <source>MELT: Final Evaluation Report, D 7</source>
          .3. http://info.meltproject.eu/shared/data/melt/MELT D7 3
          <article-title>Final Evaluation Report</article-title>
          .pdf (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. NIST/SEMATECH: e
          <article-title>-Handbook of Statistical Methods</article-title>
          . http://www.itl.nist.gov/div898/handbook (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Popitsch</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haslhofer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>DSNotify: Handling Broken Links in the Web of Data</article-title>
          .
          <source>In: Proceedings of the 19th international WWW conference (WWW2010)</source>
          , NC, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Reich, V.,
          <string-name>
            <surname>Rosenthal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>LOCKSS: A Permanent Web Publishing</article-title>
          and
          <string-name>
            <given-names>Access</given-names>
            <surname>System. D-Lib</surname>
          </string-name>
          <string-name>
            <surname>Magazine</surname>
          </string-name>
          , vol.
          <volume>7</volume>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ternier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Interim Report: The Simple Publishing Interface Speci cation</article-title>
          . http://ariadne.cs.kuleuven.be/lomi/images/b/ba/CEN
          <source>SPI interim report.pdf</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Veiga</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferreira</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>RepWeb: Replicated Web With Referential Integrity</article-title>
          .
          <source>In: Proceedings of the 2003 ACM Symposium on Applied Computing</source>
          , pp.
          <fpage>1206</fpage>
          -
          <lpage>1211</lpage>
          , New York, NY, USA (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Wiley, G.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Improving OpenURL Metadata</article-title>
          .
          <source>Serials Librarian</source>
          , vol.
          <volume>56</volume>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>286</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>