<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Content Drift on the Web Using Web Archives and Textual Similarity (short paper)⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brenda Reyes Ayala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiufeng Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juyi Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Alberta, School of Library and Information Studies</institution>
          ,
          <addr-line>11210 87 Ave, Edmonton, Alberta T6G 2G5</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Content drift, which occurs when a website's content changes and moves away from the content it originally referenced, is problem that afects both live websites and web archives. Content drift can also occur when the page has been hacked, its domain has expired, or the service has been discontinued. In this paper, we present a simple method for detecting content drift on the live web based on comparing the titles of live websites to those of their archived versions. Our assumption was that the higher the diference between the title of an archived website and that of its live counterpart, the more likely content drift had taken place. In order to test our approach, we first had human evaluators manually judge websites from three Canadian web archives to determine or not content drift had occurred. Then we extracted the titles from all websites, and used cosine similarity to compare the title of the live websites to the title of the archived websites. Our approach achieved positive results, with an accuracy of 85.2, precision of 89.3, recall of 92.1, and F-measure values of 90.7. Having simple methods such as the one presented in this paper can allow institutions or researchers to quickly and efectively detect content drift without needing many technological resources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;web archiving</kwd>
        <kwd>cultural heritage</kwd>
        <kwd>relevance</kwd>
        <kwd>reference rot</kwd>
        <kwd>link rot</kwd>
        <kwd>content drift</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The ephemeral and transient nature of the web has been well-established. For many institutions
who seek to preserve their online cultural heritage, the process of web archiving can seem like a
race against time as web archivists struggle to capture websites before they disappear from the
web. The danger of websites disappearing from the web altogether is part of a larger problem
known as reference rot, which has two components, as identified by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
1. Link rot: The resource identified by a URI vanishes from the web. As a result, a URI
reference to the resource ceases to provide access to referenced content.
2. Content drift: The resource identified by a URI changes over time. The resource’s content
evolves and can change to such an extent that it ceases to be representative of the content
that was originally referenced.
      </p>
      <p>
        The process of web archiving arose in part to combat link rot; however, the subtler and more
insidious problem of content drift persists in both the live web and in web archives. The study
of content drift is complicated by the appearance of soft 404s. A soft 404 is an incorrect HTTP
status code of 200 (OK) or 3xx (redirect) that masks a correct status code of 404 (Not found).
It occurs when websites redirect failed URLs to a site’s homepage, thus causing it to mask
the standard 404 return code that occurs when there is a failure to access a web resource [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Content drift can also occur when the page has been hacked, its domain has expired, or the
service has been discontinued. Many web archives, such as those created using the Internet
Archive’s Archive-It service [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], are topic-specific or thematic in that they collect and preserve
many websites that cover a single topic or news event, such as Human Rights or the COVID-19
pandemic. If a live website is afected by content drift, and the site is then crawled, the resulting
archived webpage, and the web archive which contains it, will also have content drift. Within
web archives, webpages afected by content drift can also be referred to as being "of-topic", and
are defined as those "that have changed through time to move away from the initial scope of
the page, which should be relevant to the topic of the collection" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>It is very useful for both web archivists and researchers to be able to determine if a live website
has been afected by content drift. Web archivists who preserve websites for their institutions,
if they see that a URI has drifted, can decide whether or not they wish to keep crawling it.
Since web archiving technologies and services usually entail a significant investment of time,
money, and resources, institutions can avoid some of these costs by ceasing to crawl websites
that have drifted. Detecting content drift on the live web can prevent it from occurring in web
archives. Furthermore, researchers who study reference rot on the web could use our method
as a way of detecting soft 404s on the live web. The presence of soft 404s can lead researchers
to underestimate the occurrence of reference rot, since these pages do not return the typical
404 "Not Found" error.</p>
      <p>
        In this paper, we present an approach to detecting content drift on the live web. The purpose
of this research is to find an approach to detecting content drift that simulates a human evaluator
inspecting a website and determining if content drift has occurred. In the past, researchers have
developed accurate methods for detecting content drift (discussed in Section 2); however, many
of these approaches are computationally intensive, and, in the case of web archives, may require
access to the Web ARChive (WARC) files that store the archived pages [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We take a diferent
approach based on a very simple assumption: that changes in the title of a website are indicative
of changes in its content, and thus large changes in the title of a website may be indicated of
content drift. This approach is quick, and is suitable for researchers or other institutions that
many not have access to the WARC files they are studying or have the computational power
required to perform these calculations.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous work</title>
      <p>
        The prevalence of reference rot in publications and the web is a topic that has been
welldocumented. As early as 2001, the authors of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] noted that improved citation practices were
necessary to minimize the future loss of information. One of the first studies to quantify
reference rot was by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], who monitored the status of a random set of URIs over four years. His
results showed that approximately 67% of URIs became inaccessible after a four-year period.
      </p>
      <p>
        The increased use of URLs and URIs in online publications has in part fueled the increase in
reference rot over time. According to a study [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Electronic Theses and Dissertations (ETDs)
that include URL references have increased over the past 14 years from 23% in 1999 to 80% in
2012. In a study of the persistence of web resources in the arXiv repository and the University
of North Texas (UNT) Digital library, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] found that 45% of the URLs referenced from arXiv still
exist, but are not preserved for future generations, and 28% of resources referenced by UNT
papers have been lost. In a 2014 paper, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] investigated how reference rot impacted the ability
to revisit the context of scholarly articles after they had been published. The authors extracted
the URIs referenced in a collection of 3.5 million scholarly articles from Science, Technology,
and Medicine (STM) fields, and observed one out of five articles sufered from reference rot,
meaning it is impossible to revisit the web context that surrounds them some time after their
publication. In 2021, researchers at Harvard Law School examined New York Times articles
from 1996 to 2019. They found that link rot had increased linearly over time, and that out of over
2 million hyperlinks, 25% were inaccessible and that over 13% of links that were still reachable
had sufered content drift [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Other researchers have employed web archives to study the nature of reference rot. A study
of reference in the UK web archives of the British Library found that, of over 1,000 archived
URIs, 40% were gone from the live web after two years, while another (40%) were had been
afected by content drift [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In 2012, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] developed a Naïve Bayes classifier that could detect
soft 404 pages with a precision of 99% and a recall of 92%.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors compiled three diferent Archive-It collections and experimented with
several methods of detecting these of-topic webpages and with how to define thresholds that
separate the on-topic from the of-topic pages. This involved comparing the text (after
preprocessing, stemming and stopword removal) of the archived website when it was first captured
(  − @0) with the text archived website that was captured at a later time (  − @).
The authors tested a variety of methods and found that the cosine similarity method proved
the best at detecting of-topic web pages, with an average accuracy of 0.983, and F-measure
(harmonic mean of precision and recall) of 0.881, and an Area Under the Curve (AUC) measure of
0.961. The second-best performing measure was word count. The author also experimented with
combining several similarity measures in an attempt to increase performance. The combination
of the cosine similarity and word count methods yielded the best results, with an accuracy equal
to 0.987,  = 0.906, and   = 0.968 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] examined content drift in the same collection used by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They first extracted the URIs
referenced in the collection of articles, then obtained the archived version (snapshots) of these
URIs whenever available. The text from these archived websites was extracted, and textual
similarity measures were used to compare their content to their live web counterparts. The
authors found that representative snapshots exist for about 30% of all URI references, and that
for over 75% of references the content had drifted away from what it was when referenced. A
high degree of both link rot and content drift was detected in the scholarly collection.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>As was seen in Section 2, researchers have used web archives to study the notion of reference
rot on the web. Past work on detecting content drift in web archives has focused on comparing
the extracted text of some archived websites to the extracted text of other archived websites.
This necessitates access to the WARC files that contain the archived websites in the first place.</p>
      <p>
        WARC is a container file format that can store a very large number of data objects, including
audio and video resources, inside of a single, compressed file. In web archiving, WARC files
are used to store content that has been harvested from the web via web crawlers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Most
WARC files are quite large, ranging from a few gigabytes to many terabytes in size, which
make them cumbersome and slow to traverse and analyze. Extracting content from WARC files
requires pre-processing steps for extracting the text, stop-word removal, and stemming. As
noted in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], extracting text from HTML and PDF files proved substantially time-consuming
and arduous, and necessitated the writing of custom code even beyond the pre-existing code
that is already available to do so. As [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] states, analyzing WARC files often means working
with 100s of terabytes of data. This necessitates large amounts of storage space, memory, and a
robust research infrastructure, which many cultural heritage institutions do not have access to.
      </p>
      <sec id="sec-3-1">
        <title>3.1. The dataset</title>
        <p>
          We used the following three web archive collections to gather test data. These were created
and maintained by the University of Alberta using the Archive-It subscription service. These
collections were created by the University of Alberta Libraries in an efort to preserve western
Canadian cultural heritage on the web [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>1. Idle No More (INM): websites related to “Idle No More”, a Canadian political movement
encompassing environmental concerns and the rights of indigenous communities [15].
2. Fort McMurray Wildfire 2016 (FMW): websites related to the Fort McMurray Wildfire of
2016 in the province of Alberta, Canada [16].
3. Western Canadian Arts (WCA): born-digital resources created by filmmakers in Western</p>
        <p>Canada [17].</p>
        <p>The collections Idle No More and Fort McMurray Wildfires consist mostly of news articles
and social media posts primarily from Twitter. As a result, we expected both of these collections
to sufer significantly more from content drift due to frequent webpage changes. Since the WCA
collection includes the personal websites of many artists, we expected this particular collection
to have much less content drift.</p>
        <p>Archived website from January 29, 2013
Live website as of June 5, 2022</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation of content drift</title>
        <p>In order to determine the amount of content drift in our web archive collections, all three
authors manually inspected each of the live websites and compared them to their archived
versions. Each capture for each website was evaluated by inspecting the content, as well its look
and feel. A website was labeled "of-topic" if it had been afected by content drift and "on-topic",
otherwise. Most captures were judged, except in a few cases where the archived version was of
very poor quality, and evaluators were not able to determine if content drift had taken place.</p>
        <p>Figure 1 shows an example of content drift from the INM collection. The archived website
originally contained a news article about the movement, but the live website now redirects to
the homepage of the newspaper, showing content drift has occurred. Were it not for an archived
copy of the page, the article would have been lost forever. The details for each collection are
given in Table 1.</p>
        <p>Overall, about a quarter of the data set has been afected by content drift, with the FMW
collection experiencing 33.2% content drift. The INM collection had undergone surprisingly
little content drift (9.6%). The judgements as to whether a website was on-topic or of-topic
were used as the ground truth for our next steps.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Extracting titles</title>
        <p>In the title extraction process, the Python Library “Beautiful Soup”1 was initially applied.
However, Beautiful Soup encountered some errors with Twitter URLs. To solve this problem,
the Selenium Webdriver 2 was used as a backup plan. Due to the fact that extracting titles with
Selenium is much more time-consuming than with Beautiful Soup, this method is only triggered
when errors occurred. There were many Twitter URLs that redirected to diferent pages. To
avoid getting the wrong title, the program waits five seconds after the URL is loaded to extract
the title.</p>
        <sec id="sec-3-3-1">
          <title>1https://www.crummy.com/software/BeautifulSoup/ 2https://www.selenium.dev/documentation/webdriver/</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Using similarity measures</title>
        <p>
          between two vectors using the formula (, ) =
After retrieving the titles, we removed stop words and converted each title to lowercase. We
then used a well-known textual similarity metric to compare the titles of the archived websites
to those of the live website: the cosine similarity. We based our choice of cosine similarity partly
because of its previous, sucessful usage of it in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Cosine similarity is one
commonlyused metric that is not sensitive to high-frequency words. Cosine similarity measures the angle
x*y
        </p>
        <p>. The values calculated by
‖ x ‖ * ‖ y ‖
cosine similarity range between 0, for vectors that do not share any terms, to 1, for vectors that
are identical, to -1, for vectors that point in opposite directions [18].</p>
        <p>
          Our assumption was that the higher the diference between the title of an archived website
and that of its live counterpart, the more likely content drift had taken place (of-topic). However,
we needed a threshold value to classify each URL pair as "on-topic" or "of-topic" We initially
experimented with higher threshold values of 0.7 and 0.8 after the work of [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. However, we
found that, over time, some websites had lengthened or shortened their titles slightly, but had
remained on-topic, thus resulting in some false positives. We eventually decided on a threshold
value of 0.6, which gave us the best performance for cosine similarity. Running the code to both
extract the website titles and perform the similarity calculations took several hours for each
collection, and the resulting text files were 232 KB (INM), 180 KB (FMW), and 20 KB (WCA) in
size. This was a much smaller footprint than the much larger and slower WARC files.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>In this section, we present the results of the similarity calculations between the titles of the
archived websites and those of their live counterparts. Table 2 presents the confusion matrix
values for the cosine similarity calculations. The following metrics are provided:</p>
      <sec id="sec-4-1">
        <title>1. True Positives (TP): URLs that were on-topic and judged to be on-topic 2. False Negatives (FN): URLs that were of-topic but judged to be on-topic 3. False Positive (FP): URLs that were on-topic but judged to be of-topic 4. True Negatives (TN): URLs that were of-topic and judged to be of-topic</title>
        <p>Since our intent is to be able to detect of-topic URLs, we were particularly interested in
keeping the number of false negatives (FNs) as low as possible.</p>
        <p>Table 3 presents the evaluation metrics and results for the cosine similarity calculations, as
compared to human judgements of content drift. Accuracy is the fraction of both on-topic
and of-topic URLs that were correctly classified, or Accuracy = (TP+(TFPP++FTNN+)TN) . Precision
is the fraction of retrieved URLs that were correctly classified as being on-topic, defined as
Precision = (T(PT+PF)P) , while recall is the fraction of on-topic URLs that were retrieved, defined
as Recall = (T(PT+PF)N) . The F-measure is the harmonic mean of precision and recall, or  −
2TP
 = (2TP+FP+FN) .</p>
        <p>Overall, good performance was achieved, with high or medium-high values of accuracy,
precision, recall, and F-measure. Because the recall is defined as the fraction of of-topic
websites that were detected, we were particularly interested in achieving high levels of it.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we presented a simple method for detecting content drift on the live web based
on comparing the titles of live websites to those of their archived versions. Our assumption
was that the higher the diference between the title of an archived website and that of its live
counterpart, the more likely content drift had taken place. Our proposed method achieved high
values of accuracy, precision, recall, and F-measure, and has the following advantages:
• It is highly consistent with human judgements of content drift.
• It is quicker and less computationally intensive than other methods which require the
extraction and comparison of the full text of archived websites.
• It does not require access to the WARC files which contain the archived websites, which
are large and require much storage space.</p>
      <p>Having simple methods such as the one presented in this paper can allow institutions or
researchers to quickly and efectively detect content drift without needing many technological
resources. In the future, we wish to apply this method for detecting content drift to larger
web archives, and seek to refine and improve its performance without sacrificing its speed and
simplicity.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Thanks to Shawn M. Jones and Michael L. Nelson for the some of the ideas that inspired this
work. The research in this paper was supported in part by funding from the Social Sciences and
Humanities Research Council of Canada.
[15] University of Alberta, Idle No More collection, n.d. URL: https://archive-it.org/collections/
3490.
[16] University of Alberta, Fort McMurray wildfire 2016 collection, 2016. URL: https://archive-it.</p>
      <p>org/collections/7368.
[17] University of Alberta, Western Canadian Arts collection, n.d. URL: https://archive-it.org/
collections/6296.
[18] D. Jurafsky, J. H. Martin, Speech and Language Processing, 2nd ed., Prentice Hall, Upper
Saddle River, NJ, 2008.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jones</surname>
          </string-name>
          , H. Van de Sompel,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <article-title>Scholarly context adrift: Three out of four uri references lead to changed content</article-title>
          ,
          <source>PLOS ONE 11</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . URL: https://doi.org/10.1371/journal.pone.0167475. doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0167475</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Meneses</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Furuta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shipman</surname>
          </string-name>
          , Identifying “soft 404”
          <article-title>error pages: Analyzing the lexical signatures of documents in distributed collections</article-title>
          , in: P. Zaphiris,
          <string-name>
            <given-names>G.</given-names>
            <surname>Buchanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loizides</surname>
          </string-name>
          (Eds.),
          <source>Theory and Practice of Digital Libraries</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2012</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Archive-It</surname>
          </string-name>
          , Learn more,
          <year>2020</year>
          . URL: https://archive-it.org/learn-more.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Alnoamany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Weigle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <article-title>Detecting of-topic pages within timemaps in web archives</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          <volume>17</volume>
          (
          <year>2016</year>
          )
          <fpage>203</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>International</given-names>
            <surname>Internet Preservation Consortium</surname>
          </string-name>
          ,
          <source>The warc format 1</source>
          .1, n.d. URL: https: //iipc.github.io/warc-specifications/specifications/warc-format
          <source>/warc-1</source>
          .1/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pennock</surname>
          </string-name>
          , G. Flake,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krovetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Coetzee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Glover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <article-title>Persistence of web references in scientific research</article-title>
          ,
          <source>Computer</source>
          <volume>34</volume>
          (
          <year>2001</year>
          )
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .1109/2.901164.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Koehler</surname>
          </string-name>
          ,
          <article-title>Web page change and persistence-a four-year longitudinal study</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>53</volume>
          (
          <year>2002</year>
          )
          <fpage>162</fpage>
          -
          <lpage>171</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.10018. doi:
          <volume>10</volume>
          .1002/asi.10018. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.10018.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alemneh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Reyes</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <article-title>Analysis of url references in etds: a case study at the university of north texas</article-title>
          ,
          <source>Library Management</source>
          <volume>35</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Phillips</surname>
          </string-name>
          , H. Van de Sompel,
          <article-title>Analyzing the persistence of referenced web resources with memento</article-title>
          , Austin, TX, USA,
          <year>2011</year>
          . URL: http://digital.library.unt.edu/ark: /67531/metadc39318/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klein</surname>
          </string-name>
          , H. Van de Sompel,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Balakireva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tobin</surname>
          </string-name>
          , Scholarly Context Not Found:
          <article-title>One in Five Articles Sufers from Reference Rot</article-title>
          ,
          <source>PLoS ONE 9</source>
          (
          <year>2014</year>
          )
          <article-title>e115253+</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0115253</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zittrain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bowers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stanton</surname>
          </string-name>
          ,
          <source>The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times, Research Report</source>
          , Berkman Klein Center for Internet &amp; Society at Harvard University,
          <year>2021</year>
          . doi:http: //dx.doi.org/10.2139/ssrn.3833133.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <article-title>Ten years of the uk web archive: what have we saved?</article-title>
          ,
          <year>2015</year>
          . URL: https: //anjackson.net/
          <year>2015</year>
          /04/27/what-have
          <article-title>-we-saved-iipc-ga-</article-title>
          <year>2015</year>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>I. Milligan</surname>
          </string-name>
          , Demystifying the warc: Research use of web archives,
          <year>2022</year>
          . URL: https:// archive.org/details/demystifying
          <article-title>-the-warc-research-use-of-web-archives-slides.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] University of Alberta Library, Digital preservation services, n.d. URL: https://www.library. ualberta.ca/digital-initiatives/preservation.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>