Detecting Content Drift on the Web Using Web
Archives and Textual Similarity (short paper)⋆
Brenda Reyes Ayala1,* , Qiufeng Du1 and Juyi Han1
1
    University of Alberta, School of Library and Information Studies, 11210 87 Ave, Edmonton, Alberta T6G 2G5, Canada


                                         Abstract
                                         Content drift, which occurs when a website’s content changes and moves away from the content it
                                         originally referenced, is problem that affects both live websites and web archives. Content drift can also
                                         occur when the page has been hacked, its domain has expired, or the service has been discontinued. In
                                         this paper, we present a simple method for detecting content drift on the live web based on comparing
                                         the titles of live websites to those of their archived versions. Our assumption was that the higher the
                                         difference between the title of an archived website and that of its live counterpart, the more likely content
                                         drift had taken place. In order to test our approach, we first had human evaluators manually judge
                                         websites from three Canadian web archives to determine or not content drift had occurred. Then we
                                         extracted the titles from all websites, and used cosine similarity to compare the title of the live websites
                                         to the title of the archived websites. Our approach achieved positive results, with an accuracy of 85.2,
                                         precision of 89.3, recall of 92.1, and F-measure values of 90.7. Having simple methods such as the one
                                         presented in this paper can allow institutions or researchers to quickly and effectively detect content
                                         drift without needing many technological resources.

                                         Keywords
                                         web archiving, cultural heritage, relevance, reference rot, link rot, content drift


1. Introduction
The ephemeral and transient nature of the web has been well-established. For many institutions
who seek to preserve their online cultural heritage, the process of web archiving can seem like a
race against time as web archivists struggle to capture websites before they disappear from the
web. The danger of websites disappearing from the web altogether is part of a larger problem
known as reference rot, which has two components, as identified by [1]:
             1. Link rot: The resource identified by a URI vanishes from the web. As a result, a URI
                reference to the resource ceases to provide access to referenced content.
             2. Content drift: The resource identified by a URI changes over time. The resource’s content
                evolves and can change to such an extent that it ceases to be representative of the content
                that was originally referenced.
  The process of web archiving arose in part to combat link rot; however, the subtler and more
insidious problem of content drift persists in both the live web and in web archives. The study
TPDL2022: 26th International Conference on Theory and Practice of Digital Libraries, 20-23 September 2022, Padua, Italy
*
 Corresponding author.
" brenda.reyes@ualberta.ca (B. Reyes Ayala); qiufeng@ualberta.ca (Q. Du); juyi@ualberta.ca (J. Han)
 0000-0002-9342-3832 (B. Reyes Ayala)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
of content drift is complicated by the appearance of soft 404s. A soft 404 is an incorrect HTTP
status code of 200 (OK) or 3xx (redirect) that masks a correct status code of 404 (Not found).
It occurs when websites redirect failed URLs to a site’s homepage, thus causing it to mask
the standard 404 return code that occurs when there is a failure to access a web resource [2].
Content drift can also occur when the page has been hacked, its domain has expired, or the
service has been discontinued. Many web archives, such as those created using the Internet
Archive’s Archive-It service [3], are topic-specific or thematic in that they collect and preserve
many websites that cover a single topic or news event, such as Human Rights or the COVID-19
pandemic. If a live website is affected by content drift, and the site is then crawled, the resulting
archived webpage, and the web archive which contains it, will also have content drift. Within
web archives, webpages affected by content drift can also be referred to as being "off-topic", and
are defined as those "that have changed through time to move away from the initial scope of
the page, which should be relevant to the topic of the collection" [4].
   It is very useful for both web archivists and researchers to be able to determine if a live website
has been affected by content drift. Web archivists who preserve websites for their institutions,
if they see that a URI has drifted, can decide whether or not they wish to keep crawling it.
Since web archiving technologies and services usually entail a significant investment of time,
money, and resources, institutions can avoid some of these costs by ceasing to crawl websites
that have drifted. Detecting content drift on the live web can prevent it from occurring in web
archives. Furthermore, researchers who study reference rot on the web could use our method
as a way of detecting soft 404s on the live web. The presence of soft 404s can lead researchers
to underestimate the occurrence of reference rot, since these pages do not return the typical
404 "Not Found" error.
   In this paper, we present an approach to detecting content drift on the live web. The purpose
of this research is to find an approach to detecting content drift that simulates a human evaluator
inspecting a website and determining if content drift has occurred. In the past, researchers have
developed accurate methods for detecting content drift (discussed in Section 2); however, many
of these approaches are computationally intensive, and, in the case of web archives, may require
access to the Web ARChive (WARC) files that store the archived pages [5]. We take a different
approach based on a very simple assumption: that changes in the title of a website are indicative
of changes in its content, and thus large changes in the title of a website may be indicated of
content drift. This approach is quick, and is suitable for researchers or other institutions that
many not have access to the WARC files they are studying or have the computational power
required to perform these calculations.


2. Previous work
The prevalence of reference rot in publications and the web is a topic that has been well-
documented. As early as 2001, the authors of [6] noted that improved citation practices were
necessary to minimize the future loss of information. One of the first studies to quantify
reference rot was by [7], who monitored the status of a random set of URIs over four years. His
results showed that approximately 67% of URIs became inaccessible after a four-year period.
   The increased use of URLs and URIs in online publications has in part fueled the increase in
reference rot over time. According to a study [8], Electronic Theses and Dissertations (ETDs)
that include URL references have increased over the past 14 years from 23% in 1999 to 80% in
2012. In a study of the persistence of web resources in the arXiv repository and the University
of North Texas (UNT) Digital library, [9] found that 45% of the URLs referenced from arXiv still
exist, but are not preserved for future generations, and 28% of resources referenced by UNT
papers have been lost. In a 2014 paper, [10] investigated how reference rot impacted the ability
to revisit the context of scholarly articles after they had been published. The authors extracted
the URIs referenced in a collection of 3.5 million scholarly articles from Science, Technology,
and Medicine (STM) fields, and observed one out of five articles suffered from reference rot,
meaning it is impossible to revisit the web context that surrounds them some time after their
publication. In 2021, researchers at Harvard Law School examined New York Times articles
from 1996 to 2019. They found that link rot had increased linearly over time, and that out of over
2 million hyperlinks, 25% were inaccessible and that over 13% of links that were still reachable
had suffered content drift [11].
   Other researchers have employed web archives to study the nature of reference rot. A study
of reference in the UK web archives of the British Library found that, of over 1,000 archived
URIs, 40% were gone from the live web after two years, while another (40%) were had been
affected by content drift [12]. In 2012, [2] developed a Naïve Bayes classifier that could detect
soft 404 pages with a precision of 99% and a recall of 92%.
   In [4], the authors compiled three different Archive-It collections and experimented with
several methods of detecting these off-topic webpages and with how to define thresholds that
separate the on-topic from the off-topic pages. This involved comparing the text (after pre-
processing, stemming and stopword removal) of the archived website when it was first captured
(𝑈 𝑅𝐼 − 𝑅@𝑡0 ) with the text archived website that was captured at a later time (𝑈 𝑅𝐼 − 𝑅@𝑡).
The authors tested a variety of methods and found that the cosine similarity method proved
the best at detecting off-topic web pages, with an average accuracy of 0.983, and F-measure
(harmonic mean of precision and recall) of 0.881, and an Area Under the Curve (AUC) measure of
0.961. The second-best performing measure was word count. The author also experimented with
combining several similarity measures in an attempt to increase performance. The combination
of the cosine similarity and word count methods yielded the best results, with an accuracy equal
to 0.987, 𝐹 = 0.906, and 𝐴𝑈 𝐶 = 0.968 [4].
   [1] examined content drift in the same collection used by [10]. They first extracted the URIs
referenced in the collection of articles, then obtained the archived version (snapshots) of these
URIs whenever available. The text from these archived websites was extracted, and textual
similarity measures were used to compare their content to their live web counterparts. The
authors found that representative snapshots exist for about 30% of all URI references, and that
for over 75% of references the content had drifted away from what it was when referenced. A
high degree of both link rot and content drift was detected in the scholarly collection.


3. Methodology
As was seen in Section 2, researchers have used web archives to study the notion of reference
rot on the web. Past work on detecting content drift in web archives has focused on comparing
Table 1
Details of the collections
          Collection   No. seeds   No. captures   No. of judged captures   % Content drift
            INM              73        863                784              9.6%
            FMW              37       618                 618              33.2%
            WCA              86        95                  94              11.6%
            Total            196      1576                1496             25.1%


the extracted text of some archived websites to the extracted text of other archived websites.
This necessitates access to the WARC files that contain the archived websites in the first place.
  WARC is a container file format that can store a very large number of data objects, including
audio and video resources, inside of a single, compressed file. In web archiving, WARC files
are used to store content that has been harvested from the web via web crawlers [5]. Most
WARC files are quite large, ranging from a few gigabytes to many terabytes in size, which
make them cumbersome and slow to traverse and analyze. Extracting content from WARC files
requires pre-processing steps for extracting the text, stop-word removal, and stemming. As
noted in [1], extracting text from HTML and PDF files proved substantially time-consuming
and arduous, and necessitated the writing of custom code even beyond the pre-existing code
that is already available to do so. As [13] states, analyzing WARC files often means working
with 100s of terabytes of data. This necessitates large amounts of storage space, memory, and a
robust research infrastructure, which many cultural heritage institutions do not have access to.

3.1. The dataset
We used the following three web archive collections to gather test data. These were created
and maintained by the University of Alberta using the Archive-It subscription service. These
collections were created by the University of Alberta Libraries in an effort to preserve western
Canadian cultural heritage on the web [14].

   1. Idle No More (INM): websites related to “Idle No More”, a Canadian political movement
      encompassing environmental concerns and the rights of indigenous communities [15].
   2. Fort McMurray Wildfire 2016 (FMW): websites related to the Fort McMurray Wildfire of
      2016 in the province of Alberta, Canada [16].
   3. Western Canadian Arts (WCA): born-digital resources created by filmmakers in Western
      Canada [17].

   The collections Idle No More and Fort McMurray Wildfires consist mostly of news articles
and social media posts primarily from Twitter. As a result, we expected both of these collections
to suffer significantly more from content drift due to frequent webpage changes. Since the WCA
collection includes the personal websites of many artists, we expected this particular collection
to have much less content drift.
     Archived website from January 29, 2013             Live website as of June 5, 2022
Figure 1: Screenshots of a URL of news article from INM collection. The archived website originally
contained a National Post article about the movement. The live website now redirects to the homepage
of the newspaper, showing content drift has occurred.


3.2. Evaluation of content drift
In order to determine the amount of content drift in our web archive collections, all three
authors manually inspected each of the live websites and compared them to their archived
versions. Each capture for each website was evaluated by inspecting the content, as well its look
and feel. A website was labeled "off-topic" if it had been affected by content drift and "on-topic",
otherwise. Most captures were judged, except in a few cases where the archived version was of
very poor quality, and evaluators were not able to determine if content drift had taken place.
   Figure 1 shows an example of content drift from the INM collection. The archived website
originally contained a news article about the movement, but the live website now redirects to
the homepage of the newspaper, showing content drift has occurred. Were it not for an archived
copy of the page, the article would have been lost forever. The details for each collection are
given in Table 1.
   Overall, about a quarter of the data set has been affected by content drift, with the FMW
collection experiencing 33.2% content drift. The INM collection had undergone surprisingly
little content drift (9.6%). The judgements as to whether a website was on-topic or off-topic
were used as the ground truth for our next steps.

3.3. Extracting titles
In the title extraction process, the Python Library “Beautiful Soup”1 was initially applied.
However, Beautiful Soup encountered some errors with Twitter URLs. To solve this problem,
the Selenium Webdriver 2 was used as a backup plan. Due to the fact that extracting titles with
Selenium is much more time-consuming than with Beautiful Soup, this method is only triggered
when errors occurred. There were many Twitter URLs that redirected to different pages. To
avoid getting the wrong title, the program waits five seconds after the URL is loaded to extract
the title.


1
    https://www.crummy.com/software/BeautifulSoup/
2
    https://www.selenium.dev/documentation/webdriver/
Table 2
Confusion Matrix
                            Collection    TP    FN     FP    TN    Total
                              INM         585    24   124     51   784
                             FMW          412    68    1     137   618
                              WCA         80      1    4      9    94
                             Overall     1077    93   129    197   1496


3.4. Using similarity measures
After retrieving the titles, we removed stop words and converted each title to lowercase. We
then used a well-known textual similarity metric to compare the titles of the archived websites
to those of the live website: the cosine similarity. We based our choice of cosine similarity partly
because of its previous, sucessful usage of it in [4] and [1]. Cosine similarity is one commonly-
used metric that is not sensitive to high-frequency words. Cosine similarity measures the angle
                                                               x*y
between two vectors using the formula 𝑘(𝑥, 𝑦) =                         . The values calculated by
                                                         ‖x‖*‖y‖
cosine similarity range between 0, for vectors that do not share any terms, to 1, for vectors that
are identical, to -1, for vectors that point in opposite directions [18].
   Our assumption was that the higher the difference between the title of an archived website
and that of its live counterpart, the more likely content drift had taken place (off-topic). However,
we needed a threshold value to classify each URL pair as "on-topic" or "off-topic" We initially
experimented with higher threshold values of 0.7 and 0.8 after the work of [4]. However, we
found that, over time, some websites had lengthened or shortened their titles slightly, but had
remained on-topic, thus resulting in some false positives. We eventually decided on a threshold
value of 0.6, which gave us the best performance for cosine similarity. Running the code to both
extract the website titles and perform the similarity calculations took several hours for each
collection, and the resulting text files were 232 KB (INM), 180 KB (FMW), and 20 KB (WCA) in
size. This was a much smaller footprint than the much larger and slower WARC files.


4. Results and Discussion
In this section, we present the results of the similarity calculations between the titles of the
archived websites and those of their live counterparts. Table 2 presents the confusion matrix
values for the cosine similarity calculations. The following metrics are provided:

   1. True Positives (TP): URLs that were on-topic and judged to be on-topic
   2. False Negatives (FN): URLs that were off-topic but judged to be on-topic
   3. False Positive (FP): URLs that were on-topic but judged to be off-topic
   4. True Negatives (TN): URLs that were off-topic and judged to be off-topic

  Since our intent is to be able to detect off-topic URLs, we were particularly interested in
keeping the number of false negatives (FNs) as low as possible.
Table 3
Evaluation results for the collections
                       Collection   Accuracy    Precision   Recall   F-measure
                         INM             81.1     82.5       96.1    88.8
                        FMW              88.8     99.8       85.8    92.3
                         WCA             94.7     95.2       98.8    97
                        Overall          85.2     89.3       92.1    90.7


   Table 3 presents the evaluation metrics and results for the cosine similarity calculations, as
compared to human judgements of content drift. Accuracy is the fraction of both on-topic
                                                                           (TP+TN)
and off-topic URLs that were correctly classified, or Accuracy = (TP+FP+FN+TN)          . Precision
is the fraction of retrieved URLs that were correctly classified as being on-topic, defined as
                (TP)
Precision = (TP+FP)    , while recall is the fraction of on-topic URLs that were retrieved, defined
              (TP)
as Recall = (TP+FN) . The F-measure is the harmonic mean of precision and recall, or 𝐹 −
                   2TP
𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (2TP+FP+FN)     .
  Overall, good performance was achieved, with high or medium-high values of accuracy,
precision, recall, and F-measure. Because the recall is defined as the fraction of off-topic
websites that were detected, we were particularly interested in achieving high levels of it.


5. Conclusion
In this paper, we presented a simple method for detecting content drift on the live web based
on comparing the titles of live websites to those of their archived versions. Our assumption
was that the higher the difference between the title of an archived website and that of its live
counterpart, the more likely content drift had taken place. Our proposed method achieved high
values of accuracy, precision, recall, and F-measure, and has the following advantages:

    • It is highly consistent with human judgements of content drift.
    • It is quicker and less computationally intensive than other methods which require the
      extraction and comparison of the full text of archived websites.
    • It does not require access to the WARC files which contain the archived websites, which
      are large and require much storage space.

   Having simple methods such as the one presented in this paper can allow institutions or
researchers to quickly and effectively detect content drift without needing many technological
resources. In the future, we wish to apply this method for detecting content drift to larger
web archives, and seek to refine and improve its performance without sacrificing its speed and
simplicity.


Acknowledgments
Thanks to Shawn M. Jones and Michael L. Nelson for the some of the ideas that inspired this
work. The research in this paper was supported in part by funding from the Social Sciences and
Humanities Research Council of Canada.


References
 [1] S. M. Jones, H. Van de Sompel, H. Shankar, M. Klein, R. Tobin, C. Grover, Scholarly context
     adrift: Three out of four uri references lead to changed content, PLOS ONE 11 (2016)
     1–32. URL: https://doi.org/10.1371/journal.pone.0167475. doi:10.1371/journal.pone.
     0167475.
 [2] L. Meneses, R. Furuta, F. Shipman, Identifying “soft 404” error pages: Analyzing the
     lexical signatures of documents in distributed collections, in: P. Zaphiris, G. Buchanan,
     E. Rasmussen, F. Loizides (Eds.), Theory and Practice of Digital Libraries, Springer Berlin
     Heidelberg, Berlin, Heidelberg, 2012, pp. 197–208.
 [3] Archive-It, Learn more, 2020. URL: https://archive-it.org/learn-more.
 [4] Y. Alnoamany, M. C. Weigle, M. L. Nelson, Detecting off-topic pages within timemaps in
     web archives, International Journal on Digital Libraries 17 (2016) 203–221.
 [5] International Internet Preservation Consortium, The warc format 1.1, n.d. URL: https:
     //iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.
 [6] S. Lawrence, D. Pennock, G. Flake, R. Krovetz, F. Coetzee, E. Glover, F. Nielsen, A. Kruger,
     C. Giles, Persistence of web references in scientific research, Computer 34 (2001) 26–31.
     doi:10.1109/2.901164.
 [7] W. Koehler, Web page change and persistence—a four-year longitudinal study, Journal
     of the American Society for Information Science and Technology 53 (2002) 162–171.
     URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.10018. doi:10.1002/asi.10018.
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/asi.10018.
 [8] M. Phillips, D. Alemneh, B. Reyes Ayala, Analysis of url references in etds: a case study at
     the university of north texas, Library Management 35 (2014).
 [9] R. Sanderson, M. E. Phillips, H. Van de Sompel, Analyzing the persistence of referenced web
     resources with memento, Austin, TX, USA, 2011. URL: http://digital.library.unt.edu/ark:
     /67531/metadc39318/.
[10] M. Klein, H. Van de Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, R. Tobin,
     Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, PLoS ONE
     9 (2014) e115253+. doi:10.1371/journal.pone.0115253.
[11] J. Zittrain, J. Bowers, C. Stanton, The Paper of Record Meets an Ephemeral Web: An
     Examination of Linkrot and Content Drift within The New York Times, Research Report,
     Berkman Klein Center for Internet & Society at Harvard University, 2021. doi:http:
     //dx.doi.org/10.2139/ssrn.3833133.
[12] A. N. Jackson, Ten years of the uk web archive: what have we saved?, 2015. URL: https:
     //anjackson.net/2015/04/27/what-have-we-saved-iipc-ga-2015/.
[13] I. Milligan, Demystifying the warc: Research use of web archives, 2022. URL: https://
     archive.org/details/demystifying-the-warc-research-use-of-web-archives-slides.
[14] University of Alberta Library, Digital preservation services, n.d. URL: https://www.library.
     ualberta.ca/digital-initiatives/preservation.
[15] University of Alberta, Idle No More collection, n.d. URL: https://archive-it.org/collections/
     3490.
[16] University of Alberta, Fort McMurray wildfire 2016 collection, 2016. URL: https://archive-it.
     org/collections/7368.
[17] University of Alberta, Western Canadian Arts collection, n.d. URL: https://archive-it.org/
     collections/6296.
[18] D. Jurafsky, J. H. Martin, Speech and Language Processing, 2nd ed., Prentice Hall, Upper
     Saddle River, NJ, 2008.