“I’m not sure how feasible capture is”: archivability as a dimension of website quality Brenda Reyes Ayala1 (Associate Professor) 1 University of Alberta, Faculty of Education, 11210 87 Ave, Edmonton AB T6G 2G5, Canada Abstract This paper presents the results of a study of the quality of archived websites using support tickets from the Internet Archive’s Archive-It (AIT) service, currently the most widely used web archiving service. The study uses grounded theory to create a definition of quality for archived websites that is composed of three dimensions: correspondence, relevance, and archivability. The focus of this paper is on website archivability, which I redefine as the intrinsic properties of a website that make it easier or more difficult to archive. I argue that archivability is a latent construct of information quality expressed by the manifest variables of correspondence and relevance, a finding not previously seen in the literature. The definition is independent of the technology currently in use to create web archives, making it suitable to a wide variety of platforms, preservation contexts, and situations. Furthermore, the paper describes how low levels of website archivability influences the quality of web archives as historical records. Keywords web archiving, web archives, archivability, digital preservation, website quality 1. Introduction Since the 1990s, many cultural heritage institutions, such as museums, archives, and libraries, have undertaken the task of creating a historical record of the web through the practice of web archiving. Web archivists who preserve websites are concerned both with quantity and with quality. Their goal is to create archived websites that are as close as possible in appearance and functionality to the original, live website. Failing to adequately capture a website might mean a flawed or incomplete historical record of it. The importance of creating high-quality web archives was echoed by the results of the 2016 survey on web archiving in the United States, conducted by the National Digital Stewardship Alliance (NDSA). When asked what were their top concerns when developing a web archiving program at their respective institutions, 52% of participants cited quality as a top issue. Quality was the third most-cited concern for respondents, after cost and access and use (60% each) [1]. The most popular web archiving service is the Internet Archive’s Archive-It (AIT), which helps organizations build and manage their own web archives. It currently has over 800 clients partners consisting of universities, archives, museums, and libraries in over 24 countries [2]. In 2022, the National Digital Stewardship Alliance (NDSA) conducted another survey of web archiving practices worldwide [3]. The authors found that, of the over 190 institutions that had web archiving programs in place, 71% capture content with Archive-It. Archive-It is popular with many institutions throughout the world, who have entrusted it with the creation and management of their web archives. Despite the popularity of the Archive-It and the Internet Archive, web archiving is a field with few conceptual tools or theoretical definitions. In a previous work [4] I presented a grounded theory of Information Quality (IQ) for web archives, derived from an analysis of tickets submitted to the Internet Archive’s AIT service by web archivists. Its goal was to create a theory of IQ that is both human-centered and independent of the technology currently in use to create web archives. This theory consists of three IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20-21 2025, Udine, Italy $ brenda.reyes@ualberta.ca (B. Reyes Ayala) € https://reyesayala.github.io/ (B. Reyes Ayala)  0000-0002-9342-3832 (B. Reyes Ayala) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings dimensions (or core categories) that determine the quality of a web archive: correspondence, relevance, and archivability, along with their subdimensions: 1. Correspondence: degree of similarity, or resemblance, between the original website and the archived website • Visual correspondence: similarity in appearance between the original website and the archived website • Interactional correspondence: the degree to which a user’s interaction with the archived website is similar to that of the original • Completeness: the degree to which the archived website contains all of the components of the original 2. Relevance: pertinence of the contents of an archived website to the original website • Topic relevance: degree to which an archived website (or a web archive) includes only content that is closely related to that of the original website or the topic of the larger web archive • Size relevance: the similarity in size of the archived website to the original website 3. Archivability: degree to which the intrinsic properties of a website make it easier or more difficult to archive A web archive can be said to have high-quality if it has high correspondence, high relevance, and high archivability. Due to the heavy presence of the dimension of correspondence in the data set, I asserted that it is the most important facet of quality in web archive. This work builds on the theory of IQ for web archives advanced earlier, but turns its attention to archivability, another important dimension. The purpose of this paper is to elaborate on and deepen the original definition of archivability, place it in a human-centred context, and predict how it will affect the future of web archiving. This goal leads to the following research questions: 1. How do people perceive the notion of archivability in web archives? 2. How does website archivability affect web archives and thus the future historical record? 2. Previous Work: “Damaged” web archives and archivability When deploying crawlers to capture a website, some crawl engineers pay special attention to embedded resources. Embedded resources are files, such as images, videos, or CSS stylesheets, that are present and referenced in a website. A user might not notice their presence, but embedded resources play a key role in ensuring the website looks and operates in the correct way. To this end, crawl engineers might calculate a percentage of missing embedded resources 𝑀𝑚 in an archived website and use it to estimate the overall quality of the site. Brunelle, Kelly, SalahEldeen, Weigle, and Nelson [5] showed that 𝑀𝑚 is not always consistent with human judgments of the quality of an archived website and was thus not a suitable metric for measuring the damage to an archived website caused by missing embedded resources. Instead, the authors proposed a new metric to assess this damage that is based on three factors: the MIME type, size, and location of the embedded resource [5]. In their iPres paper “CLEAR: A Credible Method to Evaluate Website Archivability”, [6] introduced the concept of website archivability. Archivability was defined as the "sum of the attributes that make a website amenable to being archived" [6]. The more easily it was to archive a website, the greater its archivability. The authors introduced a set of facets designed to determine the archivability of a website, termed the Credible Live Evaluation of Archive Readiness, or CLEAR, method. These facets were: standards compliance, performance, cohesion. and metadata usage. Later the authors expanded on their original work by introducing the CLEAR+ method, the incremental evolution of their original CLEAR+ method. According to CLEAR+, The archivability of a website is dependent on the following facets [7]: • Accessibility (𝐹𝐴 ): the ease with which a web crawler can visit a site, traverse its entirety and retrieve it via standard HTTP protocol requests. The website should provide resources so that a web crawler can discover and retrieve its different components (such as individual pages, images, and scripts). This facet also includes performance, or the speed at which a crawler can access the site. • Standards Compliance (𝐹𝑆 ): the website and its individual components conform to common accepted technical standards. For example, its HTML pages, conform to the W3C standards for HTML. It is also important that the website provided content in open file formats, instead of closed, proprietary formats such as QuickTime and Flash. • Cohesion (𝐹𝐶 ): the website does not have components that are dispersed across different locations on the web. For example, images, JavaScript files, and widgets. • Metadata Usage (𝐹𝑀 ): the website contains descriptive metadata such as HTTP headers and HTML META headers. It is important to note that the authors do not commit to a specific metadata model, but recommend using widely-accepted metadata models such as the Dublin Core standards. Each of these facets has several components, or criteria, each with its own significance. Criteria with high significance are more important to the archivability of a website, and if they are not met, can cause problematic web archiving results or even prevent the website from being archived at all. Medium-significance criteria are not critical but are still important, while low-significance criteria are considered minor issues. [7] stated that a website’s archivability (WA) can be computed by using the sum total of its score for each facet. The value of each facet is the weighted average of its coordinates. The website has a score for each facet, represented as a tuple (𝑥1 , ..., 𝑥𝑘 , ..., 𝑥𝑁 ). The value of 𝑥𝑘 is either 0 or 1, which represents a negative or positive answer to a specific criterion. The components of a single facet are not weighted evenly, but are assigned a weight (𝜔𝑘 ) depending on their significance. These weighted scores are then divided to average them. Once the value for each facet has been calculated, the total ∑︀ archivability score for the website can also be calculated using the following equation: 𝑊 𝐴 = 𝑤𝜆 𝐹𝜆 . 𝐹𝐴 , 𝜆∈{𝐴,𝑆,𝐶,𝑀 } 𝐹𝑆 , 𝐹𝐶 , and 𝐹𝑀 represent the value of each facet with respect to accessibility, standards compliance, cohesion, and metadata usage. [7] also created ArchiveReady, an evaluation system that implements the CLEAR+ model as a web application. ArchiveReady will calculate the websites’s archivability and present it to the user in terms of a percentage. Other researchers have also focused on the notion of archivability and attempted to operationalize it. In their paper “The impact of JavaScript on archivability”, [8] defined archivability as the ease with which a website can be archived, which is similar to the concept put forward by [7]. The authors held that the current, live version of a website to be the ideal version. Thus, a perfectly archived website is one that replicates the original, live version in its entirety: “The web page in its live, native environment is the best version possible, and if an archival tool replicates the live web, it has perfectly captured and archived that resource” [8]. However, obtaining a perfect copy of the original is an onerous process, made more difficult by the widespread use of the JavaScript programming language. The use of JavaScript, in the form of small pieces of code called scripts, has made websites more personalized and interactive. Unfortunately, it has also made websites more difficult to archive. As the authors state, today’s archival tools, such as the Heritrix web crawler employed by the Internet Archive, are unable to fully capture and render this complexity [8]. A website that contains JavaScript, such as Google Maps, functions differently from a traditional, HTML-only website. Typically, a web browser requests a website from a server, then proceeds to load the basic elements, such as HTML code and images. After the initial page is loaded, the JavaScript code is executed, This code will then request additional components to be loaded onto the page, such as the panning and zooming functions of an interactive map or geographic location features. [8] define these type of websites as deferred representations because they are not “fully realized and constructed until after the client’s-side representation is rendered”. When attempting to archive such a website, a crawler will usually capture the initial components that are loaded first, but will not capture the other components that are loaded after the JavaScript code is executed. This is because crawlers cannot execute JavaScript code. To study the impact of JavaScript on archivability, the researchers compiled two sets of archived URLs: some taken from the social media platform Twitter and others from the Internet Archive’s Archive-It service [9]. The authors studied the quality of the archived URLs and their use of the JavaScript language, and presented several metrics to measure their archivability. Each URL had a specific number of client-side components (files which execute on the end user’s computer, such as JavaScript) and server-side components (files which execute on the server). The authors called these components parameters and defined them in Equation 1. The complexity of a single URL was measured as the arithmetic mean of its depth (number of levels down from the top-level domain) and the number of client-side and server-side parameters, as shown in Equation 2. 𝐹 = 𝑚𝑎𝑥(|𝑐𝑙𝑖𝑒𝑛𝑡 − 𝑠𝑖𝑑𝑒𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠|, |𝑠𝑒𝑟𝑣𝑒𝑟 − 𝑠𝑖𝑑𝑒𝑝𝑎𝑟𝑎𝑚𝑡𝑒𝑟𝑠|) (1) |𝐷𝑒𝑝𝑡ℎ| + 𝐹 𝑈𝐶 = (2) 2 ∑︁ 𝐶𝐶 = 𝑠𝑐𝑟𝑖𝑝𝑡 𝑡𝑎𝑔𝑠 ∈ 𝐻𝑇 𝑀 𝐿 (3) 𝐽𝑎𝑣𝑎𝑠𝑐𝑟𝑖𝑝𝑡 − 𝑙𝑜𝑎𝑑𝑒𝑑 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 = 𝑁 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑙𝑜𝑎𝑑𝑒𝑑 (4) −𝑁 𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑖𝑛 𝐻𝑇 𝑀 𝐿 𝑡𝑎𝑔𝑠 𝑎𝑛𝑑 𝐶𝑆𝑆 Unlike [7], [8] thought of archivability not as a discrete measurement, but as a dynamic one that changed over time. They found that over half (54.5%) of the URLs in their collection used JavaScript to load embedded resources, an increase of 14.7% between 2005 and 2012. Similarly, JavaScript was responsible for 52.7% of all missing embedded resources during the same time period, an increase of 32.5% [8]. Based on these findings, they concluded that the archivability of websites was being negatively affected by the increasing use of JavaScript, and that in the future, the completeness of archived websites would also decrease as a result. It is worthwhile to note that the research published by [8] focuses on specific, single URLs, not on an entire website, which can consist of dozens or even thousands of URLs. However, it would be reasonable to assume that, if a single web page becomes less archivable the more JavaScript it contains, the same would apply to a complete website and even an entire web archive. The more JavaScript a website contains, the less archivable it is, and the more JavaScript a collection of websites contains, the less archivable they are as a group. In 2016, [10] examined how popular open-source tools such as the Heritrix crawler and the Wayback Machine can be used to archive a corporate intranet. They found that the crawler was able to successfully crawl many pages; however, it would sometimes accidentally crawl sensitive information, and could not crawl pages which required private user credentials. Another important issue was the inability to correctly archive any resources that were constructed with JavaScript, such as YouTube, Facebook, and widget dashboards. To improve the archivability of these resources, [10] recommended using a headless browsing client such as PhantomJS, resulting in slower but more complete crawls of websites. [11] introduced Retrospective Web Archiving (RWA), an approach for filling-in gaps of events which were not archived in real time when they originally took place. They tested their approach by building a retrospective web archive about the 2014 war in Gaza. The authors framed these differences as a matter of “platform archivability’, where some platforms contain links that a) suffered from link rot (meaning they no longer functioned on the live web) and others from b) low archival coverage, meaning they had never been archived in the first place. They found that social media websites are particularly vulnerable to link rot. Shortened URIs on Twitter were the most likely to suffer from both link rot and low archival coverage, while URIs extracted from Google and Facebook were less vulnerable to these conditions. In [12], I examined how clients of Archive-It form mental models of websites and web archives that are often at odds with the realities of both web archiving and the web itself. these misconceptions are often not addressed until "something goes wrong", that is, a problem has occurred and they need to contact an AIT employee for help and support. Additionally, these misconceptions can create false expectations about what current web archiving technologies are capable of. If they are not addressed, web archivists run the risk of assuming that everything can be preserved, when this is not the reality. 3. Methodology This paper focuses on the specific dimension of relevance in web archives, and was part of a larger project to build a comprehensive theory of IQ in web archives. The methodology described in this section is the same as the one employed in my previous work “Correspondence as the Primary Measure of Quality for Web Archives: A Grounded Theory Study” [4]. Though I describe it here, additional details can be found in that work. Barney Glaser and Anselm Strauss created the methodology of Grounded Theory (GT), which they defined as “the discovery of theory from data - systematically obtained and analysed in social research’ [13]. For the authors, theory was not a perfected product that explains all facets of a phenomenon, but a process, an ever-developing entity. GT is an inductive methodology; working closely from the data, the researcher begins the work of generating a theory. 3.1. Data Gathering and Processing The Internet Archive’s Archive-It (AIT) is a subscription-based web archiving service that helps or- ganisations build and manage their own web archives. Archive-It is currently the most popular web archiving service, with over 600 clients (called “partners’) consisting of universities, state libraries and archives, museums, and national libraries in several countries [14]. The accounts of Archive-It clients are managed by a team of partner specialists. When a client encounters a problem with Archive-It, she first opens a support ticket using Zendesk, a popular customer-service platform. The ticket is received by a partner specialist, who is then responsible for addressing the issue. AIT support tickets are a rich source of information regarding quality problems in web archives. They contain the opinions and views of individuals who are experienced creators of web archives, well-versed in web archiving processes, and familiar with institutional web preservation goals, whether they be clients or the partner specialists themselves. They contain rich descriptions of how quality problems are detected, analysed, and addressed, and are thus an ideal dataset for studying quality in all its dimensions. The first step was to obtain Archive-It support tickets in order to analyse them. Since these tickets belonged to the Internet Archive, I negotiated a researcher agreement with the organisation to obtain support tickets from the years 2012 through 2016. The tickets received comprised a wide variety of institutions reflecting AIT’s client base, from national libraries, to private organisations, to universities and museums from Europe, North America, and Asia. After the tickets were cleaned, I randomly selected the same amount of tickets for each year from 2013 through 2016. This randomisation approach was taken to minimise the selection bias that might have occurred if I had manually chosen which tickets to analyse. The final dataset of 645 tickets was then imported into the NVivo software package, a popular program for performing qualitative data analysis [15]. Among other conditions, the research agreement stipulates that the researcher anonymise any personal or institutional information present in the tickets, as well as any other potentially identifying information. In order to comply with the terms of this agreement, all the information presented in this paper has been anonymised: identifying elements such as personal names, names of institutions, and website addresses have been removed or changed. 3.2. Data Analysis The tickets collected were Level 1 support tickets that had been submitted by AIT client. They included the initial question submitted by the client, the response given by the AIT partner specialist, and any subsequent communication between the two. It is important to note that not all the AIT tickets deal with issues of quality in a web archive. Quite a few deal with collection management issues, such as how to manage user accounts for a collection of web archives, storage limitations, and questions about the privacy or public access to archived content. This research focuses on tickets in which the client discusses a perceived flaw in an individual archived website or an entire web archive. From prior experiences, I had seen that these types of tickets are the most likely to deal with issues of quality. Support tickets not pertaining to quality issues were classified as such and separated from the main data of interest. Each ticket analysed consisted of the original ticket submitted by the client, the response sent by the AIT partner specialist, and any subsequent interactions between them. Tickets could be quite brief, consisting of three interactions (the original client ticket, the partner specialist’s response, and the client’s response), or they could have many interactions over time, spanning weeks or even months. A total of 305 tickets and 2544 interactions were analyzed. These support tickets were analysed using the GT techniques of open coding and theoretical memos to identify the main concepts and categories present in the data. According to the precepts of GT, after several rounds of coding, the researcher will reach saturation, a state when nothing new is being extracted from the data. Per the guidelines of Grounded Theory, only the core categories (that is, the ones that explain most of the variation in quality) are part of the final theory. In order to increase the quality and rigour of the study, I engaged in purposeful peer review. University professors were periodically invited to audit the entire research project, including the codebook, preliminary findings, and core categories. In addition to peers, employees of the Internet Archive were also invited to see the findings and comment on them. 4. Findings As previously discussed, the notion of archivability has already received some attention from academic researchers. It was defined by [8] as the ease with which a website can be archived. I redefine archivability as the intrinsic properties of a website that make it easier or more difficult to archive. Archivability is highly dependent on the technology being employed to do web archiving. As technology evolves over time, web components that were previously thought to be unarchivable might become archivable, and viceversa. Archivability proved to be a prominent dimension, as it appears 101 times in 78 tickets. The data showed several factors that greatly affect the archivability of a website. According to the data presented here, archivality problems occur because a website: 1. has changed the way the content is delivered to the user. 2. is media-heavy or contains much dynamic content. 3. renders content in a unique, “non-standard’ way. Table 1 presents examples of the first situation. The marker C: refers to the original query presented by the AIT client, while the marker AIT: refers to the answer given by the web archivist. Many websites routinely change the way the content is delivered to the user, thus a website can go from being easily archivable to practically unarchivable fairly quickly. As one AIT employee said: “The web, and specifically social networking sites can be a moving target.’ When websites change their internal functionality, it can result in the archived website looking different from the original (tickets 08 and 129) and missing content (tickets 08 and tickets 258). Cases where archivability was negatively impacted by the heavy presence of dynamic content are shown in Table 2. Generally, sites that utilize technologies such as JavaScript, Flash, and streaming audio and video are difficult to capture and render like the original. This finding is consistent with the work of [6] and [8]. A special case of this situation is seen with websites that are database and form or Table 1 Examples of archivability problems caused by websites changing how it delivers content to users. Ticket No Text of the Ticket 8 C: Our Facebook page didn’t get archived. When I viewed what was crawled, all the came up was basically a blank Facebook page. AIT: Both Facebook and Twitter have made some changes recently to the way they set up their sites, which requires a little bit of work on our end to catch up. 1. For Facebook, your site was archived, there is just an issue that is keeping the archived page from displaying normally. Our engineers are working on this and it should be fixed this week. I will let you know as soon as I have further information. 2. For Twitter, they recently removed the “more” button from twitter feeds and instead users access older tweets by scrolling down the page. The way this feature is set up makes it difficult for our crawlers to access the older content that is not displayed automatically. 129 C: I am getting an error on the following Facebook crawl. AIT: Facebook made a change to the settings for their stylesheets 258 C: in the "Township of " collection I am trying to capture this facebook site: http://www.facebook.com/pages/township AIT: We are still generally able to capture the initial content on a Facebook timeline; however the most recent change from Facebook has made it one again difficult to capture dynamically loading content as a user scrolls down through the page search-driven, such as library catalogs, web forms, or search engines. As the AIT employee explains, these are elements that depend on a myriad of complex, dynamic interactions that cannot be replicated in an archived website. Sometimes websites will have unique or unusual ways of rendering content, which can negatively affect archivability, as seen in Table 3. For example, some content management systems can create endlessly repeating directory structures (such as http://somesite.com/news, http://somesite.com/news/ news, and http://somesite.com/news/news/news). The presence of these will cause the crawler to go into infinite loops (crawler traps) in an attempt to capture all levels of the website. This can lead to poor-quality archived websites, stalled or incomplete crawls, and large amounts of unnecessary data. 5. Discussion 5.1. How do people conceptualise the notion of archivability? Archivability is not a dimension of quality that is directly perceived by most AIT clients. AIT employees, who have a deep knowledge of and experience with the technical process of archiving websites, were much more likely to perceive a quality problem as an archivability problem. Instead, AIT clients framed archivability issues as correspondence or relevance problems that negatively affected the quality of an archived website or of an entire web archive. For example, in Table 1, ticket 8 describes a situation where the AIT client is writing because she sees a blank Facebook page, while it is the AIT employee that is able to determine that the blank page is being caused by Facebook changing how it delivers its content. That is, an archivability problem is being framed in terms of a problem with visual correspondence, since the archived website does not look like the original. Similarly, in Table 2, in ticket 100, the AIT client frames the problem as one of interactional correspondence, that is, the archived site does not "flip through the slides" in a slideshow, and so does not behave as the original website does. Table 4 lists the support tickets analysed, as well as the IQ dimension in which they are framed. Website archivability can thus be seen as a latent dimension, because it is hidden from most people, and framed in terms of other quality problems. The archivability of a website can only be perceived after its archived counterpart exhibits a quality problem. In Social Sciences research, the term manifest variables is used to describe variables that we can directly observe [16]. Manifest variables stand in contrast to hypothetical constructs, also known as latent constructs, which cannot be directly observed [17]. In order to detect the presence of a latent construct or measure its impact, manifest variables are used as operational definitions. “We therefore assume that the presence and strength of the manifest variable reflects, albeit imperfectly, the presence and strength of the hypothetical construct” [16]. I advance that website archivability is one such latent construct, because it cannot be directly measured until the website is actually archived. Any proposed archivability measurement that is taken before the website is actually archived, such as those presented by [7], is a probability measure and is at best an estimate of the likelihood that a website will be preserved. The actual archivability of a website can only be seen after archival. 5.2. How does website archivability affect web archives and thus the future historical record? As can be seen in the findings, a website with low archivability can negatively impact the quality of its archived version by causing correspondence or relevance problems. An archived website with low correspondence will look different from the original, behave in a different, degraded manner, and have missing elements. An archived website with low relevance will have content that is unrelated to that of the original website, or will be much larger in size than the original. Low archivability in a website leads to low-quality web archives, and low-quality web archives lead to low-quality historical records. The degree of archivability of a website can be estimated a priori by calculating how much of it is composed of dynamic content, such as JavaScript. However, its true archivability of a website can only be determined a posteriori by detecting correspondence or quality problems. This points to the nature of web archiving as a reactive practice instead of a proactive one. Any substantial change in web technologies, standards, or platforms necessitates a change in web archiving practice in order to "catch up" and create high-quality web archives that result in a high-quality historical record. Many parts of the web have always been out of reach: websites that are database-, form-, or search- driven have always been impossible to capture. The increasing reliance on dynamic, client-side tech- nologies such as JavaScript has also done much to decrease website archivability [8]. Given these findings, it is safe to conclude that the web is no longer archivable, and will become increasingly so as time passes. Current web archiving technologies cannot adequately capture the web as it is now, yielding, at worst, highly-degraded versions of the original. 6. Conclusion This paper makes the following contributions: 1. It presents a detailed, comprehensive definition of website archivability, one of the most important dimensions of information quality in web archives. 2. This definition of archivability is human-centred and grounded in how web archivists perceive quality in web archives. 3. The definition is independent of the technology currently in use to create web archives, making it suitable to a wide variety of platforms, preservation contexts, and situations. 4. It characterizes archivability as a latent construct of information quality, expressed by the manifest variables of correspondence and relevance, a finding not previously seen in the literature. 5. It describes how low website archivability influences the quality of web archives as historical records. As historian Ian Milligan stated “web archives do not provide a perfect representation of the past...but neither do traditional archives, which have had to be very selective with what they select, appraise, and preserve”[18]. Though this paper paints a sobering picture of decreasing website archivability and its implications for the historical record of the web, perhaps it is not web archivists, but historians, who will be the most able to deal with the errors and omissions of the past web. References [1] J. Bailey, A. Grotke, E. McCain, C. Moffatt, N. Taylor, Web Archiving in the United States: A 2016 Survey, Research Report, 2016. URL: http://ndsa.org/publications/. [2] Archive-It, Learn more, https://archive-it.org/learn-more, 2021. [3] S. Abrams, Z. Collier, E. Colón-Marrero, keondra bills freemyn, N. Krabbenhoeft, M. E. Wertheimer, A. Wickner, 2022 web archiving survey results, 2023. URL: http://ndsa.org/publications/. [4] B. Reyes Ayala, Correspondence as the primary measure of quality for web archives: A grounded theory study, in: M. Hall, T. Merčun, T. Risse, F. Duchateau (Eds.), Digital Libraries for Open Knowledge, Springer International Publishing, Cham, 2020, pp. 73–86. [5] J. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, M. L. Nelson, Not all mementos are created equal: measuring the impact of missing resources, International Journal on Digital Libraries (2015) 1–19. doi:10.1007/s00799-015-0150-6. [6] V. Banos, Y. Kim, S. Ross, Y. Manolopoulos, CLEAR: A credible method to evaluate website archivability, Presented at the 10th International Conference on Preservation of Digital Objects (iPRES 2013), 2013. URL: http://www.academia.edu/10967309/CLEAR_a_credible_method_to_ evaluate_website_archivability. [7] V. Banos, Y. Manolopoulos, A quantitative approach to evaluate website archivability using the CLEAR+ method, International Journal on Digital Libraries (2015) 1–23. doi:10.1007/ s00799-015-0144-4. [8] J. Brunelle, M. Kelly, M. Weigle, M. L. Nelson, The impact of JavaScript on archivability, Interna- tional Journal on Digital Libraries (2015) 1–23. doi:10.1007/s00799-015-0140-8. [9] Archive-It, Learn more, 2014. URL: https://archive-it.org/learn-more. [10] J. F. Brunelle, K. Ferrante, E. Wilczek, M. C. Weigle, M. L. Nelson, Leveraging heritrix and the wayback machine on a corporate intranet: A case study on improving corporate archives, D-Lib Magazine 22 (2016). URL: http://www.dlib.org/dlib/january16/brunelle/01brunelle.html. doi:10.1045/january2016-brunelle. [11] A. Ben-David, 2014 not found: a cross-platform approach to retrospec- tive web archiving, Internet Histories 3 (2019) 316–342. URL: https://doi. org/10.1080/24701475.2019.1654290. doi:10.1080/24701475.2019.1654290. arXiv:https://doi.org/10.1080/24701475.2019.1654290. [12] B. Reyes Ayala, When expectations meet reality: common misconceptions about web archives and challenges for scholars, International Journal of Digital Humanities (2021). doi:https: //doi.org/10.1007/s42803-021-00034-3. [13] B. Glaser, A. Strauss, The Discovery of Grounded Theory: Strategies for Qualitative Research, Aldine Transaction, 2009. URL: http://amazon.com/o/ASIN/0202302601/. [14] Archive-It, Learn more, 2020. URL: https://archive-it.org/learn-more. [15] QSR International, Nvivo product range, 2016. URL: http://www.qsrinternational.com/ nvivo-product. [16] M. E. Kite, J. Whitley, Bernard E., Principles of Research in Behavioral Science., Routledge, New York, NY, 2018. [17] S. El-Den, C. Schneider, A. Mirzaei, S. Carter, How to measure a latent con- struct: Psychometric principles for the development and validation of measurement in- struments, International Journal of Pharmacy Practice 28 (2020) 326–336. URL: https: //onlinelibrary.wiley.com/doi/abs/10.1111/ijpp.12600. doi:https://doi.org/10.1111/ijpp. 12600. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ijpp.12600. [18] I. Milligan, Historiography and the Web, Sage Publications Ltd, Los Angeles, CA, USA, 2018, pp. 16–29. Table 2 Examples of archivability problems caused by websites with dynamic content. Ticket No Text of the Ticket 100 C: In reviewing our crawls, I have noticed a few of our pages do not display correctly. The problem pages either utilize flash or javascript. I know javascript can be prob- lematic. The following pages are the not displaying correctly: • http://www. .edu/ Video does not display/does not flip through slides • http://www. edu/y79.xml Does not flip through slides • www. .edu/y55333.xml Does not display additional photographs • http://www. .edu/y213.xml Does not display photograph slides • http://www. .edu/BBQ.xml Does not display photograph slides • http://www. .edu/toc.xml Does not display photograph slides AIT: flash and Javascript can be difficult to capture or display sometimes 76 C: From the crawl report, it looks like a reasonable number of urls were captured, but also a good number show up in the “out of scope” column. Then, when I checked the Wayback version I got a message that the archived site cannot be displayed within a frame (see screenshot). I’m not sure what this means or if it is possible to adjust the crawl to fix it. Do you have any suggestions for the best way to capture these videos? AIT: streaming video can be difficult to archive sometimes 369 C: The athletics department has their game day programs online. I see to be able to view the sections but can’t see a way to capture printer-friendly formats from their link. Is this possible? AIT: It looks like the site uses a fair bit of javascript to generate those “printer friendly’ pages, but I’m not sure how feasible capture is 2884 C: under the About Us tab, under Press Room, the tabs other than News Releases ( in the news, Annual report, Media Kit, and Social Media) do not work: AIT: Regarding the tabs on the Press Room URL, I am not sure if we will be able to capture this content due to the dynamic way in which these links are generated Special case: Websites that are database and form or search-driven 30 C: Much of this content is located in databases, so, in general I’m curious about how Archive-It will handle these databases. Here are two examples: • http://www.ourhistory.org/search.asp#index The results of a search come up on a site with a URL with the same seed, but will Archive-It crawl this database? • http://www.ourhistory.org/iq/register/welcome.asp Same as the site above, the results remain on the same seed, will Archive-It crawl the database? AIT: if database driven parts of sites have direct links to the content, the crawler will capture those, however the crawler can’t enter search terms or interact with forms, so if that is the only way to access the database content, the crawler likely will not automatically be able to access that content 3481 C: This site has a database backend and is queried via text input fields on the phonebook homepage...Could I get some help on how to do this successfully? AIT: Because of their interactive nature, search boxes cannot operate in an archived website in the same way as they would on the live web 3458 C: I would like to know if there is any way I can capture the search feature of the website AIT: Search boxes are something that will not behave in an archived site like they do on the live web. We can archive content that would be returned by using the search function (as you noticed with the “Browse All Projects’ button) however, the crawler is not able to archive the database or search engine that the live site search runs off of Table 3 Examples of archivability problems caused by websites rendering content in unique ways. Ticket No Text of the Ticket 464 C: When you’re navigating through the catalogs themselves, you also come across the same issue of not being able to get from 1 page to 2 page or 3 or 4 or 5. However, all of the lots are captured, but you have no way of accessing them through the site AIT: The way that this site does it’s navigation is significantly more complicated than your average site due to the form based dropdowns that you notice to the right of the pagination at the top of the list. The “Sort” and “per page” options are actually forms, so instead of simply clicking on links to subsequent or previous pages (the way that most sites do pagination), the crawler would actually have to select an option from the dropdown and submit a form each time, in order to get content back. These are types of interactive behavior the crawler does not perform by default, so it will require additional development...Because this site is so uniquely complicated in the way it has implemented pagination, any work our engineers put into developing a new crawling feature to capture it would be very specific to this site and likely not transferrable to other examples 3423 C: I’ve done a test crawl on all ".stateu.edu" while it has captured thousands of pages it also seems to determine many "stateu.edu" pages to be "out of scope". These pages are not blocked by a robots.txt. Why would that be happening? AIT: We do see these types of repetitive URLs from time to time, and they appear to be generated by code in certain implementations of content management systems like Drupal 3001 C: We’re having some trouble limiting the URLs on one of crawls to a reasonable number. We’re actually getting the content we’d like, but we’re also getting a ton of extraneous URLs that are either bad content or aliases to content we’re already capturing with another URL. AIT: After taking a look at the queued URLs for this host, it appears that the crawler is running into a trap that we see from time to time on some websites (including some Drupal sites) where the site generates links with repeating directories 86 C: We ran another test crawl on this site and now we seemed to have opened a can of worms for the main site we wanted to crawl. It looks like we have the Flickr URL under control and may just put a limit on the number captured. The main URL that we want, captured 83, 600 with over 1 million in the que AIT: The issue with your http://www.pl.gov/tef/ site is one that we see from time to time, where something in the way the site is put together creates urls with repeating directories that all point back to the same page Table 4 Archivability problems and the IQ dimensions and sub-dimensions in which they are framed. Information Quality Dimension and Sub-Dimensions Ticket No Correspondence, Visual 8, 129, 100, 76 Correspondence, Interactional 2884, 464 Correspondence, Completeness 258, 369, 30, 3481, 3458 Relevance, Topic 3423 Relevance, Size 3001, 86, 76