Discoverability in a Digital Library: A Study of “Rabbit Holes” within Gallica’s Corpus Anne-Laure Tettoni1,∗,† , Simon Dumas Primbault1,2,3,† 1 Laboratory for the history of science and technology (LHST), Swiss Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland 2 OpenEdition (UAR 2504, CNRS/EHESS/AMU/AU), 22 rue John Maynard Keynes, 13013 Marseille, France 3 Bibliothèque nationale de France (BnF), Quai François Mauriac, 75706 Paris, France Abstract The phenomenon of aimless web navigation, often compared to falling ”down the rabbit hole,” brings to light significant aspects of the Internet’s ”long tail” concept. This research examines whether longer, non-goal-oriented web sessions genuinely lead users into the long tail of digital libraries, thereby ex- ploring the discoverability of cultural heritage. The focus of this study is on Gallica, the French national library’s online platform. This work aims to identify and characterize such sessions within Gallica, defining rabbit holes as long and diversified navigation sessions. The difÏculty lies in identifying rabbit holes within server logs, which requires a mixed-methods approach involving interviews, qualitative studies, and simple statistical analyses. Despite Gallica’s lack of hypertextual structure, we show that users do engage in rabbit hole-like behavior, navigating through keyword searches and filters. The study’s findings align with user testimonies. A crucial conclusion is that rabbit holes in Gallica do not generally lead users to less-consulted content. This limitation is attributed to the search engine, which users must somewhat ”hack” to navigate effectively. Enhancing Gallica’s discoverability tools without compromising the existing user experience is essential for improving content accessibility. Keywords digital library, navigation practices, discoverability, rabbit holes, long tail 1. Introduction 1.1. Research Question Who has never found themselves surfing almost aimlessly throughout the vastness of content on the Internet? Indeed, the hypertextual structure of most content on the World Wide Web allows users to navigate from page to page—either by curiosity, distraction, or mere boredom—, to the extent that, not unlike Alice in Wonderland, they may fall ”down the rabbit hole”, suppos- edly in the most unknown of places. This trope alludes to the two meanings of the Internet’s ”long tail”. Initially deriving from the network’s structure [12], the long tail denotes the mass of Internet pages with very few links pointing towards them—as opposed to the head of the network, composed of very few websites with a great numbers of inward links attracting the CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark ∗ Corresponding author. † These authors contributed equally. £ ana.tettoni@gmail.com (A. Tettoni); simon.dumas-primbault@openedition.org (S. Dumas Primbault) ȉ 0000-0002-0877-7063 (A. Tettoni); 0000-0001-7116-9338 (S. Dumas Primbault) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings major part of the trafÏc. This structural observation was later reconceptualized as a business opportunity specific to the Internet, claiming that markets should now focus on the vast diver- sity contained in the long tail [1]. Though, this view relies on the assumption that on top of this hypertextual structure, there is actual—and substantial—trafÏc in the long tail. This calls onto another meaning of the long tail in terms of user practices: while most Internet users make brief sessions to retrieve known content by querying search engines with relevant keywords, a small but significant portion of sessions are longer and seemingly not goal-oriented, endeavoured by users navigating con- textually from place to place. But do these longer sessions really lead users in the long tail of the content? Do longer aimless sessions actually take readers into the ”dark matter” of the Internet? By delaminating the two meanings of the long tail on the Internet, this work-in-progress aims to address a broader issue: that of the actual accessibility of content, beyond mere access; and the consequent need for discoverability tools, beyond mere findability. Indeed, research has shown that content in open access may not actually be accessible due to indexation issues, but also a lack of infrastructure in specific areas, the need for different skills or abilities to simply grasp the content, or the absence of incentives to wander beyond the supposed relevance of search engines [5]. With the exponential growth of their corpora—Gallica now hosts ten million digitized documents—, digital libraries have become genuine informational milieus that users practically navigate, step-by-step and iteratively. In this context, the traditional retrieval and recommen- dation systems—search engines, aggregators, socials...—, if not reassessed, tend to reinforce cultural asymmetries—favouring certain authors, languages, media types, while invisibilizing the rest—thereby offering a certain perspective on history and the social world in the case of cultural heritage. Discoverability could be defined at the intersection between users and content, as the propensity to stumble upon some unexpected content that the user was not searching for and nonetheless deems relevant for their purpose [15, 24]. In that sense, discoverability does not only reflect the desire for readers to make new and unforeseen connections—serendipity—but it also fosters cognitive justice and values such as inclusion and diversity. Consequently, it would not be enough to ensure findability in infor- mation systems—i.e. to provide relevant results to detailed queries within a closed and known informational space. In addition to this, it would indeed be necessary to design and develop tools to foster discoverability—i.e. to promote lesser known or consulted content that users were not looking for but might nonetheless find useful. 1.2. Related Work Previous studies were led mainly on social media. Indeed, due to their informational architec- ture built around algorithmic recommendations to scroll endlessly, most social media are suited to rabbit holes, even ”doomscrolling”, and the discovery of content. In general, they have been criticized for promoting content that generates more reaction, specifically outrage regarding hateful content. For example, the YouTube algorithm has been criticised for leading to echo chambers and promoting extremist content [3]. In this sense, rabbit holes can be understood as paths to radicalisation [11]. They can also foster the circulation of fake news and be a path 2 to conspiracy beliefs [22]. Extensive work was also led on Wikipedia. In an effort to understand how users either search in or navigate through Wikipedia, and how these two strategies relate [7, 6, 18, 21, 23], scholars have shown the reliance on the structure of the articles [14], the link structure of the corpus [8], or the citations [20]. More specifically, Tiziano Piccardi, Martin Gerlach, and Robert West [19], have shown by the computational analysis of Wikipedia server logs that they could extract a consistent and coherent subset of user sessions corresponding to a specific user navigation regime they call ”rabbit hole”. Rabbit holes are longer sessions, more likely to be undertaken on desktop computers during workday or on a mobile device at night, and they exhibit random navigation patterns. Let us emphasize that although such rabbit holes span a vast array of pages, they mostly remain in the same semantic or topical area as the first pages visited. 1.3. Case Study While it has become common parlance to say that a library reader may serendipitously find something they were not looking for by roaming through bookshelves, there is surprisingly little work about rabbit holes in digital libraries. Studies have long shown that besides di- rected search, navigation is a common, if underrepresented, informational practice—e.g. ”berry- picking” [2], browsing [4], ”bouncing” [16]. The present work-in-progress intends to further research on navigation practices within Gallica by addressing rabbit holes as a kind of naviga- tion. A cultural heritage repository, Gallica is the online platform of the French national library. As a consequence of a policy of mass digitization initiated in 2004, the digital library now hosts 10 million documents in the public domain, of 9 different types ranging from prints to manuscripts to maps and periodicals. After the seminal work of Nouvellet and Beaudouin [17], and using mixed methods at the intersection of ethnography (semi-directed interviews with users), semiotics (analysis of the interface and architecture), and digital humanities (analysis of server logs), we have identified several ”navigation regimes” across the digital corpus [10, 13]. For the present study, it is important to note that the computational analysis of server logs was led after two campaigns of respectively 7 and 17 semi-structured interviews with users of Gallica, as shown in figure 1 (see also [10, 13]). Such interviews helped us point out that most users exhibited a specific navigation regime—unstructured exploratory browsing—whenever they could find the time to wander aimlessly throughout Gallica’s corpus, that is either at the end of their long day, or during less intensive phases of their work. Nonetheless, these mo- ments were considered highly important in their practice, leading them to make new and un- foreseen connections between heterogeneous elements of their research; and semi-structured interviews helped us identify the properties of such regime (length, diversification) by mod- elling user sessions. Therefore, although rabbit holes are of low statistical significance, they are of paramount epistemological significance in practice. This decorrelation between the sta- tistical significance of unstructured exploratory browsing and its epistemological importance in knowledge making is also due to the fact that directed search is technically favoured by the prevalence of the search engine and the interface that somewhat reinforces this common practice. 3 Figure 1: A Mixed-Methods Ethnography: Pipeline and Methodology Note that contrary to Wikipedia, Gallica’s corpus is not hypertextually structured: the doc- uments are not linked in the database, and they are not clickable on the interface, neither are the metadata. Nonetheless, users have told us that they would engage in rabbit-hole like be- haviours: e.g. one told us they would look for historical ”lolcats” after a long day’s work while another said Gallica was their own Candy Crush where they would waste time playing around. Indeed, another study showed us how, in order to navigate from document to document, users iterated keyword searches and filtering to ”manage the noise” in the search results, thereby constructing step by step their own path throughout the corpus, sometimes even a rabbit hole [9]. The present work-in-progress aims at circumscribing a robust subcorpus of such sessions and characterizing it. We will define rabbit holes as a subregime of navigation, that is as a subset of user sessions that are long and diversified (Sec 3), and we will try to distinguish them using simple statistics in comparison with average sessions (Sec 4). 2. Data and Sessionization We used Gallica server logs spanning the period from 31st January 2016, at 13h00 to 29th Febru- ary, at 05h36. We chose to keep only one month to make computations easier to run. This still represents 319,344,032 log entries. The month of January is not particular in terms of aca- demic deadlines or cycles, so it should not see an increase in students which would skew the data. Time-zones were not taken into account, as the vast majority of users are from French- speaking countries in Europe, located in the same timezone. Moreover, users could be using VPNs, so location of users is uncertain anyway. This research involves no sensitive data and 4 all personal data were definitively anonymized: IP addresses were hashed, and the hash table destroyed. It is compliant with GDPR and the Swiss Federal Act on Data Protection. It was approved by EPFL’s Human Research Ethics Committee. For more details on the data, refer to appendix A or [13]. 2.1. Sessionization The pre-processing of this data involves enriching it by requesting individual Archival Re- sources Keys (ARKs) in order to provide additional information about the documents consulted. As mentioned before, the size of the data is considerable, so to make this step manageable, we ran the pre-processing and sessionization tasks in chunks. This leads to more sessions, as it severs sessions that may run over multiple chunks. To have an estimate of how many sessions were added, we ran the process on a chunk and then on the same chunk divided in two. We found that it added 797 sessions, out of the 74,917 ones found without chunking. This repre- sents an increase of merely 1.06%. We also created a dictionary of document’s ”visibilities”: every time a document is consulted by a user, we increment the visibility of the corresponding ARK. From our enriched logs, we want to define user sessions. To do so, we start by grouping them by hashed-IP addressed and aggregating the features. Then, we want to find the difference of time between two requests, and if that time is higher than a given inactivity threshold, we consider it to be a new session. We compute the time differences for each hashed-IP address, then create session IDs. The use of time heuristics to define user sessions has been documented in [25]. We choose 60 minutes as the inactivity threshold. This choice relies on two factors. First, [11] have found that a 60-minute inactivity threshold is a good rule-of-thumb, and second, we tested with other thresholds (45, 75, and 90) on a chunk representing 24 hours and found that it didn’t drastically change the number of sessions, as shown in table 1. Table 1 Number of sessions generated per inactivity threshold threshold 45min 60min 75min 90min number of sessions 32’858 32’610 32’478 32’329 From these session IDs, we separate the sessions from each other, and enrich them by adding features, such as the length in minutes, a list of visibilities associated with each consulted ARK, and the first referrer. We also removed the sessions with no ARK visited, which removed about 16.6% of the sessions. In total over the month, we have 1,181,190 sessions where at least one document was consulted. From the list of visibilities, we create new features that will enable us to evaluate the evolu- tion of the visibility of documents across a session. First, the mean and minimum visibility of a document, then the mean and minimum of the first and last three documents, and finally the variation of the mean and minimum visibility of the first and last documents. From this, we will be able to tell whether or not the session lead to more popular and more visited documents or the opposite. 5 3. Defining a Rabbit Hole We broadly define a ”rabbit hole” as a session that is long and diversified. To characterize this diversity, we add a list of themes—Dewey classes—and types of documents—either prints, manuscripts, images, maps...—consulted during each session, using the metadata from the re- quested ARKs. To characterize the length, we create features that indicate if the session is in the top 10% and top 5% of length in minutes. To be in the top 10% longest sessions, it must be over 30 minutes long, and over 60 minutes long in the top 5%. We also add a feature that indicates the number of visited documents, and another if this number of documents is above 10. The distribution of all sessions, irrespective of their length or diversity, is shown in figure 2. The curve is red above the 10% threshold. Figure 2: Distribution of sessions depending on length Note that this curve exhibits an interesting behaviour above 300 minutes i.e. 5 hours: regular spikes emerge from the noise roughly every hour, possibly denoting a surplus of timed sessions due to bots programmed to stop after a certain amount of hours. Furthermore, the longer tail exhibits a bump above 780 minutes i.e. 13 hours: these very long sessions cannot be considered human either. This length indicates non-human users, regularly querying the website, and not declared as a robot. An example of such a user is someone gathering the metadata of a list of ARKs, which is what we did in our data enrichment step. Therefore, in order to set aside these robotic behaviours, our selected rabbit holes correspond to the long tail of this distribution, in the highlighted gray area. We also count the number of different themes and different document types across the ses- sion. Finally, from this information we can create the diversity metrics. We consider a session 6 to be diverse if the documents consulted span more than 2 types or more than 2 themes. Then, we filter the sessions to extract the rabbit holes. We start by taking only sessions in the top 10% threshold of length in minutes (long sessions), then only those that are diversified, which represents 28.85% of the long sessions. From them, we select the ones with over 10 documents consulted. This leaves 1.59% of the overall sessions. To check if this percentage is reasonable, we create three other diversity metrics and apply the same filtering process with them. First, a restrictive diversity metric: we need to have 2 types or more and 2 themes or more. With this one, the rabbit hole sessions amount to 1.16%. Second, an augmented metric, where we need to have 5 types or more or 5 themes or more. This represents 1.02% of sessions. Lastly, a augmented and restrictive metric, where we need 5 types or more and 5 themes or more. With this one, rabbit hole sessions are only 0.11% of the sessions. We conclude that around 1% is a reasonable percentage. Table 2 contains a summary of the various added features and their definition. Table 2 Features added to the sessions Feature Description first_referrer The website from which the session started length_minutes (last_timestamp - first_timestamp) in minutes visibility A list of the visibility associated with each ARK min_visibility Smallest non-zero visibility mean_visibility Average of all visibilities min_first_3 Minimum visibility of the first 3 documents mean_first_3 Average visibility of the first 3 documents min_last_3 Minimum visibility of the last 3 documents mean_last_3 Average visibility of the last 3 documents variation_min_vis, vari- Difference between the last three documents’ min/mean visibil- ation_mean_vis ity and the first three themes, types Lists of themes and types associated with each ARK nb_themes, nb_types Number of unique themes and unique types nb_docs Number of accessed ARKs over_10_docs nb_docs ≥ 10 top_10%_length True if length_minutes ≥ 60 top_5%_length True if length_minutes ≥ 30 diversified nb_themes ≥ 2 or nb_types ≥ 2 div_restrictive nb_themes ≥ 2 and nb_types ≥ 2 diversified_5 nb_themes ≥ 5 or nb_types ≥ 5 div_restrictive_5 nb_themes ≥ 5 and nb_types ≥ 5 We now have filtered sessions that correspond to rabbit holes. These are sessions in the top 10% of session lengths, so over 30 minutes, they are diversified, meaning over two types of doc- uments or two different themes were consulted, and more than 10 documents were requested. We also remove the sessions above 5 hours, as they don’t represent human behaviour, and this removes an additional 12% of longer sessions. We will now compute a variety of statistics on them to find how they differ from average sessions and see if they lead to less visible content. 7 4. Provisional Results We begin by examining the least and most consulted themes and types of documents across both types of sessions, as well as the top 10% most consulted themes and types on the least visited documents, in both types of sessions. (a) Average sessions (b) RH sessions Figure 3: Top themes in most visited ARKs (a) Average sessions (b) RH sessions Figure 4: Top types in most visited ARKs The top 5 themes most consulted are the same, although in different orders. These corre- spond the most popular themes on Gallica in general and to the interests of its audience mainly made up of historians and genealogists (figure 3). Figure 4 shows the most popular type is ”fascicule”, which corresponds to an issue of a pe- riodical publication, such as a journal or a newspaper, was also the main type of documents on Gallica at the time. Out of the approximately 3.6 million documents at the time, there are about 1.7 million of these, with the next main type being images, at almost 1 million. The same examination is done for least popular types and themes (figures 5 and 6). We observe that in rabbit holes, books (type ”monographie”) are much more popular than in average sessions. Indeed, longer sessions give more time to dive into longer documents. 8 (a) Average sessions (b) RH sessions Figure 5: Top themes in least visited ARKs (a) Average sessions (b) RH sessions Figure 6: Top types in least visited ARKs Next, we examined the most searched terms in rabbit holes and average sessions (figure 7 and 8). In average sessions, the most searched terms are ”bnf.fr” and ”Bibliothèque de France”. This indicates users that are not familiar with the website. This is followed closely by the term ”croix”, and ”epub”, a popular format for ebooks. In RH sessions, the third most popular term is ”image”, maybe indicating users that are not looking for something in particular. We also compute the correlation between the length of a session and the minimum and mean visibility of the documents. The results are summarized in table 3. The correlations are not Table 3 Correlations of visibilities and length length-minimum length-mean Average sessions -0.0005 0.0027 RH sessions -0.0109 -0.0358 statistically significant, but we see that they are higher in absolute value for RH sessions. A 9 Figure 7: Top 50 search terms in average sessions longer session sees a very slight decrease in the mean and minimum visibility of its documents. This suggests that when engaging in a RH session, the user is very marginally more likely to visit less popular documents. Though, we cannot conclude that, in general, rabbit hole sessions lead users to the ”dark matter” of Gallica. Next, we want to examine when rabbit hole sessions are most likely to happen in the day or the week. For the beginning of the sessions, the distribution in terms of day of the week is similar across all sessions (figure 10). For the hour, the distribution looks similar, but figure 9 shows rabbit holes have peaks at 6pm and 9pm, indicating sessions that happen after working hours but before night. Indeed, they are also less likely to happen late at night or early in the morning (from midnight to 5 am) than other sessions. This suggests that users falling into rabbit holes are not exceptional—e.g. insomniacs or early risers. Rather, they might be just the average kind of users that otherwise populate the library for other purposes. We also plot the most common referrer to begin a session with (figure 11). Google and Gallica itself rank first for any type of session. Through previous user interviews, it was found that it is easier to retrieve content from Gallica through Google than with their search engine. The prevalence of Google for average sessions and the big gap separating it from Gallica could indicate users that are searching for something in particular, while for RH 10 Figure 8: Top 50 search terms in RH sessions (a) Average sessions (b) RH sessions Figure 9: Hour of beginning of session sessions, Gallica is much closer. Gallica’s faulty search feature could then be used by users wandering and less concerned by accuracy. Finally, to show the difference between average and rabbit hole sessions, we plot their dif- ferences: length, number of themes, number of types and number of documents (figures 12, 13, 11 (a) Average sessions (b) RH sessions Figure 10: Day of beginning of session (a) Average sessions (b) RH sessions Figure 11: Most common referrer 14 and 15). (a) Average sessions (b) RH sessions Figure 12: Length of a session Through plots of the length of sessions and the number of types, we show that the sub-corpus of rabbit hole sessions is not representative of all sessions. The length is by construction in the top 10% of lengths, the number of different Dewey classes observes a peak at 2 and they are overall more diversified. Similarly, the peak of number of themes is at 2 instead of 1, and there 12 (a) Average sessions (b) RH sessions Figure 13: Number of themes (a) Average sessions (b) RH sessions Figure 14: Number of types (a) Average sessions (b) RH sessions Figure 15: Number of documents are systematically more documents. Lastly we want to examine the variation in minimum and mean visibility on both types of sessions. Variation of the minimum visibility in average sessions has a wider range than in RH sessions, and while it is aggregated around zero, it is mostly positive, which corresponds to sessions that lead to more popular documents (figure 16). In RH sessions, it is slightly more likely to have 13 (a) Average sessions (b) RH sessions Figure 16: Variation of minimum visibility (a) Average sessions (b) RH sessions Figure 17: Variation of mean visibility a negative variation, which indicates a session that leads to less visible content. The same behaviour is observed on the mean variation (figure 17). 5. Conclusion This work-in-progress shows that we can indeed circumscribe a robust subcorpus of long and diversified navigation sessions that we call ”rabbit holes”. They can be considered a subregime of what we previously identified as ”exploratory non-structured navigation” and represent about 1% of overall sessions (including non-human sessions). Interestingly enough, from a merely statistical point of view, these sessions do not appear too different from average ses- sions and can only be distinguished when looking at significant but subtle details such as their prevalence after working hours or the importance of monographs. These observations are in line both with users’ testimonies—e.g. one interviewee told us ”I’m a modernist, but as a historian, requesting something about Joan of Arc is already entertain- ment for me, a rabbit hole in sum”—, as with the digital library’s information architecture—a weakly structured database of rather homogeneous documents in content but not in type. And 14 indeed, we identified in previous work that the condition for Gallica’s corpus to be actually navigable, rather than just retrievable, is that its impressive mass is both heterogeneous in types (prints, manuscripts, musical scores...) and periods (from Antiquity to Modern Times), but rather homogeneous in content (historical documents in the public domain). Thereby, it attracts a rather homogeneous audience, at least a community of practice which it is difÏcult to segment as said practices do not differ much from one another. Searching for rabbit holes within Gallica’s server logs amounts to looking for a needle in a haystack! Therefore, rather than employing cutting-edge algorithms straightaway, it was necessary to criss-cross multiple perspectives on the phenomenon—preparatory TDA work, interviews, qualitative study of the interface and architecture, extracting a subcorpus, comput- ing simple statistics—in order to build a body of small corroborating evidence. This critical appraisal will now allow us to deploy slightly more complex methods to characterize rabbits holes, notably Markov chains, sequential pattern mining, and topological data analysis, in the hope of articulating more clearly this strand of research with our previous findings [13]. Turning back to matters of accessibility and discoverability, one important conclusion of this study is that rabbit holes on Gallica do not generally lead users towards less consulted content. Longer and more diversified sessions do not wander in the long tail of the corpus. This, we think, is due to the fact that the central tool to access documents in Gallica is a search engine that users have to hack and bypass if they want to build themselves navigation paths. Indeed, we showed elsewhere that precisely because this search engine is somewhat ”faulty”— according merely to relevance criteria—it helps users ”manage the noise” to their taste in the search results [9]. Due to policies of mass digitization initiated in the 2000s, Gallica can today boast an impressive corpus of over 10 million documents but it still lacks proper navigation tools; and we believe that ”fixing” the search engine might do more harm than good to the discoverability of its content. Acknowledgments This exploratory study was led in collaboration between EPFL and OpenEdition. It was sup- ported by the BnF thanks to a Mark Pigott Fellowship in Digital Humanities. References [1] C. Anderson. The Long Tail: Why the Future of Business is Selling Less of More. New York: Hyperion, 2006. [2] M. J. Bates. “The design of browsing and berrypicking techniques for the online search interface”. In: Online Review 13.5 (1989), pp. 407–424. url: https://pages.gseis.ucla.edu/f aculty/bates/articles/berrypicking.pdf. [3] M. Brown, J. Bisbee, A. Lai, R. Bonneau, J. Nagler, and J. A. Tucker. “Echo Chambers, Rabbit Holes, and Algorithmic Bias: How YouTube Recommends Content to Real Users”. In: SSRN Electronic Journal (2022). doi: 10.2139/ssrn.4114905. url: https://www.ssrn.co m/abstract=4114905. 15 [4] D. O. Case. Looking for Information: A Survey of Research on Information Seeking, Needs, and Behavior. Academic Press, 2002. [5] L. Chan, ed. Contextualizing Openness: Situating Open Science. Ottawa: University of Ot- tawa Press, 2019. [6] D. Dimitrov, F. Lemmerich, F. Flöck, and M. Strohmaier. “Different topic, different trafic: How search and navigation interplay on wikipedia”. In: The Journal of Web Science 1 (2019). doi: https://doi.org/10.34962/jws-71. [7] D. Dimitrov, F. Lemmerich, F. Flöck, and M. Strohmaier. “Query for Architecture, Click through Military: Comparing the Roles of Search and Navigation on Wikipedia”. In: Pro- ceedings of the Conference on Web Science (WebSci), 2018. Conference on Web Science (WebSci), 2018. 2018. doi: https://doi.org/10.1145/3201064.3201092. [8] D. Dimitrov, P. Singer, F. Lemmerich, and M. Strohmaier. “What Makes a Link Successful on Wikipedia?” In: Proceedings of the International World Wide Web Conference (WWW). International World Wide Web Conference (WWW). 2017. doi: https://doi.org/10.48550 /arXiv.1611.02508. [9] S. Dumas Primbault. “”Managing the Noise”: Users’ Hacks of Gallica’s Search Engine as a Navigational Practice”. In: (forthcoming 2025). [10] S. Dumas Primbault. “Naviguer dans les savoirs à l’ère numérique. Pour une ethno- graphie des pratiques informationnelles sur Gallica”. In: Études de communication 61 (Dec. 18, 2023), pp. 61–89. doi: 10 . 4000 / edc . 16108. url: http : / / journals . openedition .org/edc/16108. [11] A. Halfaker, O. Keyes, D. Kluver, J. Thebault-Spieker, T. Nguyen, K. Grandprey-Shores, A. Uduwage, and M. Warncke-Wang. “User Session Identification Based on Strong Regular- ities in Inter-activity Time”. In: Proceedings of the 24th International Conference on World Wide Web. WWW ’15: 24th International World Wide Web Conference. Florence Italy: International World Wide Web Conferences Steering Committee, May 18, 2015, pp. 410– 418. doi: 10.1145/2736277.2741117. url: https://dl.acm.org/doi/10.1145/2736277.2741117. [12] B. H. Huberman. The Laws of the Web: Patterns in the Ecology of Information. Cambridge, MA: MIT Press, 2001. [13] B. Kaabachi and S. Dumas Primbault. “A Topological Data Analysis of Navigation Paths within Digital Libraries ”. In: (2023). url: https://ceur-ws.org/Vol-3558/paper935.pdf. [14] D. Lamprecht, K. Lerman, D. Helic, and M. Strohmaier. “How the structure of Wikipedia articles influences user navigation”. In: New Review of Hypermedia and Multimedia 23.1 (2017), pp. 29–50. doi: https://doi.org/10.1080/13614568.2016.1179798. [15] L. Magnani. Discoverability: The Urgent Need of an Ecology of Human Creativity. Springer, 2022. [16] D. Nicholas and D. Clark. ““Reading” in the Digital Environment”. In: Learned Publishing 25.2 (2012), pp. 93–98. doi: https://doi.org/10.1087/20120203. 16 [17] A. Nouvellet and V. Beaudouin. “Analyse des traces d’usage de Gallica”. In: [Rapport de recherche] Télécom ParisTech; Bibliothèque nationale de France (2017). url: https://hal.sci ence/hal-01709264. [18] T. Piccardi, M. Gerlach, A. Arora, and R. West. “A Large-Scale Characterization of How Readers Browse Wikipedia”. In: ACM Transactions on the Web 17.2 (2023), pp. 1–22. doi: 10.1145/3580318. url: http://dx.doi.org/10.1145/3580318. [19] T. Piccardi, M. Gerlach, and R. West. “Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions”. In: Companion Proceedings of the Web Confer- ence. WWW ’22: The ACM Web Conference. Virtual Event, Lyon France: Acm, Apr. 25, 2022, pp. 1324–1330. doi: 10.1145/3487553.3524930. url: https://dl.acm.org/doi/10.1145 /3487553.3524930. [20] T. Piccardi, M. Redi, G. Colavizza, and R. West. “Quantifying engagement with citations on Wikipedia”. In: Proceedings of the International World Wide Web Conference (WWW). International World Wide Web Conference (WWW). 2020. doi: https://doi.org/10.48550 /arXiv.2001.08614. [21] G. C. Rodi, V. Loreto, and F. Tria. “Search strategies of Wikipedia readers”. In: PloS one 12.2 (2017). doi: https://doi.org/10.1371/journal.pone.0170746. [22] R. M. Sutton and K. M. Douglas. “Rabbit Hole Syndrome: Inadvertent, accelerating, and entrenched commitment to conspiracy beliefs”. In: Current Opinion in Psychology 48 (Dec. 2022), p. 101462. doi: 10.1016/j.copsyc.2022.101462. url: https://linkinghub.elsevier.co m/retrieve/pii/S2352250X2200183X. [23] R. West and J. Leskovec. “Human Wayfinding in Information Networks”. In: Proceedings of the International World Wide Web Conference (WWW). International World Wide Web Conference (WWW). 2012. doi: https://doi.org/10.1145/2187836.2187920. [24] L. Woolcott and A. Shiri, eds. Discoverability in Digital Repositories Systems, Perspectives, and User Studies. Routledge, 2023. [25] J. Zhang and A. Ghorbani. “The reconstruction of user sessions from a server log using improved time-oriented heuristics”. In: Proceedings. Second Annual Conference on Com- munication Networks and Services Research, 2004. Proceedings. Second Annual Confer- ence on Communication Networks and Services Research, 2004. May 2004, pp. 315–322. doi: 10.1109/dnsr.2004.1344744. url: https://ieeexplore.ieee.org/document/1344744. A. Data and pre-processing A log entry is a string from which we can extract the following meaningful features: Some of these fields may be empty, and then represented as either ”null” or ’-’. Two example of logs are shown in figures 18 and 19. A.1. ARK (Archival Resource Key) As mentioned, some requests contain ARKs that can be extracted from them, for example from figure 18, we could extract the ARK ”bpt6k20211m”. These Archival Resource Keys repre- 17 Table 4 Features description Feature Description Hashed IP address Anonymized IP address Country and City Location of the request Date of the request Format: day/month/year:hour:minute time zone offset HTTP request Contains additional information such as the ARK Protocol Communication protocol Response number HTTP response code Length Length of the request Referrer Website the user comes from Figure 18: A log with missing information Figure 19: A log with all information sent unique identifiers for each document, from which document metadata can be obtained by querying Gallica’s website. These ARKs don’t change over time. An ARK request is shows in figure 20: (figure taken from [17]) The NAAN (name assigning Figure 20: An ARK request to Gallica authority number) will always be the same for Gallica, 12148. The NMA indicates the website that the resource can be accessed at. From a request like this, we can extract the ARK name, and then use it to request the metadata of the document. 18 A.2. Enrichment From our logs, we extract meaningful features and the ARK. From the request, we can also find the search terms, if there were any, by checking if ”search” is in the request then parsing the URL, finding the query parameters and using a regular expression to extract them. From the ARKs, we want to obtain document metadata. What will interest us in our study of Rabbit holes is to qualify the diversity and semantic diffusion of a session. For that, we will want to obtain the type and theme of each document consulted. This is done by using Gallica’s service for information retrieval. An example of what the request ”https://gallica.bnf.fr/services/OAIRecord?ark=btv1b6907077k” yields is shown in figure 21. We can obtain the type of document under the typedoc field, here an image, and the theme, Figure 21: Result of an OAIRecord query which is a Dewey class, under sdewey, not featured here as it is an optional field. This is a ma- jor limitation to finding the diversity of themes across a session, as only prints have a Dewey class. Lastly, we want to know how visible a document is, namely how many times it was consulted across all sessions of that month. To retrieve this information, we count unique occurrences of ARKs and hashed-IP address (a person can only contribute one view to a document), and store 19 the ARK and its associated count for later use. 20