<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IIR</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Pezzuti</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sean MacAvaney</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Glasgow</institution>
          ,
          <addr-line>Glasgow</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pisa</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>15</volume>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Given the vast scale of the Web, crawling prioritisation techniques based on graph traversal, popularity, link analysis, and textual content are frequently applied to surface documents that are most likely to be valuable. While these techniques have proven efective for keyword-based search, retrieval methods and user search behaviours are shifting from keyword-based matching to natural language semantic matching. Semantic matching and quality signals have been applied during ranking with great success, and recently, researchers have proposed to exploit them also to prioritise the frontier of Web crawlers. To investigate more on this, we propose two novel neural policies with the goal of surfacing content that is semantically rich and valuable for modern search needs, ultimately aligning the crawler behaviour with the recent shift towards natural language search. Our experiments on the English subset of ClueWeb22-B and the MS MARCO Web Search and Researchy Questions query sets show that, compared to existing crawling techniques, neural crawling policies significantly improve harvest rate during the early stages of crawling.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Crawling</kwd>
        <kwd>Web search</kwd>
        <kwd>Quality estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The efectiveness of search engines heavily depends on the indexed corpus: if the corpus is incomplete
or filled with low-quality pages, search results could be irrelevant [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A crawler is a program that
systematically traverses the Web and downloads Web pages to build and keep up-to-date such a search
corpus. Its ability in prioritising high-quality pages is crucial for providing accurate and relevant results
to user queries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Crawlers maintain a priority queue of URLs of pages to visit, called frontier, and
continuously download pages, extract their outgoing links, and prioritise them in the frontier to select
the next pages to crawl. Traditional graph traversal algorithms like Breadth-First Search (BFS) can
be used for traversal, without utilising any heuristic to prioritise URLs [
        <xref ref-type="bibr" rid="ref13 ref6">13, 6</xref>
        ]. In contrast, Best-First
policies (BF) are designed to prioritize the frontier leveraging some heuristic like click-through rate [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
PageRank [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or textual content [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], although each of them comes with limitations. For instance,
the most notable BF policy, PageRank, assigns higher priority to well-linked pages, but requires storing
the full Web graph, involves resource-intensive computations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and is inaccurate on sub-graphs [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ].
On the other hand, content-based quality estimation using keyword matching or term frequency has
been mainly used in focused crawlers [24] which are query-driven, and outside the scope of our work.
      </p>
      <p>
        Although all these prioritisation policies work well for keyword-based queries, they either (i) ignore
the textual content of pages, (ii) use it for keyword matching w.r.t. topic keywords, or (iii) use it for
query-driven focused-crawling, relying on relevance signals based on term frequency and inverse
document frequency, ignoring the semantics of texts. Recent advancements in contextualised Large
Language Models (LLMs), have shifted search on the Web toward conversational and complex,
questionbased queries rather than simple keywords [
        <xref ref-type="bibr" rid="ref8">23, 8</xref>
        ]. Consequently, many search applications, such as
question answering systems, LLM-based assistants, and mobile voice search, shifted their focus from
short keyword queries to natural language ones [
        <xref ref-type="bibr" rid="ref4 ref8">25, 22, 8, 4, 26, 21</xref>
        ].
      </p>
      <p>
        In this work, we argue that crawlers should also adapt to this shift. We hypothesise that, by using
LLMs to estimate quality during crawling, we can improve the ability of crawlers to surface documents
valuable for search tasks, particularly for natural language search. Building on the approach by Pezzuti
et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], we propose to prioritise the crawling frontier based on the quality scores generated by the
neural quality estimators introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We call this approach neural crawling. A neural quality
estimator is an LLM fine-tuned to assess the likelihood of a textual document being relevant to any
query. Given its semantic understanding of text, the quality scores it produces capture the semantic
quality of the input text. By relying that Web pages with similar quality are likely to link to each
other, we propose two neural policies that exploit this property to propagate quality from the inlinking
neighbourhood of each Web page. While existing studies have shown the potential of neural crawlers for
targeting high-quality LLM pre-training data and for generic search tasks [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], the alignment of neural
and traditional crawlers with the recent trend towards natural language search remains unexplored.
To investigate this, we evaluate the crawling efectiveness on keyword queries and natural language
queries, using search corpora crawled by a traditional BFS crawler, and by our proposed neural crawlers.
      </p>
      <p>
        Our experiments on the English subset of ClueWeb22-B [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] show that, w.r.t. to BFS, our neural
policies can significantly improve crawling efectiveness for natural language queries from Researchy
Questions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] (up to +149% in HR), while remaining competitive for keyword queries from MS
MARCO Web Search [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (up to +20% in HR).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Neural Crawling Policies</title>
      <p>Let  denote a Web page, and  denote the set of outlinks from , i.e., the pages that  links to. Let ℱ
denote the frontier, which stores the URLs of the pages yet to be crawled. The frontier is composed of
(, ) pairs, where  is a URL and  is its priority. A crawling policy is composed by: (i) a
priorityassignment function  :  ↦→ , typically based on heuristics such as PageRank, (ii) an update policy,
that defines how priorities are updated when discovering a new link to a page whose URL is already in
the frontier, and (iii) a selection policy, that decides which page to crawl next, according to its priority. For
instance, the well-known Breadth-First Search (BFS) crawling policy uses a constant priority-assignment
function, a First-In-First-Out selection policy, and an update policy that does not change priorities upon
rediscovery. In contrast, the Best-First (BF) policy employs a priority assignment function based on a
link-based quality estimation heuristic such as PageRank, a maximum priority selection policy, and
advanced update policies. Inspired by the link-based nature of PageRank, we propose two neural BF
crawling policies leveraging an LLM-based quality estimation heuristic to prioritise Web pages with
high semantic quality during the crawling process, called QFirst and QMin.</p>
      <p>
        At the core of our proposed neural BF crawling policies is the use of a LLM-based heuristic function
 :  ↦→ R, parametrised by  and optimised to distinguish high semantic quality pages from
lowquality ones. In particular, we aim at exploiting this neural heuristic in the priority-estimation function,
ideally using  :  ↦→  () to prioritise a page  whose URL is . However, in real-world crawling
settings we cannot access the textual content of a Web page before its download. Indeed, using the ideal
semantic quality as the priority when enqueueing pages into ℱ is only feasible in a theoretical scenario
where an oracle function has access to a page text prior its download. We refer to this oracle-based
crawling policy as QOracle [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and we use it as an upper-bound on the performance achievable by
our practical neural crawling policies.
      </p>
      <p>In the absence of an oracle, we can reasonably assume that the quality of a page is related to the
quality of the pages it is connected to. Indeed, existing literature shows that the quality of a page is
positively correlated with that of its linking neighbours. If this relationship holds, we can efectively
propagate quality via link structure by using as a proxy estimate of the quality of a page that of one of
its ancestors, i.e., the page that linked to it.</p>
      <p>In our first neural crawling policy, referred to as QFirst, when processing a page  and encountering
the outgoing URL ˜ of a new page ˜ ∈  for the first time, we insert ˜ into ℱ with priority ˜ =  ()
and we never update it.</p>
      <p>In our second neural crawling policy, referred to as QMin, we additionally assume that if a page is
linked to a low-quality page, it is highly unlikely to be of high-quality. If this holds, we could postpone
the crawling of low-quality pages by decreasing their priority whenever a link from a low-quality
ancestor is discovered. In doing so, we aim to boost the prioritisation high-quality Web pages while
deprioritising low-quality ones. To implement the QMin policy, when processing a page  and
reencountering an already enqueued URL ˜ ∈ ℱ of a page ˜ ∈ , we update its priority as the minimum
between the current priority and the quality of the new ancestor , i.e., ˜ ← min {˜,  ()}.</p>
      <p>
        We do not to propose a QMax policy, as our preliminary experiments, consistent with prior
research [
        <xref ref-type="bibr" rid="ref19 ref3">19, 3</xref>
        ], suggest that neural estimators better identify low-quality than high-quality content.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>
        We perform a single-threaded simulation of the crawling process on the English subset of ClueWeb22-B
(CW22B-eng), which contains 87 head Web pages from ClueWeb22 (CW22) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], a recently released
corpus crawled by a commercial search engine. In our simulations, Web pages are crawled sequentially,
and we assume a constant per-page crawling time. In real-world scenarios, crawlers iteratively download
Web pages and stop after periods of duration  to allow retrievers to update their index. To simulate
this, we measure the crawling efectiveness every  = 2.5 crawled pages. We start to crawl from
100 randomly selected seed URLs, and we reach a total of 29 pages. The source code to reproduce
our experiments is publicly available on Github1.
      </p>
      <p>
        Queries. We use MS MARCO Web Search (MSM-WS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Researchy Questions (RQ) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] query
sets, both generated from the logs of commercial search engines. The former contains queries reflecting
a real query distribution and relevance labels over CW22 extracted from a real click-log, with explicit
relevance assessments. The latter contains multi-perspective, non-factoid English queries, and a click
distribution over CW22. For each query in RQ, we consider the most clicked page to be relevant. Since
we work with a subset of CW22B, but both query sets are related to CW22, we excluded queries without
relevant results in CW22B-eng. MSM-WS queries are generally shorter compared to RQ, with a narrow
scope and mainly keyword-based. Moreover, MSM-WS queries contain fewer interrogatives such as
"how" and "why", while RQ queries have a broader scope and are similar to natural language.
      </p>
      <p>
        Neural Quality Estimation. In our experiments, we use the QT5-Small2 neural quality estimator [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
ifne-tuned on the MSM-WS training set by Pezzuti et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Efectiveness. There exist several approaches to compute the efectiveness of crawling policies [
        <xref ref-type="bibr" rid="ref11 ref7">11, 7</xref>
        ].
In this work, we use the Harvest Rate (HR), one of the most widely used metrics that can be used for
this [
        <xref ref-type="bibr" rid="ref17 ref2">2, 17</xref>
        ]. Let ℛ denote the set of all pages relevant to at least a query  in a query set . At time ,
for the query set , the harvest rate (, ) is defined as:
      </p>
      <p>(, ) = |ℛ|/,
where ℛ is the subset of relevant pages crawled up to time . As noted before, the page crawl time is
our time unit, and  corresponds to the crawling of  pages. This metric measures the crawl ability to
maximise the number of crawled relevant pages while minimising that of irrelevant ones.</p>
      <p>
        Baseline. We compare our policies against BFS, the simplest yet efective policy. We do not compare
with PageRank since prior research showed that on small graphs PageRank is not accurate and BFS is
stronger [
        <xref ref-type="bibr" rid="ref1 ref13 ref9">1, 9, 13</xref>
        ]. For significance testing, we use a two-tailed Z-test for proportions with  = 0.01.
      </p>
      <sec id="sec-3-1">
        <title>1https://github.com/fpezzuti/neural_crawling 2https://huggingface.co/macavaney/qt5-small-msw</title>
        <p>3.00
R
H
2.00
0</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>To investigate if using neural policies during crawling helps a crawler find relevant pages earlier than
BFS, Figure 1 shows the HR over time for our oracle neural policy QOracle, our two practical neural
policies QFirst and QMin, and BFS, on both the MSM-WS and RQ datasets.</p>
      <p>From the figure, we note that the QOracle policy initially exhibits superior performance w.r.t. all
other policies in surfacing pages relevant to keyword queries in MSM-WS, and is immediately followed
by the QFirst policy. On natural language queries, all our neural crawlers substantially outperform BFS,
and the QMin policy attains almost the same HR as QOracle.</p>
      <p>The fact that QMin exhibits comparable performance to QOracle on RQ, while being unexpectedly
on par with BFS on MSM-WS, suggests that most pages valuable for natural language search can be
easily reached by postponing the exploration of low-quality links, while some of the pages valuable
for keyword-oriented search may only be reachable through low-quality links. Thus, reluctance in
following these links may hamper the discovery of valuable Web pages located deeper in the Web graph.</p>
      <p>Meanwhile, QFirst, our simplest policy, significantly outperforms BFS on both query sets, and achieves
competitive performance w.r.t. the other methods without introducing excessive overhead. Unlike
QOracle and QMin, which rely on a greedier prioritisation and favour exploitation, QFirst is more
exploration-oriented as it relies on noisier estimates. As a result, it has higher chances of discovering
valuable pages only reachable throughout local minima.</p>
      <p>To conclude, our experiments show that, w.r.t. to BFS, our QMin policy can significantly improve
crawling efectiveness for natural language queries from RQ by up to +149% in early HR, while
remaining competitive for keyword queries from MSM-WS, with up to +20% in early HR.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we proposed two neural policies for neural crawling, both leveraging neural quality
estimators to prioritise the early crawl of semantically high-quality pages.</p>
      <p>We compared our proposed policies with an oracle policy, and with the well-established BFS baseline
in terms of crawling efectiveness. Our findings reveal that, especially for natural language queries,
we can markedly improve the early efectiveness of the crawler by using a neural policy in place of a
traditional one. While our results show the promise of our approach, we recognise several limitations
of this work that open up meaningful directions for future research. First, our experiments were
conducted in a controlled, simulated setting; the efectiveness of our approach has not been validated
in real-world, multi-threaded environments, which are typically subject to practical constraints like
politeness policies, host reachability issues, and others. Second, further investigation is needed to better
understand the potential biases introduced by neural quality estimators, particularly in terms of fairness
and transparency. Third, we our proposed policies may be vulnerable to adversarial manipulation, and
their robustness to such attacks should has yet to be explored.</p>
      <p>We leave for future work experiments on other Web corpora and query sets, as well as experiments
with other policies and other baseline comparisons.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>During the preparation of this work, the author did not use any AI tool.</title>
        <p>Jennifer Neville, and Nikhil Rao. Researchy Questions: A Dataset of Multi-Perspective,
Decompositional Questions for LLM Web Agents. arXiv:2402.17896, 2024.
[21] Artsiom Sauchuk, James Thorne, Alon Y. Halevy, Nicola Tonellotto, and Fabrizio Silvestri. On the</p>
        <p>Role of Relevance in Natural Language Processing Tasks. In Proc. SIGIR, pages 1785–1789, 2022.
[22] Khan Tajmir, Rashid Umer, and Rehman Abdur. End-to-end pseudo relevance feedback based
vertical web search queries recommendation. Multimedia Tools and Applications, 83(31):75995–
76033, 2024.
[23] Johanne R. Trippas, Sara Fahad Dawood Al Lawati, Joel Mackenzie, and Luke Gallagher. What do
Users Really Ask Large Language Models? An Initial Log Analysis of Google Bard Interactions in
the Wild. In Proc. SIGIR, pages 2703–2707, 2024.
[24] Lalit Kumar Tyagi, Anish Gupta, and Vibhash Singh Sisodia. A New Era of Web Mining: Innovative
Approaches in Focused Web Crawling for Domain-Specific Information. In Proc. ICTACS, pages
1–6, 2023.
[25] Ryen W. White. Advancing the Search Frontier with AI Agents. Commun. ACM, 67(9):54–65, 2024.
[26] Yutong Xie, Zhaoying Pan, Jinge Ma, Luo Jie, and Qiaozhu Mei. A Prompt Log Analysis of
Text-to-Image Generation Systems. In Proc. WWW, pages 3892–3902, 2023.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Boldi</surname>
          </string-name>
          , Massimo Santini, and
          <string-name>
            <given-names>Sebastiano</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <article-title>Do Your Worst to Make the Best: Paradoxical Efects in PageRank Incremental Computations</article-title>
          .
          <source>In Proc. WAW</source>
          , pages
          <fpage>168</fpage>
          -
          <lpage>180</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Soumen</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          , Martin van den Berg, and
          <string-name>
            <given-names>Byron</given-names>
            <surname>Dom</surname>
          </string-name>
          .
          <article-title>Focused crawling: a new approach to topic-specific Web resource discovery</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>31</volume>
          (
          <fpage>11</fpage>
          -16):
          <fpage>1623</fpage>
          -
          <lpage>1640</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Xuejun</given-names>
            <surname>Chang</surname>
          </string-name>
          , Debabrata Mishra, Craig Macdonald, and Sean MacAvaney.
          <article-title>Neural Passage Quality Estimation for Static Pruning</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <fpage>174</fpage>
          -
          <lpage>185</lpage>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Chen</surname>
          </string-name>
          , Jiaxin Mao, Yiqun Liu, Fan Zhang, Min Zhang, and Shaoping Ma.
          <article-title>Towards a Better Understanding of Query Reformulation Behavior in Web Search</article-title>
          .
          <source>In Proc. WWW</source>
          , pages
          <fpage>743</fpage>
          -
          <lpage>755</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Qi</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen,
          <string-name>
            <given-names>Kun</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu,
          <string-name>
            <given-names>Mingqin</given-names>
            <surname>Li</surname>
          </string-name>
          , Chuanjie Liu,
          <string-name>
            <given-names>Zengzhong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Rangan</given-names>
            <surname>Majumder</surname>
          </string-name>
          , Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang,
          <string-name>
            <surname>Linjun Yang</surname>
            ,
            <given-names>Mao</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>and Ce</given-names>
          </string-name>
          <string-name>
            <surname>Zhang. MS MARCO Web</surname>
          </string-name>
          <article-title>Search: A Largescale Information-rich Web Dataset with Millions of Real Click Labels</article-title>
          .
          <source>In Proc. WWW</source>
          , pages
          <fpage>292</fpage>
          -
          <lpage>301</lpage>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Paul</given-names>
            <surname>M.E. De Bra and Reinier D.J. Post</surname>
          </string-name>
          .
          <article-title>Information retrieval in the World-Wide Web: Making client-based searching feasible</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Fetterly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Nick</given-names>
            <surname>Craswell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Vishwa</given-names>
            <surname>Vinay</surname>
          </string-name>
          .
          <article-title>The impact of crawl policy on web search efectiveness</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ido</given-names>
            <surname>Guy</surname>
          </string-name>
          .
          <article-title>Searching by Talking: Analysis of Voice Queries on Mobile Web Search</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Holzmann</given-names>
            <surname>Helge</surname>
          </string-name>
          , Anand Avishek, and
          <string-name>
            <given-names>Khosla</given-names>
            <surname>Megha</surname>
          </string-name>
          .
          <article-title>Estimating pagerank deviations in crawled graphs</article-title>
          .
          <source>Applied Network Science</source>
          ,
          <volume>4</volume>
          (
          <issue>86</issue>
          ),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Dirk</given-names>
            <surname>Lewandowski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nadine</given-names>
            <surname>Höchstötter</surname>
          </string-name>
          .
          <source>Web Searching: A Quality Measurement Perspective</source>
          , pages
          <fpage>309</fpage>
          -
          <lpage>340</lpage>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Filippo</surname>
            <given-names>Menczer</given-names>
          </string-name>
          , Gautam Pant, Padmini Srinivasan, and
          <string-name>
            <given-names>Miguel E.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          .
          <article-title>Evaluating topic-driven web crawlers</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <fpage>241</fpage>
          -
          <lpage>249</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Filippo</surname>
            <given-names>Menczer</given-names>
          </string-name>
          , Gautam Pant, and
          <string-name>
            <given-names>Padmini</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          .
          <article-title>Topical web crawlers: Evaluating adaptive algorithms</article-title>
          .
          <source>ACM TOIT</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>378</fpage>
          -
          <lpage>419</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Najork and Janet L. Wiener</surname>
          </string-name>
          .
          <article-title>Breadth-first crawling yields high-quality pages</article-title>
          .
          <source>In Proc. WWW</source>
          , pages
          <fpage>114</fpage>
          -
          <lpage>118</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Liudmila</surname>
            <given-names>Ostroumova</given-names>
          </string-name>
          , Ivan Bogatyy, Arseniy Chelnokov, Alexey Tikhonov, and
          <string-name>
            <given-names>Gleb</given-names>
            <surname>Gusev</surname>
          </string-name>
          .
          <article-title>Crawling Policies Based on Web Page Popularity Prediction</article-title>
          .
          <source>In Proc. ECIR</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>111</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Arnold</surname>
            <given-names>Overwijk</given-names>
          </string-name>
          , Chenyan Xiong, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <article-title>ClueWeb22: 10 Billion Web Documents with Rich Information</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <fpage>3360</fpage>
          -
          <lpage>3362</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lawrence</surname>
            <given-names>Page</given-names>
          </string-name>
          , Sergey Brin, Rajeev Motwani, and
          <string-name>
            <given-names>Terry</given-names>
            <surname>Winograd</surname>
          </string-name>
          .
          <article-title>The PageRank Citation Ranking: Bringing Order to the Web</article-title>
          .
          <source>In Proc. WWW</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Gautam</given-names>
            <surname>Pant</surname>
          </string-name>
          and
          <string-name>
            <given-names>Padmini</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          .
          <article-title>Link contexts in classifier-guided topical crawlers</article-title>
          .
          <source>IEEE TKDE</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>122</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Francesca</surname>
            <given-names>Pezzuti</given-names>
          </string-name>
          , Sean MacAvaney, and Nicola Tonellotto.
          <article-title>Neural Prioritisation for Web Crawling</article-title>
          .
          <source>In Proc. ICTIR, page 8</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Francesca</surname>
            <given-names>Pezzuti</given-names>
          </string-name>
          , Ariane Mueller, Sean MacAvaney, and Nicola Tonellotto.
          <article-title>Document Quality Scoring for Web Crawling</article-title>
          .
          <source>arXiv:2504.11011</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Corby</surname>
            <given-names>Rosset</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho-Lam</surname>
            <given-names>Chung</given-names>
          </string-name>
          , Guanghui Qin, Ethan C. Chau, Zhuo Feng, Ahmed Awadallah,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>