<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Exploration versus Exploitation in Topic Driven Crawlers</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Gautam</forename><surname>Pant</surname></persName>
							<email>gautam-pant@uiowa.edu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Management Sciences</orgName>
								<orgName type="department" key="dep2">School of Library and Information Science</orgName>
								<orgName type="institution">The University of Iowa Iowa City</orgName>
								<address>
									<postCode>52242</postCode>
									<region>IA</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Padmini</forename><surname>Srinivasan</surname></persName>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Management Sciences</orgName>
								<orgName type="department" key="dep2">School of Library and Information Science</orgName>
								<orgName type="institution">The University of Iowa Iowa City</orgName>
								<address>
									<postCode>52242</postCode>
									<region>IA</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Filippo</forename><surname>Menczer</surname></persName>
							<email>filippo-menczer@uiowa.edu</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Management Sciences</orgName>
								<orgName type="department" key="dep2">School of Library and Information Science</orgName>
								<orgName type="institution">The University of Iowa Iowa City</orgName>
								<address>
									<postCode>52242</postCode>
									<region>IA</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Exploration versus Exploitation in Topic Driven Crawlers</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F467152D8E3D07714C75E0351D7384DF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:29+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed decisions about how to prioritize the links to be visited. Here we focus on the balance between a crawler's need to exploit this information to focus on the most promising links, and the need to explore links that appear suboptimal but might lead to more relevant pages. We investigate the issue for two different tasks: (i) seeking new relevant pages starting from a known relevant subset, and (ii) seeking relevant pages starting a few links away from the relevant subset. Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in spite of a penalty in the early stage of the crawl.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>A recent projection estimates the size of the visible Web today (March 2002) to be around 7 billion "static" pages <ref type="bibr" target="#b9">[10]</ref>. The largest search engine, Google, claims to be "searching" about 2 billion pages. The fraction of the Web cov-ered by search engines has not improved much over the past few years <ref type="bibr" target="#b15">[16]</ref>. Even with increasing hardware and bandwidth resources at their disposal, search engines cannot keep up with the growth of the Web and with its rate of change <ref type="bibr" target="#b4">[5]</ref>.</p><p>These scalability limitations stem from search engines' attempt to crawl the whole Web, and to answer any query from any user. Decentralizing the crawling process is a more scalable approach, and bears the additional benefit that crawlers can be driven by a rich context (topics, queries, user profiles) within which to interpret pages and select the links to be visited. It comes as no surprise, therefore, that the development of topic driven crawler algorithms has received significant attention in recent years <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b18">19]</ref>.</p><p>Topic driven crawlers (also known as focused crawlers) respond to the particular information needs expressed by topical queries or interest profiles. These could be the needs of an individual user (query time or online crawlers) or those of a community with shared interests (topical search engines and portals).</p><p>Evaluation of topic driven crawlers is difficult due to the lack of known relevant sets for Web searches, to the presence of many conflicting page quality measures, and to the need for fair gauging of crawlers' time and space algorithmic complexity. In recent research we presented an evaluation framework designed to support the comparison of topic driven crawler algorithms 1 under specified resource constraints <ref type="bibr" target="#b18">[19]</ref>. In this paper we further this line of research by investigating the relative merits of exploration versus exploitation as a defining characteristic of the crawling mechanism.</p><p>The issue of exploitation versus exploration is a universal one in machine learning and artificial intelligence, since it presents itself in any task where search is guided by quality estimations. Under some regularity assumption, one can assume that a measure of quality at one point in the search space provides some information on the quality of nearby points. A greedy algorithm can then exploit this information by concentrating the search in the vicinity of the most promising points. However, this strategy can lead to missing other equally good or even better points, for two reasons: first, the estimates may be noisy; and second, the search space may have local optima that trap the algorithm and keep it from locating global optima. In other words, it may be necessary to visit some "bad" points in order to arrive at the best ones. At the other extreme, algorithms that completely disregard quality estimates and continue to explore in a uniform or random fashion do not risk getting stuck at local optima, but they do not use the available information to bias the search and thus may spend most of their time exploring suboptimal areas. A balance between exploitation and exploration of clues is obviously called for in heuristic search algorithms, but the optimal compromise point is unknown unless the topology of the search space is well understood -which is typically not the case.</p><p>Topic driven crawlers fit into this picture very well if one views the Web as the search space, with pages as points and neighborhoods as defined by hyperlinks. A crawler must decide which pages to visit based on the cues provided by links from nearby pages. If one assumes that a relevant page has a higher probability to be near other relevant pages than to any random page, then quality estimate of pages provide cues that can be exploited to bias the search process. However, given the short range of relevance clues on the Web <ref type="bibr" target="#b16">[17]</ref>, a very relevant page might be only a few links behind an apparently irrelevant one.</p><p>Balancing the exploitation of quality estimate information with exploration of suboptimal pages is thus crucial for the performance of topic driven crawlers. It is a question that we study empirically with respect to two different tasks. In the first, we seek relevant pages starting from a set of relevant links. Applications of such a task are query-time search agents that use results of a search engine as starting points to provide a user with recent and personalized results. Since we start from relevant links, we may expect an exploratory crawler to perform reasonably well. The second task involves seeking relevant pages while starting the crawl from links that are a few links away from a relevant subset. Such a task may be a part of Web mining or competitive intelligence applications (e.g., a search starting from competitors' home pages). If we do not start from a known relevant subset, the appropriate balance of exploration vs. exploitation becomes an empirical question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Evaluation Framework</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Topics, Examples and Neighbors</head><p>In order to evaluate crawler algorithms, we need topics, some corresponding relevant examples, and neighbors. The neighbors are URLs extracted from neighborhood of the examples. We obtain our topics from the Open Directory (DMOZ). We ran randomized Breadth-First crawls starting from each of the main categories on the DMOZ site. <ref type="foot" target="#foot_0">1</ref> The crawlers identify DMOZ "leaves," i.e., pages that have no children category nodes. Leaves with five or more external links are then used to derive topics. We thus collected 100 topics.</p><p>A topic is represented by three types of information derived from the corresponding leaf page. First, the words in the DMOZ hierarchy form the topic's keywords. Second, up to 10 external links form the topic's examples. Third, we concatenate the text descriptions and anchor text of the target URLs (written by DMOZ human editors) to form a topic description. The difference be- tween topic keywords and topic descriptions is that we give the former to the crawlers, as models of (short) query-like topics, while we use the latter, which are much more detailed representations of the topics, to gauge the relevance of the crawled pages in our post-hoc analysis. Table <ref type="table" target="#tab_0">1</ref> shows a sample topic. The neighbors are obtained for each topic through the following process. For each of the examples, we obtain the top 20 inlinks as returned by Google. <ref type="foot" target="#foot_3">2</ref> Next, we get the top 20 inlinks for each of the inlinks obtained earlier. Hence, if we had 10 examples to start with, we may now have a maximum of 4000 unique URLs. A subset of 10 URLs is then picked at random from this set. The links in such a subset are called the neighbors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Architecture</head><p>We use the a previously proposed evaluation framework to compare different crawlers <ref type="bibr" target="#b18">[19]</ref>. The framework allows one to easily plug in modules implementing arbitrary crawling algorithms, which share data structures and utilities to optimize efficiency without affecting the fairness of the evaluation.</p><p>As mentioned before, we use the crawlers for two different tasks. For the first task, the crawlers start from the examples while for the second the starting points are the neighbors. In either case, as the pages are fetched their component URLs are added to a list that we call the frontier. A crawler may use topic's keywords to guide the selection of frontier URLs that are to be fetched at each iteration. For a given topic, a crawler is allowed to crawl up to MAX PAGES = 2000 pages. However, a crawl may end sooner if the crawler's frontier should become empty. We use a timeout of 10 seconds for Web downloads. Large pages are chopped so that we retrieve only the first 100 KB. The only protocol allowed is HTTP (with redirection allowed), and we also filter out all but "static pages" with text/html content. Stale links yielding HTTP error codes are removed as they are found (only good links are used in the analysis).</p><p>We constrain the space resources a crawler algorithm can use by restricting the frontier size to MAX BUFFER = 256 URLs. If the buffer becomes full then the crawler must decide which links are to be replaced as new links are added.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Crawling Algorithms</head><p>In this paper we study the notion of exploration versus exploitation. We begin with a single family of crawler algorithms with a single greediness parameter to control the exploration/exploitation behavior. In our previous experiments <ref type="bibr" target="#b18">[19]</ref> we found that a naive Best-First crawler displayed the best performance among three crawlers considered. Hence, in this study we explore variants of the Best-First crawler. More generally, we examine the Best-N-First family of crawlers where the parameter N controls the characteristic of interest.</p><p>Best-First crawlers have been studied before <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b13">14]</ref>. The basic idea is that given a frontier of  links, the best link according to some estimation criterion is selected for crawling.</p><p>Best-N-First is a generalization in that at each iteration a batch of top N links to crawl are selected. After completing the crawl of N pages the crawler decides on the next batch of N and so on. As mentioned above, the topic's keywords are used to guide the crawl. More specifically this is done in the link selection process by computing the lexical similarity between a topic's keywords and the source page for the link. Thus the similarity between a page p and the topic is used to estimate the relevance of the pages linked from p. The N URLs with the best estimates are then selected for crawling. Cosine similarity is used by the crawlers and the links with minimum similarity score are removed from the frontier if necessary in order to not exceed the MAX BUFFER size. Figure <ref type="figure" target="#fig_0">1</ref> offers a simplified pseudocode of a Best-N-First crawler.</p><p>Best-N-First offers an ideal context for our study. The parameter N controls the greedy behavior of the crawler. Increasing N results in crawlers with greater emphasis on exploration and consequently a reduced emphasis on exploitation. Decreasing N reverses this; selecting a smaller set of links is more exploitative of the evidence available regarding the potential merits of the links. In our experiments we test five mutants of the crawler by setting N to 1, 16, 64, 128 and 256. We refer to them as BFSN where N is one of the above values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation Methods</head><p>Table <ref type="table" target="#tab_2">2</ref> depicts our overall methodology for crawler evaluation. The two rows of Table <ref type="table" target="#tab_2">2</ref> indicate two different methods for gauging page quality. The first is a purely lexical approach wherein similarity to the topic description is used to assess relevance. The second method is primarily linkage based and is an approximation of the retrieval/ranking method used by Google <ref type="bibr" target="#b5">[6]</ref>; it uses PageRank to discriminate between pages containing the same number of topic keywords.</p><p>The columns of the table show that our measures are used both from a static and a dynamic perspective. The static approach examines crawl quality assessed from the full set of (up to 2000) pages crawled for each query. In contrast the dynamic measures provide a temporal characterization of the crawl strategy, by considering the pages fetched while the crawl is in progress. More specifically, the static approach measures coverage, i.e., the ability to retrieve "good" pages where the quality of a page is assessed in two different ways (corresponding to the rows of the table). Our static plots show the ability of each crawler to retrieve more or fewer highly relevant pages. This is analogous to plotting recall as a function of generality.</p><p>The dynamic approach examines the quality of retrieval as the crawl progresses. Dynamic plots offer a trajectory over time that displays the dynamic behavior of the crawl. The measures are built on average (quality-based) ranks and are generally inversely related to precision. As the average rank decreases, an increasing proportion of the crawled set can be expected to be relevant.</p><p>It should be noted that scores and ranks used in each dynamic measure are computed omnisciently, i.e., all calculations for each point in time for a crawler are done using data generated from the full crawl. For instance, all PageRank scores are calculated using the full set of retrieved pages. This strategy is quite reasonable given that we want to use the best possible evidence when judging page quality. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Lexical Based Page Quality</head><p>We use the SMART system <ref type="bibr" target="#b22">[23]</ref> to rank the retrieved pages by their lexical similarity to the topic. The SMART system allows us to pool all the pages crawled by all the crawlers for a topic and then rank these against the topic description. The system utilizes term weighting strategies involving term frequency and inverse document frequency computed from the pooled pages for a topic. SMART computes the similarity between the query and the topic as a dot product of the topic and page vectors. It outputs a ranked set of pages based on their topic similarity scores. That is, for each page we get a rank which we refer to as rank SM ART (cf. Table <ref type="table" target="#tab_2">2</ref>). Thus given a topic, the percentage of top n pages ranked by SMART (where n varies) that are retrieved by each crawler may be calculated, yielding the static evaluation metric.</p><p>For the dynamic view we use the rank SM ART values for pages to calculate mean rank SM ART at different points of the crawl. If we let S crawler (t) denote the set of pages retrieved up to time t, then we calculate mean rank SM ART over S crawler (t). The set S crawler (t) of pages increases in size as we proceed in time. We approximate t by the number of pages crawled. The trajectory of mean rank SM ART values over time displays the dynamic behavior of a crawler.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Linkage Based Page Quality</head><p>It has been observed that content alone does not give a fair measure of the quality of the page <ref type="bibr" target="#b14">[15]</ref>. Algorithms such as HITS <ref type="bibr" target="#b14">[15]</ref> and PageRank <ref type="bibr" target="#b5">[6]</ref> use the linkage structure of the Web to rank pages. PageRank in particular estimates the global popularity of a page. The computation of PageRanks can be done through an iterative process. PageRanks are calculated once after all the crawls are completed. That is, we pool the pages crawled for all the topics by all the crawlers and then calculate the PageRanks according to the algorithm described in <ref type="bibr" target="#b12">[13]</ref>. We sort the pages crawled for a given topic, by all crawlers, first based on the number of topic keywords they contain and then sort the pages with same number of keywords by their PageRank. The process gives us a rank KW,P R for each page crawled for a topic.</p><p>Once again, our static evaluation metric measures the percentage of top n pages (ranked by rank KW,P R ) crawled by a crawler on a topic. In the dynamic metric, mean rank KW,P R is plotted over each S crawler (t) where t is the number of pages crawled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>For each of the evaluation schemes and metrics outlined in Table <ref type="table" target="#tab_2">2</ref>, we analyzed the performance of each crawler on the two tasks. crawlers (BFS16, BFS64 and BFS128) can be extrapolated between the curves corresponding to BFS1 and BFS256.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Task 1 : Starting from Examples</head><p>The most general observation we can draw from the plots is that BFS256 achieves a significantly better performance under the static evaluation schemes, i.e., a superior coverage of the most highly relevant pages based on both quality metrics and across different numbers of top pages (cf. Figure <ref type="figure" target="#fig_1">2a,c</ref>). The difference between the coverage by crawlers for different N increases as one considers fewer highly relevant pages. These results indicate that exploration is important to locate the highly relevant pages when starting from relevant links, whereas too much exploitation is harmful.</p><p>The dynamic plots give us a richer picture. (Recall that here lowest average rank is best.) BFS256 still does significantly better than other crawlers on the lexical metric (cf. Figure <ref type="figure" target="#fig_1">2b</ref>). However, the linkage metric shows that BFS256 pays a large penalty in the early stage of the crawl (cf. Figure <ref type="figure" target="#fig_1">2d</ref>). However, the crawler regains quality over the longer run. The better coverage of highly relevant pages by this crawler (cf. Figure <ref type="figure" target="#fig_1">2c</ref>) may help us interpret the improvement observed in the second phase of the crawl. We conjecture that by exploring suboptimal links early on, BFS256 is capable of eventually discovering paths to highly relevant pages that escape more greedy strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Task 2: Starting from Neighbors</head><p>The success of a more exploratory algorithm on the first task may not come as a surprise since we start from known relevant pages. However, in  the second task we use the links obtained from the neighborhood of relevant subset as the starting points with the goal of finding more relevant pages. We take the worst (BFS1) and the best (BFS256) crawlers on Task 1, and use them for Task 2. In addition, we add a simple Breadth-First crawler that uses the limited size frontier as a FIFO queue. The Breadth-First crawler is added to observe the performance of a blind exploratory algorithm. A summary of the results is shown through plots in Figure <ref type="figure" target="#fig_3">3</ref>. As for Task 1, we find that the more exploratory algorithm, BFS256, performs significantly better than BFS1 under static evaluations for both lexical and linkage quality metrics (cf. Figure <ref type="figure" target="#fig_3">3a,c</ref>). In the dynamic plots (cf. Figure <ref type="figure" target="#fig_3">3b,d</ref>) BFS256 seems to bear an initial penalty for exploration but recovers in the long run. The Breadth-First crawler performs poorly on all evaluations. Hence, as a general result we find that exploration helps an exploitative algorithm, but exploration without guidance goes astray.</p><p>Due to the availability of relevant subsets (examples) for each of the topics in the current task, we plot the average recall of the relevant examples against number of pages crawled (Figure <ref type="figure" target="#fig_4">4</ref>). The plot illustrates the target-seeking behavior of the three crawlers if the examples are viewed as the targets. We again find BFS256 outperforming BFS1 while Breadth-First trails behind.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Related Research</head><p>Research on the design of effective focused crawlers is very vibrant. Many different types of crawling algorithms have been developed. For example, Chakrabarti et al. <ref type="bibr" target="#b7">[8]</ref>  example pages to guide their focused crawlers. Fetuccino <ref type="bibr" target="#b2">[3]</ref> and InfoSpiders <ref type="bibr" target="#b17">[18]</ref> begin their focused crawling with starting points generated from CLEVER <ref type="bibr" target="#b6">[7]</ref> or other search engines. Most crawlers follow fixed strategies, while some can adapt in the course of the crawl by learning to estimate the quality of links <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b21">22]</ref>.</p><p>The question of exploration versus exploitation in crawler strategies has been addressed in a number of papers, more or less directly. Fish-Search <ref type="bibr" target="#b10">[11]</ref> limited exploration by bounding the depth along any path that appeared suboptimal. Cho et al. <ref type="bibr" target="#b8">[9]</ref> found that exploratory crawling behaviors such as implemented in the Breadth-First algorithm lead to efficient discovery of pages with good PageRank. They also discuss the issue of limiting the memory resources (buffer size) of a crawler, which has an impact on the exploitative behavior of the crawling strategy because it forces the crawler to make frequent filtering decisions. Breadth-First crawlers also seem to find popular pages early in the crawl <ref type="bibr" target="#b19">[20]</ref>. The exploration versus exploitation issue continues to be studied via variations on the two major classes of Breadth-First and Best-First crawlers. For example, in recent research on Breadth-First focused crawling, Diligenti et al. <ref type="bibr" target="#b11">[12]</ref> address the "shortsightedness" of some crawlers when assessing the potential value of links to crawl. In particular, they look at how to avoid short-term gains at the expense of less-obvious but larger long-term gains. Their solution is to build classifiers that can assign pages to different classes based on the expected link distance between the current page and relevant documents.</p><p>The area of crawler quality evaluation has also received much attention in recent research <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b3">4]</ref>. For instance, many alternatives for assessing page importance have been explored, showing a range of sophistication. Cho et al. <ref type="bibr" target="#b8">[9]</ref> use the simple presence of a word such as "computer" to indicate relevance. Amento et al. <ref type="bibr" target="#b1">[2]</ref> compute the similarity between a page and the centroid of the seeds. In fact contentbased similarity assessments form the basis of relevance decisions in several examples of research <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b18">19]</ref>. Others exploit link information to estimate page relevance with methods based on in-degree, out-degree, PageRank, hubs and authorities <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b19">20]</ref>. For example, Cho et al. <ref type="bibr" target="#b8">[9]</ref> consider pages with PageRank score above a threshold as relevant. Najork and Wiener <ref type="bibr" target="#b19">[20]</ref> use a crawler that can fetch millions of pages per day; they then calculate the average PageRank of the pages crawled daily, under the assumption that PageRank estimates relevance. Combinations of link and content-based relevance estimators are evident in several approaches <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b17">18]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions</head><p>In this paper we used an evaluation framework for topic driven crawlers to study the role of exploitation of link estimates versus exploration of suboptimal pages. We experimented with a family of simple crawler algorithms of varying greediness, under limited memory resources for two different tasks. A number of schemes and quality metrics derived from lexical features and link analysis were introduced and applied to gauge crawler performance.</p><p>We found consistently that exploration leads to better coverage of highly relevant pages, in spite of a possible penalty during the early stage of the crawl. An obvious explanation is that exploration allows to trade off short term gains for longer term and potentially larger gains. However, we also found that a blind exploration when starting from neighbors of relevant pages, leads to poor results. Therefore, a mix of exploration and exploitation is necessary for good overall performance. When starting from relevant examples (Task 1), the better performance of crawlers with higher exploration could be attributed to their better coverage of documents close to the relevant subset. The good performance of BFS256 starting away from relevant pages, shows that its exploratory nature complements its greedy side in finding highly relevant pages. Extreme exploitation (BFS1) and blind exploration (Breadth-First), impede performance. Nevertheless, any exploitation seems to be better than none. Our results are based on short crawls of 2000 pages. The same may not hold for longer crawls; this is an issue to be addressed in future research. The dynamic evaluations do suggest that for very short crawls it is best to be greedy; this is a lesson that should be incorporated into algorithms for query time (online) crawlers such as MySpiders 3 <ref type="bibr" target="#b20">[21]</ref>.</p><p>The observation that higher exploration yields better results can motivate parallel and/or distributed implementations of topic driven crawlers, since complete orderings of the links in the frontier, as required by greedy crawler algorithms, do not seem to be necessary for good performance. Therefore crawlers based on local decisions seem to hold promise both for the performance of exploratory strategies and for the efficiency and scalability of distributed implementations. In particular, we intend to experiment with variations of crawling algorithms such as InfoSpiders <ref type="bibr" target="#b17">[18]</ref>, that allow for adaptive and distributed exploratory strategies.</p><p>Other crawler algorithms that we intend to study in future research include Best-First strategies driven by estimates other than lexical ones. For example we plan to implement a Best-N-First family using link estimates based on local versions of the rank KW,P R metric used in this paper for evaluations purposes. We also plan to test more sophisticated lexical crawlers such as InfoSpiders and Shark Search <ref type="bibr" target="#b13">[14]</ref>, which can prioritize over links from a single page. A goal of present research is to identify optimal trade-offs between exploration and exploitation, where either more exploration or more greediness would degrade performance. A large enough buffer size will have to be used so as not to constrain the range of exploration/exploitation strategies as much as happened in the experiments described here due to the small MAX BUFFER. Identifying an optimal exploration/exploitation trade-off would be the first step toward the development of an adaptive crawler that would attempt to adjust the level of greediness during the crawl.</p><p>Finally, two things that we have not done in this paper are to analyze the time complexity of the crawlers and the topic-specific performance of each strategy. Regarding the former, clearly more greedy strategies require more frequent decisions and this may have an impact on the efficiency of the crawlers. Regarding the latter, we have only considered quality measures in the aggregate (across topics). It would be useful to study how appropriate trade-offs between exploration and exploitation depend on different characteristics such as topic heterogeneity. Both of these issues are the object of ongoing research.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Pseudocode of Best-N-First crawlers.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>ForFigure 2 :</head><label>2</label><figDesc>Figure 2: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task 1. The plots correspond to lexical (top) and linkage (bottom) quality metric. Error bars correspond to ±1 standard error across the 100 topics in this and the following plots.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task 2. The plots correspond to lexical (top) and linkage (bottom) quality metric.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Average recall of examples when the crawls start from the neighbors.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>3</head><label></label><figDesc>http://myspiders.biz.uiowa.edu</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>A sample topic. The description is truncated for space limitations.</figDesc><table><row><cell cols="2">Topic Keywords Topic Description</cell><cell>Examples</cell></row><row><cell>Recreation Hot Air Ballooning Organizations</cell><cell>Aerostat Society of Australia Varied collec-tion of photos and facts about ballooning in Australia, Airships, Parachutes, Balloon Building and more. Includes an article on</cell></row><row><cell></cell><cell>the Theory of Flight. Albuquerque Aerostat</cell></row><row><cell></cell><cell>Ascension Association A comprehensive site</cell></row><row><cell></cell><cell>covering a range of ballooning topics includ-</cell></row><row><cell></cell><cell>ing the Albuqeurque Balloon Fiesta, local ed-</cell></row><row><cell></cell><cell>ucation and safety programs, flying events,</cell></row><row><cell></cell><cell>club activities and committees, and club his-</cell></row><row><cell></cell><cell>tory. Arizona Hot Air Balloon Club [...]</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Evaluation Schemes and Measures. The static scheme is based on coverage of top pages (ranked by quality metric among all crawled pages, S). S crawler is the set of pages visited by a crawler. The dynamic scheme is based on the ranks (by quality metric among all crawled pages, S) averaged over the crawl sets at time t, S crawler (t). |S crawler ∩ top rank SM ART (S)| p∈S crawler (t) rank SM ART (p)/|S crawler (t)| Linkage |S crawler ∩ top rank KW,P R (S)| p∈S crawler (t) rank KW,P R (p)/|S crawler (t)|</figDesc><table><row><cell>Static Scheme</cell><cell>Dynamic Scheme</cell></row><row><cell>Lexical</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://dmoz.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2">http://www.ozemail.com.au/~p0gwil http://www.hotairballooning.org/ http://www.ballooningaz.com http://www.aristotle.net/~mikev/ http://www89.pair.com/techinfo/ABAC/abac.htm http://www.prairienet.org/bagi http://www.atu.com.au/~balloon/club1.html http://communities.msn.com/BalloonCrews http://www.bfa.net http://www.ask.ne.jp/~kanako/ebst.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3">http://google.yahoo.com</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Intelligent crawling on the World Wide Web with arbitrary predicates</title>
		<author>
			<persName><forename type="first">C</forename><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Al-Garawi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 10th Intl. World Wide Web Conference</title>
				<meeting>10th Intl. World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="96" to="105" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Does &quot;authority&quot; mean quality? Predicting expert quality ratings of web documents</title>
		<author>
			<persName><forename type="first">B</forename><surname>Amento</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Terveen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hill</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 23rd ACM SIGIR Conf. on Research and Development in Information Retrieval</title>
				<meeting>23rd ACM SIGIR Conf. on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="296" to="303" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Adding support for dynamic and focused search with Fetuccino</title>
		<author>
			<persName><forename type="first">I</forename><surname>Ben-Shaul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Herscovici</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jacovi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Maarek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pelleg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shtalhaim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Soroka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">11-16</biblScope>
			<biblScope unit="page" from="1653" to="1665" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Improved algorithms for topic distillation in hyperlinked environments</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bharat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Henzinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval</title>
				<meeting>21st ACM SIGIR Conf. on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="104" to="111" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">How dynamic is the Web?</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">E</forename><surname>Brewington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cybenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 9th International World-Wide Web Conference</title>
				<meeting>9th International World-Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The anatomy of a large-scale hypertextual Web search engine</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Page</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">1-7</biblScope>
			<biblScope unit="page" from="107" to="117" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Automatic resource compilation by analyzing hyperlink structure and associated text</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rajagopalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gibson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kleinberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">1-7</biblScope>
			<biblScope unit="page" from="65" to="74" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Focused crawling: A new approach to topic-specific Web resource discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Den</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Berg</surname></persName>
		</author>
		<author>
			<persName><surname>Dom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">11-16</biblScope>
			<biblScope unit="page" from="1623" to="1640" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Efficient crawling through url ordering</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Page</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 7th Intl. World Wide Web Conference</title>
				<meeting>7th Intl. World Wide Web Conference<address><addrLine>Brisbane, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Sizing the internet</title>
		<author>
			<persName><surname>Cyveillance</surname></persName>
		</author>
		<ptr target="http://www.cyveillance.com/web/us/corporate/whitepapers.htm" />
		<imprint>
			<date type="published" when="2000-07">July 2000</date>
		</imprint>
	</monogr>
	<note>White paper</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Information retrieval in the World Wide Web: Making clientbased searching feasible</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">De</forename><surname>Bra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Post</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 1st Intl. World Wide Web Conference</title>
				<meeting>1st Intl. World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Focused crawling using context graphs</title>
		<author>
			<persName><forename type="first">M</forename><surname>Diligenti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Coetzee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Giles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 26th International Conference on Very Large Databases (VLDB 2000)</title>
				<meeting>26th International Conference on Very Large Databases (VLDB 2000)<address><addrLine>Cairo, Egypt</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="527" to="534" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Efficient computation of pagerank</title>
		<author>
			<persName><forename type="first">T</forename><surname>Haveliwala</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>Stanford Database Group</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The shark-search algorithm -An application: Tailored Web site mapping</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hersovici</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jacovi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">S</forename><surname>Maarek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pelleg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shtalhaim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 7th Intl. World-Wide Web Conference</title>
				<meeting>7th Intl. World-Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Authoritative sources in a hyperlinked environment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kleinberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the ACM</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="604" to="632" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Accessibility of information on the Web</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Giles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<biblScope unit="volume">400</biblScope>
			<biblScope unit="page" from="107" to="109" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Links tell us about lexical and semantic web content</title>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:cs.IR/0108004</idno>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Adaptive retrieval agents: Internalizing local context and scaling up to the Web</title>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Belew</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">2-3</biblScope>
			<biblScope unit="page" from="203" to="242" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Evaluating topic-driven Web crawlers</title>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Srinivasan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval</title>
				<meeting>24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Breadth-first search crawling yields high-quality pages</title>
		<author>
			<persName><forename type="first">M</forename><surname>Najork</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Wiener</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 10th International World Wide Web Conference</title>
				<meeting>10th International World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Myspiders: Evolve your own intelligent web crawlers</title>
		<author>
			<persName><forename type="first">G</forename><surname>Pant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Menczer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Autonomous Agents and Multi-Agent Systems</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="221" to="229" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Using reinforcement learning to spider the Web efficiently</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rennie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 16th International Conf. on Machine Learning</title>
				<meeting>16th International Conf. on Machine Learning<address><addrLine>San Francisco, CA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="335" to="343" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">The SMART Retrieval System -Experiments in Automatic Document Processing</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1971">1971</date>
			<publisher>Prentice-Hall</publisher>
			<pubPlace>Englewood Cliffs, NJ</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
