<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploration versus Exploitation in Topic Driven Crawlers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Padmini Srinivasan@!</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Menczer@</string-name>
          <email>filippo-menczerg@uiowa.edu</email>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>1994</year>
      </pub-date>
      <abstract>
        <p>ered by search engines has not improved much over the past few years [16]. Even with increasThe dynamic nature of the Web highlights the ing hardware and bandwidth resources at their scalability limitations of universal search en- disposal, search engines cannot keep up with the gines. Topic driven crawlers can address the growth of the Web and with its rate of change problem by distributing the crawling process [5]. across users, queries, or even client computers. These scalability limitations stem from search The context available to a topic driven crawler engines' attempt to crawl the whole Web, and to allows for informed decisions about how to prior- answer any query from any user. Decentralizing itize the links to be visited. Here we focus on the the crawling process is a more scalable approach, balance between a crawler's need to exploit this and bears the additional bene t that crawlers information to focus on the most promising links, can be driven by a rich context (topics, queries, and the need to explore links that appear sub- user pro les) within which to interpret pages and optimal but might lead to more relevant pages. select the links to be visited. It comes as no We investigate the issue for two di erent tasks: surprise, therefore, that the development of topic (i) seeking new relevant pages starting from a driven crawler algorithms has received signi cant known relevant subset, and (ii) seeking relevant attention in recent years [9, 14, 8, 18, 1, 19]. pages starting a few links away from the relevant Topic driven crawlers (also known as focused subset. Using a framework and a number of qual- crawlers) respond to the particular information ity metrics developed to evaluate topic driven needs expressed by topical queries or interest crawling algorithms in a fair way, we nd that a pro les. These could be the needs of an indimix of exploitation and exploration is essential vidual user (query time or online crawlers) or for both tasks, in spite of a penalty in the early those of a community with shared interests (topstage of the crawl. ical search engines and portals). Evaluation of topic driven crawlers is di 1 Introduction cult due to the lack of known relevant sets for Web searches, to the presence of many con ictA recent projection estimates the size of the vis- ing page quality measures, and to the need for ible Web today (March 2002) to be around 7 fair gauging of crawlers' time and space algorithbillion \static" pages [10]. The largest search mic complexity. In recent research we presented engine, Google, claims to be \searching" about an evaluation framework designed to support the 2 billion pages. The fraction of the Web cov- comparison of topic driven crawler algorithms</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>under speci ed resource constraints [19]. In this Balancing the exploitation of quality
estipaper we further this line of research by inves- mate information with exploration of
suboptitigating the relative merits of exploration versus mal pages is thus crucial for the performance of
exploitation as a de ning characteristic of the topic driven crawlers. It is a question that we
crawling mechanism. study empirically with respect to two di erent</p>
      <p>The issue of exploitation versus exploration is tasks. In the rst, we seek relevant pages
starta universal one in machine learning and arti cial ing from a set of relevant links. Applications of
intelligence, since it presents itself in any task such a task are query-time search agents that
where search is guided by quality estimations. use results of a search engine as starting points
Under some regularity assumption, one can as- to provide a user with recent and personalized
sume that a measure of quality at one point in results. Since we start from relevant links, we
the search space provides some information on may expect an exploratory crawler to perform
the quality of nearby points. A greedy algo- reasonably well. The second task involves
seekrithm can then exploit this information by con- ing relevant pages while starting the crawl from
centrating the search in the vicinity of the most links that are a few links away from a relevant
promising points. However, this strategy can subset. Such a task may be a part of Web
minlead to missing other equally good or even bet- ing or competitive intelligence applications (e.g.,
ter points, for two reasons: rst, the estimates a search starting from competitors' home pages).
may be noisy; and second, the search space may If we do not start from a known relevant subset,
have local optima that trap the algorithm and the appropriate balance of exploration vs.
exkeep it from locating global optima. In other ploitation becomes an empirical question.
words, it may be necessary to visit some \bad"
points in order to arrive at the best ones. At
the other extreme, algorithms that completely 2 Evaluation Framework
disregard quality estimates and continue to
explore in a uniform or random fashion do not risk 2.1 Topics, Examples and Neighbors
getting stuck at local optima, but they do not In order to evaluate crawler algorithms, we
use the available information to bias the search need topics, some corresponding relevant
examand thus may spend most of their time exploring ples, and neighbors. The neighbors are URLs
suboptimal areas. A balance between exploita- extracted from neighborhood of the examples.
tion and exploration of clues is obviously called We obtain our topics from the Open
Direcfor in heuristic search algorithms, but the op- tory (DMOZ). We ran randomized Breadth-First
timal compromise point is unknown unless the crawls starting from each of the main
catetopology of the search space is well understood gories on the DMOZ site.1 The crawlers identify
| which is typically not the case. DMOZ \leaves," i.e., pages that have no children</p>
      <p>Topic driven crawlers t into this picture very category nodes. Leaves with ve or more
exterwell if one views the Web as the search space, nal links are then used to derive topics. We thus
with pages as points and neighborhoods as de- collected 100 topics.</p>
      <p>ned by hyperlinks. A crawler must decide which A topic is represented by three types of
inforpages to visit based on the cues provided by links mation derived from the corresponding leaf page.
from nearby pages. If one assumes that a rele- First, the words in the DMOZ hierarchy form the
vant page has a higher probability to be near topic's keywords. Second, up to 10 external links
other relevant pages than to any random page, form the topic's examples. Third, we
concatethen quality estimate of pages provide cues that nate the text descriptions and anchor text of the
can be exploited to bias the search process. How- target URLs (written by DMOZ human editors)
ever, given the short range of relevance clues on to form a topic description. The di erence
bethe Web [17], a very relevant page might be only
a few links behind an apparently irrelevant one. 1http://dmoz.org
tween topic keywords and topic descriptions is guide the selection of frontier URLs that are to
that we give the former to the crawlers, as mod- be fetched at each iteration. For a given topic,
els of (short) query-like topics, while we use the a crawler is allowed to crawl up to MAX PAGES =
latter, which are much more detailed representa- 2000 pages. However, a crawl may end sooner if
tions of the topics, to gauge the relevance of the the crawler's frontier should become empty. We
crawled pages in our post-hoc analysis. Table 1 use a timeout of 10 seconds for Web downloads.
shows a sample topic. Large pages are chopped so that we retrieve only</p>
      <p>The neighbors are obtained for each topic the rst 100 KB. The only protocol allowed is
through the following process. For each of the HTTP (with redirection allowed), and we also
examples, we obtain the top 20 inlinks as re- lter out all but \static pages" with text/html
turned by Google.2 Next, we get the top 20 content. Stale links yielding HTTP error codes
inlinks for each of the inlinks obtained earlier. are removed as they are found (only good links
Hence, if we had 10 examples to start with, we are used in the analysis).
may now have a maximum of 4000 unique URLs. We constrain the space resources a crawler
alA subset of 10 URLs is then picked at random gorithm can use by restricting the frontier size to
from this set. The links in such a subset are MAX BUFFER = 256 URLs. If the bu er becomes
called the neighbors. full then the crawler must decide which links are
to be replaced as new links are added.
2.2</p>
      <p>Architecture
We use the a previously proposed evaluation 3 Crawling Algorithms
framework to compare di erent crawlers [19].</p>
      <p>The framework allows one to easily plug in mod- In this paper we study the notion of
exploules implementing arbitrary crawling algorithms, ration versus exploitation. We begin with a
which share data structures and utilities to op- single family of crawler algorithms with a
sintimize e ciency without a ecting the fairness of gle greediness parameter to control the
explothe evaluation. ration/exploitation behavior. In our previous
experiments [19] we found that a naive Best-First</p>
      <p>
        As mentioned before, we use the crawlers crawler displayed the best performance among
for two di erent tasks. For the rst task, the three crawlers considered. Hence, in this study
crawlers start from the examples while for the we explore variants of the Best-First crawler.
second the starting points are the neighbors. In More generally, we examine the Best-N-First
either case, as the pages are fetched their com- family of crawlers where the parameter N
conponent URLs are added to a list that we call the trols the characteristic of interest.
frontier. A crawler may use topic's keywords to Best-First crawlers have been studied before
2http://google.yahoo.com [
        <xref ref-type="bibr" rid="ref3 ref8">9, 14</xref>
        ]. The basic idea is that given a frontier of
      </p>
      <p>
        Evaluation Methods
}
}
while (#frontier &gt; 0 and visited &lt; MAX_PAGES) { Table 2 depicts our overall methodology for
links_to_crawl := dequeue_top_links(frontier, N); crawler evaluation. The two rows of Table 2
inforeach link (randomize(links_to_crawl)) {
doc := fetch_new_document(link); dicate two di erent methods for gauging page
score := sim(topic, doc); quality. The rst is a purely lexical approach
foreiafch(#foruotnltiinekr(e&gt;x=trMaAcXt__BlUiFnFkEsR()do{c)) { wherein similarity to the topic description is used
dequeue_link_with_min_score(frontier); to assess relevance. The second method is
prie}nqueue(frontier, outlink, score); marily linkage based and is an approximation of
} the retrieval/ranking method used by Google [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ];
} } it uses PageRank to discriminate between pages
containing the same number of topic keywords.
      </p>
    </sec>
    <sec id="sec-2">
      <title>The columns of the table show that our mea</title>
      <p>sures are used both from a static and a dynamic
links, the best link according to some estimation perspective. The static approach examines crawl
criterion is selected for crawling. quality assessed from the full set of (up to 2000)</p>
      <p>Best-N-First is a generalization in that at each pages crawled for each query. In contrast the
iteration a batch of top N links to crawl are se- dynamic measures provide a temporal
characlected. After completing the crawl of N pages terization of the crawl strategy, by considering
the crawler decides on the next batch of N and the pages fetched while the crawl is in progress.
so on. As mentioned above, the topic's keywords More speci cally, the static approach measures
are used to guide the crawl. More speci cally coverage, i.e., the ability to retrieve \good" pages
this is done in the link selection process by com- where the quality of a page is assessed in two
puting the lexical similarity between a topic's di erent ways (corresponding to the rows of the
keywords and the source page for the link. Thus table). Our static plots show the ability of each
the similarity between a page p and the topic crawler to retrieve more or fewer highly relevant
is used to estimate the relevance of the pages pages. This is analogous to plotting recall as a
linked from p. The N URLs with the best es- function of generality.
timates are then selected for crawling. Cosine The dynamic approach examines the quality of
similarity is used by the crawlers and the links retrieval as the crawl progresses. Dynamic plots
with minimum similarity score are removed from o er a trajectory over time that displays the
dythe frontier if necessary in order to not exceed namic behavior of the crawl. The measures are
the MAX BUFFER size. Figure 1 o ers a simpli ed built on average (quality-based) ranks and are
pseudocode of a Best-N-First crawler. generally inversely related to precision. As the</p>
      <p>Best-N-First o ers an ideal context for our average rank decreases, an increasing proportion
study. The parameter N controls the greedy be- of the crawled set can be expected to be relevant.
havior of the crawler. Increasing N results in
crawlers with greater emphasis on exploration It should be noted that scores and ranks used
and consequently a reduced emphasis on ex- in each dynamic measure are computed
omniploitation. Decreasing N reverses this; selecting sciently, i.e., all calculations for each point in
a smaller set of links is more exploitative of the time for a crawler are done using data generated
evidence available regarding the potential mer- from the full crawl. For instance, all
PageRits of the links. In our experiments we test ve ank scores are calculated using the full set of
mutants of the crawler by setting N to 1, 16, 64, retrieved pages. This strategy is quite
reason128 and 256. We refer to them as BFSN where able given that we want to use the best possible
N is one of the above values. evidence when judging page quality.
P
Pp2Scrawler(t) rankSMART (p)=jScrawler(t)j
p2Scrawler(t) rankKW;P R(p)=jScrawler(t)j
4.1 Lexical Based Page Quality rank pages. PageRank in particular estimates
the global popularity of a page. The
computaWe use the SMART system [23] to rank the re- tion of PageRanks can be done through an
ittrieved pages by their lexical similarity to the erative process. PageRanks are calculated once
topic. The SMART system allows us to pool after all the crawls are completed. That is, we
all the pages crawled by all the crawlers for a pool the pages crawled for all the topics by all
topic and then rank these against the topic de- the crawlers and then calculate the PageRanks
scription. The system utilizes term weighting according to the algorithm described in [13]. We
strategies involving term frequency and inverse sort the pages crawled for a given topic, by all
document frequency computed from the pooled crawlers, rst based on the number of topic
keypages for a topic. SMART computes the simi- words they contain and then sort the pages with
larity between the query and the topic as a dot same number of keywords by their PageRank.
product of the topic and page vectors. It out- The process gives us a rankKW;P R for each page
puts a ranked set of pages based on their topic crawled for a topic.
similarity scores. That is, for each page we get Once again, our static evaluation metric
meaa rank which we refer to as rankSMART (cf. Ta- sures the percentage of top n pages (ranked by
ble 2). Thus given a topic, the percentage of top rankKW;P R) crawled by a crawler on a topic. In
n pages ranked by SMART (where n varies) that the dynamic metric, mean rankKW;P R is plotted
are retrieved by each crawler may be calculated, over each Scrawler(t) where t is the number of
yielding the static evaluation metric. pages crawled.</p>
      <p>For the dynamic view we use the rankSMART
values for pages to calculate mean rankSMART
at di erent points of the crawl. If we let 5 Results
Scrawler(t) denote the set of pages retrieved up to
time t, then we calculate mean rankSMART over For each of the evaluation schemes and metrics
iSncrsaiwzelera(st)w. eTphreocseetedScirnawtilmer(et.) Wofepaapgpesroixnicmreaatseest oofutelainchedcirnawTlaebrleon2,twhee atwnaolytazeskdst.he performance
by the number of pages crawled. The trajectory
of mean rankSMART values over time displays 5.1 Task 1 : Starting from Examples
the dynamic behavior of a crawler.</p>
    </sec>
    <sec id="sec-3">
      <title>For the rst task the crawlers start from a rele</title>
      <p>
        vant subset of links, the examples, and use the
4.2 Linkage Based Page Quality hyperlinks to navigate and discover more
relevant pages. The results for the task are
sumIt has been observed that content alone does not marized by the plots in Figure 2. For
readgive a fair measure of the quality of the page ability, we are only plotting the performance of
[
        <xref ref-type="bibr" rid="ref4">15</xref>
        ]. Algorithms such as HITS [
        <xref ref-type="bibr" rid="ref4">15</xref>
        ] and PageR- a selected subset of the Best-N-First crawlers
ank [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] use the linkage structure of the Web to (N = 1; 256). The behavior of the remaining
led60
w
a
r
c%55
50
45
40
35 0
65
60
55
d
e
l
raw50
c
%
45
40
      </p>
      <sec id="sec-3-1">
        <title>BFBSF2S561</title>
        <p>BFBSF2S561
1000
pages crawled
d) Dynamic Linkage Performance
300
number of top pages
400
500
500
1000
pages crawled</p>
      </sec>
      <sec id="sec-3-2">
        <title>BFBSF2S561</title>
        <p>1500
2000
crawlers (BFS16, BFS64 and BFS128) can be BFS256 still does signi cantly better than other
extrapolated between the curves corresponding crawlers on the lexical metric (cf. Figure 2b).
to BFS1 and BFS256. However, the linkage metric shows that BFS256</p>
        <p>The most general observation we can draw pays a large penalty in the early stage of the
from the plots is that BFS256 achieves a sig- crawl (cf. Figure 2d). However, the crawler
reni cantly better performance under the static gains quality over the longer run. The better
evaluation schemes, i.e., a superior coverage of coverage of highly relevant pages by this crawler
the most highly relevant pages based on both (cf. Figure 2c) may help us interpret the
imquality metrics and across di erent numbers of provement observed in the second phase of the
top pages (cf. Figure 2a,c). The di erence be- crawl. We conjecture that by exploring
suboptween the coverage by crawlers for di erent N timal links early on, BFS256 is capable of
evenincreases as one considers fewer highly relevant tually discovering paths to highly relevant pages
pages. These results indicate that exploration that escape more greedy strategies.
is important to locate the highly relevant pages
when starting from relevant links, whereas too 5.2 Task 2: Starting from Neighbors
much exploitation is harmful.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The success of a more exploratory algorithm on The dynamic plots give us a richer picture. the rst task may not come as a surprise since (Recall that here lowest average rank is best.) we start from known relevant pages. However, in 6</title>
      <p>30
d
e
l
raw25
c
%
20
15</p>
      <p>BFS1
BreaBdFthSF2i5rs6t</p>
      <p>BFS1
BreaBdFthSF2i5rs6t
3500 0
4500
1000
pages crawled
d) Dynamic Linkage Performance
100
400
500
500
1500
2000
100
200</p>
      <p>300
number of top pages
400
500
500
1500</p>
      <p>2000
1000
pages crawled
the second task we use the links obtained from general result we nd that exploration helps an
the neighborhood of relevant subset as the start- exploitative algorithm, but exploration without
ing points with the goal of nding more relevant guidance goes astray.
pages. We take the worst (BFS1) and the best Due to the availability of relevant subsets
(ex(BFS256) crawlers on Task 1, and use them for amples) for each of the topics in the current task,
Task 2. In addition, we add a simple Breadth- we plot the average recall of the relevant
examFirst crawler that uses the limited size frontier ples against number of pages crawled (Figure 4).
as a FIFO queue. The Breadth-First crawler is The plot illustrates the target-seeking behavior
added to observe the performance of a blind ex- of the three crawlers if the examples are viewed
ploratory algorithm. A summary of the results as the targets. We again nd BFS256
outperis shown through plots in Figure 3. As for Task forming BFS1 while Breadth-First trails behind.
1, we nd that the more exploratory algorithm,
BFS256, performs signi cantly better than BFS1
under static evaluations for both lexical and link- 6 Related Research
age quality metrics (cf. Figure 3a,c). In the
dynamic plots (cf. Figure 3b,d) BFS256 seems to Research on the design of e ective focused
bear an initial penalty for exploration but recov- crawlers is very vibrant. Many di erent types
ers in the long run. The Breadth-First crawler of crawling algorithms have been developed. For
performs poorly on all evaluations. Hence, as a example, Chakrabarti et al. [8] use classi ers
built from training sets of positive and negative</p>
      <p>BFS1</p>
      <p>
        BFS256
BreadthFirst
0.45 long-term gains. Their solution is to build
clas0.4 si ers that can assign pages to di erent classes
0.35 based on the expected link distance between the
0.3 current page and relevant documents.
lca 0.25 The area of crawler quality evaluation has
rgee also received much attention in recent research
rveaa 0.2 [
        <xref ref-type="bibr" rid="ref2 ref8">2, 9, 8, 19, 4</xref>
        ]. For instance, many
alterna0.15 tives for assessing page importance have been
0.1 explored, showing a range of sophistication. Cho
0.05 BFS1 et al. [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] use the simple presence of a word such
0 0 500 page1s0c0r0awled 1500BreaBdFthSF2i5rs6t 2000 aals. \c[2o]mcpoumtepru"tteotihnedisciamteilarreilteyvabnectew. eAenmeantpoageet
and the centroid of the seeds. In fact
contentFigure 4: Average recall of examples when the based similarity assessments form the basis of
crawls start from the neighbors. relevance decisions in several examples of
research [8, 19]. Others exploit link information to
estimate page relevance with methods based on
example pages to guide their focused crawlers. in-degree, out-degree, PageRank, hubs and
auFetuccino [
        <xref ref-type="bibr" rid="ref1">3</xref>
        ] and InfoSpiders [18] begin their thorities [
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref8">2, 3, 4, 8, 9, 20</xref>
        ]. For example, Cho et
focused crawling with starting points generated al. [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] consider pages with PageRank score above
from CLEVER [7] or other search engines. Most a threshold as relevant. Najork and Wiener [
        <xref ref-type="bibr" rid="ref10">20</xref>
        ]
crawlers follow xed strategies, while some can use a crawler that can fetch millions of pages per
adapt in the course of the crawl by learning to day; they then calculate the average PageRank
estimate the quality of links [18, 1, 22]. of the pages crawled daily, under the assumption
      </p>
      <p>
        The question of exploration versus exploita- that PageRank estimates relevance.
Combination in crawler strategies has been addressed tions of link and content-based relevance
estimain a number of papers, more or less directly. tors are evident in several approaches [
        <xref ref-type="bibr" rid="ref2">4, 7, 18</xref>
        ].
Fish-Search [11] limited exploration by
bounding the depth along any path that appeared
suboptimal. Cho et al. [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] found that exploratory 7 Conclusions
crawling behaviors such as implemented in the
Breadth-First algorithm lead to e cient discov- In this paper we used an evaluation framework
ery of pages with good PageRank. They also dis- for topic driven crawlers to study the role of
excuss the issue of limiting the memory resources ploitation of link estimates versus exploration of
(bu er size) of a crawler, which has an impact on suboptimal pages. We experimented with a
famthe exploitative behavior of the crawling strategy ily of simple crawler algorithms of varying
greedbecause it forces the crawler to make frequent iness, under limited memory resources for two
ltering decisions. Breadth-First crawlers also di erent tasks. A number of schemes and
qualseem to nd popular pages early in the crawl [
        <xref ref-type="bibr" rid="ref10">20</xref>
        ]. ity metrics derived from lexical features and link
The exploration versus exploitation issue contin- analysis were introduced and applied to gauge
ues to be studied via variations on the two major crawler performance.
classes of Breadth-First and Best-First crawlers. We found consistently that exploration leads
For example, in recent research on Breadth-First to better coverage of highly relevant pages, in
focused crawling, Diligenti et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] address spite of a possible penalty during the early stage
the \shortsightedness" of some crawlers when as- of the crawl. An obvious explanation is that
exsessing the potential value of links to crawl. In ploration allows to trade o short term gains for
particular, they look at how to avoid short-term longer term and potentially larger gains.
Howgains at the expense of less-obvious but larger ever, we also found that a blind exploration
when starting from neighbors of relevant pages, A goal of present research is to identify
opleads to poor results. Therefore, a mix of ex- timal trade-o s between exploration and
exploration and exploitation is necessary for good ploitation, where either more exploration or
overall performance. When starting from rele- more greediness would degrade performance. A
vant examples (Task 1), the better performance large enough bu er size will have to be used
of crawlers with higher exploration could be at- so as not to constrain the range of
explotributed to their better coverage of documents ration/exploitation strategies as much as
hapclose to the relevant subset. The good perfor- pened in the experiments described here due to
mance of BFS256 starting away from relevant the small MAX BUFFER. Identifying an optimal
pages, shows that its exploratory nature com- exploration/exploitation trade-o would be the
plements its greedy side in nding highly rele- rst step toward the development of an adaptive
vant pages. Extreme exploitation (BFS1) and crawler that would attempt to adjust the level of
blind exploration (Breadth-First), impede per- greediness during the crawl.
formance. Nevertheless, any exploitation seems Finally, two things that we have not done in
to be better than none. Our results are based this paper are to analyze the time complexity of
on short crawls of 2000 pages. The same may the crawlers and the topic-speci c performance
not hold for longer crawls; this is an issue to be of each strategy. Regarding the former, clearly
addressed in future research. The dynamic eval- more greedy strategies require more frequent
deuations do suggest that for very short crawls it cisions and this may have an impact on the
efis best to be greedy; this is a lesson that should ciency of the crawlers. Regarding the latter,
be incorporated into algorithms for query time we have only considered quality measures in the
(online) crawlers such as MySpiders3 [21]. aggregate (across topics). It would be useful to
      </p>
      <p>The observation that higher exploration yields study how appropriate trade-o s between
explobetter results can motivate parallel and/or ration and exploitation depend on di erent
chardistributed implementations of topic driven acteristics such as topic heterogeneity. Both of
crawlers, since complete orderings of the links these issues are the object of ongoing research.
in the frontier, as required by greedy crawler
algorithms, do not seem to be necessary for good
performance. Therefore crawlers based on local References
decisions seem to hold promise both for the per- [1] C. Aggarwal, F. Al-Garawi, and P. Yu.
Informance of exploratory strategies and for the ef- telligent crawling on the World Wide Web
ciency and scalability of distributed implemen- with arbitrary predicates. In Proc. 10th Intl.
tations. In particular, we intend to experiment World Wide Web Conference, pages 96{105,
with variations of crawling algorithms such as 2001.</p>
      <p>InfoSpiders [18], that allow for adaptive and
distributed exploratory strategies. [2] B. Amento, L. Terveen, and W. Hill. Does</p>
      <p>
        Other crawler algorithms that we intend "authority" mean quality? Predicting
exto study in future research include Best-First pert quality ratings of web documents. In
strategies driven by estimates other than lexi- Proc. 23rd ACM SIGIR Conf. on Research
cal ones. For example we plan to implement a and Development in Information Retrieval,
Best-N-First family using link estimates based pages 296{303, 2000.
on local versions of the rankKW;P R metric used
in this paper for evaluations purposes. We also
plan to test more sophisticated lexical crawlers
such as InfoSpiders and Shark Search [
        <xref ref-type="bibr" rid="ref3">14</xref>
        ], which
can prioritize over links from a single page.
[7] S. Chakrabarti, B. Dom, P. Raghavan,
      </p>
      <p>S. Rajagopalan, D. Gibson, and J.
Kleinberg. Automatic resource compilation by [17] F. Menczer. Links tell us about lexical and
analyzing hyperlink structure and associ- semantic web content. arXiv:cs.IR/0108004.
ated text. Computer Networks, 30(1{7):65{ [18] F. Menczer and R. Belew. Adaptive
re74, 1998. trieval agents: Internalizing local context
and scaling up to the Web. Machine
Learning, 39(2{3):203{242, 2000.
[8] S. Chakrabarti, M. van den Berg, and</p>
      <p>B. Dom. Focused crawling: A new approach
to topic-speci c Web resource discovery. [19] F. Menczer, G. Pant, M. Ruiz, and
Computer Networks, 31(11{16):1623{1640, P. Srinivasan. Evaluating topic-driven Web
1999. crawlers. In Proc. 24th Annual Intl. ACM
SIGIR Conf. on Research and Development
in Information Retrieval, 2001.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ben-Shaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herscovici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacovi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pelleg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shtalhaim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Soroka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ur</surname>
          </string-name>
          .
          <article-title>Adding support for dynamic and focused search with Fetuccino</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>31</volume>
          (
          <issue>11</issue>
          {16):
          <volume>1653</volume>
          {
          <fpage>1665</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bharat</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Henzinger</surname>
          </string-name>
          . Improved al- [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Haveliwala</surname>
          </string-name>
          .
          <article-title>E cient computation of gorithms for topic distillation in hyperlinked pagerank</article-title>
          .
          <source>Technical report</source>
          , Stanford environments.
          <source>In Proc. 21st ACM SIGIR Database Group</source>
          ,
          <year>1999</year>
          . Conf.
          <source>on Research and Development in Information Retrieval</source>
          , pages
          <volume>104</volume>
          {
          <fpage>111</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hersovici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacovi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pelleg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shtalhaim</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ur</surname>
          </string-name>
          .
          <article-title>The shark-search algorithm | An application: Tailored Web site mapping</article-title>
          .
          <source>In Proc. 7th Intl. World-Wide Web Conference</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <volume>604</volume>
          {
          <fpage>632</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Giles</surname>
          </string-name>
          .
          <source>Accessibility of information on the Web. Nature</source>
          ,
          <volume>400</volume>
          :
          <fpage>107</fpage>
          {
          <fpage>109</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Brewington</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Cybenko</surname>
          </string-name>
          .
          <article-title>How dynamic is the Web?</article-title>
          <source>In Proc. 9th International World-Wide Web Conference</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Page.</surname>
          </string-name>
          <article-title>The anatomy of a large-scale hypertextual Web search engine</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          {7):
          <volume>107</volume>
          {
          <fpage>117</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Page</surname>
          </string-name>
          .
          <article-title>Efcient crawling through url ordering</article-title>
          .
          <source>In Proc. 7th Intl. World Wide Web Conference</source>
          , Brisbane, Australia,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Cyveillance</surname>
          </string-name>
          .
          <article-title>Sizing the net</article-title>
          .
          <source>White paper</source>
          , July http://www.cyveillance.com/web/us/ corporate/white papers.
          <source>htm.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Wiener</surname>
          </string-name>
          .
          <article-title>Breadth- rst search crawling yields high-quality pages</article-title>
          .
          <source>In Proc. 10th International World Wide Web Conference</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          inter2000. [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pant</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Menczer</surname>
          </string-name>
          . Myspiders:
          <article-title>Evolve your own intelligent web crawlers</article-title>
          .
          <source>Autonomous Agents and Multi-Agent Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <volume>221</volume>
          {
          <fpage>229</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Diligenti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Coetzee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          . Focused crawl- [23]
          <string-name>
            <surname>G. Salton.</surname>
          </string-name>
          <article-title>The SMART Retrieval System { ing using context graphs</article-title>
          .
          <source>In Proc. 26th Experiments in Automatic Document ProInternational Conference on Very Large cessing. Prentice-Hall</source>
          , Englewood Cli s,
          <source>Databases (VLDB</source>
          <year>2000</year>
          ), pages
          <fpage>527</fpage>
          {
          <fpage>534</fpage>
          , NJ,
          <year>1971</year>
          . Cairo, Egypt,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>