Exploration versus Exploitation in Topic Driven Crawlers
                          Gautam Pant@                 Padmini Srinivasan@!
                                          Filippo Menczer@
                               @
                           Department of Management Sciences
                           !
                        School of Library and Information Science
                                 The University of Iowa
                                   Iowa City, IA 52242
            {gautam-pant,padmini-srinivasan,filippo-menczer}@uiowa.edu


Abstract                                              ered by search engines has not improved much
                                                      over the past few years [16]. Even with increas-
The dynamic nature of the Web highlights the          ing hardware and bandwidth resources at their
scalability limitations of universal search en-       disposal, search engines cannot keep up with the
gines. Topic driven crawlers can address the          growth of the Web and with its rate of change
problem by distributing the crawling process          [5].
across users, queries, or even client computers.         These scalability limitations stem from search
The context available to a topic driven crawler       engines’ attempt to crawl the whole Web, and to
allows for informed decisions about how to prior-     answer any query from any user. Decentralizing
itize the links to be visited. Here we focus on the   the crawling process is a more scalable approach,
balance between a crawler’s need to exploit this      and bears the additional benefit that crawlers
information to focus on the most promising links,     can be driven by a rich context (topics, queries,
and the need to explore links that appear sub-        user profiles) within which to interpret pages and
optimal but might lead to more relevant pages.        select the links to be visited. It comes as no
We investigate the issue for two different tasks:     surprise, therefore, that the development of topic
(i) seeking new relevant pages starting from a        driven crawler algorithms has received significant
known relevant subset, and (ii) seeking relevant      attention in recent years [9, 14, 8, 18, 1, 19].
pages starting a few links away from the relevant
                                                         Topic driven crawlers (also known as focused
subset. Using a framework and a number of qual-
                                                      crawlers) respond to the particular information
ity metrics developed to evaluate topic driven
                                                      needs expressed by topical queries or interest
crawling algorithms in a fair way, we find that a
                                                      profiles. These could be the needs of an indi-
mix of exploitation and exploration is essential
                                                      vidual user (query time or online crawlers) or
for both tasks, in spite of a penalty in the early
                                                      those of a community with shared interests (top-
stage of the crawl.
                                                      ical search engines and portals).
                                                         Evaluation of topic driven crawlers is diffi-
1    Introduction                                     cult due to the lack of known relevant sets for
                                                      Web searches, to the presence of many conflict-
A recent projection estimates the size of the vis-    ing page quality measures, and to the need for
ible Web today (March 2002) to be around 7            fair gauging of crawlers’ time and space algorith-
billion “static” pages [10]. The largest search       mic complexity. In recent research we presented
engine, Google, claims to be “searching” about        an evaluation framework designed to support the
2 billion pages. The fraction of the Web cov-         comparison of topic driven crawler algorithms


                                                  88
                                                  1
under specified resource constraints [19]. In this       Balancing the exploitation of quality esti-
paper we further this line of research by inves-      mate information with exploration of subopti-
tigating the relative merits of exploration versus    mal pages is thus crucial for the performance of
exploitation as a defining characteristic of the      topic driven crawlers. It is a question that we
crawling mechanism.                                   study empirically with respect to two different
   The issue of exploitation versus exploration is    tasks. In the first, we seek relevant pages start-
a universal one in machine learning and artificial    ing from a set of relevant links. Applications of
intelligence, since it presents itself in any task    such a task are query-time search agents that
where search is guided by quality estimations.        use results of a search engine as starting points
Under some regularity assumption, one can as-         to provide a user with recent and personalized
sume that a measure of quality at one point in        results. Since we start from relevant links, we
the search space provides some information on         may expect an exploratory crawler to perform
the quality of nearby points. A greedy algo-          reasonably well. The second task involves seek-
rithm can then exploit this information by con-       ing relevant pages while starting the crawl from
centrating the search in the vicinity of the most     links that are a few links away from a relevant
promising points. However, this strategy can          subset. Such a task may be a part of Web min-
lead to missing other equally good or even bet-       ing or competitive intelligence applications (e.g.,
ter points, for two reasons: first, the estimates     a search starting from competitors’ home pages).
may be noisy; and second, the search space may        If we do not start from a known relevant subset,
have local optima that trap the algorithm and         the appropriate balance of exploration vs. ex-
keep it from locating global optima. In other         ploitation becomes an empirical question.
words, it may be necessary to visit some “bad”
points in order to arrive at the best ones. At
the other extreme, algorithms that completely         2        Evaluation Framework
disregard quality estimates and continue to ex-
plore in a uniform or random fashion do not risk      2.1        Topics, Examples and Neighbors
getting stuck at local optima, but they do not
                                                      In order to evaluate crawler algorithms, we
use the available information to bias the search
                                                      need topics, some corresponding relevant exam-
and thus may spend most of their time exploring
                                                      ples, and neighbors. The neighbors are URLs
suboptimal areas. A balance between exploita-
                                                      extracted from neighborhood of the examples.
tion and exploration of clues is obviously called
                                                      We obtain our topics from the Open Direc-
for in heuristic search algorithms, but the op-
                                                      tory (DMOZ). We ran randomized Breadth-First
timal compromise point is unknown unless the
                                                      crawls starting from each of the main cate-
topology of the search space is well understood
                                                      gories on the DMOZ site.1 The crawlers identify
— which is typically not the case.
                                                      DMOZ “leaves,” i.e., pages that have no children
   Topic driven crawlers fit into this picture very
                                                      category nodes. Leaves with five or more exter-
well if one views the Web as the search space,
                                                      nal links are then used to derive topics. We thus
with pages as points and neighborhoods as de-
                                                      collected 100 topics.
fined by hyperlinks. A crawler must decide which
                                                         A topic is represented by three types of infor-
pages to visit based on the cues provided by links
                                                      mation derived from the corresponding leaf page.
from nearby pages. If one assumes that a rele-
                                                      First, the words in the DMOZ hierarchy form the
vant page has a higher probability to be near
                                                      topic’s keywords. Second, up to 10 external links
other relevant pages than to any random page,
                                                      form the topic’s examples. Third, we concate-
then quality estimate of pages provide cues that
                                                      nate the text descriptions and anchor text of the
can be exploited to bias the search process. How-
                                                      target URLs (written by DMOZ human editors)
ever, given the short range of relevance clues on
                                                      to form a topic description. The difference be-
the Web [17], a very relevant page might be only
                                                          1
a few links behind an apparently irrelevant one.              http://dmoz.org


                                                  89
                                                  2
          Table 1: A sample topic. The description is truncated for space limitations.
 Topic Keywords Topic Description                       Examples
 Recreation             Aerostat Society of Australia Varied collec-     http://www.ozemail.com.au/~p0gwil
 Hot Air Ballooning     tion of photos and facts about ballooning        http://www.hotairballooning.org/
 Organizations          in Australia, Airships, Parachutes, Balloon      http://www.ballooningaz.com
                        Building and more. Includes an article on        http://www.aristotle.net/~mikev/
                        the Theory of Flight. Albuquerque Aerostat       http://www89.pair.com/techinfo/ABAC/abac.htm
                        Ascension Association A comprehensive site       http://www.prairienet.org/bagi
                        covering a range of ballooning topics includ-    http://www.atu.com.au/~balloon/club1.html
                        ing the Albuqeurque Balloon Fiesta, local ed-    http://communities.msn.com/BalloonCrews
                        ucation and safety programs, flying events,      http://www.bfa.net
                        club activities and committees, and club his-    http://www.ask.ne.jp/~kanako/ebst.html
                        tory. Arizona Hot Air Balloon Club [...]


tween topic keywords and topic descriptions is                guide the selection of frontier URLs that are to
that we give the former to the crawlers, as mod-              be fetched at each iteration. For a given topic,
els of (short) query-like topics, while we use the            a crawler is allowed to crawl up to MAX PAGES =
latter, which are much more detailed representa-              2000 pages. However, a crawl may end sooner if
tions of the topics, to gauge the relevance of the            the crawler’s frontier should become empty. We
crawled pages in our post-hoc analysis. Table 1               use a timeout of 10 seconds for Web downloads.
shows a sample topic.                                         Large pages are chopped so that we retrieve only
   The neighbors are obtained for each topic                  the first 100 KB. The only protocol allowed is
through the following process. For each of the                HTTP (with redirection allowed), and we also
examples, we obtain the top 20 inlinks as re-                 filter out all but “static pages” with text/html
turned by Google.2 Next, we get the top 20                    content. Stale links yielding HTTP error codes
inlinks for each of the inlinks obtained earlier.             are removed as they are found (only good links
Hence, if we had 10 examples to start with, we                are used in the analysis).
may now have a maximum of 4000 unique URLs.                      We constrain the space resources a crawler al-
A subset of 10 URLs is then picked at random                  gorithm can use by restricting the frontier size to
from this set. The links in such a subset are                 MAX BUFFER = 256 URLs. If the buffer becomes
called the neighbors.                                         full then the crawler must decide which links are
                                                              to be replaced as new links are added.
2.2      Architecture
We use the a previously proposed evaluation                   3         Crawling Algorithms
framework to compare different crawlers [19].
                                                              In this paper we study the notion of explo-
The framework allows one to easily plug in mod-
                                                              ration versus exploitation. We begin with a
ules implementing arbitrary crawling algorithms,
                                                              single family of crawler algorithms with a sin-
which share data structures and utilities to op-
                                                              gle greediness parameter to control the explo-
timize efficiency without affecting the fairness of
                                                              ration/exploitation behavior. In our previous ex-
the evaluation.
                                                              periments [19] we found that a naive Best-First
   As mentioned before, we use the crawlers
                                                              crawler displayed the best performance among
for two different tasks. For the first task, the
                                                              three crawlers considered. Hence, in this study
crawlers start from the examples while for the
                                                              we explore variants of the Best-First crawler.
second the starting points are the neighbors. In
                                                              More generally, we examine the Best-N-First
either case, as the pages are fetched their com-
                                                              family of crawlers where the parameter N con-
ponent URLs are added to a list that we call the
                                                              trols the characteristic of interest.
frontier. A crawler may use topic’s keywords to
                                                                 Best-First crawlers have been studied before
  2
      http://google.yahoo.com                                 [9, 14]. The basic idea is that given a frontier of


                                                           90
                                                           3
Best_N_First(topic, starting_urls, N) {                   4    Evaluation Methods
   foreach link (starting_urls) {
      enqueue(frontier, link);
   }
   while (#frontier > 0 and visited < MAX_PAGES) {        Table 2 depicts our overall methodology for
      links_to_crawl := dequeue_top_links(frontier, N);
      foreach link (randomize(links_to_crawl)) {
                                                          crawler evaluation. The two rows of Table 2 in-
         doc := fetch_new_document(link);                 dicate two different methods for gauging page
         score := sim(topic, doc);                        quality. The first is a purely lexical approach
         foreach outlink (extract_links(doc)) {
            if (#frontier >= MAX_BUFFER) {                wherein similarity to the topic description is used
                dequeue_link_with_min_score(frontier);    to assess relevance. The second method is pri-
            }
            enqueue(frontier, outlink, score);            marily linkage based and is an approximation of
         }                                                the retrieval/ranking method used by Google [6];
      }
   }
                                                          it uses PageRank to discriminate between pages
}                                                         containing the same number of topic keywords.
Figure 1: Pseudocode of Best-N-First crawlers.
                                                             The columns of the table show that our mea-
                                                          sures are used both from a static and a dynamic
links, the best link according to some estimation         perspective. The static approach examines crawl
criterion is selected for crawling.                       quality assessed from the full set of (up to 2000)
   Best-N-First is a generalization in that at each       pages crawled for each query. In contrast the
iteration a batch of top N links to crawl are se-         dynamic measures provide a temporal charac-
lected. After completing the crawl of N pages             terization of the crawl strategy, by considering
the crawler decides on the next batch of N and            the pages fetched while the crawl is in progress.
so on. As mentioned above, the topic’s keywords           More specifically, the static approach measures
are used to guide the crawl. More specifically            coverage, i.e., the ability to retrieve “good” pages
this is done in the link selection process by com-        where the quality of a page is assessed in two
puting the lexical similarity between a topic’s           different ways (corresponding to the rows of the
keywords and the source page for the link. Thus           table). Our static plots show the ability of each
the similarity between a page p and the topic             crawler to retrieve more or fewer highly relevant
is used to estimate the relevance of the pages            pages. This is analogous to plotting recall as a
linked from p. The N URLs with the best es-               function of generality.
timates are then selected for crawling. Cosine
                                                             The dynamic approach examines the quality of
similarity is used by the crawlers and the links
                                                          retrieval as the crawl progresses. Dynamic plots
with minimum similarity score are removed from
                                                          offer a trajectory over time that displays the dy-
the frontier if necessary in order to not exceed
                                                          namic behavior of the crawl. The measures are
the MAX BUFFER size. Figure 1 offers a simplified
                                                          built on average (quality-based) ranks and are
pseudocode of a Best-N-First crawler.
                                                          generally inversely related to precision. As the
   Best-N-First offers an ideal context for our           average rank decreases, an increasing proportion
study. The parameter N controls the greedy be-            of the crawled set can be expected to be relevant.
havior of the crawler. Increasing N results in
crawlers with greater emphasis on exploration                It should be noted that scores and ranks used
and consequently a reduced emphasis on ex-                in each dynamic measure are computed omni-
ploitation. Decreasing N reverses this; selecting         sciently, i.e., all calculations for each point in
a smaller set of links is more exploitative of the        time for a crawler are done using data generated
evidence available regarding the potential mer-           from the full crawl. For instance, all PageR-
its of the links. In our experiments we test five         ank scores are calculated using the full set of
mutants of the crawler by setting N to 1, 16, 64,         retrieved pages. This strategy is quite reason-
128 and 256. We refer to them as BFSN where               able given that we want to use the best possible
N is one of the above values.                             evidence when judging page quality.


                                                      91
                                                      4
Table 2: Evaluation Schemes and Measures. The static scheme is based on coverage of top pages
(ranked by quality metric among all crawled pages, S). Scrawler is the set of pages visited by a
crawler. The dynamic scheme is based on the ranks (by quality metric among all crawled pages,
S) averaged over the crawl sets at time t, Scrawler (t).

                            Static Scheme                            Dynamic Scheme
                    |Scrawler ∩ toprankSM ART (S)|
                                                        P
         Lexical                                                        rankSM ART (p)/|Scrawler (t)|
                                                        Pp∈Scrawler (t)
         Linkage    |Scrawler ∩ toprankKW,P R (S)|        p∈Scrawler (t) rankKW,P R (p)/|Scrawler (t)|


4.1    Lexical Based Page Quality                        rank pages. PageRank in particular estimates
                                                         the global popularity of a page. The computa-
We use the SMART system [23] to rank the re-             tion of PageRanks can be done through an it-
trieved pages by their lexical similarity to the         erative process. PageRanks are calculated once
topic. The SMART system allows us to pool                after all the crawls are completed. That is, we
all the pages crawled by all the crawlers for a          pool the pages crawled for all the topics by all
topic and then rank these against the topic de-          the crawlers and then calculate the PageRanks
scription. The system utilizes term weighting            according to the algorithm described in [13]. We
strategies involving term frequency and inverse          sort the pages crawled for a given topic, by all
document frequency computed from the pooled              crawlers, first based on the number of topic key-
pages for a topic. SMART computes the simi-              words they contain and then sort the pages with
larity between the query and the topic as a dot          same number of keywords by their PageRank.
product of the topic and page vectors. It out-           The process gives us a rankKW,P R for each page
puts a ranked set of pages based on their topic          crawled for a topic.
similarity scores. That is, for each page we get            Once again, our static evaluation metric mea-
a rank which we refer to as rankSM ART (cf. Ta-          sures the percentage of top n pages (ranked by
ble 2). Thus given a topic, the percentage of top        rankKW,P R ) crawled by a crawler on a topic. In
n pages ranked by SMART (where n varies) that            the dynamic metric, mean rankKW,P R is plotted
are retrieved by each crawler may be calculated,         over each Scrawler (t) where t is the number of
yielding the static evaluation metric.                   pages crawled.
   For the dynamic view we use the rankSM ART
values for pages to calculate mean rankSM ART
at different points of the crawl. If we let              5     Results
Scrawler (t) denote the set of pages retrieved up to
                                                         For each of the evaluation schemes and metrics
time t, then we calculate mean rankSM ART over
                                                         outlined in Table 2, we analyzed the performance
Scrawler (t). The set Scrawler (t) of pages increases
                                                         of each crawler on the two tasks.
in size as we proceed in time. We approximate t
by the number of pages crawled. The trajectory
of mean rankSM ART values over time displays             5.1    Task 1 : Starting from Examples
the dynamic behavior of a crawler.                       For the first task the crawlers start from a rele-
                                                         vant subset of links, the examples, and use the
4.2    Linkage Based Page Quality                        hyperlinks to navigate and discover more rele-
                                                         vant pages. The results for the task are sum-
It has been observed that content alone does not         marized by the plots in Figure 2. For read-
give a fair measure of the quality of the page           ability, we are only plotting the performance of
[15]. Algorithms such as HITS [15] and PageR-            a selected subset of the Best-N-First crawlers
ank [6] use the linkage structure of the Web to          (N = 1, 256). The behavior of the remaining


                                                     92
                                                     5
                           a) Static Lexical Performance                                                             b) Dynamic Lexical Performance
            80                                                                                      3400
                                                                   BFS1
                                                                 BFS256
            75                                                                                      3200


                                                                                                    3000
            70

                                                                                                    2800
            65

                                                                                                    2600


                                                                                     average rank
% crawled


            60
                                                                                                    2400
            55
                                                                                                    2200

            50
                                                                                                    2000

            45
                                                                                                    1800

            40                                                                                      1600
                                                                                                                                                               BFS1
                                                                                                                                                             BFS256
            35                                                                                      1400
                 0   100   200                  300        400            500                              0   500               1000                 1500            2000
                                 number of top pages                                                                         pages crawled

                           c) Static Linkage Performance                                                             d) Dynamic Linkage Performance
            65                                                                                      3100
                                                                   BFS1
                                                                 BFS256
                                                                                                    3000
            60
                                                                                                    2900


                                                                                                    2800
            55
                                                                                                    2700

                                                                                     average rank
% crawled


            50                                                                                      2600


                                                                                                    2500
            45
                                                                                                    2400


                                                                                                    2300
            40
                                                                                                    2200
                                                                                                                                                               BFS1
                                                                                                                                                             BFS256
            35                                                                                      2100
                 0   100   200                  300        400            500                              0   500               1000                 1500            2000
                                 number of top pages                                                                         pages crawled


Figure 2: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task
1. The plots correspond to lexical (top) and linkage (bottom) quality metric. Error bars correspond
to ±1 standard error across the 100 topics in this and the following plots.


crawlers (BFS16, BFS64 and BFS128) can be BFS256 still does significantly better than other
extrapolated between the curves corresponding crawlers on the lexical metric (cf. Figure 2b).
to BFS1 and BFS256.                              However, the linkage metric shows that BFS256
   The most general observation we can draw pays a large penalty in the early stage of the
from the plots is that BFS256 achieves a sig- crawl (cf. Figure 2d). However, the crawler re-
nificantly better performance under the static gains quality over the longer run. The better
evaluation schemes, i.e., a superior coverage of coverage of highly relevant pages by this crawler
the most highly relevant pages based on both (cf. Figure 2c) may help us interpret the im-
quality metrics and across different numbers of provement observed in the second phase of the
top pages (cf. Figure 2a,c). The difference be- crawl. We conjecture that by exploring subop-
tween the coverage by crawlers for different N timal links early on, BFS256 is capable of even-
increases as one considers fewer highly relevant tually discovering paths to highly relevant pages
pages. These results indicate that exploration that escape more greedy strategies.
is important to locate the highly relevant pages
when starting from relevant links, whereas too 5.2 Task 2: Starting from Neighbors
much exploitation is harmful.                    The success of a more exploratory algorithm on
   The dynamic plots give us a richer picture. the first task may not come as a surprise since
(Recall that here lowest average rank is best.) we start from known relevant pages. However, in


                                                                                93
                                                                                6
                           a) Static Lexical Performance                                                                b) Dynamic Lexical Performance
            40                                                                                        7500
                                                                   BFS1
                                                                BFS256
            38                                               BreadthFirst
                                                                                                      7000
            36

                                                                                                      6500
            34

            32                                                                                        6000


                                                                                       average rank
% crawled


            30
                                                                                                      5500
            28

            26                                                                                        5000


            24
                                                                                                      4500

            22
                                                                                                      4000
            20                                                                                                                                                        BFS1
                                                                                                                                                                   BFS256
                                                                                                                                                                BreadthFirst
            18                                                                                        3500
                 0   100   200                  300        400              500                              0    500               1000                 1500                  2000
                                 number of top pages                                                                            pages crawled

                           c) Static Linkage Performance                                                                d) Dynamic Linkage Performance
            40                                                                                        7000
                                                                   BFS1
                                                                BFS256
                                                             BreadthFirst

            35                                                                                        6500


            30                                                                                        6000


                                                                                       average rank
% crawled


            25                                                                                        5500


            20                                                                                        5000


            15                                                                                        4500

                                                                                                                                                                      BFS1
                                                                                                                                                                   BFS256
                                                                                                                                                                BreadthFirst
            10                                                                                        4000
                 0   100   200                  300        400              500                              0    500               1000                 1500                  2000
                                 number of top pages                                                                            pages crawled


Figure 3: Static evaluation (left) and dynamic evaluation (right) of representative crawlers on Task
2. The plots correspond to lexical (top) and linkage (bottom) quality metric.


the second task we use the links obtained from                                     general result we find that exploration helps an
the neighborhood of relevant subset as the start-                                  exploitative algorithm, but exploration without
ing points with the goal of finding more relevant                                  guidance goes astray.
pages. We take the worst (BFS1) and the best                                          Due to the availability of relevant subsets (ex-
(BFS256) crawlers on Task 1, and use them for                                      amples) for each of the topics in the current task,
Task 2. In addition, we add a simple Breadth-                                      we plot the average recall of the relevant exam-
First crawler that uses the limited size frontier                                  ples against number of pages crawled (Figure 4).
as a FIFO queue. The Breadth-First crawler is                                      The plot illustrates the target-seeking behavior
added to observe the performance of a blind ex-                                    of the three crawlers if the examples are viewed
ploratory algorithm. A summary of the results                                      as the targets. We again find BFS256 outper-
is shown through plots in Figure 3. As for Task                                    forming BFS1 while Breadth-First trails behind.
1, we find that the more exploratory algorithm,
BFS256, performs significantly better than BFS1
under static evaluations for both lexical and link-                                6                         Related Research
age quality metrics (cf. Figure 3a,c). In the dy-
namic plots (cf. Figure 3b,d) BFS256 seems to                                      Research on the design of effective focused
bear an initial penalty for exploration but recov-                                 crawlers is very vibrant. Many different types
ers in the long run. The Breadth-First crawler                                     of crawling algorithms have been developed. For
performs poorly on all evaluations. Hence, as a                                    example, Chakrabarti et al. [8] use classifiers
                                                                                   built from training sets of positive and negative


                                                                                  94
                                                                                  7
                 0.45
                                                                                long-term gains. Their solution is to build clas-
                  0.4                                                           sifiers that can assign pages to different classes
                 0.35                                                           based on the expected link distance between the
                  0.3
                                                                                current page and relevant documents.
                                                                                   The area of crawler quality evaluation has
average recall


                 0.25

                                                                                also received much attention in recent research
                  0.2
                                                                                [2, 9, 8, 19, 4]. For instance, many alterna-
                 0.15
                                                                                tives for assessing page importance have been
                  0.1
                                                                                explored, showing a range of sophistication. Cho
                 0.05
                                                               BFS1             et al. [9] use the simple presence of a word such
                                                            BFS256
                   0
                                                         BreadthFirst
                                                                                as “computer” to indicate relevance. Amento et
                        0   500       1000        1500                  2000
                                  pages crawled                                 al. [2] compute the similarity between a page
                                                                                and the centroid of the seeds. In fact content-
Figure 4: Average recall of examples when the                                   based similarity assessments form the basis of
crawls start from the neighbors.                                                relevance decisions in several examples of re-
                                                                                search [8, 19]. Others exploit link information to
                                                                                estimate page relevance with methods based on
example pages to guide their focused crawlers.                                  in-degree, out-degree, PageRank, hubs and au-
Fetuccino [3] and InfoSpiders [18] begin their                                  thorities [2, 3, 4, 8, 9, 20]. For example, Cho et
focused crawling with starting points generated                                 al. [9] consider pages with PageRank score above
from CLEVER [7] or other search engines. Most                                   a threshold as relevant. Najork and Wiener [20]
crawlers follow fixed strategies, while some can                                use a crawler that can fetch millions of pages per
adapt in the course of the crawl by learning to                                 day; they then calculate the average PageRank
estimate the quality of links [18, 1, 22].                                      of the pages crawled daily, under the assumption
   The question of exploration versus exploita-                                 that PageRank estimates relevance. Combina-
tion in crawler strategies has been addressed                                   tions of link and content-based relevance estima-
in a number of papers, more or less directly.                                   tors are evident in several approaches [4, 7, 18].
Fish-Search [11] limited exploration by bound-
ing the depth along any path that appeared sub-
optimal. Cho et al. [9] found that exploratory                                  7    Conclusions
crawling behaviors such as implemented in the
Breadth-First algorithm lead to efficient discov-                               In this paper we used an evaluation framework
ery of pages with good PageRank. They also dis-                                 for topic driven crawlers to study the role of ex-
cuss the issue of limiting the memory resources                                 ploitation of link estimates versus exploration of
(buffer size) of a crawler, which has an impact on                              suboptimal pages. We experimented with a fam-
the exploitative behavior of the crawling strategy                              ily of simple crawler algorithms of varying greed-
because it forces the crawler to make frequent                                  iness, under limited memory resources for two
filtering decisions. Breadth-First crawlers also                                different tasks. A number of schemes and qual-
seem to find popular pages early in the crawl [20].                             ity metrics derived from lexical features and link
The exploration versus exploitation issue contin-                               analysis were introduced and applied to gauge
ues to be studied via variations on the two major                               crawler performance.
classes of Breadth-First and Best-First crawlers.                                  We found consistently that exploration leads
For example, in recent research on Breadth-First                                to better coverage of highly relevant pages, in
focused crawling, Diligenti et al. [12] address                                 spite of a possible penalty during the early stage
the “shortsightedness” of some crawlers when as-                                of the crawl. An obvious explanation is that ex-
sessing the potential value of links to crawl. In                               ploration allows to trade off short term gains for
particular, they look at how to avoid short-term                                longer term and potentially larger gains. How-
gains at the expense of less-obvious but larger                                 ever, we also found that a blind exploration


                                                                               95
                                                                               8
when starting from neighbors of relevant pages,         A goal of present research is to identify op-
leads to poor results. Therefore, a mix of ex-       timal trade-offs between exploration and ex-
ploration and exploitation is necessary for good     ploitation, where either more exploration or
overall performance. When starting from rele-        more greediness would degrade performance. A
vant examples (Task 1), the better performance       large enough buffer size will have to be used
of crawlers with higher exploration could be at-     so as not to constrain the range of explo-
tributed to their better coverage of documents       ration/exploitation strategies as much as hap-
close to the relevant subset. The good perfor-       pened in the experiments described here due to
mance of BFS256 starting away from relevant          the small MAX BUFFER. Identifying an optimal
pages, shows that its exploratory nature com-        exploration/exploitation trade-off would be the
plements its greedy side in finding highly rele-     first step toward the development of an adaptive
vant pages. Extreme exploitation (BFS1) and          crawler that would attempt to adjust the level of
blind exploration (Breadth-First), impede per-       greediness during the crawl.
formance. Nevertheless, any exploitation seems          Finally, two things that we have not done in
to be better than none. Our results are based        this paper are to analyze the time complexity of
on short crawls of 2000 pages. The same may          the crawlers and the topic-specific performance
not hold for longer crawls; this is an issue to be   of each strategy. Regarding the former, clearly
addressed in future research. The dynamic eval-      more greedy strategies require more frequent de-
uations do suggest that for very short crawls it     cisions and this may have an impact on the ef-
is best to be greedy; this is a lesson that should   ficiency of the crawlers. Regarding the latter,
be incorporated into algorithms for query time       we have only considered quality measures in the
(online) crawlers such as MySpiders3 [21].           aggregate (across topics). It would be useful to
   The observation that higher exploration yields    study how appropriate trade-offs between explo-
better results can motivate parallel and/or          ration and exploitation depend on different char-
distributed implementations of topic driven          acteristics such as topic heterogeneity. Both of
crawlers, since complete orderings of the links      these issues are the object of ongoing research.
in the frontier, as required by greedy crawler al-
gorithms, do not seem to be necessary for good
performance. Therefore crawlers based on local
                                                     References
decisions seem to hold promise both for the per-      [1] C. Aggarwal, F. Al-Garawi, and P. Yu. In-
formance of exploratory strategies and for the ef-        telligent crawling on the World Wide Web
ficiency and scalability of distributed implemen-         with arbitrary predicates. In Proc. 10th Intl.
tations. In particular, we intend to experiment           World Wide Web Conference, pages 96–105,
with variations of crawling algorithms such as            2001.
InfoSpiders [18], that allow for adaptive and dis-
tributed exploratory strategies.                      [2] B. Amento, L. Terveen, and W. Hill. Does
   Other crawler algorithms that we intend                ”authority” mean quality? Predicting ex-
to study in future research include Best-First            pert quality ratings of web documents. In
strategies driven by estimates other than lexi-           Proc. 23rd ACM SIGIR Conf. on Research
cal ones. For example we plan to implement a              and Development in Information Retrieval,
Best-N-First family using link estimates based            pages 296–303, 2000.
on local versions of the rankKW,P R metric used
in this paper for evaluations purposes. We also       [3] I. Ben-Shaul, M. Herscovici, M. Jacovi,
plan to test more sophisticated lexical crawlers          Y. Maarek, D. Pelleg, M. Shtalhaim,
such as InfoSpiders and Shark Search [14], which          V. Soroka, and S. Ur. Adding support for
can prioritize over links from a single page.             dynamic and focused search with Fetuccino.
                                                          Computer Networks, 31(11–16):1653–1665,
  3
      http://myspiders.biz.uiowa.edu                      1999.


                                                 96
                                                 9
 [4] K. Bharat and M. Henzinger. Improved al- [13] T. Haveliwala. Efficient computation of
     gorithms for topic distillation in hyperlinked      pagerank.     Technical report, Stanford
     environments. In Proc. 21st ACM SIGIR               Database Group, 1999.
     Conf. on Research and Development in In-
                                                    [14] M. Hersovici, M. Jacovi, Y. S. Maarek,
     formation Retrieval, pages 104–111, 1998.
                                                         D. Pelleg, M. Shtalhaim, and S. Ur. The
 [5] B. E. Brewington and G. Cybenko. How                shark-search algorithm — An application:
     dynamic is the Web? In Proc. 9th Interna-           Tailored Web site mapping. In Proc. 7th
     tional World-Wide Web Conference, 2000.             Intl. World-Wide Web Conference, 1998.
                                                [15] J. Kleinberg. Authoritative sources in a
 [6] S. Brin and L. Page. The anatomy of             hyperlinked environment. Journal of the
     a large-scale hypertextual Web search en-       ACM, 46(5):604–632, 1999.
     gine. Computer Networks, 30(1–7):107–117,
     1998.                                      [16] S. Lawrence and C. Giles. Accessibility of
                                                     information on the Web. Nature, 400:107–
 [7] S. Chakrabarti, B. Dom, P. Raghavan,            109, 1999.
     S. Rajagopalan, D. Gibson, and J. Klein-
                                                [17] F. Menczer. Links tell us about lexical and
     berg. Automatic resource compilation by
                                                     semantic web content. arXiv:cs.IR/0108004.
     analyzing hyperlink structure and associ-
     ated text. Computer Networks, 30(1–7):65– [18] F. Menczer and R. Belew. Adaptive re-
     74, 1998.                                       trieval agents: Internalizing local context
                                                     and scaling up to the Web. Machine Learn-
 [8] S. Chakrabarti, M. van den Berg, and            ing, 39(2–3):203–242, 2000.
     B. Dom. Focused crawling: A new approach
     to topic-specific Web resource discovery. [19] F. Menczer, G. Pant, M. Ruiz, and
     Computer Networks, 31(11–16):1623–1640,         P. Srinivasan. Evaluating topic-driven Web
     1999.                                           crawlers. In Proc. 24th Annual Intl. ACM
                                                     SIGIR Conf. on Research and Development
 [9] J. Cho, H. Garcia-Molina, and L. Page. Ef-      in Information Retrieval, 2001.
     ficient crawling through url ordering. In [20] M. Najork and J. L. Wiener. Breadth-first
     Proc. 7th Intl. World Wide Web Confer-          search crawling yields high-quality pages. In
     ence, Brisbane, Australia, 1998.                Proc. 10th International World Wide Web
                                                   Conference, 2001.
[10] Cyveillance.         Sizing    the    inter-
     net.       White paper,       July 2000. [21] G. Pant and F. Menczer. Myspiders: Evolve
     http://www.cyveillance.com/web/us/            your own intelligent web crawlers. Au-
     corporate/white papers.htm.                   tonomous Agents and Multi-Agent Systems,
                                                   5(2):221–229, 2002.
[11] P. De Bra and R. Post. Information retrieval
     in the World Wide Web: Making client- [22] J. Rennie and A. K. McCallum. Using re-
     based searching feasible. In Proc. 1st Intl.  inforcement learning to spider the Web effi-
     World Wide Web Conference, 1994.              ciently. In Proc. 16th International Conf. on
                                                   Machine Learning, pages 335–343. Morgan
[12] M. Diligenti, F. Coetzee, S. Lawrence,        Kaufmann, San Francisco, CA, 1999.
     C. L. Giles, and M. Gori. Focused crawl- [23] G. Salton. The SMART Retrieval System –
     ing using context graphs. In Proc. 26th       Experiments in Automatic Document Pro-
     International Conference on Very Large        cessing. Prentice-Hall, Englewood Cliffs,
     Databases (VLDB 2000), pages 527–534,         NJ, 1971.
     Cairo, Egypt, 2000.


                                               97
                                               10