-

November

2000

20 2000 95 97

1 Introduction 2 pages (reciprocal or not) is used as a basis for ranking web pages in search on sales resulting from such visits. Finally, the existence of links between thousands of rings into a hierarchy, and allow to search web rings by topic. will of course miss the most general web rings. ther simply for the number of click-throughs, or by distributing commissions ties. Many sites link to each other voluntarily [1], and monitoring the trac Web rings turn out be very useful for casual browsing, since they provide can be hidden and be more or less obvious to the user. Very often, each rings are accessed through ring portal sites [6] which organize hundreds of often depends on the eorts of a ring administrator. Man y rings provide They help organize collections of pages, create directories, or form communiHyperlinks are one of the most important features of the World Wide Web. shortcuts, some have broken links, and even when all the links exist some member site of a web ring will link to an index page which itself links to In reality these rings are obviously much more complicated. Their quality results [3, 4, 5]. to dene and detect w eb rings automatically [ 7 ], even though this criterion and advertising banners [2]. Sites also frequently pay for incoming trac, eisite, in such a way that the member sites together form a ring. Most web structure. In its most ideal form, sites link to a "next" and a "previous" all the member sites of the ring. Some suggest using this spoke structure that o ws from one site to another is the basis for both aliate programs One particular link structure that has become very widespread is the ring ring to sample the main page on a few dozen related sites. This makes it example, it is often very dicult to come up with an accurate searc h query Conventional search engines tend to be rather ineectiv e in this case. For that captures the nuances of the pages that one is trying to discover. On an ecien t way to discover and explore smaller sites about a particular topic. possible to nd great sources of information that one w ould otherwise not the other hand, web ring portal sites make it easy to nd a ring co vering a given generic topic. It is then very easy to quickly follow the links in the to a web ring, is how large such a ring should become. Obviously, a web site for another is the rst site’s loss and the second site’s gain, mem bers of dieren t consideration should limit the size of web rings. Since web rings bring him back at a later stage, especially if another member of the web ring sales. And if users are going to leave a site anyhow, it is best if they leave slight advantage over directories in that it is possible to surf from one site web rings recognize that helping a visitor nd what he needs ma y ultimately ring that is very large might as well simply be a list that doesn’t connect to the next without returning to the directory every time. often form web rings as well. While one might think that a user leaving one Surprisingly, competing sites such as online stores or online auction sites it by going to a partner site [11].

One question asked by web ring administrators as new sites get added links back. Thus, by belonging to a web ring a site can increase trac and are typically organized in a hierarchy, the size of a ring should be determined back to itself { a line instead of a loop. However, we will argue that a even have known to search for. Note that for this purpose web rings have a sites. The question then becomes when to split these rings up. Is it better far a user is likely to surf before he stops. We briey summarize this model a random walk, in which the increments are independent and identically zn distributed. Thus the value of the nth page is given by to have one large ring, or several rings of smaller size? here. It is assumed that the pages encountered while browsing have a value Web rings that are allowed to grow freely as people add their sites to the A simple answer is given by the Law of Surng [8 ], which tells us how or utility. In particular, the utility of the nth page is These values follow xn. ring soon become very large. Some rings can have in excess of a thousand cien tly connected web, the threshold will be independent of n, so that the probability p(x) that a surfer surfs x links is are distributed according to the inverse Gaussian distribution [ 10 ]. That is, = x is a constant. The number of pages visited before < x is then xn xn such that it is optimal to continue only if > xn xn With a nite (non-zero) discoun t rate > 0, and a very large and sufknown, since in the continuous limit of the random walk specied b y Eq. (1) one obtains true Brownian motion. In that limit the rst passage times p(x) = r 2 x 2 (x : 3=2e 2x )2

0.15 particular site among these 2n sites that a surfer needs to nd.

(a) (b) 1-e (c) rings is larger than the expected probability of nding it in one large ring: Putting everything together, we see that splitting one ring into two equally sized rings will be optimal (i.e. lead to a higher probability of nding dy p(x)dx: n1 Zn2n Zy1 the target site) provided the expected probability of nding the page in t wo particular, there is a regime in which the inverse Gaussian distribution is care is taken in the integral approximation in order to avoid the divergence split the ring if < 1 . p12 initially chose the incorrect ring, and then assume that we can not return inequality.

We note that small modications to the model ma y change the threshold discrete case by explicitly summing the sums instead of approximating them dependence on ring size.

These results can be applied recursively. Any ring can be replaced by probability of error that we allow until we split the ring. For example, if we always better to keep the ring intact, since there is no way to satisfy the at zero, we get that a large ring should always be split provided that < . 12 two rings and a decision of which ring to follow. Thus, as long as there is a What happens if we assume a dieren t form for the Law of Surng? In other hand, the probability of chosing the correct ring is or less, it is 21 distribution. It may at rst be surprising that this result is independen t of Note that we obtain the same result if we solve the problem for the the size of the original ring. However, this must be true since a power-law to the correct ring once we are done surng our rst c hoice, we should only by integrals. organized as a hierarchical tree structure, rather than in a ring. If, on the will be advantageous to split the ring. This result is true for any power-law distribution does not provide a natural length scale that is necessary for a That is, as long as the sites can be divided into two meaningful groups, it approximated by a power-law with exponent 3=2. In this case, if proper The parameters in the model that helps to construct such a hierarchy links. It is then only a question of deciding how to optimally organize sites between this model and real usage data from a portal site. Combining the [12]. can in principle be measured, even though in practice there are a number into rings, and how to t rings in to a hierarchy. have a circular structure is not important { what matters is the ease with The model presented in the last section suggests that there is an optimal then lead to automated ways of optimally organizing an index to web sites which one navigates from one site to the next by following the web ring of challenges. It should then be possible to make quantitative comparisons tree structure to organize web rings of similar topic. The fact that web rings ideas presented here with machine learning and clustering algorithms may

VLDB

Conference , Edinburgh, Scotland, 1999 .

large-scale knowledge bases from the web" , Proceedings of the 25th

[7]

Kumar ,

Raghavan ,

Rajagopalan and

Tomkins , "Extracting

[10] Raj

Chhikara and J.

Leroy

Folks

, The Inverse Gaussian Distribution,

Marcel

Dekker , Inc., 1989 .