Introduction

Dynamics of the Chilean Web Structure

Ricardo Baeza-Yates Barbara Poblete

bpoblete@dcc.uchile.cl 0 0 Center for Web Research Dept. of Computer Science University of Chile

96 105

In this paper we present further results on the evolution of the structure of the Chilean Web between 2000 and 2003, focusing on the Web sites transitions in the structure. Our results show that there are many stable Web sites, but also chaotic changes. We also expand our own results on the death behavior of Web sites. The Web is highly dynamic and not too much is known about its evolution. There have been some work on page evolution, obtaining models that predict when a page will change, but that differs a lot from site to site. There are also generative models for Web growth, but they do not include Web death. In fact, new websites appear and others disappear, but little is know on how this happens. Less work has been done when sites or domains are used as granularity for the study of the evolution. In [BYP03] we presented the evolution of the structure composition of the Chilean Web at the site and domain level, based on data gathered from a search engine targeted to this web domain, TodoCL.cl, between years 2000 and 2002. In this paper we include data of 2003, extending our previous results. In addition, we focus not only on macro statistics, but also on the transitions of Web sites among different structure components. That is, are the changes in the size of the components due to small transitions in one direction or to large transitions in both directions? Our results show that for some Web components the first is true, while for others the second is true.

Introduction Year

Pages Sites (crawled)

Sites (known) Domains (crawled) Domains (known)

Our results present how the structure evolves, how sites migrate from one component to another component, and where sites appear and disappear. The changes are dramatic, corroborating that perhaps we are trying to study a process that is still in a transient phase, or that cannot be modeled in detail. This is a first step to measure and follow the evolution of part of the Web structure, as well as try to understand the process behind the changes. To the best of our knowledge there are no other studies on Web composition as specific as ours. Most statistical studies deal with global attributes such as language or size. We would have liked to separate the Chilean Web in commercial, educational, governmental, etc. sites, but Chile does not use a subdomain level indicating this, so the classification is not trivial.

In section 2 we review the results on the structure of the Web and the problems faced to obtain it. Section 3 shows the evolution of this structure, and section 4 analyzes the migrations of Web sites in the structure in relation to the expected typical life cycle of a Web site. The last section has some concluding remarks. 2

Web Structure

The most complete study of the Web structure [BKM+00] focuses on page connectivity. One problem with this is that a page is not a logical unit (for example, a page can describe several documents and one document can be stored in several pages.) Hence, we study the structure of how websites were connected, as websites are closer to being real logical units. Not surprisingly, we found in [BYC01] that the structure at the website level was similar to the global Web, and hence we use the same notation of [BKM+00]. The components are: a) MAIN, sites that are in the strong connected component of the connectivity graph of sites (that is, we can navigate from any site to any other site in the same component); b) IN, sites that can reach MAIN but cannot be reached from MAIN; c) OUT, sites that can be reached from MAIN, but there is no path to go back to MAIN; and d) other sites that can be reached from IN (T.IN, where T is an abbreviation for tentacles), sites in paths between IN and OUT (TUNNEL), sites that only reach OUT (T.OUT), and unconnected sites (ISLANDS).

In [BYC01] we analyzed the data for 2000 and we extended this notation by dividing the MAIN component into four parts: a) MAIN-MAIN, which are sites that can be reached directly from the IN component and can reach directly the OUT component; b) MAIN-IN, which are sites that can be reached directly from the IN component but are not in MAIN

MAIN; c) MAIN-OUT, which are sites that can reach directly the OUT component, but are not in MAIN-MAIN; d) MAIN-NORM, which are sites not belonging to the previously defined subcomponents. Figure 1 shows all these components. The average update time of pages and sites, and their relation to structure and link ranking techniques was studied in [BYSJC02] for the first two collections (2000 and 2001). We could consider domains in our study, but domains may contain sites that are quite different. For example, web hosting in an ISP provider using a common second-level domain such as co.cl.

IN T.IN

MAIN MAIN-MAIN OUT MAIN-IN MAIN-OUT MAIN-NORM TUNNEL T.OUT ISLANDS

Because any crawling is incomplete (for example, dynamic pages can be unbounded), any Web graph is the incomplete. That means that any analysis of the Web structure will be an approximation. Moreover in our case, as we are not considering paths through links outside the Chilean Web. On the other hand, our Web subset is a very coherent one and is not just a Web sample. To know if a site exists, it is enough to crawl the home page. However, to know all the links for that site, a thorough crawling of the site is needed. However, many sites, sometimes because of ignorance, do not allow crawlers to enter. For example, in 2001, 56% of the domains and 54% of the sites had only one page. However, 25% of them (14% of the total) was because they had an initial Flash page or called a similar kind of program. 3

Evolution of the Structure Composition

Table 2 shows the number of sites that have appeared and disappeared from year to year, from a total of 56,699 different sites belonging to 49,790 domains, crawled at some point, 56,020 of them being .cl. As of March 1st, 2004, there were 75,758 registered domains with a name server in .cl, with approximately 56,100 with a Web server, our coverage was around 67% in 2003 and higher in 2002. The three last rows represent the new sites (NEW), the sites that were not crawled but exist (UNKNOWN), and the sites that disappeared (DEAD), respectively. UNKNOWN include non-crawled existing sites and sites with connectivity or access problems. NEW sites may not be really new, as the crawling coverage is not 100%. Death of a site means that there is no IP address associated to it (this might be incorrect if the site changes its name, but then it is considered as a new site and there are few of such cases) and death of a domain means that there are no sites associated with it (in particular the domain name itself or prefixed by www)1.

Year

CRAWLED

NEW UNKNOWN

DEAD

In table 3 we give the relative size of each component. Notice the size of ISLANDS, which is near 40% of the Chilean Web sites. These sites are usually recent, and the main growth of the Web is in that component. We can also observe the growth of MAIN, which may indicate a more mature Web. As our collection is not complete, the percentages for MAIN are lower bounds while for ISLANDS they are upper bounds. As we checked for non-crawled sites to see if they exist, but we do not know the actual component they belong to, we can have upper and lower bounds for MAIN and ISLANDS, by adding and subtracting the number of sites with an unknown component, respectively. 1The domain name could be still registered and have a name server, though.

In table 4 we show the migration of sites among the components. There are two ways of reading these tables. By columns we have from which component comes the sites in each component a given year. By rows, we can see where are today the sites of the components in the previous year. In most cases the UNKNOWN component sites will belong to ISLANDS or OUT, although in the later case, we just need one link back to MAIN to have that site in MAIN. Notice that OUT and MAIN are quite stable components, because a large fraction of their sites stay there. It is also interesting to see that MAIN grows mainly from OUT or NEW sites, and that ISLANDS is the component with largest growth and also death, followed by OUT (and not IN!).

ISLANDS TUNNEL TIN TOUT UNKNOWN DEAD 2001

Web sites evolve and hence migrate inside the structure. First, a typical Web site should start as part of ISLANDS or IN (depending if they link or not to a good Web site). If the site becomes popular and they also link to known sites, the site migrates to MAIN. If links are not well chosen or updated, they start in or migrate to OUT. Figure 2 shows the expected life path of a website to migrate to MAIN. We also include migrations from MAIN to OUT if the site is not well maintained. On the other hand, the left side of figure 3, shows what really happened, aggregating all the transitions in our data (blue arrows are sites that dissappear). The main differences from our intuition are that there are very few IN to MAIN and IN to ISLANDS transitions. However, some of the transitions involve changes in two links, for example from IN to OUT or MAIN to or from ISLANDS. Assuming that the two links do not appear exactly at the same time, the transition from IN to OUT went through MAIN or ISLANDS, ISLANDS to MAIN went through IN or OUT, and MAIN to ISLANDS went through OUT or IN. Taking the first choice in all three cases, as the most probable, we get the right side of figure 3. This means that a finer time granularity on the Web snapshots is needed to understand 3.4% of the transitions.

Concluding Remarks

= ( 1 So, using a simple model for Web site growth as fn )fn where is the growth rate and the

The overall number of sites of the Chilean Web is almost duplicating each year, as we believe that the 2003 data did not reflect the actual growth. That is the result of about a 100% increase plus a 20% death. death rate, according to our results we have fn 1:8fn 1. However, the Web growths continuously, and we only have one snapshot per year. Different time granularities for this type of data could be considered to see if a one-year sampling is good enough.

There is still a lot to do to understand how the composition of the structure changes, but perhaps there are no formal processes behind and it is just a transient phase. Another problem is the dynamics of the sites content. For example, the largest 100 sites (in pages) per year, involve 328 sites for all years (so there are many changes on content), and only 6 and 60 sites were in the top for 3 and 2 years, respectively. Although page count depends in crawling policies, we have used more or less the same policies all the time and the changes are quite radical.

Acknowledgements

We thank the help of Edgardo Krell and Sebastian Castro from NIC Chile for providing the .CL domain data, as well as the support of Millennium Nucleus Grant P01-029-F from Mideplan, Chile, and Chilean Fondecyt Project 1020803 Grant from CONICYT. [BYC01]

Ricardo Baeza-Yates and Carlos Castillo. Relating web characteristics with link analysis. In String Processing and Information Retrieval. IEEE Computer Science Press, 2001.

8 9

[BYSJC02]

Ricardo

Baeza-Yates , Felipe Saint-Jean, and Carlos Castillo. Web dynamics, structure, and link ranking . In String Processing and Information Retrieval. Lecture Notes in CS , Springer, 2002 .

[BYP03] Ricardo Baeza-Yates and Barbara Poblete, Evolution of the Chilean Web Structure Composition . In First Latin American World Wide Web Conference. IEEE CS Press, Santiago, Chile, November 2003 .

[BKM+00]

Broder ,

Kumar ,

Maghoul ,

Raghavan ,

Rajagopalan ,

Stata , and

Tomkins . Graph structure in the Web: Experiments and models . In 9th World Wide Web Conference , 2000 .