<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamics of the Chilean Web Structure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo Baeza-Yates Barbara Poblete</string-name>
          <email>bpoblete@dcc.uchile.cl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Web Research Dept. of Computer Science University of Chile</institution>
        </aff>
      </contrib-group>
      <fpage>96</fpage>
      <lpage>105</lpage>
      <abstract>
        <p>In this paper we present further results on the evolution of the structure of the Chilean Web between 2000 and 2003, focusing on the Web sites transitions in the structure. Our results show that there are many stable Web sites, but also chaotic changes. We also expand our own results on the death behavior of Web sites. The Web is highly dynamic and not too much is known about its evolution. There have been some work on page evolution, obtaining models that predict when a page will change, but that differs a lot from site to site. There are also generative models for Web growth, but they do not include Web death. In fact, new websites appear and others disappear, but little is know on how this happens. Less work has been done when sites or domains are used as granularity for the study of the evolution. In [BYP03] we presented the evolution of the structure composition of the Chilean Web at the site and domain level, based on data gathered from a search engine targeted to this web domain, TodoCL.cl, between years 2000 and 2002. In this paper we include data of 2003, extending our previous results. In addition, we focus not only on macro statistics, but also on the transitions of Web sites among different structure components. That is, are the changes in the size of the components due to small transitions in one direction or to large transitions in both directions? Our results show that for some Web components the first is true, while for others the second is true.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>Year</title>
        <p>Pages
Sites (crawled)</p>
        <p>Sites (known)
Domains (crawled)
Domains (known)</p>
        <p>Our results present how the structure evolves, how sites migrate from one component to another
component, and where sites appear and disappear. The changes are dramatic, corroborating that perhaps we are
trying to study a process that is still in a transient phase, or that cannot be modeled in detail. This is a first
step to measure and follow the evolution of part of the Web structure, as well as try to understand the process
behind the changes. To the best of our knowledge there are no other studies on Web composition as specific
as ours. Most statistical studies deal with global attributes such as language or size. We would have liked
to separate the Chilean Web in commercial, educational, governmental, etc. sites, but Chile does not use a
subdomain level indicating this, so the classification is not trivial.</p>
        <p>In section 2 we review the results on the structure of the Web and the problems faced to obtain it. Section
3 shows the evolution of this structure, and section 4 analyzes the migrations of Web sites in the structure in
relation to the expected typical life cycle of a Web site. The last section has some concluding remarks.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Web Structure</title>
      <p>The most complete study of the Web structure [BKM+00] focuses on page connectivity. One problem
with this is that a page is not a logical unit (for example, a page can describe several documents and one
document can be stored in several pages.) Hence, we study the structure of how websites were connected,
as websites are closer to being real logical units. Not surprisingly, we found in [BYC01] that the structure
at the website level was similar to the global Web, and hence we use the same notation of [BKM+00]. The
components are:
a) MAIN, sites that are in the strong connected component of the connectivity graph of sites (that is, we
can navigate from any site to any other site in the same component);
b) IN, sites that can reach MAIN but cannot be reached from MAIN;
c) OUT, sites that can be reached from MAIN, but there is no path to go back to MAIN; and
d) other sites that can be reached from IN (T.IN, where T is an abbreviation for tentacles), sites in
paths between IN and OUT (TUNNEL), sites that only reach OUT (T.OUT), and unconnected sites
(ISLANDS).</p>
      <p>In [BYC01] we analyzed the data for 2000 and we extended this notation by dividing the MAIN
component into four parts:
a) MAIN-MAIN, which are sites that can be reached directly from the IN component and can reach
directly the OUT component;
b) MAIN-IN, which are sites that can be reached directly from the IN component but are not in
MAIN</p>
      <p>MAIN;
c) MAIN-OUT, which are sites that can reach directly the OUT component, but are not in MAIN-MAIN;
d) MAIN-NORM, which are sites not belonging to the previously defined subcomponents.
Figure 1 shows all these components. The average update time of pages and sites, and their relation to
structure and link ranking techniques was studied in [BYSJC02] for the first two collections (2000 and
2001). We could consider domains in our study, but domains may contain sites that are quite different. For
example, web hosting in an ISP provider using a common second-level domain such as co.cl.</p>
      <p>IN
T.IN</p>
      <sec id="sec-2-1">
        <title>MAIN</title>
      </sec>
      <sec id="sec-2-2">
        <title>MAIN-MAIN OUT</title>
      </sec>
      <sec id="sec-2-3">
        <title>MAIN-IN</title>
      </sec>
      <sec id="sec-2-4">
        <title>MAIN-OUT</title>
      </sec>
      <sec id="sec-2-5">
        <title>MAIN-NORM</title>
      </sec>
      <sec id="sec-2-6">
        <title>TUNNEL T.OUT</title>
      </sec>
      <sec id="sec-2-7">
        <title>ISLANDS</title>
        <p>Because any crawling is incomplete (for example, dynamic pages can be unbounded), any Web graph is
the incomplete. That means that any analysis of the Web structure will be an approximation. Moreover in
our case, as we are not considering paths through links outside the Chilean Web. On the other hand, our Web
subset is a very coherent one and is not just a Web sample. To know if a site exists, it is enough to crawl the
home page. However, to know all the links for that site, a thorough crawling of the site is needed. However,
many sites, sometimes because of ignorance, do not allow crawlers to enter. For example, in 2001, 56% of
the domains and 54% of the sites had only one page. However, 25% of them (14% of the total) was because
they had an initial Flash page or called a similar kind of program.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evolution of the Structure Composition</title>
      <p>Table 2 shows the number of sites that have appeared and disappeared from year to year, from a total of
56,699 different sites belonging to 49,790 domains, crawled at some point, 56,020 of them being .cl. As of
March 1st, 2004, there were 75,758 registered domains with a name server in .cl, with approximately 56,100
with a Web server, our coverage was around 67% in 2003 and higher in 2002. The three last rows represent
the new sites (NEW), the sites that were not crawled but exist (UNKNOWN), and the sites that disappeared
(DEAD), respectively. UNKNOWN include non-crawled existing sites and sites with connectivity or access
problems. NEW sites may not be really new, as the crawling coverage is not 100%. Death of a site means
that there is no IP address associated to it (this might be incorrect if the site changes its name, but then it is
considered as a new site and there are few of such cases) and death of a domain means that there are no sites
associated with it (in particular the domain name itself or prefixed by www)1.</p>
      <sec id="sec-3-1">
        <title>Year</title>
        <p>CRAWLED</p>
        <p>NEW
UNKNOWN</p>
        <p>DEAD</p>
        <p>In table 3 we give the relative size of each component. Notice the size of ISLANDS, which is near
40% of the Chilean Web sites. These sites are usually recent, and the main growth of the Web is in that
component. We can also observe the growth of MAIN, which may indicate a more mature Web. As our
collection is not complete, the percentages for MAIN are lower bounds while for ISLANDS they are upper
bounds. As we checked for non-crawled sites to see if they exist, but we do not know the actual component
they belong to, we can have upper and lower bounds for MAIN and ISLANDS, by adding and subtracting
the number of sites with an unknown component, respectively.
1The domain name could be still registered and have a name server, though.</p>
        <p>In table 4 we show the migration of sites among the components. There are two ways of reading these
tables. By columns we have from which component comes the sites in each component a given year. By
rows, we can see where are today the sites of the components in the previous year. In most cases the
UNKNOWN component sites will belong to ISLANDS or OUT, although in the later case, we just need one
link back to MAIN to have that site in MAIN. Notice that OUT and MAIN are quite stable components,
because a large fraction of their sites stay there. It is also interesting to see that MAIN grows mainly from
OUT or NEW sites, and that ISLANDS is the component with largest growth and also death, followed by
OUT (and not IN!).</p>
      </sec>
      <sec id="sec-3-2">
        <title>ISLANDS TUNNEL TIN</title>
      </sec>
      <sec id="sec-3-3">
        <title>TOUT</title>
      </sec>
      <sec id="sec-3-4">
        <title>UNKNOWN</title>
      </sec>
      <sec id="sec-3-5">
        <title>DEAD 2001</title>
        <p>Web sites evolve and hence migrate inside the structure. First, a typical Web site should start as part
of ISLANDS or IN (depending if they link or not to a good Web site). If the site becomes popular and
they also link to known sites, the site migrates to MAIN. If links are not well chosen or updated, they start
in or migrate to OUT. Figure 2 shows the expected life path of a website to migrate to MAIN. We also
include migrations from MAIN to OUT if the site is not well maintained. On the other hand, the left side
of figure 3, shows what really happened, aggregating all the transitions in our data (blue arrows are sites
that dissappear). The main differences from our intuition are that there are very few IN to MAIN and IN
to ISLANDS transitions. However, some of the transitions involve changes in two links, for example from
IN to OUT or MAIN to or from ISLANDS. Assuming that the two links do not appear exactly at the same
time, the transition from IN to OUT went through MAIN or ISLANDS, ISLANDS to MAIN went through
IN or OUT, and MAIN to ISLANDS went through OUT or IN. Taking the first choice in all three cases, as
the most probable, we get the right side of figure 3. This means that a finer time granularity on the Web
snapshots is needed to understand 3.4% of the transitions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Concluding Remarks</title>
      <p>= ( 1 So, using a simple model for Web site growth as fn )fn where is the growth rate and the</p>
      <p>The overall number of sites of the Chilean Web is almost duplicating each year, as we believe that the
2003 data did not reflect the actual growth. That is the result of about a 100% increase plus a 20% death.
death rate, according to our results we have fn 1:8fn 1. However, the Web growths continuously, and
we only have one snapshot per year. Different time granularities for this type of data could be considered to
see if a one-year sampling is good enough.</p>
      <p>There is still a lot to do to understand how the composition of the structure changes, but perhaps there
are no formal processes behind and it is just a transient phase. Another problem is the dynamics of the sites
content. For example, the largest 100 sites (in pages) per year, involve 328 sites for all years (so there are
many changes on content), and only 6 and 60 sites were in the top for 3 and 2 years, respectively. Although
page count depends in crawling policies, we have used more or less the same policies all the time and the
changes are quite radical.</p>
      <p>Acknowledgements</p>
      <p>We thank the help of Edgardo Krell and Sebastian Castro from NIC Chile for providing the .CL domain
data, as well as the support of Millennium Nucleus Grant P01-029-F from Mideplan, Chile, and Chilean
Fondecyt Project 1020803 Grant from CONICYT.
[BYC01]</p>
      <p>Ricardo Baeza-Yates and Carlos Castillo. Relating web characteristics with link analysis. In
String Processing and Information Retrieval. IEEE Computer Science Press, 2001.</p>
      <p>8
9</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [BYSJC02]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>Felipe Saint-Jean, and Carlos Castillo. Web dynamics, structure, and link ranking</article-title>
          .
          <source>In String Processing and Information Retrieval. Lecture Notes in CS</source>
          , Springer,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [BYP03]
          <article-title>Ricardo Baeza-Yates and Barbara Poblete, Evolution of the Chilean Web Structure Composition</article-title>
          . In First Latin American World Wide Web Conference. IEEE CS Press, Santiago, Chile,
          <year>November 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [BKM+00]
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Maghoul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajagopalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stata</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          .
          <article-title>Graph structure in the Web: Experiments and models</article-title>
          .
          <source>In 9th World Wide Web Conference</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>