Hierarchical Summarizing and Evaluating for
                    Web Pages

            Kou TAKAHASHI1 , Takao MIURA1 , and Isamu SHIOYA2
            1
              HOSEI University, Kajinocho 3-7-2, Koganei, Tokyo, JAPAN
        2
            Sannno University,Kamikasuya 1563, Isehara, Kanagawa, JAPAN

      Abstract. In this investigation we propose a novel summarization method
      of Web pages using hierarchical expression. We discuss close relationship
      between summarization and hierarchical clustering to obtain the results,
      and we examine how to evaluate hierarchical summarization based on
      both correlation and structural aspects. We describe some experimental
      results using NTCIR Web documents to examine our method.

1   Introduction
Nowadays we face to huge amount of Web pages which keep growing everyday.
Whenever we like to know current situation of interests, we can examine easily
what’s going now all over the world through these pages. However, at the same
time, such information ﬂooding makes it impossible for us to grasp the contents
quickly and intuitively. That’s reason why much attention has been paid on how
to grasp and summarize the contents quickly [7, 4]. To attack these issues, there
have been several approach proposed, and among others, we have two important
techniques, clustering and summarization[12, 4].
    Clustering means a method to put objects into several groups in such a way
that objects in a group are similar with each other while objects in distinct groups
are not [6]. In other words, clustering depends on both deﬁnition of similarity and
algorithm by which we can extract interesting patterns that are hidden. Most
information in Web pages are categorical (say, we see ”Computer Science”, ”Biol-
ogy”, ”Mathematics”) but not numerical so that we interpret hardly the notions
of metric or order. This is the true reason traditional clustering techniques are
not well-suited.
    Summarization provides us with presentation of important contents of infor-
mation source in a simpliﬁed way by which we can grasp quickly. In investigation
of automatic summarization, two major aspects have been discussed, extraction
and abstraction. To extract important part from the contents, we put weights
such as frequency to words or paragraphs, and score them. Eventually we select
sentences with high score order. On the other hand, to abstract contents, we
should analyze them in terms of Natural Language Processing (NLP) or Infor-
mation Retrieval (IR), thus few approaches has been proposed so far.
    In this work, we propose a new kind of Web summarization technique based
on hierarchical clustering. there are two kinds of ideas, Web page clustering and
Web page summarization. We put our attention on frequency and co-occurrence
of both index words and hyper-links appeared in the Web pages. We decom-
pose the page contents into semantic textual units (STU) as described later.
With the help of hierarchical clustering among them, we extract some sort of
structure among collections of STUs so that we can consider the structure as
the abstraction of the contents in a form of hierarchy. We obtain a center STU
from each cluster, thus we can represent the summarization in terms of structure
among STUs.
    We must discuss how to evaluate the hierarchical summarization. Generally
evaluation is diﬃcult as to both clustering and summarization, This is reason
because correct output can’t be deﬁned. In this investigation we propose a novel
evaluation method, from view points readability of both clusters and hierarchi-
cal structures as well as reading comprehension. We evaluate the hierarchical
summarization in terms of penalty, called cost.
    In section 2 we introduce several issues of automatic summarization and
application to Web, and we discuss a new technique how to summarization the
contents in a hierarchical manner in section 3. In section 4 we propose a novel
evaluation method of hierarchical expression. We show some experimental results
in section 5, and conclude our work in section 6.
2   Automatic Summarization
In this section we review automatic summarization putting our attention on
Web pages[7]. Generally summarization means a process to distill most impor-
tant information from document sources for particular users and tasks. The
automatic summarization for particular users is used for contents grasping of
documents(such as news articles). There are 3 kinds of the techniques proposed,
Extracting, Abstracting and Clustering.
    Extracting means identifying the most relevant parts to the main theme ap-
peared within the document. This approach is advantageous because very often
statistical techniques such as frequency and correlation correspond to the rel-
evance or importance. Here we don’t need any background knowledge of NLP,
and we could automate the process easily and eﬃciently. On the other hand, the
results can be lack of coherence because of dangling anaphor such as pronoun
and conjunction [7].
    Abstracting means putting text objects into more general and super-ordinate
concepts so that the results don’t contain the expressions not appeared in the
objects. The approach is advantageous because there should be coherent in the
results, and better compression rate could be expected. On the other hand, we
should analyze what documents mean to generalize context to some extent and
NLP or IR technique should be required [7].
    Clustering means collecting objects into several groups by using a certain
similarity. We could extract underlying semantics from the objects. But we need
other techniques such as labeling groups to interpret the results. The approach
is advantageous because we can apply clustering easily and eﬃciently to objects
without any NLP knowledge, but is diﬃcult to deﬁne similarity and interpret
the results, although several labeling have been proposed [6, 8].
    As for some applications of automatic summarization for Web pages, extract-
ing has been discussed for the purpose of hand-held device [3]. The technique is
indispensable since the display is really small for Web browsing. Web pages are
       LATEX style ﬁle for Lecture Notes in Computer Science – documentation     3

divided into small units, called semantic textual unit (STU), where each STU is
deﬁned as a unit of meaningful contents, usually corresponded to paragraphs or
caption parts. Every ﬁrst line is sent for browsing purpose. For deeply reading,
users may ask of ﬁrst 3 lines or whole lines.
     A diﬀerent approach comes from Web clustering, since each cluster corre-
sponds to closely related collection of Web pages with each other. Labeling clus-
ters corresponds to summarizing the page contents, because the labels represent
what the contents of the cluster mean. Very often frequent words could be seen
as the labels of the pages [7]. But because there are strong similarities among the
pages obtained by search engines, the frequency can’t distinguish one from oth-
ers [8]. Better approach could come from important words rather than frequent
words. With these words we could see what’s going on quickly. Moreover, there
has been pointed out that labels consisting of sequences rather than the ones of
words be helpful to grasp the contents. A typical approach as Suﬃx Tree Clus-
tering (STC)[15] where sequences of words are used. However, STC is based on
heavy assumption that we can use a set of documents which are closely related
to each other(for example, hit list by search engine). Some investigation has pro-
posed to extract important words based on Key-Graph [10] and word-sequences
based on Suﬃx Tree [15], and then combined both results to abstract the pages
[9]. However, the results depend on temporal aspects as well as clustering.

3   Summarizing Pages Hierarchically
Web documents consist of tag parts and link speciﬁcations as well as textual
parts. We examine relationship over Web pages, and propose structuring as a
method of novel automatic summarization. By structuring objects, we mean
a technique to restate documents in terms of data structure. Since any data
structure carries its own meaning, we use speciﬁc structure concisely and clearly
as a substitute for parts of documents, thus we understand what the information
does imply. Note that any data structure can be applied to any situation because
they are polymorphic without any NLP knowledge.
    However what kinds of structures are suitable for our situation and how to
organize our documents appropriately ? What we work with are either words
(minimum lexical units) or sentences (minimum semantic units). In the former
case, we can examine ﬁne semantics but hardly grasp global view because of
detail relationship among words. On the other hand, in the latter case, it seems
easier to examine relationship among sentences although outlines depend on syn-
tax. Frequency and their correlation help us to obtain important parts for words,
while distance between sentences and centroids of document clusters could show
us what parts should be extracted. Thus, to grasp the contents of documents,
important words and the relationship among them or document clusters play
important role. There can be several expressions to describe semantic structure
of the contents. Possible examples are directed graph and trees, since we expect
abstraction mechanism to represent a variety of levels of the contents. In a case
of tree, we think nodes in higher levels correspond to overview/global viewpoint
while nodes in lower levels to detail/local aspects. One of these techniques is Key
Graph [10] by which the relationship can be modeled by means of graphs. How-
ever, there is no mechanism to interpret the contents from several abstraction
levels of viewpoints.
    In this investigation, we propose a novel technique for Web page summariza-
tion. One of the main issues is how to make groups to a huge amount of Web
pages. It is well-known that naive clustering of Web pages generates a few gi-
gantic clusters and many trivial clusters so that clusters provide us with nothing
new knowledge. However, we have already discussed a sophisticated clustering
method to Web documents based on correlation of hyperlinks with naive clus-
tering under vector space modeling [12]. With these results, let us go one step
further. In this work, we apply hierarchical clustering to each group of Web pages
using STUs, and we discuss a new ”labeling” method to each group by taking a
centroid. Relationship among the clusters corresponds to the hierarchical struc-
ture as shown in the ﬁgure 1.
               Document 2      Document 4        STU 1
                Sentence 1      Sentence 1         .
                Sentence 2      Sentence 2
                                                   .                           STU 1

                                 Document 3        .
                Document 1
                Sentence n      Sentence n         .                       STU 1
                                   Sentence 1
                  Sentence 1
                                   Sentence 2
                  Sentence 2
                                                                   STU 1       STU 2   STU 3
                                   Sentence n    STU n
                  Sentence n


                                                Fig. 1. Overview
3.1   Combination Clustering
In this section and the next section, we discuss how to make hierarchical sum-
marization of Web pages. The process consists of two steps. The ﬁrst step is that
we put a collection of Web pages into disjoint groups by combining two kinds
of clustering results, called combination clustering. The second step is that we
summarize each group. The ﬁnal results are a set of hierarchical expressions
which can be seen as summarization of the Web page groups.
    Our technique of combination clustering is made based on an idea that similar
documents share many hyperlinks. We extract characteristic hyperlinks as well
as index words to distinguish from each other. It is shown that the approach
is eﬀective to obtain fruitful results[12]. Here let us explain the outline of this
approach by examples. Basically we put attention on co-occurrence of hyperlinks.
We call this approach LINK clustering. Then we extract index terms, build the
vector space and make clustering them. This traditional approach is called VSM
clustering. Then we combine the two results into one to obtain new clustering
of Web pages. Each group is not only characterized by some topic but also
connected tightly with each other in a sense of contents.

EXAMPLE 31 Let us show LINK clustering by an example. By interpreting
nodes as Web pages and arcs as hyperlinks, we can represent Web pages by the
graph, the number of references by in-degree and so on. Suppose there are 6 nodes
a1 , ..., a6 as in ﬁgure 2. Let F rom(a) be a set of nodes staring from a, called a
set of out-arcs of a, its cardinality is called out-degree of a. Similarly, the set of
arcs with the ending node b, T o(b) is called a set of in-arcs of b, its cardinality is
            LATEX style ﬁle for Lecture Notes in Computer Science – documentation                 5
      b1 b2         b3           (1,0,0,0,0) (0,0,1,1,1) (0,0,1,1,0)

                                      a1         a2        a3
  a1            a2   a3 b4
                  a5                   a4        a5       a6
      a4             a6
                                 (1,1,0,0,0) (0,0,0,1,0) (0,0,1,1,1)
           (a) LINK clustering
                                         (b) VSM clustering

Fig. 2. LINK Clustering           Fig. 3. VSM Clustering               Fig. 4. Combination Clusters
called in-degree of b. We apply complete link hierarchical clustering. This process
is called LINK clustering. we get two LINK clusters A1 and A2 :A1=a1 , a2, a4
, A2 = a3,a6 As for a node a5, we consider a cluster of singleton and remove
this.
EXAMPLE 32 VSM clustering is nothing but a clustering by document vec-
tors. For example, given 6 Web pages P = {a1 , · · · , a6 } with the document vec-
tors in ﬁgure 3, We apply complete link hierarchical clustering. This approach is
called VSM clustering, and each cluster is called VSM cluster. We get 2 clusters
B1 , B2 : B1 = {a1 , a4 }, B2 = {a2 , a3 , a5 , a6 } Then let us combine the two re-
sults. In ﬁgure 4, two ovals represent LINK clusters A1 , A2 while two rectangles
describe VSM clusters B1 , B2 .
EXAMPLE 33 In example31 and 32, we obtain two combined clusters C11 =
{a1 , a4 } and C22 {a3 , a6 } but we discard small clusters C12 = {a2 } and C02 =
{a5 } so we got the ﬁnal results of two groups C11 , C22 .

3.2        Hierarchical Expression
Let us discuss technique of structuring each group of Web pages[13]. We assume
sets of Web pages through combination clustering. Web pages are divided into
small units, called Semantic Textual Unit (STU), which are deﬁned as a unit
of meaningful contents to capture intended semantics of paragraphs or caption
parts (in, say, alt tag) of Web pages. The similar approach is found in [3].
    Well-formed Web pages contain strings parts and tag parts in a nested man-
ner. Any strings quoted by a tag (i.e. <tag> ... </tag>) constitute an STU that
carries meaningful unit of semantics intended by the tag. When the strings con-
tain tag structures inside, the STU is called nested. When analyzing STU values,
we extract STUs and their nested structure, called STU (Nest). In this investi-
gation, we examine <UL> <OL>, <DL>, <TITLE>, <TABLE> and <BLOCKQUOTE>.
    One special tag <A HREF> describes link speciﬁcation of the Web page, some-
times called a context. Here we replace the speciﬁcation part by the contents of
the linked Web pages. This is called STU (Link).
    All the words appeared in these STUs are represented by using vector space
modeling. We should give index terms extracted from all the STUs. We take
frequencies of the string expressions and reduce the dimensionality by using
Zipf’s law. Similarity of two STUs is now deﬁned as the cosine value as usual.
    We apply hierarchical algorithm to extract hierarchical relationship among
Web pages. By examining distances between two clusters and combining the
two of the minimum in a hierarchical manner, we get the hierarchical structures
among the clusters (we do this merge process many time until we get to one big
cluster). The results depend on the deﬁnition of distance (similarity) of clusters
and the initial set of Web pages, and we should think about how to construct
hierarchy. It is well-known that there are single linkage method, complete linkage
method and average linkage method to deﬁne distance (similarity) of clusters,
where we consider the distance of two clusters as the minimum, maximum and
average distance of STUs between two clusters respectively. The ﬁrst method is
not suitable for our purpose since noisy STUs (accidental similarity) can’t be
avoided. By other two methods, we make up the hierarchy in this investigation.
Moreover, to overview the whole relationship, we deﬁne centroid STU of each
cluster as the nearest neighbor STU to the average values from the STU vectors,
and build up the hierarchy consisting of the center STUs. To put labels to clus-
ters, we introduce an option of centroids that are Representative sentence of the
clusters. Because centroid comes from the deﬁnition of the similarity, our method
could be seen as how to select the most important sentence as summarization.
EXAMPLE 34 Let us explain our approach by example. In ﬁgure 5, given two


                                                    (5) [0.222] CDE


                                           (2) [0.666] AB    (3) [0.666] CDE

                                                                      (2) [1.000] CDE


                                            AB      ABC       CDE        CE         CDE

                                          STU1 STU2 STU4 STU3 STU5
         Fig. 5. Two HTML ﬁles
                                                 Fig. 6. ierarchical Expression

Web pages and 5 words A,B,C,D and E within, we generate 5 STUs. Then we
obtain hierarchical clustering by average linkage method in ﬁgure 6. A ﬁgure 6
contains the details of the result by average linkage method. As shown in ﬁgure 5,
STU1 corresponds to a <TITLE> , STU2 to a <UL> and STU5 to a <BLOCKQUOTE>
without any nesting. On the other hand, STU4 contains both a word E quoted
by an <A> and a word C appeared in the linked Web page, STU3 contains STU3
as an inner structure. The similarity value is 1.000 which means it is generated
by STU3 and STU5.
4   Evaluating Hierarchical Pages
When we summarize a set of Web pages, we wonder the answers are really cor-
rect or not. So is true for a case of machine translation. Even if we have typical
answers by hand and examine the diﬀerence between summarization results and
the typical answers, it is still hard for us to evaluate how well we get them[7].
This is because the judgment depends on subjectivity. One approach has been
proposed where there should be agreement of the summarization results among
the participants[11]. This means that summarization by hand is not always cor-
rect. In this section we discuss several evaluation methods of the automatic
         LATEX style ﬁle for Lecture Notes in Computer Science – documentation         7

summarization[7], and put focus on the evaluation method of hierarchical topic
detection. We propose three measures to evaluate hierarchical expression of Web
pages, taht is, (1) Cluster readability, (2) Hierarchy readability and (3) Reading
comprehension. We call the evaluation method as CHR method by taking the
tree initial letters.


4.1     Evaluating Summarization

Generally, to evaluate summarization results, there have been proposed several
measures so far[7]. The ﬁrst point is compression rate. This means the ratio
of length of sources and summarization. Clearly high compression rate doesn’t
contain all the contents of the source. The second is Informativeness, deﬁned as
measure of quantity to see how many contents of the source are preserved. The
idea suggests that summarization should contain contents of source as a part.
The compression rate and the informativeness are trade-oﬀ with each other.
    Traditionally F-measure 3 has been proposed ans discussed. This measure
corresponds to the evaluation of informativeness in both clustering and extract-
ing. However we can’t utilize F-measure to our case since it is hard to obtain all
the answer in advance. Moreover, in the case of abstracting, it is diﬃcult for us
to evaluate informativeness.
    Another approach comes from the viewpoints of how many word are over-
lapped by using Dice coeﬃcient and the cosine similarity. Unfortunately, in this
approach, we should resolve several issues of synonym, homonym and construc-
tion information. For example, we should think about thesaurus, Latent Se-
mantic Indexing, syntactic analysis and conversation understanding of Natural
Languages.
    Let us put attention on the evaluation of readability and reading compre-
hension. Readability means the measure of ”how well we can understand the
summarization results”. Mani[7] introduces some scoring techniques to dangling
anaphor and a context. By Reading comprehension we examine the intelligibility
of the reader who did read the summarization. Mani also introduces sophisti-
cated methods[7].


4.2     Evaluating Hierarchical Topic Detection

What kinds of points should we pay attention on evaluation of the summariza-
tion for hierarchical expression ? There have been proposed some techniques
for hierarchical topic detection, and let us discuss them as examples. The topic
detection and tracking (TDT) task contains topic detection where news articles
are organized into clusters which correspond to events (or topics)[1]. There exist
two assumptions here about news articles. The ﬁrst says that each news article
describes only one event and is assigned to only one cluster. The second says
3                                                                             2rp
    F-measure combines recall (r) and precision (p) into one formula F = r+p       where
    recall means ratio of retrieved documents to all the answer documents while precision
    means correctness ratio.
that any hierarchical relationship among events should be ignored. As a result
of the assumptions, any topic detection system assigns an article to among the
non-hierarchical group. Obviously these assumptions don’t reﬂect actual situa-
tion. A new Task Deﬁnition and Evaluation Plan of TDT 2004[2] has began for
the purpose of hierarchical topic detection task. The evaluation method used for
the old topic detection task is not suitable any more for this new task.
    In TDT 2004, Allan has proposed a new evaluation method for hierarchical
topic detection task[2]. This work has inspired many reserach activities. Tri-
eschnigg examines three methods based on the method by Allan[14]. Especially
they have revealed the fact that the detection cost isn’t proportional to ”power
set” by putting focus on false alarm cost and miss detection cost, where a power
set means to include all topics in a cluster. Fiscus and Doddington discusses a
fact that miss detection costs more than false alarm[5]. To evaluate detection
cost, Allan has introduced 4 combinations of some aspects deﬁned by system
output and clustering as in the table 1, where in cluster means an element is
properly put into a cluster of interests (called ”in correct cluster”) and relevant
means a system says the element is relevant to the cluster of interests. In the
                           Table 1. Four combinations
                        system output relevant non-relevant
                           in cluster    R+        N+
                         not in cluster  R−        N−
                              total       r       n−r

table, R+ , N+ , R− , N− mean the number of elements in each category respec-
tively, given the number of all the elements n and the number of all the relevant
elements r without any duplication. We deﬁne miss detection ratio Pmiss of the
cluster and false alarm ratio Pf a as follow :

                                           R−
                                  Pmiss =                                      (1)
                                            r
                                          N+
                                  Pf a =                                       (2)
                                         n−r
Finally we deﬁne detection cost as a linear sum of Pmiss and Pf a .

            Cdet = Cmiss Pmiss P (target) + Cf a Pf a (1 − P (target))         (3)

Cmiss and Cf a are the costs of miss detection and false alarm respectively and
P(target) is the prior probability to obtain the target.
    Fiscus and Doddington discusses examines relationship between the costs
and prior probabilities, that is, in TDT, misses should be penalized much more
heavily than false alarms [5]. This is the reason they give Cmiss = 10 and
Cf a = 1. They assume a common value of P(target) for all topics based on
corpus statistics; they give 0.02 as the constant. Eventually we deﬁne TDT cost
function as:

                           Cdet = 0.2Pmiss + 0.98Pf a                          (4)
           LATEX style ﬁle for Lecture Notes in Computer Science – documentation                                                                  9

    Allan also have deﬁned travel cost as the cost to ﬁnd the most suitable cluster
of each topic from the root node. This cost consists of a detection cost and depth
to ﬁnd the cluster. Here let us denote a depth from the root by D, then lst us
deﬁne travel cost as follow:
                                                                                dep = D/max(D)                                                (5)
                                                                              Ctravel = Cdet + dep                                            (6)
   Finally, in Allan’s approach, the minimal cost of a cluster is deﬁned as the
shortest path to this cluster from the structure’s root cluster.
                                                                      Cminimal = min(Ctravel )                                                (7)
EXAMPLE 41 Let us evaluate our example 34. We assume two clusters of
{1,2},{3,4} are correct clusters. ans we compare a merged cluster {4,5} with a
correct cluster {3,4}.
    R− is assigned to {3}, and N+ is assigned to {5}. The details of the calcula-
tion of each cost are shown in the ﬁgure7. Then we obtain the minimal cost as
Cminimal = 0.5 in cluster {1,2}.


                                                                                          {1,2,3,4,5}
                                   {1,2}                              {3,4,5}
                      {4,5}
           1 2 3 4 5                   1 2 3 4 5                   1 2 3 4 5                   1 2 3 4 5
      syst
      em
              r judgment

              relev    non-
                                  syst
                                  em
                                          r judgment

                                          releva   non-
                                                              syst
                                                              em
                                                                      r judgment

                                                                      relev    non-
                                                                                          syst
                                                                                          em
                                                                                                  r judgment

                                                                                                  relev    non-
                                                                                                                      D                  depth
      out                         outp                        outp                        outp
              ant      relevant           nt       relevant           ant      relevant           ant      relevant
      put                         ut                          ut

                                                                                                                      0                   0
                                                                                          ut

      in                          in                          in
              1        1                  2        0                  2        1          in
                                                                                                  2        3
      not                         not                         not
              1        3                  0        3                  0        2          not
                                                                                                  0        0
      in

      total
              2        5-2
                                  in

                                  total
                                          2        5-2
                                                              in

                                                              total
                                                                      2        5-2
                                                                                          in

                                                                                          total
                                                                                                  2        5-2        1                   0.5
      miss
      = 1/2 =0.5
                                  miss
                                  = 0/2 =0
                                                              miss
                                                              = 0/3 =0
                                                                                          miss
                                                                                          = 0/3 =0
                                                                                                                      2                   1.0
      fa
      = 1/(5-2) =0.33
                                  fa
                                  = 0/(5-2) =0
                                                              fa
                                                              = 1/(5-2) =0.33
                                                                                          fa
                                                                                          = 3/(5-2) =1
                                                                                                                          1 2 3 4 5
      Cdet
      = 0.2*0.5 +
                                  Cdet
                                  = 0.2*0 + 0.98*0
                                                              Cdet
                                                              = 0.2*0 + .98*0.33
                                                                                          Cdet
                                                                                          = 0.2*0 + 0.98*1
                                                                                                                      Fig. 8. Depth Calculation
      0.98*0.33
                                  =0                          = 0.323                     = 0.98
      = 0.423                                                                             Ctravel
                                  Ctravel                     Ctravel
                                                                                          =0.98+0
      Ctravel                                                 =0.323+0.5
                                  =0+0.5                                                  =0.98
      =0.423+1                                                =0.823
                                  =0.5
       =1.423

                           Fig. 7. Allan’s Calculation
    In Allan’s approach, we see several problems. One problem is that correct
clusters are not realy correct hierarchical clusters. For example, in a cluster
{1,2,3,4,5} after merged in the example above, a cluster {1,2} is computed as
relevant elements while a cluster {3,4,5} is computed as non-relevant elements,
which is not suitable since correct clusters of non-hierarchical clustering are
exclusive. It should be problematic to utilize correct non-hierarchical clusters
for hierarchical clustering.
    Another problem comes from Power set. We have Cmiss = 0 and Cf a = 1
when combining {1,2} and {3,4,5}. Let us note that Power set can’t be evaluated
in terms of false alarm cost. We obtain Cf a = N+ /(270000 − r) if we have huge
number of elements. It is noted that, when the number r of the cluster element
is small (for instance, if a cluster has a deep path from root node), Power set
can’t be evaluated as the example.
4.3   CHR Method
We propose a novel evaluation method using both correlation and hierarchy
aspects. Our basic idea is that hierarchical expression describes the relations
among the cluster by using correlation. We propose 3 kinds of evaluation methods
for cluster readability, hierarchy readability and reading comprehension.
    Let us discuss how to evaluate the readability of cluster. Cluster granularity
means ”the roughness of the cluster” but not ”size of clusters”. We say the
granularity is compact if cluster elements are similar to each other. Otherwise
we say the granularity is coarse. For instance, a cluster of 26 elements A − Z
and a cluster of 26 elements containing only A have the same size of cluster
but the latter cluster is easier to grasp contents. We believe strong relationship
between the readability of cluster and the granularity. Therefore, we evaluate
the granularity by means of Allan’s detection cost.
    Allan has proposed detection cost consisting of two measures Cf a and Cmiss .
By Cf a they evaluate how dissimilar elements in a cluster are with each other,
and by Cmiss they evaluate how many similar elements in a cluster they miss.
    According to the ideas, we introduce a notion of cluster granularity consisting
of inner and outer granularities. Correponded to Cf a by Allan, we deﬁne inner-
granularity by Cin as follow:
                                     2     ∑ m ∑m
                     Cin = 1 −                     sim(xi , xj )                (8)
                                 m(m − 1) i=1 j=i

where m means the number of elements in a cluster, xk (k : 1 . . . m) means an
element in a cluster, and sim(xi , xj ) means similarity between the elements.
This cost means how coarse the cluster elements are related.
   Similarly we deﬁne outer-granularity by Cout as follows:
                                   ∑
                                   s ∑
                                     s
                          Cout =             sim(Cli , Clj )                   (9)
                                   i=1 j=i

where s means the number of clusters, Clr (r : 1 . . . s) means a cluster, and
similarity between Cli and Clj clusters is denoted by sim(Cli , Clj ) based on
average-linkage method. This cost represents how similar with each other the
clusters are.
    Then we restate Cdet by the two costs, where Cdet means the readability of
cluster.
                             Cdet = Cin + Cout                             (10)

Note that the cost means how compact a cluster is and how similar the cluster
is to other ones.
    Let us introduce Cophenetic coeﬃcient correlation for the purpose of evalu-
ation of hierarchical relationship. Let us show how we evaluate two hierarchical
clusters by similarity matrix.
    Assume that we have similarity 0.0 between the element 1 and 3 as shown in
the similarity matrix. When we combine {1,2} and {3} as in ﬁgure 9, we get the
       LATEX style ﬁle for Lecture Notes in Computer Science – documentation      11
      1 2 3
                           0.8                             0.4
   1 0.0 1.0 0.0
   2 0.0 0.0 0.8             1                                1
   3 0.0 0.0 0.0
                                 1    2     3                     1     2     3
Table 2. Simi-
larity Matrix       Fig. 9. Hierarchical Structure 1 Fig. 10. Hierarchical Structure 2
similarity 0.8 between 1 and 3 based on single linkage. In a ﬁgure 10, we get the
similarity 0.4 by average linkage. In any cases, when two clusters are combined,
similarity should be changed, so that we can’t preserve the original similarity
matrix, and the value depends on what kind of linkage we have.
    The idea of cophenetic coeﬃcient correlation provides us both to maintain
similarity matrix and to capture current situation of hierarchical clustering. We
can describe hierarchy by cophenetic matrix. Elements of similarity matrix x and
cophenetic matrix y are obtained by means of pearson product-moment correla-
tion coeﬃcient, deﬁned as follow:
                               ∑              ∑    ∑
                                  xy − (1/n)( x)( y)
             rx,y = √ ∑               ∑       ∑              ∑                (11)
                      { x2 − (1/n)( x)2 }{ y 2 − (1/n)( y)2 }

    This is called cophenetic coeﬃcient correlation. If the value is close to 1,
there exists positive correlation, and if it is close to -1, there exists negative
correlation.
    As described, we evaluate reading comprehension by the length of a path
from the root to the most suitable clusters. Hierarchical expression contains
informativeness latently as much as sources, because the expression assigns all
STU’s to leaf nodes. Remember the reading comprehension increases whenever
clusters are shallow and informative.
    We can evaluate the situation by two costs Ctravel and Cminimal . Let D be
a depth from the root and deﬁne the costs as follows.

                                   dep = D/max(D)                              (12)
                                Ctravel = Cdet + dep                           (13)
                            Cminimal = min(Ctravel )                           (14)

EXAMPLE 42 Let us apply CHR method to our example. The table 3 contains
similarity matrix of example 34. Figure12 illustrates how cluster readability and
reading comprehension are calculated.

         Table 3. Our Similarity Matrix                D                   depth
                                                       0                    0
                1 2      3    4    5                   1                    0.5
             1 0.0 0.66 0.0 0.0 0.0                    2                    1.0
             2 0.0 0.0 0.22 0.22 0.22                      1 2 3 4 5
             3 0.0 0.0 0.0 0.66 0.66
             4 0.0 0.0 0.0 0.0 1.0                 Fig. 11. Detail of Depth Calcula-
             5 0.0 0.0 0.0 0.0 0.0                 tion
                                                                                      {1,2,3,4,5}
                                     {1,2}                             {3,4,5}                             C00 (1621) [0.0003]
                    {4,5}                                                                                  MS/MS Outline. MS/MS Method is Two series

                                                                                                         C02(110)[0.3820]
            12 345                    123 45                      12345                12345             The photograph of the aurora
                                                                                                                                            C01 (1479) [0.0174]
                                                                                                                                            MS/MS Outline. MS/MS Method is Two series
            1   2      3      {4,5}
                                                   {1,2}   3      {4,5}
                                                                                       {1,2}   {3,4,5}   C06 (55) The     C05(52) [0.5818] C04 (149) [0.4149]                   C03 (1168) [0.0219]
    1       0   0.66   0      0
                                           {1,2}   0       0.11   0.11                                   photograph       African total    Bulletin ”Asahi-ten” Star and        MS/MS Outline.
                                                                            {1,2}      0       0.11
    2       0   0      0.22   0.22                                                                       of the aurora    solar eclipse    Asahikawa astronomical club          MS/MS Method is Two series
                                           3       0       0      0.66
                                                                            {3,4,5}    0       0
    3       0   0      0      0.66
                                           {4,5}   0       0      0
    {4,5}   0   0      0      0
                                                                                                         C12 (18) A      C11 (30)        C10 (15)     C09 (31)          C08 (19) a cause of     C07 (756) [0.0605]
                                                                                                         Turkish         African total   the          Bulletin ”Asahi-  death. Schedule. the    the experiment machine
    Cout =                        Cout =                       Cout =                 Cout =             total solar     solar eclipse   heavenly     ten” Star and     chief mourner           center HP
                                                                                                         eclipse                         bodies       Asahikawa
    0.66+0.22+0=0.88              0.11+0.11 =0.22                                                                                        image.       astronomical club
                                                               0.11                   0                                                                                         C14 (264)[0.0761]       C13 (196) [0.1012]
                                                                                                                                         Asahikawa
    Cin =                         Cin =                                                                                                  astronomic
                                                                                                                                                                                FIR Filter Design       GET! NEW information
                                                               Cin =                  Cin =
                                                                                                                                         al club
    1-1 = 0                       1-0.66 =0.34
                                                               1-0.66 =0.34           1-0.11 =0.89                                                    C08 (11) FIR    C08 (11)              C08 (42)      C15 (17) The
    Cdet =                        Cdet =                                                                                                              Filter Design   Equipment,            Free BBS      publication request
                                                               Cdet =                 Cdet =                                                                          machine use                         person of the
    0.88+0=0.88
                                  0.33+0.22 =                  0.11+0.33=0.45                                                                                         manual                              report of death
                                                                                      0.89
    Ctravel =                     0.56
                                                               Ctravel =
                                                                                      Ctravel =
    0.88+1.0=1.88                 Ctravel =
                                                               0.45+0.5=0.95
                                                                                      0.89+0 =0.89
                                  0.56+0.5=1.06
                                                                                                               Fig. 13. Hierarchical Summarization

        Fig. 12. Detail of Cost Calculation
5           Experimental Results
In this section we show some experimental results to see how eﬀective our ap-
proach works. Here we examine a test collection of Web pages, called NTCIR-3,
provided by NII Japan. Here we discuss the two experiments to examine our
algorithms to one of the clusters. In the ﬁrst experiment, we compare our CHR
method to Allan’s method and examine the evaluation results. In the second
experiment, we compare several STU to see how well they play important roles
for hierarchical summarization.

5.1             Preliminaries
Before describing our results, let us review our previous work quickly. In this
experiment, NTCIR-3 collection contains many HTML and textual data in Japanese
internet (i.e., in .jp domain). We have selected 9,929 pages dated September 29
to October 5 in 2001 to be examined. We have applied the combination clustering
to the data and and we got 6 meaningful clusters.
     By putting our focus on a Cluster, let us discuss Hierarchical expression. A
ﬁgure 13 contains result of the hierarchical summarization generated by average
linkage where the digits in (..) and [..] show the number of STU elements
and the similarity at cluster merging.
     Let us examine each summarization result and corresponding cluster. There
is leaf node STU which correspond to Web page contained in this group. In fact,
”experiment equipments” appeared in a cluster 1 correspond to clusters C00/
C01/ C03/ C07/ C14/ C17/ C18 while ”a cause of death” corresponds to C08,
”free BBS” to C16. There exists an STU about ”Asahikawa astronomical club”,
and we can see topic drifting because many STUs about ”Asahikawa university”
arise in this cluster.
     A parent node describes the more abstract contents compared to the child
nodes. For example, the number of African total solar eclipse cluster (C12) ele-
ments is thirty pages. Turkish total solar eclipse cluster (C11) contains 18 pages.
When combining them (C05), we get African total solar eclipse as a label. Then
a bigger cluster becomes an abstraction of those two. This means that the higher
nodes contain, the more STUs we have and the more frequent topics we see.
        LATEX style ﬁle for Lecture Notes in Computer Science – documentation   13

    It is common that the higher level STUs describe, the more general contents
we have. In fact, the highest level STUs are Asahikawa university and digital
ﬁlter lab which are the dominant pages.
    To put labels to clusters, we introduce a centroid that can be seen as a repre-
sentative sentence of the clusters. Because a centroid comes from the deﬁnition
of the frequency, our method provides us to extract the most important sen-
tence hierarchically. Also the centroid approach is consistent with hierarchical
structure of clusters.


5.2    Experiment 1

As shown in the previous subsection, we examine 1621 STUs from 35 clusters.
In this experiment, let us compare our CHR method with Allan’s. We calculate
costs deﬁned by CHR method as well as Allan method, and compare Cin with
Cf a , and Cout with Cmiss .
    When two clusters C1, C2 are combined into one, say Cm, let us discuss
how many false alarm and errorneous element we have with respect to Allan’s
method, as shown in table4. In the table, when the two clusters C, G are put
together, we see Cmiss decreases. Generally Allan’s Cmiss never increases when
we combine clusters, because every article carries only one event (topic) and the
topic of the lowest Cmiss is chosen. For example, Cmiss cannot increase at step 3
in the table 4. However, when a cluster contains more than one topics, we should
calculate Cmiss over all topics, and the table 5 shows Cout increases according
to the CHR method.
    Allan’s cost Cf a depends on the number of cluster elements and the number
of all the elements. When clusters are combined, all the elements except elements
in a certain topic are classiﬁed as N+ . This means Cf a increases very much when
we get a cluster of many elements. For example, in the table 4, Cf a at step1 and
step2 are small, but Cf a increases in the step 3 drastically. On the other hand,
CHR method allows us to calculate granularity and the costs come based on
similarity between topics in a cluster but not heavily on the number of elements.
In our case, Cin doesn’t increase at step 3 drastically as in the table 5.
        Table 4. Allan’s miss and fa              Table 5. CHR’s miss and fa

      step C1 C2 Cm Cmiss   Cf a             step C1 C2 Cm Cout       Cin
        1 A B C 0.517241 0.0712743             1 A B C 0.004692449 0.9083263
        2 E F G 0.873418 0.0155642             2 E F G 0.001672507 0.9379648
        3 C G H 0.107759 0.419726              3 C G H 0.002181099 0.9438765

    We show the average of Cf a and Cmiss in each depth in ﬁgure 14, to compare
Cout with Cmiss , and Cin with Cout . The depth at which two costs are the
smallest (by CHR method) corresponds to the one by the Allan’s method as
shown in ﬁgure15. However, in ﬁgure14, CHR method doesn’t correspond to
Allan’s method at all. Allan gives some weight with some sort of approximation
to come close to ideal detection cost. By CHR method we get similar results to
Allan, but with no weight and no correct clusters given in advance.
                                                                      

  
                                                                                                                          
                                                                  
  
                                                                                                                            
                                                                        
                                                                                                                       
                                                                                                                      
                                                                     
                                                                                                                        
                                                                                                                              

                                                                         


            Fig. 14. miss Cost and fa Cost                               Fig. 15. Detection Cost and Travel Cost
5.3         Experiment 2
Our next experiment concerns on a fact that links and nesting are really eﬀective
to obtain better summarization. We examine how STU(Nest) and STU(Link) are
useful. In this experiment, we generate STUs by means of 4 types of constructs:
STU, STU(Nest), STU(Link), and STU(Nest+Link), which mean that we have
STU without any nesting, with nesting, with link speciﬁcation but not nesting,
and STU with nesting and link. We obtain ŷ in such a way that:
                                                ŷ = argminy∈Y (Cdet(y))                                                   (15)
This equation means that ŷ is predominant in Cdet . Ctravel (u) and cophenetic(v)
are deﬁned similarly.    û = argmin      (Ctravel(u))                        (16)
                                                                   u∈Y
                                          v̂ = argmaxv∈Y (cophenetic(v))                                                   (17)



                                               

                                                        Fig. 17. Cophenetic Correlation Coeﬃcient
 
                                                                                  cophenetic
 
                                                !                       correlation coeﬃcient
 

                                                                         STU               0.678701
 
                                                                      STU (Link)           0.775679
                                                                      STU (Nest)           0.667447
                      
                                                                  STU (Nest+Link)          0.717835

       Fig. 16. Calculating argmin

    We examine the three argmin/argmax values by means of one of the 4 meth-
ods, the results are shown in the ﬁgure16.
    We see STU(Nest) wins the best result in ﬁgure16. We found some STUs
appear also in STU(nest), in this case, we must have compact granularity. Let
us illustrate the results of cophenetic correlation coeﬃcient in table 17. Here
STU(Link) shows the best results, this means referenced pages tells us main
topics generally. When we replace links by representative contents, we should
have heavy change of centroids. This means STU(Link) is one of the key factors
which improves cophenetic coeﬃcient correlation.
    By looking at these results, we conclude two aspects as follows. The ﬁrst
is that STU(Nest) plays important role on both detection cost and travel cost.
Second is that STU(Link) improves cophenetic coeﬃcient correlation essentially.
Because STU(Nest+Link) carries these aspects, we can say CHR method is
excellent.
      LATEX style ﬁle for Lecture Notes in Computer Science – documentation         15

6    Conclusion
In this investigation, we have proposed a new technique to summarize contents
of Web pages in a form of hierarchical expression. And then, we have proposed a
novel evaluation method for the expression. Experiments show the good results
and we can say the approach is promising.
    Our approach starts with decomposition of pages into STUs and application
of hierarchical clustering. An automated procedure can be implemented easily.


References
 1. Allan, J., Carbonell, J., Doddington, G., Yamron, J. and Yang, Y. : Topic detection
    and tracking pilot study: Final report, the DARPA Broadcast News Transcription
    and Understanding Workshop, 1998
 2. Allan, J., Feng, A. and Bolivar, A.: Flexible Intrinsic Evaluation of Hierarchical
    Clustering for TDT, the 12th international Conference on Information and Knowl-
    edge Management (CIKM), 2003, pp.263 - 270
 3. Buyukkokten, O., Garcia-Molina, H. and Paepcke, A.: Seeing the Whole in
    Parts:Text Summarization for Web Browsing on Handheld Devices, In Proceed-
    ings International WWW Conference, 2001
 4. Chakrabat,S.: Mining the Web, Morgan Kaufmann, 2003
 5. Fiscus, J,G. and Doddington, G.R. : Topic detection and tracking evaluation
    overview, Event-based Information Organization, 2002, pp. 17-31
 6. Jain, A.K., Murty, M. N. and Flynn, P. J.: Data clustering: a review, ACM Com-
    puting Surveys, 31-3, 1999, pp.264-323
 7. Mani,I. : Automatic Summarization, John Benjamins, USA, 2001
 8. Mori, M., Miura, T and Shioya, I.: Extracting Events From Web Pages, Interna-
    tional Conference on Advances in Intelligent Systems - Theory and Applications
 9. Mori, M., Miura, T and Shioya, I.: Abstracting Temporal Clusters, Internet Tech-
    nologies and Applications (ITA), 2005
10. Yukio Ohsawa, Nels E. Benson and Masahiko Yachida: KeyGraph: Automatic In-
    dexing by Co-occurrence Graph based on Building Construction Metaphor Proc.
    Advanced Digital Library Conference (IEEE ADL’98), pp.12-18, 1998
11. Radev, D., Jing, H. and M. Budzikowska, M.: Centroid-based summarization of
    multiple documents: sentence extraction, Information Processing and Manage-
    ment, 2004, pp.919-938
12. Takahashi, K., Miura, T and Shioya, I.: Combination Clustering for Web Corre-
    lation, IEEE Paciﬁc Rim Conference on Communications, Computers and Signal
    Processing (PACRIM), 2005, pp.434-437
13. Takahashi, K., Miura, T and Shioya, I.: Hierarchical Summarization of Web Pages,
    IADIS Applied Computing, 2006, pp.612-617
14. Trieschnigg, D. and Kraaij, W.: Scalable Hiearachical Topic Detection, SIGIR, 2005
15. Zamir, O. and Etzioni, O.: Web Document Clustering: A Feasibility Demonstra-
    tion, the 21st International ACM SIGIR Conference, 1998, pp.46-54