<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical Summarizing and Evaluating for Web Pages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kou TAKAHASHI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Takao MIURA</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isamu SHIOYA</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HOSEI University</institution>
          ,
          <addr-line>Kajinocho 3-7-2, Koganei, Tokyo</addr-line>
          ,
          <country country="JP">JAPAN</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sannno University</institution>
          ,
          <addr-line>Kamikasuya 1563, Isehara, Kanagawa</addr-line>
          ,
          <country country="JP">JAPAN</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this investigation we propose a novel summarization method of Web pages using hierarchical expression. We discuss close relationship between summarization and hierarchical clustering to obtain the results, and we examine how to evaluate hierarchical summarization based on both correlation and structural aspects. We describe some experimental results using NTCIR Web documents to examine our method.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Nowadays we face to huge amount of Web pages which keep growing everyday.
Whenever we like to know current situation of interests, we can examine easily
what’s going now all over the world through these pages. However, at the same
time, such information flooding makes it impossible for us to grasp the contents
quickly and intuitively. That’s reason why much attention has been paid on how
to grasp and summarize the contents quickly [
        <xref ref-type="bibr" rid="ref4 ref7">7, 4</xref>
        ]. To attack these issues, there
have been several approach proposed, and among others, we have two important
techniques, clustering and summarization[
        <xref ref-type="bibr" rid="ref12 ref4">12, 4</xref>
        ].
      </p>
      <p>
        Clustering means a method to put objects into several groups in such a way
that objects in a group are similar with each other while objects in distinct groups
are not [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In other words, clustering depends on both definition of similarity and
algorithm by which we can extract interesting patterns that are hidden. Most
information in Web pages are categorical (say, we see ”Computer Science”,
”Biology”, ”Mathematics”) but not numerical so that we interpret hardly the notions
of metric or order. This is the true reason traditional clustering techniques are
not well-suited.
      </p>
      <p>Summarization provides us with presentation of important contents of
information source in a simplified way by which we can grasp quickly. In investigation
of automatic summarization, two major aspects have been discussed, extraction
and abstraction. To extract important part from the contents, we put weights
such as frequency to words or paragraphs, and score them. Eventually we select
sentences with high score order. On the other hand, to abstract contents, we
should analyze them in terms of Natural Language Processing (NLP) or
Information Retrieval (IR), thus few approaches has been proposed so far.</p>
      <p>In this work, we propose a new kind of Web summarization technique based
on hierarchical clustering. there are two kinds of ideas, Web page clustering and
Web page summarization. We put our attention on frequency and co-occurrence
of both index words and hyper-links appeared in the Web pages. We
decompose the page contents into semantic textual units (STU) as described later.
With the help of hierarchical clustering among them, we extract some sort of
structure among collections of STUs so that we can consider the structure as
the abstraction of the contents in a form of hierarchy. We obtain a center STU
from each cluster, thus we can represent the summarization in terms of structure
among STUs.</p>
      <p>We must discuss how to evaluate the hierarchical summarization. Generally
evaluation is difficult as to both clustering and summarization, This is reason
because correct output can’t be defined. In this investigation we propose a novel
evaluation method, from view points readability of both clusters and
hierarchical structures as well as reading comprehension. We evaluate the hierarchical
summarization in terms of penalty, called cost.</p>
      <p>In section 2 we introduce several issues of automatic summarization and
application to Web, and we discuss a new technique how to summarization the
contents in a hierarchical manner in section 3. In section 4 we propose a novel
evaluation method of hierarchical expression. We show some experimental results
in section 5, and conclude our work in section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Automatic Summarization</title>
      <p>
        In this section we review automatic summarization putting our attention on
Web pages[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Generally summarization means a process to distill most
important information from document sources for particular users and tasks. The
automatic summarization for particular users is used for contents grasping of
documents(such as news articles). There are 3 kinds of the techniques proposed,
Extracting, Abstracting and Clustering.
      </p>
      <p>
        Extracting means identifying the most relevant parts to the main theme
appeared within the document. This approach is advantageous because very often
statistical techniques such as frequency and correlation correspond to the
relevance or importance. Here we don’t need any background knowledge of NLP,
and we could automate the process easily and efficiently. On the other hand, the
results can be lack of coherence because of dangling anaphor such as pronoun
and conjunction [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Abstracting means putting text objects into more general and super-ordinate
concepts so that the results don’t contain the expressions not appeared in the
objects. The approach is advantageous because there should be coherent in the
results, and better compression rate could be expected. On the other hand, we
should analyze what documents mean to generalize context to some extent and
NLP or IR technique should be required [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Clustering means collecting objects into several groups by using a certain
similarity. We could extract underlying semantics from the objects. But we need
other techniques such as labeling groups to interpret the results. The approach
is advantageous because we can apply clustering easily and efficiently to objects
without any NLP knowledge, but is difficult to define similarity and interpret
the results, although several labeling have been proposed [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ].
      </p>
      <p>
        As for some applications of automatic summarization for Web pages,
extracting has been discussed for the purpose of hand-held device [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The technique is
indispensable since the display is really small for Web browsing. Web pages are
divided into small units, called semantic textual unit (STU), where each STU is
defined as a unit of meaningful contents, usually corresponded to paragraphs or
caption parts. Every first line is sent for browsing purpose. For deeply reading,
users may ask of first 3 lines or whole lines.
      </p>
      <p>
        A different approach comes from Web clustering, since each cluster
corresponds to closely related collection of Web pages with each other. Labeling
clusters corresponds to summarizing the page contents, because the labels represent
what the contents of the cluster mean. Very often frequent words could be seen
as the labels of the pages [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. But because there are strong similarities among the
pages obtained by search engines, the frequency can’t distinguish one from
others [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Better approach could come from important words rather than frequent
words. With these words we could see what’s going on quickly. Moreover, there
has been pointed out that labels consisting of sequences rather than the ones of
words be helpful to grasp the contents. A typical approach as Suffix Tree
Clustering (STC)[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] where sequences of words are used. However, STC is based on
heavy assumption that we can use a set of documents which are closely related
to each other(for example, hit list by search engine). Some investigation has
proposed to extract important words based on Key-Graph [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and word-sequences
based on Suffix Tree [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and then combined both results to abstract the pages
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the results depend on temporal aspects as well as clustering.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Summarizing Pages Hierarchically</title>
      <p>Web documents consist of tag parts and link specifications as well as textual
parts. We examine relationship over Web pages, and propose structuring as a
method of novel automatic summarization. By structuring objects, we mean
a technique to restate documents in terms of data structure. Since any data
structure carries its own meaning, we use specific structure concisely and clearly
as a substitute for parts of documents, thus we understand what the information
does imply. Note that any data structure can be applied to any situation because
they are polymorphic without any NLP knowledge.</p>
      <p>
        However what kinds of structures are suitable for our situation and how to
organize our documents appropriately ? What we work with are either words
(minimum lexical units) or sentences (minimum semantic units). In the former
case, we can examine fine semantics but hardly grasp global view because of
detail relationship among words. On the other hand, in the latter case, it seems
easier to examine relationship among sentences although outlines depend on
syntax. Frequency and their correlation help us to obtain important parts for words,
while distance between sentences and centroids of document clusters could show
us what parts should be extracted. Thus, to grasp the contents of documents,
important words and the relationship among them or document clusters play
important role. There can be several expressions to describe semantic structure
of the contents. Possible examples are directed graph and trees, since we expect
abstraction mechanism to represent a variety of levels of the contents. In a case
of tree, we think nodes in higher levels correspond to overview/global viewpoint
while nodes in lower levels to detail/local aspects. One of these techniques is Key
Graph [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] by which the relationship can be modeled by means of graphs.
However, there is no mechanism to interpret the contents from several abstraction
levels of viewpoints.
      </p>
      <p>
        In this investigation, we propose a novel technique for Web page
summarization. One of the main issues is how to make groups to a huge amount of Web
pages. It is well-known that naive clustering of Web pages generates a few
gigantic clusters and many trivial clusters so that clusters provide us with nothing
new knowledge. However, we have already discussed a sophisticated clustering
method to Web documents based on correlation of hyperlinks with naive
clustering under vector space modeling [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. With these results, let us go one step
further. In this work, we apply hierarchical clustering to each group of Web pages
using STUs, and we discuss a new ”labeling” method to each group by taking a
centroid. Relationship among the clusters corresponds to the hierarchical
structure as shown in the figure 1.
      </p>
      <p>Document 2
Sentence 1
Sentence 2
Document 1
Sentence n
Sentence 1
Sentence 2
Sentence n
In this section and the next section, we discuss how to make hierarchical
summarization of Web pages. The process consists of two steps. The first step is that
we put a collection of Web pages into disjoint groups by combining two kinds
of clustering results, called combination clustering. The second step is that we
summarize each group. The final results are a set of hierarchical expressions
which can be seen as summarization of the Web page groups.</p>
      <p>
        Our technique of combination clustering is made based on an idea that similar
documents share many hyperlinks. We extract characteristic hyperlinks as well
as index words to distinguish from each other. It is shown that the approach
is effective to obtain fruitful results[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Here let us explain the outline of this
approach by examples. Basically we put attention on co-occurrence of hyperlinks.
We call this approach LINK clustering. Then we extract index terms, build the
vector space and make clustering them. This traditional approach is called VSM
clustering. Then we combine the two results into one to obtain new clustering
of Web pages. Each group is not only characterized by some topic but also
connected tightly with each other in a sense of contents.
      </p>
      <p>EXAMPLE 31 Let us show LINK clustering by an example. By interpreting
nodes as Web pages and arcs as hyperlinks, we can represent Web pages by the
graph, the number of references by in-degree and so on. Suppose there are 6 nodes
a1, ..., a6 as in figure 2. Let F rom(a) be a set of nodes staring from a, called a
set of out-arcs of a, its cardinality is called out-degree of a. Similarly, the set of
arcs with the ending node b, T o(b) is called a set of in-arcs of b, its cardinality is
LATEX style file for Lecture Notes in Computer Science – documentation
(1,0,0,0,0) (0,0,1,1,1) (0,0,1,1,0)
(a) LINK clustering
called in-degree of b. We apply complete link hierarchical clustering. This process
is called LINK clustering. we get two LINK clusters A1 and A2 :A1=a1 , a2, a4
, A2 = a3,a6 As for a node a5, we consider a cluster of singleton and remove
this.</p>
      <p>EXAMPLE 32 VSM clustering is nothing but a clustering by document
vectors. For example, given 6 Web pages P = {a1, · · · , a6} with the document
vectors in figure 3, We apply complete link hierarchical clustering. This approach is
called VSM clustering, and each cluster is called VSM cluster. We get 2 clusters
B1, B2: B1 = {a1, a4}, B2 = {a2, a3, a5, a6} Then let us combine the two
results. In figure 4, two ovals represent LINK clusters A1, A2 while two rectangles
describe VSM clusters B1, B2.</p>
      <p>
        EXAMPLE 33 In example31 and 32, we obtain two combined clusters C11 =
{a1, a4} and C22{a3, a6} but we discard small clusters C12 = {a2} and C02 =
{a5} so we got the final results of two groups C11, C22.
Let us discuss technique of structuring each group of Web pages[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We assume
sets of Web pages through combination clustering. Web pages are divided into
small units, called Semantic Textual Unit (STU), which are defined as a unit
of meaningful contents to capture intended semantics of paragraphs or caption
parts (in, say, alt tag) of Web pages. The similar approach is found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Well-formed Web pages contain strings parts and tag parts in a nested
manner. Any strings quoted by a tag (i.e. &lt;tag&gt; ... &lt;/tag&gt;) constitute an STU that
carries meaningful unit of semantics intended by the tag. When the strings
contain tag structures inside, the STU is called nested. When analyzing STU values,
we extract STUs and their nested structure, called STU (Nest). In this
investigation, we examine &lt;UL&gt; &lt;OL&gt;, &lt;DL&gt;, &lt;TITLE&gt;, &lt;TABLE&gt; and &lt;BLOCKQUOTE&gt;.</p>
      <p>One special tag &lt;A HREF&gt; describes link specification of the Web page,
sometimes called a context. Here we replace the specification part by the contents of
the linked Web pages. This is called STU (Link).</p>
      <p>All the words appeared in these STUs are represented by using vector space
modeling. We should give index terms extracted from all the STUs. We take
frequencies of the string expressions and reduce the dimensionality by using
Zipf’s law. Similarity of two STUs is now defined as the cosine value as usual.</p>
      <p>We apply hierarchical algorithm to extract hierarchical relationship among
Web pages. By examining distances between two clusters and combining the
two of the minimum in a hierarchical manner, we get the hierarchical structures
among the clusters (we do this merge process many time until we get to one big
cluster). The results depend on the definition of distance (similarity) of clusters
and the initial set of Web pages, and we should think about how to construct
hierarchy. It is well-known that there are single linkage method, complete linkage
method and average linkage method to define distance (similarity) of clusters,
where we consider the distance of two clusters as the minimum, maximum and
average distance of STUs between two clusters respectively. The first method is
not suitable for our purpose since noisy STUs (accidental similarity) can’t be
avoided. By other two methods, we make up the hierarchy in this investigation.
Moreover, to overview the whole relationship, we define centroid STU of each
cluster as the nearest neighbor STU to the average values from the STU vectors,
and build up the hierarchy consisting of the center STUs. To put labels to
clusters, we introduce an option of centroids that are Representative sentence of the
clusters. Because centroid comes from the definition of the similarity, our method
could be seen as how to select the most important sentence as summarization.
EXAMPLE 34 Let us explain our approach by example. In figure 5, given two
Web pages and 5 words A,B,C,D and E within, we generate 5 STUs. Then we
obtain hierarchical clustering by average linkage method in figure 6. A figure 6
contains the details of the result by average linkage method. As shown in figure 5,
STU1 corresponds to a &lt;TITLE&gt; , STU2 to a &lt;UL&gt; and STU5 to a &lt;BLOCKQUOTE&gt;
without any nesting. On the other hand, STU4 contains both a word E quoted
by an &lt;A&gt; and a word C appeared in the linked Web page, STU3 contains STU3
as an inner structure. The similarity value is 1.000 which means it is generated
by STU3 and STU5.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluating Hierarchical Pages</title>
      <p>
        When we summarize a set of Web pages, we wonder the answers are really
correct or not. So is true for a case of machine translation. Even if we have typical
answers by hand and examine the difference between summarization results and
the typical answers, it is still hard for us to evaluate how well we get them[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
This is because the judgment depends on subjectivity. One approach has been
proposed where there should be agreement of the summarization results among
the participants[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This means that summarization by hand is not always
correct. In this section we discuss several evaluation methods of the automatic
summarization[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and put focus on the evaluation method of hierarchical topic
detection. We propose three measures to evaluate hierarchical expression of Web
pages, taht is, (1) Cluster readability, (2) Hierarchy readability and (3) Reading
comprehension. We call the evaluation method as CHR method by taking the
tree initial letters.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Evaluating Summarization</title>
        <p>
          Generally, to evaluate summarization results, there have been proposed several
measures so far[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The first point is compression rate. This means the ratio
of length of sources and summarization. Clearly high compression rate doesn’t
contain all the contents of the source. The second is Informativeness, defined as
measure of quantity to see how many contents of the source are preserved. The
idea suggests that summarization should contain contents of source as a part.
The compression rate and the informativeness are trade-off with each other.
        </p>
        <p>Traditionally F-measure 3 has been proposed ans discussed. This measure
corresponds to the evaluation of informativeness in both clustering and
extracting. However we can’t utilize F-measure to our case since it is hard to obtain all
the answer in advance. Moreover, in the case of abstracting, it is difficult for us
to evaluate informativeness.</p>
        <p>Another approach comes from the viewpoints of how many word are
overlapped by using Dice coefficient and the cosine similarity. Unfortunately, in this
approach, we should resolve several issues of synonym, homonym and
construction information. For example, we should think about thesaurus, Latent
Semantic Indexing, syntactic analysis and conversation understanding of Natural
Languages.</p>
        <p>
          Let us put attention on the evaluation of readability and reading
comprehension. Readability means the measure of ”how well we can understand the
summarization results”. Mani[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] introduces some scoring techniques to dangling
anaphor and a context. By Reading comprehension we examine the intelligibility
of the reader who did read the summarization. Mani also introduces
sophisticated methods[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Evaluating Hierarchical Topic Detection</title>
        <p>
          What kinds of points should we pay attention on evaluation of the
summarization for hierarchical expression ? There have been proposed some techniques
for hierarchical topic detection, and let us discuss them as examples. The topic
detection and tracking (TDT) task contains topic detection where news articles
are organized into clusters which correspond to events (or topics)[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. There exist
two assumptions here about news articles. The first says that each news article
describes only one event and is assigned to only one cluster. The second says
3 F-measure combines recall (r) and precision (p) into one formula F = r2+rpp where
recall means ratio of retrieved documents to all the answer documents while precision
means correctness ratio.
that any hierarchical relationship among events should be ignored. As a result
of the assumptions, any topic detection system assigns an article to among the
non-hierarchical group. Obviously these assumptions don’t reflect actual
situation. A new Task Definition and Evaluation Plan of TDT 2004[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] has began for
the purpose of hierarchical topic detection task. The evaluation method used for
the old topic detection task is not suitable any more for this new task.
        </p>
        <p>
          In TDT 2004, Allan has proposed a new evaluation method for hierarchical
topic detection task[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This work has inspired many reserach activities.
Trieschnigg examines three methods based on the method by Allan[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Especially
they have revealed the fact that the detection cost isn’t proportional to ”power
set” by putting focus on false alarm cost and miss detection cost, where a power
set means to include all topics in a cluster. Fiscus and Doddington discusses a
fact that miss detection costs more than false alarm[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. To evaluate detection
cost, Allan has introduced 4 combinations of some aspects defined by system
output and clustering as in the table 1, where in cluster means an element is
properly put into a cluster of interests (called ”in correct cluster”) and relevant
means a system says the element is relevant to the cluster of interests. In the
Cmiss and Cfa are the costs of miss detection and false alarm respectively and
P(target) is the prior probability to obtain the target.
        </p>
        <p>
          Fiscus and Doddington discusses examines relationship between the costs
and prior probabilities, that is, in TDT, misses should be penalized much more
heavily than false alarms [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This is the reason they give Cmiss = 10 and
Cfa = 1. They assume a common value of P(target) for all topics based on
corpus statistics; they give 0.02 as the constant. Eventually we define TDT cost
function as:
        </p>
        <p>Cdet = 0.2Pmiss + 0.98Pfa
(4)
(5)
(6)
(7)</p>
        <p>Allan also have defined travel cost as the cost to find the most suitable cluster
of each topic from the root node. This cost consists of a detection cost and depth
to find the cluster. Here let us denote a depth from the root by D, then lst us
define travel cost as follow:</p>
        <p>EXAMPLE 41 Let us evaluate our example 34. We assume two clusters of
{1,2},{3,4} are correct clusters. ans we compare a merged cluster {4,5} with a
correct cluster {3,4}.</p>
        <p>R− is assigned to {3}, and N+ is assigned to {5}. The details of the
calculation of each cost are shown in the figure7. Then we obtain the minimal cost as
Cminimal = 0.5 in cluster {1,2}.
{4,5}
ispeonyumustt1t 2rra1enjultedvg3menrn14teolne-va5nt isoeunytumst1pt 2rrn2etjuledv3gamen4trn0eolne-v5ant isuoenytumst1pt 2rra2enjultedv3gmen4rn1teolne-v5ant iseounytum1stpt 2rra2enjulted3vgmen4rn3teolnev-5ant 0D de0pth
itnnoottal 21 53-2 itnnoottal 02 35-2 itnnoottal 20 52-2 itnnoottal 20 50-2 1 0.5
miss miss miss miss 2 1.0
= 1/2 =0.5 = 0/2 =0 = 0/3 =0 = 0/3 =0
f=a1/(5-2) =0.33 f=a0/(5-2) =0 f=a1/(5-2) =0.33 f=a3/(5-2) =1 1 2 3 4 5
0===CC=.0td9001r.e8...4a244t*2v*2203e033.l+3.513+ ====CC00td00r.+e.5a20tv*.e50l+ 0.98*0 ====CC00td00r..e..83a23t22v*233e03l++0..598*0.33 ===CC=00td0r..0e.99a.2t889v*e8+0l0+ 0.98*1 Fig. 8. Depth Calculation</p>
        <p>Fig. 7. Allan’s Calculation</p>
        <p>In Allan’s approach, we see several problems. One problem is that correct
clusters are not realy correct hierarchical clusters. For example, in a cluster
{1,2,3,4,5} after merged in the example above, a cluster {1,2} is computed as
relevant elements while a cluster {3,4,5} is computed as non-relevant elements,
which is not suitable since correct clusters of non-hierarchical clustering are
exclusive. It should be problematic to utilize correct non-hierarchical clusters
for hierarchical clustering.</p>
        <p>Another problem comes from Power set. We have Cmiss = 0 and Cfa = 1
when combining {1,2} and {3,4,5}. Let us note that Power set can’t be evaluated
in terms of false alarm cost. We obtain Cfa = N+/(270000 − r) if we have huge
number of elements. It is noted that, when the number r of the cluster element
is small (for instance, if a cluster has a deep path from root node), Power set
can’t be evaluated as the example.</p>
        <p>We propose a novel evaluation method using both correlation and hierarchy
aspects. Our basic idea is that hierarchical expression describes the relations
among the cluster by using correlation. We propose 3 kinds of evaluation methods
for cluster readability, hierarchy readability and reading comprehension.</p>
        <p>Let us discuss how to evaluate the readability of cluster. Cluster granularity
means ”the roughness of the cluster” but not ”size of clusters”. We say the
granularity is compact if cluster elements are similar to each other. Otherwise
we say the granularity is coarse. For instance, a cluster of 26 elements A − Z
and a cluster of 26 elements containing only A have the same size of cluster
but the latter cluster is easier to grasp contents. We believe strong relationship
between the readability of cluster and the granularity. Therefore, we evaluate
the granularity by means of Allan’s detection cost.</p>
        <p>Allan has proposed detection cost consisting of two measures Cfa and Cmiss.
By Cfa they evaluate how dissimilar elements in a cluster are with each other,
and by Cmiss they evaluate how many similar elements in a cluster they miss.</p>
        <p>According to the ideas, we introduce a notion of cluster granularity consisting
of inner and outer granularities. Correponded to Cfa by Allan, we define
innergranularity by Cin as follow:</p>
        <p>2
Cin = 1 − m(m − 1)
m m
∑ ∑ sim(xi, xj )
i=1 j=i
where m means the number of elements in a cluster, xk(k : 1 . . . m) means an
element in a cluster, and sim(xi, xj ) means similarity between the elements.
This cost means how coarse the cluster elements are related.</p>
        <p>Similarly we define outer-granularity by Cout as follows:
(8)
(9)
s s
Cout = ∑ ∑ sim(Cli, Clj )</p>
        <p>i=1 j=i
where s means the number of clusters, Clr(r : 1 . . . s) means a cluster, and
similarity between Cli and Clj clusters is denoted by sim(Cli, Clj ) based on
average-linkage method. This cost represents how similar with each other the
clusters are.</p>
        <p>Then we restate Cdet by the two costs, where Cdet means the readability of
cluster.</p>
        <p>Cdet = Cin + Cout
(10)
Note that the cost means how compact a cluster is and how similar the cluster
is to other ones.</p>
        <p>Let us introduce Cophenetic coefficient correlation for the purpose of
evaluation of hierarchical relationship. Let us show how we evaluate two hierarchical
clusters by similarity matrix.</p>
        <p>Assume that we have similarity 0.0 between the element 1 and 3 as shown in
the similarity matrix. When we combine {1,2} and {3} as in figure 9, we get the
similarity 0.8 between 1 and 3 based on single linkage. In a figure 10, we get the
similarity 0.4 by average linkage. In any cases, when two clusters are combined,
similarity should be changed, so that we can’t preserve the original similarity
matrix, and the value depends on what kind of linkage we have.</p>
        <p>The idea of cophenetic coefficient correlation provides us both to maintain
similarity matrix and to capture current situation of hierarchical clustering. We
can describe hierarchy by cophenetic matrix. Elements of similarity matrix x and
cophenetic matrix y are obtained by means of pearson product-moment
correlation coefficient, defined as follow:
rx,y = √{∑x2 − (1/n)(∑x)2}{∑y2 − (1/n)(∑y)2}</p>
        <p>∑xy − (1/n)(∑x)(∑y)</p>
        <p>This is called cophenetic coefficient correlation. If the value is close to 1,
there exists positive correlation, and if it is close to -1, there exists negative
correlation.</p>
        <p>As described, we evaluate reading comprehension by the length of a path
from the root to the most suitable clusters. Hierarchical expression contains
informativeness latently as much as sources, because the expression assigns all
STU’s to leaf nodes. Remember the reading comprehension increases whenever
clusters are shallow and informative.</p>
        <p>We can evaluate the situation by two costs Ctravel and Cminimal. Let D be
a depth from the root and define the costs as follows.
(11)
(12)
(13)
(14)
dep = D/max(D)</p>
        <p>Ctravel = Cdet + dep</p>
        <p>Cminimal = min(Ctravel)
EXAMPLE 42 Let us apply CHR method to our example. The table 3 contains
similarity matrix of example 34. Figure12 illustrates how cluster readability and
reading comprehension are calculated.</p>
        <p>Fig. 13. Hierarchical Summarization</p>
        <p>Fig. 12. Detail of Cost Calculation</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Experimental Results</title>
      <p>In this section we show some experimental results to see how effective our
approach works. Here we examine a test collection of Web pages, called NTCIR-3,
provided by NII Japan. Here we discuss the two experiments to examine our
algorithms to one of the clusters. In the first experiment, we compare our CHR
method to Allan’s method and examine the evaluation results. In the second
experiment, we compare several STU to see how well they play important roles
for hierarchical summarization.
5.1</p>
      <sec id="sec-5-1">
        <title>Preliminaries</title>
        <p>Before describing our results, let us review our previous work quickly. In this
experiment, NTCIR-3 collection contains many HTML and textual data in Japanese
internet (i.e., in .jp domain). We have selected 9,929 pages dated September 29
to October 5 in 2001 to be examined. We have applied the combination clustering
to the data and and we got 6 meaningful clusters.</p>
        <p>By putting our focus on a Cluster, let us discuss Hierarchical expression. A
figure 13 contains result of the hierarchical summarization generated by average
linkage where the digits in (..) and [..] show the number of STU elements
and the similarity at cluster merging.</p>
        <p>Let us examine each summarization result and corresponding cluster. There
is leaf node STU which correspond to Web page contained in this group. In fact,
”experiment equipments” appeared in a cluster 1 correspond to clusters C00/
C01/ C03/ C07/ C14/ C17/ C18 while ”a cause of death” corresponds to C08,
”free BBS” to C16. There exists an STU about ”Asahikawa astronomical club”,
and we can see topic drifting because many STUs about ”Asahikawa university”
arise in this cluster.</p>
        <p>A parent node describes the more abstract contents compared to the child
nodes. For example, the number of African total solar eclipse cluster (C12)
elements is thirty pages. Turkish total solar eclipse cluster (C11) contains 18 pages.
When combining them (C05), we get African total solar eclipse as a label. Then
a bigger cluster becomes an abstraction of those two. This means that the higher
nodes contain, the more STUs we have and the more frequent topics we see.</p>
        <p>It is common that the higher level STUs describe, the more general contents
we have. In fact, the highest level STUs are Asahikawa university and digital
filter lab which are the dominant pages.</p>
        <p>To put labels to clusters, we introduce a centroid that can be seen as a
representative sentence of the clusters. Because a centroid comes from the definition
of the frequency, our method provides us to extract the most important
sentence hierarchically. Also the centroid approach is consistent with hierarchical
structure of clusters.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experiment 1</title>
        <p>As shown in the previous subsection, we examine 1621 STUs from 35 clusters.
In this experiment, let us compare our CHR method with Allan’s. We calculate
costs defined by CHR method as well as Allan method, and compare Cin with
Cfa, and Cout with Cmiss.</p>
        <p>When two clusters C1, C2 are combined into one, say Cm, let us discuss
how many false alarm and errorneous element we have with respect to Allan’s
method, as shown in table4. In the table, when the two clusters C, G are put
together, we see Cmiss decreases. Generally Allan’s Cmiss never increases when
we combine clusters, because every article carries only one event (topic) and the
topic of the lowest Cmiss is chosen. For example, Cmiss cannot increase at step 3
in the table 4. However, when a cluster contains more than one topics, we should
calculate Cmiss over all topics, and the table 5 shows Cout increases according
to the CHR method.</p>
        <p>Allan’s cost Cfa depends on the number of cluster elements and the number
of all the elements. When clusters are combined, all the elements except elements
in a certain topic are classified as N+. This means Cfa increases very much when
we get a cluster of many elements. For example, in the table 4, Cfa at step1 and
step2 are small, but Cfa increases in the step 3 drastically. On the other hand,
CHR method allows us to calculate granularity and the costs come based on
similarity between topics in a cluster but not heavily on the number of elements.
In our case, Cin doesn’t increase at step 3 drastically as in the table 5.</p>
        <p>We show the average of Cfa and Cmiss in each depth in figure 14, to compare
Cout with Cmiss, and Cin with Cout. The depth at which two costs are the
smallest (by CHR method) corresponds to the one by the Allan’s method as
shown in figure15. However, in figure14, CHR method doesn’t correspond to
Allan’s method at all. Allan gives some weight with some sort of approximation
to come close to ideal detection cost. By CHR method we get similar results to
Allan, but with no weight and no correct clusters given in advance.</p>
        <p>Fig. 15. Detection Cost and Travel Cost
Our next experiment concerns on a fact that links and nesting are really effective
to obtain better summarization. We examine how STU(Nest) and STU(Link) are
useful. In this experiment, we generate STUs by means of 4 types of constructs:
STU, STU(Nest), STU(Link), and STU(Nest+Link), which mean that we have
STU without any nesting, with nesting, with link specification but not nesting,
and STU with nesting and link. We obtain yˆ in such a way that:
This equation means that yˆ is predominant in Cdet. Ctravel(u) and cophenetic(v)
are defined similarly.</p>
        <p>yˆ = argminy∈Y (Cdet(y))
uˆ = argminu∈Y (Ctravel(u))
vˆ = argmaxv∈Y (cophenetic(v))
(15)
(16)
(17)</p>
        <p>We examine the three argmin/argmax values by means of one of the 4
methods, the results are shown in the figure16.</p>
        <p>We see STU(Nest) wins the best result in figure16. We found some STUs
appear also in STU(nest), in this case, we must have compact granularity. Let
us illustrate the results of cophenetic correlation coefficient in table 17. Here
STU(Link) shows the best results, this means referenced pages tells us main
topics generally. When we replace links by representative contents, we should
have heavy change of centroids. This means STU(Link) is one of the key factors
which improves cophenetic coefficient correlation.</p>
        <p>By looking at these results, we conclude two aspects as follows. The first
is that STU(Nest) plays important role on both detection cost and travel cost.
Second is that STU(Link) improves cophenetic coefficient correlation essentially.
Because STU(Nest+Link) carries these aspects, we can say CHR method is
excellent.
In this investigation, we have proposed a new technique to summarize contents
of Web pages in a form of hierarchical expression. And then, we have proposed a
novel evaluation method for the expression. Experiments show the good results
and we can say the approach is promising.</p>
        <p>Our approach starts with decomposition of pages into STUs and application
of hierarchical clustering. An automated procedure can be implemented easily.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , J., Carbonell, J.,
          <string-name>
            <surname>Doddington</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamron</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Topic detection and tracking pilot study: Final report</article-title>
          ,
          <source>the DARPA Broadcast News Transcription and Understanding Workshop</source>
          ,
          <year>1998</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bolivar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Flexible Intrinsic Evaluation of Hierarchical Clustering for TDT, the 12th international Conference on Information and Knowledge Management (CIKM</article-title>
          ),
          <year>2003</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>270</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buyukkokten</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Paepcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Seeing the Whole in Parts:Text Summarization for Web Browsing on Handheld Devices</article-title>
          ,
          <source>In Proceedings International WWW Conference</source>
          ,
          <year>2001</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chakrabat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Mining the Web, Morgan Kaufmann, 2003
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fiscus</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          ,G. and
          <string-name>
            <surname>Doddington</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          :
          <article-title>Topic detection and tracking evaluation overview</article-title>
          ,
          <source>Event-based Information Organization</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>31</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murty</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Flynn</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          :
          <article-title>Data clustering: a review</article-title>
          ,
          <source>ACM Computing Surveys</source>
          ,
          <fpage>31</fpage>
          -
          <lpage>3</lpage>
          ,
          <year>1999</year>
          , pp.
          <fpage>264</fpage>
          -
          <lpage>323</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mani</surname>
            ,
            <given-names>I. : Automatic</given-names>
          </string-name>
          <string-name>
            <surname>Summarization</surname>
          </string-name>
          , John Benjamins, USA,
          <year>2001</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miura</surname>
          </string-name>
          ,
          <source>T and Shioya, I.: Extracting Events From Web Pages, International Conference on Advances in Intelligent Systems - Theory and Applications</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miura</surname>
          </string-name>
          , T and Shioya, I.:
          <article-title>Abstracting Temporal Clusters, Internet Technologies and Applications (ITA</article-title>
          ),
          <year>2005</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yukio</surname>
            <given-names>Ohsawa</given-names>
          </string-name>
          , Nels E. Benson and Masahiko Yachida:
          <article-title>KeyGraph: Automatic Indexing by Co-occurrence Graph based on Building Construction Metaphor Proc</article-title>
          .
          <source>Advanced Digital Library Conference (IEEE ADL'98)</source>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>1998</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jing</surname>
            , H. and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Budzikowska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Centroid-based summarization of multiple documents: sentence extraction</article-title>
          ,
          <source>Information Processing and Management</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>919</fpage>
          -
          <lpage>938</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miura</surname>
          </string-name>
          , T and Shioya, I.:
          <article-title>Combination Clustering for Web Correlation</article-title>
          ,
          <source>IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>434</fpage>
          -
          <lpage>437</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miura</surname>
          </string-name>
          , T and Shioya, I.:
          <article-title>Hierarchical Summarization of Web Pages</article-title>
          ,
          <source>IADIS Applied Computing</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>612</fpage>
          -
          <lpage>617</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Trieschnigg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kraaij</surname>
          </string-name>
          , W.:
          <article-title>Scalable Hiearachical Topic Detection</article-title>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          ,
          <year>2005</year>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zamir</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Web Document Clustering: A Feasibility Demonstration, the 21st</article-title>
          <source>International ACM SIGIR Conference</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>