Hierarchical Summarizing and Evaluating for
Web Pages
Kou TAKAHASHI1 , Takao MIURA1 , and Isamu SHIOYA2
1
HOSEI University, Kajinocho 3-7-2, Koganei, Tokyo, JAPAN
2
Sannno University,Kamikasuya 1563, Isehara, Kanagawa, JAPAN
Abstract. In this investigation we propose a novel summarization method
of Web pages using hierarchical expression. We discuss close relationship
between summarization and hierarchical clustering to obtain the results,
and we examine how to evaluate hierarchical summarization based on
both correlation and structural aspects. We describe some experimental
results using NTCIR Web documents to examine our method.
1 Introduction
Nowadays we face to huge amount of Web pages which keep growing everyday.
Whenever we like to know current situation of interests, we can examine easily
what’s going now all over the world through these pages. However, at the same
time, such information flooding makes it impossible for us to grasp the contents
quickly and intuitively. That’s reason why much attention has been paid on how
to grasp and summarize the contents quickly [7, 4]. To attack these issues, there
have been several approach proposed, and among others, we have two important
techniques, clustering and summarization[12, 4].
Clustering means a method to put objects into several groups in such a way
that objects in a group are similar with each other while objects in distinct groups
are not [6]. In other words, clustering depends on both definition of similarity and
algorithm by which we can extract interesting patterns that are hidden. Most
information in Web pages are categorical (say, we see ”Computer Science”, ”Biol-
ogy”, ”Mathematics”) but not numerical so that we interpret hardly the notions
of metric or order. This is the true reason traditional clustering techniques are
not well-suited.
Summarization provides us with presentation of important contents of infor-
mation source in a simplified way by which we can grasp quickly. In investigation
of automatic summarization, two major aspects have been discussed, extraction
and abstraction. To extract important part from the contents, we put weights
such as frequency to words or paragraphs, and score them. Eventually we select
sentences with high score order. On the other hand, to abstract contents, we
should analyze them in terms of Natural Language Processing (NLP) or Infor-
mation Retrieval (IR), thus few approaches has been proposed so far.
In this work, we propose a new kind of Web summarization technique based
on hierarchical clustering. there are two kinds of ideas, Web page clustering and
Web page summarization. We put our attention on frequency and co-occurrence
of both index words and hyper-links appeared in the Web pages. We decom-
pose the page contents into semantic textual units (STU) as described later.
With the help of hierarchical clustering among them, we extract some sort of
structure among collections of STUs so that we can consider the structure as
the abstraction of the contents in a form of hierarchy. We obtain a center STU
from each cluster, thus we can represent the summarization in terms of structure
among STUs.
We must discuss how to evaluate the hierarchical summarization. Generally
evaluation is difficult as to both clustering and summarization, This is reason
because correct output can’t be defined. In this investigation we propose a novel
evaluation method, from view points readability of both clusters and hierarchi-
cal structures as well as reading comprehension. We evaluate the hierarchical
summarization in terms of penalty, called cost.
In section 2 we introduce several issues of automatic summarization and
application to Web, and we discuss a new technique how to summarization the
contents in a hierarchical manner in section 3. In section 4 we propose a novel
evaluation method of hierarchical expression. We show some experimental results
in section 5, and conclude our work in section 6.
2 Automatic Summarization
In this section we review automatic summarization putting our attention on
Web pages[7]. Generally summarization means a process to distill most impor-
tant information from document sources for particular users and tasks. The
automatic summarization for particular users is used for contents grasping of
documents(such as news articles). There are 3 kinds of the techniques proposed,
Extracting, Abstracting and Clustering.
Extracting means identifying the most relevant parts to the main theme ap-
peared within the document. This approach is advantageous because very often
statistical techniques such as frequency and correlation correspond to the rel-
evance or importance. Here we don’t need any background knowledge of NLP,
and we could automate the process easily and efficiently. On the other hand, the
results can be lack of coherence because of dangling anaphor such as pronoun
and conjunction [7].
Abstracting means putting text objects into more general and super-ordinate
concepts so that the results don’t contain the expressions not appeared in the
objects. The approach is advantageous because there should be coherent in the
results, and better compression rate could be expected. On the other hand, we
should analyze what documents mean to generalize context to some extent and
NLP or IR technique should be required [7].
Clustering means collecting objects into several groups by using a certain
similarity. We could extract underlying semantics from the objects. But we need
other techniques such as labeling groups to interpret the results. The approach
is advantageous because we can apply clustering easily and efficiently to objects
without any NLP knowledge, but is difficult to define similarity and interpret
the results, although several labeling have been proposed [6, 8].
As for some applications of automatic summarization for Web pages, extract-
ing has been discussed for the purpose of hand-held device [3]. The technique is
indispensable since the display is really small for Web browsing. Web pages are
LATEX style file for Lecture Notes in Computer Science – documentation 3
divided into small units, called semantic textual unit (STU), where each STU is
defined as a unit of meaningful contents, usually corresponded to paragraphs or
caption parts. Every first line is sent for browsing purpose. For deeply reading,
users may ask of first 3 lines or whole lines.
A different approach comes from Web clustering, since each cluster corre-
sponds to closely related collection of Web pages with each other. Labeling clus-
ters corresponds to summarizing the page contents, because the labels represent
what the contents of the cluster mean. Very often frequent words could be seen
as the labels of the pages [7]. But because there are strong similarities among the
pages obtained by search engines, the frequency can’t distinguish one from oth-
ers [8]. Better approach could come from important words rather than frequent
words. With these words we could see what’s going on quickly. Moreover, there
has been pointed out that labels consisting of sequences rather than the ones of
words be helpful to grasp the contents. A typical approach as Suffix Tree Clus-
tering (STC)[15] where sequences of words are used. However, STC is based on
heavy assumption that we can use a set of documents which are closely related
to each other(for example, hit list by search engine). Some investigation has pro-
posed to extract important words based on Key-Graph [10] and word-sequences
based on Suffix Tree [15], and then combined both results to abstract the pages
[9]. However, the results depend on temporal aspects as well as clustering.
3 Summarizing Pages Hierarchically
Web documents consist of tag parts and link specifications as well as textual
parts. We examine relationship over Web pages, and propose structuring as a
method of novel automatic summarization. By structuring objects, we mean
a technique to restate documents in terms of data structure. Since any data
structure carries its own meaning, we use specific structure concisely and clearly
as a substitute for parts of documents, thus we understand what the information
does imply. Note that any data structure can be applied to any situation because
they are polymorphic without any NLP knowledge.
However what kinds of structures are suitable for our situation and how to
organize our documents appropriately ? What we work with are either words
(minimum lexical units) or sentences (minimum semantic units). In the former
case, we can examine fine semantics but hardly grasp global view because of
detail relationship among words. On the other hand, in the latter case, it seems
easier to examine relationship among sentences although outlines depend on syn-
tax. Frequency and their correlation help us to obtain important parts for words,
while distance between sentences and centroids of document clusters could show
us what parts should be extracted. Thus, to grasp the contents of documents,
important words and the relationship among them or document clusters play
important role. There can be several expressions to describe semantic structure
of the contents. Possible examples are directed graph and trees, since we expect
abstraction mechanism to represent a variety of levels of the contents. In a case
of tree, we think nodes in higher levels correspond to overview/global viewpoint
while nodes in lower levels to detail/local aspects. One of these techniques is Key
Graph [10] by which the relationship can be modeled by means of graphs. How-
ever, there is no mechanism to interpret the contents from several abstraction
levels of viewpoints.
In this investigation, we propose a novel technique for Web page summariza-
tion. One of the main issues is how to make groups to a huge amount of Web
pages. It is well-known that naive clustering of Web pages generates a few gi-
gantic clusters and many trivial clusters so that clusters provide us with nothing
new knowledge. However, we have already discussed a sophisticated clustering
method to Web documents based on correlation of hyperlinks with naive clus-
tering under vector space modeling [12]. With these results, let us go one step
further. In this work, we apply hierarchical clustering to each group of Web pages
using STUs, and we discuss a new ”labeling” method to each group by taking a
centroid. Relationship among the clusters corresponds to the hierarchical struc-
ture as shown in the figure 1.
Document 2 Document 4 STU 1
Sentence 1 Sentence 1 .
Sentence 2 Sentence 2
. STU 1
Document 3 .
Document 1
Sentence n Sentence n . STU 1
Sentence 1
Sentence 1
Sentence 2
Sentence 2
STU 1 STU 2 STU 3
Sentence n STU n
Sentence n
Fig. 1. Overview
3.1 Combination Clustering
In this section and the next section, we discuss how to make hierarchical sum-
marization of Web pages. The process consists of two steps. The first step is that
we put a collection of Web pages into disjoint groups by combining two kinds
of clustering results, called combination clustering. The second step is that we
summarize each group. The final results are a set of hierarchical expressions
which can be seen as summarization of the Web page groups.
Our technique of combination clustering is made based on an idea that similar
documents share many hyperlinks. We extract characteristic hyperlinks as well
as index words to distinguish from each other. It is shown that the approach
is effective to obtain fruitful results[12]. Here let us explain the outline of this
approach by examples. Basically we put attention on co-occurrence of hyperlinks.
We call this approach LINK clustering. Then we extract index terms, build the
vector space and make clustering them. This traditional approach is called VSM
clustering. Then we combine the two results into one to obtain new clustering
of Web pages. Each group is not only characterized by some topic but also
connected tightly with each other in a sense of contents.
EXAMPLE 31 Let us show LINK clustering by an example. By interpreting
nodes as Web pages and arcs as hyperlinks, we can represent Web pages by the
graph, the number of references by in-degree and so on. Suppose there are 6 nodes
a1 , ..., a6 as in figure 2. Let F rom(a) be a set of nodes staring from a, called a
set of out-arcs of a, its cardinality is called out-degree of a. Similarly, the set of
arcs with the ending node b, T o(b) is called a set of in-arcs of b, its cardinality is
LATEX style file for Lecture Notes in Computer Science – documentation 5
b1 b2 b3 (1,0,0,0,0) (0,0,1,1,1) (0,0,1,1,0)
a1 a2 a3
a1 a2 a3 b4
a5 a4 a5 a6
a4 a6
(1,1,0,0,0) (0,0,0,1,0) (0,0,1,1,1)
(a) LINK clustering
(b) VSM clustering
Fig. 2. LINK Clustering Fig. 3. VSM Clustering Fig. 4. Combination Clusters
called in-degree of b. We apply complete link hierarchical clustering. This process
is called LINK clustering. we get two LINK clusters A1 and A2 :A1=a1 , a2, a4
, A2 = a3,a6 As for a node a5, we consider a cluster of singleton and remove
this.
EXAMPLE 32 VSM clustering is nothing but a clustering by document vec-
tors. For example, given 6 Web pages P = {a1 , · · · , a6 } with the document vec-
tors in figure 3, We apply complete link hierarchical clustering. This approach is
called VSM clustering, and each cluster is called VSM cluster. We get 2 clusters
B1 , B2 : B1 = {a1 , a4 }, B2 = {a2 , a3 , a5 , a6 } Then let us combine the two re-
sults. In figure 4, two ovals represent LINK clusters A1 , A2 while two rectangles
describe VSM clusters B1 , B2 .
EXAMPLE 33 In example31 and 32, we obtain two combined clusters C11 =
{a1 , a4 } and C22 {a3 , a6 } but we discard small clusters C12 = {a2 } and C02 =
{a5 } so we got the final results of two groups C11 , C22 .
3.2 Hierarchical Expression
Let us discuss technique of structuring each group of Web pages[13]. We assume
sets of Web pages through combination clustering. Web pages are divided into
small units, called Semantic Textual Unit (STU), which are defined as a unit
of meaningful contents to capture intended semantics of paragraphs or caption
parts (in, say, alt tag) of Web pages. The similar approach is found in [3].
Well-formed Web pages contain strings parts and tag parts in a nested man-
ner. Any strings quoted by a tag (i.e.