Hierarchical Summarizing and Evaluating for Web Pages Kou TAKAHASHI1 , Takao MIURA1 , and Isamu SHIOYA2 1 HOSEI University, Kajinocho 3-7-2, Koganei, Tokyo, JAPAN 2 Sannno University,Kamikasuya 1563, Isehara, Kanagawa, JAPAN Abstract. In this investigation we propose a novel summarization method of Web pages using hierarchical expression. We discuss close relationship between summarization and hierarchical clustering to obtain the results, and we examine how to evaluate hierarchical summarization based on both correlation and structural aspects. We describe some experimental results using NTCIR Web documents to examine our method. 1 Introduction Nowadays we face to huge amount of Web pages which keep growing everyday. Whenever we like to know current situation of interests, we can examine easily what’s going now all over the world through these pages. However, at the same time, such information flooding makes it impossible for us to grasp the contents quickly and intuitively. That’s reason why much attention has been paid on how to grasp and summarize the contents quickly [7, 4]. To attack these issues, there have been several approach proposed, and among others, we have two important techniques, clustering and summarization[12, 4]. Clustering means a method to put objects into several groups in such a way that objects in a group are similar with each other while objects in distinct groups are not [6]. In other words, clustering depends on both definition of similarity and algorithm by which we can extract interesting patterns that are hidden. Most information in Web pages are categorical (say, we see ”Computer Science”, ”Biol- ogy”, ”Mathematics”) but not numerical so that we interpret hardly the notions of metric or order. This is the true reason traditional clustering techniques are not well-suited. Summarization provides us with presentation of important contents of infor- mation source in a simplified way by which we can grasp quickly. In investigation of automatic summarization, two major aspects have been discussed, extraction and abstraction. To extract important part from the contents, we put weights such as frequency to words or paragraphs, and score them. Eventually we select sentences with high score order. On the other hand, to abstract contents, we should analyze them in terms of Natural Language Processing (NLP) or Infor- mation Retrieval (IR), thus few approaches has been proposed so far. In this work, we propose a new kind of Web summarization technique based on hierarchical clustering. there are two kinds of ideas, Web page clustering and Web page summarization. We put our attention on frequency and co-occurrence of both index words and hyper-links appeared in the Web pages. We decom- pose the page contents into semantic textual units (STU) as described later. With the help of hierarchical clustering among them, we extract some sort of structure among collections of STUs so that we can consider the structure as the abstraction of the contents in a form of hierarchy. We obtain a center STU from each cluster, thus we can represent the summarization in terms of structure among STUs. We must discuss how to evaluate the hierarchical summarization. Generally evaluation is difficult as to both clustering and summarization, This is reason because correct output can’t be defined. In this investigation we propose a novel evaluation method, from view points readability of both clusters and hierarchi- cal structures as well as reading comprehension. We evaluate the hierarchical summarization in terms of penalty, called cost. In section 2 we introduce several issues of automatic summarization and application to Web, and we discuss a new technique how to summarization the contents in a hierarchical manner in section 3. In section 4 we propose a novel evaluation method of hierarchical expression. We show some experimental results in section 5, and conclude our work in section 6. 2 Automatic Summarization In this section we review automatic summarization putting our attention on Web pages[7]. Generally summarization means a process to distill most impor- tant information from document sources for particular users and tasks. The automatic summarization for particular users is used for contents grasping of documents(such as news articles). There are 3 kinds of the techniques proposed, Extracting, Abstracting and Clustering. Extracting means identifying the most relevant parts to the main theme ap- peared within the document. This approach is advantageous because very often statistical techniques such as frequency and correlation correspond to the rel- evance or importance. Here we don’t need any background knowledge of NLP, and we could automate the process easily and efficiently. On the other hand, the results can be lack of coherence because of dangling anaphor such as pronoun and conjunction [7]. Abstracting means putting text objects into more general and super-ordinate concepts so that the results don’t contain the expressions not appeared in the objects. The approach is advantageous because there should be coherent in the results, and better compression rate could be expected. On the other hand, we should analyze what documents mean to generalize context to some extent and NLP or IR technique should be required [7]. Clustering means collecting objects into several groups by using a certain similarity. We could extract underlying semantics from the objects. But we need other techniques such as labeling groups to interpret the results. The approach is advantageous because we can apply clustering easily and efficiently to objects without any NLP knowledge, but is difficult to define similarity and interpret the results, although several labeling have been proposed [6, 8]. As for some applications of automatic summarization for Web pages, extract- ing has been discussed for the purpose of hand-held device [3]. The technique is indispensable since the display is really small for Web browsing. Web pages are LATEX style file for Lecture Notes in Computer Science – documentation 3 divided into small units, called semantic textual unit (STU), where each STU is defined as a unit of meaningful contents, usually corresponded to paragraphs or caption parts. Every first line is sent for browsing purpose. For deeply reading, users may ask of first 3 lines or whole lines. A different approach comes from Web clustering, since each cluster corre- sponds to closely related collection of Web pages with each other. Labeling clus- ters corresponds to summarizing the page contents, because the labels represent what the contents of the cluster mean. Very often frequent words could be seen as the labels of the pages [7]. But because there are strong similarities among the pages obtained by search engines, the frequency can’t distinguish one from oth- ers [8]. Better approach could come from important words rather than frequent words. With these words we could see what’s going on quickly. Moreover, there has been pointed out that labels consisting of sequences rather than the ones of words be helpful to grasp the contents. A typical approach as Suffix Tree Clus- tering (STC)[15] where sequences of words are used. However, STC is based on heavy assumption that we can use a set of documents which are closely related to each other(for example, hit list by search engine). Some investigation has pro- posed to extract important words based on Key-Graph [10] and word-sequences based on Suffix Tree [15], and then combined both results to abstract the pages [9]. However, the results depend on temporal aspects as well as clustering. 3 Summarizing Pages Hierarchically Web documents consist of tag parts and link specifications as well as textual parts. We examine relationship over Web pages, and propose structuring as a method of novel automatic summarization. By structuring objects, we mean a technique to restate documents in terms of data structure. Since any data structure carries its own meaning, we use specific structure concisely and clearly as a substitute for parts of documents, thus we understand what the information does imply. Note that any data structure can be applied to any situation because they are polymorphic without any NLP knowledge. However what kinds of structures are suitable for our situation and how to organize our documents appropriately ? What we work with are either words (minimum lexical units) or sentences (minimum semantic units). In the former case, we can examine fine semantics but hardly grasp global view because of detail relationship among words. On the other hand, in the latter case, it seems easier to examine relationship among sentences although outlines depend on syn- tax. Frequency and their correlation help us to obtain important parts for words, while distance between sentences and centroids of document clusters could show us what parts should be extracted. Thus, to grasp the contents of documents, important words and the relationship among them or document clusters play important role. There can be several expressions to describe semantic structure of the contents. Possible examples are directed graph and trees, since we expect abstraction mechanism to represent a variety of levels of the contents. In a case of tree, we think nodes in higher levels correspond to overview/global viewpoint while nodes in lower levels to detail/local aspects. One of these techniques is Key Graph [10] by which the relationship can be modeled by means of graphs. How- ever, there is no mechanism to interpret the contents from several abstraction levels of viewpoints. In this investigation, we propose a novel technique for Web page summariza- tion. One of the main issues is how to make groups to a huge amount of Web pages. It is well-known that naive clustering of Web pages generates a few gi- gantic clusters and many trivial clusters so that clusters provide us with nothing new knowledge. However, we have already discussed a sophisticated clustering method to Web documents based on correlation of hyperlinks with naive clus- tering under vector space modeling [12]. With these results, let us go one step further. In this work, we apply hierarchical clustering to each group of Web pages using STUs, and we discuss a new ”labeling” method to each group by taking a centroid. Relationship among the clusters corresponds to the hierarchical struc- ture as shown in the figure 1. Document 2 Document 4 STU 1 Sentence 1 Sentence 1 . Sentence 2 Sentence 2 . STU 1 Document 3 . Document 1 Sentence n Sentence n . STU 1 Sentence 1 Sentence 1 Sentence 2 Sentence 2 STU 1 STU 2 STU 3 Sentence n STU n Sentence n Fig. 1. Overview 3.1 Combination Clustering In this section and the next section, we discuss how to make hierarchical sum- marization of Web pages. The process consists of two steps. The first step is that we put a collection of Web pages into disjoint groups by combining two kinds of clustering results, called combination clustering. The second step is that we summarize each group. The final results are a set of hierarchical expressions which can be seen as summarization of the Web page groups. Our technique of combination clustering is made based on an idea that similar documents share many hyperlinks. We extract characteristic hyperlinks as well as index words to distinguish from each other. It is shown that the approach is effective to obtain fruitful results[12]. Here let us explain the outline of this approach by examples. Basically we put attention on co-occurrence of hyperlinks. We call this approach LINK clustering. Then we extract index terms, build the vector space and make clustering them. This traditional approach is called VSM clustering. Then we combine the two results into one to obtain new clustering of Web pages. Each group is not only characterized by some topic but also connected tightly with each other in a sense of contents. EXAMPLE 31 Let us show LINK clustering by an example. By interpreting nodes as Web pages and arcs as hyperlinks, we can represent Web pages by the graph, the number of references by in-degree and so on. Suppose there are 6 nodes a1 , ..., a6 as in figure 2. Let F rom(a) be a set of nodes staring from a, called a set of out-arcs of a, its cardinality is called out-degree of a. Similarly, the set of arcs with the ending node b, T o(b) is called a set of in-arcs of b, its cardinality is LATEX style file for Lecture Notes in Computer Science – documentation 5 b1 b2 b3 (1,0,0,0,0) (0,0,1,1,1) (0,0,1,1,0) a1 a2 a3 a1 a2 a3 b4 a5 a4 a5 a6 a4 a6 (1,1,0,0,0) (0,0,0,1,0) (0,0,1,1,1) (a) LINK clustering (b) VSM clustering Fig. 2. LINK Clustering Fig. 3. VSM Clustering Fig. 4. Combination Clusters called in-degree of b. We apply complete link hierarchical clustering. This process is called LINK clustering. we get two LINK clusters A1 and A2 :A1=a1 , a2, a4 , A2 = a3,a6 As for a node a5, we consider a cluster of singleton and remove this. EXAMPLE 32 VSM clustering is nothing but a clustering by document vec- tors. For example, given 6 Web pages P = {a1 , · · · , a6 } with the document vec- tors in figure 3, We apply complete link hierarchical clustering. This approach is called VSM clustering, and each cluster is called VSM cluster. We get 2 clusters B1 , B2 : B1 = {a1 , a4 }, B2 = {a2 , a3 , a5 , a6 } Then let us combine the two re- sults. In figure 4, two ovals represent LINK clusters A1 , A2 while two rectangles describe VSM clusters B1 , B2 . EXAMPLE 33 In example31 and 32, we obtain two combined clusters C11 = {a1 , a4 } and C22 {a3 , a6 } but we discard small clusters C12 = {a2 } and C02 = {a5 } so we got the final results of two groups C11 , C22 . 3.2 Hierarchical Expression Let us discuss technique of structuring each group of Web pages[13]. We assume sets of Web pages through combination clustering. Web pages are divided into small units, called Semantic Textual Unit (STU), which are defined as a unit of meaningful contents to capture intended semantics of paragraphs or caption parts (in, say, alt tag) of Web pages. The similar approach is found in [3]. Well-formed Web pages contain strings parts and tag parts in a nested man- ner. Any strings quoted by a tag (i.e. ... ) constitute an STU that carries meaningful unit of semantics intended by the tag. When the strings con- tain tag structures inside, the STU is called nested. When analyzing STU values, we extract STUs and their nested structure, called STU (Nest). In this investi- gation, we examine