STRUCTURING THE WEB TO COPE WITH DYNAMIC CHANGES Wookey Lee1 and Jinho Kim2 1 Dept. of Computer Science, Sungkyul University, 147-2 Anyang 8 Dong, Anyang, Korea; 2 Dept. of Computer Science, Kangwon National University, 192-1 Hyoja Dong 2, Chunchon, Kangwon, Korea Abstract: The web structure yields significant insights into web algorithms for searching, structuring, discovering, mining, and revealing web information. We formalize our view of the web structure in terms of the integer linear programming that converts the web directed graph to the optimal hierarchical structure. The model represents a high level structure regardless of various measures such as the cosine similarity and tf-idf measure of the vector space model as well as the PageRank of Google. Another advantage for our approach is that the corresponding sensitivity analysis yields an allowable range for the optimal structure so that the model can be estimated even though dynamic changes take place in the web pages, links, and structures. Key words: web structuring, linear programming, VSM, tf-idf, PageRank 1. INTRODUCTION The WWW can be viewed as a hierarchy of web objects. The WWW can be seen as a set of web sites, and a web site can be seen as a digraph with web nodes and arcs, where the web nodes correspond to HTML files having page contents and the arcs correspond to hypertext links interconnected with the web pages. The web-as-a-graph approach can be a starting point to generate a structure that can be used for web site designers, search engines, web crawling robots, and web marketers and analysts. Wookey Lee, and Jinho Kim One of the typical examples of web site structuring is tree modeling, such as breadth first search (BFS), depth first search (DFS), web catalogues or site maps [1, 5, 15]. Complex and static web abstractions, however, do little to help a web designer or a web crawler that wants to model a web site, and also often poses navigational hazards. The structural abstraction is useful in organizing information and reducing the number of alternatives that must be considered at any one time [10]. The BFS, including backlink count method or RageRank [5, 7, 9], has some advantages so that an 'important' web page can easily be accessed by simply clicking relatively fewer steps from its default page. It is easy to reduce a graph to a tree form and the depths for accessing each web page. Thus, the BFS can statistically minimize the total access time, but structurally it is extremely flat because all pages may stick to the root node. On the other hand, the DFS is an easy method to adopt, and, from a cognitive science point of view, the search methods are similar to the behaviors of human snoopers. But, the DFS is inappropriate for structuring the web because a long series of web pages means as many clicks as there are pages which, in turn, entails massive time consumption to access each page. A topological ordering algorithm [2, 14] converted a web site structure to an unbiased hierarchical form; this minimizes the average distance from the root node to a specific web page. They also considered the semantic relevance between the web nodes. The weakness is that when there are minor changes in the web page's weight or in the link structure, the entire structure needs to be reorganized. Corresponding to the modification of the web structure, there are some intriguing findings that most pairs of pages on the web are separated by a handful of links, almost under average 20 a page, and that this number will grow logarithmically with the size of the web [2, 25]. We formalize our view of the web structure in terms of the integer programming that converts the web directed graph to the optimal hierarchical structure. The structure of a web site can prevent the search robots or crawling agents from confusing in the midst of huge web pages in the web sites. Therefore, in this paper, we introduce an attribute called, "the weight," to evaluate the significance of web pages. This paper is organized as follows: In Section 2, we present the data model of web sites, the web schema, and conventional approaches. In Section 3, we discuss several weight measures endowed on web nodes. In Section 4, we will discuss on the integer linear programming model of the web structure. In Section 5, we will present an example of our model and present the robustness of our model with experimental results. Finally, we conclude the paper. STRUCTURING THE WEB TO COPE WITH dYNAMIC cHANGES WEB SCHEMA AND SIMILARITY MEASURES 1.1 The Web Schema A web site can be defined as a set of web nodes Nw = {N1, ..., Nn}, a directed graph Gw = (Nw, Ew), an arc function xij : Nk → {0, 1}, ∀(i, j)∈Ew consisting of a finite web node set Nw, a finite web arc set Ew of ordered pairs of web nodes, and the web arc elements (i, j) respectively, where i, j ∈ {0, 1, 2, 3, ... , n-1}, and n =|Nw| the cardinality of web pages. There is a mapping system for the nodes corresponding to web pages and the arcs to Uniform Resource Identifiers [4, 5]. The web node (NW) can be defined as follows: N w = [Ni ,{(i, j), ∀i}, wi ] (2-1) Where the Ni represents a web node corresponding to an HTML file (we set a node identifier as i ). Where the homepage is defined as a default page (index.html) predetermined by the web server. The {(i, j), ∀i, j ∈Ew} is the set of web arcs having hypertext links to which the web page indicates. There are three approaches to investigate the web digraph domain as (1) the whole web [1, 2, 12], (2) a set of strongly coupled components [9], (3) set of a web site [14, 17]. The first one is utilized to measure the whole size or growing ratio, but it is not appropriate to derive a web structure. The second focuses a mathematical model, and the third is inclined to practical side. So we adopt the third for implementation's sake, where the homepage is defined as a default page is predetermined by the Web server and the other pages are interconnected each other. 1.2 How to Measure the Web nodes? In order to generate the web structure, we introduce the representative weight measures such as cosine measure and tf-idf measure from Vector Space Model (VSM), and PageRank. For the first time, most web ranking algorithms utilize a similarity measure in terms of VSM, which has been extensively studied in the information retrieval community [6, 8, 13]. For the VSM, to compute the similarities among a set of web pages, each page can be viewed as an n dimensional vector T. The cosine similarity [13] between a query, Q, and a web page, Ni, can then be defined as the inner product of the Q and n vectors. Another common way of computing a web page's weight, W, is the tf-idf, which is obtained as an normalized vector, W' = < w'1,...,w'm >T, where each w'i is the product of a term frequency (tf) factor and an inverse document Wookey Lee, and Jinho Kim frequency (idf) factor. The tf factor is proportional to the frequency of the ith word within a web page. The idf factor is a factor divided by the number of times that the word appears in the entire set of web pages that corresponds to the content discriminating power of the ith word that appears rarely in documents has a high idf, while a word that occurs in a large number of documents has a low idf. Typically, idf is computed by logarithms from the total number of documents and is the number of documents containing the word. If a word appears in every document, its discriminating power is 0. If a word appears in a single document, its discriminating power can be very large. Once the vector is computed, the normalized vector, W, is typically obtained by their norms. The similarity between a query, Q, and a web page, Ni, can then be defined from which the weight specified as the inner product of the two. There are other measures from VSM in terms of local weights such as the Binary model [13], Logarithmic [14] and global weights such as the entropy model [8], GfIdf [13], and Probabilistic Inverse [18]. There are two major link-based search algorithms, HITS and PageRank. The basic idea of the HITS algorithm is to identify a small sub-graph of the web and apply link analysis on this sub-graph to locate the authorities and hubs for the given query. The sub-graph that is chosen depends on the user query. The selections of a small sub-graph (typically a few thousand pages), not only focus the link analysis on the most relevant part of the web, but also reduce the amount of work for the next phase. The main weaknesses of HITS are known to non-uniqueness and nil-weighting. Haveliwala [12] suggested a domain based PageRank algorithm, but its limitation depends on the usefulness of the ontological state of the algorithm and the particular thesaurus that the system tries to include in discriminating semantics among web documents. One of the representative arc oriented approaches is the PageRank algorithm [7, 9, 12]. PageRank has the arc based measure of a page that is similar to its indegree, which is a possible measure of the importance of a page. The PageRank of a page is high if many pages with a high PageRank contain links to it, and a page containing few outgoing links contributes more weight to the pages it links to than would a page containing many outgoing links. It is a static measure, so that it is modeled to rank pages in the absence of any queries. So the PageRank computes the global worth of each page. The algorithm, however, has a weakness called Google bombing that can be attacked by creating a large number of bogus web pages, all pointing to a single target page. Since many search engines take into account the number of incoming links in ranking pages, the rank of the target page is likely to increase, and appear earlier in query result sets. So, the page can achieve 'higher-than-deserved' rankings in Google [4]. The weights in this paper are assumed to accurately indicate the importance of the web page [19, STRUCTURING THE WEB TO COPE WITH dYNAMIC cHANGES 20, 21, 22]. We generalized the formula and weight from the previous work [17]. 2. SIMILARITY MEASURES The similarity measure in this paper is the weight of web node. We introduce the cosine measure, tf-idf measure, and PageRank measure as the weight measure which can be used to determine the topological ordering of web sites. The prototype system has been experimentally tested to search for the structure of the test web site. The link structure of the site is shown in reference Fig. 3.1 and 3.2. The circle in the figures represents a web node and the arrow represents a hyperlink or a web arc. First of all, the cosine measure from the VSM is introduced [13] as following equation (3.1) by which the measure is exploited in the experiment m T a q ∑a q ij i weight (W j ) = = i =1 j (3.1) || a j ||2 || q ||2 m m ∑a ∑q i =1 2 ij i =1 2 i In the tf-idf measure, the tf factor itself is sometimes normalized by dividing it by the frequency of the most-frequent non-stop term in the document as, tfnorm = tf/tfmax. The idf factor is typically computed by [df(wi)/N]-1, and most often the log2[N/ df(wi)] is used, where, N, is the total number of documents and, df(wi), is the number of documents containing the ith word. Refer to equation (3.2) [8]. T weight( W j ) = Q ∗ W j = [qi ] ∗ ⎡⎣(tf i ⋅ log 2 [ N / df ( wi )])⎤⎦ ∑ qi ⋅ (tfi ⋅ log 2 [ N / df ( wi )]) (3.2) = i The PageRank measure is that, if source page, i, has a link to target page, j, then the author of source page, i, is implicitly conferring some importance to page, j. Let Nj be the out-degree of page, i, and let Rank(p) represents the importance of page, p. Then, the link (i, j) confers a certain number of units of rank to, j. This simple idea leads to the following iterative fix-point computation that yields the rank vector over all of the pages on the web. If, n, is the number of pages, assign all pages the initial value 1/n. Let, Bj represent the set of pages pointing to j. Links between web pages propagate the ranks [9]. We continue the iterations until the rank is stabilized to within some defined threshold. The final rank vector contains the PageRank vector over the web. This vector is computed only once after each crawl of the web; the Wookey Lee, and Jinho Kim values can then be used to influence the ranking of search results [6]. Guaranteeing the rank vector to converge, PageRank algorithm uses the following equation with a damping factor (d): ⎡1⎤ ∀i,Rank( k +1 ) (i) = ( 1 − d)E + d( ∑ Rank( k ) ( j) / N j ) where, E = ⎢ ⎥ i∈B ⎣ n ⎦ n×1 (3.3) j In Google, we usually set the value of the damping factor to 0.85 [7, 8] so that we can see that the PageRank vector converges either slowly or quickly in terms of the magnitude of the damping factor. For example, by (3.3) we can get the weights in Table 3.1 and derived weights as Fig. 3.1. The node weights by PageRank are: <0.865, 0.479, 0.50, 2.25, 0.64, 1.00, 1.25>. The link weights are generated by simple Euclidean distance between the web nodes as: = <0.664, 0.80, 0.777, 0.777, 0.984, 0.640, 0.846, 1.166, 1.220, 1.402, 1.161, 1.374, 1.343>. Table 3.1 PageRank values for Fig. 3.1 N0 N1 N2 N3 N4 N5 N6 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.858 0.603 0.745 1.878 1.028 1.141 0.745 0.757 0.587 0.552 2.275 0.763 0.969 1.094 0.831 0.495 0.528 2.226 0.684 0.987 1.245 0.873 0.485 0.501 2.226 0.651 1.023 1.237 0.860 0.483 0.504 2.256 0.646 1.002 1.244 0.862 0.479 0.501 2.250 0.644 1.005 1.255 0.866 0.479 0.500 2.249 0.642 1.008 1.253 Fig. 3.1 Topology for 0.864 0.479 0.500 2.252 0.642 1.006 1.253 PageRank on the example site 0.864 0.479 0.500 2.251 0.642 1.006 1.254 0.865 0.479 0.500 2.251 0.642 1.006 1.253 3. MATHEMATICAL MODELING The model can be applied to transform a digraph into a tree. First, except for the root node, the tree node's indegree should be 1. Second, All cycles which may exist in a graph should be removed during the transformation phase. Cycle detection is implemented by DFS using Stack based algorithm STRUCTURING THE WEB TO COPE WITH dYNAMIC cHANGES [25] in which results the constraints (4.4) to (4.6). Third, a self-cycle within a graph node should be removed. And last, duplicate paths between two adjacent nodes in a graph should be removed. So, considering the above constraints, IP formulation is as follows: max ∑ wij xij i,j∈N ,j ≥1 (4.1) (i,j)∈A s.t ∑ xij = 0 i∈N (4.2) (i,0 )∈A ∑ xij = 1 for all j ≠ 0 i∈N (4.3) (i,j)∈A xii = 0 for all i (4.4) xij + x ji ≤ 1 ∀i , j (4.5) m−1 xij1 + ∑ x jk jk +1 + x jmi ≤ m for 2 ≤ m ≤ N − 1 (4.6) k =1 xij = 0 ∀i , j ∉ N (4.7) xij = 0 or 1 ∀i , j (4.8) The variable xij is 1 if there exists a web arc from node i to node j, zero otherwise. Set A represents a set of all existing links between any of two Web pages in a Web site, where notation (i, j) is used to represent a link from node i to node j. The weight parameter wij represents the average distance from node i to node j. There are several alternatives for deriving the weight of a node and to generate a distance to the link, including the number of inward or outward links [24]. In this paper, we use weight measures as tf- idf, Cosine, and PageRank to each node, and generate a Euclidean distance wij from node i to node j. The objective function (4.1) which is to be maximized denotes the sum of weights for links. The constraint (4.2) ensures that any of the links toward the root page should not be included for making feasible spanning tree, while constraint (4.3) ensures that each of all web pages should have only one web page that is directly linked to a parent node except the root node. The constraint (4.4) removes a self-cycle. The constraints (4.5) and (4.6) are to remove a cycle with 2 steps and a cycle by more than 3 steps, respectively. Constraint (4.6) means that for all cycles starting from any of Web page i in a given Web site, where a cycle starting from Web page i and returning to the same page through m intermediate Wookey Lee, and Jinho Kim Web pages, can be represented as an ordered list of Web pages (i, j1, j2,..., jk, jk+1,..., jm-1, jm, i). The constraint (4.7) restricts the domain within a web site. The constraint (4.8) representing the problem should be binary integers. 4. OPTIMAL STRUCTURE AND CHANGE ANALYSIS When the query terms are altered, measurements of the web Node are also altered. In this case, sensitivity analysis is used to determine whether the entire problem needs to be reformulated and recalculated or not. The standard IP problem forms separated basic variables between nonbasic variables. After applying a simplex algorithm, the above IP problem is reformed. The criteria of optimality in a simplex algorithm is that an objective function's coefficient of a non-basic variable, i.e., c j = CBV B−1 a j − c j (aj is a column vector of N for xj , cj is objective function's coefficient value for a nonbasic variable xj) must be non-negative. Also, the feasibility condition of the current basis solution is that RHS of the equation, i.e., B-1b must be non- negative. When the weight of the web node is changed, but and if this change does not influence the above two conditions (i.e., the optimality and feasibility condition), then the current basis is conserved. Fig. 3.1 represents a digraph consisting of 7 web pages, i.e., N0 to web page N6. The weight of an arc is the mean average weight value of two terminal nodes of the arc. After reorganizing the IP problem, each of the BV (Basic variable), NBV (Nonbasic variable), objective function coefficient cBV, cNBV, basis matrix B, RHS (Right Hand Side) can be described. If the web page is changed, then the corresponding objective function's coefficients are also influenced. If it does not break the primal feasibility condition, i.e. B-1b≥ 0, then the current basis will not change. In this case, the current solution's feasibility condition and optimality condition is maintained. This also means that the current basis is not changed. In case a link is deleted, the constraint can be inserted into the formulation. If the current solution does satisfy the new constraint, the current basis is maintained, otherwise, a dual simplex algorithm can be used. Consider the case where two links between nodes, W1 and W4, are disconnected. Then, the two constraints(x14=0 and x41=0) should be included into the previous IP formulation. As the current solution satisfies the inserted constraints, the current basis is maintained. In this case, the objective function coefficient, the element of N matrix for the new inserted variable, and all constraint types, except non-negative constraints, can be included in LP formulation. In this case, the current basis is not conserved. STRUCTURING THE WEB TO COPE WITH dYNAMIC cHANGES 5. EXPERIMENTS By the example site in section 2, the number of nodes is increased from 1 to 7. The node weights are derived from the cosine of VSM, tf-idf, and PageRank. A total of four structuring methods are applied such as DFS, BFS, Semantic, and LP. TF-IDF Meas ure 60.000 LP 50.000 Semantic 40.000 DFS 30.000 BFS 20.000 10.000 0.000 1 2 3 4 5 6 7 LP 9.490 9.580 16.13 24.08 35.12 44.66 54.80 Semantic 9.490 9.580 16.05 20.97 35.04 41.55 54.16 DFS 9.490 9.580 16.05 20.97 32.01 41.55 54.16 BFS 9.490 9.580 13.02 24.00 35.04 44.58 51.69 # of Nodes Fig. 6.1 tf-idf weight measure traverses w.r.t the number of node PageRank Measure 7.000 LP 6.000 Semantic 5.000 DFS 4.000 BFS 3.000 2.000 1.000 0.000 1 2 3 4 5 6 7 LP 1.000 0.800 1.784 2.448 3.614 5.016 6.390 Semantic 1.000 0.800 1.761 2.425 3.591 4.993 6.331 DFS 1.000 0.800 1.577 2.241 3.407 4.627 6.183 BFS 1.000 0.800 1.784 2.448 3.614 4.835 5.995 # of Nodes Fig. 6.2 PageRank weight measure traverses w.r.t. the number of nodes Wookey Lee, and Jinho Kim We conclude that the LP solution produces the optimal solution with optimal structure. In other words, our approach generates the optimal hierarchical structure among many structuring methods. Our method shows its openness of the mechanism without respect to the weight measure such as cosine measure, if-idf measure, and PageRank measure. cos ine Meas ure 8.000 LP Semantic 6.000 DFS 4.000 BFS 2.000 0.000 1 2 3 4 5 6 7 LP 1.410 1.000 1.996 2.706 3.348 4.728 6.557 Semantic 1.410 1.000 1.577 2.287 2.929 4.309 5.719 DFS 1.410 1.000 1.577 2.287 2.929 4.309 5.331 BFS 1.410 1.000 1.996 2.706 3.348 4.728 6.138 # of Nodes Fig. 6.3 Cosine weight measure traverses w.r.t the number of nodes 6. CONCLUSIONS The goal of this paper has been to generate the optimal hierarchical structure of a web site from directed graphs. With respect to the weight measure, the optimal solution can be derived: from a user's point of view, the optimum approach will be consonant with the user's preferences in terms of query with tf-idf or cosine measures, on one hand; or, from a web robot's point of view, the optimum will be represented by the search robot's point of view. This model guarantees that the optimal hierarchical structure will be generated with respect to the measurement and the optimization model. A mathematical model is pertained onto a web site in terms of the integer linear programming by which the web digraph can be converted to the hierarchical structure. The optimal solution from the hierarchical structure and the corresponding sensitivity analysis yield a robustness of the model in maintaining the solution in a dynamic manner. The model can secure the optimum under any weight measure conditions, and by using the LP model, a user is always able to get the optimal result STRUCTURING THE WEB TO COPE WITH dYNAMIC cHANGES even though any actions, such as Adding, Deleting, Inserting, etc., are undertaken within a web site. Under the circumstances of frequent web contents change, the LP model can be a very good tool for developing an advanced search engine schematic. The LP model also shows its openness: Using significant similarity measures (such as tf-idf, cosine measure, and PageRank) does not affect the optimal model, the corresponding solution, and sensitivity analysis, which implies that the model can easily be adapted by and integrated into the current Web search robots as well as the web site managers having various similarity measures. ACKNOWLEDGEMENT This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc). REFERENCES 1. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Crawling the Web for cyber communities in the Web, In Proc. 8th WWW (1999) 403–415 2. Barabasi A., Albert R., and Jeong H.: Scale-free Characteristics of Random Networks: the Topology of the World-Wide Web. Physica A 281 (2000) 69-77. 3. Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D., Yates, A.: Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison. AAAI (2004) 391-398. 4. Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J.: Combating Web Spam with TrustRank. VLDB (2004) 576-587. 5. Henzinger, M. R., Heydon, A., Mitzenmacher, M. and Najork, M.: On near-uniform URL sampling, Computer Networks, 33(1) (2000) 295-308. 6. Demaine, E., Lopez-Ortiz, A.: A Linear Lower Bound on Index Size for Text Retrieval, Journal of Algorithms, 48(1) (2003) 2-15. 7. Thom, L. H., and Iochpe, C.: Integrating a Pattern Catalogue in a Business Process Model, ICEIS 3 (2004) 651-654. 8. Glover, E. J., Tsioutsiouliklis, C., Lawrence, S., Pennock, D., and Flake, G.: Using Web Structure for Classifying and Describing Web Pages. WWW (2002) 562-569. 9. Pandurangan, G., Raghavan, P. and Upfal, E.: Using PageRank to Characterize Web Structure. COCOON (2002) 330-339. Wookey Lee, and Jinho Kim 10. Gabriel Nivasch, "Cycle detection using a stack," Information Processing Letters, 90(3) (2004) pp.135-140. 11. Cooley, R.: The Use of Web Structure and Content to Identify Subjectively Interesting Web Usage Patterns. ACM Internet Technology, 3(2) (2003) 93-116. 12. Mendelzon, A. O. and Milo, T.: Formal Model of Web Queries, ACM PODS (1997) 134-143. 13. Berry, M. and Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, Siam (1999). 14. Wookey Lee and Geller J.: Semantic Hierarchical Abstraction of Web Site Structures for Web Searchers. Journal of Research and Practice in Information Technology, 36 (1) (2004) 71-82. 15. Gurrin, C., and Smeaton, A. F.: Replicating Web Structure in Small- Scale Test Collections, Information Retrieval, 7(3) (2004) 239-263. 16. Zwol, R. and Apers P.: The Webspace Method: On the Integration of Database Technology with Multimedia Retrieval. CIKM (2000) 438-445. 17. Wookey Lee, Seung Kim, Suk-Ho Kang, "Dynamic Hierarchical Website Structuring Using Linear Programming," LNCS, Vol. 3182, (2004) 328-337. 18. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake G.: Using Web Structure for Classifying and Describing Web Pages. WWW (2002) 562-569. 19. Cho, J., and Garcia-Molina, H.: Effective Page Refresh Policies for Web Crawlers. ACM TODS 28(4) (2003) 390-426. 20. Chen, M., Hearst, M., Hong, J. and Lin, J.: Cha-Cha: A system for Organizing Intranet Search Results. USENIX ITS (1999) 11-14. 21. Subramani, K. and Kovalchick L.: Contraction versus Relaxation: A Comparison of Two Approaches for the Negative Cost Cycle Detection Problem. Computational Science (2003) 377-387. 22. Garofalakis, M., Kappos, P. and Mourloukos, D.: Web Site Optimization Using Page Popularity, IEEE Internet Computing, 3(4) (1999) 22-29. 23. Hou, J., Zhang, Y.: Effective Finding Relevant Web Pages from Linkage Information. IEEE TKDE 15(4) (2003) 940-951. 24. Takeuchi, F.: Nonregular Triangulations, View Graphs of Triangulations, and Linear Programming Duality. Discrete and Computational Geometry (2000) 330-338. 25. Shmueli, O.: Dynamic Cycle Detection. Information Processing Letters. 17(4) (1983) 185-188.