<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mathias</forename><surname>Géry</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Jean-Pierre</forename><surname>Chevallet</surname></persName>
							<email>pierre.chevallet@imag.fr</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="laboratory">Equipe MRIM (Modélisation et Recherche d&apos;Information Multimédia)</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">Laboratoire CLIPS-IMAG</orgName>
								<address>
									<postBox>P. 53</postBox>
									<postCode>38041, Cedex 9</postCode>
									<settlement>Grenoble</settlement>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B9FD0C92C0BEE436BCE016047EDC9C6B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Web Information Retrieval</term>
					<term>Web Pages Analysis</term>
					<term>Structure Extraction</term>
					<term>Statistics</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graphlike structure).</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The task of an Information Retrieval System (IRS) is to process a whole set of electronic documents (corpus), with an aim of making it possible to retrieve those matching with their information need. On the contrary of Databases Management Systems (DBMS), the user expresses with a query the semantic content of the documents that he seeks. We distinguish two principal tasks:</p><p>Indexing: The extraction and storage of the documents semantic content. This phase requires a representation model of these contents, called document model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Querying:</head><p>The representation of the user's information need, generally in a query form. It is followed by the retrieval task, and the presentation of the results. This phase requires a representation model called query model, and a matching function to evaluate documents relevance.</p><p>IRS were classically used for textual documents databases, or multimedia databases like medical corpora. The Web growth constitutes a new applicability field for IR. The number of users on the Web has been estimated at 119 millions in 1998 (NUA Ltd Internet survey, July 1998), 333 millions in 2000 (NUA Ltd Internet survey, June 2000). The number of accessible pages has been estimated in December 1997 at 320 millions <ref type="bibr" target="#b19">[20]</ref>, in February 1999 at 800 millions <ref type="bibr" target="#b20">[21]</ref> and in July 2000 at more than 2 billions <ref type="bibr" target="#b25">[26]</ref>.</p><p>The Web is a huge and sometimes chaotic information space without central authority. In this context, and in spite of standardization efforts, these documents are very heterogeneous in their contents as in their presentations: HTML standard is respected in less than 7% of HTML pages <ref type="bibr" target="#b3">[4]</ref>. We can expect to find almost everything there, but retrieving relevant information seems to be like Finding the Needle in the Haystack... Nowadays, the great challenge for research in IR is to help people to profit of the huge amount of resources existing on the Web. But it exists yet no approach that satisfies this information need in a both effective <ref type="foot" target="#foot_0">1</ref> and efficient<ref type="foot" target="#foot_1">2</ref> way. For assisting the user in his task, some search engines (like Altavista, AllTheWeb, or Google <ref type="foot" target="#foot_2">3</ref> ) are available on the Web. They are able to process huge documents volumes with several tens of million indexed pages. They are nevertheless very fast and are able to solve several thousands of queries per second. In spite of all their efforts, the answers provided by these systems are generally not very satisfactory. Preliminary results obtained with a test collection of the TREC conference Web Track has showed the poor results quality of 5 well known engines of the Web, compared to those of 6 systems taking part to TREC <ref type="bibr" target="#b16">[17]</ref>.</p><p>In fact, most of existing search engines use well-known techniques like those described by Salton 30 years ago <ref type="bibr" target="#b29">[30]</ref>. Most of them prefer a wide coverage of Web with a low indexing quality to a better indexing on a smaller part of the Web. In particular, they consider generally HTML pages as atomic and independent documents, without taking into account relations existing between them. The notion of document for a search engine is reduced to its physical appearance, a HTML page. But Web's structure is used in few of Web search engines like Google <ref type="bibr" target="#b5">[6]</ref>.</p><p>With an aim of Structured IR, we wanted to determine which structure exists on the Web, and which structure it is possible to extract. This paper is organized as follow: after presentation of related works in section 2 (i.e. IR with structured documents, hypertexts and Web), we will present our hypotheses about what will be an ideal structure on the Web in section 3.1. Then we will propose our approach to validate our hypothesis and check if this kind of structure exists on the Web in section 3.2. Finally we will introduce the Web sample that we have analysed in section 4.1 and some preliminary results of our experimentations in sections 4.2, 4.3 and 4.4, while section 6 gives a conclusion about this work-in-progress and some future directions of our works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">IR and structure on the Web</head><p>The Web is not only a simple set of atomic documents. The HTML standard allows description of structured multimedia documents, it is widely used to publish on Web. Furthermore, Web is an hypertext, with URL's use (Uniform Resource Locator) for the description of links. This structure was used for IR, as well in the context of structured documents as in the context of classical hypertexts. We distinguish 3 main approaches proposing techniques of information access using structure: navigation, DBMS and IR approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Navigation approach</head><p>Navigation is based on links, used for finding and consulting some interesting information. In the case of a navigation within a hypertext composed by several hundreds of nodes, this solution can be useful. This task is more difficult to achieve on larger hypertext, mainly because of disorientation and cognitive overload problems. Furthermore, it is necessary to have the right links at the right place. A solution is proposed by "Web Directories" as Yahoo or Open Directory Project <ref type="foot" target="#foot_3">4</ref> which propose an organized hierarchy of several millions of sites. These hierarchies are built and verified manually, and thus it is expensive and difficult to keep them up-to-date. Furthermore, exhaustiveness is impossible to reach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">DBMS approach</head><p>Documents are represented using a data schema encapsulated in a relational or Object Oriented <ref type="bibr" target="#b9">[10]</ref> data schema. It allows an interrogation using a declarative query language based on an exact matching and forced by the data schema structure. The hypertext structure integration in the database schema has been much studied, for example by <ref type="bibr" target="#b24">[25]</ref>, <ref type="bibr" target="#b2">[3]</ref> (ARANEUS project), <ref type="bibr" target="#b15">[16]</ref> (TSIMMIS project), etc. Integration attempts at the level of query language can be found in hyperpaths <ref type="bibr" target="#b0">[1]</ref> or POQL <ref type="bibr" target="#b9">[10]</ref>. In fact, these approaches are extensions of the proposed solution for the documents structure integration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">IR approach</head><p>IR approach deals with structured documents, promoting a hierarchical indexing : during the indexing process, information is propagated from document sections to top of document, along composition relations. This method is refined by Lee <ref type="bibr" target="#b21">[22]</ref> who distinguishes several strategies of ascent. Paradis in <ref type="bibr" target="#b27">[28]</ref> distinguishes several structure types, data ascent depending on different link types.</p><p>Hypertext structure has been taken into account at the indexing step. For example, the hypertext graph can be incorporated into a global indexing schema using conceptual graph model <ref type="bibr" target="#b8">[9]</ref> or using inference networks <ref type="bibr" target="#b10">[11]</ref>. World Wide Web Worm <ref type="bibr" target="#b23">[24]</ref> enables the indexing of multimedia documents by the use of the anchor's surrounding text. Amitay <ref type="bibr" target="#b1">[2]</ref> promotes also document's context use. Marchiori <ref type="bibr" target="#b22">[23]</ref> adds the "navigation cost notion" that expresses the navigation effort for reaching a given page.</p><p>SmartWeb <ref type="bibr" target="#b12">[13]</ref> considers the accessible information of a Web page at indexing step, so page relevance is evaluated considering the page content but also the page's neighbors content. Kleinberg (HITS <ref type="bibr" target="#b17">[18]</ref>) promotes the use of both links directions: he introduces the hub page <ref type="foot" target="#foot_4">5</ref> and authority page<ref type="foot" target="#foot_5">6</ref> concepts. For automatic ressources compilation, the CLEVER system <ref type="bibr" target="#b7">[8]</ref> based on the same idea, obtains good results against manually generated compilation (Yahoo!<ref type="foot" target="#foot_6">7</ref> ). Gurrin <ref type="bibr" target="#b14">[15]</ref> has tried to improve Kleinberg's approach. He distinguishes 2 links types (structural and functional) and uses only structural ones. The well-known Google search engine <ref type="bibr" target="#b5">[6]</ref> uses textual anchors to describe pages referenced by links from these anchors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Related Works : Discussion</head><p>We think that navigation approach is well adapted to manually manageable collections, but the Web is too big to be acceded only with navigation. Navigation can be an interesting help to other techniques, for example to consult search results.</p><p>About DBMS approaches, we think that a declarative query language is not adapted to the Web heterogeneity. Moreover, these approaches rely on the underlying data base schema, and Web pages have to be expressed following this schema or following predefined templates. According to Nestorov <ref type="bibr" target="#b26">[27]</ref> we think that even if some Web pages are strongly structured, this structure is too irregular to be modeled efficiently with structured models like relational or object.</p><p>IR approach enables natural language querying, and considers relevance in a hypertext context. At present, most of the IR approaches are based on pages connectivity use, with the notion of relevance propagation along links. The drawback is the bad use of this information because of the fact that relations (links) and nodes (documents) are not typed on the Web.</p><p>We think that these approaches are interesting and useful. The lack of explicit Web structure to improve them encourages us to work on Web structure extraction. Several works have focused on statistics studies <ref type="bibr" target="#b4">[5]</ref>, <ref type="bibr" target="#b3">[4]</ref>, <ref type="bibr" target="#b30">[31]</ref>, dealing with the use of HTML tags or with the links distribution which leads for example to the notion of hub and authority pages. Pirolli <ref type="bibr" target="#b28">[29]</ref> has categorized Web pages following 5 predefined categories which are related to site structure, based on usage, site connectivity and content data. Broder <ref type="bibr" target="#b6">[7]</ref> has studied the Web connectivity and has extracted a macroscopic Web structure. But none of these works deals with Web structure (structured documents and hypertexts) extraction related to IR objectives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Is the Web well structured?</head><p>The main objectives of our Web sample analysis are to identify the Web explicit structure, and to extract the Web implicit structure. Obviously, the Web is clearly not structured in the DataBase sense of the term. But HTML allows people to publish structured sites. Thus we will talk about hierarchically structured Web as well as structured in the hypertext sense. The question is : "Is the Web sufficiently structured (especially hierarchically) to index it following a structured IR model?". This main objective leads us to some other interesting questions like : "What is a Document on the Web" or "How can I classify a Web link?".</p><p>We present our approach to answer these questions. Firstly, in section 3.1 we present the kind of structure that we wanted to identify/extract from the Web. We hypothesize that this ideal structure for the Web exists. The underlying problematic is about a structured IR model: our final goal is to develop an IR model adapted to Web.</p><p>Secondly we will present in section 3.2 our interpretation of HTML use related to our hypothesis. Web's structure depends mainly of the HTML use by sites authors, thus finally we will present in section 4 some preliminary results of an Web sample analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Hypotheses</head><p>We will try to check the following assumptions, directly related to the concept of information in a semistructured and heterogeneous context: Hypothesis 1: information granularity. We think that information units on the Web can have different granularities. By assembling these various constituents, one can build entities of more important size. We distinguish at least 6 main granularities, and hypothesis 1 is detailed following these 6 granularities:</p><p>H1.1: elementary constituent. We think that on the Web there is the notion of elementary constituents, which can be composed using a morpheme, a word or a sentence. In our approach, elementary constituent is at the sentence level. H1.2: paragraph. By assembling sentences, one can build paragraph-sized entities. This is our first level of structure. This structure is a list, reflecting the logical sequence existing in order to constitute an understandable unit. H1.3: document section. This second level includes all the elements that composes a "classical" document, like sub-sections, sections, chapters, etc. All of them are built using paragraphs. They include also some other attributes like title, author, etc. H1.4: document. This third level is the first one that introduces a tree like structure, based on document sections. Moreover, reader must follow a reading direction for a better understanding. For example, people generally read "introduction" before "conclusion". H1.5: hyper-document. This level loses the reading organization when gluing documents. This level can be associated with parts of hypertext, where a reading direction is not obligatory any more. H1.6: clusters of hyper-document. This last level is useful to glue the hyper-documents that have some characteristics in common, like the theme or the authors. This can be seen as the library shelf metaphor.</p><p>Hypothesis 2: relations. There are various relations between documents, whatever their granularity. We distinguish at least 3 main relations types, and hypothesis 2 is detailed following these 3 types:</p><p>H2.1: composition. This relation expresses the hierarchical (tree-like) build of higher granularity entity. This relation is used in the five first levels of the previous granularity description (i.e. paragraphs are composed by sentences). Composition deals with attributes shared along composition relations, for example author name. It also deals with the compound element lifetime: a paragraph doesn't exist any more without its sentences. The composition can be split in weak and strong composition according to the sharing status. The composition is weak if an element can be shared. In this case the relation draws a lattice, otherwise we obtain a tree. H2.2: sequence. Certain documents parts can be organized by the author in an orderly way: part B precedes part C and follows part A. This order suggests a reading direction to the reader, for a best understanding. This relation only concerns H1.1 to H1.4, it can be modeled using the probability that a part can be best understood after the reading of a part . This conditional probability value can be the fuzzy value of the sequence from to . H2.3: reference. This relation is weak, in the sense that it can link elements at any granularity level because they have something in common. For example, an author can refer to another document for a complementary information or two documents can refer each other because of their similarity.</p><p>The next generation of Web search engines will have to consider all these granularities and relations. In the next section, we interpret the HTML usage on the Web, in relation with these hypotheses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Web analysis to validate assumptions</head><p>Our objectives are to study different Web characteristics, with an aim of validating our hypotheses. Without considering under-sentences granularities, we have made the hypothesis that it exists 6 main granularities on the Web (cf section 3.1), from sentence level until cluster of hyper-documents level. To validate hypothesis 1.1, 1.2 and 1.3, we have chosen HTML tags as describing inside-page granularities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H1.1</head><p>It is possible with HTML to describe elementary constituents, with &lt;ADDRESS&gt; or &lt;CODE&gt;. Several are at the presentation level, others at the semantics level. We place our analysis at the sentence level, and we do not have found a lot of tags that explicitly isolate sentence like &lt;CITE&gt; do. All others tags are internal sentence elements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H1.2</head><p>We propose to place at this level simple paragraphs and "blocs elements" like &lt;TABLE&gt; or &lt;FORM&gt;.</p><p>It exists sub-blocs elements like &lt;PRE&gt; that we place also at this level. Of course we propose to use paragraphs separators &lt;P&gt;, &lt;HR&gt;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H1.3</head><p>To express document sections, one can use HTML separators &lt;Hn&gt;. In fact, we could use the whole Web page as a section.</p><p>H1. <ref type="bibr" target="#b3">4</ref> We propose to consider the physical HTML page as a document. But we could also take a set of interconnected Web pages as document assuming that links between them represent composition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H1.5</head><p>The first proposition we can do is to consider an hyper-document to be an Internet site which is defined as a set of pages on the same site.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H1.6</head><p>To represent our cluster of hyper-document, we propose the notion of Web domain (i.e. ".imag.fr").</p><p>To validate hypothesis 2, we have tried to identify composition and sequence links. All unidentified links are categorized as reference links. Implicit similarity and reference relations are not extracted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H2.1</head><p>Composition can be identified by inside-pages H1.3 tags, representing strong composition. Also, inside-sites links can be identified as hierarchical, representing strong or weak composition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H2.2</head><p>The sequence can be found by looking at the implicit position of a fragment relatively to the following text segment (inside-pages). Also, some inside-sites links from a page to one of its sisters can be considered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H2.3</head><p>All the remaining links are classified in this category. This type of relation can be represented on the Web using hypertext links. But it can also be implicit, like quotations for example.</p><p>It is possible to describe a structure, but is it a reality on the Web? We have to verify if these sub-page granularity tags are used by authors (H1.1 to H1.3), and we have to check if the concept of page, site and domain are relevant on the Web (H1.4 to H1.6). For each page, we have to rebuild hierarchical tree structure, and to identify a structured documents hierarchy between HTML pages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments results</head><p>We will present in this section some preliminary results about a Web sample analysis, and particularly statistics used to validate our assumptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Web pages sample: IMAG collection</head><p>We have collected an "October 5 2000 snapshot" Web sample, using our Web crawler "CLIPS-Index" (cf section 5). We have chosen to restrict our experiment to the Web pages of the IMAG domain <ref type="foot" target="#foot_7">8</ref> , which are browsable starting from URL "http://www.imag.fr". These pages deal with several topics, but most of them are scientific documents, particularly in computer science field. Main characteristics of this collection are summarized in figure <ref type="figure">1</ref>  Our spider has collected, taking less than 2 hours, almost 39.000 pages which are identified by their URL from 39 hosts, for a size of 443 Mb. It is not surprising that most of the pages are in HTML format (72 % of .html and .htm, cf figure <ref type="figure" target="#fig_0">2</ref>). After analysis and textual extraction, it remains about 140 Mb of textual data containing more than 241.000 distinct terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Granularity analysis</head><p>We have extracted statistics related to entities granularities described in section 3.1. It appears that HTML ability to represent different insidepage granularities as described in section 3.2 is widely used: each page contains on average 17 level 1 objects, 17 blocs elements, 29 paragraphs separators and 3,3 section (cf figure <ref type="figure">3</ref>). Hypothesis 1.1, 1.2 and 1.3 seem to be correct, but need manual experiments to be validated.  <ref type="figure">1</ref>). This is greater than other studies results (almost 7 Ko <ref type="bibr" target="#b30">[31]</ref>, <ref type="bibr" target="#b4">[5]</ref>). Textual pages size (pages without HTML tags) is on average of 3,69 Ko. But these statistics are related to physical aspects of documents. We have to consider entities linkages to conclude something about logical aspects. It exists on average 37 links per page in our collection (cf figure <ref type="figure">4</ref>): if we don't consider redundant links (same source and same destination), it remains only 550'000 distinct links: on average 14,11 per page, which is not far from other studies (13,9/page <ref type="bibr" target="#b30">[31]</ref>, 16,1/page <ref type="bibr" target="#b3">[4]</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Level</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Links</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Internal HTML page structure extraction</head><p>We have identified several internal pages levels (cf section 4.2): section, paragraph, sentence and even internal sentence. These levels are defined by HTML pages writers. With these structure elements (cf figure <ref type="figure">3</ref>), we are able to rebuild hierarchical tree structure which are relatively large (cf figure <ref type="figure">6</ref>). Particularly, we have to extract most of the composition and sequence relations. Composition relations are implicit, from page to all its sections, but also from sections to all their paragraphs, etc. Sequence relations are also implicit, from each page element (except the last one) to its physical successor. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">External HTML page structure extraction</head><p>We make the assumption that Web site directory structure includes some semantics that can be automatically extracted. This semantics is proposed "a priori", because we suppose the manner that pages are placed in the directory hierarchy follows the "principle of least effort". We assume the directory hierarchy reflects the composition relation. It must of course be validated by experimentation using manual validation. We examine in this part all ways links are joining pages across the directory hierarchy and we propose the following links categories: internal (inside-page), hierarchical and transversal (insidesite), cross (outside-site) and out (outside-domain) (cf figure <ref type="figure" target="#fig_1">7</ref>).  We are interested by categorizing the relations represented by these links. We have interpreted each link type to type the relation that it expresses, in the following way: Internal 8% of links stay in the same page. We have no proposition for their category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Types</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hierarchical</head><p>We call hierarchical links those whose the source and target are in the same directory path.</p><p>These links are the most common in our sample with 60%. If these links reflect the composition</p><p>We have also developed several analysis tools (23'000 lines) using PERL (Practical Extraction and Report Language) for HTML extraction, links analysis and typing, topology analysis, statistics extraction, text indexing, language extraction, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and future works</head><p>We think that it is interesting and useful to use Web structure for IR. Because of the lack of Web explicit structure, we have to identify explicit structure and extract implicit one. We have proposed a framework composed by 6 entities granularities and 3 main relations types. We have proposed some rules to extract these granularities and relations, based mainly on HTML possibilities to describe structured elements, and on study of relations that exist between hypertext links and Web server directories hierarchy.</p><p>Our first experiments show that, in a first hand the hypotheses H1.1, H1.2, H1.3 (internal structure level) and H1.5 (site granularity) seem to be correct, and in a second hand that hypotheses H1.4 (page granularity) and H1.6 (cluster of sites granularity as internet sub-domains) seem to be false.</p><p>It is possible to identify and extract structure from the Web: several granularities and several types of relations. But we have to continue these experiments. Firstly, we have to improve our relations categorization and our hierarchical structure extraction. Secondly, we need to check extracted informations manually, to validate our hypotheses. Thirdly, we have to analyze bigger collections: several domains, more heterogeneous pages. IMAG collection is undoubtedly not very representative of the Web, because of its small size compared to French Web. Moreover it represents only a single Web domain. And finally, our main objective is to propose a structured IR model, based on these 6 granularity levels and 3 relations types. An Information Retrieval System based on this model will use some IR methods used in the context of structured documents and hypertexts <ref type="bibr" target="#b11">[12]</ref>. It will actually use Web structure for IR, and thus will be able to help facing the IR problem on the Web. We are also working on the use of DataMining techniques for extracting useful knowledge for improving IR results <ref type="bibr" target="#b13">[14]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Pages format</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Links types</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>.</figDesc><table><row><cell></cell><cell>#per page</cell><cell># in coll</cell><cell>Extension</cell><cell>#pages</cell><cell>%</cell></row><row><cell>Hosts</cell><cell></cell><cell>39</cell><cell>.html</cell><cell cols="2">25'665 65,82</cell></row><row><cell>Pages</cell><cell></cell><cell>38'994</cell><cell>.htm</cell><cell cols="2">2'530 6,49</cell></row><row><cell>French language</cell><cell></cell><cell>5'649</cell><cell>.java</cell><cell cols="2">1'021 2,62</cell></row><row><cell>English language</cell><cell></cell><cell>23'819</cell><cell>.cgi</cell><cell cols="2">219 0,56</cell></row><row><cell>Others language</cell><cell></cell><cell>9'068</cell><cell>.txt</cell><cell cols="2">82 0,21</cell></row><row><cell>Distinct terms</cell><cell></cell><cell>241'000</cell><cell>.php3</cell><cell cols="2">71 0,18</cell></row><row><cell>Size (HTML)</cell><cell>11,6 Ko</cell><cell>443 Mb</cell><cell>No extension</cell><cell cols="2">8'134 20,86</cell></row><row><cell>Size (text)</cell><cell>3,7 Ko</cell><cell>141 Mb</cell><cell>Directory</cell><cell cols="2">933 2,39</cell></row><row><cell>Lines</cell><cell cols="2">207 8'079'676</cell><cell>Others</cell><cell cols="2">339 0,87</cell></row><row><cell>Links</cell><cell cols="2">37,8 1'475'096</cell><cell>Total</cell><cell cols="2">38'994 100</cell></row><row><cell cols="3">Figure 1: IMAG sample: main characteristics</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>Web pages are heavily linked together, but without link categorization it is difficult to distinguish which pages are hyper-documents, which are structured documents or which are sections (Hypothesis 1.4). Especially, we can't confirm that a Web document is represented by an HTML page.</figDesc><table><row><cell cols="2">#links/page #pages</cell><cell>%</cell><cell></cell><cell cols="5">There are a few outside-sites links: only 2,6% of</cell></row><row><cell>0</cell><cell cols="2">38'275 97,63</cell><cell></cell><cell cols="5">them, contained by 2,4% of pages. Thus, we think</cell></row><row><cell>1</cell><cell>396</cell><cell>1,52</cell><cell></cell><cell cols="5">that the site compactness validate Hypothesis 1.5:</cell></row><row><cell>2</cell><cell>106</cell><cell>0,36</cell><cell></cell><cell cols="5">hyper-documents are represented by sites. Only</cell></row><row><cell>3</cell><cell>85</cell><cell>0,18</cell><cell></cell><cell cols="5">5,4% of these outside-sites links are inside-domain</cell></row><row><cell>4</cell><cell>39</cell><cell>0,1</cell><cell></cell><cell cols="5">links: most of sites are connected with outside-</cell></row><row><cell>5 +</cell><cell>93</cell><cell>0,21</cell><cell></cell><cell cols="5">domain sites, we conclude that a cluster of hyper-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="5">documents is not represented by a Web domain (Hy-</cell></row><row><cell cols="4">Figure 5: Outside-sites links per page</cell><cell cols="2">pothesis 1.6).</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>#links</cell><cell>%</cell><cell cols="3">Per page Per site Distinct</cell><cell>%</cell><cell cols="2">Per page Per site</cell></row><row><cell>Inside-pages</cell><cell>118'248</cell><cell>8</cell><cell>2,97</cell><cell>3'128</cell><cell>13'897</cell><cell>2,53</cell><cell>0,36</cell><cell>356</cell></row><row><cell>Outside-pages</cell><cell cols="2">1'318'490 89,38</cell><cell>33,81</cell><cell cols="3">33'807 500'472 90,96</cell><cell>12,83</cell><cell>12'832</cell></row><row><cell>Outside-sites</cell><cell>2'093</cell><cell>0,14</cell><cell>0,05</cell><cell>57,67</cell><cell>1'708</cell><cell>0,31</cell><cell>0,04</cell><cell>43,79</cell></row><row><cell>Outside-domain</cell><cell>36'265</cell><cell>2,46</cell><cell>0,93</cell><cell>930</cell><cell>34'130</cell><cell>6,20</cell><cell>0,87</cell><cell>875</cell></row><row><cell>Total</cell><cell>1'475'096</cell><cell>100</cell><cell>37,12</cell><cell cols="2">39'118 550'207</cell><cell>100</cell><cell>14,11</cell><cell>14'108</cell></row><row><cell></cell><cell cols="5">Figure 4: Links analysis: all/distincts links</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">measure of IR-tool quality, evaluated classically using precision and recall measures</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">measure of system resources use: memory usage, network load, etc...</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://www.altavista.com, http://www.alltheweb.com, http://www.google.com</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://www.yahoo.com, http://www.dmoz.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">A page that references a lot of authorities pages.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">A page that is referenced by a lot of hubs pages.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"> www.yahoo.com   </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">Institut d'Informatique et de Mathématiques Appliquées de Grenoble : hosts which name is ended by .imag.fr</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>structure, we can deduce that this sample is strongly structured.</p><p>Transversal The target of the link is neither in the ascendant directories nor in the descendant directories but is in the same site. There are 30% of links in this category. We can probably classify them in the weak composition or in reference links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cross site</head><p>The target is on an other site: only 0,1% are concerned. They are candidates to be reference.</p><p>Outside IMAG Target is outside IMAG domain: only 2,5%, they are also candidates to be reference.</p><p>We detail hierarchical links in 3 categories: horizontal, up and down.</p><p>#  <ref type="figure">8</ref>).</p><p>Up Source is deeper in the directory path. These links go up in site hierarchy. This is the second ranking value with 25% (cf figure <ref type="figure">8</ref>). It exists more links going up than going down, we think that this is caused by a lot of "Back to the Top" links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Down:</head><p>Target is deeper in the directory path. These links are less frequent (5%) (cf figure <ref type="figure">9</ref>) than other hierarchicals. We think that they could belong to the composition hierarchy.</p><p>We conclude that the directory path is not built in a random manner: the majority of the links follow it. The site also seems to have a strong consistency: 98% of the links are inside-sites links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Technical details: CLIPS-Index and Web pages analysis</head><p>We have developed a robot called CLIPS-Index 9 in collaboration with Dominique Vaufreydaz (from GEOD team), with the aim of creating Web corpora. This spider crawl the Web, collecting and storing pages. CLIPS-Index tries to collect the bigger amount of information in this heterogeneous context which is not respectful of the existing standard. It is an interesting problem to collect the Web correctly. In spite of this, our spider is quite efficient: for example, we have collected (October 5 2000) 38'994 pages on the .imag.fr domain, comparatively to Altavista which index 24.859 pages (October 24 2000) on the same domain and AllTheWeb which index 21.208 pages (October 24 2000). 3,5 millions pages from french-speaking Web domains where collected during 4 days, using a 600Mhz PC with 1 Gb RAM. CLIPS-Index crawls this huge hypertext without considering non-textual pages, and respects the robot exclusion protocol <ref type="bibr" target="#b18">[19]</ref>. It does not overload distant Web servers, despite the launching of several hundred HTTP queries simultaneous. CLIPS-Index, running on an ordinary 333Mhz PC with 128Mo RAM which cost less than 1.000 dollars, is able to find, load, analyze and stock something like ½ ¾ millions pages per day. 9 http://CLIPS-Index.imag.fr</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Interrogation d&apos;Hypertextes</title>
		<author>
			<persName><forename type="first">B</forename><surname>Amann</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
		<respStmt>
			<orgName>Conservatoire National des Arts et Métiers de Paris</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Using common hypertext links to identify the best phrasal description of target Web document</title>
		<author>
			<persName><forename type="first">E</forename><surname>Amitay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Research and Development in IR (SIGIR&apos;98)</title>
				<meeting><address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Semistructured and structured data in the Web : Going back and forth</title>
		<author>
			<persName><forename type="first">P</forename><surname>Atzeni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mecca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Merialdo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Management of Semistructured Data</title>
				<meeting><address><addrLine>Tucson</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">30 % accessible -a survey to the UK Wide Web</title>
		<author>
			<persName><forename type="first">D</forename><surname>Beckett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;97)</title>
				<meeting><address><addrLine>Santa Clara, California</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Measuring the Web</title>
		<author>
			<persName><forename type="first">T</forename><surname>Bray</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;96)</title>
				<meeting><address><addrLine>Paris, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1996-05">May 1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">The anatomy of a largescale hypertextual Web search engine</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Page</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;98)</title>
				<meeting><address><addrLine>Brisbane, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Graph structure in the Web</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Z</forename><surname>Broder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Maghoul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rajagopalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tomkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wiener</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;00)</title>
				<meeting><address><addrLine>Amsterdam, Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Spectral filtering for resource discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">E</forename><surname>Dom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename><surname>David Gibson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rajagopalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tomkins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Research and Development in IR (SIGIR&apos;98)</title>
				<meeting><address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">An integrated model for hypermedia and information retrieval</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chiaramella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kheirbek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Retrieval and Hypertext</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Agosti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Smeaton</surname></persName>
		</editor>
		<imprint>
			<publisher>Kluwer Academic Publisher</publisher>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Querying structured documents with hypertext links using OODBMS</title>
		<author>
			<persName><forename type="first">V</forename><surname>Christophidès</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rizk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Hypertext Technology (ECHT&apos;94)</title>
				<meeting><address><addrLine>Edinburgh, Scotland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A retrieval model for incorporating hypertext links</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Turtle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Conference on Hypertext (HT&apos;89)</title>
				<meeting><address><addrLine>Pittsburg, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A generic framework for structured document access</title>
		<author>
			<persName><forename type="first">F</forename><surname>Fourel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mulhem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-F</forename><surname>Bruandet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Database and Expert Systems Applications (DEXA&apos;98)</title>
				<meeting><address><addrLine>Vienna, Austria</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
	<note>LNCS 1460</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Smartweb : Recherche de zones de pertinence sur le world wide web</title>
		<author>
			<persName><forename type="first">M</forename><surname>Géry</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Congrès INFORSID&apos;99</title>
				<meeting><address><addrLine>La Garde, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Knowledge discovery for automatic query expansion on the World Wide Web</title>
		<author>
			<persName><forename type="first">M</forename><surname>Géry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H</forename><surname>Haddad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on the World-Wide Web and Conceptual Modeling (WWWCM&apos;99)</title>
				<meeting><address><addrLine>Paris, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
			<biblScope unit="volume">1727</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A connectivity analysis approach to increasing precision in retrieval from hyperlinked documents</title>
		<author>
			<persName><forename type="first">C</forename><surname>Gurrin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Smeaton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text REtrieval Conference (TREC&apos;99)</title>
				<meeting><address><addrLine>Gaithersburg, Maryland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Extracting semistructured information from the Web</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hammer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aranha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Crespo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Management of Semistructured Data</title>
				<meeting><address><addrLine>Tucson</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Results and challenges in Web search evaluation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Thistlewaite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Harman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;99)</title>
				<meeting><address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999-05">May 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Authoritative sources in a hyperlinked environnement</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Kleinberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM-SIAM Symposium on Discrete Algorithms (SODA&apos;98)</title>
				<meeting><address><addrLine>San Francisco, California</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">A method for Web robots control</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koster</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
		<respStmt>
			<orgName>Internet Engineering Task Force (IETF</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Searching the World Wide Web</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Giles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">280</biblScope>
			<date type="published" when="1998-04">April 1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Accessibility of information on the Web</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Giles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature</title>
		<imprint>
			<date type="published" when="1999-07">July 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Index structures for structured documents</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Yoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yoon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Conference on Digital Libraries (DL&apos;96)</title>
				<meeting><address><addrLine>Bethesda, Maryland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">The quest for correct information on the Web : Hyper search engines</title>
		<author>
			<persName><forename type="first">M</forename><surname>Marchiori</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;97)</title>
				<meeting><address><addrLine>Santa Clara, California</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">GENVL and WWWW: Tools for taming the Web</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">A</forename><surname>Mcbryan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">World Wide Web Conference (WWW&apos;94)</title>
				<meeting><address><addrLine>Geneva, Switzerland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Querying the World Wide Web</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mendelzon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mihaila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Milo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Parallel and Distributed Information Systems (PDIS&apos;96)</title>
				<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Sizing the internet</title>
		<author>
			<persName><forename type="first">A</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">H</forename><surname>Murray</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<publisher>Cyveillance</publisher>
		</imprint>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Inferring structure in semistructured data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Nestorov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Abiteboul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Motwani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Management of Semistructured Data</title>
				<meeting><address><addrLine>Tucson</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Using linguistic and discourse structures to derive topics</title>
		<author>
			<persName><forename type="first">F</forename><surname>Paradis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Information and Knowledge Management (CIKM&apos;95)</title>
				<meeting><address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Silk from a sow&apos;s ear : extracting usable structures from the Web</title>
		<author>
			<persName><forename type="first">P</forename><surname>Pirolli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pitkow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Human Factors in Computing Systems (CHI&apos;96)</title>
				<meeting><address><addrLine>Vancouver, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">The SMART retrieval system : experiments in automatic document processing</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1971">1971</date>
			<publisher>Prentice Hall</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">An investigation of documents from the World Wide Web</title>
		<author>
			<persName><forename type="first">A</forename><surname>Woodruff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Aoki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brewer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gauthier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Rowe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks and ISDN Systems</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
