=Paper=
{{Paper
|id=None
|storemode=property
|title=Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages
|pdfUrl=https://ceur-ws.org/Vol-701/paper2.pdf
|volume=Vol-701
|dblpUrl=https://dblp.org/rec/conf/icdt/GeryC01
}}
==Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages==
Toward a Structured Information Retrieval System on the Web: Automatic
Structure Extraction of Web Pages
Mathias Géry, Jean-Pierre Chevallet
Equipe MRIM (Modélisation et Recherche d’Information Multimédia)
Laboratoire CLIPS-IMAG, B.P. 53, 38041 Grenoble Cedex 9, France
E-mail : Mathias.Gery, Jean-Pierre.Chevallet@imag.fr
Abstract : The World Wide Web is a distributed, heterogeneous and semi-structured information space.
With the growth of available data, retrieving interesting information is becoming quite difficult and classical
search engines give often very poor results. The Web is changing very quickly, and search engines mainly
use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page
structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is
possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the
implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary
results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graph-
like structure).
Keywords : Web Information Retrieval, Web Pages Analysis, Structure Extraction, Statistics
1 Introduction
The task of an Information Retrieval System (IRS) is to process a whole set of electronic documents (corpus),
with an aim of making it possible to retrieve those matching with their information need. On the contrary
of Databases Management Systems (DBMS), the user expresses with a query the semantic content of the
documents that he seeks. We distinguish two principal tasks:
Indexing: The extraction and storage of the documents semantic content. This phase requires a representa-
tion model of these contents, called document model.
Querying: The representation of the user’s information need, generally in a query form. It is followed by
the retrieval task, and the presentation of the results. This phase requires a representation model called
query model, and a matching function to evaluate documents relevance.
IRS were classically used for textual documents databases, or multimedia databases like medical cor-
pora. The Web growth constitutes a new applicability field for IR. The number of users on the Web has been
estimated at 119 millions in 1998 (NUA Ltd Internet survey, July 1998), 333 millions in 2000 (NUA Ltd
Internet survey, June 2000). The number of accessible pages has been estimated in December 1997 at 320
millions [20], in February 1999 at 800 millions [21] and in July 2000 at more than 2 billions [26].
The Web is a huge and sometimes chaotic information space without central authority. In this context,
and in spite of standardization efforts, these documents are very heterogeneous in their contents as in their
presentations: HTML standard is respected in less than 7% of HTML pages [4]. We can expect to find almost
everything there, but retrieving relevant information seems to be like Finding the Needle in the Haystack...
Nowadays, the great challenge for research in IR is to help people to profit of the huge amount of re-
sources existing on the Web. But it exists yet no approach that satisfies this information need in a both effec-
tive 1 and efficient 2 way. For assisting the user in his task, some search engines (like Altavista, AllTheWeb,
or Google 3 ) are available on the Web. They are able to process huge documents volumes with several tens
1
measure of IR-tool quality, evaluated classically using precision and recall measures
2
measure of system resources use: memory usage, network load, etc...
3
http://www.altavista.com, http://www.alltheweb.com, http://www.google.com
6
of million indexed pages. They are nevertheless very fast and are able to solve several thousands of queries
per second. In spite of all their efforts, the answers provided by these systems are generally not very satis-
factory. Preliminary results obtained with a test collection of the TREC conference Web Track has showed
the poor results quality of 5 well known engines of the Web, compared to those of 6 systems taking part to
TREC [17].
In fact, most of existing search engines use well-known techniques like those described by Salton 30
years ago [30]. Most of them prefer a wide coverage of Web with a low indexing quality to a better indexing
on a smaller part of the Web. In particular, they consider generally HTML pages as atomic and independent
documents, without taking into account relations existing between them. The notion of document for a
search engine is reduced to its physical appearance, a HTML page. But Web’s structure is used in few of
Web search engines like Google [6].
With an aim of Structured IR, we wanted to determine which structure exists on the Web, and which
structure it is possible to extract. This paper is organized as follow: after presentation of related works in
section 2 (i.e. IR with structured documents, hypertexts and Web), we will present our hypotheses about
what will be an ideal structure on the Web in section 3.1. Then we will propose our approach to validate
our hypothesis and check if this kind of structure exists on the Web in section 3.2. Finally we will introduce
the Web sample that we have analysed in section 4.1 and some preliminary results of our experimentations
in sections 4.2, 4.3 and 4.4, while section 6 gives a conclusion about this work-in-progress and some future
directions of our works.
2 IR and structure on the Web
The Web is not only a simple set of atomic documents. The HTML standard allows description of structured
multimedia documents, it is widely used to publish on Web. Furthermore, Web is an hypertext, with URL’s
use (Uniform Resource Locator) for the description of links. This structure was used for IR, as well in the
context of structured documents as in the context of classical hypertexts. We distinguish 3 main approaches
proposing techniques of information access using structure: navigation, DBMS and IR approach.
2.1 Navigation approach
Navigation is based on links, used for finding and consulting some interesting information. In the case of
a navigation within a hypertext composed by several hundreds of nodes, this solution can be useful. This
task is more difficult to achieve on larger hypertext, mainly because of disorientation and cognitive overload
problems. Furthermore, it is necessary to have the right links at the right place. A solution is proposed by
“Web Directories” as Yahoo or Open Directory Project 4 which propose an organized hierarchy of several
millions of sites. These hierarchies are built and verified manually, and thus it is expensive and difficult to
keep them up-to-date. Furthermore, exhaustiveness is impossible to reach.
2.2 DBMS approach
Documents are represented using a data schema encapsulated in a relational or Object Oriented [10] data
schema. It allows an interrogation using a declarative query language based on an exact matching and forced
by the data schema structure. The hypertext structure integration in the database schema has been much
studied, for example by [25], [3] (ARANEUS project), [16] (TSIMMIS project), etc. Integration attempts
at the level of query language can be found in hyperpaths [1] or POQL [10]. In fact, these approaches are
extensions of the proposed solution for the documents structure integration.
2.3 IR approach
IR approach deals with structured documents, promoting a hierarchical indexing : during the indexing pro-
cess, information is propagated from document sections to top of document, along composition relations.
4
http://www.yahoo.com, http://www.dmoz.org
7
This method is refined by Lee [22] who distinguishes several strategies of ascent. Paradis in [28] distin-
guishes several structure types, data ascent depending on different link types.
Hypertext structure has been taken into account at the indexing step. For example, the hypertext graph
can be incorporated into a global indexing schema using conceptual graph model [9] or using inference
networks [11]. World Wide Web Worm [24] enables the indexing of multimedia documents by the use of
the anchor’s surrounding text. Amitay [2] promotes also document’s context use. Marchiori [23] adds the
“navigation cost notion” that expresses the navigation effort for reaching a given page.
SmartWeb [13] considers the accessible information of a Web page at indexing step, so page relevance
is evaluated considering the page content but also the page’s neighbors content. Kleinberg (HITS [18])
promotes the use of both links directions: he introduces the hub page 5 and authority page 6 concepts. For
automatic ressources compilation, the CLEVER system [8] based on the same idea, obtains good results
against manually generated compilation (Yahoo!7 ). Gurrin [15] has tried to improve Kleinberg’s approach.
He distinguishes 2 links types (structural and functional) and uses only structural ones. The well-known
Google search engine [6] uses textual anchors to describe pages referenced by links from these anchors.
2.4 Related Works : Discussion
We think that navigation approach is well adapted to manually manageable collections, but the Web is too big
to be acceded only with navigation. Navigation can be an interesting help to other techniques, for example
to consult search results.
About DBMS approaches, we think that a declarative query language is not adapted to the Web hetero-
geneity. Moreover, these approaches rely on the underlying data base schema, and Web pages have to be
expressed following this schema or following predefined templates. According to Nestorov [27] we think
that even if some Web pages are strongly structured, this structure is too irregular to be modeled efficiently
with structured models like relational or object.
IR approach enables natural language querying, and considers relevance in a hypertext context. At
present, most of the IR approaches are based on pages connectivity use, with the notion of relevance propa-
gation along links. The drawback is the bad use of this information because of the fact that relations (links)
and nodes (documents) are not typed on the Web.
We think that these approaches are interesting and useful. The lack of explicit Web structure to improve
them encourages us to work on Web structure extraction. Several works have focused on statistics studies
[5], [4], [31], dealing with the use of HTML tags or with the links distribution which leads for example to the
notion of hub and authority pages. Pirolli [29] has categorized Web pages following 5 predefined categories
which are related to site structure, based on usage, site connectivity and content data. Broder [7] has studied
the Web connectivity and has extracted a macroscopic Web structure. But none of these works deals with
Web structure (structured documents and hypertexts) extraction related to IR objectives.
3 Is the Web well structured?
The main objectives of our Web sample analysis are to identify the Web explicit structure, and to extract the
Web implicit structure. Obviously, the Web is clearly not structured in the DataBase sense of the term. But
HTML allows people to publish structured sites. Thus we will talk about hierarchically structured Web as
well as structured in the hypertext sense. The question is : “Is the Web sufficiently structured (especially
hierarchically) to index it following a structured IR model?”. This main objective leads us to some other
interesting questions like : “What is a Document on the Web” or “How can I classify a Web link?”.
We present our approach to answer these questions. Firstly, in section 3.1 we present the kind of structure
that we wanted to identify/extract from the Web. We hypothesize that this ideal structure for the Web exists.
The underlying problematic is about a structured IR model: our final goal is to develop an IR model adapted
to Web.
5
A page that references a lot of authorities pages.
6
A page that is referenced by a lot of hubs pages.
7
www.yahoo.com
8
Secondly we will present in section 3.2 our interpretation of HTML use related to our hypothesis. Web’s
structure depends mainly of the HTML use by sites authors, thus finally we will present in section 4 some
preliminary results of an Web sample analysis.
3.1 Hypotheses
We will try to check the following assumptions, directly related to the concept of information in a semi-
structured and heterogeneous context:
Hypothesis 1: information granularity. We think that information units on the Web can have different
granularities. By assembling these various constituents, one can build entities of more important
size. We distinguish at least 6 main granularities, and hypothesis 1 is detailed following these 6
granularities:
H1.1: elementary constituent. We think that on the Web there is the notion of elementary con-
stituents, which can be composed using a morpheme, a word or a sentence. In our approach,
elementary constituent is at the sentence level.
H1.2: paragraph. By assembling sentences, one can build paragraph-sized entities. This is our first
level of structure. This structure is a list, reflecting the logical sequence existing in order to
constitute an understandable unit.
H1.3: document section. This second level includes all the elements that composes a “classical”
document, like sub-sections, sections, chapters, etc. All of them are built using paragraphs.
They include also some other attributes like title, author, etc.
H1.4: document. This third level is the first one that introduces a tree like structure, based on docu-
ment sections. Moreover, reader must follow a reading direction for a better understanding. For
example, people generally read “introduction” before “conclusion”.
H1.5: hyper-document. This level loses the reading organization when gluing documents. This level
can be associated with parts of hypertext, where a reading direction is not obligatory any more.
H1.6: clusters of hyper-document. This last level is useful to glue the hyper-documents that have
some characteristics in common, like the theme or the authors. This can be seen as the library
shelf metaphor.
Hypothesis 2: relations. There are various relations between documents, whatever their granularity. We
distinguish at least 3 main relations types, and hypothesis 2 is detailed following these 3 types:
H2.1: composition. This relation expresses the hierarchical (tree-like) build of higher granularity
entity. This relation is used in the five first levels of the previous granularity description (i.e.
paragraphs are composed by sentences). Composition deals with attributes shared along com-
position relations, for example author name. It also deals with the compound element lifetime:
a paragraph doesn’t exist any more without its sentences. The composition can be split in weak
and strong composition according to the sharing status. The composition is weak if an element
can be shared. In this case the relation draws a lattice, otherwise we obtain a tree.
H2.2: sequence. Certain documents parts can be organized by the author in an orderly way: part B
precedes part C and follows part A. This order suggests a reading direction to the reader, for
a best understanding. This relation only concerns H1.1 to H1.4, it can be modeled using the
probability that a part can be best understood after the reading of a part . This conditional
probability value can be the fuzzy value of the sequence from to .
H2.3: reference. This relation is weak, in the sense that it can link elements at any granularity level
because they have something in common. For example, an author can refer to another docu-
ment for a complementary information or two documents can refer each other because of their
similarity.
The next generation of Web search engines will have to consider all these granularities and relations. In
the next section, we interpret the HTML usage on the Web, in relation with these hypotheses.
9
3.2 Web analysis to validate assumptions
Our objectives are to study different Web characteristics, with an aim of validating our hypotheses. Without
considering under-sentences granularities, we have made the hypothesis that it exists 6 main granularities on
the Web (cf section 3.1), from sentence level until cluster of hyper-documents level. To validate hypothesis
1.1, 1.2 and 1.3, we have chosen HTML tags as describing inside-page granularities.
H1.1 It is possible with HTML to describe elementary constituents, with or . Several
are at the presentation level, others at the semantics level. We place our analysis at the sentence level,
and we do not have found a lot of tags that explicitly isolate sentence like do. All others tags
are internal sentence elements.
H1.2 We propose to place at this level simple paragraphs and “blocs elements” like or