-

78 87

a class room experiment. Section 4 shows how the structural query approach ponents of the framework and their application in the domain of ER models models are abstract enough to deliver information about the stored data presents a realization of this approach based on a framework for general model trol of retrieval behavior is explained in section 5. Before we summarize and nal ER models of Deep Web information systems and provides a front end generation of search engines. A keyword description of script pages (the enconclude this paper, section 6 gives an overview of other approaches to search the information provided by keywords relies on the relations between them try points to the Deep Web) can only describe very vague the Deep Web an information of the Deep Web and therefore not available for the index It is interesting to see that the structural information (lost when a user’s question is transformed to a set of keywords) is actually used to manage the information of the Deep Web: Most of the sites oering dynamic W eb pages in the next section. Section 3 evaluates the structural retrieval approach in models (ER models [4]) accurately describe the content and relations. ER without using the data itself. Imagine a search engine that indexes the internot deliver a link to a travel agency, because "Frankfurt" and "Hawaii" is and are not expressed. A search engine for Deep Web information should use to formulate queries in a structural manner like ER modelling. This paper retrieval [12]. The paper is structured as follows: We briey describe the comcan be combined with the necessary keyword-based retrieval. Dynamic conkeywords like "igh t", "from", "Frankfurt", "to", "Hawaii" would probably this structural information, too. information and the real content of a Web site. Additionally the main part of for Deep Web information. is used to transmit user input to the Web site. Search engines feeded with store their data within relational databases for which Entity-Relationship

Transaction

(1,1) performs

(0,N)

Account (1,1) has (0,N) Customer of a one-to-one mapping between ER models and SSMs following the rules As we can assume w.l.o.g. that there are no isolated nodes in a SSM, this is in a relational database and provides algorithms to specify the similarity bethe mappings we implemented graphical user interfaces used to state queries Figure 1 shows a simple ER model and its corresponding SSM. The mapping tween SSMs. This similarity exploits the adjacency structure of SSM graphs. nodes of the same type. The number q expresses the number of edges in G and to visualize the stored SSMs as natural ER models.

Mappings are always one to one. The matching quality is a rational number ( (v); (w)) 2 The matching quality realized by the mapping is dened E0. entities and dependencies between entities and relationships become nodes rules enforce that SSMs always decompose into two bipartite graphs. Apart of 0 by the maximum over all matching qualities of mappings from G onto G .

Given two graphs G = (V; E) and = (V we look at partial mappings G0 0; E0), between 0 and 1 with a value of 1 indicating that there exists a mapping beon the second layer with types derived from the cardinalities of the relations. the case if and only if both graphs are isomorphic. that are implicitly mapped onto edges in i.e. edges (v; w) 2 E such that G0, layer. Relationships become nodes of the third layer and the relations between of [12]: Dieren t kind of entities become dieren t kind of nodes of the rst tween the library graph and the query graph that matches exactly all edges. of the nodes from G onto the nodes of which only map nodes onto G0 We extended the framework to the eld of ER models by the denition The server part of Firestorm remains merely unchanged. It stores SSMs as d( ) = 2q=(jEj + The similarity between two graphs is then dened jE0j).

Relationship Relationship performs has polynomial for a xed n umber of nodes on layer 2, and a reasonable algorithm the exact algorithm has to be applied. Further details of the algorithms are solves for each of them the matching problems on layers 1 and 3 to optimality is N P -hard. But the three-layer structure of SSMs supports the design of heuristic and exact retrieval algorithms as well as a fast lter. The exact alThe authors of [11] showed that nding a mapping of optimal quality by using algorithms for weighted bipartite matching. The exact algorithm is up the retrieval time tremendously by reducing the set of graphs to which for small numbers of nodes on layer 2. Filter and heuristic algorithms speed explained in [11]. gorithm enumerates all feasible mappings of nodes from the second layer and proof is out of sight. of corresponding ER. Modelling is an art and one may argue that e.g. the evaluated the solutions of the students and assigned credit points according models. The restrictiveness in structure and available means to create ER made it possible to evaluate whether high similarity on the SSMs (respectively positive answer would promise to apply the structural retrieval approach for models somehow limits the modelling freedom but a mathematical quality ilarities with maximal credit points. similarity between dieren t ER models to the graph similarity between their ships so that we can concentrate only on the structure. Some sta mem bers main of ER models. This means that the system can report the similarity models for three dieren t real world cases. These cases are formulated as natural language text. The text implies the naming of entities and relationFirestorm is used by students to create and manage their ER models.

University of Tubingen (Germany) is on database and information systems. same real world situation can be modelled in very dieren t ways leading to These master models were used as queries to the system with the corbetween two dieren t ER models. The focus of our working group at the responding student models building the library. The system computed the similarities were converted into credit points by simply multiplying the simto how well the ER models represent the real world cases. For each exercise a dieren t SSMs. The structural retrieval approach reduces the detection of This section will show how well the structures of SSMs capture the semantics sta mem ber created a master model representing the expected solution. This Part of a database course is to learn how to create ER models that describe SSMs but the retrieval quality depends on semantic similarity between ER semantics of the modelled real world cases, expressed by a high grade. A the domain of ER models.

As explained in the previous section Firestorm was extended to the do(graph) similarity between each student model and the master model. The The course contains three exercises that require students to design ER an aspect of the real world to be managed with a database. In this course the corresponding ER models) reects a certain similarity according to the models (as long as the context is clear). the sta mem bers to the credit points assigned by Firestorm. The dierences between SSMs can be used to determine semantic similarity between ER in credit point assignment are marginal so it seems that graph similarity The results are encouraging. Figure 3 relates the credit points assigned by a users chooses the third tree level for the second layer the algorithms can and specializes the types along the tree paths. Figure 4 shows the subtree of cardinality distinction ((0,1),(1,1),(0,N),(1,N) and (N,M)). connections that represent two dieren t kinds of connection cardinalities in root type the second level denes the t ypes of all nodes on the second layer. words which subtypes can be matched to each other. This means e.g. that if should be used by the algorithms for the similarity computation, in other Assume that the SSM of the query model and the SSMs of all library dieren t types. We dene a t ype tree that starts with a generic root type and the second layer SSM node types for the domain of ER models. Following the The control functionality relies on a type tree of SSM nodes. As we know (min; max) notation. The fourth level of the tree captures the most specic models only contain nodes of the most specic t ype tree levels. Before a user triggers a retrieval action the user species for each layer which tree level from section 2 the elements of ER models are mapped onto SSM nodes of third level of the tree distinguishes between (*,N) connections and (*,1) These nodes represent connections between entities and relationships. The in databases exist nearly as long as the Web exists. There are a number of interface for simple text or boolean queries. The users can adjust the search contains the addresses of about 103000 databases organized in a directory structural dependencies between dieren t information units captures much The LII [9] is a free search engine with an annotated, searchable subject Also from BrightPlanet is the free search engine CompletePlanet [2]. It and boolean operators are the means to state queries.

Searching for information that is hidden in the Deep Web is not a new search engines that index those sites and query them according to some user Neither of the presented search engines use the database structure of retrieval approach. ple text input. resemble much the other mentioned directory-based search engines. strategies and result presentation to its own preferences.

Query interfaces require simple text and boolean queries. They are easier to structure with categories and subcategories. The search interface allows sim[3] from BrightPlanet. It searches among 2200 databases and provides a query input. We look at some of those engines and relate them to the structural directory of about 9000 Web resources. The resources are selected and evalumore semantics and therefore allows a more precise search. task. Sites providing information generated by scripts out of data stored ated by librarians for their usefulness to users of public libraries. Simple text One of the most popular commercial Deep Web search engines is LexiBot Deep Web information systems. They describe the content of those systems use than our query interface that is based on ER models. But especially the The search engines of IntelliSeek (ProFusion [8] and InvisibleWeb [7]) with keywords and organize them (by hand or automated) within a directory. 6. A. M. Georion. An in troduction to structured modeling. Management Science, 33(5):547{588, 1987. 5. S. Decker. The Semantic Web Community Portal, 2001. complex models. SIGMOD Record, 29(4):55{63, 2000.

Proceedings of the 1th Conference on Very Large Databases, Morgan Kaufman 3. BrightPlanet. LexiBot, 2001. http://www.lexibot.com/. 2. BrightPlanet. CompletePlanet, 2000. http://www.completeplanet.com/. www.semanticweb.org/. 1. P. A. Bernstein, A. Y. Halevy, and R. Pottinger. A vision of management of 4. P. P. S. Chen. The entity-relationship model | toward a unied view of data. pubs. (Los Altos CA), Kerr (ed), pp.173, 1975. unfamiliar with ER modelling but it is a standardized and powerful way among the individual elements. The management of RDF denitions can according to recall and precision. dynamic control of retrieval behavior based on type trees provides the users Inuenced by the special requirements of the Web we describe a technique to combine structural and keyword-based retrieval. A functionality for the with means to adjust the system with respect to the individual preferences our system to provide Web users a promising way to search for information benet a lot b y an RDF extension of Firestorm. to express dependencies between information objects. The simple graphical retrieval approach has yet to prove its applicability under the real world information. The semantic description is mostly done following the principles user interface should allow users with basic knowledge to specify their own to describe the desired information in structural correlation. Many users are up many applications that require tools to handle semantic descriptions of the general model retrieval framework Firestorm to the domain of ER models. sites need to register with the system. Most of the target sites store their data Users have to state their queries to the system by specifying ER models that provide information contained in the Deep Web. Therefore we extend ER models. conditions of the Web. Therefore we hope that many Web sites register to within relational databases. If they do not have ER models that describe their of RDF [15]. RDF denitions resemble graphs that show the dependencies Another drawback of the approach is the absence of ER models that Web Apart of the sandbox evaluation presented in section 3 the structural of the Deep Web. databases a tool could probably create a model out of the database schema.

Looking into the future the ongoing hype of the Semantic Web [5] brings This paper describes a structural retrieval approach to search for Web sites