Application of geometrical approaches to Information Retrieval ° c Vasyl Tereshchenko National Taras Shevchenko University of Kyiv vtereshch@gmail.com Abstract 3. The result of Problem B resolving is transduced into correct result for Problem A. In this paper we propose an idea of transforma- Theorem. The search problems in relational database tion relational database problems to computa- are transformed to computational geometry search prob- tional geometry problems to develop more ef- lems in time O(N ). ficient algorithms for discovering useful infor- Proof. To prove the theorem, it is necessary to prove mation from databases. We consider in detail the fulfillment of three conditions mentioned above. To relational algebra operations - the base of rela- this end, let us formalize input data sets of the relational tional language foundation - and give adequate database search problem in terms and concepts of the ge- geometrical interpretation for each of them. ometric search problem, and per contra, results of ge- ometric search problem solution interpreted in terms of 1 Introduction databases. Let each tuple of relation R put in accordance to some Over the past ten years the relational database manage- point (or IOW n-plex) of geometric space ER . Let each ment systems (DBMS) have become wide applicable in attribute of relation R put in accordance to some coordi- different areas such as automated design system, CAE nate axis in the following way: axis value area is defined system, geographic information system, office informa- by domain, the attribute is specified under so that value of tion system and so on. However, the relational database each tuple element corresponds to some coordinate value management systems have limited capacity from the ob- of corresponding space point. Such a correspondence is ject’s modeling viewpoint. That makes the DBMS to “one-one”. Ex facte, input data for relational database be non-applicable for the complicated specialized ap- problems are transformed into corresponding input data plications. Also, the recent progress of communication for computational geometry problem in time O(N ) and and network technologies makes it easy to accumulate a the received computational geometry problem solution is large collection of unstructured or semi-structured texts transformed into correct solution for relational database data [2, 5, 6]. In this context, the problem of searching problem in time O(N ) also. Let us consider the main op- more efficient algorithms to discover useful information erations of relational algebra that is the base for relational from large non-structured databases that differs from ex- languages creation. And by using examples of relational istent information retrieving methods is a point of big algebra search queries, we proved their geometrical re- interest [7, 4]. The work [3] is worth to be mentioned, alization (condition 2), and hence, the transformation of since it is devoted to problem of discovering data in large two classes of problems, mentioned above. semistructured text collections. The paper proposes an algorithm based on one of the Selection (S = σpredicate (R)). computational geometry methods that is called the re- Selection is a unary operation. The result of the se- gional search algorithm and speeds up search substan- lection is a new relation S containing only those tuples tially. So nomogenously the subject about the possibil- of the input relation R that holds the specified condition ity to transform relational database problems to computa- (predicate). Let the relation R of the relational database tional geometry problems has been occurred taking into put in accordance to the subspace ER of the space E n . account a high efficient of geometric algorithms. As it was mentioned above, the rank d of the relation R defines the dimension of the corresponding subspace 2 The geometrical approach to informa- ER , Figure 1. Predicate in the selection operation defines some do- tion retrieval main (plane of the rank k < d). We are interested in all those points of the subspace ER that lie within the Definition. Problem A are transduced into Problem B, defined domain. Thereby, the predicate determines the if: search region in the subspace ER , and under the geomet- 1. The input data for Problem A are transduced into ric interpretation the result of the selection operation is corresponding input data for Problem B. the query about the points set of the subspace ER that 2. The Problem B is solved. lie within the queried region. Thus, the regional search corresponds to the selection operation. There were pro- Proceedings of the Spring Researcher’s Colloquium on Database posed several solutions of the regional search. Among and Information Systems, Moscow, Russia, 2011 them, the algorithm based on the orthogonal range tree Figure 1: Figure 3: Assume that we are given a point P = (1) (2) (n) (x , x , . . . xP ) over a subspace ER , dim(ER ) = d; where L = {l1 , l2 , . . . , ld } is a set of the coordinate axis of the subspace ER . Then Πatr.i (R) ≡ prli (ER ) = {P 0 |P 0 = prli P, ∀P ∈ ER }= (i) {P 0 |P 0 = (0, . . . , 0, xP , 0, . . . , 0), ∀P ∈ ER }, Πatr.i,...,atr.j (R) ≡ prπ (ER )={P 0 |P 0 = prπ P, π li , . . . , lj , ∀P ∈ ER }= {P 0 |P 0 = (i) (j) (0, . . . , 0, xP , . . . , xP , 0, . . . , 0), ∀P ∈ ER }. Union (R ∪ S). The union of two relations R and S with tuples I and Figure 2: J correspondingly results their concatenation by forma- method and described in Preparata and Shamos [1] is tion a new relation enclosing the maximal number of tu- worth to be mentioned. The very algorithm uses a data ples (I + J), if the duplicated tuples are expunged. The structure called the orthogonal range tree that requires relations R and S should be a union compatible (i.e., they O(log d−1 N ) time per query, O(N log d−1 N ) space and should have the same number of attributes with coinci- O(N log d−1 N ) preprocessing time, where the N is the dent domains). Let the relations R and S of the relational number of points and d is the space dimension. For an database put in accordance to the subspace ER and ES of example let us consider the following relations: the space E n correspondingly. In the geometric space the union compatibility corresponds to the following condi- PRODUCER (PR, Surname, City, Status) tions: CUSTOMER (CS, Surname, City) 1. Relations R and S have the same number of at- DETAIL (DT, Name, Weight) tributes ↔ corresponding subspaces ER and ES have the CPD ({CS, PR, DT}, Quantity, Price) same dimension R ↔ ER Query 1. Find out the list of all the details with the S ↔ ES => dim(ER ) = dim(ES ) weight in range (0.2; 0.45). 2. Domains coincides ↔ corresponding subspaces ER and ES are given under the same field. Thus, the This query is composed in such a way: union compatibility of the relations R and S corresponds to the isomorphism of the subspaces ER and ES . Under σ0.2