=Paper= {{Paper |id=None |storemode=property |title=Application of Geometrical Approaches to Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-735/paper1.pdf |volume=Vol-735 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Tereshchenko11 }} ==Application of Geometrical Approaches to Information Retrieval== https://ceur-ws.org/Vol-735/paper1.pdf

Application of geometrical approaches to Information Retrieval

°
c Vasyl Tereshchenko
National Taras Shevchenko University of Kyiv
vtereshch@gmail.com

Abstract 3. The result of Problem B resolving is transduced
into correct result for Problem A.
In this paper we propose an idea of transforma- Theorem. The search problems in relational database
tion relational database problems to computa- are transformed to computational geometry search prob-
tional geometry problems to develop more ef- lems in time O(N ).
ficient algorithms for discovering useful infor- Proof. To prove the theorem, it is necessary to prove
mation from databases. We consider in detail the fulfillment of three conditions mentioned above. To
relational algebra operations - the base of rela- this end, let us formalize input data sets of the relational
tional language foundation - and give adequate database search problem in terms and concepts of the ge-
geometrical interpretation for each of them. ometric search problem, and per contra, results of ge-
ometric search problem solution interpreted in terms of
1 Introduction databases.
Let each tuple of relation R put in accordance to some
Over the past ten years the relational database manage- point (or IOW n-plex) of geometric space ER . Let each
ment systems (DBMS) have become wide applicable in attribute of relation R put in accordance to some coordi-
different areas such as automated design system, CAE nate axis in the following way: axis value area is defined
system, geographic information system, office informa- by domain, the attribute is specified under so that value of
tion system and so on. However, the relational database each tuple element corresponds to some coordinate value
management systems have limited capacity from the ob- of corresponding space point. Such a correspondence is
ject’s modeling viewpoint. That makes the DBMS to “one-one”. Ex facte, input data for relational database
be non-applicable for the complicated specialized ap- problems are transformed into corresponding input data
plications. Also, the recent progress of communication for computational geometry problem in time O(N ) and
and network technologies makes it easy to accumulate a the received computational geometry problem solution is
large collection of unstructured or semi-structured texts transformed into correct solution for relational database
data [2, 5, 6]. In this context, the problem of searching problem in time O(N ) also. Let us consider the main op-
more efficient algorithms to discover useful information erations of relational algebra that is the base for relational
from large non-structured databases that differs from ex- languages creation. And by using examples of relational
istent information retrieving methods is a point of big algebra search queries, we proved their geometrical re-
interest [7, 4]. The work [3] is worth to be mentioned, alization (condition 2), and hence, the transformation of
since it is devoted to problem of discovering data in large two classes of problems, mentioned above.
semistructured text collections.
The paper proposes an algorithm based on one of the Selection (S = σpredicate (R)).
computational geometry methods that is called the re- Selection is a unary operation. The result of the se-
gional search algorithm and speeds up search substan- lection is a new relation S containing only those tuples
tially. So nomogenously the subject about the possibil- of the input relation R that holds the specified condition
ity to transform relational database problems to computa- (predicate). Let the relation R of the relational database
tional geometry problems has been occurred taking into put in accordance to the subspace ER of the space E n .
account a high efficient of geometric algorithms. As it was mentioned above, the rank d of the relation
R defines the dimension of the corresponding subspace
2 The geometrical approach to informa- ER , Figure 1.
Predicate in the selection operation defines some do-
tion retrieval main (plane of the rank k < d). We are interested in
all those points of the subspace ER that lie within the
Definition. Problem A are transduced into Problem B, defined domain. Thereby, the predicate determines the
if: search region in the subspace ER , and under the geomet-
1. The input data for Problem A are transduced into ric interpretation the result of the selection operation is
corresponding input data for Problem B. the query about the points set of the subspace ER that
2. The Problem B is solved. lie within the queried region. Thus, the regional search
corresponds to the selection operation. There were pro-
Proceedings of the Spring Researcher’s Colloquium on Database posed several solutions of the regional search. Among
and Information Systems, Moscow, Russia, 2011 them, the algorithm based on the orthogonal range tree
Figure 1: Figure 3:

Assume that we are given a point P =
(1) (2) (n)
(x , x , . . . xP ) over a subspace ER , dim(ER ) = d;
where L = {l1 , l2 , . . . , ld } is a set of the
coordinate axis of the subspace ER . Then
Πatr.i (R) ≡ prli (ER ) = {P 0 |P 0 = prli P, ∀P ∈ ER }=
(i)
{P 0 |P 0 = (0, . . . , 0, xP , 0, . . . , 0), ∀P ∈ ER },
Πatr.i,...,atr.j (R) ≡ prπ (ER )={P 0 |P 0 =
prπ P, π li , . . . , lj , ∀P ∈ ER }= {P 0 |P 0 =
(i) (j)
(0, . . . , 0, xP , . . . , xP , 0, . . . , 0), ∀P ∈ ER }.

Union (R ∪ S).
The union of two relations R and S with tuples I and
Figure 2:
J correspondingly results their concatenation by forma-
method and described in Preparata and Shamos [1] is tion a new relation enclosing the maximal number of tu-
worth to be mentioned. The very algorithm uses a data ples (I + J), if the duplicated tuples are expunged. The
structure called the orthogonal range tree that requires relations R and S should be a union compatible (i.e., they
O(log d−1 N ) time per query, O(N log d−1 N ) space and should have the same number of attributes with coinci-
O(N log d−1 N ) preprocessing time, where the N is the dent domains). Let the relations R and S of the relational
number of points and d is the space dimension. For an database put in accordance to the subspace ER and ES of
example let us consider the following relations: the space E n correspondingly. In the geometric space the
union compatibility corresponds to the following condi-
PRODUCER (PR, Surname, City, Status) tions:
CUSTOMER (CS, Surname, City) 1. Relations R and S have the same number of at-
DETAIL (DT, Name, Weight) tributes ↔ corresponding subspaces ER and ES have the
CPD ({CS, PR, DT}, Quantity, Price) same dimension
R ↔ ER
Query 1. Find out the list of all the details with the S ↔ ES => dim(ER ) = dim(ES )
weight in range (0.2; 0.45). 2. Domains coincides ↔ corresponding subspaces
ER and ES are given under the same field. Thus, the
This query is composed in such a way: union compatibility of the relations R and S corresponds
to the isomorphism of the subspaces ER and ES . Under
σ0.2