Application of geometrical approaches to Information Retrieval

                                                °
                                                c Vasyl Tereshchenko
                                      National Taras Shevchenko University of Kyiv
                                                 vtereshch@gmail.com


                        Abstract                                    3. The result of Problem B resolving is transduced
                                                                into correct result for Problem A.
    In this paper we propose an idea of transforma-                 Theorem. The search problems in relational database
    tion relational database problems to computa-               are transformed to computational geometry search prob-
    tional geometry problems to develop more ef-                lems in time O(N ).
    ficient algorithms for discovering useful infor-                Proof. To prove the theorem, it is necessary to prove
    mation from databases. We consider in detail                the fulfillment of three conditions mentioned above. To
    relational algebra operations - the base of rela-           this end, let us formalize input data sets of the relational
    tional language foundation - and give adequate              database search problem in terms and concepts of the ge-
    geometrical interpretation for each of them.                ometric search problem, and per contra, results of ge-
                                                                ometric search problem solution interpreted in terms of
1   Introduction                                                databases.
                                                                    Let each tuple of relation R put in accordance to some
Over the past ten years the relational database manage-         point (or IOW n-plex) of geometric space ER . Let each
ment systems (DBMS) have become wide applicable in              attribute of relation R put in accordance to some coordi-
different areas such as automated design system, CAE            nate axis in the following way: axis value area is defined
system, geographic information system, office informa-          by domain, the attribute is specified under so that value of
tion system and so on. However, the relational database         each tuple element corresponds to some coordinate value
management systems have limited capacity from the ob-           of corresponding space point. Such a correspondence is
ject’s modeling viewpoint. That makes the DBMS to               “one-one”. Ex facte, input data for relational database
be non-applicable for the complicated specialized ap-           problems are transformed into corresponding input data
plications. Also, the recent progress of communication          for computational geometry problem in time O(N ) and
and network technologies makes it easy to accumulate a          the received computational geometry problem solution is
large collection of unstructured or semi-structured texts       transformed into correct solution for relational database
data [2, 5, 6]. In this context, the problem of searching       problem in time O(N ) also. Let us consider the main op-
more efficient algorithms to discover useful information        erations of relational algebra that is the base for relational
from large non-structured databases that differs from ex-       languages creation. And by using examples of relational
istent information retrieving methods is a point of big         algebra search queries, we proved their geometrical re-
interest [7, 4]. The work [3] is worth to be mentioned,         alization (condition 2), and hence, the transformation of
since it is devoted to problem of discovering data in large     two classes of problems, mentioned above.
semistructured text collections.
    The paper proposes an algorithm based on one of the             Selection (S = σpredicate (R)).
computational geometry methods that is called the re-               Selection is a unary operation. The result of the se-
gional search algorithm and speeds up search substan-           lection is a new relation S containing only those tuples
tially. So nomogenously the subject about the possibil-         of the input relation R that holds the specified condition
ity to transform relational database problems to computa-       (predicate). Let the relation R of the relational database
tional geometry problems has been occurred taking into          put in accordance to the subspace ER of the space E n .
account a high efficient of geometric algorithms.               As it was mentioned above, the rank d of the relation
                                                                R defines the dimension of the corresponding subspace
2   The geometrical approach to informa-                        ER , Figure 1.
                                                                    Predicate in the selection operation defines some do-
    tion retrieval                                              main (plane of the rank k < d). We are interested in
                                                                all those points of the subspace ER that lie within the
Definition. Problem A are transduced into Problem B,            defined domain. Thereby, the predicate determines the
if:                                                             search region in the subspace ER , and under the geomet-
    1. The input data for Problem A are transduced into         ric interpretation the result of the selection operation is
corresponding input data for Problem B.                         the query about the points set of the subspace ER that
    2. The Problem B is solved.                                 lie within the queried region. Thus, the regional search
                                                                corresponds to the selection operation. There were pro-
Proceedings of the Spring Researcher’s Colloquium on Database   posed several solutions of the regional search. Among
and Information Systems, Moscow, Russia, 2011                   them, the algorithm based on the orthogonal range tree
                         Figure 1:                                                            Figure 3:

                                                                     Assume that we are given a point P                               =
                                                                    (1)       (2)        (n)
                                                                  (x , x , . . . xP ) over a subspace ER , dim(ER ) = d;
                                                                  where L             =       {l1 , l2 , . . . , ld } is a set of the
                                                                  coordinate axis of the subspace ER .                              Then
                                                                  Πatr.i (R) ≡ prli (ER ) = {P 0 |P 0 = prli P, ∀P ∈ ER }=
                                                                                                         (i)
                                                                  {P 0 |P 0 = (0, . . . , 0, xP , 0, . . . , 0), ∀P ∈ ER },
                                                                  Πatr.i,...,atr.j (R)            ≡             prπ (ER )={P 0 |P 0   =
                                                                  prπ P, π li , . . . , lj , ∀P          ∈          ER }= {P 0 |P 0   =
                                                                                  (i)          (j)
                                                                  (0, . . . , 0, xP , . . . , xP , 0, . . . , 0), ∀P ∈ ER }.

                                                                      Union (R ∪ S).
                                                                      The union of two relations R and S with tuples I and
                         Figure 2:
                                                                  J correspondingly results their concatenation by forma-
method and described in Preparata and Shamos [1] is               tion a new relation enclosing the maximal number of tu-
worth to be mentioned. The very algorithm uses a data             ples (I + J), if the duplicated tuples are expunged. The
structure called the orthogonal range tree that requires          relations R and S should be a union compatible (i.e., they
O(log d−1 N ) time per query, O(N log d−1 N ) space and           should have the same number of attributes with coinci-
O(N log d−1 N ) preprocessing time, where the N is the            dent domains). Let the relations R and S of the relational
number of points and d is the space dimension. For an             database put in accordance to the subspace ER and ES of
example let us consider the following relations:                  the space E n correspondingly. In the geometric space the
                                                                  union compatibility corresponds to the following condi-
   PRODUCER (PR, Surname, City, Status)                           tions:
   CUSTOMER (CS, Surname, City)                                       1. Relations R and S have the same number of at-
   DETAIL (DT, Name, Weight)                                      tributes ↔ corresponding subspaces ER and ES have the
   CPD ({CS, PR, DT}, Quantity, Price)                            same dimension
                                                                      R ↔ ER
  Query 1. Find out the list of all the details with the              S ↔ ES => dim(ER ) = dim(ES )
weight in range (0.2; 0.45).                                          2. Domains coincides ↔ corresponding subspaces
                                                                  ER and ES are given under the same field. Thus, the
   This query is composed in such a way:                          union compatibility of the relations R and S corresponds
                                                                  to the isomorphism of the subspaces ER and ES . Under
   σ0.2<weight<0.45 (Detail)                                      the geometric interpretation the union of two relations R
                                                                  and S corresponds to the union of point sets of subspaces
   The given query has the following geometric view,              ER and ES .
Figure 2.
                                                                     Intersection (R ∩ S).
    Projection (S = Πatr.1,...,atr.n (R)).                           The necessary condition for intersection of two rela-
    A projection is also a unary operation. It deter-             tions R and S is their union compatibility.
mines a new relation S that encloses a vertical subset               Consequently, subspaces ER and ES , that corre-
(i.e., an attributes subset) of the input relation R ob-          sponds to the relations R and S, should be isomorphic.
tained by deriving the values of the defined attributes           The intersection of two relations R and S contains all the
and by removing all duplicate tuples from the result.             tuples of R that also belong to S. Thus, the intersection
Ex facte, the projection operation of the relation R cor-         of two relations R and S is corresponded to the intersec-
responds to the projection of the points collection of            tion of subspaces ER and ES .
the space ER over some coordinate plane π, defined
by the coordinate axis l1 , . . . , ln , that correspond to the      Difference (R − S).
attributes atr.1, . . . , atr.n in the projection operation          The difference of two relations R and S contains only
Πatr.1,...,atr.n (R), Figure 3.                                   those tuples of R that do not belong to S.
   Also, relations R and S should be a union compati-           hyper-large databases. To this end, we formalized input
ble. As it was mentioned above, the union compatibility         sets of the relational database search problem in terms
of the relations R and S corresponds to the isomorphism         and concepts of the geometric search problem, and per
of the subspaces ER and ES . Under the geometric in-            contra, results of geometric search problem solution in-
terpretation the difference of two relations R and S cor-       terpreted in terms of databases. Also, by using examples
responds to the set difference of point sets of subspaces       of relational algebra search queries, we proved their ge-
ER and ES .                                                     ometrical realization, and hence, the transformation of
                                                                two classes of problems, mentioned above.
    Cartesian product (R × S).                                     The object of our future researches is to develop the
    The Cartesian product of two relations R and S corre-       general theory of transformation of database problems to
sponds to the sum of the subspaces ER and ES . It should        computational geometry problems and to create new data
be denoted that this sum is not a direct one, since if the      structures for informational storing and searching using
attribute names of relations R and S coincide, the coordi-      geometrical methods.
nate axis that correspond to these attributes are collinear.

   Division (R ÷ S).
   The division is a binary operation. The result consists      References
of the restrictions of tuples in R to the attribute names       [1] M. I. Shamos F. P. Preparata. Computational geom-
unique to R, i.e., in the header of R but not in the header         etry. SpringerVerlag, 1985.
of S, for which it holds that all their combinations with
tuples in S are present in R. This operation may be ex-         [2] T. Shinohara H. Arimura, H. Ishizaka. Learning
pressed through the other ones:                                     unions of tree patterns using queries. Theoretical
                                                                    Computer Science, pages 47–62, 1997.
    T1 = Πc (R)
    T2 = Πc ((S × T1 ) − R)                                     [3] Ryoichi Fujino Hiroki Arimura, Atsushi Wataki and
    T = T1 − T2                                                     Setsuo Arikawa. A fast algorithm for discovering
                                                                    optimal string patterns in large text databases. 2004.
   The projection, Cartesian product, and difference op-
erations have been already interpreted in terms of geom-        [4] N. Cercone J. Han, Y. Cai. Knowledge discovery in
etry. Thus, the division operation could be geometrically           databases: An attribute oriented approach. In Proc.
interpreted.                                                        the 18th VLDB Conference, pages 547–559, 1992.
                                                                [5] D. D. Lewis. Challenges in machine learning for
    Joins.
                                                                    text classification. Proc. 9th Computational Learn-
    Join operation derivates from Cartesian product since
                                                                    ing Theory, pages 1–8, 1996.
it is equal to selection applied to Cartesian product of
those tuples of two relations R and S that meet the con-        [6] A. Swami R. Agrawal, T. Imielinski. Mining associ-
dition specified in selection predicate. Thus, join opera-          ation rules between sets of items in large databases.
tion of two relations R and S corresponds to the regional           Proc. the ACM SIGMOD Conference on Manage-
search in subspace resulting as sum of the subspaces ER             ment of Data, pages 207–216, 1993.
and ES . Natural join is a binary operator that is written
as (R, S) where R and S are relations. The result of the        [7] S. Morishita T. Fukuda, Y. Morimoto and
natural join is the set of all combinations of tuples in R          T. Tokuyama.      Data mining using twodimen-
and S that are equal on their common attribute names.               sional optimized association rules. Proc. the ACM
The right outer join of relations R and S is written as             SIGMOD Conference on Management of Data,
R X = S. The result of the right outer join is the set              pages 13–23, 1996.
of all combinations of tuples in R and S that are equal
on their common attribute names, in addition to tuples in
S that have no matching tuples in R.
    The outer join or full outer join in effect combines
the results of the left and right outer joins.
    The full outer join is written as R = X = S where R
and S are relations. The result of the full outer join is the
set of all combinations of tuples in R and S that are equal
on their common attribute names, in addition to tuples in
S that have no matching tuples in R and tuples in R that
have no matching tuples in S in their common attribute
names.

3    Conclusion
In this paper we considered the problems of information
retrieval from relational databases and proved their trans-
formation to the geometric search problems in computa-
tional geometry, what allows us to use more efficient ge-
ometric algorithms for implementing search functions in