-

Identifying Algebraic Properties to Support Optimization of Unary Similarity Queries?

onica Ribeiro Porto Ferreira

Agma J. M. Traina

agma@icmc.usp.br 0

Ires Dias

iresdias@icmc.usp.br 2

Richard Chbeir

richard.chbeir@u-bourgogne.fr 1

Caetano Traina Junior

caetano@icmc.usp.br 0 0 Computer Science Dept., ICMC-Univ. of Sa~o Paulo , Sa~o Carlos-SP , Brazil 1 LE2I Laboratory UMR-CNRS Univ. of Bourgogne , Dijon , France 2 Mathematics Dept., ICMC-Univ. of Sa~o Paulo , Sa~o Carlos-SP , Brazil

Conventional operators for data retrieval are either based on exact matching or on total order relationship among elements. Neither of them is appropriate to manage complex data, such as multimedia data, time series and genetic sequences. In fact, the most meaningful way to compare complex data is by similarity. However, the Relational Algebra, employed in the Relational Database Management Systems (RDBMS), cannot express similarity criteria. In order to address this issue, we provide here an extension of the Relational Algebra, aimed at representing similarity queries in algebraic expressions. This paper identi es fundamental properties to allow the integration of the unary similarity operators into the Relational Algebra to handle similarity-based operators, either alone or combined with the existing (exact matching and/or relational) operators. We also show how to take advantage of such properties to optimize similarity queries, including these properties into a similarity query optimizer developed for a Similarity Retrieval Engine, which uses an existing RDBMS to answer similarity queries.

similarity algebra algebraic properties query optimization unary similarity queries

In 1970, Codd [ 1 ] introduced the relational model, which is the foundation for most of the actual commercial DataBase Management Systems (DBMS). It is based on the mathematical relation theory: the database is represented as a set of relations, where each relation is a table with tuples (or rows) and attributes (or columns). The domain of possible values for each attribute is restricted by the data types.

Initially, the relational model supported only traditional data, i.e., numerical and string data types. Elements of these types can be compared using exact matching (= and 6=) and relational (<, >, and ) operators. Now, with the ? This work has been supported by FAPESP, CNPq and CAPES/Fulbright advent of multimedia and spatial applications, the Relational DBMS (RDBMS) must be able to support new data types, operators and kinds of queries. Thus, similarity emerges as the natural way to compare elements in complex domains, such as images, audios, videos, genomic sequences, and time series, and consequently handling operations based on similarity (or distance) between data becomes a must. To illustrate this, let us take the following examples:

Q1: In a health-care information system: \Given a mammography exam with images of left and right breast from cranio-caudal (RCC) and medio-lateral oblique (RMLO) views of a patient, show the exams whose texture do not di er more than 10 units from those in the exam ".

Q2: In a health-care information system: \Given a head tomography exam of a patient showing a pathology, retrieve the 5 exams most similar not presenting pathology, and that texture do not di er more than 5 units from those in the exam".

Q3: In Geographic Information Systems (GIS): \Find the 15 districts nearest to `Arequipa' that are not farther than 15 miles, and where the population having between 21 and 64 year is greater than 65-year-old population and over ".

To solve similarity-based queries, several extensions of relational algebra have been provided in the literature aimed at including the similarity functionality in RDBMS from various perspectives. The rst algebra to consider this issue has been the Multi-Similarity Algebra (MSA), presented by Adali et al. [ 2 ]. It has been designed to integrate di erent interpretations of similarity values coming from multiple similarity implementation in a common framework. However, it remains at a higher abstraction level and thus does not address the problem of an \operational" algebra usable for modeling, optimizing, and processing queries with similarity-based operations [ 3 ]. Therefore, it is not fully consistent to the relational model.

Other works have associated similarity to uncertainty and provided fuzzy logic-based methods to solve this [ 4, 5 ]. The problem of those approaches is that they assume that complex data manipulation involves evaluation of their similarity, but this does not mean that these data or the similarity evaluation are uncertain or imprecise (as only exact match comparisons are useless in these domains). In fact, it is possible to execute similarity queries resulting in either approximated or exact answers.

Likewise, other approaches have been based on the notion of ranking, i.e., ordering among tuples or elements [ 6, 7 ]. It is true that they are consistent to the relational model and can be applied to similarity queries considering the distance functions as the ranking criterion, but they depend on ranking criteria that are independent from queries, whereas the ranking criterion of a similarity query varies with the query.

None of these previous works has addressed optimizations based on query rewriting for the similarity-based select operators in complex expressions. Traina et al. [ 8 ] proposed an extension of relational algebra considering complex similarity queries with two or more similarity predicates combined with Boolean operators. However, they have only treated queries with the same query element (unique center), which is very restrictive and does not cover all cases occurring in a RDBMS.

In this paper, we present the fundamental properties of a Similarity Algebra aiming at integrating both unary similarity operators with the relational algebra, which allows optimizing similarity queries in relational DBMS. The properties allow handling queries including any number of query centers, and suitable to support both similarity-based and traditional operators in the same query.

The remainder of this paper is structured as follows. Section 2 presents the Similarity Algebra. Section 3 shows experimental results conducted to evaluate the relevance of our approach. Finally, Section 4 concludes this paper and draws our future steps. 2 2.1

Similarity Algebra

Preliminaries In order to execute similarity queries in relational DBMS, it is necessary to provide a measurement of how to quantify similarity between two elements. Usually, it is done by de ning a distance function d, which is the basis to create a metric space M = hS; di, where S denotes the universe of valid elements (domain) and d is a function d : S S ! R+ that expresses a \distance" between elements of S. The distance function d must satisfy the following properties: (i) symmetry: d(s1; s2) = d(s2; s1); (ii) non-negativity: 0 < d(s1; s2) < 1; if s1 6= s2 and d(s1; s1) = 0; and (iii) triangular inequality: d(s1; s2) d(s1; s3) + d(s3; s2); 8s1; s2; s3 2 S .

An attribute is comparable by similarity only if it is associated to a similarity measure d. Although distance functions can theoretically be assigned to any attribute, they are of utter importance when applied to complex attributes. Therefore, without loss of generality, we call complex attributes1 and, correspondingly, its domains, those associated to distance functions, and the others we call simple attributes.

Relations that have complex attributes should follow the same properties and de nitions of traditional relations. In this paper, we employ the following notation to express relations. Let Ah Ah be a simple attribute in a domain Ah that allows comparisons using traditional operators; Sj Sj be an complex attribute in a domain Sj in a metric space that allows comparisons using complex operators; and T be an relation with any number of both simple and complex attributes. That is, let T = fA1; : : : ; Am; S1; : : : Spg be a relational schema, a relation T T is a set of elements represented as tuples T = fA1; : : : ; Am; S1; : : : Spg, which has for each tuple t = ha1; : : : ; am; s1; : : : spi values ah (1 h m) obtained in the domain Ah and values sj (1 j p) obtained in the domain Sj . Thus, let ti(Sj ) (1 i n) be the value of the Sj complex attribute of the ith tuple in the relation, and correspondingly let ti(Ah) be the value of the Ah simple attribute. To alleviate the notation of handling several attributes in a 1 Distinctly from object-oriented models, we employ here the term \complex attribute" to refer to those having a distance function assigned. Examples are images, audios, geographical coordinates, genomaic sequences, etc. relation, in the remainder of the paper, we will use just S and S to refer to a complex attribute Sj and its respective domain Sj , and A and A refer respectively to a simple attribute Ah and its respective domain Ah whenever the focus of the text is over only one attribute. 2.2

Unary similarity queries operations Traditional selections follow the format (A a) T , where is a comparison operator valid in the domain A of the attribute A, and `a' is either a constant taken in the domain of A or the value of another attribute from the same domain of A in the same tuple. Similarity selections follow the same format: c (S c sq) T , where c represents a similarity selection, c is a similarity operator valid in the domain S of the attribute S and `sq' is either a constant taken in the domain of S or the value of another attribute from the same domain of S in the same tuple.

There are two similarity operators commonly employed: range and k-nearest neighbor. As their properties can be di erent from those of the traditional selection, we initially use the symbols ^ and to represent range and kN N selections, and ^ and to represent range and kN N operators, respectively. They are described as follows.

De nition 1. Range query - Rq: Let S be a complex attribute taken in domain S over which the similarity condition is expressed, d be a distance function, be the similarity threshold and sq 2 S be the query element. The query ^(S ^(d; ) sq)T returns every tuple ti 2 T such that d (ti (S) ; sq) . That is: ^(S ^(d; ) sq) T = fti 2 T j d (ti (S) ; sq) g : (1) De nition 2. k-Nearest Neighbor query - kN N : Let S be a complex attribute taken in domain S over which the similarity condition is expressed, d be a distance function, k 2 N be the similarity threshold and sq 2 S be the query element. The query (S (d;k) sq) T returns the tuples from T whose value of the attribute S is one of the k elements in S nearest to the query element sq based on the distance function d. That is: (S (d;k) sq) T = T 0 = fti 2 T j 8 tj 2 [T

T 0] ; T 0 = ti=1;:::;k; d (ti (S) ; sq) d (tj (S) ; sq)g : (2) 2.3

Algebraic properties The query optimizer of RDBMSs employs algebraic equivalences to rewrite queries into equivalent expressions which are expected to be executed faster. Selections are important operations because they reduce the size of relations. In the subsections following, we identify algebraic properties useful to rewrite expressions of both range operator ^ and k-nearest neighbor operator . Due to space restriction, formal proofs of these properties are omitted here (they can be found in Ferreira et al. [ 9 ]).

Range Selection - ^.

Properties 1 and 2 apply conjunctive and disjunctive conditions involving only ^ operations, respectively.

Property 1. Conjunctions of ^ operators can be rewritten into a cascade of individual ^ operations or a sequence of intersection operations, i.e., (3) (4) (5) (6) ^(S1 ^(d1; 1) sq1) ^ (S2 ^(d2; 2) sq2) T = ^(S1 ^(d1; 1) sq1) ^(S2 ^(d2; 2) sq2) T = ^(S1 ^(d1; 1) sq1) T \ ^(S2 ^(d2; 2) sq2) T : A special case exists when sq1 = sq2, as follows.

Property 1.1. Special case where sq1 = sq2 = sq.

^(S ^(d; 1) sq) T \ ^(S ^(d; 2) sq) T = ^(S ^(d; 1) sq) ^ (S ^(d; 2) sq) T = ^(S ^(d;min( 1; 2)) sq) T : Property 2. Disjunctions of ^ operators can be rewritten into a sequence of union operations as follows.

^(S1 ^(d1; 1) sq1) _ (S2 ^(d2; 2) sq2) T = ^(S1 ^(d1; 1) sq1) T [ ^(S2 ^(d2; 2) sq2) T : A special case exists when sq1 = sq2, as follows.

Property 2.1. Special case where sq1 = sq2 = sq.

^(S ^(d; 1) sq) T [ ^(S ^(d; 2) sq) T = ^(S ^(d; 1) sq) _ (S ^(d; 2) sq) T = ^(S ^(d;max( 1; 2)) sq) T :

Properties 3 and 4 explore the commutativity of ^ operation with its composition and traditional operation.

Property 3. The Rq selection operation commutes under its composition, i.e., ^(S1 ^(d1; 1) sq1) ^(S2 ^(d2; 2) sq2) T = ^(S2 ^(d2; 2) sq2) ^(S1 ^(d1; 1) sq1) T : (7) Property 4. The Rq selection operation and the traditional selection operation commute under their composition, i.e., ^(S ^(d; ) sq) (A a) T = (A a) ^(S ^(d; ) sq) T : (8)

As ^ operation is commutative with operation, Properties 1 and 2 can also be employed for these operations. Therefore, we can use Properties 1 and 2 with either the ^ operation only or ^ and operations.

The next set of properties involving ^ allows pushing range selection through the traditional binary operators: union ([), intersection (\), di erence ( ), cross product ( ) and join (on). Property 5 shows that ^ is distributive over the set binary operators [, \ and . The relations T1 and T2 must be union compatible.

Property 5. The operator ^ is distributive over the set binary operators [, and \ as follows.

Property 5.1. For union, the following expression holds: ^(S ^(d; ) sq) (T1 [ T2) = ^(S ^(d; ) sq) T1 [ ^(S ^(d; ) sq) T2

Property 6. When the complex attribute mentioned in the range predicate belongs to only one of the joined relations, the operation ^ is distributive over on or . Let T1 be the relation that has the complex attribute S. Thus: ^(S ^(d; ) sq) (T1

T2) = ^(S ^(d; ) sq) T1

T2 ; (12) for any =on or

Properties 1 to 6 show that range selection shares the same algebraic equivalences as the traditional selection. Moreover, Property 4 shows the commutativity property between similarity-based selections and traditional ones. This is an important result, as it allows the RDBMS query optimizer to treat range selection as traditional selection. Therefore, we can use the symbol instead of ^ to represent range selections, only using ^ to represent the range operator, without lost of generality. 2.3.2

k-Nearest Neighbor Selection - .

Distinctly from range and traditional selections, kN N selections have only three properties to rewrite algebraic expressions. Property 7 regards the conjunctive selection conditions, as follows: Property 7. Conjunctions of operators can be rewritten into a sequence of intersection operations but they cannot be rewritten as a cascade of individual operations, i. e.,

(S1 (d1;k1) sq1) ^ (S2 (d2;k2) sq2) T = Property 8. Disjunctions of operators can be rewritten into a sequence of union operations as follows (this property requires that the relation T is a set because, in this way, duplications will be correctly eliminated):

(S1 (d1;k1) sq1) _ (S2 (d2;k2) sq2) T = Property 9. For complex conditions, each selection should be executed separately and the intersection (for conjunctions) or the union (for disjunctions) of results must be returned, since the operator is not commutative neither with other selection operators nor with itself. That is, for conjunctive conditions: and for disjunctive conditions: The same property can be employed combining either \ \ / [ ^" or \ \ / [

When query elements sq1 and sq2 are the same, there are special cases where the kN N selection operation becomes commutative with range and self selection operations. First, for a composition of kN N selection operation, this expression can be rewritten in the conjunction of kN N condition; therefore, only the kN N selection with the smallest k condition needs to be executed. Second, for a composition of kN N and range selection operation, this expression can also be rewritten in the conjunction of kN N and range condition; then, the intersection of the results from both basic operators should be executed. Finally, for the disjunction of range and kN N condition, the union of the results from both basic operators needs to be executed.

For the set of properties involving traditional binary operators, no property involving exists, because is not distributive over these operators.

As kN N selection operations accepts only three properties (7, 8 and 9) and ve special cases (over the same query element), they do not allow optimization algorithms equivalent to traditional and range selections. Thus, speci c optimization algorithms should be implemented in the query optimizer to optimize this kind of selection. The kAndRange and the kOrRange algorithms [ 8 ] are examples speci cally created to handle the commutativity property of Range and kN N query over the same query element. 3

Experimental Results

In this section, we present experiments comparing the evaluation of the similarity queries both optimized and not optimized using the properties of the Similarity Algebra presented in Section 2. To obtain the measurements, this algebra was incorporated into the SIREN query optimizer. SIREN is a similarity retrieval engine that allows expressing similarity queries in SQL and executing them [ 10 ]. We call the new version of SIREN able to perform optimization on queries involving similarity as SIREN+O. The experiments analyze the performance of SIREN and SIREN+O to execute similarity queries. As we will see here, the rst results show that the proposed algebra leads SIREN+O to perform faster than SIREN.

The test framework was implemented in C++, and the experiments ran on an AMD Athlon XP 3000+ processor with 1024MB of main memory, under the Windows XP operational system. The RDBMS employed was Oracle 9i. Every test was performed using both sequential scan and a Slim-tree index. Due to space limitations, we only highlight here the performance regarding total time (in milliseconds) as it summarizes the whole computational cost. Four data sets were used: { RCCMammography : a set of 658 medical images obtained from mammograms of right breast with cranio-caudal view (CC). They were compared using the texture distance function [ 11 ]; { RMLOMammography : a set of 695 medical images obtained from mammography exams of right breast with medio-lateral oblique view (MLO). They were also compared using the same texture distance function [ 11 ]. { MedImage: a set of 5,180 medical images obtained from three human body parts (abdomen, cranium and thorax) by computerized tomographies (CT).

They were compared using the metric histogram distance function [ 12 ]. { PeruDistricts : a set of 1,829 Peruvian districts. They were compared using the Euclidean distance function.

The rst three sets were obtained from the Clinical Hospital at Ribeir~ao Preto of the University of S~ao Paulo and the last set was obtained from Peru Instituto Nacional de Estad stica e Informatica (INEI).

The experiments evaluated the execution time of Queries Q1 over RCCMammography and RMLOMammography data sets, Q2 over MedImage data set and Q3 over PeruDistricts data set stated in Section 1. The queries were performed 30 times and the values shown are the average of performing the same query, but varying query elements sq.

Table 1 summarizes the results executing SIREN and SIREN+O using both sequential scan and the Slim tree index, a well-known index structure for metric data [ 13 ].

Query Q1 involves a traditional join and a range selection and it can be algebraically expressed as ^(S ^(texture;0:1) sq) (RCC on RM LO). Property 6 was employed to optimize the query. Its gain was about 30.39% using sequential scan and 30.16% using a Slim-tree index.

Both Queries Q2 and Q3 involve traditional selection, range selection and kNN selection. Therefore, Properties 4 and 9 as well as their special cases should be used to optimize these queries. Q2 can be algebraically expressed as: (pathology=`N0) ^(S ^(texture;0:05) sq) (S (texture;5) sq) (M edImage) . The gain obtained was about 64.68% using a Slim-tree index and 62.92% using sequential scan.

Q3 can be expressed as: (adultpop > oldpop) (^(S ^(Euclidean;15) sq) ((S (Euclidean;15) sq)(P eruDistricts))), and the gain obtained was about 63.82% using sequential scan and 62.61% using a Slim-tree index.

Conclusion

Nowadays, storing and retrieving multimedia data is a requirement that must be provided by RDBMS. In order to allow query compilers to optimize a similarity query execution, we presented here the properties holding for the unary similarity operators, that is, for Range and k-Nearest Neighbor Selection operations. We also presented the experiments conducted to show the performance obtained with SIREN query optimizer using Similarity Algebra (SIREN+O) reducing total time in up to 64% over the performance of queries in SIREN without the algebra regardless of the usage of an index. As a follow-up of this paper, we are currently working on the properties to extend the Similarity Algebra to support similarity join. This will surely open the possibility to support the storage and retrieval of complex data in RDBMS. We are also working on developing better statistics that can be measured over data, to create heuristics able to control the DBMS query optimizer for similarity queries.

1. Codd , E.F. : A relational model of data for large shared data banks . CACM 13 ( 6 ) ( 1970 ) 377 { 387

2. Adali , S. , Bonatti , P. , Sapino , M. , Subrahmanian , V.: A multi-similarity algebra . In: SIGMOD , ACM Press ( 1998 ) 402 { 413

3. Atnafu , S. , Chbeir , R. , Coquil , D. , Brunie , L. : Integrating similarity-based queries in image DBMSs . In: SAC, ACM Press ( 2004 ) 735 { 739

4. Belohlavek , R. , Opichal , S. , Vychodil., V. : Relational algebra for ranked tables with similarity: properties and implementation . In: IDA . Volume 4723 of LNCS., Springer Verlag ( 2007 ) 140 { 151 .

5. Ciaccia , P. , Montesi , D. , Penzo , W. , Trombetta , A. : Imprecision and user preferences in multimedia queries: a generic algebraic approach . In: FoIKS. Volume 1762 of LNCS., Springer Verlag ( 2000 ) 50 { 71

6. Adali , S. , Bu , C. , Sapino , M.L. : Ranked relations: query languages and query processing methods for multimedia . MTAJ 24 ( 3 ) ( 2004 ) 197 { 214

7. Li , C. , Chang , K.C.C. , Ilyas , I.F. , Song , S.: RankSQL: query algebra and optimization for relational top-k queries . In: SIGMOD , ACM Press ( 2005 ) 131 { 142

8. Traina-Jr. , C. , Traina , A.J.M. , Vieira , M.R. , Arantes , A.S. , Faloutsos , C. : E cient processing of complex similarity queries in RDBMS through query rewriting . In: CIKM , ACM Press ( 2006 ) 4 { 13

9. Ferreira , M.R.P. , Traina-Jr. , C. , Traina , A.J.M. , Dias , I. : Extending SQL to support unary similarity queries . Technical Report 325 , ICMC/USP ( 2008 )

10. Barioni , M.C.N. , Razente , H.L. , Traina , A.J.M. , Traina-Jr. , C. : SIREN: A similarity retrieval engine for complex data . In: VLDB , ACM Press ( 2006 ) 1155 { 1158

11. Felipe , J.C. , Traina , A.J.M. , Traina-Jr. , C. : Retrieval by content of medical images using texture for tissue identi cation . In: CBMS, IEEE Computer Society ( 2003 ) 175 { 180

12. Traina , A.J.M. , Traina-Jr. , C. , Bueno , J.M. , Chino , F.J.T. , Azevedo-Marques , P.: E cient content-based image retrieval through metric histograms . WWW 6 ( 2 ) ( 2003 ) 157 { 185

13. Traina-Jr. , C. , Traina , A.J.M. , Seeger , B. , Faloutsos , C. : Slim-trees: High performance metric trees minimizing overlap between nodes . In: EDBT . Volume 1777 of LNCS., Springer Verlag ( 2000 ) 51 { 65