Introduction

QQuueerryy Ooppttiimmiizzaattiioonn bbyy GGeenneettiicc AAllggoorriitthhmmss

Suhail S. J. Owais

Pavel Kr¨omer

Va´clav Sn´aˇsel Suhail S. J. Owais

Pavel Kr¨omer

Va´clav Sn´aˇsel

1 0 , v7a0c8la3v3.Osnstarsaevla 1 Department of Computer Science , VS

125 137

This study investigated the use of Genetic algorithms in Information retrieval in the area of optimizing a Boolean query. A query with Boolean logical operators was used in information retrieval. For Genetic algorithms, encoding chromosomes was done from Boolean query; where it was represented in the form of tree prefix with indexing for all terms and all Boolean logical operators. Information retrieval effectiveness measures precision and recall used as a fitness function in our work. Other Genetic algorithms operators were used as single point crossover on Boolean logical operators, and mutation operator was used to exchange one of the Boolean operators and, or, and xor with any other one. The goal is to retrieve most relevant documents with less number of nonrelevant documents with respect to user query in Information retrieval system using genetic programming.

Introduction

Information retrieval system is used to retrieve documents that depend on or relevant to the user input query. The growth in the number of documents in the internet made it necessary to use the best knowledge or methods in retrieving the most relevant documents to the user query. For this, Information Retrieval has become a fact of life for most internet users.

Information retrieval systems deal with data bases which is composed of information items documents that may consist of textual, pictorial or vocal information. Such systems process user queries trying to allow the user to access the relevant information in an appropriate time interval. Where the art of searching will be in the databases or hypertext networked databases such as internet or intranet for text, sound, images or data, see [ 1 ]. Thus an information system has at its heart a collection of data about reality [ 4 ].

Most of the information retrieval systems are based on the Boolean queries where the query terms are joined by the logical operators AND and OR. The similarity between a query and documents is measure by different retrieval strategies that are based on the more frequent terms found in both the document and the query. The more relevant document is deemed to be the query request. The most frequently used measures of retrieval effectiveness are precision the percentage of the retrieved documents that are relevant and recall the percentage of the relevant documents that are retrieved.

Information retrieval is concerned with collection and organization of texts, responding to the requests of internet users for the information seeking text, retrieving the most relevant documents from a collection of documents; and with retrieving some of non-relevant as possible. Information retrieval is involved in: – Representation, – Storage, – Searching, – Finding documents or texts or images those are relevant to some requirements for information desired by a user.

Genetic Algorithms (GA) represents one of the artificial intelligence algorithms that are attractive paradigm to improve performance in information retrieval systems. Retrieving necessary documents from a collection of such documents will be achieved using a query to select the most relevant documents. For implementation of genetic algorithms will be on queries. A form of genetic algorithm started to be applied in information retrieval systems in order to optimize the query by genetic algorithms 2

Genetic Algorithms (GA)

Genetic algorithms (GA) first described by John Holland in 1960s and further developed by Holland and his students and colleagues at the University of Michigan in the 1960s and 1970s [ 10 ]. GA used Darwinian Evolution to extract nature optimization strategies that uses them successfully and transform them for application in mathematical optimization theory to find the global optimum in defined phase space [ 5, 3 ].

GA is used to find approximate solutions to difficult problems through a set of methods or techniques inheritance or crossover, mutation, natural selection, and fitness function. Such methods are principles of evolutionary biology applied to computer science. GAs are useful for: – Solving difficult problems.

– Modelling the natural system that inspired design.

Applying genetic algorithms over a population of individuals or chromosomes shows that several operators are utilized. They are as follows (see Figure 1): – Fitness operator, metrics to measure scheduler performance for each chromosome in the problem, and calculate the values for each chromosome. – Selection operator, select two chromosomes with the highest quality values from the population, that couple to produce two offspring. – Crossover operator, exchanges two subparts of the selected chromosomes, the position of the subparts selected randomly.

– Mutation operator, randomly changes the allele value in some location. • Solving difficult problems.

• Modeling the natural system that inspired design.

Employing of genetic algorithms over a population of individuals or chromosomes: • Fitness Operator, metrics to measure scheduler performance for each chromosome in the problem, and calculate the values for each chromosome. • Selection Operator, select two chromosomes with the highest quality values from the population, that couple to produce two offspring. • Crossover Operator, exchanges two subparts of the selected chromosomes, the position of the subparts selQeuceterydOrapntidmoimzaltyio.n by Genetic Algorithms 127 • Mutation Operator, randomly changes the allele value in some location.

R E G E N E R A T I O N

Start Generate Initial Population

Encoding generated population Evaluate Fitness Function Ft(x)

Meets Optimization

Criteria? Selection (select parents)

Crossover (married parents) Mutation (mutate individual)

Yes

Best individuals

Stop Fig – 1: Flowchart for Genetic Algorithm This section will present the implementation of information retrieval using genetic algorithms (for SQL we can see [ 11, 8, 6, 2 ]). The GA is generally used to solve optimization problems [ 5 ]. GA starts on an initial population with fixed size of chromosomes ”P-chromosomes”. Each individual are coded according to chromosome length, where genes are allocated in each position in a chromosome with different data types, and each gene values called allele. In information retrieval, query for relevant documents are representing for each individual or chromosome, and each document described by set of terms. The description di for document Di, where i = 1 . . . l, the set of terms for Di are Tj , where j = 1 . . . n, thus di = (w1i , w2i , . . . , wni ). The value for each term will be 1 if this term exists in the document or 0 if not (Note: about another weights for terms was mention in paper [ 9 ]), this indicate that the indexing function that is maps a given index term t and a given document d is

F : D × T → [ 0, 1 ].

Defining a query will be combination from set of terms and set of Boolean operators and, or, xor, not. The query set Q defined as set of queries for documents, define the query processing mechanism by which documents can be evaluated in terms of their relevance to a given query [ 7 ].

Ih this work, we develop genetic program for implementing GA with variable length of chromosomes and mixture symbolic of information, like real values and Boolean queries values. 4

Chromosome Encoding

In order to really understand the principle of chromosome encoding, consider a Boolean query with a series of terms w1, w2, . . . , wn, and set of Boolean operators from and, or, Xor , and N ot. Examples of queries in infix form are: and and (w2 or w6) and(w9 and w3) (w3 and w4) xor((w5 and w6) or w8) and(or w2w6)(and w9w3) xor(and w3w4)(or(and w5w6)w8)

In the previous queries ordinary Boolean operators are used, and they are in infix form, they will be encoded to be chromosomes for genetic programming but in prefix form such as

and also they will be represented in a tree form as the chromosomes shown in Figure 2.

and also represented a queries in tree from in a chromosomes such as

and

AND OR

AND W2

AND W3

AND

W8 XOR W5

OR W6

Evaluation of the information retrieval system, measured by effectiveness, two

statistics are used precision and recall, maximize precision subject to a constraint on the minimal recall accepted. Recall and precision are often perceived as being inversely related, i.e., complementary and competitive. [D. H. KRAFT, F. E.PETRY,

B. P. BUCKLES, AND T. SADASIVAN. ﺦﻳرﺎﺘﻟا GENETIC ALGORITHMS FOR QUERY OPTIMIZATION IN INFORMATION RETRIEVAL: RELEVANCE

and also represented a queries in tree from in a chromosomes such as and Evaluation of the inform5a.tEivoanluarteiotnriaenvdaFlitnseyssstFeumnctiiosndone by measuring its effectiveness. This is best measured by two statistics precision and recall, maximizing precision is subject toEvaaluactioonnosftthreaiinnfotrmoatinon trehtrieevamlsyinsteimm, maelasrureecdabylleffaeccticveenpestse,tdwo. Recall and precision are oftenstatpisteicrscaereiuvseeddpraecsisiboneainndgrecianllv,mearxsimeilzye prreecliasiotnesdub,jeic.tet.o,accoonmstrapinlteomnentary and competitive [ 7 ]. Ftoher mainnimyal recall acsceeptted. Rdecoalcluanmd eprnectissiocnaalrleeodfteCnopelrlceeicvetdioans boeifngDocuments, given of inversely related, i.e., complementary and competitive. [D. H. KRAFT, F. E.PETRY, there is a subset Bo. fP. BUCKLES, AND T. SADASIVAN. ﺦﻳرﺎﺘﻟا GENETIC ALGORITHMScFaOlRled Relevant documents that are relevant to such query documents, and aQsUuERbYseOtPoTIfMdIZoAcTuIOmN eInNtsINtFhOaRtMAaTrIeONretRrEiTeRvIEeVdALc:alRlEeLdEVRAeNtCrEieved Documents. DemonstraFEtEioDnBAfCoKr. نpﺎﻜrﻤﻟeا].ciDseimoonnstarantiodn froercpraeclilsiaonraendshrecoawllanre isnhowFniign ufigr-e2 w3ithwith respect to all documents rcesopelcltetcotalel ddo.cuWmenhtse.rWeheprerperecciissiioonnandarencadll arreecdeafilnledaars:e defined as: Recall = RelevantRetrie=ved

Relevant =

P recision = RelevantRetrieved

Retrieved

Collection of Documents Relevant Documents

Recall

Retrieved Documents

Precision

Relevant Retrieved Documents

Fig-2: Retrieved and relevant documents

Fitness function for our work will be considered as functions for precision and recall. One function will trade for the other function because both precision and recall are inversely related. And they will deal with the ranked documents. The shortcoming of using recall and precision as fitness functions is that if no relevant documents are retrieved by a chromosome, then its fitness is zero. This will lead to loss of all genes for this chromosome.

Computing recall and precision values for each query in our genetic programming as illustrated in equations 1 and 2, the first fitness function RecallF itnessE1 measures recall (equ-1), and the second fitness function P recisionF itnessE2 measures Precision (equ-2). Where rd is the relevance of document d (1 for relevant and 0 for nonrelevant), fd is the retrieved document d (1 for retrieval and 0 for nonretrieval), and α and β are arbitrary weights. Where α and β are added specially to precision fitness function [ 7 ].

Where α and β are added specially to precision fitness function. [D. H. KRAFT, F.

E.PETRY, B. P. BUCKLES, AND T. SADASIVAN. ﺦﻳرﺎﺘﻟا GENETIC

ALGORITHMS FOR QUERY OPTIMIZATION IN INFORMATION RETRIEVAL:

RELEVANCE FEEDBACK. نﺎﻜﻤﻟا].

RecallFitnessE1 = ∑ [rd ∗ fd ] ∑ [rd ]

d d 130

Suhail S. J. Owais, Pavel Kr¨omer, and V´aclav Sn´aˇsel PrecisionFitnessE2 = α ∑d [rd ∗ fd ] ∑d [rd ]+ β ∑d [rd ∗ fd ] ∑d [ fd ] 6. Selection operators

ReallF itnessE1 =

Pd[rd × fd]

Pd[rd] P recisionF itnessE2 =

P P To process genetic algorithms operadto[rrsd]will be done idn[feda]ch generation on the α Pd[rd × fd]

β Pd[rd × fd] + best two chromosomes. 5.1 Selection operators chromosomes that depend on the highest fitness values for precision measure or on Processing genetic algorithm operators will be done in each generation over the recall measure will be selected. These two chromosomes will be called parent1 and best two chromosomes. From the population of chromosomes, the best two chroparent2. These two parents will be used to produce two new offsprings. mosomes depending on the highest fitness values for precision or recall measures will be selected. These two chromosomes will be called parent1 and parent2. These two p7a.reCntrsoswsilol vbeeruoserdRteocpormodbuinceattiwoonnoepweorafftsoprrsings.

From the population of chromosomes, the best two see fig-3.

spring1 and offspring2. 5.2

Crossover or Recombination operators In generating two new offsprings from the existing population, offsprings must have some inheritance from the best chromosomes in the population; crossover will have some inheritance from the two parents. Crossover will do that by exchang

Generating two new offsprings form the existing population. Offsprings must

idnog sthuabttrbeye cfrhoamngipnagresnutb1trweeithfrosmubtpraereenfrt1omwiptharseunbt2tr.ePeofsriotmionpsafroernts2u.btProeesi1tiaonnsd for ssuubbttrreeee21 awnidll fbore ssuebletrcetee2dwrailnldboemselyle,catenddratnhdeopmolsyi,tiaonnd mthuesptobsietioBnomolueastnbleoBgiocoallean olpogericaatloropneordateosr isnucthheastr“eAe;NsDu,cOhRa,sXaOndR,,oarn,dxoNrO,aTn”d. not. This will produce offSingle point crossover was used in our genetic programming. Index identifier

Single point crossover used in our genetic programming. Index identifier was

was assigned for each term and each Boolean operator in a prefix form for each assigned for each term and each Boolean operator in a postfix form for each query, query that was encoded in a tree, see Figure 4.

AND

0 OR 1

AND

4 W2 2

W6 3

W9 5

W3 6

AND

1 W3 2

W4 3

XOR 0 W5 6

AND 5

W8 8 OR 4 W6 7 number for parent2 must not exceed the number of nodes of parent2.

Fig-3: Trees in Postfix form with indexing For parent1, the selecQteudesruybtOrepet1i mrainzdaotmionnubmybGeremneutsitcnAotlgeoxrcietehdmthse num1b3e1r

of nodes for tree1. And the must be done for selection subtree2 random number for parent2.

For example, if we have two random numbers 1 and 4 for subtree1 and subtree2 respectivFeolry,exaanmdplwe,eifimweplcehmoseenrtansdinogmlenupmobinerts c1roanssdo4vefrorpsruobctreeses1 oanndpsaurbetrnete12 and parerenstp2ecftoivreltyr,eaensdswheo wimnplienmFenigtusirnegl3e,ptohinet ncreowssogveenreprraotcedss offnsppareintg1sasnhdopwarnenitn2 see Figufroer t5r,eeasftsehrowenxcinhafingg-3e, sthuebnterweegse.nerated offsprings shown in fig-4, after exchange sub trees.

OR 1 W6 4 W5 3

AND 2

W8 5

AND 0

AND

6 W9 7

W3 8

XOR

0 AND

1 W3 2

W4 3

OR 4 W2 5

W6 6

Fig-4: New offsprings after single point crossover.

Fig. 5. New offsprings after single point crossover. 5.3

8. Mutation operators

Mutation operators Mutation, ranMdoumtatipoenr,truanrdboamtiopnertiunrbtahtieonchinr othme ocshorommeosroempererseepnretsaetnitoanti,onis, insenceecsessasarryy to assurteo atshsautretthheat cthuerrceunrrtengtegneenreartaitioonn iiss ccoonnnnecetcetdetdo tthoe tehnteireensetairrechssepaarcceh, asnpdaictei,s and it isnenceecsesasrsyartoy itnotroindutrcoednuewcegneneewticgmenaetetriical minatotearipaolpiunlattoioan tphoatphualsatstiaobnilitzheadtlehvaesl. stabilize[dD.leHve.lK[7R]A.FInT,ouF.r Ege.PnEeTtiRcYp, roBg.rPa.mBmUiCngK,LmESu,taAtNioDn oTp.eSrAatDoAr SmIVaAyNc.haﺦnﻳرgﺎﺘeﻟا one BooGleEaNnElToIgCicaAlLoGpOerRaItToHrMbSy oFtOhRerQoUnEe.RTYhOatP TisI MfoIrZAmTuItOaNtioInN wINillFObeR MfoArTaInOdN, or and XREorTRBIEoVolAeaLn: RopEeLrEaVtoArNs.CE FEEDBACK. نﺎﻜﻤﻟا]. In our genetic programming,

For emaucthatinoenwopoeffrastporrimnagy, cthhaengseloenceteBdooralenadnolomgicnaulmopbeerartoisr bleysosthtehraonneo.nTeh.aItfisthfoer

selectedmruatnatdioonmwinllubmebfoerrAisN Dle,sOsRt,haannd XmOuRtaBtiooonleavnaloupe,raatonros.ther selected random number will be drown and checking if it is on Boolean operator in the offspring, then exchangeFworilelapcrhoocfefsspsroinng,itse,l(eic.ter.aannddomwnilulmbbeerolresosrthxaonr)o.nEe,xifatmhepsleeleisctsehdorwanndoimn Figure 5n,uwmhbeerrelesms uthtaantimonutawtiaosn nvaoltuep, raoncoethsesrosneleocfftesdprrainndgo1mbnuutmdboenrewiollnboeffdsrpowrinngan2d, where racnhdecokminlgyiiff iwties soenleBcoto1letaondoepneoratteorfoinr tthheeosfufsbptrrinege,pthoesniteioxnchoanngoeffwspillripnrgo2ce,sasnodn from offistp,(rii.ne.gA2NwDe wseilel btehaOtRitorisXaORB)o.oElexaanmplloegwicialll boepsehroawtnorinafnigd-,5,awndherreanmduotamtiloyn it was chwaansgneodt ptroocoers,s tfhoer onfefwsprtinwgo1 obffutspdroinnegsfoarroeffsshproiwngn2.inRFanigduomre i5f.we select 1 to denote for the subtree position for offspring2, and from offspring2 we get that the

Boolean logical operator is AND, and by random it was changed to OR. The new two 5.4 UopffdspartiinngsgarPeoshpouwlnatiniofing-5.

Fitness values will be computed for the two generated new offsprings, and after that the best chromosomes will survive for the next generation. Offsprings will be replaced by the worst chromosomes in the population, for this the new population size will not be changed, replacing two worst chromosomes by two offsprings if the new offsprings fitness values exceed any of the chromosomes in the population.

The new population was ready to start next generation. And this will be repeated until number of generations or until we get the best solution for our problem. 5.5

Experiments We developed our genetic program over a testing environment to see the influence of genetic algorithms on the information retrieval. The environment had a population from set of Boolean queries. For our program an input data used Boolean model of a collection of documents and set of Boolean queries as an initial population, and for execution the genetic program we used: – Different Collections of documents with variant number of words and documents. – Two sets of queries that represent as tree prefix form used as two different initial populations with fixed size in number of chromosomes (8 individuals). – The results will be for a one requesting user query – An initial values was fixed in all experiments are • Mutation value is 0.2 • α is 0.25 • β is 1.0 – Fixed number of generations are 50 generations 5.6

Limitations of current version – At this time genetic operators are applied only for Boolean logical operators and, or and xor. This causes following: if queries in initial population do not include all the terms we have in the user query, they cannot come into existence. – In our implementation, precision fitness value can be greater than 1.0, where the maximal value of precision is 1.25 that is coming from (α and β), thus we cannot interpret it as probability. 5.7

Experiment Tests Our tests was done using user query

w8 or w2, and we used two different initial populations Q1 and Q2; and they are shown in table 1, and table 2 respectively.

We mention that in initial population Q1 there is in one query contain a sub expression w8 and w2, w8 and w2, w8 or w2, w8 xor w2. and in initial population Q2 there are in three different queries contains three different sub expressions

Our testing of execution for genetic programming was done independently from other execution results. Three cases were studied. The results of our tests with different collections (different number of documents and different number of words/terms). The three cases descriptions are: – Test case 1: collection of 10 documents, 30 words in collection, maximal number of different words in a document is 10. Results are shown in table 3. – Test case 2: collection of 200 documents, 50 words in collection, maximal number of different words in a document is 50. Results are shown in table 4. – Test case 3: collection of 5000 documents, 2000 words in collection, maximal number of different words in a document is 500. Results are shown in table 5. The results of this study suggests that the final population composed of individuals having the same strength (quality) will have the same precision and recall values. The best individual result was randomly chosen as best. We conclude that the quality of initial population was important to have best result of genetic programming process, and the less quality of initial population caused worse results. This could be seen when we chose parents based on precision, and therefore recall was very small for the large collection.

We concluded that in order to get good results, we choose parents depending on the recall fitness values than the precision fitness values, but that will increase the number of Boolean logical operators for the final best results.

For the future work we suggest to use less number of prefix tree by identifying the Boolean logical operators only without identify index for terms. For selecting the best individual with less number of Boolean operators and less number of terms instead of random selection if we got the final population with same strength. Modifying our system to work with different Boolean operators (of, adj, . . . see in [ 4 ]), and extend this for fuzzy logic.

1. Baeza-Yates , R. , Ribeiro-Neto , B. : Modern Information Retrieval . Addison Wesley, New York, 1999 .

2. Freytag , Johann Christoph: A Rule-Based View of Query Optimization . Proceedings of ACM-SIGMOD , 1987 , pp. 173 - 180 .

3. Goldberg , David E.: Genetic Algorithms in Search, Optimization and Machine Learning . Reading, Massachusetts: Addison-Wesley, 1989 .

4. Korfhage Robert R.: Information Storage and Retrieval . John Wiley & Sons, Inc. 1997 .

5. Melanie

M.:

An Introduction to Genetic Algorithms. A Bradford Book The MIT Press

1999 .

6. Kim , Won: On Optimizing an SQL-like Nested Query . ACM Transactions on Database Systems 7 , 3 ( September 1982 ), pp. 443 - 469 .

7. Kraft , D.H. , Bordogna , G. , and Pasi , G.: Fuzzy Set Techniques in Information Retrieval , in Bezdek, J.C., Didier , D. and Prade , H. (eds.), Fuzzy Sets in Approximate Reasoning and Information Systems , vol. 3 , The Handbook of Fuzzy Sets Series, Norwell, MA: Kluwer Academic Publishers, 1999 .

8. McGoveran , David: ”Evaluating Optimizers.” Database Programming and Design . January 1990 , pp. 38 - 49 .

9. Salton , G. and Buckley , C. : Terms-Weighting approach in automatic text retrieval . Information Processing and management, 1988 24 ( 5 ): 513 - 523 .

10. Suhail

S. J.

Owais .: Timetabling of Lectures in the Information Technology College at Al al-Bayt University Using Genetic Algorithms. Master thesis , Al al-Bayt University 2003 , Jordan. (in Arabic).

11. Yao , S. Bing: ” Optimization of Query Algorithms.” ACM Transactions on Database Systems 4 , 2 ( June 1979 ), pp. 133 - 155 .