Evolutionary Learning of Boolean Queries by Genetic Programming

Evolutionary Learning of Boolean Queries by Genetic Programming SuhailS JOwais Department of Computer Science V ŠB-Technical University of Ostrava

17. listopadu 15 Ostrava -Poruba Czech Republic

PavelKrömer Department of Computer Science V ŠB-Technical University of Ostrava

17. listopadu 15 Ostrava -Poruba Czech Republic

VáclavSnášel vaclav.snasel@vsb.cz Department of Computer Science V ŠB-Technical University of Ostrava

17. listopadu 15 Ostrava -Poruba Czech Republic

Evolutionary Learning of Boolean Queries by Genetic Programming 7CA4A98F6A25E2BE1D9E8B305FE4C9E6 GROBID - A machine learning software for extracting information from scholarly documents

The performance of an information retrieval system is usually measured in terms of two different criteria, precision and recall. This way, the optimization of any of its components is a clear example of a multiobjective problem. However, although evolutionary programming have been widely applied in the information retrieval area, in all of these applications both criteria have been combined in a single scalar fitness function by means of a weighting scheme. In this paper, we deal with using of Genetic Programming in Information retrieval specially in optimizing of a Boolean query.

Introduction

Ever since the advent of the public network Internet, the quantity of available information is rapidly rising. One of the most important uses of this public network is to find suitable information for such user query request. In such a huge and unstable information collection, todays greatest problem is to find relevant information to the user query.

Information filtering is concerned with finding information from unstable collections of documents such as the Internet. In the information filtering domain, the user query does not consists of a list of words or terms (word and term have the same meaning in our work) to search for but rather of combinations of words extracted from various examples. The most important problem to solve is to optimize the significance of the user query and obtaining accurate collection statistics for calculating the term arity.

After using evolutionary techniques for single-objective optimization during more than two decades, the incorporation of more than one objective in the fitness function has become a popular area of research for multiobjective problems. The use of evolutionary algorithms to solve problems with multiple objectives (known as Multi-objective Optimization Problems) has attracted much attention [6,3,18]. An information retrieval system is basically constituted of three main components: documentary database, query subsystem and matching or evaluation mechanism [1,13].

Evaluation of Information Retrieval System

Evaluation of the information retrieval system, measured by effectiveness, two statistics are used precision and recall, where these measures are evaluated over a set of documents called a collection of documents. All documents in this collection of documents are divided into four subsets: Relevant set "set of documents that are relevant to the user query"; Retrieved set "set of documents that are returned to the user"; and Relevant-Retrieved set "set of documents that are retrieved and relevant to the user query"; and finally the rest set of documents "set of documents that are not relevant and not retrieved". Where precision is the percentage of the retrieved documents that are relevant to the user query and recall is the percentage of the relevant documents that are retrieved for the requested query [1,12,8].

Recall = RelevantRetrieved Relevant P recision = RelevantRetrieved

Retrieved

In our work we introduce to use Genetic Programming for implementing the Information Retrieval system with Boolean queries, trying to evolve Boolean queries by genetic programming.

Genetic Algorithms

Most of the search engines in the internet depend on the user query and operate an information retrieval system to get the response of the user query request. Where the user query consists of set of terms and set of logical operators; especially and, or, of, and not operator see [8]. For this our motivation in our work is to do the evolution of the Boolean queries using genetic programming in the information retrieval [3,2,15].

Genetic Algorithm is an algorithm that used to find approximate solutions to problems that were difficult to solve it through set of methods or techniques inheritance or crossover, mutation, natural selection, and fitness function that are principles of evolutionary biology in computer science. For more detail about Genetic Algorithms see [5,16].

Genetic Programming

This section will present the implementation of information retrieval using genetic programming (for SQL we can see [17,11,7,4]). The GA is generally used to solve optimization problems [12,9,5]. GA starts on an initial population with fixed size of chromosomes "P-chromosomes". Each individual are coded according to chromosome length, where genes are allocated in each position in a chromosome with different data types, and each gene values is called allele. In information retrieval, query for relevant documents are representing for each individual or chromosome, and each document described by set of terms. For the collection of documents D, the description for document D i from l documents, where i = 1 . . . l, the set of terms for D i are T j , where j = 1 . . . n, thus D i = (w 1i , w 2i , . . . , w ni ). The value for each term will be 1 if this term exists in the document or 0 if not (Note: about another weights for terms were mentioned in paper [14]), this indicate that the indexing function that is maps a given index term t and a given document d is

F : D × T → [0, 1]

Defining a query will be combination from set of terms and set of Boolean operators and, or, of and not. The query set Q is defined as set of queries for documents, define the query processing mechanism by which documents can be evaluated in terms of their relevance to a given query [10].

Note: The of operator has the following general form:

N of(w 1 , w 2 , w 3 , . . . , w M ); M ≥ N and works like this: the document will be retrieved when it contains at least N terms from the list of M terms. For an example, 2 of(w 1 , w 2 , w 3 ) = ((w 1 and w 2 ) or(w 1 and w 3 ) or(w 2 and w 3 ))

In this work, we develop genetic programming for implementing GA operators with variable length of chromosomes and mixture symbolic of information, like real values and Boolean queries values.

Each chromosome from the initial population represented a tree structure for one query; an index was defined for each node in the tree. Genetic operators were operated over individuals. Queries will be encoded as trees, where each chromosome contains set of genes, and each gene mention to be a node in a tree and the value for each node known as allele. An example that show query encoding for chromosome in the population shown in Figure 1.

Implement Genetic Operators to Evolve Boolean Queries

Genetic operators used in our work to evolve Boolean queries. Presenting for these operators Fitness, Selection, Crossover, and Mutation follows:

Fitness function operator

For each individual the value of precision and recall will be computed and known as fitness values see RecallF itness and P recisionF itness respectively, this depends on the number of relevance documents r d in the collection of documents to the user query, number of retrieved document f d , and α and β are arbitrary weights. Where α and β are added specially to precision fitness function [10].

ReallF itness = d [r d × f d ] d [r d ] P recisionF itness = α d [r d × f d ] d [r d ] + β d [r d × f d ] d [f d ]

Selection operator

Two individuals with best fitness values are chosen from a population, but if there are more than two individuals with the same highest fitness values, then two of them will be chosen randomly. The two selected chromosomes will be called parents and they will be used to produce two new offsprings.

Crossover operator

Offsprings must have some inheritance from the two parents; single point crossover will do that by exchange subtree from parent1 with subtree from parent2. Positions for exchanging subtree1 and subtree2 will be select randomly. In our work we define the selection of the position for subtree to be:

1. The root node of the tree. 2. Each Boolean operator node.

3. Each leaf from the tree.

Producing two new offsprings from implementation of a single point crossover was shown as an example in Figure 2.

Mutation operators

Mutation, random perturbation in the chromosome representation, is necessary to assure that the current generation is connected to the entire search space, and it is necessary to introduce new genetic material into a population that has stabilized level [10]. In our implementation, mutation operator works as the most important operator for the evolutionary learning of Boolean query.

Each node from the new offsprings may be mutated; that depends on mutation value (by default 0.2). And we work with different type of mutations shown below:

-Mutation on Boolean operator: randomly exchanging one operator to another but both must be from the same arity, such as any exchange in ( and, or, of and not) are allowed. -Mutation on term node or leaf node: changing one term selected randomly from the offspring by any another one but the other one will be one from:

• The terms in a given collection of documents • The terms in an initial population.

• A specified list of terms.

• The terms appeared in the user query. -Mutation by inserting or deleting unary operator between two nodes in the offspring.

Where mutation was implemented on this way: For generated offspring select one node randomly and for this node we have two possibilities to mutate into another one or to apply insert an unary operator before it or delete it if and only if this node is an unary operator. Some examples were shown for mutations in Figure 3. Presenting our work now to show how our research processed for Boolean queries evolutionary learning was done.

Introduction to experiments

We developed a genetic programming to process some experiments over a set of Boolean queries and various collections of documents; the documents are with various number of words or terms. All collections used in our experiments are described in Table 1, where collection 'Collection-1' consists of 10 different documents and 30 different words, where each document includes some of these words (one, two or more of them). For all of our experiments were used the following ten Boolean queries as an initial population for processing our genetic programming: 2 of(w 2 , w 8 ) 1 of(w 1 , w 2 , w 8 ) not(not w 13 and not w 8 ) (w 1 and(w 2 and w 8 )) or not(w 4 or w 2 ) not(w 1 or w 2 ) and((w 5 or w 4 ) and(w 3 and w 6 )) (w 9 and w 14 ) (not w 14 ) and w 1 (w 2 or w 6 ) or(w 8 and w 13 ) (w 3 and w 4 ) or((w 1 2 xor w 15 ) and w 8 ) (w 2 or w 8 ) or(w 1 and w 2 )

The Genetic programming execution was terminated when one or more chromosomes from the population reached the highest value of the selected fitness function, or when reaching a maximum number of generations, where the highest values for precision and recall are α + β and 1 respectively.

All experiments were done multiple times with the same options to see the differences in the results, because of probability used during genetic programming process. In all experiments were used following fixed options:

the arbitrary weights for α = 0.25, and β = 1.0 -crossover value = 0.8

Experiments on different types of mutation and different fitness functions

Mutation value is probability of applying mutation operator on offsprings. In these experiments we observed how the changes of mutation value affect the result of genetic programming process, where we used different types of mutation as described above and two different sets of options where implemented on our experiments.

First set of options are:

-User query is:-(w 6 and w 8 ) and not w 10 -Collection name is:-Collection-1 -Used fitness measures are:-precision or recall -Highest number of generations are: -200 generations.

Experiments on using precision fitness function:

All terms from the initial population were used for mutation on leaves, the results were shown in table 2. Where in all experiments the chromosomes fitness values in the final population reached to be highest value, and the same for recall fitness value is the highest too, where the number of generations was variant. All terms from the user query were used for mutation on leaves, and the results were shown in Table 3. In this case, mostly maximum number of generations were reached to obtain the optimized query, because of not reaching the highest value on the selected fitness function, when the highest values for recall were reached. All terms form the whole collection were used for mutation of leaves, and the results were shown in Table 4. Where in some experiments the termination of the program execution was done because of reaching the maximum number of generations or reaching the highest precision fitness value, where mostly in all experiments we reached the highest values for recall fitness values. Experiments on using recall fitness function:

All chromosomes in the final population had approximately the same highest value of recall, but mostly the values of precision are various, some of these results are shown in tables bellow. Table 5 shows the results when the mutation over leaves used terms from user query only, table 6 shows the results when the mutation over leaves used terms from initial population, and table 7 shows the results when the mutation over leaves used terms from whole population. Some experiments were done using the second set of options with the following results shown in Table 8 and 9.

Second set of options are:

-User query is:-((not w 10 ) and(w 6 and w 8 )) -Collection name is:-Collection-2 -Used fitness measures are:-precision or recall -Highest number of generations are: -1200 generations. After increasing number of generations and experiments were done on Collection-2, there were a differences in the results because in some cases we reached the best solution depends on precision fitness function as shown in table 8 before, where the results in table 9 shown before mostly all experiments were reached the highest value of recall within a few number of generations.

Experiments over fitness functions:

The goal of optimization process of a Boolean query is to get a query with highest possible values of precision or recall depends on chosen fitness function. Results shown above demonstrate, that when we used recall as the fitness function, the program terminated within few number of generations because of reaching the highest value of recall as a fitness function, and when using precision as the fitness function recall has reached the highest value mostly in all experiments, where some of precision values were not high.

Conclusions

In this paper, an optimization of Boolean query over a collection of documents is presented. We focused especially on different types of mutation and on comparison of two different fitness measures, precision and recall. Experiments were done over various types of document collections and different types of mutation and two types of fitness functions. So we obtained the following conclusions:

First, when applying mutation operator on terms in the chromosomes from the initial population, it is necessary to have all the terms from the search space at disposal for mutation. If only terms from user query or initial population were used for mutation, the results were worse than when terms from whole collection were used. Only then there can come into existence new queries, describing the same documents as user query, but containing terms not included into user query or initial population.

Second, recall seems to be more efficient than precision when chosen as a fitness function to reach an optimized query within less number of generations than when precision was chosen as a fitness function. So we retrieved all relevant documents with few number of non-relevant documents. But on choosing precision as a fitness function, we reached mostly the highest values of recall before program termination when the highest values of precision were mostly not reached and the process terminated after reaching maximum number of generations.

Third, in some cases, especially when we used for mutation over leaves the terms from collection or from initial population on two different types of fitness functions an optimized query was reached within few number of generations, but on chosen recall as fitness function the results were reached within less number of generations than when precision was used as a fitness function. But for mutation over leaves the terms from user query only and the fitness function was precision there were worse results than in other cases.

We will focus in our future work on weighted terms and weighted Boolean operators for implementing the fuzzy logic over terms and Boolean operators weights for optimizing user query in information retrieval systems, and also on using different methods for evaluating the performance of information retrieval such as Harmonic mean measure (F-score). We also want to consider the number of Boolean operators and the number of terms as the objectives for query optimization.

Fig. 1 .1Fig. 1. Chromosome encoding form a query

Fig. 2 .2Fig. 2. Single point crossover, Randomly select nodes on parents

Fig. 3 .3Fig. 3. Processing mutation over offsprings, where nodes are selected randomly

Table 1 .1Description of Document CollectionsCollection Name Number of Documents Maximum number of termsin each documentCollection-11030Collection-250200Collection-350001000

Table 2 .2Precision, mutation over leaves using terms from all initial population mutation value number of generations final precision final recall0.12000.751.000.2241.251.000.3271.251.000.41181.251.000.5451.251.00

Table 3 .3Precision, mutation over leaves using terms from user query only mutation value number of generations final precision final recall0.12000.751.000.22000.751.000.32000.751.000.42000.751.000.51131.251.00

Table 4 .4Precision, mutation over leaves using terms from whole collectionmutation value number of generations final precision final recall0.12000.751.000.2581.251.000.3111.251.000.41871.251.000.5421.251.00

Table 5 .5Recall, mutation over leaves using terms from user query mutation value number of generations final precision final recall0.150.5831.000.250.5001.000.350.5001.000.450.5831.000.560.5831.00Table 6. Recall, mutation over leaves using terms from initial populationmutation value number of generations final precision final recall0.150.5831.000.250.5001.000.350.5001.000.450.5831.000.570.5831.00

Table 7 .7Recall, mutation over leaves using terms from whole collection mutation value number of generations final precision final recall0.150.5831.000.250.5001.000.360.5001.000.480.5831.000.550.5831.00From these results shown in tables 5, 6 and 7 the executions of program wereterminated when the highest recall fitness function values of chromosomes werereached within few number of generations.

Table 8 .8Precision, mutation over leaves using terms from user query, Collection-2mutation value number of generations final precision final recall0.112001.001.000.212000.751.000.31971.251.000.44621.251.000.512000.751.00

Table 9 .9Recall, mutation over leaves using terms from user query, Collection-2 mutation value number of generations final precision final recall0.1160.30501.000.2130.30501.000.3230.30501.000.4110.30501.000.5170.30501.00

Modern Information Retrieval RBaeza-Yates BRibeiro-Neto 1999 Addison Wesley New York A machine learning approach to inductive query by examples: an experiment using relevance feedback, ID3, genetic algorithms, and simulated annealing HChen Journal of the American Society for Information Science 49 8 1998 Evolutionary Learning of Boolean Queries by Multiobjective Genetic Programming OCordon EHerrera-Viedma MLuque¡ PPSN VII, LNCS JJMerelo Guervos

Berlin Heidelberg

Springer-Verlag 2002 2439 A Rule-Based View of Query Optimization JCFreytag Proceedings of ACM-SIGMOD ACM-SIGMOD 1987 Genetic Algorithms in Search, Optimization and Machine Learning DEGoldberg 1989 Addison-Wesley Reading, Massachusetts PDE:A Pareto-frontier Differential Evolution Approach for Multi-objective Optimization Problems HAAbbass RSarker CNewton Proceedings of the Congress on Evolutionary Computation 2001 (CEC'2001) the Congress on Evolutionary Computation 2001 (CEC'2001)

Piscataway, New Jersey

IEEE Service Center 2001 2 On Optimizing an SQL-like Nested Query WKim ACM Transactions on Database Systems 7 1982 Information Storage and Retrieval RRKorfhage 1997 John Wiley & Sons, Inc Genetic programming. On the programming of computers by means of natural selection JKoza 1992 The MIT Press Fuzzy Set Techniques in Information Retrieval DHKraft GBordogna GPasi Fuzzy Sets in Approximate Reasoning and Information Systems JCBezdek DDidier HPrade

Norwell, MA

Kluwer Academic Publishers 1999 3 The Handbook of Fuzzy Sets Series Evaluating Optimizers. Database Programming and Design DMcgoveran 1990 An Introduction to Genetic Algorithms MMelanie 1999 A Bradford Book The MIT Press CJRijsbergen Information Retrieval Butterworth 1979 2nd edition Terms-Weighting approach in automatic text retrieval GSalton CBuckley Information Processing and management 24 5 1988 The use of genetic programming to build Boolean queries for text retrieval through relevance feedback MPSmith MSmith Journal of Information Science 23 6 1997 Timetabling of Lectures in the Information Technology College at Al al-Bayt University Using Genetic Algorithms SS JOwais 2003 Jordan Al al-Bayt University Master thesis in Arabic Optimization of Query Algorithms SBYao ACM Transactions on Database Systems 4 1979 Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications EZitzler 1999 Zurich Swiss Federal Institute of Technology Zurich