Evolutionary Learning of Boolean Queries by
               Genetic Programming

             Suhail S. J. Owais, Pavel Krömer, and Václav Snášel

      Department of Computer Science, VŠB-Technical University of Ostrava,
              17. listopadu 15, Ostrava - Poruba, Czech Republic
                           {vaclav.snasel}@vsb.cz


      Abstract. The performance of an information retrieval system is usu-
      ally measured in terms of two different criteria, precision and recall. This
      way, the optimization of any of its components is a clear example of a
      multiobjective problem. However, although evolutionary programming
      have been widely applied in the information retrieval area, in all of these
      applications both criteria have been combined in a single scalar fitness
      function by means of a weighting scheme. In this paper, we deal with
      using of Genetic Programming in Information retrieval specially in opti-
      mizing of a Boolean query.


1   Introduction

Ever since the advent of the public network Internet, the quantity of available
information is rapidly rising. One of the most important uses of this public
network is to find suitable information for such user query request. In such a
huge and unstable information collection, todays greatest problem is to find
relevant information to the user query.
     Information filtering is concerned with finding information from unstable
collections of documents such as the Internet. In the information filtering domain,
the user query does not consists of a list of words or terms (word and term have
the same meaning in our work) to search for but rather of combinations of
words extracted from various examples. The most important problem to solve is
to optimize the significance of the user query and obtaining accurate collection
statistics for calculating the term arity.
     After using evolutionary techniques for single-objective optimization during
more than two decades, the incorporation of more than one objective in the fit-
ness function has become a popular area of research for multiobjective problems.
The use of evolutionary algorithms to solve problems with multiple objectives
(known as Multi-objective Optimization Problems) has attracted much attention
[6, 3, 18].
     An information retrieval system is basically constituted of three main com-
ponents: documentary database, query subsystem and matching or evaluation
mechanism [1, 13].

                                          54
                                                                               55

2    Evaluation of Information Retrieval System

Evaluation of the information retrieval system, measured by effectiveness, two
statistics are used precision and recall, where these measures are evaluated over
a set of documents called a collection of documents. All documents in this collec-
tion of documents are divided into four subsets: Relevant set ”set of documents
that are relevant to the user query”; Retrieved set ”set of documents that are
returned to the user”; and Relevant-Retrieved set ”set of documents that are
retrieved and relevant to the user query”; and finally the rest set of documents
”set of documents that are not relevant and not retrieved”. Where precision is
the percentage of the retrieved documents that are relevant to the user query
and recall is the percentage of the relevant documents that are retrieved for the
requested query [1, 12, 8].

    Recall = RelevantRetrieved
                 Relevant                          P recision = RelevantRetrieved
                                                                    Retrieved

    In our work we introduce to use Genetic Programming for implementing the
Information Retrieval system with Boolean queries, trying to evolve Boolean
queries by genetic programming.


3    Genetic Algorithms

Most of the search engines in the internet depend on the user query and operate
an information retrieval system to get the response of the user query request.
Where the user query consists of set of terms and set of logical operators; espe-
cially and, or, of, and not operator see [8]. For this our motivation in our work
is to do the evolution of the Boolean queries using genetic programming in the
information retrieval [3, 2, 15].
    Genetic Algorithm is an algorithm that used to find approximate solutions
to problems that were difficult to solve it through set of methods or techniques
inheritance or crossover, mutation, natural selection, and fitness function that
are principles of evolutionary biology in computer science. For more detail about
Genetic Algorithms see [5, 16].


4    Genetic Programming

This section will present the implementation of information retrieval using ge-
netic programming (for SQL we can see [17, 11, 7, 4]). The GA is generally used
to solve optimization problems [12, 9, 5]. GA starts on an initial population
with fixed size of chromosomes ”P-chromosomes”. Each individual are coded
according to chromosome length, where genes are allocated in each position in
a chromosome with different data types, and each gene values is called allele.
In information retrieval, query for relevant documents are representing for each
individual or chromosome, and each document described by set of terms. For
56

the collection of documents D, the description for document Di from l docu-
ments, where i = 1 . . . l, the set of terms for Di are Tj , where j = 1 . . . n, thus
Di = (w1i , w2i , . . . , wni ). The value for each term will be 1 if this term exists in
the document or 0 if not (Note: about another weights for terms were mentioned
in paper [14]), this indicate that the indexing function that is maps a given index
term t and a given document d is

                                  F : D × T → [0, 1]

   Defining a query will be combination from set of terms and set of Boolean
operators and, or, of and not. The query set Q is defined as set of queries for
documents, define the query processing mechanism by which documents can be
evaluated in terms of their relevance to a given query [10].
   Note: The of operator has the following general form:

                         N of(w1 , w2 , w3 , . . . , wM ); M ≥ N

   and works like this: the document will be retrieved when it contains at least
N terms from the list of M terms. For an example,

           2 of(w1 , w2 , w3 ) = ((w1 and w2 ) or(w1 and w3 ) or(w2 and w3 ))

    In this work, we develop genetic programming for implementing GA operators
with variable length of chromosomes and mixture symbolic of information, like
real values and Boolean queries values.
    Each chromosome from the initial population represented a tree structure
for one query; an index was defined for each node in the tree. Genetic operators
were operated over individuals. Queries will be encoded as trees, where each
chromosome contains set of genes, and each gene mention to be a node in a
tree and the value for each node known as allele. An example that show query
encoding for chromosome in the population shown in Figure 1.


                     Fig. 1. Chromosome encoding form a query
                                                                               57

5   Implement Genetic Operators to Evolve Boolean
    Queries

Genetic operators used in our work to evolve Boolean queries. Presenting for
these operators Fitness, Selection, Crossover, and Mutation follows:


Fitness function operator

For each individual the value of precision and recall will be computed and known
as fitness values see RecallF itness and P recisionF itness respectively, this de-
pends on the number of relevance documents rd in the collection of documents
to the user query, number of retrieved document fd , and α and β are arbitrary
weights. Where α and β are added specially to precision fitness function [10].
                                             P
                                                   [r × fd ]
                          ReallF itness =        Pd
                                                 d
                                                   d [rd ]

                                        P              P
                                    α      [rd × fd ] β d [rd × fd ]
             P recisionF itness =        P
                                         d
                                                     +  P
                                            d [rd ]        d [fd ]


Selection operator

Two individuals with best fitness values are chosen from a population, but if
there are more than two individuals with the same highest fitness values, then
two of them will be chosen randomly. The two selected chromosomes will be
called parents and they will be used to produce two new offsprings.


Crossover operator

Offsprings must have some inheritance from the two parents; single point crossover
will do that by exchange subtree from parent1 with subtree from parent2. Posi-
tions for exchanging subtree1 and subtree2 will be select randomly. In our work
we define the selection of the position for subtree to be:

1. The root node of the tree.
2. Each Boolean operator node.
3. Each leaf from the tree.

   Producing two new offsprings from implementation of a single point crossover
was shown as an example in Figure 2.
58


          Fig. 2. Single point crossover, Randomly select nodes on parents


Mutation operators
Mutation, random perturbation in the chromosome representation, is necessary
to assure that the current generation is connected to the entire search space,
and it is necessary to introduce new genetic material into a population that has
stabilized level [10]. In our implementation, mutation operator works as the most
important operator for the evolutionary learning of Boolean query.
    Each node from the new offsprings may be mutated; that depends on muta-
tion value (by default 0.2). And we work with different type of mutations shown
below:

 – Mutation on Boolean operator: randomly exchanging one operator to an-
   other but both must be from the same arity, such as any exchange in ( and,
   or, of and not) are allowed.
 – Mutation on term node or leaf node: changing one term selected randomly
   from the offspring by any another one but the other one will be one from:
     • The terms in a given collection of documents
     • The terms in an initial population.
     • A specified list of terms.
     • The terms appeared in the user query.
 – Mutation by inserting or deleting unary operator between two nodes in the
   offspring.

    Where mutation was implemented on this way: For generated offspring select
one node randomly and for this node we have two possibilities to mutate into
another one or to apply insert an unary operator before it or delete it if and only
if this node is an unary operator. Some examples were shown for mutations in
Figure 3.
                                                                                     59


    Fig. 3. Processing mutation over offsprings, where nodes are selected randomly


6     Experiments
Presenting our work now to show how our research processed for Boolean queries
evolutionary learning was done.

6.1    Introduction to experiments
We developed a genetic programming to process some experiments over a set
of Boolean queries and various collections of documents; the documents are
with various number of words or terms. All collections used in our experiments
are described in Table 1, where collection ’Collection-1’ consists of 10 different
documents and 30 different words, where each document includes some of these
words (one, two or more of them).

                Table 1. Description of Document Collections
Collection Name Number of Documents Maximum number of terms
                                                 in each document
   Collection-1                10                           30
   Collection-2                50                          200
   Collection-3              5000                         1000
60

    For all of our experiments were used the following ten Boolean queries as an
initial population for processing our genetic programming:

2 of(w2 , w8 )
1 of(w1 , w2 , w8 )
not(not w13 and not w8 )
(w1 and(w2 and w8 )) or not(w4 or w2 )
not(w1 or w2 ) and((w5 or w4 ) and(w3 and w6 ))
(w9 and w14 )
(not w14 ) and w1
(w2 or w6 ) or(w8 and w13 )
(w3 and w4 ) or((w1 2 xor w15 ) and w8 )
(w2 or w8 ) or(w1 and w2 )

    The Genetic programming execution was terminated when one or more chro-
mosomes from the population reached the highest value of the selected fitness
function, or when reaching a maximum number of generations, where the highest
values for precision and recall are α + β and 1 respectively.
    All experiments were done multiple times with the same options to see the
differences in the results, because of probability used during genetic program-
ming process. In all experiments were used following fixed options:

 – the arbitrary weights for α = 0.25, and β = 1.0
 – crossover value = 0.8


6.2   Experiments on different types of mutation and different fitness
      functions

Mutation value is probability of applying mutation operator on offsprings. In
these experiments we observed how the changes of mutation value affect the
result of genetic programming process, where we used different types of mutation
as described above and two different sets of options where implemented on our
experiments.
   First set of options are:

 – User query is:- (w6 and w8 ) and not w10
 – Collection name is:- Collection-1
 – Used fitness measures are:- precision or recall
 – Highest number of generations are: - 200 generations.


Experiments on using precision fitness function:

All terms from the initial population were used for mutation on leaves, the results
were shown in table 2. Where in all experiments the chromosomes fitness values
in the final population reached to be highest value, and the same for recall fitness
value is the highest too, where the number of generations was variant.
                                                                               61

Table 2. Precision, mutation over leaves using terms from all initial population
  mutation value number of generations final precision final recall
         0.1                   200                   0.75            1.00
         0.2                    24                   1.25            1.00
         0.3                    27                   1.25            1.00
         0.4                   118                   1.25            1.00
         0.5                    45                   1.25            1.00


    All terms from the user query were used for mutation on leaves, and the
results were shown in Table 3. In this case, mostly maximum number of gener-
ations were reached to obtain the optimized query, because of not reaching the
highest value on the selected fitness function, when the highest values for recall
were reached.

  Table 3. Precision, mutation over leaves using terms from user query only
   mutation value number of generations final precision final recall
         0.1                   200                  0.75          1.00
         0.2                   200                  0.75          1.00
         0.3                   200                  0.75          1.00
         0.4                   200                  0.75          1.00
         0.5                   113                  1.25          1.00


   All terms form the whole collection were used for mutation of leaves, and the
results were shown in Table 4. Where in some experiments the termination of
the program execution was done because of reaching the maximum number of
generations or reaching the highest precision fitness value, where mostly in all
experiments we reached the highest values for recall fitness values.

  Table 4. Precision, mutation over leaves using terms from whole collection

   mutation value number of generations final precision final recall
        0.1               200                 0.75          1.00
        0.2                58                 1.25          1.00
        0.3                11                 1.25          1.00
        0.4               187                 1.25          1.00
        0.5                42                 1.25          1.00

Experiments on using recall fitness function:
All chromosomes in the final population had approximately the same highest
value of recall, but mostly the values of precision are various, some of these
results are shown in tables bellow. Table 5 shows the results when the mutation
over leaves used terms from user query only, table 6 shows the results when the
mutation over leaves used terms from initial population, and table 7 shows the
results when the mutation over leaves used terms from whole population.
 62

        Table 5. Recall, mutation over leaves using terms from user query
      mutation value number of generations final precision final recall
           0.1                    5                   0.583         1.00
           0.2                    5                   0.500         1.00
           0.3                    5                   0.500         1.00
           0.4                    5                   0.583         1.00
           0.5                    6                   0.583         1.00


      Table 6. Recall, mutation over leaves using terms from initial population
      mutation value number of generations final precision final recall
            0.1                    5                   0.583           1.00
            0.2                    5                   0.500           1.00
            0.3                    5                   0.500           1.00
            0.4                    5                   0.583           1.00
            0.5                    7                   0.583           1.00


      Table 7. Recall, mutation over leaves using terms from whole collection
      mutation value number of generations final precision final recall
            0.1                   5                   0.583          1.00
            0.2                   5                   0.500          1.00
            0.3                   6                   0.500          1.00
            0.4                   8                   0.583          1.00
            0.5                   5                   0.583          1.00

     From these results shown in tables 5, 6 and 7 the executions of program were
 terminated when the highest recall fitness function values of chromosomes were
 reached within few number of generations.
     Some experiments were done using the second set of options with the follow-
 ing results shown in Table 8 and 9.

      Second set of options are:
   – User query is:- ((not w10 ) and(w6 and w8 ))
   – Collection name is:- Collection-2
   – Used fitness measures are:- precision or recall
   – Highest number of generations are: - 1200 generations.
Table 8. Precision, mutation over leaves using terms from user query, Collection-2
    mutation value number of generations final precision final recall
           0.1                  1200                   1.00         1.00
           0.2                  1200                   0.75         1.00
           0.3                   197                   1.25         1.00
           0.4                   462                   1.25         1.00
           0.5                  1200                   0.75         1.00
                                                                                63

Table 9. Recall, mutation over leaves using terms from user query, Collection-2
   mutation value number of generations final precision final recall
         0.1                    16                  0.3050         1.00
         0.2                    13                  0.3050         1.00
         0.3                    23                  0.3050         1.00
         0.4                    11                  0.3050         1.00
         0.5                    17                  0.3050         1.00


    After increasing number of generations and experiments were done on Collection-
2, there were a differences in the results because in some cases we reached the
best solution depends on precision fitness function as shown in table 8 before,
where the results in table 9 shown before mostly all experiments were reached
the highest value of recall within a few number of generations.


Experiments over fitness functions:

The goal of optimization process of a Boolean query is to get a query with highest
possible values of precision or recall depends on chosen fitness function. Results
shown above demonstrate, that when we used recall as the fitness function, the
program terminated within few number of generations because of reaching the
highest value of recall as a fitness function, and when using precision as the
fitness function recall has reached the highest value mostly in all experiments,
where some of precision values were not high.


7   Conclusions

In this paper, an optimization of Boolean query over a collection of documents
is presented. We focused especially on different types of mutation and on com-
parison of two different fitness measures, precision and recall. Experiments were
done over various types of document collections and different types of mutation
and two types of fitness functions. So we obtained the following conclusions:
    First, when applying mutation operator on terms in the chromosomes from
the initial population, it is necessary to have all the terms from the search space
at disposal for mutation. If only terms from user query or initial population were
used for mutation, the results were worse than when terms from whole collection
were used. Only then there can come into existence new queries, describing the
same documents as user query, but containing terms not included into user query
or initial population.
    Second, recall seems to be more efficient than precision when chosen as a fit-
ness function to reach an optimized query within less number of generations than
when precision was chosen as a fitness function. So we retrieved all relevant doc-
uments with few number of non-relevant documents. But on choosing precision
64

as a fitness function, we reached mostly the highest values of recall before pro-
gram termination when the highest values of precision were mostly not reached
and the process terminated after reaching maximum number of generations.
    Third, in some cases, especially when we used for mutation over leaves the
terms from collection or from initial population on two different types of fitness
functions an optimized query was reached within few number of generations, but
on chosen recall as fitness function the results were reached within less number of
generations than when precision was used as a fitness function. But for mutation
over leaves the terms from user query only and the fitness function was precision
there were worse results than in other cases.
    We will focus in our future work on weighted terms and weighted Boolean
operators for implementing the fuzzy logic over terms and Boolean operators
weights for optimizing user query in information retrieval systems, and also on
using different methods for evaluating the performance of information retrieval
such as Harmonic mean measure (F-score). We also want to consider the num-
ber of Boolean operators and the number of terms as the objectives for query
optimization.


References

 1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley,
    New York (1999)
 2. Chen, H.: A machine learning approach to inductive query by examples: an exper-
    iment using relevance feedback, ID3, genetic algorithms, and simulated annealing,
    Journal of the American Society for Information Science 49:8 (1998) 693–705
 3. Cordon, O., Herrera-Viedma, E., Luque¡ M.:Evolutionary Learning of Boolean
    Queries by Multiobjective Genetic Programming. J.J. Merelo Guervos et al. (Eds.):
    PPSN VII, LNCS 2439, Springer-Verlag Berlin Heidelberg (2002) 710–719
 4. Freytag, J. C.: A Rule-Based View of Query Optimization. Proceedings of ACM-
    SIGMOD (1987) 173–180
 5. Goldberg, D. E.: Genetic Algorithms in Search, Optimization and Machine Learn-
    ing. Reading, Massachusetts: Addison-Wesley (1989)
 6. Abbass, H. A., Sarker, R., Newton, C.: PDE:A Pareto-frontier Differential Evo-
    lution Approach for Multi-objective Optimization Problems, Proceedings of the
    Congress on Evolutionary Computation 2001 (CEC’2001), Vol. 2, IEEE Service
    Center, Piscataway, New Jersey (2001) 971–978
 7. Kim, W.: On Optimizing an SQL-like Nested Query. ACM Transactions on
    Database Systems 7 (1982) 443–469
 8. Korfhage, R. R.: Information Storage and Retrieval. John Wiley & Sons, Inc. (1997)
 9. Koza, J.: Genetic programming. On the programming of computers by means of
    natural selection, The MIT Press (1992)
10. Kraft, D.H., Bordogna, G., Pasi, G.: Fuzzy Set Techniques in Information Re-
    trieval, in Bezdek, J.C., Didier, D. and Prade, H. (eds.), Fuzzy Sets in Approximate
    Reasoning and Information Systems, vol. 3, The Handbook of Fuzzy Sets Series,
    Norwell, MA: Kluwer Academic Publishers (1999)
11. McGoveran, D.: Evaluating Optimizers. Database Programming and Design (1990)
    38–49
                                                                                   65

12. Melanie, M.: An Introduction to Genetic Algorithms. A Bradford Book The MIT
    Press (1999)
13. Rijsbergen, C.J.: Information Retrieval (2nd edition), Butterworth (1979)
14. Salton, G., Buckley, C.: Terms-Weighting approach in automatic text retrieval.
    Information Processing and management (1988) 24(5):513–523
15. Smith, M.P., Smith, M.: The use of genetic programming to build Boolean queries
    for text retrieval through relevance feedback, Journal of Information Science 23:6
    (1997) 423–431
16. Owais, S. S. J.: Timetabling of Lectures in the Information Technology College at
    Al al-Bayt University Using Genetic Algorithms. Master thesis, Al al-Bayt Uni-
    versity, Jordan (2003) (in Arabic).
17. Yao, S. B.: Optimization of Query Algorithms. ACM Transactions on Database
    Systems 4 (1979) 133–155
18. Zitzler, E.: Evolutionary Algorithms for Multiobjective Optimization: Methods and
    Applications. Swiss Federal Institute of Technology Zurich, Zurich (1999)