=Paper= {{Paper |id=Vol-205/paper-1 |storemode=property |title=A Framework for the study of Evolved Term-Weighting Schemes in IR |pdfUrl=https://ceur-ws.org/Vol-205/paper1.pdf |volume=Vol-205 }} ==A Framework for the study of Evolved Term-Weighting Schemes in IR== https://ceur-ws.org/Vol-205/paper1.pdf
  A Framework for the study of Evolved Term-Weighting
          Schemes in Information Retrieval
                                              Ronan Cummins and Colm O’Riordan 1


Abstract. Evolutionary algorithms and, in particular, Genetic Pro-         shown to increase the performance of IR systems [16]. These tech-
gramming (GP) are increasingly being applied to the problem of             niques only work when the ranked lists from the different retrieval
evolving term-weighting schemes in Information Retrieval (IR). One         systems return different ranked lists. Thus, when new term-weighting
fundamental problem with the solutions generated by these stochas-         schemes are developed it is important, in many respects, to deter-
tic processes is that they are often difficult to analyse. A number of     mine if these new schemes are similar to existing ones in terms of
questions regarding these evolved term-weighting schemes remain            the ranked lists produced, or if indeed they belong to a new family of
unanswered. One interesting question is; do different runs of the GP       weighting scheme.
process bring us to similar points in the solution space?                     This paper presents a framework for evaluating the distance
   This paper deals with determining a number of measures of the           between the ranked lists produced from different term-weighting
distance between the ranked lists (phenotype) returned by differ-          schemes in order to understand the relative closeness of these
ent term-weighting schemes. Using these distance measures, we de-          schemes. We develop two different distance measures and show that
velop trees that show the phenotypic distance between these term-          they are useful in determining how the term-weighting schemes are
weighting schemes. This framework gives us a representation of             expected to perform in a general environment. We use these dis-
where these evolved solutions lie in the solution space.                   tance measures to create trees visualizing the distances between the
   Finally, we evolve several global term-weighting schemes and            weighting schemes.
show that this framework is indeed useful for determining the rel-            Section 2 of this paper introduces term-weighting schemes useful
ative closeness of these schemes and for determining the expected          for determining the discrimination value of a term. Section 3 intro-
performance on general test data.                                          duces the GP process and existing approaches using GP to evolve
                                                                           term-weighting schemes are also discussed. Section 4 introduces our
                                                                           framework and outlines two distance measures. Our experimental
1 INTRODUCTION                                                             setup is outlined in section 5 while section 6 discusses our results.
Information retrieval (IR) is concerned with the return of relevant        Finally, our conclusions and future work are summarised in section
documents from a collection of unstructured documents given a user         7.
need. It has been recognized that the effectiveness of vector space ap-
proaches to IR depend crucially on the term weighting applied to the       2 INFORMATION RETRIEVAL
terms of the document vectors [15]. These term-weights are typically
calculated using term-weighting schemes that assign values to terms        2.1 Term-Weighting for vector models
based on how useful they are likely to be in determining the relevance     Term-weighting schemes assign values to terms based on measures
of a document. Documents are scored in relation to a query using one       of the term in both a global (collection-wide) and local (document-
of these term-weighting schemes and are returned in a ranked list for-     specific) context. Yu and Salton [19] suggest that the best distinguish-
mat.                                                                       ing terms are those which occur with a high frequency in certain doc-
   Genetic Programming (GP) is a biologically inspired search algo-        uments but whose overall frequency across a collection is low (low
rithm useful for searching large complex spaces. Inspired by the the-      document frequency). They conclude from this that a term weight-
ory of natural selection, the GP process creates a random population       ing function should vary directly with term frequency and inversely
of solutions. These solutions, encoded as trees, undergo generations       with document frequency. The idf scheme, first introduced by Sparck
of selection, reproduction and mutation until suitable solutions are       Jones [17], gives a higher weight to terms that occur in fewer docu-
found. As GP is a non-deterministic algorithm it cannot be expected        ments. The original idf measure is often calculated as follows:
to produce a similar solution each time. Restart theory in GP sug-
                                                                                                                 N +1
gests that it is necessary to restart the GP a number of times in order                             idf = log(        )                        (1)
to achieve good solutions [9]. As a result, an important question re-                                             dft
garding the solutions generated by the GP process is; do all the good      where N is the number of documents in the collection and dft is
solutions behave similarly or is the GP bringing us to a different area    the number of documents containing term t. A modern weighting
in the solution space each time?                                           scheme developed by Robertson et al. [13] is the BM25 weighting
   Recently, IR fusion techniques, that use the rankings from several      scheme. The global part of this weighting scheme is a variation of
retrieval systems to determine the final document ranking, have been       the traditional idf measure and is calculated as follows:
1 University of Ireland, Galway. email: ronan.cummins@nuigalway.ie, col-                                      N − dft + 0.5
                                                                                              idfrsj = log(                 )                  (2)
  mor@it.nuigalway.ie                                                                                           dft + 0.5
The idf measure forms the basis of many modern term-weighting              It is important to gain an understanding of the solutions obtained
schemes as it determines what initial weight a search term should          from these evolutionary processes and have a means of rating the
receive [12]. It is worth noting that documents are typically not re-      differences between the schemes.
trieved by idf only, and are usually used in conjunction with local           In [7], differences in retrieval systems are analysed using the
measures to aid retrieval performance. However, if we can firstly find     ranked lists returned from the various systems. The distance between
out what initial weight a search term should be given, we can then         two ranked lists is measured using the number of out-of-order pairs.
improve upon this by looking at the within-document characteristics        Using the measure it can then be determined if two systems are in
to further improve retrieval performance. Developing global weight-        essence the same (i.e. if they return the same ranked lists for a set of
ing schemes separately has been shown to benefit the performance           queries). Spearman’s rank correlation and Kendall’s tau are two com-
of IR systems [11, 4, 14] and is an important goal in developing           mon correlations that measure the difference between ranked sets of
full weighting schemes which include local characteristics, like term-     data. Both Spearman’s rank correlation and Kendall’s tau use all of
frequency and document normalisation. These idf type schemes are           the ranked data in a pair of ranked lists.
also used in many other domains within IR to weight features (e.g.
document classification).
                                                                           4 FRAMEWORK
                                                                           4.1 Phenotypic Distance Measures
3 GENETIC PROGRAMMING
                                                                           Figure 1 shows how the GP paradigm is adopted to evolve term-
Genetic Programming [8] is a stochastic searching algorithm, in-           weighting schemes in IR. We use mean average precision (MAP)
spired by natural selection. In the GP process, a population of solu-      as our fitness function as it is a commonly used metric to evaluate
tions is created randomly (although some approaches seed the initial       the performance of IR systems and is known to be a stable measure
population with certain known solutions). The solutions are encoded        [1]. Furthermore, it has been used with success in previous research
as trees and can be thought of as the genotypes of the individuals.        evolving term-weighting schemes in IR [6, 18].
Each tree (genotype) contains nodes which are either functions (op-
erators) or terminals (operands). Each solution is rated based on how
it performs in its environment. This is achieved using a fitness func-
tion. Having assigned the fitness values, selection can occur. Individ-               Genetic Programming terminology for evolving
uals are selected for reproduction based on their fitness value. Fitter                  term-weighting for Information Retrieval
solutions will be selected more often.
   Once selection has occurred, reproduction can start. Reproduc-                           mean average
tion (recombination) can occur in variety of ways. Crossover is the                                                   fitness value
                                                                                              precision
main reproductive mechanism in GP. When two solutions are se-
lected from the selection process, their genotypes are combined to
create a new individual. One point crossover is the norm for genetic                            sets of
                                                                                                                       phenotype
programming. This is where a single point is located in both parents                          ranked lists
and the sub-trees are swapped at these points to create two new so-
lutions. Mutation (asexual reproduction) is the random change of the
value of a gene (or the change of a subtree) to create a new individual.                    term-weighting
                                                                                                                       genotype
                                                                                                scheme
   Selection and recombination occurs until the population is re-
placed by newly created individuals. Once the recombination process
is complete, each individual’s fitness in the new generation is evalu-
ated and the selection process starts again. The process usually ends
                                                                                           Figure 1.    GP for Information Retrieval
after a predefined number of generations. Bloat is a common phe-
nomenon in GP. Bloat is where solutions grow in size without a cor-
responding increase in fitness.
                                                                              For our framework, we measure the phenotype of our solutions by
3.1 Phenotype                                                              examining the sets of ranked lists returned by the term-weighting so-
                                                                           lution for a set of topics on a document collection (its environment).
The phenotype of the individual is often described as its behaviour.       Spearman’s rank correlation uses all available document ranks from
Selection occurs based on the fitness only. Fitness is determined by       two ranked lists and not just the ranks of relevant documents. We
the phenotype which is in turn determined by the genotype. As one          wish to develop distance measures for the parts of the ranked lists
can imagine, different genotypes can map to the same phenotype, and        which affect the MAP (fitness) of a solution. This is important as the
different phenotypes can have the same fitness. For most problems in       rank of relevant documents is the only direct contributing factor to
GP in an unchanging environment, identical genotypes will map to           the fitness of individuals within the GP.
identical phenotypes which will have the same fitness.                        To compare two sets of ranked lists, we introduce a measure which
                                                                           essentially measures the average difference between the ranks of rel-
                                                                           evant documents in two sets of ranked lists. In this measure, we ig-
3.2 Previous Research
                                                                           nore the ranks of non-relevant documents as they do not contribute
GP techniques have previously been adopted to evolve weighting             to the fitness although they do technically contribute to the pheno-
functions and are shown to outperform standard weighting schemes           type of the individual. This measure will tell us if the same relevant
in an adhoc framework [6, 10, 18, 4]. However, in many of these ap-        documents are being retrieved at, or close to, the same ranks and
proaches a critical analysis of the solutions evolved is not presented.    will tell us if the weighting schemes are evolving towards solutions
that promote similar features of relevant documents. Thus, one of the        5 EXPERIMENTAL SETUP
phenotypic distance measures (dist(a, b)), where a and b are two
weighting schemes, is defined as follows:                                    5.1 Approach Adopted
                         (
                             |lim − ri (b)|      if ri (a) > lim             We evolve global term-weighting schemes in the following frame-
                                                                             work:
               P
           1
           R       i∈R
                             |ri (a) − lim|      if ri (b) > lim
                             |ri (a) − ri (b)|   otherwise                                                        X
                                                                                                  score(d, q) =       (gwt × qtf )               (3)
   where R is the set of relevant documents in the collection for all
of the queries used and ri (a) is the rank position of relevant docu-           where score(d, q) is the score a document d recieves in relation to
ment i under weighting scheme a. lim is the maximum rank posi-               a query q, gwt is the global weighting and qtf is the frequency of the
tion available from a list and is usually 1000 (as this is the usually       term in the query. All documents in the collection are scored in rela-
the maximum rank for official TREC runs). As a result, relevant doc-         tion to the query and ranked accordingly. We are only evolving the
uments that are ranked outside the top 1000 are treated as being at          global (term-discrimination) part of the weighting scheme as an ex-
rank 1000. Thus, when comparing two schemes this measure will                ample of our framework. However, the entirety of the term-weighting
tell us how many rank positions, on average, a relevant document is          scheme can be evolved and analysed in a similar manner.
expected to change from scheme a to scheme b . Although different
parts of the phenotype will impact on the fitness in different amounts
(i.e. changes of rank for relevant documents at positions near 1000          5.2 Training and Test Collections
do not significantly effect the MAP) they are an important part in dis-      We use collections from TREC disks 4 and 5 as our test collections.
tinguishing the behaviour of the phenotype. The change in position           A different set of 50 TREC topics is used for each of the collections
at high ranks can tell us about certain features of weighting scheme         (apart from the Federal Register collection (FR) for which we use
and the behaviour at these ranks.                                            100 TREC topics). For each set of topics we create a medium length
   We also develop a second measure of the distance between two              query set (m), consisting of the title and description fields, and a
ranked lists which takes into account the effect a change in rank has        long query set (l) consisting of the title, description and narrative
on MAP. To measure the actual difference a change in rank could              fields. We also use documents from the OHSUMED collection as
make in terms of MAP, we modify the dist(a, b) measure so that the           a test collection for medium length queries (OH90-91). We only use
change in rank of a relevant document is weighted on how it effects          the topics in these sets that have relevant documents in the collection.
MAP. This weighted distance measure (w dist(a, b)) is similar to                The TRAIN collection (used in training) consists of 35,412 docu-
the measure described in [2] and is calculated as follows:                   ments from the OHSUMED collection and the 63 topics. The lengths
                                
                                       1                                     of these topics range from 2 to 9 terms. Standard stop-words from the
                                        − ri1(b)      if ri (a) > lim
                                     lim
                                                                             Brown Corpus2 are removed and remaining words are stemmed using
                                
                                
                                
                                       1        1
       P          P
     1
     Q
                1
           q∈Q Rq        i∈Rq        ri (a)
                                            − lim      if ri (b) > lim       Porter’s algorithm. No additional words are removed from the narra-
                                
                                      1
                                                                             tive fields as is the case in some approaches. Table 1 shows some
                                            − ri1(b)   otherwise
                                
                                                                             characteristics of the document collections used in this research.
                                
                                     ri (a)

   where Q is the number of queries and Rq is the relevant documents
for a query q. This measure tells us how a change in rank of a rele-                              Table 1. Document Collections
vant document will affect the MAP (i.e. changes of rank at positions
close to 1000 will not change the MAP significantly, while changes                   Collection    #Docs #words/doc    #Topics medium long
of rank in the top 10 may change MAP considerably). Of course, it is                 TRAIN         35,412      72.7     0-63    4.96  None
entirely possible that two ranked lists could be considerably different              LATIMES      131,896     251.7    301-350   9.9  29.9
yet have a similar MAP, as they may be promoting different relevant                  FBIS         130,471     249.9    351-400   7.9  21.9
                                                                                     FT91-93      138,668     221.8    401-450   6.5  18.7
documents.                                                                           FR            55,630     387.1    301-400   8.9  25.9
                                                                                     OH90-91      148,162      81.4     0-63    4.96  None

4.2 Neighbour-joining trees
Neighbour-joining is a bottom-up clustering method often used for
the creation of phylogenetic trees. However, we use the method to            5.3 Terminal and Function Set
produce trees that represent solutions that are from different runs of       Tables 2 and 3 show the functions and terminals that are used in all
our GP. The algorithm requires knowledge of the distance between             runs of the GP.
entities that are to be represented in the tree. A distance matrix is cre-
ated for the set of entities using a distance measure and the tree can
then be produced from the resulting data. We use this clustering tech-       5.4 GP parameters
nique to visualize the phenotypic distance between the best solutions        We use MAP as our fitness function. All tests are run for 50 gen-
output by our GP. For example, if we have N entities or solutions,           erations with an initial random population of 100 solutions on the
we can create an N × N distance matrix using one of our distance             training collection (TRAIN) detailed in Table 1. The tournament size
measures. Then, using this distance matrix, we can then create a tree        is set to 3. We restrict all trees to a depth of 6. As a result this has
using a suitable drawing package [3] which represents the data and           the effect of reducing bloat, improving generalisation, reducing the
can provide a visualisation into where our solutions lie in relation to      search space and increasing the speed of the GP. As our highest order
each other. This model is also well suited to our evolutionary para-         operator is binary, the longest individual we can have can contains 63
digm. We use this technique simply to visualise the distance between
                                                                             2 http://www.lextek.com/manuals/onix/stopwords1.html
our term-weighting solutions which are developed using GP.
                                                                                                         Best and Average of the Population from two runs of the GP
                              Table 2. Function Set                                      0.24


                                                                                         0.22


                                                                                          0.2
                        Function Description
                                                                                         0.18
                        +, ×, /, - arithmetic functions
                        log        the natural log                                       0.16




                                                                                   MAP
                        √
                                   square-root function                                  0.14
                        sq         square
                                                                                         0.12


                                                                                          0.1

                                                                                                                                                                  gw1_best
                                                                                         0.08
                              Table 3. Terminal Set                                                                                                                gw1_avg
                                                                                                                                                                  gw2_best
                                                                                                                                                                   gw2_avg
                                                                                         0.06
                                                                                                0        10                20                 30                 40              50
                                                                                                                                Generations

         Terminal Description
         N            no. of documents in the collection                           Figure 2. Best and Average Fitness for best two global runs
         df           document frequency of a term
         cf           collection frequency of a term
         V            vocabulary of collection (no. of unique terms)
         C            size of collection (total number of terms)             as a comparison to our distance measures that only look at distances
         0.5          the constant 0.5                                       of relevant documents.
         1            the constant 1
         10           the constant 10                                           The values in Table 6 indicate the average number of rank posi-
                                                                             tions a relevant document changes. While the values in Table 7 in-
                                                                             dicate the maximum possible percentage MAP difference between
genes (26 − 1). We believe that this is a large enough space in which        two schemes. By looking at the difference between the ranked lists
to find suitable term-weighting schemes. The creation type used is           of each global weighting we can get an idea of the landscape of the
the standard ramped half and half creation method used by Koza [8].          solution space in the global domain.
We use an elitist strategy where the best individual is automatically
transfered to the next generation. 4% mutation is used in our experi-        Table 5. 1− spearman’s rank correlation between all global weightings on
ments. Due to the stochastic nature of GP a number of runs is often                                        TRAIN
needed to allow the GP converge to a suitably good solution. We run
                                                                               Scheme idf            idfrsj gw1           gw2         gw3          gw4         gw5       gw6          gw7
the GP seven times and choose the best solution from each of those
runs. This gives us seven evolved solutions and two benchmark solu-            idf         00.00 0.014        0.524 0.501 0.503 0.396                          0.007     0.007        0.078
                                                                               idfrsj            00.00        0.531 0.507 0.507 0.400                          0.020     0.020        0.089
tions (1) (2) to use with our document collections.                            gw1                            00.00 0.059 0.186 0.174                          0.523     0.523        0.451
                                                                               gw2                                  00.00 0.210 0.136                          0.499     0.499        0.425
                                                                               gw3                                        00.00 0.170                          0.501     0.501        0.435
                                                                               gw4                                              00.00                          0.393     0.393        0.324
6 RESULTS                                                                      gw5                                                                             00.00     00.00        0.076
                                                                               gw6                                                                                       00.00        0.076
Table 4 shows the MAP of seven global evolved weighting schemes                gw7                                                                                                    00.00
(gw) on our training data in no particular order. We can see that all
the evolved schemes are better than our benchmarks (idf and idfrsj )             Table 6. dist measure between all global weightings on TRAIN
in terms of MAP.

                                                                               Scheme idf            idfrsj gw1           gw2         gw3          gw4         gw5       gw6          gw7
                  Table 4. % MAP for all global weightings
                                                                               idf         00.00 01.27        36.50 35.07 32.81 30.31                          03.65     03.65        11.15
                                                                               idfrsj            00.00        35.94 34.32 31.90 29.42                          04.52     04.52        12.01
                                                                               gw1                            00.00 05.07 21.26 18.46                          35.34     35.34        28.13
 Collection idf      idfrsj     gw1   gw2   gw3   gw4     gw5   gw6    gw7     gw2                                  00.00 20.60 15.96                          34.15     34.15        27.08
                                                                               gw3                                        00.00 06.62                          30.48     30.48        24.66
 TRAIN      19.83 19.98         22.05 21.98 21.60 21.69 20.11 20.11 20.75      gw4                                              00.00                          28.37     28.37        22.15
                                                                               gw5                                                                             00.00     00.00        09.16
                                                                               gw6                                                                                       00.00        09.16
                                                                               gw7                                                                                                    00.00
   Figure 2 shows the best and average of the population from the
two best runs of the GP (i.e. gw1 and gw2 ). It is worth noting that
the best individual from the seven randomly created populations (i.e.         Table 7.          w dist % measure between all global weightings on TRAIN
generation 0) is not better than the best solution produced after the
50th generation from the worst of the seven runs.
                                                                               Scheme idf            idfrsj gw1           gw2         gw3          gw4         gw5       gw6          gw7
   Tables 5, 6 and 7 shows the distance matrices for all the global
weighting schemes for the training data using Spearman’s rank cor-             idf         00.00 00.21        04.30 04.17 04.10 03.91                          00.99     00.99        01.78
                                                                               idfrsj            00.00        04.20 04.15 04.26 04.01                          01.10     01.10        01.86
relation, dist(a, b) and w dist(a, b) measures respectively. Spear-            gw1                            00.00 01.33 02.73 03.16                          04.23     04.23        04.25
                                                                               gw2                                  00.00 02.50 02.16                          04.09     04.09        04.03
man’s rank correlation gives us values in the range of −1 to +1 and            gw3                                        00.00 02.64                          03.96     03.96        03.95
uses all of the documents in the ranked list. As the Spearman corre-           gw4                                              00.00                          03.48     03.48        03.38
                                                                               gw5                                                                             00.00     00.00        01.18
lations of the ranked lists produced by the global weighting scheme            gw6                                                                                       00.00        01.18
                                                                               gw7                                                                                                    00.00
are all positively correlated, we simply use 1− Spearman’s rank cor-
relation as a distance measure. This will give us 1 if the lists are ran-
domly correlated and 0 if they are identical. We use this correlation
                                                                                                             similar. An important point to note is that as we get phenotypically
                                                                                                             further from the best solution (gw1 ) we see a relative drop in MAP
                                                                                                             on our training collection. This indicates that the solutions are evolv-
                                                                                                             ing towards the ranked lists (on the training set) that are produced by
                                                                                                             gw1 . Obviously, phenotypically close solutions will have a similar
                            gw
                                 3                                                                           fitness but it is not neccessarily true that solutions with a similar fit-
                                             gw
                                                   4                                                         ness will have a similar phenotype (e.g. as one can imagine that there
                                                                                                             exists many poor performing functions which return equally bad but
                                                                                                             different ranked lists). It is worth noting that these trees should be
                           gw2
                                                                                                             produced from the training data as this is the environment where the
                                 gw1                                                                         solutions were evolved. However, these trees can help us to predict


                                                                         gw7
                                                                                                             the behaviour of the schemes on general data (if our training data is
                                                                                                             a representative sample).
                                                                                     5
                                                                                  gw6                           Tables 8 and 9 show the MAP of all schemes for unseen test data
                                                                             idf  gw
                                                                                                             on medium and long queries. Firstly, we can see that the differences
                                                                             sj
                                                                       idf-r



                                                                                                             in MAP between the evolved weightings and idfrsj are all statisti-
                                                                                                             cally significant (p < 0.05) using a two-tailed t-test. Both version
                                                                                                             of idf perform similarly as expected. We can see that gw1 is no
                                                                                                             longer the best evolved weighting scheme, although it is still sig-
                                                                                                             nificantly better than idf . Schemes gw2 , gw3 and gw4 are now the
                     1−Spearman’s rank correlation                                                           best performing schemes on most of the collections. Schemes gw5
                                                                                                             and gw6 still perform only slightly better than idfrsj , while gw7 still
                                                                                                             performs slightly better than these again. It would seem that gw1 has
                                                                                                             overtrained slightly on the training collection. It is also worth point-
            gw                                                                     gw
                                                                                                             ing out that our training set seems to be quite general as most of the
              3                                                                          4
                      gw
                                                        gw
                                                             3                                               schemes perform similarly on test data. If we look at the genotypes
                        4
                                                                                                             of some of the schemes it leads us to a similar conclusion. We have
                                                                                                             re-written the following formulas in a more intuitive manner to pro-
gw                                                                                                   gw      vide transparency to the process. As a result, the re-written formulas
  2                                    gw                                                                7
                                            7          gw
                                                         2
                                                                                                             may also be shorter (in depth) that those that were evolved originally.
      gw                                                                                             g
        1                                    gw                                                      gww5
                                                                                                        6
                                             gw5
                                               6                 gw
                                                                   1                                                             √                                     √
                                       idfidf                                                idfidf                      V 2 cf 2 cf  p                            cf 2 cf
                                           -rs
                                               j                                                 -rs            gw1 =            3
                                                                                                                                     + cf                 gw2 =
                                                                                                     j
                                                                                                                            C.df                                     df 3

                                                                                                                                r
                                                                                                                                            cf 2 N        N2
                                                                                                                        gw3 =      (log(       )) ×    ×(    + 1)
                     dist                                                         w dist                                                    df      df    df
                 Figure 3. Neighbour-Joining trees for global weightings
                                                                                                                               r                            sr
                                                                                                                                   cf 3 N                          0.5
                                                                                                                      gw4 =                         gw5 =
                                                                                                                                    df 4                           df
   Firstly, from Figure 3 we can see that the phenotypic distance
measures produce trees of a similar structure. The only difference                                                            r√                          sp
in form is that gw3 and gw4 are clustered together directly using the                                                               df                          cf /N
unweighted dist measure. It is important to note that the trees visu-                                                 gw6 =                       gw7 =
                                                                                                                                   df                           df 2
alize different aspects of the ranked lists. For example, the distance
between the top four performing schemes (gw1 to gw4 ) and the re-                                               We can see that gw1 is a more specific form of gw2 . Schemes
maining schemes is greater in the tree created from Spearman’s rank                                          gw5 and gw6 are an example of two different genotypes producing
correlation than for the other two trees. This is because Spearman’s                                         the same ranked lists. gw6 will produce a score that is always double
rank correlation uses the ranks of non-relavant documents. Looking                                           that of gw5 . We are evolving towards a ranked list on the training
at the tree produced by the dist(a, b) measure, we can see that gw3                                          collection that is produced by the best two schemes (gw1 and gw2 ).
and gw4 are quite similar in terms of the actual ranks of relevant doc-                                      The gw2 scheme is a more general form of gw1 and performs con-
uments. However, when looking at the tree produced by w dist(a, b)                                           sistently better on our test data. The gw3 scheme contains a problem-
for these two schemes, we can see that some of these differences are                                         atic log(cf /df ) that will assign certain low frequency terms a zero
at low ranks as the possible difference in MAP is quite large.                                               weight [5] and makes it a poor choice for weighting in a retrieval
   In general, we can see that idfrsj , idf , gw5 and gw6 are pheno-                                         context. This can be seen on the results for the FR collection when
typically close. Schemes gw5 and gw6 are actually phenotypically                                             compared to one of its nearest neighbours gw4 . When looking at the
equivalent (i.e. return the same ranked lists) but not genotypically                                         individual queries for this collection (FR), we have determined that
equivalent. The two versions of idf are very close. Schemes gw1 and                                          the difference between gw4 and the other top schemes (gw1 to gw3 )
gw2 are also phenotypically close while gw3 and gw4 are somewhat                                             is only large for a very small number of queries. As a result it can be
                                             Table 8. % MAP for idf and global weightings for Medium Queries


                                Collection     Topics        idf     idfrsj    gw1      gw2       gw3       gw4       gw5     gw6      gw7
                                LATIMES        301-350 (m)   19.11   19.16     21.80    22.49     23.48     22.98     20.92   20.92    21.12
                                FBIS           351-400 (m)   10.30   10.41     15.16    15.68     14.55     14.33     11.61   11.61    11.72
                                FT91-93        401-450 (m)   27.38   28.15     27.52    27.86     27.56     27.92     27.04   27.04    27.10
                                FR             301-400 (m)   25.87   24.89     25.12    25.71     21.31     28.72     25.49   25.49    27.39
                                OH90-91        0-63 (m)      21.68   21.72     24.96    25.69     25.02     25.28     22.96   22.96    23.68
                                ≈ p-value      241 Topics    0.272   -         0.004    0.0001    0.0001    0.0001    0.018   0.018    0.021


                                              Table 9. % MAP for idf and global weightings for Long Queries


                              Collection     Topics        idf     idfrsj    gw1       gw2       gw3       gw4       gw5      gw6      gw7
                              LATIMES        301-350 (l)   13.57   13.79     21.60     24.27     24.78     24.30     16.37    16.37    16.63
                              FBIS           351-400 (l)   06.76   06.97     12.30     13.32     14.07     13.84     08.34    08.34    09.01
                              FT91-93        401-450 (l)   23.11   23.13     27.17     28.28     28.31     29.13     24.95    24.95    25.80
                              FR             301-400 (l)   16.23   16.95     22.78     22.75     20.86     27.83     19.84    19.84    19.92
                              ≈ p-value      241 Topics    0.300   -         0.0001    0.0001    0.0001    0.0001    0.0001   0.0001   0.0001

concluded that gw4 promotes certain useful features that are different                   [3] Jeong-Hyeon Choi, Ho-Youl Jung, Hye-Sun Kim, and Hwan-Gue
than those of the rest of the schemes. These differences are notice-                         Cho, ‘Phylodraw: a phylogenetic tree drawing system.’, Bioinformat-
                                                                                             ics, 16(11), 1056–1058, (2000).
able on the FR collection because of its makeup. The gw4 scheme                          [4] Ronan Cummins and Colm O’Riordan, ‘An evaluation of evolved term-
seems to be a particularly robust global weighting scheme as shown                           weighting schemes in information retrieval.’, in CIKM, pp. 305–306,
on the test data. The difference between gw4 and gw3 , for example,                          (2005).
is not statistically significant. However, we know that gw4 has ad-                      [5] Ronan Cummins and Colm O’Riordan, ‘Evolving general term-
vantagous retrieval features (as seen on the FR collection) for certain                      weighting schemes for information retrieval: Tests on larger collec-
                                                                                             tions.’, Artif. Intell. Rev., 24(3-4), 277–299, (2005).
(albeit few) queries.                                                                    [6] Weiguo Fan, Michael D. Gordon, and Praveen Pathak, ‘A generic rank-
                                                                                             ing function discovery framework by genetic programming for infor-
                                                                                             mation retrieval’, Information Processing & Management, (2004).
7 CONCLUSION                                                                             [7] P. Kantor, K. Ng, and D. Hull. Comparison of system using pairs-out-
                                                                                             of-order, 1998.
We have introduced two metrics that measure the distance between                         [8] John R. Koza, Genetic Programming: On the Programming of Comput-
the ranked lists returned by different term-weighting schemes. These                         ers by Means of Natural Selection, MIT Press, Cambridge, MA, USA,
measures are useful for determining the closeness of term-weighting                          1992.
schemes and for analysing the solutions without the need to analyse                      [9] Sean Luke, ‘When short runs beat long runs’, in Proceedings of the Ge-
                                                                                             netic and Evolutionary Computation Conference (GECCO-2001), pp.
the exact form (genotype) of a term-weighting scheme. This frame-                            74–80, San Francisco, California, USA, (7-11 2001). Morgan Kauf-
work can be used for all types of term-weighting schemes and also                            mann.
fits well into the genetic programming paradigm.                                        [10] N. Oren, ‘Re-examining tf.idf based information retrieval with genetic
   The distance matrices produced from these distance measures can                           programming’, Proceedings of SAICSIT, (2002).
                                                                                        [11] A. Pirkola and K. Jarvelin, ‘Employing the resolution power of search
be used to produce trees that aid visualization of the solution space.
                                                                                             keys’, J. Am. Soc. Inf. Sci. Technol., 52(7), 575–583, (2001).
The trees produced are also useful in determining the relative per-                     [12] S. E. Robertson and S. Walker, ‘On relevance weights with little rel-
formance of the solutions on general test data. We have also shown                           evance information’, in Proceedings of the 20th annual international
that all the evolved global weighting schemes produced are evolving                          ACM SIGIR conference on Research and development in information
to a area of the solution space that is different from the types of idf                      retrieval, pp. 16–24. ACM Press, (1997).
                                                                                        [13] Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu,
currently being used to measure the discrimination value of a term.                          Aarron Gull, and Marianna Lau, ‘Okapi at TREC-3’, in In D. K. Har-
In future work, we intend to apply this framework to analyse entire                          man, editor, The Third Text REtrieval Conference (TREC-3) NIST,
term-weighting schemes which have been evolved.                                              (1995).
                                                                                        [14] Dmitri Roussinov, Weiguo Fan, and Fernando A. Das Neves, ‘Dis-
                                                                                             cretization based learning approach to information retrieval.’, in CIKM,
ACKNOWLEDGEMENTS                                                                             pp. 321–322, (2005).
                                                                                        [15] Gerard Salton and Chris Buckley, ‘Term-weighting approaches in au-
This work is being carried out with the support of IRCSET (the Irish                         tomatic text retrieval’, Information Processing & Management, 24(5),
Research Council for Science, Engineering and Technology) under                              513–523, (1988).
                                                                                        [16] Alan F. Smeaton, ‘Independence of contributing retrieval strategies in
the Embark Initiative.                                                                       data fusion for effective information retrieval.’, in BCS-IRSG Annual
                                                                                             Colloquium on IR Research, (1998).
                                                                                        [17] Karen Sparck Jones, ‘A statistical interpretation of term specificity
REFERENCES                                                                                   and its application in retrieval’, Journal of Documentation, 28, 11–21,
                                                                                             (1972).
[1] Chris Buckley and Ellen M. Voorhees, ‘Evaluating evaluation measure
                                                                                        [18] Andrew Trotman, ‘Learning to rank’, Information Retrieval, 8, 359 –
    stability’, in SIGIR ’00: Proceedings of the 23rd annual international
                                                                                             381, (2005).
    ACM SIGIR conference on Research and development in information
                                                                                        [19] C. T. Yu and G. Salton, ‘Precision weighting - an effective automatic
    retrieval, pp. 33–40, New York, NY, USA, (2000). ACM Press.
                                                                                             indexing method’, Journal of the ACM, 23(1), 76–88, (1976).
[2] Ben Carterette and James Allan, ‘Incremental test collections’, in CIKM
    ’05: Proceedings of the 14th ACM international conference on Informa-
    tion and knowledge management, pp. 680–687, New York, NY, USA,
    (2005). ACM Press.