=Paper=
{{Paper
|id=Vol-205/paper-1
|storemode=property
|title=A Framework for the study of Evolved Term-Weighting Schemes in IR
|pdfUrl=https://ceur-ws.org/Vol-205/paper1.pdf
|volume=Vol-205
}}
==A Framework for the study of Evolved Term-Weighting Schemes in IR==
A Framework for the study of Evolved Term-Weighting
Schemes in Information Retrieval
Ronan Cummins and Colm O’Riordan 1
Abstract. Evolutionary algorithms and, in particular, Genetic Pro- shown to increase the performance of IR systems [16]. These tech-
gramming (GP) are increasingly being applied to the problem of niques only work when the ranked lists from the different retrieval
evolving term-weighting schemes in Information Retrieval (IR). One systems return different ranked lists. Thus, when new term-weighting
fundamental problem with the solutions generated by these stochas- schemes are developed it is important, in many respects, to deter-
tic processes is that they are often difficult to analyse. A number of mine if these new schemes are similar to existing ones in terms of
questions regarding these evolved term-weighting schemes remain the ranked lists produced, or if indeed they belong to a new family of
unanswered. One interesting question is; do different runs of the GP weighting scheme.
process bring us to similar points in the solution space? This paper presents a framework for evaluating the distance
This paper deals with determining a number of measures of the between the ranked lists produced from different term-weighting
distance between the ranked lists (phenotype) returned by differ- schemes in order to understand the relative closeness of these
ent term-weighting schemes. Using these distance measures, we de- schemes. We develop two different distance measures and show that
velop trees that show the phenotypic distance between these term- they are useful in determining how the term-weighting schemes are
weighting schemes. This framework gives us a representation of expected to perform in a general environment. We use these dis-
where these evolved solutions lie in the solution space. tance measures to create trees visualizing the distances between the
Finally, we evolve several global term-weighting schemes and weighting schemes.
show that this framework is indeed useful for determining the rel- Section 2 of this paper introduces term-weighting schemes useful
ative closeness of these schemes and for determining the expected for determining the discrimination value of a term. Section 3 intro-
performance on general test data. duces the GP process and existing approaches using GP to evolve
term-weighting schemes are also discussed. Section 4 introduces our
framework and outlines two distance measures. Our experimental
1 INTRODUCTION setup is outlined in section 5 while section 6 discusses our results.
Information retrieval (IR) is concerned with the return of relevant Finally, our conclusions and future work are summarised in section
documents from a collection of unstructured documents given a user 7.
need. It has been recognized that the effectiveness of vector space ap-
proaches to IR depend crucially on the term weighting applied to the 2 INFORMATION RETRIEVAL
terms of the document vectors [15]. These term-weights are typically
calculated using term-weighting schemes that assign values to terms 2.1 Term-Weighting for vector models
based on how useful they are likely to be in determining the relevance Term-weighting schemes assign values to terms based on measures
of a document. Documents are scored in relation to a query using one of the term in both a global (collection-wide) and local (document-
of these term-weighting schemes and are returned in a ranked list for- specific) context. Yu and Salton [19] suggest that the best distinguish-
mat. ing terms are those which occur with a high frequency in certain doc-
Genetic Programming (GP) is a biologically inspired search algo- uments but whose overall frequency across a collection is low (low
rithm useful for searching large complex spaces. Inspired by the the- document frequency). They conclude from this that a term weight-
ory of natural selection, the GP process creates a random population ing function should vary directly with term frequency and inversely
of solutions. These solutions, encoded as trees, undergo generations with document frequency. The idf scheme, first introduced by Sparck
of selection, reproduction and mutation until suitable solutions are Jones [17], gives a higher weight to terms that occur in fewer docu-
found. As GP is a non-deterministic algorithm it cannot be expected ments. The original idf measure is often calculated as follows:
to produce a similar solution each time. Restart theory in GP sug-
N +1
gests that it is necessary to restart the GP a number of times in order idf = log( ) (1)
to achieve good solutions [9]. As a result, an important question re- dft
garding the solutions generated by the GP process is; do all the good where N is the number of documents in the collection and dft is
solutions behave similarly or is the GP bringing us to a different area the number of documents containing term t. A modern weighting
in the solution space each time? scheme developed by Robertson et al. [13] is the BM25 weighting
Recently, IR fusion techniques, that use the rankings from several scheme. The global part of this weighting scheme is a variation of
retrieval systems to determine the final document ranking, have been the traditional idf measure and is calculated as follows:
1 University of Ireland, Galway. email: ronan.cummins@nuigalway.ie, col- N − dft + 0.5
idfrsj = log( ) (2)
mor@it.nuigalway.ie dft + 0.5
The idf measure forms the basis of many modern term-weighting It is important to gain an understanding of the solutions obtained
schemes as it determines what initial weight a search term should from these evolutionary processes and have a means of rating the
receive [12]. It is worth noting that documents are typically not re- differences between the schemes.
trieved by idf only, and are usually used in conjunction with local In [7], differences in retrieval systems are analysed using the
measures to aid retrieval performance. However, if we can firstly find ranked lists returned from the various systems. The distance between
out what initial weight a search term should be given, we can then two ranked lists is measured using the number of out-of-order pairs.
improve upon this by looking at the within-document characteristics Using the measure it can then be determined if two systems are in
to further improve retrieval performance. Developing global weight- essence the same (i.e. if they return the same ranked lists for a set of
ing schemes separately has been shown to benefit the performance queries). Spearman’s rank correlation and Kendall’s tau are two com-
of IR systems [11, 4, 14] and is an important goal in developing mon correlations that measure the difference between ranked sets of
full weighting schemes which include local characteristics, like term- data. Both Spearman’s rank correlation and Kendall’s tau use all of
frequency and document normalisation. These idf type schemes are the ranked data in a pair of ranked lists.
also used in many other domains within IR to weight features (e.g.
document classification).
4 FRAMEWORK
4.1 Phenotypic Distance Measures
3 GENETIC PROGRAMMING
Figure 1 shows how the GP paradigm is adopted to evolve term-
Genetic Programming [8] is a stochastic searching algorithm, in- weighting schemes in IR. We use mean average precision (MAP)
spired by natural selection. In the GP process, a population of solu- as our fitness function as it is a commonly used metric to evaluate
tions is created randomly (although some approaches seed the initial the performance of IR systems and is known to be a stable measure
population with certain known solutions). The solutions are encoded [1]. Furthermore, it has been used with success in previous research
as trees and can be thought of as the genotypes of the individuals. evolving term-weighting schemes in IR [6, 18].
Each tree (genotype) contains nodes which are either functions (op-
erators) or terminals (operands). Each solution is rated based on how
it performs in its environment. This is achieved using a fitness func-
tion. Having assigned the fitness values, selection can occur. Individ- Genetic Programming terminology for evolving
uals are selected for reproduction based on their fitness value. Fitter term-weighting for Information Retrieval
solutions will be selected more often.
Once selection has occurred, reproduction can start. Reproduc- mean average
tion (recombination) can occur in variety of ways. Crossover is the fitness value
precision
main reproductive mechanism in GP. When two solutions are se-
lected from the selection process, their genotypes are combined to
create a new individual. One point crossover is the norm for genetic sets of
phenotype
programming. This is where a single point is located in both parents ranked lists
and the sub-trees are swapped at these points to create two new so-
lutions. Mutation (asexual reproduction) is the random change of the
value of a gene (or the change of a subtree) to create a new individual. term-weighting
genotype
scheme
Selection and recombination occurs until the population is re-
placed by newly created individuals. Once the recombination process
is complete, each individual’s fitness in the new generation is evalu-
ated and the selection process starts again. The process usually ends
Figure 1. GP for Information Retrieval
after a predefined number of generations. Bloat is a common phe-
nomenon in GP. Bloat is where solutions grow in size without a cor-
responding increase in fitness.
For our framework, we measure the phenotype of our solutions by
3.1 Phenotype examining the sets of ranked lists returned by the term-weighting so-
lution for a set of topics on a document collection (its environment).
The phenotype of the individual is often described as its behaviour. Spearman’s rank correlation uses all available document ranks from
Selection occurs based on the fitness only. Fitness is determined by two ranked lists and not just the ranks of relevant documents. We
the phenotype which is in turn determined by the genotype. As one wish to develop distance measures for the parts of the ranked lists
can imagine, different genotypes can map to the same phenotype, and which affect the MAP (fitness) of a solution. This is important as the
different phenotypes can have the same fitness. For most problems in rank of relevant documents is the only direct contributing factor to
GP in an unchanging environment, identical genotypes will map to the fitness of individuals within the GP.
identical phenotypes which will have the same fitness. To compare two sets of ranked lists, we introduce a measure which
essentially measures the average difference between the ranks of rel-
evant documents in two sets of ranked lists. In this measure, we ig-
3.2 Previous Research
nore the ranks of non-relevant documents as they do not contribute
GP techniques have previously been adopted to evolve weighting to the fitness although they do technically contribute to the pheno-
functions and are shown to outperform standard weighting schemes type of the individual. This measure will tell us if the same relevant
in an adhoc framework [6, 10, 18, 4]. However, in many of these ap- documents are being retrieved at, or close to, the same ranks and
proaches a critical analysis of the solutions evolved is not presented. will tell us if the weighting schemes are evolving towards solutions
that promote similar features of relevant documents. Thus, one of the 5 EXPERIMENTAL SETUP
phenotypic distance measures (dist(a, b)), where a and b are two
weighting schemes, is defined as follows: 5.1 Approach Adopted
(
|lim − ri (b)| if ri (a) > lim We evolve global term-weighting schemes in the following frame-
work:
P
1
R i∈R
|ri (a) − lim| if ri (b) > lim
|ri (a) − ri (b)| otherwise X
score(d, q) = (gwt × qtf ) (3)
where R is the set of relevant documents in the collection for all
of the queries used and ri (a) is the rank position of relevant docu- where score(d, q) is the score a document d recieves in relation to
ment i under weighting scheme a. lim is the maximum rank posi- a query q, gwt is the global weighting and qtf is the frequency of the
tion available from a list and is usually 1000 (as this is the usually term in the query. All documents in the collection are scored in rela-
the maximum rank for official TREC runs). As a result, relevant doc- tion to the query and ranked accordingly. We are only evolving the
uments that are ranked outside the top 1000 are treated as being at global (term-discrimination) part of the weighting scheme as an ex-
rank 1000. Thus, when comparing two schemes this measure will ample of our framework. However, the entirety of the term-weighting
tell us how many rank positions, on average, a relevant document is scheme can be evolved and analysed in a similar manner.
expected to change from scheme a to scheme b . Although different
parts of the phenotype will impact on the fitness in different amounts
(i.e. changes of rank for relevant documents at positions near 1000 5.2 Training and Test Collections
do not significantly effect the MAP) they are an important part in dis- We use collections from TREC disks 4 and 5 as our test collections.
tinguishing the behaviour of the phenotype. The change in position A different set of 50 TREC topics is used for each of the collections
at high ranks can tell us about certain features of weighting scheme (apart from the Federal Register collection (FR) for which we use
and the behaviour at these ranks. 100 TREC topics). For each set of topics we create a medium length
We also develop a second measure of the distance between two query set (m), consisting of the title and description fields, and a
ranked lists which takes into account the effect a change in rank has long query set (l) consisting of the title, description and narrative
on MAP. To measure the actual difference a change in rank could fields. We also use documents from the OHSUMED collection as
make in terms of MAP, we modify the dist(a, b) measure so that the a test collection for medium length queries (OH90-91). We only use
change in rank of a relevant document is weighted on how it effects the topics in these sets that have relevant documents in the collection.
MAP. This weighted distance measure (w dist(a, b)) is similar to The TRAIN collection (used in training) consists of 35,412 docu-
the measure described in [2] and is calculated as follows: ments from the OHSUMED collection and the 63 topics. The lengths
1 of these topics range from 2 to 9 terms. Standard stop-words from the
− ri1(b) if ri (a) > lim
lim
Brown Corpus2 are removed and remaining words are stemmed using
1 1
P P
1
Q
1
q∈Q Rq i∈Rq ri (a)
− lim if ri (b) > lim Porter’s algorithm. No additional words are removed from the narra-
1
tive fields as is the case in some approaches. Table 1 shows some
− ri1(b) otherwise
characteristics of the document collections used in this research.
ri (a)
where Q is the number of queries and Rq is the relevant documents
for a query q. This measure tells us how a change in rank of a rele- Table 1. Document Collections
vant document will affect the MAP (i.e. changes of rank at positions
close to 1000 will not change the MAP significantly, while changes Collection #Docs #words/doc #Topics medium long
of rank in the top 10 may change MAP considerably). Of course, it is TRAIN 35,412 72.7 0-63 4.96 None
entirely possible that two ranked lists could be considerably different LATIMES 131,896 251.7 301-350 9.9 29.9
yet have a similar MAP, as they may be promoting different relevant FBIS 130,471 249.9 351-400 7.9 21.9
FT91-93 138,668 221.8 401-450 6.5 18.7
documents. FR 55,630 387.1 301-400 8.9 25.9
OH90-91 148,162 81.4 0-63 4.96 None
4.2 Neighbour-joining trees
Neighbour-joining is a bottom-up clustering method often used for
the creation of phylogenetic trees. However, we use the method to 5.3 Terminal and Function Set
produce trees that represent solutions that are from different runs of Tables 2 and 3 show the functions and terminals that are used in all
our GP. The algorithm requires knowledge of the distance between runs of the GP.
entities that are to be represented in the tree. A distance matrix is cre-
ated for the set of entities using a distance measure and the tree can
then be produced from the resulting data. We use this clustering tech- 5.4 GP parameters
nique to visualize the phenotypic distance between the best solutions We use MAP as our fitness function. All tests are run for 50 gen-
output by our GP. For example, if we have N entities or solutions, erations with an initial random population of 100 solutions on the
we can create an N × N distance matrix using one of our distance training collection (TRAIN) detailed in Table 1. The tournament size
measures. Then, using this distance matrix, we can then create a tree is set to 3. We restrict all trees to a depth of 6. As a result this has
using a suitable drawing package [3] which represents the data and the effect of reducing bloat, improving generalisation, reducing the
can provide a visualisation into where our solutions lie in relation to search space and increasing the speed of the GP. As our highest order
each other. This model is also well suited to our evolutionary para- operator is binary, the longest individual we can have can contains 63
digm. We use this technique simply to visualise the distance between
2 http://www.lextek.com/manuals/onix/stopwords1.html
our term-weighting solutions which are developed using GP.
Best and Average of the Population from two runs of the GP
Table 2. Function Set 0.24
0.22
0.2
Function Description
0.18
+, ×, /, - arithmetic functions
log the natural log 0.16
MAP
√
square-root function 0.14
sq square
0.12
0.1
gw1_best
0.08
Table 3. Terminal Set gw1_avg
gw2_best
gw2_avg
0.06
0 10 20 30 40 50
Generations
Terminal Description
N no. of documents in the collection Figure 2. Best and Average Fitness for best two global runs
df document frequency of a term
cf collection frequency of a term
V vocabulary of collection (no. of unique terms)
C size of collection (total number of terms) as a comparison to our distance measures that only look at distances
0.5 the constant 0.5 of relevant documents.
1 the constant 1
10 the constant 10 The values in Table 6 indicate the average number of rank posi-
tions a relevant document changes. While the values in Table 7 in-
dicate the maximum possible percentage MAP difference between
genes (26 − 1). We believe that this is a large enough space in which two schemes. By looking at the difference between the ranked lists
to find suitable term-weighting schemes. The creation type used is of each global weighting we can get an idea of the landscape of the
the standard ramped half and half creation method used by Koza [8]. solution space in the global domain.
We use an elitist strategy where the best individual is automatically
transfered to the next generation. 4% mutation is used in our experi- Table 5. 1− spearman’s rank correlation between all global weightings on
ments. Due to the stochastic nature of GP a number of runs is often TRAIN
needed to allow the GP converge to a suitably good solution. We run
Scheme idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7
the GP seven times and choose the best solution from each of those
runs. This gives us seven evolved solutions and two benchmark solu- idf 00.00 0.014 0.524 0.501 0.503 0.396 0.007 0.007 0.078
idfrsj 00.00 0.531 0.507 0.507 0.400 0.020 0.020 0.089
tions (1) (2) to use with our document collections. gw1 00.00 0.059 0.186 0.174 0.523 0.523 0.451
gw2 00.00 0.210 0.136 0.499 0.499 0.425
gw3 00.00 0.170 0.501 0.501 0.435
gw4 00.00 0.393 0.393 0.324
6 RESULTS gw5 00.00 00.00 0.076
gw6 00.00 0.076
Table 4 shows the MAP of seven global evolved weighting schemes gw7 00.00
(gw) on our training data in no particular order. We can see that all
the evolved schemes are better than our benchmarks (idf and idfrsj ) Table 6. dist measure between all global weightings on TRAIN
in terms of MAP.
Scheme idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7
Table 4. % MAP for all global weightings
idf 00.00 01.27 36.50 35.07 32.81 30.31 03.65 03.65 11.15
idfrsj 00.00 35.94 34.32 31.90 29.42 04.52 04.52 12.01
gw1 00.00 05.07 21.26 18.46 35.34 35.34 28.13
Collection idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7 gw2 00.00 20.60 15.96 34.15 34.15 27.08
gw3 00.00 06.62 30.48 30.48 24.66
TRAIN 19.83 19.98 22.05 21.98 21.60 21.69 20.11 20.11 20.75 gw4 00.00 28.37 28.37 22.15
gw5 00.00 00.00 09.16
gw6 00.00 09.16
gw7 00.00
Figure 2 shows the best and average of the population from the
two best runs of the GP (i.e. gw1 and gw2 ). It is worth noting that
the best individual from the seven randomly created populations (i.e. Table 7. w dist % measure between all global weightings on TRAIN
generation 0) is not better than the best solution produced after the
50th generation from the worst of the seven runs.
Scheme idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7
Tables 5, 6 and 7 shows the distance matrices for all the global
weighting schemes for the training data using Spearman’s rank cor- idf 00.00 00.21 04.30 04.17 04.10 03.91 00.99 00.99 01.78
idfrsj 00.00 04.20 04.15 04.26 04.01 01.10 01.10 01.86
relation, dist(a, b) and w dist(a, b) measures respectively. Spear- gw1 00.00 01.33 02.73 03.16 04.23 04.23 04.25
gw2 00.00 02.50 02.16 04.09 04.09 04.03
man’s rank correlation gives us values in the range of −1 to +1 and gw3 00.00 02.64 03.96 03.96 03.95
uses all of the documents in the ranked list. As the Spearman corre- gw4 00.00 03.48 03.48 03.38
gw5 00.00 00.00 01.18
lations of the ranked lists produced by the global weighting scheme gw6 00.00 01.18
gw7 00.00
are all positively correlated, we simply use 1− Spearman’s rank cor-
relation as a distance measure. This will give us 1 if the lists are ran-
domly correlated and 0 if they are identical. We use this correlation
similar. An important point to note is that as we get phenotypically
further from the best solution (gw1 ) we see a relative drop in MAP
on our training collection. This indicates that the solutions are evolv-
ing towards the ranked lists (on the training set) that are produced by
gw1 . Obviously, phenotypically close solutions will have a similar
gw
3 fitness but it is not neccessarily true that solutions with a similar fit-
gw
4 ness will have a similar phenotype (e.g. as one can imagine that there
exists many poor performing functions which return equally bad but
different ranked lists). It is worth noting that these trees should be
gw2
produced from the training data as this is the environment where the
gw1 solutions were evolved. However, these trees can help us to predict
gw7
the behaviour of the schemes on general data (if our training data is
a representative sample).
5
gw6 Tables 8 and 9 show the MAP of all schemes for unseen test data
idf gw
on medium and long queries. Firstly, we can see that the differences
sj
idf-r
in MAP between the evolved weightings and idfrsj are all statisti-
cally significant (p < 0.05) using a two-tailed t-test. Both version
of idf perform similarly as expected. We can see that gw1 is no
longer the best evolved weighting scheme, although it is still sig-
nificantly better than idf . Schemes gw2 , gw3 and gw4 are now the
1−Spearman’s rank correlation best performing schemes on most of the collections. Schemes gw5
and gw6 still perform only slightly better than idfrsj , while gw7 still
performs slightly better than these again. It would seem that gw1 has
overtrained slightly on the training collection. It is also worth point-
gw gw
ing out that our training set seems to be quite general as most of the
3 4
gw
gw
3 schemes perform similarly on test data. If we look at the genotypes
4
of some of the schemes it leads us to a similar conclusion. We have
re-written the following formulas in a more intuitive manner to pro-
gw gw vide transparency to the process. As a result, the re-written formulas
2 gw 7
7 gw
2
may also be shorter (in depth) that those that were evolved originally.
gw g
1 gw gww5
6
gw5
6 gw
1 √ √
idfidf idfidf V 2 cf 2 cf p cf 2 cf
-rs
j -rs gw1 = 3
+ cf gw2 =
j
C.df df 3
r
cf 2 N N2
gw3 = (log( )) × ×( + 1)
dist w dist df df df
Figure 3. Neighbour-Joining trees for global weightings
r sr
cf 3 N 0.5
gw4 = gw5 =
df 4 df
Firstly, from Figure 3 we can see that the phenotypic distance
measures produce trees of a similar structure. The only difference r√ sp
in form is that gw3 and gw4 are clustered together directly using the df cf /N
unweighted dist measure. It is important to note that the trees visu- gw6 = gw7 =
df df 2
alize different aspects of the ranked lists. For example, the distance
between the top four performing schemes (gw1 to gw4 ) and the re- We can see that gw1 is a more specific form of gw2 . Schemes
maining schemes is greater in the tree created from Spearman’s rank gw5 and gw6 are an example of two different genotypes producing
correlation than for the other two trees. This is because Spearman’s the same ranked lists. gw6 will produce a score that is always double
rank correlation uses the ranks of non-relavant documents. Looking that of gw5 . We are evolving towards a ranked list on the training
at the tree produced by the dist(a, b) measure, we can see that gw3 collection that is produced by the best two schemes (gw1 and gw2 ).
and gw4 are quite similar in terms of the actual ranks of relevant doc- The gw2 scheme is a more general form of gw1 and performs con-
uments. However, when looking at the tree produced by w dist(a, b) sistently better on our test data. The gw3 scheme contains a problem-
for these two schemes, we can see that some of these differences are atic log(cf /df ) that will assign certain low frequency terms a zero
at low ranks as the possible difference in MAP is quite large. weight [5] and makes it a poor choice for weighting in a retrieval
In general, we can see that idfrsj , idf , gw5 and gw6 are pheno- context. This can be seen on the results for the FR collection when
typically close. Schemes gw5 and gw6 are actually phenotypically compared to one of its nearest neighbours gw4 . When looking at the
equivalent (i.e. return the same ranked lists) but not genotypically individual queries for this collection (FR), we have determined that
equivalent. The two versions of idf are very close. Schemes gw1 and the difference between gw4 and the other top schemes (gw1 to gw3 )
gw2 are also phenotypically close while gw3 and gw4 are somewhat is only large for a very small number of queries. As a result it can be
Table 8. % MAP for idf and global weightings for Medium Queries
Collection Topics idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7
LATIMES 301-350 (m) 19.11 19.16 21.80 22.49 23.48 22.98 20.92 20.92 21.12
FBIS 351-400 (m) 10.30 10.41 15.16 15.68 14.55 14.33 11.61 11.61 11.72
FT91-93 401-450 (m) 27.38 28.15 27.52 27.86 27.56 27.92 27.04 27.04 27.10
FR 301-400 (m) 25.87 24.89 25.12 25.71 21.31 28.72 25.49 25.49 27.39
OH90-91 0-63 (m) 21.68 21.72 24.96 25.69 25.02 25.28 22.96 22.96 23.68
≈ p-value 241 Topics 0.272 - 0.004 0.0001 0.0001 0.0001 0.018 0.018 0.021
Table 9. % MAP for idf and global weightings for Long Queries
Collection Topics idf idfrsj gw1 gw2 gw3 gw4 gw5 gw6 gw7
LATIMES 301-350 (l) 13.57 13.79 21.60 24.27 24.78 24.30 16.37 16.37 16.63
FBIS 351-400 (l) 06.76 06.97 12.30 13.32 14.07 13.84 08.34 08.34 09.01
FT91-93 401-450 (l) 23.11 23.13 27.17 28.28 28.31 29.13 24.95 24.95 25.80
FR 301-400 (l) 16.23 16.95 22.78 22.75 20.86 27.83 19.84 19.84 19.92
≈ p-value 241 Topics 0.300 - 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
concluded that gw4 promotes certain useful features that are different [3] Jeong-Hyeon Choi, Ho-Youl Jung, Hye-Sun Kim, and Hwan-Gue
than those of the rest of the schemes. These differences are notice- Cho, ‘Phylodraw: a phylogenetic tree drawing system.’, Bioinformat-
ics, 16(11), 1056–1058, (2000).
able on the FR collection because of its makeup. The gw4 scheme [4] Ronan Cummins and Colm O’Riordan, ‘An evaluation of evolved term-
seems to be a particularly robust global weighting scheme as shown weighting schemes in information retrieval.’, in CIKM, pp. 305–306,
on the test data. The difference between gw4 and gw3 , for example, (2005).
is not statistically significant. However, we know that gw4 has ad- [5] Ronan Cummins and Colm O’Riordan, ‘Evolving general term-
vantagous retrieval features (as seen on the FR collection) for certain weighting schemes for information retrieval: Tests on larger collec-
tions.’, Artif. Intell. Rev., 24(3-4), 277–299, (2005).
(albeit few) queries. [6] Weiguo Fan, Michael D. Gordon, and Praveen Pathak, ‘A generic rank-
ing function discovery framework by genetic programming for infor-
mation retrieval’, Information Processing & Management, (2004).
7 CONCLUSION [7] P. Kantor, K. Ng, and D. Hull. Comparison of system using pairs-out-
of-order, 1998.
We have introduced two metrics that measure the distance between [8] John R. Koza, Genetic Programming: On the Programming of Comput-
the ranked lists returned by different term-weighting schemes. These ers by Means of Natural Selection, MIT Press, Cambridge, MA, USA,
measures are useful for determining the closeness of term-weighting 1992.
schemes and for analysing the solutions without the need to analyse [9] Sean Luke, ‘When short runs beat long runs’, in Proceedings of the Ge-
netic and Evolutionary Computation Conference (GECCO-2001), pp.
the exact form (genotype) of a term-weighting scheme. This frame- 74–80, San Francisco, California, USA, (7-11 2001). Morgan Kauf-
work can be used for all types of term-weighting schemes and also mann.
fits well into the genetic programming paradigm. [10] N. Oren, ‘Re-examining tf.idf based information retrieval with genetic
The distance matrices produced from these distance measures can programming’, Proceedings of SAICSIT, (2002).
[11] A. Pirkola and K. Jarvelin, ‘Employing the resolution power of search
be used to produce trees that aid visualization of the solution space.
keys’, J. Am. Soc. Inf. Sci. Technol., 52(7), 575–583, (2001).
The trees produced are also useful in determining the relative per- [12] S. E. Robertson and S. Walker, ‘On relevance weights with little rel-
formance of the solutions on general test data. We have also shown evance information’, in Proceedings of the 20th annual international
that all the evolved global weighting schemes produced are evolving ACM SIGIR conference on Research and development in information
to a area of the solution space that is different from the types of idf retrieval, pp. 16–24. ACM Press, (1997).
[13] Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu,
currently being used to measure the discrimination value of a term. Aarron Gull, and Marianna Lau, ‘Okapi at TREC-3’, in In D. K. Har-
In future work, we intend to apply this framework to analyse entire man, editor, The Third Text REtrieval Conference (TREC-3) NIST,
term-weighting schemes which have been evolved. (1995).
[14] Dmitri Roussinov, Weiguo Fan, and Fernando A. Das Neves, ‘Dis-
cretization based learning approach to information retrieval.’, in CIKM,
ACKNOWLEDGEMENTS pp. 321–322, (2005).
[15] Gerard Salton and Chris Buckley, ‘Term-weighting approaches in au-
This work is being carried out with the support of IRCSET (the Irish tomatic text retrieval’, Information Processing & Management, 24(5),
Research Council for Science, Engineering and Technology) under 513–523, (1988).
[16] Alan F. Smeaton, ‘Independence of contributing retrieval strategies in
the Embark Initiative. data fusion for effective information retrieval.’, in BCS-IRSG Annual
Colloquium on IR Research, (1998).
[17] Karen Sparck Jones, ‘A statistical interpretation of term specificity
REFERENCES and its application in retrieval’, Journal of Documentation, 28, 11–21,
(1972).
[1] Chris Buckley and Ellen M. Voorhees, ‘Evaluating evaluation measure
[18] Andrew Trotman, ‘Learning to rank’, Information Retrieval, 8, 359 –
stability’, in SIGIR ’00: Proceedings of the 23rd annual international
381, (2005).
ACM SIGIR conference on Research and development in information
[19] C. T. Yu and G. Salton, ‘Precision weighting - an effective automatic
retrieval, pp. 33–40, New York, NY, USA, (2000). ACM Press.
indexing method’, Journal of the ACM, 23(1), 76–88, (1976).
[2] Ben Carterette and James Allan, ‘Incremental test collections’, in CIKM
’05: Proceedings of the 14th ACM international conference on Informa-
tion and knowledge management, pp. 680–687, New York, NY, USA,
(2005). ACM Press.