<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing the Robustness of Expansion Techniques and Retrieval Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen Tomlinson</string-name>
          <email>stephen.tomlinson@hummingbird.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Robust Retrieval, Blind Feedback, First Relevant Score</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ottawa</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hummingbird participated in the monolingual (Bulgarian, French, Hungarian, Portuguese and English) and robust (Dutch, English, French, German, Italian and Spanish) information retrieval tasks of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2006. In all 22 of our experiments with blind feedback (a technique known to impair robustness across topics), the mean scores of the Average Precision, Geometric MAP and Precision@10 measures increased (and most of these increases were statistically significant), implying that these measures are not suitable as robust retrieval measures. In contrast, we found that measures based on just the first relevant item, such as a Generalized Success@10 measure, successfully discerned some robustness gains, particularly the robustness advantage of expanding Title queries by using the Description field instead of blind feedback.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/</title>
      <p>Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications.
The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.</p>
      <p>
        SearchServer works in Unicode internally [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and supports most of the world’s major
character sets and languages. The major conferences in text retrieval experimentation (CLEF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
NTCIR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and TREC [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) have provided judged test collections for objective experimentation
with SearchServer in more than a dozen languages.
      </p>
      <p>1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other
copyrights, trademarks and tradenames are the property of their respective owners.</p>
      <p>Language</p>
    </sec>
    <sec id="sec-2">
      <title>Portuguese</title>
      <p>French
Bulgarian
Hungarian
English</p>
      <p>This paper describes experimental work with SearchServer for the task of finding relevant
documents for natural language queries in various European languages using the CLEF 2006
AdHoc Track test collections.
2
2.1</p>
      <sec id="sec-2-1">
        <title>Methodology</title>
        <sec id="sec-2-1-1">
          <title>Data</title>
          <p>The CLEF 2006 Ad-Hoc Track document sets consisted of tagged (SGML-formatted) news articles
in 5 different languages: Bulgarian, French, Hungarian, Portuguese and English. Table 1 gives the
sizes.</p>
          <p>The CLEF organizers created 50 natural language “topics” and translated them into many
languages. Some topics were discarded for some languages because of a lack of relevant documents.
Table 1 gives the final number of topics for each language and their average number of relevant
documents (along with the lowest, median and highest number of relevant documents of any topic).
For more information on the CLEF test collections, see the track overview paper.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Indexing</title>
          <p>
            Our indexing approach was mostly the same as last year [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Accents were not indexed except
for the combining breve in Bulgarian. The apostrophe was treated as a word separator for the
investigated languages (except English). The custom text reader, cTREC, was updated to maintain
support for the CLEF guidelines of only indexing specifically tagged fields.
          </p>
          <p>
            Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these
experiments, the stop word lists for Bulgarian and Hungarian were based on Savoy’s updated lists
[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
          </p>
          <p>By default, the SearchServer index supports both exact matching (after some Unicode-based
normalizations, such as decompositions and conversion to upper-case) and morphological matching
(e.g. inflections, derivations and compounds, depending on the linguistic component used).
2.3</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Searching</title>
          <p>We experimented with the SearchServer CONTAINS predicate. Our test application specified
SearchSQL to perform a boolean-OR of the query words. For example, for English topic 279
whose Title was “Swiss referendums”, a corresponding SearchSQL query would be:
SELECT RELEVANCE(’2:3’) AS REL, DOCNO
FROM CLEF06EN
WHERE FT_TEXT CONTAINS ’Swiss’|’referendums’
ORDER BY REL DESC;</p>
          <p>
            Most aspects of the SearchServer relevance value calculation are the same as described last
year [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a
manner similar to Okapi [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] and dampens the inverse document frequency using an approximation
of the logarithm. These calculations are based on the stems of the terms (roughly speaking)
when doing morphological searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’
was previously specified). The SearchServer RELEVANCE_METHOD setting was set to ‘2:3’
and RELEVANCE_DLEN_IMP was set to 750 for all experiments in this paper.
2.4
          </p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Experimental Runs</title>
          <p>For each language, we executed 5 experimental runs in May 2006, though just 3 were allowed
to be submitted for official assessment. In the identifiers (e.g. “humBG06tde”), ‘t’, ‘d’ and ‘n’
indicate that the Title, Description and Narrative field of the topic were used (respectively), and
‘e’ indicates that query expansion from blind feedback on the first 3 rows was used (weight of
onehalf on the original query, and one-sixth each on the 3 expanded rows). From the Description and
Narrative fields for most languages, instruction words such as “find”, “relevant” and “document”
were automatically removed (based on looking at some older topic lists, not this year’s topics; this
step was skipped for Hungarian because we did not update our lists based on last year’s topics).
All runs used inflections and/or derivations from stemming.</p>
          <p>The 5 executed runs for each language:
² “t”: Just the Title field of the topic was used.
² “te”: Same as “t” except that blind feedback (based on the first 3 rows of the “t” query) was
used to expand the query. (This run was not submitted.)
² “td”: Same as “t” except that the Description field was additionally used.
² “tde”: Same as “td” except that blind feedback (based on the first 3 rows of the “td” query)
was used to expand the query.
² “tdn”: Same as “td” except that the Narrative field was additionally used. (This run was not
submitted.)
3</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Retrieval Measures</title>
        <p>Traditionally, different retrieval measures have been used for “ad hoc” tasks, which seek relevant
items for a topic, than for “known-item” tasks, which seek a particular known document. However,
we argue that the known-item measures are not only applicable to ad hoc tasks, but that they
are often preferable. For many ad hoc tasks, e.g. finding answer documents for questions, just one
relevant item is needed. Also, the traditional ad hoc measures encourage retrieval of duplicate
relevants, which does not correspond to user benefit.</p>
        <p>
          The traditional known-item measures are very coarse, e.g. Success@10 is 1 or 0 for each topic,
while reciprocal rank cannot produce a value between 1.0 and 0.5. Last year, we began investigating
a new measure, Generalized Success@10 (GS10) (introduced as “First Relevant Score” (FRS) in
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]), which is defined below. This investigation led to the discovery that the blind feedback
technique (a commonly used technique at CLEF, NTCIR and TREC, but not known to be popular
in real systems) had a downside, namely that it pushes down the first relevant item (on average),
as has now been verified not just for our own blind feedback approach, but on 7 other major blind
feedback systems [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
3.1
        </p>
        <p>Primary Recall Measures
“Primary recall” is retrieval of the first relevant item for a topic. Primary recall measures include
the following:
² Generalized Success@30 (GS30): For a topic, GS30 is 1:0241¡r where r is the rank of the
first row for which a desired page is found, or zero if a desired page was not found. (This is
an experimental new measure introduced in this paper; compared to GS10 (defined below),
it further deemphasizes small differences at the top of the list.)
² Generalized Success@10 (GS10): For a topic, GS10 is 1:081¡r where r is the rank of the first
row for which a desired page is found, or zero if a desired page was not found.
² Success@n (S@n): For a topic, Success@n is 1 if a desired page is found in the first n rows,
0 otherwise. This paper lists Success@1 (S1) and Success@10 (S10) for all runs.
² Reciprocal Rank (RR): For a topic, RR is 1r where r is the rank of the first row for which
a desired page is found, or zero if a desired page was not found. “Mean Reciprocal Rank”
(MRR) is the mean of the reciprocal ranks over all the topics.</p>
        <p>Interpretation of Generalized Success@n: GS30 and GS10 are estimates of the percentage of
potential result list reading the system saved the user to get to the first relevant item, assuming that
users are less and less likely to continue reading as they get deeper into the result list.
Comparison of GS10 and Reciprocal Rank : Both GS10 and RR are 1.0 if a desired page is found
at rank 1. At rank 2, GS10 is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At
rank 3, GS10 is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10,
GS10 is 0.50, whereas RR is 0.10. GS10 is greater than RR for ranks 2 to 52 and lower for ranks
53 and beyond.</p>
        <p>Connection of GS10 to Success@10 : GS10 is considered a generalization of Success@10 because
it rounds to 1 for r·10 and to 0 for r&gt;10. (Similarly, GS30 is considered a generalization of
Success@30 because it rounds to 1 for r·30 and to 0 for r&gt;30.)
3.2</p>
        <sec id="sec-2-2-1">
          <title>Secondary Recall Measures</title>
          <p>
            “Secondary recall” is retrieval of the additional relevant items for a topic (after the first one).
Secondary recall measures place most of their weight on these additional relevant items.
² Precision@n: For a topic, “precision” is the percentage of retrieved documents which are
relevant. “Precision@n” is the precision after n documents have been retrieved. This paper
lists Precision@10 (P10) for all runs.
² Average Precision (AP): For a topic, AP is the average of the precision after each relevant
document is retrieved (using zero as the precision for relevant documents which are not
retrieved). By convention, AP is based on the first 1000 retrieved documents for the topic.
The score ranges from 0.0 (no relevants found) to 1.0 (all relevants found at the top of the
list). “Mean Average Precision” (MAP) is the mean of the average precision scores over all
of the topics (i.e. all topics are weighted equally).
² Geometric MAP (GMAP): GMAP (introduced in [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]) is the primary measure for the “robust
task” this year. It is based on “Log Average Precision” which for a topic is the natural log
of the max of 0.00001 and the average precision. GMAP is the exponential of the mean log
average precision. (We argue in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] that primary recall measures better reflect robustness
than GMAP.)
² GMAP’ : We also define a linearized log average precision measure (denoted GMAP’) which
linearly maps the ‘log average precision’ values to the [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ] interval. For statistical significance
purposes, GMAP’ gives the same results as GMAP, and it has advantages such as that the
individual topic differences are in the familiar ¡1:0 to 1.0 range and are on the same scale
as the mean. Table 2 shows examples of the mapping of the AP and GMAP’ scores for a
topic; for example, the table shows that for GMAP, an AP increase from 0.00001 to 0.01 is
considered more important than an increase from 0.01 to 1.0 (these are differences of 0.6 and
0.4 respectively in GMAP’). (This example illustrates one of our concerns with GMAP, which
is that small differences likely to be unimportant to a user can be dramatically amplified.)
3.3
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Statistical Significance Tables</title>
          <p>For tables comparing 2 diagnostic runs (such as Table 4), the columns are as follows:
² “Expt” specifies the experiment. The language code is given, followed by the labels of the
2 runs being compared. The difference is the first run minus the second run. For example,
“BG-td-t” specifies the difference of subtracting the scores of the Bulgarian ‘t’ run from the
Bulgarian ‘td’ run (of Table 3).
² “¢GS30” is the difference of the mean GS30 scores of the two runs being compared (and
“¢GS10” is the difference of the mean GS10 scores, etc.).
² “95% Conf” is an approximate 95% confidence interval for the difference (calculated from
plus/minus twice the standard error of the mean difference). If zero is not in the interval,
the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of
neutral impact (on average), though if the average difference is small (e.g. &lt;0.020) it may
still be too minor to be considered “significant” in the magnitude sense.
² “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively)
compared to the second run. These numbers should always add to the number of topics.
² “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the
topic number in brackets. The first difference is the largest one of any topic (based on the</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Results of Query Expansion Experiments</title>
        <sec id="sec-2-3-1">
          <title>Expansion of Title Queries</title>
          <p>Table 4 shows that expanding the Title queries by adding the Description field increased the mean
score for all investigated measures (GS30, GS10, MRR, P10, GMAP and MAP), including at
least one statistically significant increase for each measure. Adding the Description is a “robust”
technique that can sometimes improve a poor result from just using the Title field.</p>
          <p>
            Table 5 shows that expanding the Title queries via blind feedback of the first 3 rows did not
produce any statistically significant increases for the primary recall measures (GS30, GS10, MRR),
even though it produced statistically significant increases for the secondary recall measures (P10,
GMAP, MAP). Blind feedback is not a robust technique in that it is unlikely to improve poor
results. (In a larger experiment, we would expect the primary recall measures to show statistically
significant decreases, like we saw for Bulgarian last year [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].)
          </p>
          <p>Table 6 compares the results of the two title-expansion approaches. For each primary recall
measure (GS30, GS10, MRR), there is at least one positive statistically significant difference,
BG-td-t
FR-td-t
HU-td-t
PT-td-t
EN-td-t
BG-td-t
FR-td-t
HU-td-t
PT-td-t
EN-td-t
BG-td-t
FR-td-t
HU-td-t
PT-td-t
EN-td-t
BG-td-t
FR-td-t
HU-td-t
PT-td-t
EN-td-t
BG-td-t
FR-td-t
HU-td-t
PT-td-t
EN-td-t
BG-te-t
FR-te-t
HU-te-t
PT-te-t
EN-te-t
BG-te-t
FR-te-t
HU-te-t
PT-te-t
EN-te-t
BG-te-t
FR-te-t
HU-te-t
PT-te-t
EN-te-t
BG-te-t
FR-te-t
HU-te-t
PT-te-t
EN-te-t
BG-te-t
FR-te-t
HU-te-t
PT-te-t
EN-te-t
BG-td-te
FR-td-te
HU-td-te
PT-td-te
EN-td-te
BG-td-te
FR-td-te
HU-td-te
PT-td-te
EN-td-te
BG-td-te
FR-td-te
HU-td-te
PT-td-te
EN-td-te
BG-td-te
FR-td-te
HU-td-te
PT-td-te
EN-td-te
BG-td-te
FR-td-te
HU-td-te
PT-td-te
EN-td-te
reflecting the robustness of using the Description instead of blind feedback. However, there are
no statistically significant differences in the secondary recall measures (P10, GMAP, MAP); these
measures do not discern the higher robustness of the “td” run compared to the “te” run.
4.2</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Expansion of “Title+Desc” Queries</title>
          <p>Table 7 shows that expanding the Description queries by adding the Narrative field tended to
be beneficial for both primary and secondary recall measures, though not as consistently as was
adding the Description to the Title queries. (Sometimes the Narrative field specifies what is not
relevant.)</p>
          <p>
            Table 8 produced a lot of statistically significant increases for the secondary recall measures
(P10, GMAP, MAP). We also see one statistically significant increase for a primary recall measure
(for Hungarian), which we suspect is a Type I error, because it does not fit the pattern we have
seen over several other experiments [
            <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">11, 8, 10, 9</xref>
            ] (including last year’s Hungarian experiment, for
which mean GS10 was down slightly with blind feedback [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]).
          </p>
          <p>Table 9 compares the results of the two expansion approaches for “Title+Desc” queries. The
Narrative was modestly beneficial for the primary recall measures compared to blind feedback,
reflecting a robustness advantage, even though blind feedback boosted the secondary recall measures
a little more.
BG-tdn-td
FR-tdn-td
HU-tdn-td
PT-tdn-td
EN-tdn-td
BG-tdn-td
FR-tdn-td
HU-tdn-td
PT-tdn-td
EN-tdn-td
BG-tdn-td
FR-tdn-td
HU-tdn-td
PT-tdn-td
EN-tdn-td
BG-tdn-td
FR-tdn-td
HU-tdn-td
PT-tdn-td
EN-tdn-td
BG-tdn-td
FR-tdn-td
HU-tdn-td
PT-tdn-td
EN-tdn-td
BG-tde-td
FR-tde-td
HU-tde-td
PT-tde-td
EN-tde-td
BG-tde-td
FR-tde-td
HU-tde-td
PT-tde-td
EN-tde-td
BG-tde-td
FR-tde-td
HU-tde-td
PT-tde-td
EN-tde-td
BG-tde-td
FR-tde-td
HU-tde-td
PT-tde-td
EN-tde-td
BG-tde-td
FR-tde-td
HU-tde-td
PT-tde-td
EN-tde-td
¡0:007
¡0:006
0.012
¡0:003
¡0:010
¢GS10
¡0:002
¡0:011
0.025
0.002
¡0:017
BG-tdn-tde
FR-tdn-tde
HU-tdn-tde
PT-tdn-tde
EN-tdn-tde
BG-tdn-tde
FR-tdn-tde
HU-tdn-tde
PT-tdn-tde
EN-tdn-tde
BG-tdn-tde
FR-tdn-tde
HU-tdn-tde
PT-tdn-tde
EN-tdn-tde
BG-tdn-tde
FR-tdn-tde
HU-tdn-tde
PT-tdn-tde
EN-tdn-tde
BG-tdn-tde
FR-tdn-tde
HU-tdn-tde
PT-tdn-tde
EN-tdn-tde
DE-e0
EN-e0
ES-e0
FR-e0
IT-e0
NL-e0
DE-e0
EN-e0
ES-e0
FR-e0
IT-e0
NL-e0
DE-e0
EN-e0
ES-e0
FR-e0
IT-e0
NL-e0
DE-e0
EN-e0
ES-e0
FR-e0
IT-e0
NL-e0
DE-e0
EN-e0
ES-e0
FR-e0
IT-e0
NL-e0
DE-e1
EN-e1
ES-e1
FR-e1
IT-e1
NL-e1
DE-e1
EN-e1
ES-e1
FR-e1
IT-e1
NL-e1
DE-e1
EN-e1
ES-e1
FR-e1
IT-e1
NL-e1
DE-e1
EN-e1
ES-e1
FR-e1
IT-e1
NL-e1
DE-e1
EN-e1
ES-e1
FR-e1
IT-e1
NL-e1</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Robust Task Results</title>
        <p>The “Robust Task” re-used the old test collections for Dutch, English, French, German, Italian
and Spanish from CLEF 2001-2003. Of the 160 old topics, 60 were allowed to be used for new
“training”, leaving the other 100 for “testing”. Participants were encouraged to train on the GMAP
measure, though we believe primary recall measures better reflect robustness. We actually did not
do any new training for this task.</p>
        <p>Note that even though the document sets were not always the same for each language in 2001,
2002 and 2003, a fixed document set was used for each language in this task. Hence there may be
more unjudged relevant items than usual. Unfortunately, we did not have time to look at metrics
on just judged items for this paper.</p>
        <p>Table 10 lists the mean scores of our submitted Robust Task runs. For each language, we
submitted a “td” run (no blind feedback) and a “tde” run (incorporating blind feedback based on
the first 3 rows of “td”). Even though blind feedback is known to tend to make results less robust,
the GMAP score was higher with blind feedback in all cases (as were P10 and MAP).</p>
        <p>
          Tables 11 and 12 isolate the impact of blind feedback on each measure. The impact on the
primary recall measures tended to be detrimental, including a statistically significant decrease
on the Spanish training topics. The increases on the secondary recall measures were mostly
statistically significant. While this generally fits the pattern we have seen in other experiments
(e.g. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]), the negative impact on the primary recall measures seems to be less strong than we have
seen elsewhere. Perhaps the old CLEF topics tend to be “easier” than, say, the old TREC topics
used at RIA [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], providing relatively fewer cases for which blind feedback would be detrimental.
6
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Conclusions</title>
        <p>
          For all 22 blind feedback experiments reported in this paper, the mean scores for MAP, GMAP
and P10 were up with blind feedback, and most of these increases were statistically significant.
As blind feedback is known to be bad for robustness (because of its tendency to “not help (and
frequently hurt) the worst performing topics” [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]), we conclude that none of these 3 measures
should be used as robustness measures.
        </p>
        <p>Measures based on just the first relevant item (i.e. primary recall measures such as GS30 and
GS10) reflect robustness. In this paper, we found in particular that these measures discerned the
robustness advantage of expanding Title queries by using the Description field instead of blind
feedback, while the secondary recall measures (MAP, GMAP, P10) did not.</p>
        <p>
          These results are consistent with what we have seen elsewhere [
          <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">11, 8, 10, 9</xref>
          ]. For example, in
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], 7 other groups’ blind feedback systems were studied, and it was found that blind feedback was
detrimental to the first relevant item (on average), even though it boosted the secondary recall
measures.
        </p>
        <p>
          A paper at the recent SIGIR conference [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] gives a theoretical explanation for why different
retrieval approaches are superior when seeking just one relevant item instead of several. In
particular, it finds that when seeking just one relevant item, it can theoretically be advantageous to
use negative pseudo-relevance feedback to encourage more diversity in the results.
        </p>
        <p>To encourage more research in robust retrieval, probably the simplest thing the organizers of
ad hoc tracks could do would be to use a measure based on just the first relevant item (e.g. GS10
or GS30) as the primary measure for the ad hoc task. Participants would then find it detrimental
to use the non-robust blind feedback technique, but potentially would be rewarded for finding
ways of producing more diverse results.
for</p>
        <p>IR</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Systems)</title>
    </sec>
    <sec id="sec-4">
      <title>Home</title>
    </sec>
    <sec id="sec-5">
      <title>Page. [5] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at</title>
      <p>Multilingual
information
retrieval
resource</p>
    </sec>
    <sec id="sec-6">
      <title>Hummingbird</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Harr</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>David R.</given-names>
            <surname>Karger</surname>
          </string-name>
          . Less is More:
          <article-title>Probabilistic Models for Retrieving Fewer Relevant Documents</article-title>
          .
          <source>SIGIR</source>
          <year>2006</year>
          , pp.
          <fpage>429</fpage>
          -
          <lpage>436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cross-Language Evaluation</surname>
          </string-name>
          <article-title>Forum web site</article-title>
          . http://www.clef-campaign.org/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Hodgson</surname>
          </string-name>
          .
          <article-title>Converting the Fulcrum Search Engine to Unicode</article-title>
          . Sixteenth International Unicode Conference,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>NTCIR (NII-NACSIS Test</surname>
          </string-name>
          Collection http://research.nii.ac.jp/»ntcadm/index-en.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <source>Okapi at TREC-3. Proceedings of TREC-3</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          . CLEF and http://www.unine.ch/info/clef/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Text</given-names>
            <surname>REtrieval Conference (TREC) Home</surname>
          </string-name>
          <article-title>Page</article-title>
          . http://trec.nist.gov/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          .
          <article-title>CJK Experiments with Hummingbird SearchServerTM at NTCIR-5</article-title>
          .
          <source>Proceedings of NTCIR-5</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Tomlinson. Early Precision</surname>
          </string-name>
          <article-title>Measures: Implications from the Downside of Blind Feedback</article-title>
          .
          <source>SIGIR</source>
          <year>2006</year>
          , pp.
          <fpage>705</fpage>
          -
          <lpage>706</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          .
          <article-title>Enterprise, QA, Robust and Terabyte Experiments with Hummingbird SearchServerTM at TREC 2005</article-title>
          .
          <source>Proceedings of TREC</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Tomlinson</surname>
          </string-name>
          .
          <article-title>European Ad Hoc Retrieval Experiments with SearchServerTM at CLEF 2005</article-title>
          .
          <article-title>Working Notes for the CLEF 2005 Workshop</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>Overview of the TREC 2003 Robust Retrieval Track</article-title>
          .
          <source>Proceedings of TREC</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>Overview of the TREC 2004 Robust Retrieval Track</article-title>
          .
          <source>Proceedings of TREC</source>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>