<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalable Multilingual Information Access</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Paul McNamee and James Mayfield Johns Hopkins University Applied Physics Laboratory 11100</institution>
          <addr-line>Johns Hopkins Road Laurel, MD 20723-6099</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The third Cross-Language Evaluation Forum workshop (CLEF-2002) provides the unprecedented opportunity to evaluate retrieval in eight different languages using a uniform set of topics and assessment methodology. This year the Johns Hopkins University Applied Physics Laboratory participated in the monolingual, bilingual, and multilingual retrieval tasks. We contend that information access in a plethora of languages requires approaches that are inexpensive in developer and run-time costs. In this paper we describe a simplified approach that seems suitable for retrieval in many languages; we also show how good retrieval is possible over many languages, even when translation resources are scarce, or when query-time translation is infeasible. In particular, we investigate the use of character n-grams for monolingual retrieval, pre-translation expansion as a technique to mitigate errors due to limited translation resources, and translation of document representations to an interlingua for computationally efficient retrieval against multiple languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The number of languages in the CLEF document collection has grown to eight in 2002: Dutch, English,
Finnish, French, German, Italian, Spanish, and Swedish. While the Romance languages have a great deal in
common with one another, the Teutonic languages and Finnish have different origins; this set of modern
languages provide challenges in word decompounding, complex morphology, and handling diacritical marks.
For many years research in information retrieval was focused on the English language where these problems
are less significant. As a result simple rules for stemming words and case-folding are really the only common
improvements to exact string matching used by retrieval systems. The use of stopword lists is also routine,
but seems to have little effect except to reduce the size of inverted files and to improve runtime efficiency.
We have been interested in discovering how simple methods can be applied to combat the aforementioned
problems. Though their use has not found favor in English, we have demonstrated that overlapping character
n-grams are remarkably effective for retrieval in many languages, including those most widely used in
Europe. This simple technique appears to provide a surrogate means to normalize word forms, an efficient
approximation to word bigrams (when n-grams with interior spaces are formed), and a solution to the
problem of decompounding agglutinative languages. For the CLEF-2002 evaluation we continued to use the
Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system which
supports n-gram processing.</p>
      <p>We participated in three tasks at this year’s workshop, monolingual, cross-language, and multilingual
retrieval. All of our official submissions were automated runs. This year we relied on an aligned parallel
corpus as our sole translation resource – this resource was automatically mined from the Web and can be
used to support retrieval between any pair of E.U. languages, except Greek. In the sections that follow, we
first describe the standard methodology used for each language’s sub-collection and we then present initial
results in monolingual, bilingual, and multilingual retrieval. Highlights include an investigation into the use
of pre-translation expansion from a comparable collection to improve retrieval performance, a discovery that
character n-grams provide a means for effective bilingual retrieval for a close language pair, without
translation, and an efficient method for multilingual retrieval that involves no query-time translation.</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>For the monolingual tasks we created sixteen indexes, a word and an n-gram (n=6) index for each of the
eight languages. For the bilingual and multilingual tasks we used the same indexes but translated topic
statements to produce our official runs; however, we also report on another approach for multilingual
retrieval that required a separate index. Information about each index is provided in Table 1.
collection size
(MB zipped)</p>
      <p>203
# docs
type</p>
      <p>
        # terms
Our methods for scanning documents, creating an index, and processing queries are essentially unchanged
from last year. We include below a description from our CLEF-2001 paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; those already familiar with
our our previous work using HAIRCUT should skip ahead to a description of this year’s experiments.
      </p>
      <p>The use of overlapping character n-grams provides a surrogate form of morphological normalization. For
example, in Table 2 above, the n-gram “minist” could have been generated from several different forms like
administer, administrative, minister, ministers, ministerial, or ministry. It could also come from an unrelated
word like feminist. Another advantage of n-gram indexing comes from the fact that n-grams containing
spaces can convey phrasal information. In the table above, 6-grams such as “rime-m”, “ime-mi”, and
“memin” may act much like the phrase “prime minister” in a word-based index using multiple word phrases.
At last year’s workshop we examined different types of translation resources for bilingual retrieval and
espoused a language-neutral approach to retrieval. We continued in this vein and did not utilize stopword
lists or morphological analyzers.</p>
      <p>Query Processing
HAIRCUT performs rudimentary preprocessing on topic statements to remove stop structure, e.g., phrases
such as “… would be relevant” or “relevant documents should….” . We have constructed a list of about 1000
such English phrases from previous topic sets (mainly TREC topics) and these have been translated into
other languages using commercial machine translation. Other than this preprocessing, queries are parsed in
the same fashion as documents in the collection.</p>
      <p>In all of our experiments we used a linguistically motivated probabilistic model for retrieval. Our official
runs all used blind relevance feedback, though it did not improve retrieval performance in every instance. To
perform relevance feedback we first retrieved the top 1000 documents. We then used the top 20 documents
for positive feedback and the bottom 75 documents for negative feedback; however, we removed any
duplicate or near duplicate documents from these sets. We then select terms for the expanded query based on
three factors, a term’s initial query term frequency (if any); the cube root of the (α=3, β=2, γ=2) Rocchio
score; and a term similarity metric that incorporates IDF weighting. The 60 top ranked terms are then used
as the revised query with words as indexing terms; 400 terms are used with 6-grams. In previous work we
penalized documents containing only a fraction of the query terms; we are no longer convinced that this
technique adds much benefit and have discontinued its use. As a general trend we observe a decrease in
precision at very low recall levels when blind relevance feedback is used, but both overall recall and mean
average precision are improved.</p>
    </sec>
    <sec id="sec-3">
      <title>Monolingual Experiments</title>
      <p>We submitted an official run for each target language only using the &lt;title&gt; and &lt;desc&gt; fields and only
automatic processing. These official runs were actually the combination of two base-runs, one using words
and one using 6-grams; both base-runs also make use of blind relevance feedback. We again relied on a
statistical language model of retrieval and used the same parameters as last year. With words as indexing
terms we used queries expanded to include 60 terms and a smoothing parameter, alpha, of 0.30. When
6grams were used instead, queries were expanded to 400 terms and alpha was set to 0.15. In both cases the top
20 documents were used as positive examples and the bottom 75 of 1000 were presumed irrelevant for the
purposes of query expansion. Our official results are shown below in Table 3.</p>
      <p>
        The recall at 1000 documents is very high relative to the number of relevant documents in each of the
subcollections. Since we created runs by combining distinct runs (one using words, one using 6-grams) we
should examine the individual performance using each method. Figure 1 contains a plot that shows the mean
average precision obtained for each sub-collection using both approaches. We note that in English and the
Romance languages, the use of words yields slightly better performance, an improvement of 0.010 to 0.025
in absolute terms. We reported observing the same trend for French and Italian during last year’s evaluation
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In the Dutch sub-collection, little difference is seen, but 6-grams are clearly advantageous in the
remaining languages. A sizeable difference is seen in German (0.035) and Swedish (0.023), and far more
significantly,, in Finnish (0.13)
In Figure 1 we also plot the performance of the combined runs. Combination was generally beneficial, but
due to the large disparity between n-grams and words for Finnish, the technique depressed performance in
      </p>
      <p>We performed the same analysis when blind relevance feedback was not performed and found similar
results. There the performance was generally less than when automated feedback was performed. Also,
within each language, differences between techniques were larger without feedback. By averaging across all
languages, we saw that feedback improved the microaveraged mean average precision from 0.3479 to 0.4141
when words were used, and from 0.3729 to 0.4295 when 6-grams were used. If, as it seems, n-grams are
more effective for retrieval in languages with complex morphology, then the fact that the two approaches
achieved more similar performance when feedback was employed would support the notion that automatic
relevance feedback improves performance by redressing the effect of inflectional variation.</p>
    </sec>
    <sec id="sec-4">
      <title>Bilingual Experiments</title>
      <p>Our official bilingual submissions were based on query translation when some attempt at translation was
made; we submitted one run for each document collection. For each collection, save English, we created one
run using the English query statements and the title and description fields. The runs are named using the
template aplbienxx. For these 7 runs, we used pre-translation expansion using the L.A. Times collection;
queries were expanded to 60 terms and we used statistical word-by-word translations mined from an aligned
parallel collection. This collection is an expanded version of the corpus we obtained from the Europa web
site (details follow). We used unnormalized words for these bilingual experiments because we have not yet
used our parallel collection to generate statistical translations that are character n-grams – we want to
investigate this, but for the evaluation, we simply used words. The final two runs, aplbiptesa and aplbiptesb,
made no use of translation whatsoever. Since 10 runs were allowed to be submitted according to the track
guidelines, we did submit three other runs. The first, aplbipten, used the Portuguese topic statements to
search the English sub-collection; our motivation here was only to submit a run using these statements.
The seven runs produced using English queries first performed pre-translation expansion using the L.A.
Times sub-collection. The query was expanded to include 60 words, and then each term was translated, if
possible, using the Europa corpus for translation. Then two runs were made, one using pre-translation
expansion alone and one using both pre- and post-translation query expansion. Scores for these two runs
were normalized and merged together to form our official submission. The eighth run, aplbipten, used the
Portuguese topic statements and no pre-translation expansion was attempted. However, two runs were still
combined, one with no expansion and one that made use of blind relevance feedback. Results for these runs
are shown below in Table 4.</p>
      <p>
        The results for each run in Table 4 are not comparable to one another because a different target language
collection was involved. Furthermore, the last column, which reports the comparison to a target language
monolingual baseline using mean average precision, is not especially meaningful. It is unfair to compare
against our monolingual baseline for two reasons. First, Voorhees has pointed out that a comparison between
test-sets using different topic statements (as is the case here) is not justified even though the document
collections are the same[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; the various translations of each topic may result in queries that are significantly
easier in one language than another. Second, slightly different algorithms were used in our monolingual and
bilingual results. Our monolingual runs were formed through merging n-gram and word-based runs while our
bilingual results only used words. Also, the bilingual runs all used pre-translation over the English collection,
which itself only contained relevant documents for 42 of the 50 topics.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Improved Translation Resource</title>
      <p>The quality of translation resources is a critical driver for CLIR performance. Therefore, it is important to
select a translation approach that ensures translation of important query terms. At our disposal we had
translation software (Systran, L&amp;H Power Translator, and various on-line services), bilingual dictionaries
automatically extracted from lists on the Web, and a large parallel corpus. We investigated each of these
methods in our 2001 paper and found that when only a single source was used, best performance was
obtained by using the parallel collection for translation. We decided to expand the parallel collection and use
it for our official runs.</p>
      <p>
        The collection was obtained through a nightly crawl of the Europa web site where we targeted the Official
Journal of the European Union [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The Journal is available in each of the E.U. languages and consists
mainly of governmental topics, for example, trade and foreign relations. We had data available from
December 2000 through May 2002. Though focused on European topics, the time span is 5-7 years after the
CLEF-2002 document collection. So, it is possible that many proper names in 1994 and 1995 will be rarely
mentioned, if at all. The Journal is published electronically in PDF format and we wanted to create an
aligned collection. Rather than attempt an 11-language, multiple aligned collection, we simply wasted disk
space and preformed redundant alignments. At the present we have not aligned all O(n2) pairs, but instead
created n alignments between English and the other languages. We used the publicly available pdftotext
package to extract text from the PDF, but Greek text is not supported by the software so we neglected this
language1. Once converted to text, documents were split into pieces using conservative rules for page-breaks
and paragraph breaks. Many of the documents are written in outline form, or contain large tables, so this task
is not trivial. Approximately 20GB of PDF documents are involved; we find that the PDF files are
approximately ten times larger than the plain text versions. Thus we have about 200 MB of text in each
language that may be aligned.
      </p>
      <p>Once aligned, we indexed each sub-collection using the same technique described for the CLEF-2002
document collections; in particular, unnormalized words were used as indexing terms. We relied on query
term translation and extracted candidate translations as follows. First, we would take a candidate term as
input and identify documents containing this word in the English subset of the aligned. Up to 5000
documents were considered; we bounded the number for reasons of efficiency and because we felt that
performance was not enhanced appreciably when a greater number of documents was used. If no document
contained this term, then the word itself was left untranslated. Second, we would identify the corresponding
documents in the target language. Third, using a similarity metric that is similar to mutual information, we
1 We are very interested in having Danish and Portuguese document collections added to the CLEF test set.
would extract a single potential translation using the frequency of occurrence in the whole collection and the
frequency in the subset of aligned documents found that are believed to contain a mapping for the original
source term.</p>
    </sec>
    <sec id="sec-6">
      <title>Pre-translation Expansion</title>
    </sec>
    <sec id="sec-7">
      <title>No translation</title>
      <p>We are still analyzing the effect of query expansion on retrieval performance and will report on it in the final
version of the paper.</p>
      <p>
        In previous work we have shown that reasonably good retrieval between two related languages is possible,
without any translation at all. Though the use of cognate matches as been known for some time (e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), we
found that pre-translation expansion using a comparable expansion corpus enhances performance – in some
cases, by 200-300% [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. During last year’s campaign we also noted that n-grams were almost twice as
effective as words in this scenario [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This year, we wanted to conduct similar work that looked at a variety
of language pair in comparison to our pervious work which only used English as the target language. We
looked at several language pairs and hoped to see a difference in performance when this method was used
between close languages. Our hypothesis is that translation-less retrieval between related languages (say the
Romance group) would be more effective than when this approach was used between, say, German and
Spanish.
      </p>
      <p>For these runs, we did not use pre-translation expansion (though we hope to examine this in the future). We
did compare performance using words and n-grams. Our two official runs for this experiment we aplbiptesa
and aplbiptesb. The first used 6-grams as indexing terms while the later used words. Both urns used blind
relevance feedback. Results for these two runs are shown below in Table 5.</p>
      <p>run id topic type average recall # topics % mono %Eng
fields precision (at 1000) bilingual
aplbiptesa TD 6-grams 0.3325 2071 / 2854 50 64.04% 92.31%
aplbiptesb TD words 0.2000 1589 / 2854 50 38.52% 55.52%
Table 5. Official results for the bilingual task using no translation, the Portuguese topic statements, and the
Spanish news collection.</p>
      <p>It is interesting to note that with no translation whatsoever and the use of 6-grams as indexing terms,
performance was 92% of that when English topics where translated to Spanish. This is still not a fair
comparision (the English topics might be particularly hard, for example), but, it is surprisingly good. The
mean precision at 5 docs for aplbiptesa was 0.3920; on average, two out of the five top documents were
relevant, despite not translating the queries. We examined several other language pairs as well, but have not
looked at all n(n-1) cases. These other results were not official runs.
As would be expected, retrieval without translation is more effective in closely related language pairs. In the
table above, we see that German retrieval against Dutch is almost 50% as effective as monolingual Dutch
retrieval when using 6-grams; similarly, Dutch retrieval against German is about 60% as effective as
monolingual German retrieval. This strongly suggests that for language pairs with few direct translation
resources, translation to a closely related language for which translation is feasible from the source language,
can result in good cross-language retrieval performance. It remains to examine whether small length n-grams
result in even greater performance, and whether pre-translation expansion improves this approach. Our
previous experiments on the CLEF-2001 collection would suggest the later, but in those we only examined
language pairs where English was the target language.</p>
    </sec>
    <sec id="sec-8">
      <title>Multilingual Experiments</title>
      <p>To date, our experiments in multilingual merging have not found a technique that results in producing a high
quality, single ranked list from documents in many languages. Last year we experimented with methods that
tried to normalize document similarity scores and to produce a single list. This year we submitted two
official runs that used either merge-by-score (aplmuena) or merge-by-rank (aplmuenb). As in the past, we
found these two methods comparable, but not tremendously effective. However, no more suitable method
has been proposed.</p>
      <p>
        We have been intrigued by work by researchers at the University of California at Berkeley that address this
problem in a way that does not require score normalization. Gey et al., create a single inverted file from
documents in many languages and then, to score documents, they create a composite query composed of a
query statement in a single language concatenated with translations of that query in the other collection
languages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This approach results in a single ranked list, and it appears to work well with Berkeley’s
logisitic regression approach to retrieval. In the CLEF-2001 campaign we examined this method using both
unnormalized words and character 5-grams. Our results with simple words were disappointing and the
5grams, though significantly better, did not perform as well as simple merging approaches. We do not yet
understand why our results are different than those reported by Berkeley, but the fact that we use a different
model of retrieval may be responsible.
      </p>
      <p>This year, we also attempted a dual solution to the approach described above. Rather than translate queries
into every language, we created an index that contained a document that was transformed into a single
language. We picked English as our interlingua and mapped each document into English using a
bag-ofwords approach to translation. Strictly speaking, we did not perform translation of the documents. Rather, we
took the indexed document representations from our monolingual indexes, loaded a hash-table into memory
that contained a bilingual wordlist, and created a new inverted-file where the posting lists were English
words (or untranslatable foreign terms) that refered to documents from the different languages. We also
included the native English documents. Because we felt lexical coverage was most important, we translated
the documentation representations by mapping each source word into all of its candidate (English)
translations. We probably should have removed stopwords, but did not do so. This process is linear in the
size of the collection since the hash-table lookups are O(1) per word occurrence.</p>
      <p>This approach creates an index with several peculiar characteristics. First, it makes the foreign language
document representations a bit larger, since on average, a term may have 2 or 3 potential translations. Also,
the original English documents are somewhat more focused since they don’t have erroneous translations in
their representations. Still, we are left with an approach where we can take a query in our preferred language
(preferred here because we have good resources for it) and simply run it against our transformed document
collection. This approach (aplmuend) appears to be 18% more effective than our officially submitted runs
using normalization and merging. Interestingly, precision at a small number of documents was greatly
enhanced, and recall at 1000 docs suffered; however, a subsequent combination with run aplmuena restored
the overall recall (aplmuenq). Furthermore, this method creates a composite ‘English’ index in time linear
with the collection size and requires no query-time translation or post-retrieval processing (e.g., merging).
See Table 7 for a comparison of this and our two official runs.</p>
      <p>run id topic average</p>
      <p>fields precision
aplmuena TD 0.2070
aplmuenb TD 0.2082
aplmuend TD 0.2447
aplmuenq TD 0.2456
aplmuenz TD 0.2265
Table 7. Multilingual results.</p>
      <p>recall
(at 1000)
4729 / 8068
4660 / 8068
3394 / 8068
4766 / 8068
4772 / 8068
precision
at 5 docs
0.4680
0.4480
0.5760
0.5600
0.4840</p>
      <p>remarks
official; score-based merge
official; rank-based merge
translation of document representations
combination of aplmuena and aplmuend
score-based merge using monolingual runs
One final thing we did for this year’s mutlingual task was to try and isolate the effect of losses due to query
translation and multi-collection merging. What we did was to take monolingual runs for each of the
collections and attempt to merge them (aplmuenz). We found slightly better average precision when doing
this, as might be expected. We think this is an interesting way to investigate the multilingual problem; it
reduced the problem to that finding a good merging strategy, which still seems like one of the most viable
approaches to MLIR.</p>
    </sec>
    <sec id="sec-9">
      <title>Conclusions</title>
      <p>We set out to investigate how well a simplified approach to CLIR would work. By applying our
languageneutral philosophy, we were able to submit monolingual and bilingual runs for each of the document
collections. We repeated previous experiments and confirmed that character n-grams work well in many
languages, including Finnish and Swedish which we had not previously studied. N-grams appear to have a
decided advantage over words in Finnish retrieval. We also examined retrieval using cognate matches
between close, and less close language pairs; as expected, performance is higher (relative to a monolingual
baseline) with related pairs. Finally, we implemented a novel approach to multilingual retrieval that is
similar to document translation – we transformed a bag-of-words representation of documents in many
languages into a corresponding set of English terms using a bilingual dictionary. This processing is efficient
and can be done at indexing time. As a result, multilingual queries from a single interlingua can be processed
with no additional query-time processing beyond that normal for monolingual retrieval. Our preliminary
results indicate that this approach is also 18% more effective than a baseline using score normalization and
merging.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          , '
          <article-title>Using Clustering and Super Concepts within SMART: TREC-6'</article-title>
          . In E. Voorhees and D. Harman (eds.),
          <source>Proceedings of the Sixth Text REtrieval Conference (TREC-6)</source>
          ,
          <source>NIST Special Publication 500-240</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Larson</surname>
          </string-name>
          , '
          <article-title>Manual Queries and Machine Translation in Cross-language Retrieval and Interactive Retrieval with Cheshire II at TREC-7'</article-title>
          . In E. M. Voorhees and
          <string-name>
            <surname>D. K</surname>
          </string-name>
          . Harman, eds.,
          <source>Proceedings of the Seventh Text REtrieval Conference (TREC-7)</source>
          , pp.
          <fpage>527</fpage>
          -
          <lpage>540</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>McNamee</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mayfield</surname>
          </string-name>
          , 'JHU/APL Experiments at CLEF:
          <article-title>Translation Resources and Score Normalization'</article-title>
          ..In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (eds.),
          <source>Evaluation of CrossLanguage Information Retrieval Systems: Proceedings of the CLEF 2001 Workshop, Lecture Notes in Computer Science 2406</source>
          , Springer, pp.
          <fpage>193</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          and
          <string-name>
            <given-names>James</given-names>
            <surname>Mayfield</surname>
          </string-name>
          , '
          <article-title>Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources'</article-title>
          .
          <source>In the Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR-2002)</source>
          , Tampere, Finland,
          <year>August 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          , '
          <article-title>The Philosophy of Information Retrieval Evaluation</article-title>
          .' ..In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck (eds.),
          <source>Evaluation of Cross-Language Information Retrieval Systems: Proceedings of the CLEF 2001 Workshop, Lecture Notes in Computer Science 2406</source>
          , Springer, pp.
          <fpage>355</fpage>
          -
          <lpage>370</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[6] http://europa.eu.int/</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>