<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>gba@fub.it Gianni Amati Fondazione Ugo Bordoni</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <fpage>388</fpage>
      <lpage>400</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>3 Description of PROSIT
2 Indexing
3.1 Term weighting
tion gain is obtained by a combination of three distinct probabilistic processes: the probabilistic
with respect to an \Elite" set of documents, which is the set of documents containg the query
term, and the probabilistic process deriving the term frequency within the document normalized
collection, the probabilistic process computing a conditional probability of occurrence of the term
However, for both TREC-10 and CLEF collections we have introduced a parameter for document
The framework is based on computing the information gain for each term query. The
informaprocess computing the amount of the information content of the term with respect to the entire
length normalization which enhances the retrieval outcome.
to the document average length. The framework thus consists of three independent components:
the average length of documents
the total number of term occurrences in its elite set
the length of the document
the within document term frequency
the size of the elite set of the term (see below) Et
the size of the collection
candidates to be assessed relevant and therefore we might consider them to constitute a second
query. We have considered in our experiments only 3 documents as pseudo-relevant documents
Formulas 1, 2 and 3 produce a rst ranking of possible relev ant documents. The topmost ones are
dieren t \Elite set T of documents", namely documents which best describe the content of the
Leibler divergence function:
and extracted from them the rst 10 most informativ e terms which were added to the original
query. The most informative terms are selected by using the information-theoretic
Kullbackexperiments and is called L2 stands for Bose-Einstein and L for Laplace). BE (BE
c avg l
l
tf n = tf 1 + log2
The correcting factor c = 3 may be inserted to the term frequency normalization and obtain</p>
    </sec>
    <sec id="sec-2">
      <title>Based on these observations, we decided to use bigrams in addition to, not in place of, single</title>
      <p>most returned documents for the CLEF query \Kaurismaki lms" w ere about other famous lms,
information.
inverted le. During query ev aluation, for each bigram extracted from the query, the posting lists
Bigrams are good for disambiguating terms and for handling topic drift, i.e., when the results
On the other hand, some bigrams that are generated automatically may, in turn, over-emphasize
concepts that are common to both relevant and nonrelevant documents [9]. So far, the results
associated with the bigram words in the inverted le are merged and a new pseudo posting list is
directly into PROSIT’s main algorithm.
topic but not to the requested aspect of it. This phenomenon can also be seen as some query terms
created that contains all documents that contain the bigram along with the relevant occurrence
necessary to encode the information about the position of each term in each document into the
of PROSIT. In this way one can hope to increase the quality of the documents on which the
documents that have the same pair of words occurring within the specied windo w. The score
performance remained lower than that obtainable by using just unigram scores.
subsequent query expansion step is based. This may happen because more top relevant documents
whereas the use of bigrams considerably improved the precision of search.
After the rst ranking w as computed using unigram and bigram scores, the top documents
We submitted one run to CLEF 2002, labeLLed as \fub02l", which was produced using PROSIT
of PROSIT because the order of words in the expanded query is not relevant.
similar to the query.
We used a simple technique known as lexical aÆnities. Lexical aÆnities are identied b y nding
keywords, or unigrams. We attempted to improve its performance by using two-word index units
were used to generate the expanded query and PROSIT computed the second ranking as if it were
just using unigrams. We chose to not expand the original query with two-word units due to the
are retrieved or because the nonrelevant documents which contribute to query expansion are more
The lexical aÆnity technique was reported to produce very good results on the web TREC
collection, even better than those obtained using unigrams [5]. However, we were not able to
dimensionality problem, and we did not use the bigram method during the second-pass ranking
augmented with the bigrams procedure just described.
CLEF experiments, we used the query title and chose a distance of 5 words. All the bigrams
pairs of words that occur close to each other in a window of some predened small size. F or the
From an implementation point of view, in order to eÆciently compute the bigram scores it is
about the eectiv eness of bigrams versus unigrams have not been conclusive.
words. Second, instead of running two separate ranking systems, one for unigrams and the other
PROSIT, like most information retrieval systems, is based on index units consisting of single
for bigrams, and then combining their scores, we tried to incorporate the bigram component
assigned to bigrams is computed using the same weighting function used for unigrams.
(bigrams).
The bigram scores were thus combined with the unigram score to produce the rst-pass ranking
obtain such good results on the CLEF collection. In fact, we found that the bigram performance
generated this way are seen as new index units and are used to increase the relevance of those
of queries on specic aspects of wide topics con tain documents that are relevant to the general
was considerably worse than the unigram performance; even when combining the scores, the
matching out of context of their relationships to other terms [4]. For instance, using unigrams</p>
    </sec>
    <sec id="sec-3">
      <title>4 Augmenting PROSIT with bigrams</title>
      <p>In the second retrieval we used the term weighting function:
w = (6)
al, that is according Formulas 2 and 3.</p>
    </sec>
    <sec id="sec-4">
      <title>PROSIT+bigrams</title>
      <p>0.5208
0.5088
both test collections but it was still worse than baseline performance on the CLEF 2002 collection
of the best system at CLEF 2001 (0.4865). This result is a conrmation of the high eectiv eness
Table 1 also shows that, in general, the variations in performance when passing from basic
Combining both enhancements improved the retrieval performance over using CLM alone on
test collections, with the value obtained for CLEF 2001 (0.5116) being much higher than the result
The results of Table 1 show that the performance of standard PROSIT was excellent on both
2001 and CLEF 2002 Italian monolingual tasks. Table 1 shows the retrieval performance of the
four systems on the two test collections using the average precision as evaluation measure.
We tested PROSIT and its three variants (i.e., PROSIT with bigrams, PROSIT with coordination
performance across both test collections, whereas the use of coordination level matching was
slightly benecial for CLEF 2001 and detrimen tal for CLEF 2002.
document and query statistics.
of the probabilistic ranking model implemented in PROSIT, which is exclusively based on simple
PROSIT to enhanced PROSIT were small. More in particular, the use of bigrams improved
level matching, and PROSIT with both bigrams and coordination level matching) on the CLEF</p>
    </sec>
    <sec id="sec-5">
      <title>It should also be noted that we experimented with other types of multi-word index units, by</title>
      <p>using just two words with a window of size 5 was the optimal choice.
using windows of dieren t size and by selecting a larger number of words. However, we found that</p>
    </sec>
    <sec id="sec-6">
      <title>Consistent with earlier eectiv eness results, most information retrieval systems are based on best</title>
      <p>+ coordination level matching (run fub02lb)
seen as an alternative or as a complementary technique to traditional best-matching retrieval. In
In this way, the documents were partially ordered according to their coordination level matching
matching algorithms between query and documents.
interest in precision rather than recall have fostered new research on exact matching retrieval,
tiveness by simply preferring the documents that contained all the words of the query title, without
To implement this strategy, we modied the standard best-matc hing similarity score between
documents to rerank retrieval results may improve performance in certain situations (e.g.,[7], [3]).
with the query title, with ties being broken using their best-matching similarity score to the query.
that matched all of the query keywords above documents that matched all but one of the keywords,
query and documents, computed as explained in Section 2, by adding a much larger addendum
However, the results were somewhat disappointing. We obtained a much better retrieval
eecpaying attention to lower levels of coordination matching. This was our choice (run fub02b).
and so on.
Finally, we submitted a fourth run by using the fully enhanced version of PROSIT, i.e., bigrams
particular, it has been shown that taking into account the number of query words matched by the
to it which was proportional to the number of terms shared by the document and the query title.
However, the use of very short queries on the part of most of the users and the prevailing
For the CLEF experiments, we focused on the query title. The goal was to prefer documents
selection of retrieval results for interactive searches. International Journal On Digital Libraries,
[3] E. Berenci, C. Carpineto, V. Giannini, S. Mizzaro. Eectiv eness of keyword-based display and
3(3):249-260, 2000.</p>
    </sec>
    <sec id="sec-7">
      <title>7 Conclusions</title>
      <p>References
based on measuring divergence from randomness. ACM Transactions on Information Systems,
[2] Gianni Amati and Cornelis Joost van Rijsbergen. Probabilistic models of information retrieval
(to appear), 2002.
a probabilistic framework for topic relevance term weighting. In E.M. Voorhees and D.K.
[1] Gianni Amati, Claudio Carpineto, and Giovanni Romano. FUB at TREC 10 web track:
Harman, editors, In Proceedings of the 10th Text Retrieval Conference TREC 2001, pages
182{191, Gaithersburg, MD, 2002. NIST Special Pubblication 500-250.
the use of bigrams and coordination level matching within PROSIT’s main algorithm. From our
We have experimented with the PROSIT system on the Italian monolingual task and have explored
experimental evaluation, the following main conclusions can be drawn.
coordination. JASIS, 49(14):1254-1269,1998.
[4] D. Bodo, A. Kam bil. Partial coordination. I. The best of pre-coordination and
post</p>
    </sec>
    <sec id="sec-8">
      <title>We regret that due to tight schedule we were not able to test PROSIT on the other CLEF</title>
      <p>monolingual tasks. However, as the application of PROSIT to the Italian task did not require any
languages. This is left for future work.
special work, we are conden t that with a small eort w e could obtain similar results for the other
and unigram scores performed better but it was still inferior to the results obtained by using
Using bigrams in the place of unigrams hurt performance; the combination of bigram scores
held across both test collections.
unigrams alone. However, using the bigram scores in the rst-pass ranking, just to rank the
documents used for query expansion, resulted in a performance improvement. These results
on both the CLEF 2001 and CLEF 2002 Italian monolingual tasks. These results are even
The novel probabilistic model implemented in PROSIT achieved high retrieval eectiv eness
more remarkable considering that the system employs very simple indexing techniques and
does not rely on any specialised or ad hoc natural language processing techniques.
query analysis of concept drift in the nal retriev ed documents.
and worse than using bigrams alone on CLEF 2002.
Overall, the results about the enhanced versions of PROSIT are inconclusive. More work is
sample of performance measures or by considering other query scenarios.
standing of why the use of bigrams into PROSIT’s main algorithm yielded positive results in the
experiments reported in this paper. This might be done, for instance, by analysing the variations
needed to collect further evidence about their eectiv eness, e.g., by using a more representative
Besides more robust evaluation of retrieval performance, it would be useful a better
underon quality of the top ranked documents used for query expansion or by performing a query by
Using coordination level matching to rerank the retrieval results did not, in general, improve
documents according to their level of coordination matching hurt performance on both test
performance Favouring the documents that contained all the keywords in the query title
collections.
worked better on one test collection and worse on the other collection, whereas ordering the
automatic query expansion. ACM Transactions on Information Systems, 19(1):1{27, 2001.
[6] C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information theoretic approach to</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>