<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ENSM-SE at CLEF 2006 : AdHocUses of fuzzy proximity matching function</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annabelle Mercier</string-name>
          <email>annabelle.mercier@emse.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Beigbeder</string-name>
          <email>michel.beigbeder@emse.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Nationale Superieure des Mines de Saint Etienne (ENSM-SE) 158 cours Fauriel 42023 Saint Etienne Cedex 2</institution>
          <country country="FR">FRANCE</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Starting from the idea that the closer the query terms in a document are to each other the more relevant the document, we propose an information retrieval method that uses the degree of fuzzy proximity of key terms in a document to compute the relevance of the document to the query. Our model handles Boolean queries but, contrary to the traditional extensions of the basic Boolean information retrieval model, does not use a proximity operator explicitly. A single parameter makes it possible to control the proximity degree required. To improve our system we use a stemming algorithm before indexing, we take a specific influence function and we merge fuzzy proximity result list built with different spread of influence function. We explain how we construct the queries and report the results of our experiments in the ad-hoc monolingual French task of the CLEF 2006 evaluation campaign.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Experimentation
term proximity, fuzzy information retrieval
In the information retrieval domain, systems are based on three basic models: the Boolean model,
the vector model and the probabilistic model. These models have many variations (extended
Boolean models, models based on fuzzy sets theory, generalized vector space model,. . . ) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
However, they are all based on weak representations of documents: either sets of terms or bags of terms.
In the first case, what the information retrieval system knows about a document is whether it
contains a given term or not. In the second case, the system knows the number of occurrences – the
term frequency, tf – of a given term in each document. So whatever the order of the terms in
the documents, they share the same index representation if they use the same terms. Noteworthy
exceptions to this rule are most of the Boolean model implementations which propose a near
operator [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This operator is a kind of and but with the constraint that the different terms
2
      </p>
      <p>
        Uses of Proximity
are within a window of size n, where n is an integral value. The set of retrieved documents can
be restricted with this operator. For instance, it is possible to discriminate between documents
about “data structures” and those about “data about concrete structures”. Using this operator
results in an increase in precision of the system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. But the Boolean systems that implement a
near operator share the same limitation as any basic Boolean system: these systems are not able
to rank the retrieved documents because with this model a document is or is not relevant to a
query. Different extensions have been proposed to the basic Boolean systems to circumvent this
limitation. These extensions represent the documents with some kind of term weights. Most of
the time these weights are computed on a tf basis. Some combining formulas are then applied
to compute the document score given the term weights and the query tree. But these extensions
are not compatible with the near operator. Some researchers have thus proposed models that
attempt to directly score the documents by taking into account the proximity of the query terms
within them.
      </p>
      <p>Z position in the document text and a query. This fuzzy proximity function is summed up over
To address the problem of scoring the documents taking into account the relative order of the
words in the document, we have defined a new method based on a fuzzy proximity between each
to score the document.</p>
      <p>We model the fuzzy proximity to an occurrence of a term with an influence function f that
reaches its maximum (value 1) at the value 0 and decreases on each side down to 0. Different
types of functions (Hamming, rectangular, gaussian, etc.) can be used. We used an adhoc and
a triangular one displayed in figure 1. In the following, the examples and the experiments will
be based on a triangular function x 7→ max( k−|x| , 0). The constant k controls the support of the
k
function and this support represents the extent of influence of each term occurrence. A similar
parameter can be found for other shapes.</p>
      <p>
        So, for a query term t, the fuzzy proximity function to the occurrence at position i of the
term t is x 7→ f (x − i). Now, we define the term proximity function wtd which models the fuzzy
proximity at the position x in the text to the term t by combining the fuzzy proximity functions
of the different occurrences of the term t:
Three methods have been proposed to score documents taking into account different sets of
intervals containing the query terms. These methods differ in the set of intervals that are selected in
a first step, and then in the formulas used to compute the score for a given interval. The method
of Clarke et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] selects the shortest intervals that contain all the query terms (this constraint
is relaxed if there are not enough retrieved documents), so the intervals cannot be nested. In
the method of Hawking et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], for each query term occurrence, the shortest interval containing
all the query terms is selected, thus the selected intervals can nest. Rasolofo et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] chose to
select intervals only containing two terms of the query, but with the additional constraint that the
interval is shorter than five words.
      </p>
      <p>
        Moreover, passage retrieval methods indirectly use the notion of proximity. In fact, in several
methods, documents are ranked by selecting documents which have passages with a high density
of query terms, that is to say documents where the query terms are near to each other [
        <xref ref-type="bibr" rid="ref11 ref3 ref6">11, 3, 6</xref>
        ].
The next section presents our method which scores documents on the basis of term proximity.
3
      </p>
      <p>Fuzzy Proximity Matching
x 7→ wtd(x) =</p>
      <p>max
i∈Occ(t,d)</p>
      <p>f (x − i)
where Occ(t, d) is the set of the positions of the term t in the document d and f is the influence
function.</p>
      <p>
        Z So we obtain a function wqd from to the interval [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. The result of the summation of this
for a conjunctive node. With a post-order tree traversal a fuzzy proximity function to the query
can be computed at the root of the query tree as the fuzzy proximity functions are defined on the
leaves.
function is used as the score of the document:
      </p>
      <p>The query model is the classical Boolean model: A tree with terms on the leaves and or or
and operators on the internal nodes. At an internal node, the proximity functions of the sons of
this node are combined in the query tree with the usual fuzzy set theory formulas. So the fuzzy
proximity is computed by
wqd or q′ = max(wqd, wqd′ )
wqdand q′ = min(wqd, wqd′ )
s(q, d) =
+∞</p>
      <p>X wqd(x) .</p>
      <p>x=−∞
Thus, the computed score s(q, d) depends on the fuzzy proximity functions and enables document
ranking according to the query term proximity in the documents.
4</p>
      <p>
        Experiments and Evaluation
We carried out experiments within the context of the clef 2006 evaluation campaign in the
adhoc monolingual French task1. We used the retrieval search engine Lucy2 which is based on the
Okapi information retrieval model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to index this collection. It was easy to adapt this tool to
our method because it keeps the positions of the terms occurring in the documents in the index.
Thus, we extended this tool to compute the relevance score values for our fuzzy proximity matching
function.
      </p>
      <p>Documents in the clef 2006 test collection are newspapers articles in XML format from SDA
and Le Monde of the years 1994 and 1995. For each document (tag &lt;DOC&gt;), we keep the fields
&lt;DOCNO&gt; with the tag and the document number, the textual contents of the tags &lt;TX&gt;, &lt;LD&gt;,
&lt;TI&gt;, &lt;ST&gt; for SDA French and &lt;TEXT&gt;, &lt;LEAD1&gt;, &lt;TITLE&gt; for Le Monde 1995. We used the
topics and the relevance judgements to evaluate the different methods by the trec eval program.
4.1</p>
      <p>Building the Queries
1http://clef.isti.cnr.it/
2http://www.seg.rmit.edu.au/lucy/
Each topic is composed of three tags: &lt;FR-title&gt;, &lt;FR-desc&gt;, &lt;FR-narr&gt;. Two sets of queries
were built for our experiments.</p>
      <p>Automatically built queries. For this set, a query is built with the terms from the title field
where the stop words3 are removed. Here is an example with the topic #278. The original topic
is expressed by:
&lt;top&gt;
&lt;num&gt; 278 &lt;/num&gt;
&lt;FR-title&gt; Les moyens de transport pour handicap´es&lt;/FR-title&gt;
&lt;FR-desc&gt; A quels probl`emes doivent faire face les personnes
handicap´ees physiques lorsquelles empruntent les transports
publics et quelles solutions sont propos´ees ou adopt´ees?
&lt;/FR-desc&gt;
&lt;FR-narr&gt; Les documents pertinents devront d´ecrire les
difficulte´s auxquelles doivent faire face les personnes
diminu´ees physiquement lorsquelles utilisent les transports
publics et/ou traiter des progr`es accomplis pour r´esoudre ces
probl`emes.
&lt;/FR-narr&gt;
&lt;/top&gt;
First, the topic number and the title field are extracted and concatenated:</p>
      <p>278 moyens transport handicap´es
From this form, the queries are automatically built by simple derivations:
Lucy: 278 moyens transport handicap´es
conjunctive fuzzy proximity: 278 moyens &amp; transport &amp; handicap´es
disjunctive fuzzy proximity: 278 moyens | transport | handicap´es
Manually built queries. They are built with all the terms from the title field and some terms
from the description field. The general idea was to build conjunctions (which are the basis of our
method) of disjunctions. The disjunctions are composed of the plural form of the terms and some
derivations to compensate the lack of a stemming tool in Lucy. Sometimes some terms from the
same semantic field were grouped together in the disjunctions.</p>
      <p>Queries for the method implemented in the Lucy tool are flat queries composed of different
inflectional and/or derivational forms of the terms. Here is an example for topic #278:
fuzzy proximity: 278 (moyen | moyens) &amp; (transport | transports)</p>
      <p>&amp; (handicap | handicap´e | handicap´es)
Lucy: 278 moyen moyens transport transports</p>
      <p>handicap handicap´e handicap´es
4.2</p>
      <p>Building the Result Lists
The Okapi model and our fuzzy method were compared.</p>
      <p>In first time, it is well known that the Okapi method gives one of the best performances.</p>
      <p>
        However, a previous study showed that proximity based methods improve retrieval [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. If one
of our experiments with our proximity based method does not retrieve enough documents (one
thousand for the clef experiments), then its results list is supplemented by documents from the
Okapi result list that have not yet been retrieved by the proximity based method.
      </p>
      <p>In second time, we note in past experiments that the higher the area (k = 200 was used) of
influence of a term the better the results are. Moreover, we retrieved more documents with fuzzy
3Removed stop words: a`, aux, au, chez, et, dans, des, de, du, en, la, les, le, par, sur, uns, unes, une, un, d’, l’.
proximity with a large spread of influence function. So, we merge for each type of queries (title,
description and manual), the results obtained with several k values (k equal to 200, 100, 80, 50,
20, 5).
4.3</p>
      <p>Submitted Runs
0.8
0.7
0.6
0.5
0.3
0.2
0.1</p>
      <p>The run RIMAM06TDMLRef use the triangular influence function.</p>
      <p>Here we present with CLEF 2005 queries the differences obtained between runs with stemming
(or not) at indexing step or at query formulation step.</p>
      <p>For the runs where the Okapi method was used, the queries are flat (bag of terms). These
runs were produced by using the native Lucy search engine and they provide the baselines for
the comparison with our method. The recall precision results are provided in figures 2.</p>
      <p>We can see that the runs with no stemming step before indexing have less precision than
the others. The figure 3 shows also the stemming step provide better results. But we can see
that the run TitleDescManual is above the ConjTitle100 which means that the words added for
“stemming” queries increase the precision results. The last figure 4 the differences between Okapi
Lucy method and fuzzy proximity method. We can see that our method is better than Lucy one
(ManualNoStem200 vs. LucyTitleNoStem; StemManual200 vs. ConjTitle200). We can note the
run with stemming at indexing and at query time is the better one.
0.7
0
0
We have presented our information retrieval model which takes into account the position of the
query terms in the documents to compute the relevance scores. We experimented this method on
the clef 2006 Ad-Hoc French test collection.</p>
      <p>In these experiment, we submit runs which use different k values in order to retrieve more
documents with our method. We think also that the results could be improved by using an
automatic stemming procedure. Eventually we think that the user can add query word taken
within a thesaurus in order to retrieve more documents with our method. In further experiments,
we will use other function at AND and OR nodes and we want to specifies “phrases” in the queries.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>Berthier</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern Information Retrieval</article-title>
          . ACM Press / Addison-Wesley,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Charles</surname>
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cormack</surname>
          </string-name>
          , and
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tudhope</surname>
          </string-name>
          .
          <article-title>Relevance ranking for one to three term queries</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <fpage>291</fpage>
          -
          <lpage>311</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] Owen de Kretser and
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Moffat</surname>
          </string-name>
          .
          <article-title>Effective document presentation with a locality-based similarity heuristic</article-title>
          .
          <source>In SIGIR '99: Proceedings of the 22nd ACM SIGIR Annual International Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          . ACM, 9
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Thistlewaite</surname>
          </string-name>
          .
          <article-title>Proximity operators - so near and yet so far</article-title>
          . In D. K. Harman, editor,
          <source>The Fourth Text REtrieval Conference (TREC-4)</source>
          , pages
          <fpage>131</fpage>
          -
          <lpage>143</lpage>
          . Department of Commerce,
          <source>National Institute of Standards and Technology</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Keen</surname>
          </string-name>
          .
          <article-title>Some aspects of proximity searching in text retrieval systems</article-title>
          .
          <source>Journal of Information Science</source>
          ,
          <volume>18</volume>
          :
          <fpage>89</fpage>
          -
          <lpage>98</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Koichi</given-names>
            <surname>Kise</surname>
          </string-name>
          , Markus Junker, Andreas Dengel, and
          <string-name>
            <given-names>Keinosuke</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          .
          <article-title>Passage retrieval based on density distributions of terms and its applications to document retrieval and question answering</article-title>
          . In Andreas Dengel, Markus Junker, and Anette Weisbecker, editors,
          <source>Reading and Learning: Adaptive Content Recognition</source>
          , volume
          <volume>2956</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>306</fpage>
          -
          <lpage>327</lpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mercier</surname>
          </string-name>
          . E´
          <article-title>tude comparative de trois approches utilisant la proximit´e entre les termes de la requˆete pour le calcul des scores des documents</article-title>
          .
          <source>In INFORSID 2004</source>
          , pages
          <fpage>95</fpage>
          -
          <lpage>106</lpage>
          , mai
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Yves</given-names>
            <surname>Rasolofo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          .
          <article-title>Term proximity scoring for keyword-based retrieval systems</article-title>
          .
          <source>In 25th European Conference on Information Retrieval Research</source>
          , number 2633 in LNCS, pages
          <fpage>207</fpage>
          -
          <lpage>218</lpage>
          . Springer,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <article-title>Okapi at trec-3</article-title>
          . In D. K. Harman, editor,
          <source>Overview of the Third Text REtrieval Conference (TREC-3)</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          . Department of Commerce,
          <source>National Institute of Standards and Technology</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton and M. J. McGill</surname>
          </string-name>
          .
          <article-title>Introduction to Modern Information Retrieval</article-title>
          .
          <string-name>
            <surname>McGraw-Hill Book</surname>
          </string-name>
          Company,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Ross</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          .
          <article-title>Effective retrieval of structured documents</article-title>
          .
          <source>In SIGIR '94, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>311</fpage>
          -
          <lpage>317</lpage>
          . Springer-Verlag New York,
          <volume>7</volume>
          <fpage>1994</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>