<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From terms to concepts: a revisited approach to Local Context Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annalina Caputo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>semerarog@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science University of Bari 70126 Bari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Pseudo-Relevance Feedback (PRF) is a widely used technique which aims to improve the query representation assuming as relevant the top ranked documents. This should results in better performance as, after the expansion and re-weigh of the original query, the resultant vector should contain all those worth features able to express utterly the user's information need. This paper presents the application of a pseudo-relevance feedback technique, called Local Context Analysis (LCA), to SENSE (SEmantic N-levels Search Engine). SENSE is an IR system that tries to overcome the limitations of the ranked keyword approach by introducing semantic levels which integrate (and not simply replace) the lexical level represented by keywords. The evaluation shows that this PRF technique is able to work worthily on both the lexical level represented by keywords and the semantic level represented by WordNet synsets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        LCA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a PRF technique which exploits the context of query words in a
collection of documents, by analyzing which words in the top ranked documents
simultaneously co-occur with the most of query terms. This paper presents an
extension of LCA in SENSE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an IR system which aims to be a step forward
traditional keyword-based systems. The main idea underlying SENSE is the
definition of an open framework to model di erent semantic aspects (or levels)
pertaining document content. Two basic levels are available in the framework: The
keyword level, the entry level in which the document is represented by the words
occurring in the text, and the word meaning level, represented through synsets
obtained by WordNet, a semantic lexicon for the English language. A synset is
a set of synonym words. Word Sense Disambiguation algorithms are adopted to
assign synsets to words. Analogously, several di erent levels of representation
are needed for representing queries. In this model also the notion of relevance of
a document d in the collection for the user query q is extended to several levels
of representation. A local similarity function computes the document relevance
for each level, according to feature weights de ned by the corresponding local
scoring function. Then, a global ranking function is needed to merge all the result
lists that come from each level in a single list of documents ranked in decreasing
order of relevance. In the same way, the PRF technique should be able to work
over all the levels involved in our model.
2
      </p>
      <p>
        nLCA
LCA proved its e ectiveness on several test collections. This technique combines
the strength of a global relevance feedback method like PhraseFinder [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] while
preventing its drawbacks. LCA selects the expansion terms directly from the
collection on the basis of their co-occurrences with query terms. Di erently from
PhraseFinder, this method computes this statistics on the basis of the top-ranked
documents that are assumed to be the relevant ones, with a considerable gain in
e ciency. Then, LCA joins the advantage of a global technique with the e ciency
of a local one. This technique is grounded on the hypothesis that terms frequently
occurring in the top-ranked documents frequently co-occur with all query terms
in those documents too. Our work exploits the idea of LCA in the N-levels model.
In that model, LCA is integrated into two representation levels: keyword and
word meaning. The challenge lies in the idea that the LCA hypothesis could also
be applied to the word meaning level, in which meanings are involved instead of
terms. The original measure of co-occurrence degree is extended to encompass
the weight of a generic feature (keyword or word meaning) rather than just a
term.
      </p>
      <p>We modify the orginal formula introducing two new factors and (in bold
in following formulae):
codegree(f; qi) =
log10(co(f; qi) + 1) idf (f )
log10(n)
codegree is computed starting from the degree of co-occurrence of the feature f
and the query feature qi (co(f; qi)), but it takes also into account the frequency
of f in the whole collection (idf (f )) and normalizes this value with respect to
n, the number of documents in the top-ranked set.</p>
      <p>co(f; qi) =</p>
      <p>X tf (f; d) tf (qi; d)
d2S
idf (f ) = min(1:0;</p>
      <p>N
log10 Nf )
5:0
where tf (f; d) and tf (qi; d) are the frequencies in d of f and qi respectively, S is
the set of top-ranked documents, N is the number of documents in the collection
and Nf is the number of documents containing the feature f . For each level, we
retrieve the n top-ranked documents for a query q and then we rank the feature
belonging to those documents by computing the function lca, as follows:
lca(f; q) =</p>
      <p>Y ( +
and transfer the importance of a query term into the weight of words it
cooccurs with. In fact, takes into account the frequency of a query term (qf ) in
the original query ( = 1 + log(qf (qi))), while takes into account a boost factor
associated with a speci c query term ( = 1 + log(boost(qi))). lca is used to rank
the list of features that occur in the top-ranked documents, is a smoothing
factor, while the power is used to raise the impact of rare features. The new
query q is given by the sum of the original query q and the expanded query q0,
where q0 = (wf1 ; :::; wfk ) and wfi = 1:0 0:k9i is the weight of the i-th feature fi.
Hence, the new query is re-executed to obtain the nal list of ranked documents
for each level. Di erently from the original work, we applied LCA to the top
ranked documents rather than passages1.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Setting the scene</title>
      <p>
        We evaluate our technique on the CLEF Ad-Hoc Robust Task collection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The CLEF collection is composed by 166,717 documents and 160 topics. In this
collection both documents and topics are disambiguated by the task
organizers. Topics are structured in three elds: T itle, Description and N arrative.
All query elds are exploited in the search phase with a di erent boost factor:
T itle = 8, Description = 2 and N arrative = 1. We use the Okapi BM25 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as
local similarity functions for both meaning and keyword levels. In particular, we
adopt the BM25-based strategy which takes into account multi- eld documents.
Documents in CLEF collection are represented by two elds: HEADLINE and
TEXT. The multi- eld representation re ects this structure. We set the BM25
parameters as follows: b = 0:7 in both levels, k1 = 3:25 and 3:50 in keyword
and meaning levels respectively. We tested several n, k, and values, and we
set n; k = 10 and = 0:1. To compute the global ranking function we adopt the
CombSUM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] strategy, giving a weight of 0:8 to the keyword level and 0:2 to
the meaning level. All parameters (boosting factors, BM25 and global ranking
function) are set after a tuning phase over a set of training topics provided by
organizers. In order to compare our approach we consider the Mean Average
Precision (MAP) and the Geometric Mean Average Precision (GMAP).
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results and Remarks</title>
      <p>
        We performed two experiments in which one level at a time is considered and
then the two lists are merged producing a single list of ranked documents. We
explored two strategies involving LCA: The rst strategy (lca) is based on the
formula proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the second strategy (lca-n), we took into account also
the meaning level and we decided to expand only synsets referring to nouns.
The second strategy tries to overcome a limit of Word Sense Disambiguation
algorithms which, in general, have better performance with nouns. The latter
strategy (lca-n- ) is based on lca-n, but with the introduction of and factors.
The results of our evaluation are depicted in Table 1.
1 In the original work, passages are parts of document text of about 300 words
n-levels
keyword
synset
      </p>
      <p>MAP GMAP</p>
      <p>While the synset level alone is not able to reach the performance of the
keyword level, the combination of these two levels without expansion strategies
(no-expansion) improves performance in both MAP and GMAP. All lca
strategies exploited in this paper outperform our baseline (no-expansion). However, it
is worth to highlight here that the expansion on synset level produces slightly
better results with respect to the standard metod lca when it involves only nouns
(lca-n). The introduction of and parameters results in the best performance.
This result supports the claim that the weight of query terms is important also
to weigh the expansion terms. Future work will include the comparison in the
N-levels model of the proposed approach with other PRF, such as Rocchio,
Divergence from Randomness and Kullback-Leibler language modeling.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>G.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Otegi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>CLEF 2009 Ad Hoc Track Overview: Robust-WSD Task</article-title>
          . In: Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Kurimo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Mostefa</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Pen~as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Roda</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <source>Multilingual Information Access Evaluation</source>
          ,
          <source>Vol. I: Text Retrieval Experiments. Lecture Notes in Computer Science</source>
          , Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Degemmis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lops</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>Enhancing semantic search using N-levels document representation</article-title>
          . In: Bloehdorn,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          , D.T. (eds.)
          <source>Proceedings of the Workshop on Semantic Search (SemSearch</source>
          <year>2008</year>
          )
          <article-title>at the 5th European Semantic Web Conference (ESWC</article-title>
          <year>2008</year>
          ), Tenerife, Spain, June 2nd,
          <year>2008</year>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>334</volume>
          , pp.
          <volume>29</volume>
          {
          <fpage>43</fpage>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Combination of Multiple Searches</article-title>
          . In: TREC. pp.
          <volume>243</volume>
          {
          <issue>252</issue>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jing</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          :
          <article-title>An association thesaurus for information retrieval</article-title>
          .
          <source>In: RIAO 94 Conference Proceedings</source>
          . pp.
          <volume>146</volume>
          {
          <issue>160</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.:
          <article-title>Simple BM25 extension to multiple weighted elds</article-title>
          .
          <source>In: Proceedings of the thirteenth ACM international conference on Information and knowledge management</source>
          . pp.
          <volume>42</volume>
          {
          <fpage>49</fpage>
          . CIKM '04,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B.:
          <article-title>Improving the e ectiveness of information retrieval with local context analysis</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <volume>79</volume>
          {
          <fpage>112</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>