<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Back to Basics - Again - for Domain Specific Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ray R. Larson School of Information University of California</institution>
          ,
          <addr-line>Berkeley</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we will describe Berkeley's approach to the Domain Specific (DS) track for CLEF 2008. Last year we used Entry Vocabulary Indexes and Thesaurus expansion approaches for DS, but found in later testing that some simple text retrieval approaches had better results than these more complex query expansion approaches. This year we decided to revisit our basic text retrieval approaches and see how they would stack up against the various expansion approaches used by other groups. The results are now in and the answer is clear, they perform pretty badly compared to other groups' approaches. All of the runs submitted were performed using the Cheshire II system. This year the Berkeley/Cheshire group submitted a total of twenty-four runs, including two for each subtask of the DS track. These include six Monolingual runs for English, German, and Russian, twelve Bilingual runs (four X2EN, four X2DE, and four X2RU), and six Multilingual runs (two EN, two DE, and two RU). The overall results include Cheshire runs in the top five participants for each task, but usually as the lowest of the five (and often fewer) groups.</p>
      </abstract>
      <kwd-group>
        <kwd>Cheshire II</kwd>
        <kwd>Logistic Regression</kwd>
        <kwd>Entry Vocabulary Indexes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper discusses the retrieval methods and evaluation results for Berkeley’s participation in the
CLEF 2008 Domain Specific track. In 2007 we focused on query expansion using Entry Vocabulary
Indexes(EVIs)[
        <xref ref-type="bibr" rid="ref3 ref5">4, 6</xref>
        ], and thesaurus lookup of topic terms. Once the relevance judgements for 2007
were released we discovered that these rather complex method actually did not perform as well as
basic text retrieval on the topics without additional query expansion. So, this year for the Domain
Specific track we have returned to using a basic text retrieval approach using Probabilistic retrieval
based on Logistic Regression with the inclusion of blind feedback, as used in 2006[
        <xref ref-type="bibr" rid="ref4">5</xref>
        ].
      </p>
      <p>1
0.9
0.8
0.7
ion 0.6
ics 0.5
reP 0.4
0.3
0.2
0.1
0</p>
    </sec>
    <sec id="sec-2">
      <title>The Retrieval Algorithms</title>
      <p>
        As we have discussed in our other papers for the Adhoc-TEL and GeoCLEF tracks, basic form and
variables of the Logistic Regression (LR) algorithm used for all of our submissions were originally
developed by Cooper, et al. [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ]. To formally the LR method, the goal of the logistic regression
method is to define a regression model that will estimate (given a set of training data), for a
particular query Q and a particular document D in a collection the value P (R | Q, D), that is,
the probability of relevance for that Q and D. This value is then used to rank the documents
in the collection which are presented to the user in order of decreasing values of that probability.
To avoid invalid probability values, the usual calculation of P (R | Q, D) uses the “log odds” of
relevance given a set of S statistics, si, derived from the query and database, giving a regression
formula for estimating the log odds from those statistics:
where b0 is the intercept term and the bi are the coefficients obtained from the regression analysis
of a sample set of queries, a collection and relevance judgements. The final ranking is determined
by the conversion of the log odds form to probabilities:
2.1
      </p>
      <sec id="sec-2-1">
        <title>TREC2 Logistic Regression Algorithm</title>
        <p>
          For all of our Domain Specific submissions this year we used a version of the Logistic Regression
(LR) algorithm that has been used very successfully in Cross-Language IR by Berkeley researchers
for a number of years[1] and which is also used in our GeoCLEF and Domain Specific submissions.
For the Domain Specific track we used the Cheshire II information retrieval system
implementation of this algorithm. One of the current limitations of this implementation is the lack of
decompounding for German documents and query terms in the current system. As noted in our
(1)
(2)
other CLEF notebook papers, the Logistic Regression algorithm used was originally developed by
Cooper et al. [
          <xref ref-type="bibr" rid="ref1">2</xref>
          ] for text retrieval from the TREC collections for TREC2. The basic formula is:
log O(R|C, Q) = log
        </p>
        <p>p(R|C, Q)
1 − p(R|C, Q)
= log
p(R|C, Q)
p(R|C, Q)
1 |XQc| qtfi
p|Qc| + 1 i=1 ql + 35
where C denotes a document component (i.e., an indexed part of a document which may be the
entire document) and Q a query, R is a relevance variable,
p(R|C, Q) is the probability that document component C is relevant to query Q,
p(R|C, Q) the probability that document component C is not relevant to query Q, which is 1.0
p(R|C, Q)
|Qc| is the number of matching terms between a document component and a query,
qtfi is the within-query frequency of the ith matching term,
tfi is the within-document frequency of the ith matching term,
ctfi is the occurrence frequency in a collection of the ith matching term,
ql is query length (i.e., number of terms in a query like |Q| for non-feedback situations),
cl is component length (i.e., number of terms in a component), and
Nt is collection length (i.e., number of terms in a test collection).
ck are the k coefficients obtained though the regression analysis.</p>
        <p>
          More details of this algorithm and the coefficients used with it may be found in our Adhoc-TEL
notebook paper where the same algorithm and coefficients were used. In addition to this primary
algorithm we used a version that performs “blind feedback” during the retrieval process. The
method used is also described in detail in our Adhoc-TEL paper. Our blind feedback approach
uses some number of top-ranked documents from an initial retrieval using the LR algorithm above,
and selects some number of terms from the content of those documents, using a version of the
Robertson and Sparck Jones probabilistic term relevance weights [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ]. Those terms are merged
with the original query and new term frequency weights are calculated, and the revised query
submitted to obtain the final ranking. We used different numbers of documents and terms for
different collections based on some tests run the 2007 data, varying these numbers to find the
optimal point for the specific collection. For the German collection we selected 20 documents and
the 35 topranked terms from those documents for feedback. For English we used 14 documents
and 16 terms, and for Russian we used 16 documents and the topranked 10 terms.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approaches for Domain Specific Retrieval</title>
      <p>In this section we describe the specific approaches taken for our submitted runs for the Domain
Specific track. First we describe the database creation and the indexing and term extraction
methods used, and then the search features we used for the submitted runs.
Although the Cheshire II system uses the XML structure of documents and extracts selected
portions of the record for indexing and retrieval, for the submitted runs this year we used only a
single one of these indexes that contains the entire content of the document.</p>
      <p>Table 1 lists the indexes created for the Domain Specific database and the document elements
from which the contents of those indexes were extracted. The “Used” column in Table 1 indicates
whether or not a particular index was used in the submitted Domain Specific runs.</p>
      <p>For all indexing we used language-specific stoplists to exclude function words and very common
words from the indexing and searching. The German language runs, however, did not use
decompounding in the indexing and querying processes to generate simple word forms from compounds.
3.3</p>
      <sec id="sec-3-1">
        <title>Search Processing</title>
        <p>Searching the Domain Specific collection used Cheshire II scripts to parse the topics and submit
the title and description elements from the topics to the “topic” index containing all terms from
Name
docno
author
title
topic
date
subject</p>
        <p>Document ID
Author name
Article Title
All Content Words
Date
Controlled Vocabulary
the documents. For the monolingual search tasks we used the topics in the appropriate language
(English, German, or Russian), and for bilingual tasks the topics were translated from the source
language to the target language using the LEC Power Translator PC-based program. Overall
we have found that this translation program seems to generate good translations between any of
the languages needed for this track, but we still intend to do some further testing to compare
to previous approaches (which used web-based translation tools like Babelfish and PROMT). We
suspect that, as always, different tools provide a more accurate representation of different topics
for some languages, but the LEC Power Translator seemed to do pretty good (and often better)
translations for all of the needed languages.</p>
        <p>All searches were submitted using the TREC2 Algorithm with blind feedback described above.
This year we did no expansion of topics or use of the thesaurus or the classification clusters created
last year. The differences in the runs for a given language or language pair (for bilingual) in Table
2 are primarily whether the topic title and description only (TD) or title, description and narrative
(TDN).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results for Submitted Runs</title>
      <p>The summary results (as Mean Average Precision) for all of our submitted runs for English,
German and Russian are shown in Table 2, the Recall-Precision curves for these runs are also
shown in Figure 1 (for monolingual), Figure 2 (for bilingual) and Figure 3 (for multilingual). In
Figures 1, 2, and 3 the names are abbrevated to the letters and numbers of the full name in Table 2
describing the languages and query expansion approach used. For example, in Figure 2 DEEN-TD
corresponds to run BRK-BI-DEEN-TD in Table 2.</p>
      <p>We observe that for the vast majority of our runs, using the narrative tends to degrade instead
of improve performance. (We observed the same in other tracks as well.)</p>
      <p>It is worth noting that the approaches used in our submitted runs provided the best results
when testing with 2007 data and topics when compared to our official 2007 runs. In fact we
may have over-simplified for this track. Although at least one Cheshire run appeared in the top
five runs of the overall summary results available on the DIRECT system, none of them were
top-ranked and for many tasks there appeared to be fewer than five participants.
BRK-MO-DE-TD
BRK-MO-DE-TDN
BRK-MO-EN-TD
BRK-MO-EN-TDN
BRK-MO-RU-TD
BRK-MO-RU-TDN
BRK-BI-ENDE-TD
BRK-BI-ENDE-TDN
BRK-BI-RUDE-TD
BRK-BI-RUDE-TDN
BRK-BI-DEEN-TD
BRK-BI-DEEN-TDN
BRK-BI-RUEN-TD
BRK-BI-RUEN-TDN
BRK-BI-DERU-TD
BRK-BI-DERU-TDN
BRK-BI-ENRU-TD
BRK-BI-ENRU-TDN
BRK-MU-DE-TD
BRK-MU-DE-TDN
BRK-MU-EN-TD
BRK-MU-EN-TDN
BRK-MU-RU-TD
BRK-MU-RU-TDN</p>
      <p>Monolingual German
Monolingual German
Monolingual English
Monolingual English
Monolingual Russian
Monolingual Russian</p>
      <sec id="sec-4-1">
        <title>Bilingual English⇒German</title>
      </sec>
      <sec id="sec-4-2">
        <title>Bilingual English⇒German</title>
      </sec>
      <sec id="sec-4-3">
        <title>Bilingual Russian⇒German</title>
      </sec>
      <sec id="sec-4-4">
        <title>Bilingual Russian⇒German</title>
      </sec>
      <sec id="sec-4-5">
        <title>Bilingual German⇒English</title>
      </sec>
      <sec id="sec-4-6">
        <title>Bilingual German⇒English</title>
      </sec>
      <sec id="sec-4-7">
        <title>Bilingual Russian⇒English</title>
      </sec>
      <sec id="sec-4-8">
        <title>Bilingual Russian⇒ English</title>
      </sec>
      <sec id="sec-4-9">
        <title>Bilingual German⇒Russian</title>
      </sec>
      <sec id="sec-4-10">
        <title>Bilingual German⇒Russian</title>
      </sec>
      <sec id="sec-4-11">
        <title>Bilingual English⇒Russian</title>
      </sec>
      <sec id="sec-4-12">
        <title>Bilingual English⇒Russian</title>
        <p>Multilingual German
Multilingual German
Multilingual English
Multilingual English
Multilingual Russian
Multilingual Russian</p>
        <p>TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto
TDN auto
TD auto</p>
        <p>TDN auto
Since we have not yet had a chance to test alternative approaches on the 2008 topics and relevance
judgement, we don’t yet have much to report on ways forward. Given that the re-introduction of
fusion approaches in our GeoCLEF entry led to very good results, we suspect that the application
of selected fusion approaches for this task may also prove valuable.</p>
        <p>We are much more curious to see what approaches the other groups in this task used this year,
since some very strong results (at least compared to our own) appeared in the overall summary
data.
[1] Aitao Chen and Fredric C. Gey. Multilingual information retrieval using machine translation,
relevance feedback and decompounding. Information Retrieval, 7:149–182, 2004.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Gey</surname>
          </string-name>
          .
          <article-title>Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression</article-title>
          .
          <source>In Text REtrieval Conference (TREC-2)</source>
          , pages
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <surname>William</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cooper</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fredric C. Gey</surname>
          </string-name>
          , and Daniel P. Dabney.
          <article-title>Probabilistic retrieval based on staged logistic regression</article-title>
          .
          <source>In 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Copenhagen, Denmark, June 21-24, pages
          <fpage>198</fpage>
          -
          <lpage>210</lpage>
          , New York,
          <year>1992</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Fredric</given-names>
            <surname>Gey</surname>
          </string-name>
          , Michael Buckland, Aitao Chen, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Entry vocabulary - a technology to enhance digital search</article-title>
          .
          <source>In Proceedings of HLT2001, First International Conference on Human Language Technology</source>
          , San Diego, pages
          <fpage>91</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>March 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ray</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Domain specific retrieval: Back to basics. In Evaluation of Multilingual and Multi-modal Information Retrieval - Seventh Workshop of the Cross-Language Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2006</year>
          , LNCS, page to appear, Alicante, Spain,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Vivien</given-names>
            <surname>Petras</surname>
          </string-name>
          , Fredric Gey, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Domain-specific CLIR of english, german and russian using fusion and subject metadata for query expansion</article-title>
          .
          <source>In Cross-Language Evaluation Forum: CLEF</source>
          <year>2005</year>
          , pages
          <fpage>226</fpage>
          -
          <lpage>237</lpage>
          .
          <source>Springer (Lecture Notes in Computer Science LNCS 4022)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. Sparck</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Relevance weighting of search terms</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          , pages
          <fpage>129</fpage>
          -
          <lpage>146</lpage>
          , May-June
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>