<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dublin City University at WebCLEF 2007</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sisay Fissaha Adafre</string-name>
          <email>sfissaha@computing.dcu.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing</institution>
          ,
          <addr-line>DCU</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>4</lpage>
      <abstract>
        <p>This paper describes our participation in the Multilingual Web Track (WebCLEF) 2007.</p>
      </abstract>
      <kwd-group>
        <kwd>Web retrieval</kwd>
        <kwd>Questions beyond factoids</kwd>
        <kwd>Importance ranking</kwd>
        <kwd>Duplicate removal</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>WebCLEF 2007 deals with identifying relevant bits of information from online sources that meets
the information needs of an expert user writing an article on a topic. Such information need is
expressed in terms of short title of the topic, description of the goals and the intended audience of
the article, known online sources that the user considers relevant for the topic and a set of google
queries. Given such input, systems are expected to return a ranked list of relevant snippets from
the collection provided for the task. The collection consists of the top 1000 (at most) hits from
Google for each of the retrieval queries specified in the topic, or for the topic title if the queries
are not specified. For this task, we devised a simple aggregative ranking approach which combines
evidences from the different elements of the topic description which we will describe in the next
sections.</p>
    </sec>
    <sec id="sec-2">
      <title>Description</title>
      <p>Our approach consists of the following two steps:
• Preprocessing of the collection
• Ranking of the snippets</p>
      <p>Preprocessing of the snippets basically consist of splitting the documents in the collection into
smaller snippets where each constitutes an information nugget. We consider sentences as our
basic information nugget and we split the documents into the constituting sentences or snippets.
Following this, we rank the snippets based on their importance to the topic.</p>
    </sec>
    <sec id="sec-3">
      <title>Ranking Snippets</title>
      <p>Snippets are ranked based on their similarity with the topic description. As mentioned in Section 1,
the topic is described by multiple informational items, i.e. topic title, google queries, and known
online sources. We devised three methods for ranking snippets using these topic descriptions, i.e.
baseline, filtering, and parsing.</p>
      <sec id="sec-3-1">
        <title>Baseline</title>
        <p>This approach involves a set of preprocessing steps to the topic descriptions. First, the documents
from the known online sources are split into sentences or smaller snippets. We then remove
stopwords from these sentences. We also remove stopwords from topic titles. The revised topic
description consists of the content words from the topic titles, google queries, and a set of snippets
(from known-online-source) each represented by its content words (important snippets).</p>
        <p>For each candidate snippet, we compute its rank as follows. We first remove stopwords from
the candidate snippet. Then we compute its word overlap similarity with the topic titles, google
queries and each of the snippets from the known online sources. The final score for the snippet is
the average score of these similarity scores.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Filtering</title>
        <p>In this approach, we preprocess the topic descriptions as shown in the Baseline. Unlike the
Baseline, this method applies a two-step ranking process.</p>
        <p>First, we combined the content words of the topic titles and google queries, and formed a set
of query terms. We computed the word overlap score between this set and the candidate snippets
represented by their content words. We took candidate snippets that fall above a certain threshold.</p>
        <p>Following this, we reranked the resulting candidate snippets based on their average word
overlap score with a set of snippets (from known-online-source) each represented by its content
words.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Parsing</title>
        <p>The above two methods are based on a relatively language independent (and shallow) methods
that can easily be applied to a problem with multilingual requirement. However, the underlying
scoring method uses simple word overlap metric, where the topic description and the candidate
snippets are represented by their content words. The content words are selected if they do not
belong to a precompiled stopword list.</p>
        <p>In this approach, we want to impose more constraint on the choice of content words. We
considered only the top 20 ranking webdocuments for processing. Like the Filtering method, we
filtered the candidate snippets based on their word overlap with the topic title and google queries.</p>
        <p>
          The resulting list of candidate snippets are reranked as follows. We parsed each of the candidate
snippets and the set of snippets from the known online sources using a Lexical Functional Grammar
based parser [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Each of the candidate snippets is represented by the corresponding set of head
words we obtain from the parse tree. We then compute a word overlap score for each candidate
snippet with the snippets from known online sources. The final score for a candidate snippet is
the average of these scores. For non-English topics, the system applies the baseline method of
computing the rank, i.e. without applying parsing.
        </p>
        <p>Finally, a simple duplicate removal procedure is applied to the outputs of the above three
systems. We sort the output of the above systems in descending order of their scores. For each
high ranking snippet, we compute its wordoverlap scores with other snippets that are ranked lower
and remove those with a word overlap score above a threshold value. We then return the top 30
snippets as our final result set.</p>
        <p>• Nugget-based (resp., character-based) Recall: the number of the all identified nuggets (resp.,
their character length) which are covered by the snippets of a system S, divided by the total
number of nuggets (resp., their total character length)
• Precision: the number of characters that belong to at least one span linked to a nugget,
divided by the total character length of the system’s response.</p>
        <p>Overall the scores are very low compared to the top scoring system (Precision: 0.202 and
Recall: 0.256). Parsing seems to improve the result to some extent. Since the improvement is
small, further experiment needs to done to see the actual contribution of applying parsing to the
overall improvement in the scores. On the other hand, the first two approaches did not show any
significant differences. We observed that our system returns mostly short snippets (compared to
other systems) since we split the documents into sentences. In some cases, the splitting process
returns fragmented sentences containing incomplete information. Larger textual units, such as
paragraphs, may be an appropriate retrieval unit for the current task.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We applied three different methods for the task of identifying important snippets from the Web.
Our aim is to devise a simple and generic method that can be applied in the context of Multilingual
Web retrieval. We also applied deeper natural language analysis method which is mainly targeted
for English language. The experimental results show that there is some room for improvement. In
the future, we will investigate the contribution of redundancy information and structural properties
of the Web documents for ranking the snippets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Aoife</given-names>
            <surname>Cahill</surname>
          </string-name>
          , Michael Burke, Martin Forst, Ruth Odonovan, Christian Rohrer, Josef Genabith, and
          <string-name>
            <given-names>Andy</given-names>
            <surname>Way</surname>
          </string-name>
          .
          <article-title>Treebank-based acquisition of multilingual unification grammar resources</article-title>
          .
          <source>Research on Language and Computation</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>247</fpage>
          -
          <lpage>279</lpage>
          ,
          <year>July 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>WebCLEF</surname>
          </string-name>
          ,
          <year>2007</year>
          . WebCLEF: The CLEF Crosslingual Web Track http://ilps.science.uva. nl/WebCLEF/WebCLEF2007.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>