=Paper= {{Paper |id=Vol-1173/CLEF2007wn-QACLEF-KouylekovEt2007 |storemode=property |title=FBK-irst at CLEF 2007 |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-KouylekovEt2007.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/KouylekovNMC07 }} ==FBK-irst at CLEF 2007== https://ceur-ws.org/Vol-1173/CLEF2007wn-QACLEF-KouylekovEt2007.pdf
                          FBK-irst at CLEF 2007
          Milen Kouylekov, Matteo Negri, Bernardo Magnini and Bonaventura Coppola
                       Fondazione Bruno Kessler FBK-irst, Trento, Italy
                       {kouylekov,magnini,negri,coppolab}@itc.it


                                            Abstract


          This report presents the outcomes of the activity carried out at FBK-irst for the
      participation in the CLEF-2007 Main QA track. Both the major improvements over
      last year’s version of the DIOGENE system, and the results achieved in the evaluation
      exercise are reported.

Keywords
Question answering, Wikipedia, Anaphoric expressions processing


1     Introduction
The main novelties in this year’s setting of the Main QA Task at CLEF are represented by:
    • Introduction of topic-related questions. Questions, possibly referring to each other through
      anaphoric expressions, are organized into clusters related to a specific topic.
    • Extended answer search space. Besides the past years document collection, Wikipedia arti-
      cles were added as a possible answer source.

    Even though the overall system architecture is the same we adopted for our previous partic-
ipations to CLEF evaluation exercise (see [2]), some adaptations were necessary to address the
increased complexity of this year’s edition of the task. These are shortly overviewed in Section
2, which presents our work on the new answer search space, and Section 3, which reports on
our simple approach to topic-related questions. Section 4 and 5 conclude the report respectively
reporting the results achieved by DIOGENE in the CLEF-2007 Main QA task, and presenting
directions for future work.


2     Exploring Wikipedia
This year the dataset provided by the organizers included a dump of the Wikipedia articles. The
resulting new dataset posed new problems that had to be addressed, including:

    • Processing Wikipedia articles.
    • Integrating the new document source in an appropriate position in the DIOGENE system
      dataflow.
2.1     Processing Wikipedia Articles
Wikipedia articles contain different types of texts: information about a certain topic, formulas,
lists, tables etc. We considered as a processable unit any text paragraph inside an article, apart
form the Wikipedia links. Thus, we didn’t process any other information that is contained in the
other parts of the Wikipedia articles. For each processable unit we cleaned the text, using regular
expressions, to remove the following text formatting information:

    • HTML tags.

    • Wikipedia Links
    • Wikipedia Comments

   As a result, the clean processable units were considered as potential answer sources. The open
source search engine Lucene [1] was used to index these Wikipedia documents, while the MG search
engine [5] has been used to index the news document collection as in the last year’s version of the
DIOGENE QA system.

2.2     Integration in the System Dataflow
We decided to integrate the Wikipedia document index inside the document retrieval component
of DIOGENE. The system uses a document retrieval technique based on query relaxation loops [3].
Such technique is designed to output a limited set of ranked documents (at least 30, at most 100).
The Wikipedia document collection, however, is only considered as an auxiliary information source
due to the noisy documents it contains. Often, in fact, our first implementation of the cleaning
procedure does not return fully reliable processable units. This is due to the large amount of
unremoved tags, special symbols, or other XML annotations. As a result, Wikipedia documents
are considered as a less reliable information source and are accessed only if an insufficient number of
articles (less than 30) is returned by the MG search engine accessing the news document collection.


3     Dealing with Topic-Related Questions
The other new problem that we had to address was handling a set of questions which share the
same f ocus. To handle this problem the f ocus of the first question has be recognized. For this
purpose, we adopt the following simple heuristic, which defines the f ocus of a question as the first
noun phrase or multi-word after the main verb of the question, if it is capitalized, or the second if
the first one is in lower case.
    Examples of the focus identified for some CLEF-2007 questions are the following:

    1. Question – In quale anno é uscito il film Flashdance?
       (In what year Flashdance came on the screen? )
       Focus – Flashdance
    2. Question – Quali sono i Grandi Laghi africani?
       (What are the Great African Lakes? )
       Focus – Grandi Laghi africani
    3. Question – Chi é l’autore del libro “Giorni giapponesi”?
       (Who wrote the book “Giorni giapponesi”? )
       Focus – libro “Giorni giapponesi”

    Once the f ocus of the input question Q1 is identified, it is added as a keyword (or a conjunction
of keywords) to the search queries of the following questions Q2 , ..., Qn in the cluster, unless it is
already present among their terms.



                                                  2
4    Results
Apart from these slight modifications to the system’s architecture, our submission to this year’s
edition of the CLEF QA task (results are reported in Table 1) has been obtained with the same
system’s components described in our previous participation in CLEF [2], and reflects the “work-
in-progress” situation of the DIOGENE QA system.

        task               Overall (%)    Def. (%)     List (%)    Factoid (%)     Temp. (%)
        Italian/Italian          11.50         2.36         0.00          15.17         12.50

                           Table 1: System performance in the QA tasks


    A preliminary analysis of the results achieved focused on the impact of the adaptations of the
system to this year’s task.
    As for wikipedia articles, potential answer candidates have been extracted from such additional
resource only for 9 questions (for a total of 38 candidates). Out of them, the final answer returned
by DIOGENE came from Wikipedia in 6 cases, but only in one case it was the correct one (i.e.
Q-0134: “Quanto dista Dunleary da Dublino” - “How far is it from Dunleary to Dublin”).
    As for topic-related questions, our focus extraction heuristic has been applied for 67 questions.
The focus has been correctly added to the search keywords of a question in 42 cases, leading to 5
questions correctly answered. In 1 case it is not clear what the focus actually is, making a decision
about its correctness rather difficult. This is:

      Q-0113 – Qual é la capitale di Rhode Island?
      (What is the capital of Rhode Island? )
      Q-0114 – Dove si trova?
      (Where is it located? )



5    Conclusions
In this report we presented our adaptations of the FBK-irst DIOGENE QA system, made to
participate in the CLEF-2007 Main QA track. Such improvements addressed the problems posed
by the two novelties of this year’s edition of the task, namely the introduction of Wikipedia articles
to extend the document collection, and the introduction of topic-related questions. The results
achieved by the system show that our basic procedures dealing with such problems need to be
refined. In particular, as a first step, the cleaning procedure designed to extract reliable processable
units from Wikipedia articles will be improved, allowing for a more effective exploitation of such
resource. As for topic-related questions, future improvements will address the focus selection
strategy, either with refined heuristics, or with supervised approaches as proposed in [4].


References
[1] Erik Hatcher and Otis Gospodnetic. Lucene in Action (In Action series). Manning Publica-
    tions, December 2004.
[2] Milen Kouylekov, Matteo Negri, Bernardo Magnini, and Bonaventura Coppola. Towards
    Entailment-based Question Answering: ITC-irst at CLEF2006. In Cross Language Evalua-
    tion Forum (Clef-2006), Alicante, Spain, 2006.
[3] Bernardo Magnini, Matteo Negri, Roberto Prevete, and Hristo Tanev. Is It the Right Answer?
    Exploiting Web Redundancy for Answer Validation. In Proceedings of the 40th Annual Meeting



                                                   3
   of the Association for Computational Linguistics, (ACL-2002), pages 1495–1500, Philadelphia
   (PA), 7-12 July 2002.
[4] Matteo Negri and Milen Kouylekov. ”Who Are We Talking About?” Tracking the Referent
    in a Question Answering Series. In Proceedings of the 6th Discourse Anaphora and Anaphor
    Resolution Colloquium (DAARC 2007), Lagos, Portugal, March 29-30 2007.
[5] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and
    Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco, CA, 1999.




                                              4