<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The LIMSI participation to the QAst track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sophie Rosset</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Galibert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Bernard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Bilinski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilles Adda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spoken Language Processing Group</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIMSI-CNRS</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Orsay cedex</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the LIMSI question-answering systems on speech transcripts which participated to the QAst 2008 evaluation. These systems are based on a complete and multilevel analysis of both queries and documents. These systems use an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. We participated to all the subtasks and submitted 18 runs (for 16 sub-tasks). The evaluation results for manual transcripts range from 31% to 45% for accuracy depending on the task and from 16 to 41% for automatic transcripts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Résumé</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <sec id="sec-2-1">
        <title>Question answering, speech transcriptions</title>
        <p>
          In the QA and Information Retrieval domains progress has been demonstrated via evaluation campaigns
for both open domain and limited domains [
          <xref ref-type="bibr" rid="ref3 ref6">7, 4, 1</xref>
          ]. In these evaluations systems are presented with either
independent or linked questions and should provide one answer extracted from textual data to each question.
Recently, there has been growing interest in extracting information from multimedia data such as meetings,
lectures... Spoken data is different from textual data in various ways. The grammatical structure of
spontaneous speech is quite different from written discourse and include various types of disfluencies. The lecture
and interactive meeting data provided in QAst evaluation are particularly difficult due to run-on sentences
and interruptions. Most of the QA systems use a complete and deep syntactic and semantic analysis of both
the question and the document, or snippets given by a search engine, and search for the answer in the result.
Such an analysis cannot be performed reliably on the data we are interested in.
        </p>
        <p>In this paper, we present the architecture of the QA systems developed at LIMSI for the QAst evaluation.
This year 10 general subtasks have been proposed :
– T1a : Question Answering in manual transcriptions of lectures (CHIL corpus)
– T1b : Question Answering in automatic transcriptions of lectures (CHIL corpus)
– T2a : Question Answering in manual transcriptions of meetings (AMI corpus)
– T2b : Question Answering in automatic transcriptions of meetings (AMI corpus)
– T3a : Question Answering in manual transcriptions of broadcast news for French (ESTER corpus)
– T3b : Question Answering in automatic transcriptions of broadcast news for French (ESTER corpus)
– T4a : Question Answering in manual transcriptions of European Parliament Plenary sessions in English
(EPPS English corpus)
– T4b : Question Answering in automatic transcriptions of European Parliament Plenary sessions in English
(EPPS English corpus)
– T5a : Question Answering in manual transcriptions of European Parliament Plenary sessions in Spanish
(EPPS Spanish corpus)
– T5b : Question Answering in automatic transcriptions of European Parliament Plenary in Spanish (EPPS</p>
        <p>Spanish corpus)
For the tasks T3b, T4b and T5b, 3 different collections (one collection corresponding to one automatic
speech recognition output) have been provided with 3 different Word Error Rates (WER) in order to allow
studies on the impact of the WER on the Question Answering task. We submitted 2 runs for T3a and T5a
tasks and one for each other tasks. In total, we submitted 18 runs. We used the exact same system for each
manual and ASR collection in order to be able to evaluate the impact of the WER on the overall system. For
the different languages and tasks, we used basically the same system, the only changes were the analysis
which is language dependant and the tuning parameters learned on the development data set.
The following sections present the documents and queries pre-processing and the non-contextual analysis
with the work carried out this year on the adaptation of our analysis system to Spanish. In section 3, we
present the documents and snippets selection and the answer extraction and scoring. Section 4 finally
presents the results for these two systems on both development and test data.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Analysis of documents and queries</title>
      <p>Usually, the syntactic/semantic analysis is different for the document and for the query ; our approach is
instead to perform the same complete and multilevel analysis on both queries and documents. There are
several reasons for this : First of all, the system has to deal with both transcribed speech (transcriptions
of meetings and lectures, user utterances) and text documents, so there should be a common analysis that
takes into account the specifics of both data types. Moreover, incorrect analysis due to the lack of context
or limitations of hand-coded rules are likely to happen on both data types, so using the same strategy for
document and utterance analysis helps to reduce their negative impact. But first, we need to reduce the
surface forms variations between the different modalities (text, manual transcripts, automatic transcripts) in
order to have a common representation and use of words, sentences, case, etc. This process, a superset of
tokenization, is called normalization.
2.1</p>
      <sec id="sec-3-1">
        <title>Normalization</title>
        <p>Normalization, in our application, is the process by which raw texts are converted to a text form where
words and numbers are unambiguously delimited, capitalization happens on proper nouns only, punctuation
is separated from words, and the text is split into sentence-like segments (or as close to sentences as is
reasonably possible). Different normalization steps are applied, depending of the kind of input data ; these
steps are :</p>
        <sec id="sec-3-1-1">
          <title>1. Separating words and numbers from punctuation.</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>2. Reconstructing correct case for the words.</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>3. Adding punctuation.</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>4. Splitting into sentences at period marks.</title>
          <p>
            Reconstructing the case and adding punctuation is done in the same process based on using a fully-cased,
punctuated language model [
            <xref ref-type="bibr" rid="ref2">3</xref>
            ]. A word graph was built covering all the possible variants (all possible
punctuations added between words, all possible word cases), and a 4-gram language model was used to select
the most probable hypothesis. The language model was estimated on House of Commons Daily Debates,
final edition of the European Parliament Proceedings and various newspapers archives. The final result,
with uppercase only on proper nouns and words clearly separated by white-spaces, is then passed to the
non-contextual analysis.
2.2
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Analysis module</title>
        <p>The non-contextual analysis aims at extracting, from both user utterances and documents, what is
considered to be pertinent information. The analysis covers multiple levels : Named entities detection, Linguistic
chunking, Question words classification and Question topic detection. An example of an analysis result
appears on figure 2. In that example, New-York is recognized as a named entity, specifically an organization.
municipal elections is chunked together as a compound noun, which makes it available as a search key in the
QA system. who is detection as a question word related to a person, and its combination with won allows to
classify the question as one about someone’s victory or achievement.</p>
        <p>
          The types we need to detect correspond to two levels of analysis : named-entity recognition and
chunkbased shallow parsing. Various strategies for named-entity recognition using machine learning techniques
have been proposed [
          <xref ref-type="bibr" rid="ref1 ref4 ref5">2, 5, 6</xref>
          ]. In these approaches, a statistically pertinent coverage of all defined types
and subtypes induced the need of a large number of occurrences, and therefore rely on the availability of
large annotated corpora which are difficult to build. Rule-based approaches to named-entity recognition
(e.g. [
          <xref ref-type="bibr" rid="ref7">8</xref>
          ]) rely on morphosyntactic and/or syntactic analysis of the documents. However, in the present work,
performing this sort of analysis is not feasible : the speech transcriptions are too noisy to allow for both
accurate and robust linguistic analysis based on typical rules. We use a internal tool to write grammars based
on regular expressions on words. Our tools allows the use of lists for initial detection, and the definition of
local contexts and simple categorizations. This engine matches (and substitutes) regular expressions using
words as the base unit instead of characters. This property allows for a more readable syntax than traditional
regular expressions and enables the use of classes (lists of words) and macros (sub-expressions in-line in a
larger expression).
2.2.1
        </p>
        <p>Adaptation to English and Spanish languages
This analysis is obviously language dependant. The French analyser detects about 300 types and constitutes
the basis for the Spanish and English (T4 task only) analyzers adaptation. This year was our first attempt
in working with spanish. The Spanish analyser has been created as a simple adaptation of the French one
where only the lexicons were adapted, and only around 50% of them. For the English a deeper adaptation is
required, in particular the order in which the blocks of rules are applied is reversed. The English and Spanish
analysers detect only about a hundred types.</p>
        <p>We now plan to use some aligned corpus in order to automatically acquire some specific lexicons.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Question-Answering system</title>
      <p>The input request takes the form of an analyzed question. From that information a Search Descriptor is built
which is the basis of all the following search algorithms.
3.1</p>
      <sec id="sec-4-1">
        <title>Search Descriptor Generation</title>
        <p>This descriptor is structured in 3 parts : the elements of the input considered pertinent for the search, the
expected type or types for the answer, and a number of tuning parameters.</p>
        <p>The types considered pertinent are the named entities (standard, extended and nonspecific) and the linguistic
chunks. Each entity also carries a weight, set by rules, and a critical/secondary flag. Critical entities must be
present in a document near a candidate answer, secondary entities only give a bonus to the final score. This
distinction aims at increasing the system precision. In practice, all named entities and some linguistic chunks
are considered critical according to, once again, a set of rules. The expected answer types and their weights
are decided using a 2-level rule-based classifier built by examining the development data and generalized by
hand. Rhe tuning parameters are set empirically by systematic trials on the development data. Moreover, as
shown in Figure3, possible transformations of the elements are described. These possible transformations are
obtained from a few rules. This year, we used this concept to allow weighted morphological derivations and
synonymic transformations. The lexicon used for morphological derivations have been built on our corpus
using the analysis module to extract all values of the considered types (for example all adjectives and nouns)
and to apply some derivational rules on these lists in order to built morphological correspondances. We tried
various algorithms and that simple method was the one obtaining the best results on the development data
set for each language and task.</p>
        <p>Question : when was Hans Krasa killed ?
– Critical element
– 1,0 pers identity(Hans Krasa)
– 0,2 pers expand(Hans Krasa)
– Secondary element
– 1,0 verb identity(killed)
– 0,7 verb lemma(killed)
– 0,5 verb synonym(killed)
– 0,5 subs verb_subs(killed)
– Answer types
– 1,0 full_date
– 0,9 month_year, day_month, hour
– 0,7 year
Once the Search Descriptor (SD) is built, the next step is to generate a list of the n documents with the
highest probability of containing the answer. The method is fundamentally simple : give a score to all the
documents that include at least one element of the SD and pick the n with the best scores. The score we’ve
chosen is based on the counts of occurrences of elements, ponderated by the SD weights. The tree structure
is taken into account : the scores of elements in the same node are added, the scores for children have their
geometric mean taken. The geometric mean has two advantages, it avoids needing to compensate for the
differences in global frequency of the elements, since the counts are multiplied together, and it ensures that a
zero count on a critical element propagates into a global zero count. Accordingly, 1 is added to the secondary
element nodes to avoid the zero-propagation effect. The document score is the score of a virtual root node
of all the top nodes.</p>
        <p>The index gives the raw occurrence counts for each of the elements. The analysis producing hierarchical
annotations, the same instance of an elements can appear under multiple types. For instance, France is typed
as both country and location or organization each time it appears in a document. To compensate for that the
counts are recomputed by subtracting the number of occurrences taken into account for the other elements
of the same or upper nodes.</p>
        <p>In the specific case of QAst where the document count is very low, n is set high enough that all the documents
with as least one element are picked.</p>
        <p>3.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Snippets selection and scoring</title>
        <p>The snippet selection step aims at selecting in the documents blocks of lines with a high expectation of
containing the answer. That action has a dual effect : faster answers by reducing the number of candidates to
look at, and better precision of the answers given by reducing the noise introduced by faraway candidates.
The idea of the method is that elements of the SD has a distance of influence or range which is counted
in lines, that is sentences for text documents or utterances for spoken documents. The algorithm starts by
extracting all the lines which have elements in range to satisfy all the critical elements of the SD, building
that way a series of blocks. Too big blocks, i.e. above a critical size, are split up to try to push them under
the critical size by temporarily promoting some of the secondary elements to critical status. Eventually all
the blocks are small enough or all the elements have become critical and no more splitting is possible.
We want these snippets to be self-contained for later candidate evaluation, which means that they must
include all the elements found in the SD that made them pertinent. But in some cases two critical elements
are too far apart from each other that the line they’re in is kept, while some lines in the middle are within
range of both and as such form an element-less snippet. To fix these situations the snippets frontiers are
extended to cover the neighboring lines where influential elements are present.
3.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Answers selection and scoring</title>
        <p>The snippets are sorted by score and examined one by one independently. Every element in a snippet with a
type found in the list of expected answer types of the SD is considered an answer candidate. Each candidate
is given a score, which is the sum of the the distances between itself and the elements of the SD, each
elevated to the power −α, ponderated by the element weights. That score is smoothed with the snippet score
through a δ-ponderated geometric mean. This extraction and scoring stops once a number m of candidates
has been reached, once again to control the speed of the system. All the scores for the different instances of
the same element are added together, and in order to compensate for the differencing natural frequencies of
the entities in the documents the final score is divided by the occurence count in all the documents and in
all the examined snippets, each elevated to the power β and γ respectively. The entities with the best scores
then win. The tuning parameters α, β, γ, δ all come from the third part of the SD and has been selected by
systematic trials on the develoment corpus. These parameters are set for each question class.
Our second approach for answer scoring is built upon the results of that first one. We compute a new ranking
of the answers with a tree transformation method. For each candidate answer to a question, we transform
the tree of the snippet from where the answer was extracted into the tree of the question. The sequence of
operations used for the transformation gives us a transformation cost. The candidate answers are re-ranked
using these costs. We applied this method as a second run for T3a and T5a tasks. The results do not yet show
the expected improvement. But this work is still in progress and further analysis is needed. One positive
aspect of these trials is that they show that this approach is completely language independant (same results
are obtained for French and Spanish languages).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>4.1</p>
      <sec id="sec-5-1">
        <title>Training and Development data</title>
        <p>The official development data consisted of 50 questions for each task. The development documents were 10
seminars for the T1 task, 50 meetings for the T2 task, 6 shows for the T3 task, 4 for the T4 task and 1 for the
T5 task. As we have observed last year, 50 questions are clearly not enough to correctly tune a system. We
decided to hand-build and use a corpus of reformulated questions for each task and used them as training
corpus. We built corpus of questions/answering/documents for the T3, T4 and T5 tasks and we used the 2007
evaluation data for T1 and T2 tasks as blind development data. The table 1 gave a general overview of the
different corpus used.</p>
        <p>T1
T2
T3
T4
T5
We compared the results obtained on our different corpus (training, on which the tuning is done, and
development, blind corpus on which only the synthetic scores are looked at) and on the 2008 evaluation. The
following tables give results obtained on the different development sets and on the test.</p>
        <sec id="sec-5-1-1">
          <title>Accuracy MRR Recall</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Accuracy</title>
          <p>MRR
Recall</p>
          <p>50
82%
0.90
100%
We did not do anything specific in order to handle recognition errors in the documents, the systems have
been used as-is. As such our results show the loss due to the ASR on a decent but non-adapted system. The
T3b, T4b and T5b tasks provided three different ASR outputs allowing an analysis of the impact of WER on
the overall QA results. Table 8 gives the results on the ASR output depending on the task, the word error rate
and the accuracy obtained on the respective manual transcriptions. The WERs for the T1b and T2b tasks are
unknown.</p>
          <p>T3
T4
T5</p>
          <p>ASR_A
Acc. WER
41% 11%
21% 10.6%
24% 11.5%</p>
          <p>ASR_B
Acc. WER
25% 23.9%
20% 14%
19% 12.7%</p>
          <p>ASR_C
Acc. WER
21% 35.4%
19% 24.1%
23% 13.7%</p>
          <p>MAN
Acc.
45%
33%
33%
The better quality, including robustness, on the French analysis shows up immediatly again, the loss at
equivalent error rate being roughly halved (5% instead of 10% at 11% WER). The loss rate does not seem to
be easily predictable from the WER, but there are not enough data points to be sure. It may just be that 100
questions and a small number of documents is not enough to compute reliable statistics. A deeper analysis
measuring the word error rate by word category could provide some intersting insights.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We presented the LIMSI question-answering systems on speech transcripts which participated to the QAst
2008 evaluation. These systems are based on a complete and multi-level language dependant analysis of both
queries and documents followed by a language independant information retrieval and answer extraction and
scoring. These systems obtained state-of-the-art results on the different tasks and languages.</p>
    </sec>
    <sec id="sec-7">
      <title>Références</title>
      <p>[1] Christelle Ayache, Brigitte Grau, and Anne Vilnat. Evaluation of question-answering systems : The
French EQueR-EVALDA Evaluation Campaign. In Proceedings of LREC’06, Genoa - Italy, 24-26 May
2006.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.M.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Weischedel</surname>
          </string-name>
          .
          <article-title>Nymble : a high-performance learning namefinder</article-title>
          .
          <source>In Proceedings of ANLP'97</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Déchelotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , G. Adda, and
          <string-name>
            <surname>J.-L. Gauvain.</surname>
          </string-name>
          <article-title>Improved machine translation of speech-to-text outputs</article-title>
          .
          <source>Antwerp. Belgium</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Giampiccolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Forner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ayache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cristea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jijkoun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Osenova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sacaleanu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Sutcliffe</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF 2007 Multilingual Question Answering Track</article-title>
          .
          <source>In Working Notes for the CLEF 2007 Workshop</source>
          , Budapest, Hungary,
          <year>September 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Isozaki</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Kazawa</surname>
          </string-name>
          .
          <article-title>Efficient support vector classifiers for named entity recognition</article-title>
          .
          <source>In Proceedings of COLING</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Turmo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Comelles</surname>
          </string-name>
          .
          <article-title>Named entity recognition from spontaneous open-domain speech</article-title>
          .
          <source>In in InterSpeech'05</source>
          , Lisbon, Portugal,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          and
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Buckland. The Sixteenth Text REtrieval Conference Proceedings</surname>
          </string-name>
          (TREC
          <year>2007</year>
          ). In Voorhees and Buckland, editor,
          <source>NIST Special Publication 500-274</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vichot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Dillet</surname>
          </string-name>
          .
          <article-title>Automatic processing of proper names in texts</article-title>
          .
          <source>In Proceedings of EACL'95</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>