<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF2007 Question Answering Experiments at Tokyo Institute of Technology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E.W.D. Whittaker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J.R. Novak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Heie</string-name>
          <email>heie@furui.cs.titech.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S. Furui Dept. of Computer Science</string-name>
          <email>edw@furui.cs.titech.ac.jp</email>
          <email>furui@furui.cs.titech.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tokyo Institute of Technology</institution>
          ,
          <addr-line>2-12-1, Ookayama, Meguro-ku, Tokyo 152-8552</addr-line>
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe the experiments carried out at Tokyo Institute of Technology for the CLEF 2007 QAst (Question Answering in speech transcripts) pilot task, as well as our results from the official evaluation. We apply a non-linguistic, data-driven approach to Question Answering (QA), based a noisy channel model. The system we use for the QAst evaluation comprises an Information Retrieval (IR) module which uses an LM-based approach to sentence retrieval, and an Answer Extraction (AE) module which identifies and ranks the exact answer candidates in the retrieved sentences. Our team participated in the CLEF 2007 QAst pilot track, task T1: QA in manual transcriptions of lectures, and task T2: QA in automatic transcriptions of lectures. On the official evaluation our system achieved a best run MRR of 0.20 and a top1 score of 0.14 on task T1, and a best run MRR of 0.12 and a top1 score of 0.08 on task T2, placing us 3rd in a field of 5 teams that submitted results for these tasks. All experiments and evaluations descibed in this paper were conducted using the CHIL corpus (transcriptions of lectures) which was supplied to all track participants by the QAst track coordinators. ASR lattices were also provided by LIMSI, however we did not use these during the official evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>Question answering</kwd>
        <kwd>Language modeling</kwd>
        <kwd>Speech recognition</kwd>
        <kwd>Spoken document retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper we explain our experimental setup and general approach to automatic Question
Answering (QA), and report our official evaluation results for the CLEF 2007 QAst (Question
Answering in speech transcripts) pilot track. We employed an entirely data-driven, non-linguistic
and largely language independent QA framework for the QAst track, which was similar but not
identical to that which we used in previous QA evaluations such as TREC 2006, CLEF 2006,
NTCIR 2006, etc. This approach, which is detailed in [11, 12, 13] centers on a noisy-channel
model of the QA problem and generally speaking relies on the redundancy of answer data in the
target corpus in order to identify and extract correct answers.</p>
      <p>Our QAst system comprises two major components, an Information Retrieval (IR) module used
to identify and retrieve relevant sentences from a large corpus, and an Answer Extraction module
which is used to identify and rank exact answers in the sentences returned by the IR module.
Our approach, which is data-driven and does not require human-guided interaction except for the
development of a short list of frequent stop words and common question words, makes it possible
to rapidly develop new systems for a wide variety of different languages. Furthermore performance
accuracy is roughly comparable even across very disparate languages such as English and Japanese,
and developers need not have more than a perfunctory acquaintance with the language [6, 14] in
order to build and deploy a new system.</p>
      <p>Our data-driven approach differs substantially from conventional rule-based approaches, yet it
does share certain features with other approaches in the literature [1, 2, 3, 4, 8, 9, 10]. Systems
which employ similar answer-typing approaches have lately begun to appear [7], however most
of these systems still utilize some form of specific linguistic knowledge in contrast to our
alldata driven, non-linguistic, classification approach. Although our approach requires that a small
number of parameters be optimized to minimize the effects of data sparsity, these parameters
are all determined at system initialization time and are invariant across different questions. This
means that new data or system settings can be applied without the need for wearisome model
re-training.</p>
      <p>Due to its data-driven nature our QA system performs best when there are numerous redundant
sentences containing the correct answer and question words. This reliance on data redundancy to
help identify correct answers has seldom been a source of difficulty in past evaluations, however the
QAst pilot track presented a unique challenge due to the relatively small size of the CHIL lectures
target corpus. In other closed domain evaluations with medium-sized corpora we have opted to
utilize web data, however this did not seem entirely appropriate for the QAst track due to the
spoken nature of the data and very small corpus size. In part to help combat the resulting data
sparsity, we employed a new language-modeling based sentence retrieval IR module as a precursor
to the Answer Extraction (AE) stage. This sentence retrieval module acts as an intermediate filter
and helps to eliminate noise usually contained in the larger original documents.</p>
      <p>The rest of the paper is structured as follows. Section 2 describes our QA architecture in
detail. In section 3 we detail our experimental setup. Section 4 describes our results and Section 5
presents a brief discussion of the results. Finally, section 6 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>QA Architecture for QAst</title>
      <p>The answer to a question depends primarily on the question itself but also on many other factors
such as the identity and location of the questioner, previous questions, social context and so on.
Although such factors are clearly relevant in many situations, they are difficult to model and also
to test. In our approach to QA we therefore limit ourselves to modeling the most straightforward
dependence, the probability of an answer A given the question Q. In the system used for the
QAst evaluation, we divide the work of identifying answers between two major modules, the
Information Retrieval (IR) module which employs an LM-based approach to sentence retrieval,
and the Answer Extraction (AE) module. We briefly describe the IR module, the AE module and
the Query Expansion process below.
2.1</p>
      <sec id="sec-2-1">
        <title>Information Retrieval module</title>
        <p>The general approach to IR for QA is to treat the question as a standard search query, but discard
question-type words such as “what”, “when”, “who”, etc., and possibly also a set of stop words.
We employ a language modeling approach to this problem where an individual LM is estimated for
each document. The documents are then ranked according to the conditional probability P (Q|D),
the probability of generating the query Q given the document D.</p>
        <p>In our system we employ a sentence-based retrieval approach similar to that described in [5],
where each document comprises only one sentence. Due to lack of data to train the sentence
specific LMs, all words are treated as independent, and a unigram model is applied,
|Q|
P (Q|S) = Y P (qi|S),</p>
        <p>i=1
where qi is the ith query term in the query Q = (q1...q|Q|) composed of |Q| query terms.
Throughout this paper we calculate the probability of a query term q given a sentence S in three different
ways: P1(q|S), P2(q|S) and P3(q|S), as explained below.</p>
        <p>We use absolute discounting in order to smooth the otherwise sparse LMs, where the probability
of a query term q given a sentence S is calculated as:</p>
        <p>PC (q|S) =
|C|
X P (q|cj ) · P (cj |S),
j=1
P (cj |S) =
|V |
X P (cj |wk) · P (wk|S),
k=1
where P (q|cj ) = 1/|cj | if q ∈ cj , else P (q|cj ) = 0, where |cj | is the number of words in cj . P (cj |S)
can be re-written as a sum over the |V | words in the vocabulary V = {w1...w|V |}:
where P (cj |wk) = 1/N (wk, C) if wk ∈ cj , else P (cj |wk) = 0 where N (wk, C) is the number of
classes in C where wk occurs. P (wk|S) is the unigram probability of the word wk given the
sentence S.</p>
        <p>P1(q|S)
=
max{tf (q, S) − δ, 0} + δ · h(S, δ)
l(S) l(S)
· P (q|B),
where tf (q, S) is the term frequency of q in S, l(S) is the length (number of words) of S, δ
is the discount parameter, h(S, δ) is the count of how many unique words in S have a term
frequency higher than δ, and P (q|B) is the unigram probability of the query term q according to
the background collection model. Note that if δ &lt; 1 then h(S, δ) is equal to the number of unique
words in S.</p>
        <p>
          A problem with the model presented in [5] is that words relevant to the sentence might not
occur in the sentence itself, but in the surrounding text. For example, for the question “Who
is Tom Cruise married to?”, the sentence “He is married to Katie Holmes” in an article about
Tom Cruise should ideally be assigned a high probability, despite the sentence missing the words
“Tom” and “Cruise”. To account for this, we train document LMs, P1(q|D), in the same manner
as for P1(q|S) in Eq. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), and perform a linear interpolation between P1(q|S) and P1(q|D):
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>P2(q|S) = (1 − α) · P1(q|S) + α · P1(q|D),
where 0 ≤ α ≤ 1 is an interpolation parameter.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Query expansion</title>
        <p>In order to help further improve QA performance we experiment with a global query expansion
method in which words are grouped into a set C = {c1...c|C|} of |C| overlapping classes beforehand,
and calculate the unigram class model probability of a query term q given a sentence S as follows:</p>
        <p>Pint(q|S) = (1 − β) · P1(q|S) + β · PC (q|S),
where 0 ≤ β ≤ 1 is an interpolation parameter.</p>
        <p>
          Pint(q|D) is calculated in a similar manner. Eq.(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) is then adjusted to give P3(q|S) as follows:
        </p>
        <p>P3(q|S) = (1 − γ) · Pint(q|S) + γ · Pint(q|D),
where 0 ≤ γ ≤ 1 is an interpolation parameter. For all QAst evaluation runs, either P2 or P3 were
used.</p>
        <p>
          Using Bayes’ rule and making various conditional independence and uniform prior distribution
assumptions, Eq. (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) can be rearranged to give:
        </p>
        <p>Aˆ = arg max P (A|W, X).</p>
        <p>A
arg max P (A|X) · P (W |A),</p>
        <p>A
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Answer Extraction</title>
        <p>The AE module models the probability of an answer A given a question Q as:</p>
        <p>P (A|Q) = P (A|W, X),
where W is a set of features describing the question-type part of Q, such as “when”, “why” and
“how”, etc., while X is a set of features describing the information-bearing part of of Q, i.e. what
the question is about and what it refers to. For example, in the questions “Where was Tom Cruise
married?” and “When was Tom Cruise married”, the information-bearing parts are identical while
the question-type parts differ. Finding the best answer Aˆ involves a search over all A for the one
which maximizes the probability of the above model:</p>
        <p>
          The word LM in Eq.(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) and the class LM in Eq.(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) are combined using linear interpolation:
where P (A|X) is termed the answer retrieval model and P (W |A) the answer filter model. P (A|X)
essentially models the proximity of A to features in X. P (W |A) can be viewed as a LM that models
the probability of the question-type features W given a candidate answer A.
        </p>
        <p>We will not examine the answer retrieval model and the answer filter model further, see [15]
for further details.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup for QAst</title>
      <p>We participated in task T1: QA in manual transcriptions of lectures, and task T2: QA in automatic
transcriptions of lectures. For the official evaluation we used the data released for the QAst
evaluation task T1 and task T2. This data comprised a development set and an evaluation set
with characteristics described in Table 1. The development set consisted of manual transcripts
(MAN) and ASR-based transcripts (ASR) for 10 lectures, a set of questions, and a set of answers
for each transcript set. The evaluation set consisted of MAN and ASR for 15 lectures, and a set
of 100 questions. The development and evaluation data did not overlap. All questions were of one
of the following answer types: person, location, organization, language, system/method, measure,
time, color, shape, and material. Word lattices were also made available however, after preliminary
experiments with the development data revealed minor inconsistencies between the lattices and
ASR, we chose not to use any of the lattices in the actual evaluation. No audio was provided.</p>
      <p>
        We cleaned the data by automatically removing fillers and pauses, and performed simple text
processing of abbreviations and numerical expressions using perl’s Lingua CPAN module to ensure
consistency between ASR, MAN, questions and answers. ASR documents were sentence segmented
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
(
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
Data Set
Dev. Set
Eval. Set
according to the sentence boundaries provided, and MAN was sentence segmented using an
inhouse segmenter developed by one of the authors. Our system is not able to identify whether the
answer to a question can be found in the corpus, thus we chose never to return a “nil” response
for any question.
      </p>
      <p>For retrieval purposes we filtered out question-type words and stop words (in total 28 words)
from the questions. Using the remaining words as query terms, we ranked sentences according to
either P2(q|S) or P3(q|S), depending on the run. We optimized weights on the development set
and used these weights for the official evaluation.</p>
      <p>Classes for query expansion were generated based on the overlap in features, which are
computed using standard mutual information techniques, for each word in the vocabulary based on a
large text corpus.
4</p>
    </sec>
    <sec id="sec-4">
      <title>QAst Evaluation Results</title>
      <p>
        Question sets for both task T1 and task T2 comprised the same 100 factoid questions, however 2
of these questions were deemed faulty by the coordinators following submissions and were removed
prior to making assessments, resulting in a total of 98 evaluation questions. Our system returned
a maximum of 5 answer candidates per question per run. We submitted two (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) runs each for task
T1 and task T2. For both tasks, P2(q|S) was used for the first run and P3(q|S) was used for the
second run. In addition to our group, four other teams participated. Table 2 details the official
best-run results for the entire field for task T1.
      </p>
      <p>Team ID
clt1
dfki1
limsi2
tokyo2
upc1</p>
      <p>As can be seen in table 2 our system achieved a best run MRR of 0.20, and was able to correctly
answer 34 of 98 questions on the manual data set, placing us third overall. Results for the ASR
transcripts were lower, as expected, at 18 correct answers for 98 questions, however other systems
showed similar losses on the ASR data. Table 3 shows a comparison of our group’s manual versus
ASR results by submission. P2(q|S) was used for runs tokyo1 t1 and tokyo1 t2, while P3(q|S)
was used for runs tokyo2 t1 and tokyo2 t2. As can be seen, query expansion employed by P3(q|S)
slightly improved our Top5 scores, but had no effect on Top1 accuracy. There was a performance
drop of approximately 44% for results based on the top 5 answers using P3(q|S) and a drop of
approximately 43% for results based on the top 1 answer for both P2(q|S) and P3(q|S). Similar
drops were reflected in other participants results however, and we suspect that this primarily
reflects ASR errors.
Top5(P2)
Top5(P3)
Top1(P2)
Top1(P3)</p>
      <p>MRR</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Analysis</title>
      <p>Our results from task T1 compare favorably with results from previous CLEF and TREC
evaluations, despite the size and relative lack of redundancy in the target CHIL lectures corpus.
Additional experiments on this corpus which are documented in a paper that is currently pending
publication show that our system is able to correctly select the sentence containing the answer
over 50% of the time, indicating that there is upwards of a 20% performance loss between the
sentence retrieval and answer extraction stages.</p>
      <p>While performance across different answer types was fairly consistent, there was a conspicuous
gap for the time type, where we did not answer any of the related questions correctly. Analysis
of the data indicates that this was caused by multiple factors. There were two time questions
for which there was no appropriate answer in the document corpus. There was also a problem
with automatically normalizing complex dates which the perl Lingua module was not particular
consistent, and as our system generally performs better when times and dates are represented as
digits, this made it difficult to correctly extract answers such as “nineteen ninety-eight”. Finally,
there was at least one time question for which the question itself did not clearly specify the type.</p>
      <p>Finally, we observed a considerable drop in performance between task T1 and task T2, which
was similarly mirrored in all other participants’ results. We surmise that in our case this was
mainly due to answer typing issues resulting from ASR errors since answer words of the correct
answer type are crucial for good AE performance in our system. This can be explained by the way
the answer filter model (Section 2.3) works: if the answer words in ASR are of the wrong answer
type, then P (W |A) will assign a low probability to the correct answer candidate.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper we have presented our results from the CLEF 2007 QAst pilot track for task T1 and
T2, and described our system and experimental setup for the evaluation. In general our results
compare favorably with past evaluations, and place us in the middle of the field for this evaluation.
We noticed considerable performance drops between the manual transcripts and ASR transcripts,
but because these drops were consistent across submissions and participants we are led to believe
that this is mainly a result of ASR errors. In future evaluations we think it would be preferable to
supply both recognition lattices which consistently match the ASR transcripts, and to be able to
use the actual audio. Given that the real aim of this track is to find answers to natural language,
factoid questions in spoken documents, having access to these resources might provide greater
opportunities for teams to directly exploit the source data in more interesting way.
7
8</p>
    </sec>
    <sec id="sec-7">
      <title>Online demonstration</title>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>A demonstration of the system using model ONE supporting questions in English, Japanese,
Chinese, Russian, French, Spanish and Swedish can be found online at http://www.inferret.com/
This research was supported in part by the Japanese government’s 21st century COE programme:
“Framework for Systematization and Application of Large-scale Knowledge Resources”.
[14] E. Whittaker, J. Novak, P. Chatain, P. Dixon, M. Heie, and S. Furui. CLEF2006 Question Answering</p>
      <p>Experiments at Tokyo Institute of Technology. In CLEF 2006, LNCS 4730 proceedings, 2006.
[15] E. Whittaker, J. Novak, P. Chatain, and S. Furui. TREC 2006 Question Answering Experiments at
Tokyo Institute of Technology. In Proceedings of TREC-15, 2006.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Freitag</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mittal</surname>
          </string-name>
          .
          <article-title>Bridging the Lexical Chasm: Statistical Approaches to Answer-Finding</article-title>
          .
          <source>In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , Athens, Greece,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Brill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Banko</surname>
          </string-name>
          .
          <article-title>An Analysis of the AskMSR Question-answering System</article-title>
          .
          <source>In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Echihabi</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcu</surname>
          </string-name>
          .
          <article-title>A Noisy-Channel Approach to Question Answering</article-title>
          .
          <source>In Proceedings of the 41st Annual Meeting of the ACL</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ittycheriah</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          .
          <article-title>IBM's Statistical Question Answering System-TREC-11</article-title>
          .
          <source>In Proceedings of the TREC 2002 Conference</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Merkel</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          .
          <article-title>Comparing Improved Language Models for Sentence Retrieval in Question Answering</article-title>
          .
          <source>In Proceedings of CLIN</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Novak</surname>
          </string-name>
          , E. Whittaker,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Imai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          .
          <article-title>NTCIR-6 CLQA Question Answering Experiments at the Tokyo Institute of Technology</article-title>
          .
          <source>In Proceedings of the NTCIR-6 Conference</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pinchak</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>A Probabilistic Answer Type Model</article-title>
          .
          <source>In European Chapter of the ACL</source>
          , Trento, Italy,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Grewal</surname>
          </string-name>
          .
          <article-title>Probabilistic Question Answering on the Web</article-title>
          .
          <source>In Proc. of the 11th international conference on WWW, Hawaii</source>
          ,
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ravichandran</surname>
          </string-name>
          , E. Hovy, and
          <string-name>
            <given-names>F. Josef</given-names>
            <surname>Och</surname>
          </string-name>
          .
          <article-title>Statistical QA-Classifier vs</article-title>
          .
          <article-title>Re-ranker: What's the difference</article-title>
          ?
          <source>In Proceedings of the ACL Workshop on Multilingual Summarization and Question Answering</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Soricut</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Brill</surname>
          </string-name>
          .
          <article-title>Automatic Question Answering: Beyond the Factoid</article-title>
          .
          <source>In Proceedings of the HLT/NAACL 2004: Main Conference</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Whittaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chatain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          . TREC2005 Question Answering Experiments at Tokyo Institute of Technology.
          <source>In Proceedings of the 14th Text Retrieval Conference</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Whittaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          .
          <article-title>A Statistical Pattern Recognition Approach to Question Answering using Web Data</article-title>
          .
          <source>In Proceedings of Cyberworlds</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Whittaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hamonic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Furui</surname>
          </string-name>
          .
          <article-title>A Unified Approach to Japanese and English Question Answering</article-title>
          .
          <source>In Proceedings of NTCIR-5</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>