<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Models for Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piero Molino Supervisor: Pasquale Lops</string-name>
          <email>piero.molino@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, University of Bari Via Orabona - I-70125 Bari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>28</fpage>
      <lpage>32</lpage>
      <abstract>
        <p>The research presented in this paper focuses on the adoption of semantic models for Question Answering (QA) systems. We propose a framework which exploits semantic technologies to analyze the question, retrieve and rank relevant passages. It exploits: (a) Natural Language Processing algorithms for the analysis of questions and candidate answers both in English and Italian; (b) Information Retrieval (IR) probabilistic models for retrieving candidate answers and (c) Machine Learning methods for question classification. The data source for the answers is an unstructured text document collection stored in search indices. The aim of the research is to improve the system performances by introducing semantic models in every step of the answering process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Question Answering (QA) is the task of answering users’ questions with answers
obtained from a collection of documents or from the Web.</p>
      <p>
        Traditional search engines usually retrieve long lists of full-text documents
that must be checked by the user in order to find the needed information. Instead
QA systems exploit Information Retrieval (IR) and Natural Language
Processing (NLP) [
        <xref ref-type="bibr" rid="ref9">18, 9</xref>
        ], to find the answer, or short passages of text containing it,
to a natural language question. Open-domain QA systems search on the Web
and exploit redundancy, textual pattern extraction and matching to solve the
problem [
        <xref ref-type="bibr" rid="ref12 ref14">14, 12</xref>
        ].
      </p>
      <p>
        QA emerged in the last decade as one of the most promising fields in Artificial
Intelligence thanks to some competitions organized during international
conferences [
        <xref ref-type="bibr" rid="ref16">19, 16</xref>
        ], but the first studies on the subject can be dated back to 1960s [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
In the last years some enterprise applications, such as IBM’s Watson/DeepQA
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], have shown the potential of the state-of-the-art techology.
      </p>
      <p>This paper describes a study on the introduction of semantic models inside
a QA framework in order to improve its performances in answering users’
questions. Although Semantic Role Labelling and Word Sense Disambiguation have
been already employed in the past [18], distributional and latent models for QA
are a completely new approach to investigate for QA.</p>
      <p>A framework for building real-time QA systems with focus on closed domains
was built for this purpose. The generality of the framework allows that also its
application to open domains can be rather easy. It exploits NLP algorithms for
both English and Italian and integrates a question categorization component
based on Machine Learning techniques and linguistic rules written by human
experts. Text document collections used as data sources are organized in indices
for generic unstructured data storage with fast and reliable search functions
exploiting state-of-the-art IR weighting schemes.</p>
      <p>The paper is structured as follows. Section 2 provides a generic overview
of the framework architecture, while Section 3 presents the different semantic
models. In Section 4 a preliminary evaluation of the impact of the adoption of
semantic models is provided. Final conclusions, then, close the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Framework overview</title>
      <p>Documento
DocumDDoeocncutDoumomecDenuntmootcoeunmt ent
Document
Indexer
Document</p>
      <p>Base
Indexing
Search</p>
      <p>User question
NLP Analyzer</p>
      <p>Pipeline</p>
      <sec id="sec-2-1">
        <title>Search1Engine</title>
      </sec>
      <sec id="sec-2-2">
        <title>SearchNEngine</title>
        <p>Index 1</p>
        <p>Index N
Filter Pipeline
Risposta</p>
        <p>Risposta</p>
        <p>RispAonsstwaer</p>
        <p>The architecture, shown in Figure 1, introduces some new aspects that make
it general and easier to expand, such as the adoption of different indices, parallel
search engines and different NLP and filtering pipelines, which can also run in
parallel.</p>
        <p>The first step is a linguistic analysis of the user’s question. Question analysis
is performed by a pipeline of NLP analyzers. NLP analyzers are provided both for
English and Italian and include a stemmer, a Part of Speech tagger, a lemmatizer,
a Named Entity Recognizer and a chunker.</p>
        <p>This step includes also a question classifier that uses an ensemble learning
approach exploiting both hand-written rules and rules inferred by machine learning
categorization techniques (Support Vector Machines are adopted), thus
bringing together the hand-written rules’ effectiveness and precision and the machine
learning classifier’s recall.</p>
        <p>
          The question is then passed to the search engines, whose architecture is highly
parallel and distributed. Moreover, each single engine has its own query
generator, because the query’s structure and syntax could change between different
engines. For this purpose two different query expansion techniques are
implemented: Kullback-Liebler Divergence [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and Divergence From Randomness [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>The filter pipeline is then responsible for the scoring and filtering of the
passages retrieved by the search engines. Finally, a ranked list of passages is
presented to the user.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Semantic Models</title>
      <p>The word ”semantics” describes the study of meaning. In NLP and IR it is
used to refer to ”lexical semantics”, i.e. the meaning of words, and to ”semantic
role”, the role of a phrase in a sentence. The aim of this research is to investigate
whether semantic models can improve performances of QA systems and under
which conditions improvements are achieved. A deep cost-benefit analysis
analysis of semantic models will be also performed. New models will also be developed
and tested.</p>
      <p>From the point of view of QA, semantic models can be applied to different
parts of the process.</p>
      <p>
        Among the NLP Analyzers, Word Sense Disambiguation techniques should
be applied to find the explicit meaning of every word taken from a semantic
lexicon like WordNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The application of Semantic Role Labelling algorithms
is also needed to extract the role of a phrase, thus helping the identification of
the most important parts of the user’s question [18]. Word Sense Induction [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
methods should be also applied to discriminate word meaning depending on the
use of the word inside the collection, in an implicit way.
      </p>
      <p>In the query expansion step, the use of synonyms taken from explicit
repositories and similar words obtained from the similarity of contexts of use need
both to be investigated.</p>
      <p>
        During the search step, different approaches to semantic analysis should be
applied, ranging from algebraic matrix approaches like Latent Semantic Analysis
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Non-negative Matrix Factorization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Random Indexing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], to explicit
representation approaches like Explicit Semantic Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As most of these
techniques are expensive to be applied in real time for all documents, their
adoption can be shifted to the filtering step, thus applying them only on a
reduced and pre-filtered subset of all the candidate answers. To the best of my
knowledge those semantic models have never been adopted in QA, in particular
for candidate answer filtering and scoring. Different search models like fuzzy and
neural network based models for IR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] will also be investigated.
      </p>
      <p>
        In the filtering step, semantic distance and semantic correlation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
measures can be applied to score the candidate answers, according to the adopted
representation of meaning.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Preliminary Evaluation</title>
      <p>
        A preliminary evaluation has been conducted on the ResPubliQA 2010 Dataset
adopted int he 2010 CLEF QA Competition [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This dataset contains about
10700 documents about European Union Legislation and European Parliament
transcriptions, aligned in several languages including English and Italian, with
200 questions.
      </p>
      <p>The adopted metric is the c@1 proposed for the competition:</p>
      <p>nc
nc + nn N
(1)
where N is the number of the questions, nc is the number of the system’s
correct answers and nn is the number of unanswered questions.</p>
      <p>
        Several combinations of different parameters have been tested, but in the
table below only the best one is shown. It employs a two BM25 based searchers [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
as search component, one using a keyword index, the other using a lemma index,
and the filters pipeline is built with keyword matching, lemma matching, exact
overlap, density and n-grams filters. The framework is named qc1.The adoption
of a Random Indexing based semantic filter improves the system performance of
0.045 for English and 0.04 for Italian, as shown in Table 1.
      </p>
      <sec id="sec-4-1">
        <title>System</title>
      </sec>
      <sec id="sec-4-2">
        <title>Search k1 b</title>
        <p>qc1 english
qc1+sf english
best CLEF2010 en
BM25 1.6 0.8 no</p>
        <p>BM25 1.6 0.8 yes
qc1 italian BM25 1.8 0.75 no
qc1+sf italian BM25 1.8 0.75 yes
best CLEF2010 it</p>
        <p>Table 1. Preliminary evaluation
0.705
0.75
0.73
0.635
0.675
0.63
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, a research proposal about the adoption of semantic models for QA
has been presented. A short overview of the adopted QA framework, alongside
with a description of the different semantic models to adopt for the research
purpose, has been provided Finally, a preliminary evaluation on a standard dataset,
CLEF 2010 ResPubliQA, has been given, which shows an improvement in
comparison to other state-of-the-art systems and demonstrates how promising the
use of semantic models is for the field of QA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amati</surname>
          </string-name>
          , G.,
          <string-name>
            <surname>Van Rijsbergen</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>20</volume>
          ,
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          (
          <year>October 2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Modern information retrieval</article-title>
          .
          <source>Paperback (May</source>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bert</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chomsky</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laughery</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Baseball: an automatic question-answerer</article-title>
          .
          <source>In: Papers presented at the May 9-11</source>
          ,
          <year>1961</year>
          ,
          <article-title>western joint IREAIEE-ACM computer conference</article-title>
          . pp.
          <fpage>219</fpage>
          -
          <lpage>224</lpage>
          .
          <article-title>IRE-AIEE-ACM '61 (Western)</article-title>
          , ACM, New York, NY, USA (
          <year>1961</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Carpineto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de</surname>
            <given-names>Mori</given-names>
          </string-name>
          , R.,
          <string-name>
            <surname>Romano</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bigi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>An information-theoretic approach to automatic query expansion</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>19</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          (
          <year>January 2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Deerwester</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harshman</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society of Information Science</source>
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>WordNet: an electronic lexical database. Language, speech, and communication</article-title>
          , MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ferrucci</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Ibm's watson/deepqa</article-title>
          . SIGARCH
          <source>Computer Architecture News</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computing semantic relatedness using wikipediabased explicit semantic analysis</article-title>
          .
          <source>In: In Proceedings of the 20th International Joint Conference on Artificial Intelligence</source>
          . pp.
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hermjakob</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Junk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          :
          <article-title>Question answering in webclopedia</article-title>
          .
          <source>In: TREC</source>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kanerva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Sparse distributed memory</article-title>
          .
          <source>Bradford Books</source>
          , MIT Press (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seung</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          :
          <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>
          .
          <source>Nature</source>
          <volume>401</volume>
          (
          <issue>6755</issue>
          ),
          <fpage>788</fpage>
          -
          <lpage>791</lpage>
          (
          <year>Oct 1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An exploration of the principles underlying redundancy-based factoid question answering</article-title>
          .
          <source>ACM Trans. Inf. Syst. 25 (April</source>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Word sense disambiguation: A survey</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>41</volume>
          (
          <issue>2</issue>
          ),
          <volume>10</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          :
          <fpage>69</fpage>
          (Feb
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Pa¸sca, M.:
          <article-title>Open-domain question answering from large text collections. Studies in computational linguistics</article-title>
          ,
          <source>CSLI Publications</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patwardhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michelizzi</surname>
          </string-name>
          , J.: Wordnet:
          <article-title>:similarity: measuring the relatedness of concepts</article-title>
          .
          <source>In: Demonstration Papers at HLT-NAACL 2004</source>
          . pp.
          <fpage>38</fpage>
          -
          <lpage>41</lpage>
          . HLT-NAACL-Demonstrations '
          <volume>04</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Penas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrigo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutcliffe</surname>
            ,
            <given-names>R.F.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forascu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mota</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ResPubliQA 2010:
          <article-title>Question Answering Evaluation over European Legislation</article-title>
          . In: Braschler,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Pianta</surname>
          </string-name>
          , E. (eds.)
          <source>Working notes of ResPubliQA 2010 Lab at CLEF</source>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          , H.:
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>3</volume>
          ,
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          (
          <year>April 2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>