<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Question Answering System for Frequently Asked Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Divyanshu Bhardwaj, Partha Pakray,</string-name>
          <email>contact.srvsaha)@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CIC</string-name>
          <email>c@1</email>
          <email>gelbukh@gelbukh.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IPN Mexico</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jereemi Bentham, Saurav Saha, Dept. of CSE, NIT Mizoram</institution>
          ,
          <addr-line>India, (divbhardwaj42, parthapakray, jereemibentham</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Question Answering (QA) is an important aspect of Natural Language Processing. It comprises building a system that automatically answers questions sought in natural language. Frequently Asked Questions (FAQs) are a set of listed questions and answers concerning a specific topic, which are most likely to be enquired by a user. This paper deals with developing an open domain QA system for retrieving a list of relevant FAQs related to the query issued by the user. Our approach combines the orthodox AND/OR searching with the Combinatorics searching technique which is able to produce an exhaustive list of results for a particular query generated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Question Answering (QA) un
aspetto importante di Natural Language
Processing. Si compone di costruire
un sistema che risponde automaticamente
alle domande cercato in linguaggio
naturale. Domande frequenti (FAQ) sono un
insieme di domande elencate e risposte
riguardanti un argomento specifico, che
hanno pi probabilit di indagato da un
utente. Questo documento si occupa di
sviluppo di un sistema di QA dominio
aperto per il recupero di un elenco di
domande frequenti pertinenti relativi alla
query emesso da parte dell’utente. Il
nostro approccio combina l’ortodossa e / o la
ricerca con la tecnica di ricerca
combinatorio che in grado di produrre un elenco
esaustivo dei risultati per una determinata
query generato.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Question Answering (QA) is an emerging topic in
today’s world. It is an aggregate of Information
Retrieval (IR) and Natural Language Processing
(NLP) and is concerned with developing an
automated engine which is able to respond to the
queries presented by users in natural language.
Frequently Asked Questions (FAQs) represent
an effective and efficient way to quickly resolve
queries posed by users. They are usually
represented as an ensembled list of questions and their
answers.</p>
      <p>Searching within FAQs can be a tedious
task. This becomes even more drawn out when
paraphrasing comes into fray. As a result the
user is pushed into a maze of questions and
answers having to manually look for a particular
one as shown in figure 1. It is here that a QA
system comes of utmost importance retrieving the
particular desired query instantly.
The rest of this paper is organised as
follows, Section 2 describes the corpus and its
preprocessing, Section 3 describes our system’s
architecture and the tools used, Section 4 describes
the experiment. Section 5 describes the
performance of the system, Section 6 analyses the results
and Section 7 describes the conclusion and future
works.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus and Preprocessing</title>
      <p>The corpus obtained from the QA4FAQ task
website1 provided us with FAQ in .csv (comma
separated values) format, using ; as separator and in
XML format. The CSV file was in UTF-8 format
and contained 4 fields viz.,
1. id: a number that uniquely identifies the</p>
      <p>FAQ;
2. question: the question text of the current</p>
      <p>FAQ;
3. answer: the answer text of the current FAQ;
4. tag: a set of tags separated by ,.
stopwords.2 3 4 So, we merged them and
developed our own exhaustive Italian stopword corpus
from the existing ones. This corpus5 had
approximately 546 unique stopwords in total. This
operation helped us in getting rid of the unwanted words
which would hinder the system’s performance.
3</p>
    </sec>
    <sec id="sec-4">
      <title>System Architecture</title>
      <p>The architecture of our system is shown in figure
2.</p>
      <p>An example of the data provided is given below:
193;Cosa significa AEEGSI?; l’Autorit
per l’Energia Elettrica il Gas ed il Sistema
Idrico.;acqua, acquedotto, distribuzione, AEEGSI
2.1</p>
      <sec id="sec-4-1">
        <title>Parsing</title>
        <p>For the purpose of pre-processing of the training
data we developed a CSV parser which could
extract the ID and the rest of the parts.
Development dataset had 406 files with id, question,
answer, tag(s). We extracted the question, answer
and tags in a file and saved it in the file named
ID.txt.
2.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Stopword Removal</title>
        <p>In order to increase the efficiency of our input
data, we decided to perform stopwords removal.
Words which occur in 80% of the documents in
the collection are the stop words. However while
searching for a list of Italian stopwords, we
realised that the existing ones had only 133 to 399
1http://qa4faq.github.io</p>
        <p>The architecture may be divided into two
distinct parts as shown in figure. One part
contains the architecture of Nutch6 enclosed in the
rectangle. It contains all the basic components
essential in the implementation of a Search
Engine. The other part represents the aggregation
of the searching techniques to be adopted while
searching the FAQs. This includes a module that
processes the queries obtained for both AND/OR
searching as well as combinatorics based
searching.The two major steps involved in developing
the architecture were Crawling &amp; Indexing and
Searching (described in Section 4).</p>
        <p>The steps involved in crawling and indexing are
described below:
2http://www.ranks.nl/stopwords/italian
3http://members.unine.ch/jacques.savoy/clef/italianST.txt
4https://github.com/themnd/stopword-it/blob/master/
stopwords.txt</p>
        <p>5The corpus is openly shared in Github for furthur
use -
https://github.com/SRvSaha/QA4FAQ-EVALITA16/blob/master/italian stopwords.txt
6https://nutch.apache.org
1. Run a generic Java code taking the ids (taken
from ID.txt) as the input to generate URL
seeds.
2. Injector injects the list of seed URLs into the
crawlDB.
3. Generator takes the list of seed URLs
from crawlDB, forms fetch list, adds
crawl generate folder into the segments.
4. These fetch lists are used by fetchers to fetch
the raw content of the document. It is then
stored in segments.
5. Parser is called to parse the content of the
document and parsed content is stored back
in segments.
6. The links are inverted in the link graph and
stored in LinkDB.
7. Indexing the terms present in segments is
done and indices are updated in the segments.
8. Information on the newly fetched documents
are updated on the crawlDB.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>The corpus obtained after pre-processing was
experimented upon by means of various
methodologies. A total of 1132 FAQs were available
in the test data set. A prototype system was
created by feeding the input data into Nutch. We
performed two separate runs so as to perform a
comparative study between unprocessed and pre
processed data.</p>
      <p>We used Nutch’s own configuration for the
Indexing, Searching and Ranking of the data for one of
the runs and implemented our own configuration
for the other run. The ranking provided by Nutch
may be explained using the following equation:
1. queryNorm() : indicates the normalization
factor for the query.
2. coord() : indicates how many query terms are
present in the given document.
3. norm() : score indicating field based
normalization factor.</p>
      <sec id="sec-5-1">
        <title>4. tf: term frequency</title>
      </sec>
      <sec id="sec-5-2">
        <title>5. idf: inverse document frequency</title>
        <p>6. t.boost() : score indicating the importance of
terms occurrence in a particular field</p>
        <p>Apart from this, we developed our own
configuration which was a combination of both
the traditional AND/OR search along with the
Combinatorics approach. To implement this
Combinatorics approach, we split the query by
space separator and all possible combinations
of a word in query were generated. This is the
methodology adopted in subset generation from a
given set. So, given n number of words in a query
after removing stopwords, we would have 2n 1
possible combinations of query. These were
then used for searching by Nutch and ranking
was done based on the ranking algorithm we
developed. Benefit of this approach was that, it
was an exhaustive search and maximum number
of relevant results would be retrieved using it
using proper ranking algorithm.</p>
        <p>This approach could be explained using the
following example:
Consider the following query:
numero verde aqp
For this query, all the possible combinations
would be created in the following order :
numero verde aqp
numero verde
verde aqp
numero aqp
numero
verde
aqp
From this example we can clearly visualize
how this approach would be extremely efficient in
retrieving the most relevant answers for queries
provided by the user. After applying this
approach, we were left with 29 unanswered queries.
We also implemented our own ranking system
which ranked the retrieved pages in the following
way :
Consider a query of 4 words. We used a 4 point
scale to rank the pages with the highest score
being assigned to the page with 4*(number of
matches) Thus, for a query of length n, the
highest match would be assigned to n*(number of
matches). Assuming we have a query of n words,
all possible combinations i.e, 2n 1 possible
queries were to be ranked according to the above
mentioned algorithm.</p>
        <p>Consider the query following query:
numero verde
and let the text be il numero verde non verde, un
numero che pu essere dipinta di verde.</p>
        <p>Ranking of queries would be done as :
1. numero verde : 2*1 = 2
2. numero : 1*2 = 2
3. verde : 1*3 = 3
Since we get the highest score from the query
verde so the most relevant document will be
fetched by verde. Our system retrieved results
based on this methodology.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Performance</title>
      <p>The relevant statistics of both the runs based on the
experiments performed are outlined in Table 1.</p>
      <p>Run 1
Total No. of
no. of queries
queries answered
1132 684</p>
      <p>Run 2
Total No. of
no. of queries
queries answered
1132 1103</p>
      <p>No. of
queries
unanswered</p>
      <p>448
No. of
queries
unanswered
29</p>
      <p>As can be inferred from Table 1, while during
Run 1 there were a large number of unanswered
queries, they were significantly reduced in Run 2.
This was possible due to the combinatorics
approach used in Run 2. The performance of our
system in both the runs is depicted in Table 2.</p>
      <sec id="sec-6-1">
        <title>Runs</title>
        <p>Run 1
Run 2</p>
        <p>Score Obtained
0.2125
0.0168
Systems were ranked according to accuracy@1.
In this method of ranking the precision of the
system was computed taking into account only the
first answer generated by the system. The
formulation of c@1 is given as below:
where:
1. nR : number of questions correctly answered
2. nR : number of questions unanswered</p>
      </sec>
      <sec id="sec-6-2">
        <title>3. n: total number of questions</title>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>As the evaluation was done according to
accuracy@1 which considered only the first answer
retrieved by the systems, the results obtained
weren’t extremely accurate. We however managed
to implement a search engine which was 97.33%
accurate in retrieving queries, which resulted in a
trivial amount of unanswered queries. This system
conveyed a lot of information which made us
realise that combinatorics can be an extremely
powerful tool for searching if implemented in a proper
way. However, the relevancy of the results
obtained would depend on how efficiently the
ranking is done.
7</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and Future Direction</title>
      <p>In this paper, we intended to frame an automated
Question Answering (QA) system for Frequently
Asked Questions (FAQs). We described the
preprocessing of the corpus and the experiments
performed on them. We also described the
combinatorics approach used for searching. While the
evaluation results were only decent, we did
manage to materialise a remarkably accurate search
engine for FAQs. Now that we have an adept
search engine we would next endeavour towards
perfecting our ranking techniques and algorithms
in order to take steps towards implementing a state
of the art QA system.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work presented here falls under the research
project Grant No. YSS/2015/000988 and
supported by the Department of Science &amp;
Technology (DST) and Science and Engineering Research
Board (SERB), Govt. of India. The authors would
like to acknowledge the Department of Computer
Science &amp; Engineering, National Institute of
Technology, Mizoram for providing infrastructural
facilities in order to facilitate research on this task.
Valentin Jijkoun and Maarten de Rijke 2010.
Retrieving answers from frequently asked questions pages
on the web, Proceedings of the 14th ACM
international conference on Information and knowledge
management</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Annalina</given-names>
            <surname>Caputo</surname>
          </string-name>
          , Marco de Gemmis, Pasquale Lops,
          <source>Franco Lovecchio and Vito Manzari</source>
          <year>2016</year>
          .
          <article-title>Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task</article-title>
          ,
          <source>Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          ).
          <article-title>Associazione Italiana di Linguistica Computazionale (AILC)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Deepak Ravichandran and Eduard Hovy</source>
          <year>2002</year>
          .
          <article-title>Learning Surface Text Patterns for a Question Answering System, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Lynette Hirschman and Robert Gaizauskas</source>
          <year>2001</year>
          .
          <article-title>Natural language question answering: the view from here</article-title>
          ,
          <source>Natural Language Engineering</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>Marius Pasca and Sanda Harabagiu</source>
          <year>2001</year>
          . High Performance Question/Answering, ACM SIGIR-2001
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Narendra</surname>
            <given-names>K Gupta</given-names>
          </string-name>
          , Mazin G Rahim,
          <article-title>Giuseppe</article-title>
          and Riccardi,
          <year>2007</year>
          .
          <article-title>System for handling frequently asked questions in a natural language dialog service</article-title>
          , Google Patents
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Partha Pakray</source>
          <year>2014</year>
          .
          <article-title>Yamraj: Binary-class and Multiclass based Textual Entailment System for Japanese (JA) and Chinese Simplified (CS)</article-title>
          ,
          <source>Proceedings of the 11th NTCIR Conference</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Partha Pakray and Petr Sojka</source>
          <year>2014</year>
          .
          <article-title>An Architecture for Scientific Document Retrieval Using Textual</article-title>
          and Math Entailment Modules,
          <source>Recent Advances in Slavonic Natural Language Processing</source>
          , Karlova Studnka, Czech Republic
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Partha</given-names>
            <surname>Pakray</surname>
          </string-name>
          , Pinaki Bhaskar, Somnath Banerjee, Bidhan Chandra Pal,
          <source>Sivaji Bandyopadhyay and Alexander Gelbukh</source>
          <year>2011</year>
          .
          <article-title>A Hybrid Question Answering System based on Information Retrieval and Answer Validation</article-title>
          ,
          <source>CLEF 2011 Workshop on Question Answering For Machine Reading Evaluation (QA4MRE)</source>
          ,
          <source>CLEF 2011 Labs and Workshop</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Pinaki</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <surname>Amitava Das</surname>
          </string-name>
          ,
          <source>Partha Pakray and Sivaji Bandyopadhyay</source>
          <year>2010</year>
          .
          <article-title>Theme Based English and Bengali Ad-hoc Monolingual Information Retrieval in FIRE 2010</article-title>
          ,
          <article-title>FIRE 2010</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Partha</given-names>
            <surname>Pakray</surname>
          </string-name>
          , Pinaki Bhaskar, Santanu Pal,
          <string-name>
            <surname>Dipankar Das</surname>
          </string-name>
          ,
          <source>Sivaji Bandyopadhyay and Alexander Gelbukh</source>
          <year>2010</year>
          .
          <article-title>JU CSE TE: System Description QA@CLEF 2010 - ResPubliQA</article-title>
          ,
          <source>CLEF 2010 Workshop on Multiple Language Question Answering (MLQA</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Pinaki</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          , Partha Pakray, Somnath Banerjee, Samadrita Banerjee, Sivaji Bandyopadhyay and
          <string-name>
            <surname>Alexander F Gelbukh</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Question Answering System for QA4MRE@ CLEF 2012, CLEF</article-title>
          (Online Working Notes/Labs/Workshop)
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Pinaki</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          , Partha Pakray, Somnath Banerjee, Samadrita Banerjee,
          <source>Sivaji Bandyopadhyay and Alexander Gelbukh</source>
          <year>2012</year>
          .
          <article-title>Question Answering System for QA4MRE@CLEF 2012</article-title>
          ,
          <article-title>Workshop on Question Answering For Machine Reading Evaluation (QA4MRE), CLEF 2012 Labs</article-title>
          and Workshop
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Pinaki</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          , Somnath Banerjee, Partha Pakray, Samadrita Banerjee, Sivaji Bandyopadhyay and
          <string-name>
            <surname>Alexander F Gelbukh</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>A hybrid question answering system for Multiple Choice Question (MCQ), CEUR-WS</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Robin D Burke</surname>
          </string-name>
          , Kristian J Hammond, Vladimir Kulyukin, Steven L Lytinen,
          <string-name>
            <surname>Noriko Tomuro</surname>
            , and
            <given-names>Scott Schoenberg</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Question answering from frequently asked question files: Experiences with the faq finder system</article-title>
          ,
          <source>AI magazine</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Somnath</given-names>
            <surname>Banerjee</surname>
          </string-name>
          , Pinaki Bhaskar, Partha Pakray, Sivaji Bandyopadhyay and
          <string-name>
            <surname>Alexander F Gelbukh</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Multiple Choice Question (MCQ) Answering System for Entrance Examination</article-title>
          ,
          <source>CLEF (Working Notes)</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>