Question Answering System for Frequently Asked Questions

  Divyanshu Bhardwaj, Partha Pakray,                      Alexander Gelbukh
     Jereemi Bentham, Saurav Saha
              Dept. of CSE                                    CIC
             NIT Mizoram                                   IPN Mexico
                 India                                       Mexico
 (divbhardwaj42, parthapakray,                        gelbukh@gelbukh.com
          jereemibentham,
   contact.srvsaha)@gmail.com


                Abstract                       1     Introduction
English. Question Answering (QA) is            Question Answering (QA) is an emerging topic in
an important aspect of Natural Language        today’s world. It is an aggregate of Information
Processing. It comprises building a sys-       Retrieval (IR) and Natural Language Processing
tem that automatically answers questions       (NLP) and is concerned with developing an
sought in natural language. Frequently         automated engine which is able to respond to the
Asked Questions (FAQs) are a set of listed     queries presented by users in natural language.
questions and answers concerning a spe-        Frequently Asked Questions (FAQs) represent
cific topic, which are most likely to be en-   an effective and efficient way to quickly resolve
quired by a user. This paper deals with de-    queries posed by users. They are usually repre-
veloping an open domain QA system for          sented as an ensembled list of questions and their
retrieving a list of relevant FAQs related     answers.
to the query issued by the user. Our ap-
proach combines the orthodox AND/OR               Searching within FAQs can be a tedious
searching with the Combinatorics search-       task. This becomes even more drawn out when
ing technique which is able to produce an      paraphrasing comes into fray. As a result the
exhaustive list of results for a particular    user is pushed into a maze of questions and
query generated.                               answers having to manually look for a particular
                                               one as shown in figure 1. It is here that a QA
Italiano. Question Answering (QA) un
                                               system comes of utmost importance retrieving the
aspetto importante di Natural Language
                                               particular desired query instantly.
Processing.     Si compone di costruire
un sistema che risponde automaticamente
alle domande cercato in linguaggio natu-
rale. Domande frequenti (FAQ) sono un
insieme di domande elencate e risposte
riguardanti un argomento specifico, che
hanno pi probabilit di indagato da un
utente. Questo documento si occupa di
sviluppo di un sistema di QA dominio
aperto per il recupero di un elenco di
domande frequenti pertinenti relativi alla
query emesso da parte dell’utente. Il nos-
tro approccio combina l’ortodossa e / o la
ricerca con la tecnica di ricerca combina-
torio che in grado di produrre un elenco
esaustivo dei risultati per una determinata
query generato.                                    Figure 1: FAQs of Microsoft Download Center
  The rest of this paper is organised as fol-         stopwords.2 3 4 So, we merged them and devel-
lows, Section 2 describes the corpus and its pre-     oped our own exhaustive Italian stopword corpus
processing, Section 3 describes our system’s ar-      from the existing ones. This corpus5 had approxi-
chitecture and the tools used, Section 4 describes    mately 546 unique stopwords in total. This opera-
the experiment. Section 5 describes the perfor-       tion helped us in getting rid of the unwanted words
mance of the system, Section 6 analyses the results   which would hinder the system’s performance.
and Section 7 describes the conclusion and future
works.                                                3       System Architecture

2       Corpus and Preprocessing                      The architecture of our system is shown in figure
                                                      2.
The corpus obtained from the QA4FAQ task web-
site1 provided us with FAQ in .csv (comma sep-
arated values) format, using ; as separator and in
XML format. The CSV file was in UTF-8 format
and contained 4 fields viz.,

    1. id: a number that uniquely identifies the
       FAQ;

    2. question: the question text of the current
       FAQ;

    3. answer: the answer text of the current FAQ;

    4. tag: a set of tags separated by ,.

An example of the data provided is given below:       Figure 2: Architecture of the implemented system


  193;Cosa significa AEEGSI?; l’Autorit                  The architecture may be divided into two
per l’Energia Elettrica il Gas ed il Sistema          distinct parts as shown in figure. One part con-
Idrico.;acqua, acquedotto, distribuzione, AEEGSI      tains the architecture of Nutch6 enclosed in the
                                                      rectangle. It contains all the basic components
                                                      essential in the implementation of a Search
2.1       Parsing
                                                      Engine. The other part represents the aggregation
For the purpose of pre-processing of the training     of the searching techniques to be adopted while
data we developed a CSV parser which could ex-        searching the FAQs. This includes a module that
tract the ID and the rest of the parts. Develop-      processes the queries obtained for both AND/OR
ment dataset had 406 files with id, question, an-     searching as well as combinatorics based search-
swer, tag(s). We extracted the question, answer       ing.The two major steps involved in developing
and tags in a file and saved it in the file named     the architecture were Crawling & Indexing and
ID.txt.                                               Searching (described in Section 4).
                                                      The steps involved in crawling and indexing are
2.2       Stopword Removal                            described below:
In order to increase the efficiency of our input
data, we decided to perform stopwords removal.            2
                                                           http://www.ranks.nl/stopwords/italian
Words which occur in 80% of the documents in              3
                                                           http://members.unine.ch/jacques.savoy/clef/italianST.txt
                                                         4
the collection are the stop words. However while           https://github.com/themnd/stopword-it/blob/master/
                                                      stopwords.txt
searching for a list of Italian stopwords, we re-        5
                                                           The corpus is openly shared in Github for furthur
alised that the existing ones had only 133 to 399     use - https://github.com/SRvSaha/QA4FAQ-EVALITA-
                                                      16/blob/master/italian stopwords.txt
    1                                                    6
        http://qa4faq.github.io                            https://nutch.apache.org
    1. Run a generic Java code taking the ids (taken      1. queryNorm() : indicates the normalization
       from ID.txt) as the input to generate URL             factor for the query.
       seeds.
                                                          2. coord() : indicates how many query terms are
    2. Injector injects the list of seed URLs into the       present in the given document.
       crawlDB.
                                                          3. norm() : score indicating field based normal-
    3. Generator takes the list of seed URLs                 ization factor.
       from crawlDB, forms fetch list, adds
       crawl generate folder into the segments.           4. tf: term frequency

    4. These fetch lists are used by fetchers to fetch    5. idf: inverse document frequency
       the raw content of the document. It is then        6. t.boost() : score indicating the importance of
       stored in segments.                                   terms occurrence in a particular field
    5. Parser is called to parse the content of the         Apart from this, we developed our own con-
       document and parsed content is stored back        figuration which was a combination of both
       in segments.                                      the traditional AND/OR search along with the
                                                         Combinatorics approach.       To implement this
    6. The links are inverted in the link graph and
                                                         Combinatorics approach, we split the query by
       stored in LinkDB.
                                                         space separator and all possible combinations
    7. Indexing the terms present in segments is         of a word in query were generated. This is the
       done and indices are updated in the segments.     methodology adopted in subset generation from a
                                                         given set. So, given n number of words in a query
    8. Information on the newly fetched documents        after removing stopwords, we would have 2n − 1
       are updated on the crawlDB.                       possible combinations of query. These were
                                                         then used for searching by Nutch and ranking
4     Experiments
                                                         was done based on the ranking algorithm we
The corpus obtained after pre-processing was             developed. Benefit of this approach was that, it
experimented upon by means of various method-            was an exhaustive search and maximum number
ologies. A total of 1132 FAQs were available             of relevant results would be retrieved using it
in the test data set. A prototype system was             using proper ranking algorithm.
created by feeding the input data into Nutch. We         This approach could be explained using the
performed two separate runs so as to perform a           following example:
comparative study between unprocessed and pre            Consider the following query:
processed data.                                          numero verde aqp
We used Nutch’s own configuration for the Index-         For this query, all the possible combinations
ing, Searching and Ranking of the data for one of        would be created in the following order :
the runs and implemented our own configuration           numero verde aqp
for the other run. The ranking provided by Nutch         numero verde
may be explained using the following equation:           verde aqp
                                                         numero aqp
                                                         numero
                                                         verde
                                                         aqp

                                                         From this example we can clearly visualize
                                                         how this approach would be extremely efficient in
                                                         retrieving the most relevant answers for queries
                                                         provided by the user. After applying this ap-
         Figure 3: Nutch’s Ranking Equation              proach, we were left with 29 unanswered queries.
                                                         We also implemented our own ranking system
    Here,                                                which ranked the retrieved pages in the following
way :                                                      Systems were ranked according to accuracy@1.
Consider a query of 4 words. We used a 4 point          In this method of ranking the precision of the sys-
scale to rank the pages with the highest score          tem was computed taking into account only the
being assigned to the page with 4*(number of            first answer generated by the system. The formu-
matches) Thus, for a query of length n, the             lation of c@1 is given as below:
highest match would be assigned to n*(number of
matches). Assuming we have a query of n words,
all possible combinations i.e, 2n − 1 possible
queries were to be ranked according to the above
mentioned algorithm.
Consider the query following query:
numero verde                                                         Figure 4: Formula for c@1
and let the text be il numero verde non verde, un
numero che pu essere dipinta di verde.                      where:
Ranking of queries would be done as :
1. numero verde : 2*1 = 2                                   1. nR : number of questions correctly answered
2. numero : 1*2 = 2
3. verde : 1*3 = 3                                          2. nR : number of questions unanswered
Since we get the highest score from the query
verde so the most relevant document will be                 3. n: total number of questions
fetched by verde. Our system retrieved results
based on this methodology.                              6     Discussion
5   Performance                                         As the evaluation was done according to accu-
The relevant statistics of both the runs based on the   racy@1 which considered only the first answer
experiments performed are outlined in Table 1.          retrieved by the systems, the results obtained
                                                        weren’t extremely accurate. We however managed
                     Run 1                              to implement a search engine which was 97.33%
         Total     No. of    No. of                     accurate in retrieving queries, which resulted in a
        no. of     queries   queries                    trivial amount of unanswered queries. This system
        queries   answered unanswered                   conveyed a lot of information which made us re-
         1132       684       448                       alise that combinatorics can be an extremely pow-
                     Run 2                              erful tool for searching if implemented in a proper
                                                        way. However, the relevancy of the results ob-
         Total     No. of    No. of
                                                        tained would depend on how efficiently the rank-
        no. of     queries   queries
                                                        ing is done.
        queries   answered unanswered
         1132       1103       29
                                                        7     Conclusion and Future Direction
      Table 1: Statistics of both approaches
                                                        In this paper, we intended to frame an automated
  As can be inferred from Table 1, while during         Question Answering (QA) system for Frequently
Run 1 there were a large number of unanswered           Asked Questions (FAQs). We described the pre-
queries, they were significantly reduced in Run 2.      processing of the corpus and the experiments per-
This was possible due to the combinatorics ap-          formed on them. We also described the combi-
proach used in Run 2. The performance of our            natorics approach used for searching. While the
system in both the runs is depicted in Table 2.         evaluation results were only decent, we did man-
                                                        age to materialise a remarkably accurate search
             Runs     Score Obtained                    engine for FAQs. Now that we have an adept
             Run 1        0.2125                        search engine we would next endeavour towards
             Run 2        0.0168                        perfecting our ranking techniques and algorithms
                                                        in order to take steps towards implementing a state
Table 2: Performance of NLP-NITMZ in both runs          of the art QA system.
Acknowledgments                                          Pinaki Bhaskar, Amitava Das, Partha Pakray and Sivaji
                                                            Bandyopadhyay 2010. Theme Based English and
This work presented here falls under the research           Bengali Ad-hoc Monolingual Information Retrieval
project Grant No. YSS/2015/000988 and sup-                  in FIRE 2010, FIRE 2010
ported by the Department of Science & Technol-           Partha Pakray, Pinaki Bhaskar, Santanu Pal, Dipankar
ogy (DST) and Science and Engineering Research             Das, Sivaji Bandyopadhyay and Alexander Gelbukh
Board (SERB), Govt. of India. The authors would            2010. JU CSE TE: System Description QA@CLEF
like to acknowledge the Department of Computer             2010 - ResPubliQA, CLEF 2010 Workshop on Mul-
                                                           tiple Language Question Answering (MLQA 2010)
Science & Engineering, National Institute of Tech-
nology, Mizoram for providing infrastructural fa-        Pinaki Bhaskar, Partha Pakray, Somnath Banerjee,
cilities in order to facilitate research on this task.      Samadrita Banerjee, Sivaji Bandyopadhyay and
                                                            Alexander F Gelbukh 2012. Question Answering
                                                            System for QA4MRE@ CLEF 2012, CLEF (Online
                                                            Working Notes/Labs/Workshop)
References
                                                         Pinaki Bhaskar, Partha Pakray, Somnath Banerjee,
Annalina Caputo, Marco de Gemmis, Pasquale Lops,            Samadrita Banerjee, Sivaji Bandyopadhyay and
  Franco Lovecchio and Vito Manzari            2016.        Alexander Gelbukh 2012. Question Answering Sys-
  Overview of the EVALITA 2016 Question Answer-             tem for QA4MRE@CLEF 2012, Workshop on Ques-
  ing for Frequently Asked Questions (QA4FAQ) Task,         tion Answering For Machine Reading Evaluation
  Proceedings of Third Italian Conference on Compu-         (QA4MRE), CLEF 2012 Labs and Workshop
  tational Linguistics (CLiC-it 2016) & Fifth Evalua-
  tion Campaign of Natural Language Processing and       Pinaki Bhaskar, Somnath Banerjee, Partha Pakray,
  Speech Tools for Italian. Final Workshop (EVALITA         Samadrita Banerjee, Sivaji Bandyopadhyay and
  2016). Associazione Italiana di Linguistica Com-          Alexander F Gelbukh 2013. A hybrid ques-
  putazionale (AILC)                                        tion answering system for Multiple Choice Question
                                                            (MCQ), CEUR-WS
Deepak Ravichandran and Eduard Hovy 2002. Learn-
  ing Surface Text Patterns for a Question Answer-       Robin D Burke, Kristian J Hammond, Vladimir Ku-
  ing System, Proceedings of the 40th Annual Meet-         lyukin, Steven L Lytinen, Noriko Tomuro, and Scott
  ing of the Association for Computational Linguistics     Schoenberg 1997. Question answering from fre-
  (ACL)                                                    quently asked question files: Experiences with the
                                                           faq finder system, AI magazine
Lynette Hirschman and Robert Gaizauskas 2001. Nat-
  ural language question answering: the view from        Somnath Banerjee, Pinaki Bhaskar, Partha Pakray,
  here, Natural Language Engineering                       Sivaji Bandyopadhyay and Alexander F Gelbukh
                                                           2013. Multiple Choice Question (MCQ) Answering
Marius Pasca and Sanda Harabagiu 2001. High Per-           System for Entrance Examination, CLEF (Working
 formance Question/Answering, ACM SIGIR-2001               Notes)

Narendra K Gupta, Mazin G Rahim, Giuseppe and            Valentin Jijkoun and Maarten de Rijke 2010. Retriev-
  Riccardi, 2007. System for handling frequently           ing answers from frequently asked questions pages
  asked questions in a natural language dialog ser-        on the web, Proceedings of the 14th ACM inter-
  vice, Google Patents                                     national conference on Information and knowledge
                                                           management
Partha Pakray 2014. Yamraj: Binary-class and Multi-
  class based Textual Entailment System for Japanese
  (JA) and Chinese Simplified (CS), Proceedings of the
  11th NTCIR Conference

Partha Pakray and Petr Sojka 2014. An Architec-
  ture for Scientific Document Retrieval Using Textual
  and Math Entailment Modules, Recent Advances
  in Slavonic Natural Language Processing, Karlova
  Studnka, Czech Republic

Partha Pakray, Pinaki Bhaskar, Somnath Banerjee,
  Bidhan Chandra Pal, Sivaji Bandyopadhyay and
  Alexander Gelbukh 2011. A Hybrid Question An-
  swering System based on Information Retrieval and
  Answer Validation, CLEF 2011 Workshop on Ques-
  tion Answering For Machine Reading Evaluation
  (QA4MRE), CLEF 2011 Labs and Workshop