=Paper=
{{Paper
|id=Vol-2029/paper2
|storemode=property
|title=Automated Question Answering System
|pdfUrl=https://ceur-ws.org/Vol-2029/paper2.pdf
|volume=Vol-2029
|authors=Chanin Pithyaachariyakul,Maria Khvalchik,Anagha Kulkarni
|dblpUrl=https://dblp.org/rec/conf/simbig/Pithyaachariyakul17
}}
==Automated Question Answering System==
<pdf width="1500px">https://ceur-ws.org/Vol-2029/paper2.pdf</pdf>
<pre>
                               Automated Question Answering System

               Chanin Pithyaachariyakul, Maria Khvalchik, Anagha Kulkarni
                              Department of Computer Science
                               San Francisco State University
                             1600 Holloway Ave, San Francisco
                                     CA, USA, 94132
        cpithyaa@mail.sfsu.edu, mkhvalch@mail.sfsu.edu, ak@sfsu.edu


                          Abstract                          question is provided by the members of the on-
                                                            line community. Compared to the experience that
        We present SFS-QA, an automatic real-               users have with commercial search engines, the
        time question-answering system, that                QA sites provide two key advantages: (i) freedom
        can answer free-text questions within a             to specify their information need (question) in free
        minute. SFS-QA system analyzes a ques-              flowing natural language, and (ii) convenience of
        tion and transforms it into a boolean key-          receiving focused answer(s) to the question, as op-
        word query using Stanford Dependency                posed to receiving 10 web pages that have to be
        Parser and MetaMap tagger. The query                read and parsed to identify the answer to the ques-
        is used to obtain matching web pages                tion. On the other hand, one clear advantage of
        from the World Wide Web through Google              search engines over QA sites is the information
        Search API. In addition, two specific data          response time. The search results are available to
        sources: Yahoo! Answers and Wikipedia               the user instantaneously, whereas human-authored
        are used for matching the documents. The            answers may take much longer to be posted.
        resulting web pages are mined for candi-               These observations motivate our QA system
        date answers. Finally, Learning to Rank             that allows the user to specify the question in nat-
        based LambdaMart, and Recurrent Neu-                ural language, which in turn is internally trans-
        ral Network based BLSTM algorithms are              formed into a boolean query composed of key
        used to learn, rank, and select the best an-        terms and phrases. Our approach works on the
        swer from candidate answers. For empir-             premise that the answers for the majority of ques-
        ical evaluation, TREC LiveQA 2015 and               tions are already available on some web page(s)
        2016 datasets which consist of about 1000           on WWW. Thus the goal of the next phase of our
        questions each were used. The results               system is to obtain these web pages, and extract
        demonstrate that our system substantially           concise answers from the content of the pages. An
        outperforms strong baselines.                       additional objective that we set for the QA system
1       Introduction                                        is to have low latency: every question has to be
                                                            answered in less than one minute. This require-
Even with modern search engines, there are many             ment influences many design decisions made for
scenarios where the users struggle to find the in-          the system’s architecture.
formation they are looking for. This is especially             The key contributions of this work are design-
true when the information need is complex, and              ing and developing: (i) a light-weight but effec-
when the user is unable to distill down their infor-        tive question to query transformation approach
mation need into a few keywords. These factors              (ii) a multi-sourced document retrieval approach,
motivate many users to seek answers on commu-               and (iii) a highly effective answer ranking ap-
nity based Question-Answering (QA) sites, such              proach. A thorough empirical evaluation of in-
as, Yahoo! Answers1 , and Quora2 , where the                dividual phases of the system, and of the end-to-
question can be posted in natural language, as op-          end system was undertaken, which demonstrates
posed to keyword query, and the answer(s) to the            that the proposed QA system performs substan-
    1
        https://answers.yahoo.com                           tially better than strong baselines while meeting
    2
        https://www.quora.com                               the response time requirement.


                                                       47
2   Related Work                                               Pinter et al. introduced a new method that ap-
                                                            plied a grammatical dependency parser to iden-
Liu et al. analyzed the difference between queries          tify segments of CQA questions to generate
and questions in the context of community QA                queries (Pinter and Reichart, 2016). Because CQA
sites (Qiaoling Liu, 2016). They confirmed the              questions are often long and verbose, the depen-
common belief that queries are focused on key               dency parser is required to partition a question into
concepts (things/nouns and actions/verb). While             several syntactically independent segments. The
questions also include contextual and etiquette re-         segment queries, which are generated from frag-
lated terms. In their analysis only 31.4% of terms          ments of a question, are more effective to find the
overlapped between question and query. They                 relevant answers than the simple phrase queries.
also observed that questions contain more abbre-
viations, and shortened versions of terms, since               Soricut and Brill (R. and E., 2006) published
the writer perceives more freedom of writing with           one of the first papers on non-factoid question an-
questions.                                                  swering, and many others have followed (R. and
                                                            H.; Surdeanu et al., 2011; Oh J. H., 2012). As a
   Savenkov et al. introduced a new QA sys-
                                                            training set they used a corpus of 1M question-
tem that uses multiple data resources such as
                                                            answer pairs from FAQ collected on the Web.
Wikipedia data, WWW, and Yahoo! Answer Col-
                                                            To search for the answer candidates they used
lection (Savenkov and Eugene, 2016). Wikipedia
                                                            MSNSearch and Google. Our work uses dif-
corpus is used to obtain relevant documents for an
                                                            ferent algorithm for QFM, is trained using Ya-
answer, and results from Web search are used for
                                                            hoo! Answers dataset and uses learning to rank
query expansion. They detected keywords from
                                                            techniques which started to advance in mid-00s.
top snippet results. Then, they expanded keyword
                                                            In recent years the advancements in NLP/ML
terms that may be misspelled and have multiple
                                                            techniques and availability of large QA datasets
thesauruses from the snippets. This lexical match
                                                            have propelled research and contests on answering
increases the opportunities to match more rele-
                                                            open-domain non-factoid questions (Agichtein E.,
vant documents when a less popular keywords is
                                                            2015). Wang et al. (Wang and Nyberg, 2015,
used in a query. Finally, they used WebScope L6
                                                            2016) works were the winner of two subsequent
dataset, which contains 4.4 million Yahoo! An-
                                                            TREC LiveQA competitions. In the first paper
swer questions, to run through a supervised learn-
                                                            they trained an answer prediction model using
ing to label the association between question terms
                                                            BLSTM Neural Network. In the second - Neural
and answer entities. The label is used to eval-
                                                            Machine Translation techniques to train the model
uate the candidate answer based on its associa-
                                                            which generates the answer itself given only a
tion score. Their result showed a significant im-
                                                            question. We use their method as a baseline com-
provement because web search results are the ef-
                                                            paring our work against to.
fective resources that enhance the query under-
standing. Moreover, the question-answer pairing
successfully ranked candidate answers by helping
with entity identification.
                                                            3   SFS-QA: An automated real-time
                                                                question-answering system
   Shtok et al. proposed a new approach to an-
swer a new question from CQA site such as Ya-
hoo! Answers by reusing past answers from previ-            Our QA system is structured as a pipeline of four
ous similar questions from CQA site itself (Shtok           components: Query formulation, Document re-
and Szpektor, 2012). They applied a cosine simi-            trieval, Candidate answer extraction, and Answer
larity to match potential past questions as the can-        selection. The first module is responsible for pars-
didate selections. The next step is to extract only         ing, and transforming the question into a query,
the best answer from the selected old questions             that is used by the second phase to retrieve docu-
as candidate answers. Then they applied statistic           ments that may contain the answer, the third phase
classifier to select the final answer. This research        is tasked with identifying the minimal unit of text
achieved high precision answering and preserved             that answers the original question, and the fourth
human generated content answer unlike other au-             phase selects the unit of text that serves best as
tomatic question answering systems that used web            the answer to the question. The architecture of the
extraction to generate the answers.                         system is shown in Figure 1.


                                                       48
                                     Yahoo!1                             Best1Answer
                                   Answers1 L61                                                                  Answer
     Question
                                     Dataset
           Lucene Query1
             Language                            Top13                     Top151Best1
                                 Yahoo!1Answers1 Pages                                 Answer1 Selection1   Top1Answer
                                                          Best1Answer1      Answers
                   YES             Site1Search1                                             with1
       Exact1                                              Extraction
       Match                        using1Bing                                           BLSTM+LLTR

                             Lucene1Query1Language
           NO                                                                                    Top151Answers
                                                  Top13    Wikipedia1        Answer1
    Question1 to1Query1                           Pages                    Candidates
                                   Wikipedia                 Pages1                     TF.iDF Scoring1
        Module                      Dataset                Extraction

           QFM

                  Boolean1                        Top13                      Answer1
                   Query
                                   Google         Pages     HTML1
                                                          Extraction
                                                                           Candidates


                                      DRM                   CAEM                             ASM


Figure 1: System Architecture for SFS-QA system. QFM: Query Formulation Module. DR: Document
Retrieval Module. CAEM: Candidate Answer Extraction Module. ASM: Answer Selection Module.

3.1 Query Formulation Module (QFM)                                     trieved web-pages are about orange juice and not
                                                                       about orange peel juice. In order to construct a
This module transforms free-text questions to a                        boolean query that retains the key information in
well-formed boolean conjunctive queries that can                       the question, QFM performs detailed grammatical
be evaluated by a search engine. This is a chal-                       analysis of the question. Specifically, we use the
lenging problem because questions are often ver-                       Stanford Dependency Parser (Chen and Manning,
bose. Questions often contain information that is                      2014) to identify the various phrases (noun, verb,
useful for a human reader but is superfluous, or                       preposition, and adjective phrases) in the sentence.
even misleading, if included in the search query.                      This allows us to identify important phrases, rather
We address this verbosity problem at multiple lev-                     than just individual words. For the above question,
els.                                                                   this approach selects important phrases and gen-
   First, not every sentence in the question con-                      erates the final boolean conjunctive query as fol-
tributes to the final query. Only sentences that                       lows: (juice) AND (orange peel) AND (good for
start with WH-words (e.g. Who, When, Where,                            eyes). This query is successful at retrieving web-
Why) and end with a question mark do (Varanasi                         pages about orange peel juice rather than about
and Neumann, 2015). Second, within a sentence                          orange juice even though the latter has the more
only certain select parts of the question are in-                      dominant presence on the web.
cluded in the query. Also, the length of these parts,
individual words or phrases, is selected carefully.                       The English closed class terms (pronouns, de-
For example, transforming the following question,                      terminers, prepositions) in the question are often
“Why’s juice from orange peel supposed to be                           ignored since they do not capture the central topic
good for eyes?”, into a unigram boolean query:                         of the question. However, in certain situations the
(orange) AND (peel) AND (juice) AND (good)                             prepositions should be included in the query. In
AND (eyes), is not effective because most of the re-                   case of the following question, “How much should


                                                                49
I pay for a round trip direct flight from NYC to            excellent resource, MetaMap tagger3 , that we use
Chicago in early November?”, if the preposition             in our work for health category questions (Aron-
words, from and to, are ignored then the informa-           son and Lang, 2010). For example, the ques-
tion about the travel direction is lost. Preposition        tion, ”how to treat type 2 diabetes without medica-
detection is used to address this issue where the           tion”, is transformed into the following query us-
grammatical tree structure of the sentence is lever-        ing the synonyms suggested by the MetaMap tag-
aged to identify the preposition phrases, such as,          ger for this question, (medication OR pharmaceu-
from NYC and to Chicago, and these are included             tical preparations) AND (non-insulin-dependent)
as-is in the boolean query.                                 AND (type 2 diabetes OR diabetes mellitus).
   The verb phrase is another important depen-                 In summary, the input to the Query Formulation
dency phrase that the system needs to identify be-          Module is the user question, and QFM transforms
cause sometimes single verb term is too broad, and          it into a query, which is handed over to the next
thus not useful in retrieving relevant documents.           phase.
However, in such circumstances the verb phrase
provides a focused search query. As an exam-                3.2      Document Retrieval Module (DRM)
ple question, “How to map dowse using a pen-                The goal of this module is to use the generated
dulum?”, without the verb phrase detection, the             query to obtain a set of web pages that are likely to
system generates the query: (map) AND (dowse)               contain answer(s) for the question. First, however,
AND (use) AND (pendulum). Once the query is                 it is checked if the current question has already
run through a search engine, it might retrieve a            been answered on Yahoo! Answers. If that is the
distorted set of documents because the verb map is          case then the answer that has been voted as the
ambiguous between mapping either dowse or pen-              best answer is returned by the system. Finding an
dulum. The verb phrase detection, however, gener-           exact match of the question on Yahoo! Answers is
ates a more explicit query: (map dowse) AND (use            however a rare occurrence.
pendulum) that is less likely to retrieve ambiguous            A targeted search is conducted on two online
result set.                                                 knowledge sources: English Wikipedia, and Ya-
   All of the above transformations are necessary           hoo! Answers. To facilitate faster query response
when the question is verbose. However, when the             time, we maintain a local copy of the English
question is well-articulated and succinct, no trans-        Wikipedia, and the query is run against this local
formations are performed. Questions with 8 words            copy. The top three wiki pages returned for the
or less are considered concise, and used as search          query are added to the set of answer-bearing web
queries as-is.                                              pages. We use Solr/Lucene to index and search the
                                                            Wikipedia copy.
   A substantial fraction of the posted questions on
                                                               For Yahoo! Answers (Y!A), we use a two-
the community sites are related to health. These
                                                            pronged strategy to provide short query response
questions also tend to have certain unique proper-
                                                            time. A local copy is maintained of the Webscope
ties, such as, the larger vocabulary gap between
                                                            L6 dataset 4 , which is a snapshot of the Y!A site
the question and the content of the relevant doc-
                                                            captured in October 2007. This data consist of
uments. A vocabulary gap exists between two
                                                            4.4+ million questions and all the posted answers,
units of text when they use different vocabulary
                                                            along with metadata, such as, question category,
to convey the same meaning. Since the questions
                                                            and best voted answer. This data is also indexed
are authored by ordinary people they tend to use
                                                            with Solr/Lucene for efficient access and search.
the common names for diseases, conditions, and
                                                            In addition to L6, the system is also capable of
symptoms, while the relevant documents written
                                                            conducting a site search of Yahoo! Answers, in
by medical professionals/experts tend to use the
                                                            order to obtain the most up-to-date data. The site
technical names for these concepts. As a result,
                                                            search functionality of Bing Search API is used to
a document that is relevant to the question might
                                                            accomplish this. It was found that Google Search
have very little word overlap with the question,
                                                            API is biased against Yahoo! Answers, and thus
and thus not be retrieved. To avoid this, query
                                                            Bing was chosen for this task. The top three an-
expansion is often proposed as the solution. Do-
ing effective query expansion for medical text is              3
                                                                   https://metamap.nlm.nih.gov
a non-trivial problem. Fortunately there exists an             4
                                                                   https://webscope.sandbox.yahoo.com


                                                       50
swer pages are added to the set of answer-bearing               There is a rich history of LeToR approaches be-
web pages.                                                   ing applied to automated QA (Bilotti et al., 2010;
  Finally, the larger World Wide Web is searched             Surdeanu et al., 2011; Agarwal et al., 2012). Fol-
using the Google Search API. The top three web               lowing on this tradition, for the baseline approach,
pages returned for the query are added to the set of         we employ the LambdaMart algorithm to learn a
answer-bearing web pages. In total, this module              ranking model for scoring the candidate answers,
identifies at most nine web pages that are passed            and the highest scored answer is selected as the
on to the next module.                                       final answer. We refer to this answer selection ap-
                                                             proach as LLTR. A subset of the Webscope Yahoo!
3.3 Candidate Answer Extraction Module
                                                             Answers L6 dataset7 is used for training the LLTR
    (CAEM)
                                                             model. For many questions in this dataset one of
The set of web pages identified by the Document              the answers for the question is identified as the
Retrieval Module are downloaded, and each page               best answer. For training LLTR the best answer
is passed through the following text processing              is assigned the highest rank label, and the remain-
pipeline. The first step extracts ASCII text from            ing answers are assigned a rank label proportional
the web page using an html2text library5 . We re-            to their BM25 score with the best answer. The
fer to the extracted text as a document. This doc-           following feature set is computed for each <ques-
ument is next split into passages, where each pas-           tion, answer>pair: Okapi BM25 score, cosine
sage consists of four consecutive sentences, the             similarity, number of overlapping terms, number
most popular answer length in Yahoo! Answers                 of punctuation marks in the passage, number of
dataset. A sliding span of four consecutive sen-             words in the answer, number of characters in the
tences is used to generate the passages. Thus, a             answer, query likelihood probability, largest dis-
document containing five sentences would gener-              tance between two query terms in the answer, aver-
ate two passages. This approach generates many               age distance between two terms, number of terms
passages, specifically, 1 + (n 4), where n is the            in longest continuous span, maximum number of
total number of sentences in the document. The               terms matched in a single sentence, maximum
passage length of four sentences was chosen based            number of terms in order. Before computing each
on data. On an average, high quality answers in the          of these features, all terms from query and candi-
L6 dataset contain four sentences. Shorter answer            date answer were stemmed using Porter.
lengths (single sentence) are common for factoid                Recurrent Neural Network (RNN) based ap-
questions but majority of the L6 questions cannot            proaches have received a lot of attention from the
be categorized as purely factoid.                            QA community recently (Severyn and Moschitti,
   Passages that do not contain any of the query             2015; Cohen and Croft, 2016; Wang and Nyberg,
terms, or that contain more than 2 line breaks, or           2015, 2016). Since carefully feature engineering
more than 10 punctuation marks, or non-printable             is completely unnecessary for NNs these networks
symbols are eliminated. Also, passages that are              lend themselves very well to the QA problem
not in English are filtered out. The langdetect li-          where it is difficult to defining features that gen-
brary6 is employed for language identification. All          eralize well. In fact, the best performing system
the passages that survive the filtering step are con-        (Encoder-Decoder) at the TREC 2016 LiveQA
sidered as candidate answers.                                track employed a recurrent neural network based
3.4 Answer Selection Module (ASM)                            approach. In our work we have employed the
                                                             Bidirectional Long Short Term Memory (BLSTM)
In this final step of the QA pipeline, the best an-          neural network because it adapts well to data with
swer from all the candidate answers is chosen.               varying dependency spans length. The bidirec-
We experiment with three algorithms for this task:           tional property of this network allows for track-
(i) Learning To Rank (LeToR) based LambdaMart                ing of both, forward and backward relations in
algorithm (Burges, 2010), (ii) Neural Network                the text. We use a modification of network ar-
based BLSTM algorithm (Graves and Schmidhu-                  chitecture implemented in (Wang and Nyberg,
ber, 2005), and (iii) a combination approach that            2015). The network consists of several layers: the
employs both, LambdaMart and BLSTM.                          word embedding layer followed by BLSTM layer,
   5
       https://pypi.python.org/pypi/html2text
   6                                                            7
       https://pypi.python.org/pypi/langdetect                      http://webscope.sandbox.yahoo.com


                                                        51
dropout layer to reduce overfitting, mean pooling,            1. Original-Q: This baseline tests the minimalist
and dense layer for the output. The output for the               approach where no processing is performed
network is a number from 0 to 1 identifying how                  on the question. The question is used as
likely the answer matches the question. It was                   query as-is. Here the intuition is that since
trained with ADAM optimizer, with binary cross-                  the questions are authored by humans, no in-
entropy as a target loss function. To train the net-             formation should be filtered out or added into
work a subset containing 384K <question,answer                   the original question. As such, the question
>pairs from the Webscope Yahoo! Answers L6                       title, along with the body field is used as the
dataset was used.                                                query.
   The third answer selection approach that we
                                                              2. QuestionMark-Q: This baseline works with
investigate simply combines the above two ap-
                                                                 the assumption that humans often add super-
proaches. The score assigned by BLSTM to each
                                                                 fluous details to the question, and these de-
<question, answer>pair is used as an additional
                                                                 tails typical occur in sentences that do not end
feature in the feature set used by the LLTR rank-
                                                                 with question mark. As such, sentences in
ing algorithm.
                                                                 title and body fields, that end with question
                                                                 mark are used as the query.
4   Results and Analysis
                                                              3. Unigram-Q: This last baseline seeks to fil-
We conduct a thorough empirical evaluation of the                ter out unnecessary information further by re-
individual components of our system, and of the                  moving stopwords, and by applying morpho-
end-to-end system. The results of these evalua-                  logical normalization using Krovetz stemmer
tions are presented in this section.                             to sentences ending with question mark in ti-
                                                                 tle and body fields. The terms that remain
4.1 Query Formulation Module                                     are treated as unigrams and compiled into a
Evaluation Data: Recall that QFM module is                       boolean AND query.
tasked with understanding the question, and com-            Evaluation Metrics: For the task at hand, the
piling a set of web pages (URLs) that are likely            generated query is said to be effective if it can re-
to contain the answer(s) to the question. Evalua-           trieve one or more answer-bearing web pages in
tion dataset for this task is not readily available.        the first three ranks, since these pages are mined
We had to thus re-purpose the annotated datasets            for candidate answers in the next phase. As
that are available for another task – question-             such, this is a precision oriented task, and thus
answer evaluation. Specifically, we used the                inspires our choice of evaluation metrics: Pre-
TREC LiveQA 2015 and 2016 datasets, which                   cision@ranks1,2,3 and NDCG@ranks1,2,3. The
consist of 1000 questions each. For each ques-              latter models the different grades of relevance that
tion there are one or more answers, and each an-            are present in the annotation scores.
swer is assessed for its relevance to the question
by a human, and assigned a score between 0 (non-            Results & Analysis: Table 1 reports the results
relevant) and 3 (very relevant). The LiveQA 2016            for the three baselines and QFM with and with-
dataset provides the source URL for each answer.            out site search of Yahoo! Answers. It should be
For the LiveQA 2015 dataset we had to locate the            noted that URLs for which human scores are not
source URL for each answer, since it is not in-             available are considered non-relevant in this as-
cluded in the dataset. This gives us an annotated           sessment. This is an important point that leads to
set of <Question,URL,score>tuples which we use              overall low values seen in Table 1.
to evaluate the effectiveness of QFM.                          The prominent trend in these results is that
                                                            QFM outperforms all the baselines substantially
Baselines: The LiveQA 2015 and 2016 ques-                   across all the metrics. The results for the Original-
tions consist of three fields, title, body, and cat-        Q baseline demonstrate that using the user ques-
egory. Since these questions are generated by real          tion as-is often does not lead to relevant web
users the question can be either in title or in de-         pages. This justifies the need of a Query For-
scription. Based on this observations we have de-           mulation Module. The QuestionMark-Q results
fined three baselines to compare with the proposed          show that filtering out the sentences without ques-
QFM:                                                        tion mark improves the performance by reducing


                                                       52
                                         Table 1: Results of QFM
                                       Precision@               NDCG@
                                       1        2       3       1    2                 3
                                               TREC 2015
                  Original-Q           0.030 0.027 0.020 0.048 0.056                   0.048
                  QuestionMark-Q       0.046 0.037 0.027 0.078 0.083                   0.069
                  Unigram-Q            0.032 0.028 0.024 0.062 0.064                   0.062
                  QFM                  0.073 0.064 0.055 0.159 0.143                   0.125
                                               TREC 2016
                  Original-Q           0.043 0.035 0.029 0.044 0.057                   0.065
                  QuestionMark-Q       0.064 0.046 0.041 0.069 0.072                   0.090
                  Unigram-Q            0.055 0.043 0.035 0.058 0.064                   0.078
                  QFM                  0.106 0.083 0.063 0.179 0.165                   0.158


                                        Table 2: Results of ASM
                                           MAP@                            MRR@
                                  NDCG
                                           2        3        4             2    3           4
                                              TREC 2015
             Encoder-Decoder      0.635    0.512 0.339 0.166               0.565   0.367    0.178
             LLTR                 0.622    0.484 0.316 0.155               0.549   0.352    0.156
             BLSTM                0.656    0.546 0.347 0.174               0.587   0.379    0.205
             LLTR + BLSTM         0.660    0.550 0.349 0.176               0.590   0.381    0.206
                                              TREC 2016
             LLTR                 0.648    0.512 0.346 0.217               0.621   0.381    0.241
             BLSTM                0.671    0.559 0.379 0.254               0.648   0.403    0.288
             LLTR + BLSTM         0.675    0.567 0.384 0.257               0.650   0.399    0.293


the unnecessary information and noise. But fil-             on this dataset (Wang and Nyberg, 2016). The per-
tering out stopwords, and loosing the grammatical           formance of their ASM module on 2016 data is not
structure of the question completely is detrimental,        available, we only have access to the results of the
which is demonstrated by the Unigram-Q results.             overall system performance which we compare to
As is described in Section 3.1, QFM attempts to             ours later.
balance two conflicting objectives: (i) distill down
                                                            Evaluation Metrics: The effectiveness of the
the question to the bear minimal query (ii) keep
                                                            system was evaluated using standard evalua-
the necessary units of the question intact. These
                                                            tion: NDCG (Normalized Discounted Cumulative
results indicate that this intuition behind QFM is
                                                            Gain), MAP (Mean average precision) at rank X,
correct.
                                                            and MRR (Mean Reciprocal Rank) at rank X.
4.2 Answer Selection Module                                 Results & Analysis: Table 2 provides the re-
Evaluation data: To evaluate the performance                sults for the evaluation of the ASM. LLTR is
of ASM, we use the questions TREC 2015                      less effective than the state-of-the-art approach
LiveQA evaluation set. Since answers are rated              across all the metrics. However, the neural net-
by judges, it is possible to check how good our             work based approach, BLSTM performs substan-
system is at ordering of these answers.                     tially better than Encoder-Decoder and LLTR for
                                                            both datasets. The results for LLTR+BLSTM il-
Baselines: For the baseline, we use the results             lustrate that two approaches have complementary
obtained by Encoder-Decoder system which was                strengths that can be combined to obtain the best
the best performing system of 2016 year. They               results for the task. The difference between LLTR
were testing their version of ASM on 2015 data,             and LLTR+BLSTM is statistically significant.
and therefore we can compare their system to ours              The results for MAP show that in 55% of the


                                                       53
cases our system selects the answer with at least             X=annotation score + 1. The avgScore is the av-
fair quality (2+), and in 21% of the cases the                erage score of the answers produced by a system,
quality is excellent. The performance on TREC                 where answer scores’ range is 0-3.
2016 is better than on TREC 2015. For refer-
                                                              Results & Analysis: The results are presented
ence, Table 2 provides the performance of the
                                                              in Table 3. Our system performed substantially
TREC 2016 LiveQA winning system – Encoder-
                                                              better than the state-of-the-art for both 2015 and
Decoder (Wang and Nyberg, 2016), but our sys-
                                                              2016. This is especially prominent at higher score
tem outperforms it. Interestingly, a simple LLTR
                                                              levels, indicating that our system is able to gen-
which relies only on statistical text features is very
                                                              erate better quality results. For 2015, the system
close in performance to Encoder-Decoder neural
                                                              produced 53% of answers which could be con-
network approach. It implies that LLTR is a pow-
                                                              sidered as fair, 40% questions as good, and 25%
erful method with right selection of the features.
                                                              questions were considered as excellent answers.
   We believe that quality of the model can be im-
                                                              For the 2016, 67% of answers have a fair grade,
proved by sanitizing the training dataset. Cur-
                                                              45% are good and 23% are excellent. The average
rently, two main problems are: (i) presence of
                                                              scores, although higher than those of state-of-the-
words with misspellings which make computa-
                                                              art, indicate that there is plenty of room for im-
tions of statistical features imprecise; (ii) quality
                                                              provement.
of the best answers manually selected by voters.
                                                                 On an average, the system generates an answer
There exist a few approaches to diminish impact
                                                              for a submitted question in less than half a minute
of both issues such as (Chen et al., 2007) for mis-
                                                              on a non-specialized commodity computer. The
spellings and (Agichtein et al., 2008) for keeping
                                                              Query Formulation Module transforms the ques-
only high-quality answers.
                                                              tion into a query in 3 seconds, the Document Re-
4.3 End-to-end System Performance                             trieval Module and the Candidate Answer Extrac-
Evaluation data: For the end-to-end system                    tion Module together take 8 seconds to generate all
evaluation, the existing datasets cannot be reused            the candidate answers, and the last module of An-
because the system generated answers can be                   swer Selection needs 15 seconds to select the best
different from the previously annotated answers.              answer, which amounts to 26 seconds per query,
Thus, we undertook the task of manual assessment              on average.
of the answers generated by our system. Each                     Table 4 presents examples of the user ques-
answer was rated on the same scale as in TREC                 tions, the corresponding queries, and the answers
LiveQA competition which is 0 (non-relevant)                  generated by SFS-QA system. The first exam-
through 3 (very relevant). We selected at random              ple includes an instance of query expansion us-
100 questions from LiveQA 2015, and 100 from                  ing MetaMap. The scientific name of bats, Chi-
LiveQA 2016, and assessed their respective an-                roptera, is sourced from MetaMap. The query
swers on the 4-point scale.                                   generated for the second question shows an ex-
                                                              ample of verb phrase and adjective phrase. Other
Baselines: We compare our results with those                  queries also show examples of noun phrases. In
produced by the winning systems at TREC                       general, these examples illustrate how the QFM
LiveQA 2015, and 2016, the Open Advancement                   module retains the important bigrams and phrases
of Question Answering (OAQA) (Wang and Ny-                    while reducing the superfluous terms. The gen-
berg, 2015), and the Encoder-Decoder (Wang and                erated answers, more often than not, provide the
Nyberg, 2016) system, respectively. Both systems              necessary information, at least partially.
are similar - they use Yahoo! Answers and Bing
Web Search for candidate answers retrieval. The               5   Conclusions
only difference between them is the strategy of
                                                              We presented an automated question answering
best answer selection. The former system uses a
                                                              system that accepts questions in natural language
type of BLSTM while the latter uses an encoder-
                                                              and responds with a focused answer in less than
decoder recurrent neural network model.
                                                              half a minute. With thorough empirical evalua-
Evaluation Metrics: We use the official metrics               tion we demonstrated that a light-weight question-
used by TREC LiveQA track. The succ@X+ is                     to-query transformation module can be developed
a fraction of answers with grade above X, where               that is also highly effective. We also illustrated


                                                         54
                                          Table 3: Overall Results
                                       avgScore(0-3) succ@2+               succ@3+      succ@4+
                                                TREC 2015
                 SFS-QA                1.180           0.530               0.400        0.250
                 OAQA                  1.081           0.532               0.359        0.190
                                                TREC 2016
                 SFS-QA                1.350           0.670               0.450        0.230
                 Encoder-Decoder       1.155           0.561               0.395        0.199


that various existing information sources can be                question answering. In Proceedings of the 2016
leveraged to obtain answer-bearing web pages. Fi-               ACM on International Conference on the Theory of
                                                                Information Retrieval. ACM, pages 143–146.
nally, we established that advances in deep learn-
ing can be utilized to select the best answer for the         Alex Graves and Jürgen Schmidhuber. 2005. Frame-
question.                                                       wise phoneme classification with bidirectional lstm
                                                                and other neural network architectures. Neural Net-
                                                                works 18(5):602–610.
References                                                    et al. Oh J. H. 2012. Why question answering using
Arvind Agarwal, Hema Raghavan, Karthik Subbian,                  sentiment analysis and word classes.
  Prem Melville, Richard D Lawrence, David C
  Gondek, and James Fan. 2012. Learning to rank for           Yuval Pinter and Roi Reichart. 2016. Syntactic parsing
  robust question answering. In Proceedings of the              of web queries with question intent. In Proceedings
  21st ACM international conference on Information              of NAACL-HLT 2016, pages 670-680.
  and knowledge management. ACM, pages 833–842.
                                                              et al Qiaoling Liu. 2016. When web search fails,
Eugene Agichtein, Carlos Castillo, Debora Donato,                searchers become askers: Understanding the transi-
  Aristides Gionis, and Gilad Mishne. 2008. Finding              tion. In Proceedings of SIGIR 2012.
  high-quality content in social media. In Proceedings
  of the 2008 international conference on web search          Higashinaka R. and Isozaki H. ????
  and data mining. ACM, pages 183–194.                        Soricut R. and Brill E. 2006. Automatic question an-
et al. Agichtein E. 2015. Overview of the trec 2015             swering using the web: Beyond the factoid.
   liveqa track. In Proceedings of TREC..
                                                              Denis Savenkov and Agichtein Eugene. 2016. When
Alan R Aronson and Franois-Michel Lang. 2010. An                a knowledge base is not enough: Question answer-
  overview of metamap: historical perspective and re-           ing over knowledge bases with external text data. In
  cent advances. Journal of the American Medical In-            Proceedings of the 39th International ACM SIGIR
  formatics Association: JAMIA 17(3):229–236.                   conference on Research and Development in Infor-
                                                                mation Retrieval. ACM, 2016..
Matthew W Bilotti, Jonathan Elsas, Jaime Carbonell,
 and Eric Nyberg. 2010. Rank learning for fac-                Aliaksei Severyn and Alessandro Moschitti. 2015.
 toid question answering with linguistic and semantic           Learning to rank short text pairs with convolutional
 constraints. In Proceedings of the 19th ACM inter-             deep neural networks. In Proceedings of the 38th
 national conference on Information and knowledge               International ACM SIGIR Conference on Research
 management. ACM, pages 459–468.                                and Development in Information Retrieval. ACM,
                                                                pages 373–382.
Christopher JC Burges. 2010. From ranknet to lamb-
  darank to lambdamart: An overview. Learning                 Anna Shtok and Idan Szpektor. 2012. Learning from
  11(23-581):81.                                                the past: Answering new questions with past an-
                                                                swers. In Proceedings of the 21st International Con-
Danqi Chen and Christopher D Manning. 2014. A fast              ference on World Wide Web. pages 759–768.
  and accurate dependency parser using neural net-
  works. In Proceedings of EMNLP, 2014.                       Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
                                                                Zaragoza. 2011. Learning to rank answers to non-
Qing Chen, Mu Li, and Ming Zhou. 2007. Improving                factoid questions from web collections. Computa-
  query spelling correction using web search results.           tional linguistics 37(2):351–383.
  In EMNLP-CoNLL. Citeseer, volume 7, pages 181–
  189.                                                        Stalin Varanasi and Gnter Neumann. 2015. Ques-
                                                                 tion/answer matching for yahoo! answers using a
Daniel Cohen and W Bruce Croft. 2016. End to end                 corpus-based extracted ngram-based mapping. In
  long short term memory networks for non-factoid                MLT-Lab DFKI D-66123, 2015.


                                                         55
                    Table 4: Examples of Queries and Answers Returned by the System
   Question: Bat ran into me, should I be afraid of rabies?
   Query: (afraid) AND (bat OR Chiroptera) AND (Rabies)
   Answer: Bats can carry rabies. Never try and retrieve a sick looking animal. The fecal material
   from bats can also accumulate and harbor histoplasmosis fungus spores which may cause blindness
   , and pneumonia like symptoms.Source(s): 23 years in pest control
   Question: Is waterproof eyeliner necessary for traveling in hot/humid areas?
   Query: (necessary) AND (travel hot humid areas) AND (waterproof eyeliner)
   Answer: For a softer look, we found that Stila’s twist-up Smudge Stick provides the best pencil
   eyeliner experience, from the flexibility of its application, as it goes on the softest but sets the
   firmest, to its remarkable staying power.[The Stila Smudge Stick Waterproof liner] was easiest to
   use of all the eyeliners tested. Once applied and given a few seconds, it sets firmly and was among
   the hardest to budge for our testers.TheStila Smudge Stick Waterproof Eyelineris an extra-soft,
   twist-up, thin mechanical pencil that deposits a neat matte line onto the skin.
   Question: Can I rent a car in Verona, Italy if I’m 18 years old?
   Query: (18 years old) AND (rent car) AND (Verona Italy)
   Answer: Usually, car rental companies will only allow renters 21 and over. One thing you might
   check is to see if the company you work for has a corporate account with any car rental companies,
   where they have contracted a lower minimum age for their employees.Lori J * 8 years ago Thumbs
   up
   Question: Do you need an exotic animal permit to own a peacock? Im asking because im thinking
   of raising them when i move to my own house also wondering if they do well around chickens
   Query: (exotic animal) AND (peacock) AND (permit)
   Answer: Exotic fowl refers to any avian species that is not indigenous to this state, including
   ratites (emu, ostrich, rhea, cassowary, etc.There are no state bag or possession limits or closed
   seasons on exotic animals or fowl on private property. It is against the law to: * Hunt an exotic
   without a valid hunting license. Thumbs up
   Question: Which coffee shops in the netherlands has the best weed?
   Query: (best weed) AND (coffee shops) AND (netherlands)
   Answer: those with tobacco mixed with cannabis, and have made customers smoke in upstairs or
   downstairs rooms. Notable coffeeshops[edit]
   * Checkpoint coffeeshop
   * Mellow Yellow coffeeshop


Di Wang and Eric Nyberg. 2015. Discovering the right
  answer with clues. In Proceedings of TREC, 2015.
Di Wang and Eric Nyberg. 2016. An attentional neural
  encoder-decoder approach for answer ranking. In
  Proceedings of TREC.


                                                       56

</pre>