=Paper=
{{Paper
|id=Vol-2029/paper2
|storemode=property
|title=Automated Question Answering System
|pdfUrl=https://ceur-ws.org/Vol-2029/paper2.pdf
|volume=Vol-2029
|authors=Chanin Pithyaachariyakul,Maria Khvalchik,Anagha Kulkarni
|dblpUrl=https://dblp.org/rec/conf/simbig/Pithyaachariyakul17
}}
==Automated Question Answering System==
Automated Question Answering System
Chanin Pithyaachariyakul, Maria Khvalchik, Anagha Kulkarni
Department of Computer Science
San Francisco State University
1600 Holloway Ave, San Francisco
CA, USA, 94132
cpithyaa@mail.sfsu.edu, mkhvalch@mail.sfsu.edu, ak@sfsu.edu
Abstract question is provided by the members of the on-
line community. Compared to the experience that
We present SFS-QA, an automatic real- users have with commercial search engines, the
time question-answering system, that QA sites provide two key advantages: (i) freedom
can answer free-text questions within a to specify their information need (question) in free
minute. SFS-QA system analyzes a ques- flowing natural language, and (ii) convenience of
tion and transforms it into a boolean key- receiving focused answer(s) to the question, as op-
word query using Stanford Dependency posed to receiving 10 web pages that have to be
Parser and MetaMap tagger. The query read and parsed to identify the answer to the ques-
is used to obtain matching web pages tion. On the other hand, one clear advantage of
from the World Wide Web through Google search engines over QA sites is the information
Search API. In addition, two specific data response time. The search results are available to
sources: Yahoo! Answers and Wikipedia the user instantaneously, whereas human-authored
are used for matching the documents. The answers may take much longer to be posted.
resulting web pages are mined for candi- These observations motivate our QA system
date answers. Finally, Learning to Rank that allows the user to specify the question in nat-
based LambdaMart, and Recurrent Neu- ural language, which in turn is internally trans-
ral Network based BLSTM algorithms are formed into a boolean query composed of key
used to learn, rank, and select the best an- terms and phrases. Our approach works on the
swer from candidate answers. For empir- premise that the answers for the majority of ques-
ical evaluation, TREC LiveQA 2015 and tions are already available on some web page(s)
2016 datasets which consist of about 1000 on WWW. Thus the goal of the next phase of our
questions each were used. The results system is to obtain these web pages, and extract
demonstrate that our system substantially concise answers from the content of the pages. An
outperforms strong baselines. additional objective that we set for the QA system
1 Introduction is to have low latency: every question has to be
answered in less than one minute. This require-
Even with modern search engines, there are many ment influences many design decisions made for
scenarios where the users struggle to find the in- the system’s architecture.
formation they are looking for. This is especially The key contributions of this work are design-
true when the information need is complex, and ing and developing: (i) a light-weight but effec-
when the user is unable to distill down their infor- tive question to query transformation approach
mation need into a few keywords. These factors (ii) a multi-sourced document retrieval approach,
motivate many users to seek answers on commu- and (iii) a highly effective answer ranking ap-
nity based Question-Answering (QA) sites, such proach. A thorough empirical evaluation of in-
as, Yahoo! Answers1 , and Quora2 , where the dividual phases of the system, and of the end-to-
question can be posted in natural language, as op- end system was undertaken, which demonstrates
posed to keyword query, and the answer(s) to the that the proposed QA system performs substan-
1
https://answers.yahoo.com tially better than strong baselines while meeting
2
https://www.quora.com the response time requirement.
47
2 Related Work Pinter et al. introduced a new method that ap-
plied a grammatical dependency parser to iden-
Liu et al. analyzed the difference between queries tify segments of CQA questions to generate
and questions in the context of community QA queries (Pinter and Reichart, 2016). Because CQA
sites (Qiaoling Liu, 2016). They confirmed the questions are often long and verbose, the depen-
common belief that queries are focused on key dency parser is required to partition a question into
concepts (things/nouns and actions/verb). While several syntactically independent segments. The
questions also include contextual and etiquette re- segment queries, which are generated from frag-
lated terms. In their analysis only 31.4% of terms ments of a question, are more effective to find the
overlapped between question and query. They relevant answers than the simple phrase queries.
also observed that questions contain more abbre-
viations, and shortened versions of terms, since Soricut and Brill (R. and E., 2006) published
the writer perceives more freedom of writing with one of the first papers on non-factoid question an-
questions. swering, and many others have followed (R. and
H.; Surdeanu et al., 2011; Oh J. H., 2012). As a
Savenkov et al. introduced a new QA sys-
training set they used a corpus of 1M question-
tem that uses multiple data resources such as
answer pairs from FAQ collected on the Web.
Wikipedia data, WWW, and Yahoo! Answer Col-
To search for the answer candidates they used
lection (Savenkov and Eugene, 2016). Wikipedia
MSNSearch and Google. Our work uses dif-
corpus is used to obtain relevant documents for an
ferent algorithm for QFM, is trained using Ya-
answer, and results from Web search are used for
hoo! Answers dataset and uses learning to rank
query expansion. They detected keywords from
techniques which started to advance in mid-00s.
top snippet results. Then, they expanded keyword
In recent years the advancements in NLP/ML
terms that may be misspelled and have multiple
techniques and availability of large QA datasets
thesauruses from the snippets. This lexical match
have propelled research and contests on answering
increases the opportunities to match more rele-
open-domain non-factoid questions (Agichtein E.,
vant documents when a less popular keywords is
2015). Wang et al. (Wang and Nyberg, 2015,
used in a query. Finally, they used WebScope L6
2016) works were the winner of two subsequent
dataset, which contains 4.4 million Yahoo! An-
TREC LiveQA competitions. In the first paper
swer questions, to run through a supervised learn-
they trained an answer prediction model using
ing to label the association between question terms
BLSTM Neural Network. In the second - Neural
and answer entities. The label is used to eval-
Machine Translation techniques to train the model
uate the candidate answer based on its associa-
which generates the answer itself given only a
tion score. Their result showed a significant im-
question. We use their method as a baseline com-
provement because web search results are the ef-
paring our work against to.
fective resources that enhance the query under-
standing. Moreover, the question-answer pairing
successfully ranked candidate answers by helping
with entity identification.
3 SFS-QA: An automated real-time
question-answering system
Shtok et al. proposed a new approach to an-
swer a new question from CQA site such as Ya-
hoo! Answers by reusing past answers from previ- Our QA system is structured as a pipeline of four
ous similar questions from CQA site itself (Shtok components: Query formulation, Document re-
and Szpektor, 2012). They applied a cosine simi- trieval, Candidate answer extraction, and Answer
larity to match potential past questions as the can- selection. The first module is responsible for pars-
didate selections. The next step is to extract only ing, and transforming the question into a query,
the best answer from the selected old questions that is used by the second phase to retrieve docu-
as candidate answers. Then they applied statistic ments that may contain the answer, the third phase
classifier to select the final answer. This research is tasked with identifying the minimal unit of text
achieved high precision answering and preserved that answers the original question, and the fourth
human generated content answer unlike other au- phase selects the unit of text that serves best as
tomatic question answering systems that used web the answer to the question. The architecture of the
extraction to generate the answers. system is shown in Figure 1.
48
Yahoo!1 Best1Answer
Answers1 L61 Answer
Question
Dataset
Lucene Query1
Language Top13 Top151Best1
Yahoo!1Answers1 Pages Answer1 Selection1 Top1Answer
Best1Answer1 Answers
YES Site1Search1 with1
Exact1 Extraction
Match using1Bing BLSTM+LLTR
Lucene1Query1Language
NO Top151Answers
Top13 Wikipedia1 Answer1
Question1 to1Query1 Pages Candidates
Wikipedia Pages1 TF.iDF Scoring1
Module Dataset Extraction
QFM
Boolean1 Top13 Answer1
Query
Google Pages HTML1
Extraction
Candidates
DRM CAEM ASM
Figure 1: System Architecture for SFS-QA system. QFM: Query Formulation Module. DR: Document
Retrieval Module. CAEM: Candidate Answer Extraction Module. ASM: Answer Selection Module.
3.1 Query Formulation Module (QFM) trieved web-pages are about orange juice and not
about orange peel juice. In order to construct a
This module transforms free-text questions to a boolean query that retains the key information in
well-formed boolean conjunctive queries that can the question, QFM performs detailed grammatical
be evaluated by a search engine. This is a chal- analysis of the question. Specifically, we use the
lenging problem because questions are often ver- Stanford Dependency Parser (Chen and Manning,
bose. Questions often contain information that is 2014) to identify the various phrases (noun, verb,
useful for a human reader but is superfluous, or preposition, and adjective phrases) in the sentence.
even misleading, if included in the search query. This allows us to identify important phrases, rather
We address this verbosity problem at multiple lev- than just individual words. For the above question,
els. this approach selects important phrases and gen-
First, not every sentence in the question con- erates the final boolean conjunctive query as fol-
tributes to the final query. Only sentences that lows: (juice) AND (orange peel) AND (good for
start with WH-words (e.g. Who, When, Where, eyes). This query is successful at retrieving web-
Why) and end with a question mark do (Varanasi pages about orange peel juice rather than about
and Neumann, 2015). Second, within a sentence orange juice even though the latter has the more
only certain select parts of the question are in- dominant presence on the web.
cluded in the query. Also, the length of these parts,
individual words or phrases, is selected carefully. The English closed class terms (pronouns, de-
For example, transforming the following question, terminers, prepositions) in the question are often
“Why’s juice from orange peel supposed to be ignored since they do not capture the central topic
good for eyes?”, into a unigram boolean query: of the question. However, in certain situations the
(orange) AND (peel) AND (juice) AND (good) prepositions should be included in the query. In
AND (eyes), is not effective because most of the re- case of the following question, “How much should
49
I pay for a round trip direct flight from NYC to excellent resource, MetaMap tagger3 , that we use
Chicago in early November?”, if the preposition in our work for health category questions (Aron-
words, from and to, are ignored then the informa- son and Lang, 2010). For example, the ques-
tion about the travel direction is lost. Preposition tion, ”how to treat type 2 diabetes without medica-
detection is used to address this issue where the tion”, is transformed into the following query us-
grammatical tree structure of the sentence is lever- ing the synonyms suggested by the MetaMap tag-
aged to identify the preposition phrases, such as, ger for this question, (medication OR pharmaceu-
from NYC and to Chicago, and these are included tical preparations) AND (non-insulin-dependent)
as-is in the boolean query. AND (type 2 diabetes OR diabetes mellitus).
The verb phrase is another important depen- In summary, the input to the Query Formulation
dency phrase that the system needs to identify be- Module is the user question, and QFM transforms
cause sometimes single verb term is too broad, and it into a query, which is handed over to the next
thus not useful in retrieving relevant documents. phase.
However, in such circumstances the verb phrase
provides a focused search query. As an exam- 3.2 Document Retrieval Module (DRM)
ple question, “How to map dowse using a pen- The goal of this module is to use the generated
dulum?”, without the verb phrase detection, the query to obtain a set of web pages that are likely to
system generates the query: (map) AND (dowse) contain answer(s) for the question. First, however,
AND (use) AND (pendulum). Once the query is it is checked if the current question has already
run through a search engine, it might retrieve a been answered on Yahoo! Answers. If that is the
distorted set of documents because the verb map is case then the answer that has been voted as the
ambiguous between mapping either dowse or pen- best answer is returned by the system. Finding an
dulum. The verb phrase detection, however, gener- exact match of the question on Yahoo! Answers is
ates a more explicit query: (map dowse) AND (use however a rare occurrence.
pendulum) that is less likely to retrieve ambiguous A targeted search is conducted on two online
result set. knowledge sources: English Wikipedia, and Ya-
All of the above transformations are necessary hoo! Answers. To facilitate faster query response
when the question is verbose. However, when the time, we maintain a local copy of the English
question is well-articulated and succinct, no trans- Wikipedia, and the query is run against this local
formations are performed. Questions with 8 words copy. The top three wiki pages returned for the
or less are considered concise, and used as search query are added to the set of answer-bearing web
queries as-is. pages. We use Solr/Lucene to index and search the
Wikipedia copy.
A substantial fraction of the posted questions on
For Yahoo! Answers (Y!A), we use a two-
the community sites are related to health. These
pronged strategy to provide short query response
questions also tend to have certain unique proper-
time. A local copy is maintained of the Webscope
ties, such as, the larger vocabulary gap between
L6 dataset 4 , which is a snapshot of the Y!A site
the question and the content of the relevant doc-
captured in October 2007. This data consist of
uments. A vocabulary gap exists between two
4.4+ million questions and all the posted answers,
units of text when they use different vocabulary
along with metadata, such as, question category,
to convey the same meaning. Since the questions
and best voted answer. This data is also indexed
are authored by ordinary people they tend to use
with Solr/Lucene for efficient access and search.
the common names for diseases, conditions, and
In addition to L6, the system is also capable of
symptoms, while the relevant documents written
conducting a site search of Yahoo! Answers, in
by medical professionals/experts tend to use the
order to obtain the most up-to-date data. The site
technical names for these concepts. As a result,
search functionality of Bing Search API is used to
a document that is relevant to the question might
accomplish this. It was found that Google Search
have very little word overlap with the question,
API is biased against Yahoo! Answers, and thus
and thus not be retrieved. To avoid this, query
Bing was chosen for this task. The top three an-
expansion is often proposed as the solution. Do-
ing effective query expansion for medical text is 3
https://metamap.nlm.nih.gov
a non-trivial problem. Fortunately there exists an 4
https://webscope.sandbox.yahoo.com
50
swer pages are added to the set of answer-bearing There is a rich history of LeToR approaches be-
web pages. ing applied to automated QA (Bilotti et al., 2010;
Finally, the larger World Wide Web is searched Surdeanu et al., 2011; Agarwal et al., 2012). Fol-
using the Google Search API. The top three web lowing on this tradition, for the baseline approach,
pages returned for the query are added to the set of we employ the LambdaMart algorithm to learn a
answer-bearing web pages. In total, this module ranking model for scoring the candidate answers,
identifies at most nine web pages that are passed and the highest scored answer is selected as the
on to the next module. final answer. We refer to this answer selection ap-
proach as LLTR. A subset of the Webscope Yahoo!
3.3 Candidate Answer Extraction Module
Answers L6 dataset7 is used for training the LLTR
(CAEM)
model. For many questions in this dataset one of
The set of web pages identified by the Document the answers for the question is identified as the
Retrieval Module are downloaded, and each page best answer. For training LLTR the best answer
is passed through the following text processing is assigned the highest rank label, and the remain-
pipeline. The first step extracts ASCII text from ing answers are assigned a rank label proportional
the web page using an html2text library5 . We re- to their BM25 score with the best answer. The
fer to the extracted text as a document. This doc- following feature set is computed for each pair: Okapi BM25 score, cosine
sage consists of four consecutive sentences, the similarity, number of overlapping terms, number
most popular answer length in Yahoo! Answers of punctuation marks in the passage, number of
dataset. A sliding span of four consecutive sen- words in the answer, number of characters in the
tences is used to generate the passages. Thus, a answer, query likelihood probability, largest dis-
document containing five sentences would gener- tance between two query terms in the answer, aver-
ate two passages. This approach generates many age distance between two terms, number of terms
passages, specifically, 1 + (n 4), where n is the in longest continuous span, maximum number of
total number of sentences in the document. The terms matched in a single sentence, maximum
passage length of four sentences was chosen based number of terms in order. Before computing each
on data. On an average, high quality answers in the of these features, all terms from query and candi-
L6 dataset contain four sentences. Shorter answer date answer were stemmed using Porter.
lengths (single sentence) are common for factoid Recurrent Neural Network (RNN) based ap-
questions but majority of the L6 questions cannot proaches have received a lot of attention from the
be categorized as purely factoid. QA community recently (Severyn and Moschitti,
Passages that do not contain any of the query 2015; Cohen and Croft, 2016; Wang and Nyberg,
terms, or that contain more than 2 line breaks, or 2015, 2016). Since carefully feature engineering
more than 10 punctuation marks, or non-printable is completely unnecessary for NNs these networks
symbols are eliminated. Also, passages that are lend themselves very well to the QA problem
not in English are filtered out. The langdetect li- where it is difficult to defining features that gen-
brary6 is employed for language identification. All eralize well. In fact, the best performing system
the passages that survive the filtering step are con- (Encoder-Decoder) at the TREC 2016 LiveQA
sidered as candidate answers. track employed a recurrent neural network based
3.4 Answer Selection Module (ASM) approach. In our work we have employed the
Bidirectional Long Short Term Memory (BLSTM)
In this final step of the QA pipeline, the best an- neural network because it adapts well to data with
swer from all the candidate answers is chosen. varying dependency spans length. The bidirec-
We experiment with three algorithms for this task: tional property of this network allows for track-
(i) Learning To Rank (LeToR) based LambdaMart ing of both, forward and backward relations in
algorithm (Burges, 2010), (ii) Neural Network the text. We use a modification of network ar-
based BLSTM algorithm (Graves and Schmidhu- chitecture implemented in (Wang and Nyberg,
ber, 2005), and (iii) a combination approach that 2015). The network consists of several layers: the
employs both, LambdaMart and BLSTM. word embedding layer followed by BLSTM layer,
5
https://pypi.python.org/pypi/html2text
6 7
https://pypi.python.org/pypi/langdetect http://webscope.sandbox.yahoo.com
51
dropout layer to reduce overfitting, mean pooling, 1. Original-Q: This baseline tests the minimalist
and dense layer for the output. The output for the approach where no processing is performed
network is a number from 0 to 1 identifying how on the question. The question is used as
likely the answer matches the question. It was query as-is. Here the intuition is that since
trained with ADAM optimizer, with binary cross- the questions are authored by humans, no in-
entropy as a target loss function. To train the net- formation should be filtered out or added into
work a subset containing 384K pairs from the Webscope Yahoo! Answers L6 title, along with the body field is used as the
dataset was used. query.
The third answer selection approach that we
2. QuestionMark-Q: This baseline works with
investigate simply combines the above two ap-
the assumption that humans often add super-
proaches. The score assigned by BLSTM to each
fluous details to the question, and these de-
pair is used as an additional
tails typical occur in sentences that do not end
feature in the feature set used by the LLTR rank-
with question mark. As such, sentences in
ing algorithm.
title and body fields, that end with question
mark are used as the query.
4 Results and Analysis
3. Unigram-Q: This last baseline seeks to fil-
We conduct a thorough empirical evaluation of the ter out unnecessary information further by re-
individual components of our system, and of the moving stopwords, and by applying morpho-
end-to-end system. The results of these evalua- logical normalization using Krovetz stemmer
tions are presented in this section. to sentences ending with question mark in ti-
tle and body fields. The terms that remain
4.1 Query Formulation Module are treated as unigrams and compiled into a
Evaluation Data: Recall that QFM module is boolean AND query.
tasked with understanding the question, and com- Evaluation Metrics: For the task at hand, the
piling a set of web pages (URLs) that are likely generated query is said to be effective if it can re-
to contain the answer(s) to the question. Evalua- trieve one or more answer-bearing web pages in
tion dataset for this task is not readily available. the first three ranks, since these pages are mined
We had to thus re-purpose the annotated datasets for candidate answers in the next phase. As
that are available for another task – question- such, this is a precision oriented task, and thus
answer evaluation. Specifically, we used the inspires our choice of evaluation metrics: Pre-
TREC LiveQA 2015 and 2016 datasets, which cision@ranks1,2,3 and NDCG@ranks1,2,3. The
consist of 1000 questions each. For each ques- latter models the different grades of relevance that
tion there are one or more answers, and each an- are present in the annotation scores.
swer is assessed for its relevance to the question
by a human, and assigned a score between 0 (non- Results & Analysis: Table 1 reports the results
relevant) and 3 (very relevant). The LiveQA 2016 for the three baselines and QFM with and with-
dataset provides the source URL for each answer. out site search of Yahoo! Answers. It should be
For the LiveQA 2015 dataset we had to locate the noted that URLs for which human scores are not
source URL for each answer, since it is not in- available are considered non-relevant in this as-
cluded in the dataset. This gives us an annotated sessment. This is an important point that leads to
set of tuples which we use overall low values seen in Table 1.
to evaluate the effectiveness of QFM. The prominent trend in these results is that
QFM outperforms all the baselines substantially
Baselines: The LiveQA 2015 and 2016 ques- across all the metrics. The results for the Original-
tions consist of three fields, title, body, and cat- Q baseline demonstrate that using the user ques-
egory. Since these questions are generated by real tion as-is often does not lead to relevant web
users the question can be either in title or in de- pages. This justifies the need of a Query For-
scription. Based on this observations we have de- mulation Module. The QuestionMark-Q results
fined three baselines to compare with the proposed show that filtering out the sentences without ques-
QFM: tion mark improves the performance by reducing
52
Table 1: Results of QFM
Precision@ NDCG@
1 2 3 1 2 3
TREC 2015
Original-Q 0.030 0.027 0.020 0.048 0.056 0.048
QuestionMark-Q 0.046 0.037 0.027 0.078 0.083 0.069
Unigram-Q 0.032 0.028 0.024 0.062 0.064 0.062
QFM 0.073 0.064 0.055 0.159 0.143 0.125
TREC 2016
Original-Q 0.043 0.035 0.029 0.044 0.057 0.065
QuestionMark-Q 0.064 0.046 0.041 0.069 0.072 0.090
Unigram-Q 0.055 0.043 0.035 0.058 0.064 0.078
QFM 0.106 0.083 0.063 0.179 0.165 0.158
Table 2: Results of ASM
MAP@ MRR@
NDCG
2 3 4 2 3 4
TREC 2015
Encoder-Decoder 0.635 0.512 0.339 0.166 0.565 0.367 0.178
LLTR 0.622 0.484 0.316 0.155 0.549 0.352 0.156
BLSTM 0.656 0.546 0.347 0.174 0.587 0.379 0.205
LLTR + BLSTM 0.660 0.550 0.349 0.176 0.590 0.381 0.206
TREC 2016
LLTR 0.648 0.512 0.346 0.217 0.621 0.381 0.241
BLSTM 0.671 0.559 0.379 0.254 0.648 0.403 0.288
LLTR + BLSTM 0.675 0.567 0.384 0.257 0.650 0.399 0.293
the unnecessary information and noise. But fil- on this dataset (Wang and Nyberg, 2016). The per-
tering out stopwords, and loosing the grammatical formance of their ASM module on 2016 data is not
structure of the question completely is detrimental, available, we only have access to the results of the
which is demonstrated by the Unigram-Q results. overall system performance which we compare to
As is described in Section 3.1, QFM attempts to ours later.
balance two conflicting objectives: (i) distill down
Evaluation Metrics: The effectiveness of the
the question to the bear minimal query (ii) keep
system was evaluated using standard evalua-
the necessary units of the question intact. These
tion: NDCG (Normalized Discounted Cumulative
results indicate that this intuition behind QFM is
Gain), MAP (Mean average precision) at rank X,
correct.
and MRR (Mean Reciprocal Rank) at rank X.
4.2 Answer Selection Module Results & Analysis: Table 2 provides the re-
Evaluation data: To evaluate the performance sults for the evaluation of the ASM. LLTR is
of ASM, we use the questions TREC 2015 less effective than the state-of-the-art approach
LiveQA evaluation set. Since answers are rated across all the metrics. However, the neural net-
by judges, it is possible to check how good our work based approach, BLSTM performs substan-
system is at ordering of these answers. tially better than Encoder-Decoder and LLTR for
both datasets. The results for LLTR+BLSTM il-
Baselines: For the baseline, we use the results lustrate that two approaches have complementary
obtained by Encoder-Decoder system which was strengths that can be combined to obtain the best
the best performing system of 2016 year. They results for the task. The difference between LLTR
were testing their version of ASM on 2015 data, and LLTR+BLSTM is statistically significant.
and therefore we can compare their system to ours The results for MAP show that in 55% of the
53
cases our system selects the answer with at least X=annotation score + 1. The avgScore is the av-
fair quality (2+), and in 21% of the cases the erage score of the answers produced by a system,
quality is excellent. The performance on TREC where answer scores’ range is 0-3.
2016 is better than on TREC 2015. For refer-
Results & Analysis: The results are presented
ence, Table 2 provides the performance of the
in Table 3. Our system performed substantially
TREC 2016 LiveQA winning system – Encoder-
better than the state-of-the-art for both 2015 and
Decoder (Wang and Nyberg, 2016), but our sys-
2016. This is especially prominent at higher score
tem outperforms it. Interestingly, a simple LLTR
levels, indicating that our system is able to gen-
which relies only on statistical text features is very
erate better quality results. For 2015, the system
close in performance to Encoder-Decoder neural
produced 53% of answers which could be con-
network approach. It implies that LLTR is a pow-
sidered as fair, 40% questions as good, and 25%
erful method with right selection of the features.
questions were considered as excellent answers.
We believe that quality of the model can be im-
For the 2016, 67% of answers have a fair grade,
proved by sanitizing the training dataset. Cur-
45% are good and 23% are excellent. The average
rently, two main problems are: (i) presence of
scores, although higher than those of state-of-the-
words with misspellings which make computa-
art, indicate that there is plenty of room for im-
tions of statistical features imprecise; (ii) quality
provement.
of the best answers manually selected by voters.
On an average, the system generates an answer
There exist a few approaches to diminish impact
for a submitted question in less than half a minute
of both issues such as (Chen et al., 2007) for mis-
on a non-specialized commodity computer. The
spellings and (Agichtein et al., 2008) for keeping
Query Formulation Module transforms the ques-
only high-quality answers.
tion into a query in 3 seconds, the Document Re-
4.3 End-to-end System Performance trieval Module and the Candidate Answer Extrac-
Evaluation data: For the end-to-end system tion Module together take 8 seconds to generate all
evaluation, the existing datasets cannot be reused the candidate answers, and the last module of An-
because the system generated answers can be swer Selection needs 15 seconds to select the best
different from the previously annotated answers. answer, which amounts to 26 seconds per query,
Thus, we undertook the task of manual assessment on average.
of the answers generated by our system. Each Table 4 presents examples of the user ques-
answer was rated on the same scale as in TREC tions, the corresponding queries, and the answers
LiveQA competition which is 0 (non-relevant) generated by SFS-QA system. The first exam-
through 3 (very relevant). We selected at random ple includes an instance of query expansion us-
100 questions from LiveQA 2015, and 100 from ing MetaMap. The scientific name of bats, Chi-
LiveQA 2016, and assessed their respective an- roptera, is sourced from MetaMap. The query
swers on the 4-point scale. generated for the second question shows an ex-
ample of verb phrase and adjective phrase. Other
Baselines: We compare our results with those queries also show examples of noun phrases. In
produced by the winning systems at TREC general, these examples illustrate how the QFM
LiveQA 2015, and 2016, the Open Advancement module retains the important bigrams and phrases
of Question Answering (OAQA) (Wang and Ny- while reducing the superfluous terms. The gen-
berg, 2015), and the Encoder-Decoder (Wang and erated answers, more often than not, provide the
Nyberg, 2016) system, respectively. Both systems necessary information, at least partially.
are similar - they use Yahoo! Answers and Bing
Web Search for candidate answers retrieval. The 5 Conclusions
only difference between them is the strategy of
We presented an automated question answering
best answer selection. The former system uses a
system that accepts questions in natural language
type of BLSTM while the latter uses an encoder-
and responds with a focused answer in less than
decoder recurrent neural network model.
half a minute. With thorough empirical evalua-
Evaluation Metrics: We use the official metrics tion we demonstrated that a light-weight question-
used by TREC LiveQA track. The succ@X+ is to-query transformation module can be developed
a fraction of answers with grade above X, where that is also highly effective. We also illustrated
54
Table 3: Overall Results
avgScore(0-3) succ@2+ succ@3+ succ@4+
TREC 2015
SFS-QA 1.180 0.530 0.400 0.250
OAQA 1.081 0.532 0.359 0.190
TREC 2016
SFS-QA 1.350 0.670 0.450 0.230
Encoder-Decoder 1.155 0.561 0.395 0.199
that various existing information sources can be question answering. In Proceedings of the 2016
leveraged to obtain answer-bearing web pages. Fi- ACM on International Conference on the Theory of
Information Retrieval. ACM, pages 143–146.
nally, we established that advances in deep learn-
ing can be utilized to select the best answer for the Alex Graves and Jürgen Schmidhuber. 2005. Frame-
question. wise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Net-
works 18(5):602–610.
References et al. Oh J. H. 2012. Why question answering using
Arvind Agarwal, Hema Raghavan, Karthik Subbian, sentiment analysis and word classes.
Prem Melville, Richard D Lawrence, David C
Gondek, and James Fan. 2012. Learning to rank for Yuval Pinter and Roi Reichart. 2016. Syntactic parsing
robust question answering. In Proceedings of the of web queries with question intent. In Proceedings
21st ACM international conference on Information of NAACL-HLT 2016, pages 670-680.
and knowledge management. ACM, pages 833–842.
et al Qiaoling Liu. 2016. When web search fails,
Eugene Agichtein, Carlos Castillo, Debora Donato, searchers become askers: Understanding the transi-
Aristides Gionis, and Gilad Mishne. 2008. Finding tion. In Proceedings of SIGIR 2012.
high-quality content in social media. In Proceedings
of the 2008 international conference on web search Higashinaka R. and Isozaki H. ????
and data mining. ACM, pages 183–194. Soricut R. and Brill E. 2006. Automatic question an-
et al. Agichtein E. 2015. Overview of the trec 2015 swering using the web: Beyond the factoid.
liveqa track. In Proceedings of TREC..
Denis Savenkov and Agichtein Eugene. 2016. When
Alan R Aronson and Franois-Michel Lang. 2010. An a knowledge base is not enough: Question answer-
overview of metamap: historical perspective and re- ing over knowledge bases with external text data. In
cent advances. Journal of the American Medical In- Proceedings of the 39th International ACM SIGIR
formatics Association: JAMIA 17(3):229–236. conference on Research and Development in Infor-
mation Retrieval. ACM, 2016..
Matthew W Bilotti, Jonathan Elsas, Jaime Carbonell,
and Eric Nyberg. 2010. Rank learning for fac- Aliaksei Severyn and Alessandro Moschitti. 2015.
toid question answering with linguistic and semantic Learning to rank short text pairs with convolutional
constraints. In Proceedings of the 19th ACM inter- deep neural networks. In Proceedings of the 38th
national conference on Information and knowledge International ACM SIGIR Conference on Research
management. ACM, pages 459–468. and Development in Information Retrieval. ACM,
pages 373–382.
Christopher JC Burges. 2010. From ranknet to lamb-
darank to lambdamart: An overview. Learning Anna Shtok and Idan Szpektor. 2012. Learning from
11(23-581):81. the past: Answering new questions with past an-
swers. In Proceedings of the 21st International Con-
Danqi Chen and Christopher D Manning. 2014. A fast ference on World Wide Web. pages 759–768.
and accurate dependency parser using neural net-
works. In Proceedings of EMNLP, 2014. Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2011. Learning to rank answers to non-
Qing Chen, Mu Li, and Ming Zhou. 2007. Improving factoid questions from web collections. Computa-
query spelling correction using web search results. tional linguistics 37(2):351–383.
In EMNLP-CoNLL. Citeseer, volume 7, pages 181–
189. Stalin Varanasi and Gnter Neumann. 2015. Ques-
tion/answer matching for yahoo! answers using a
Daniel Cohen and W Bruce Croft. 2016. End to end corpus-based extracted ngram-based mapping. In
long short term memory networks for non-factoid MLT-Lab DFKI D-66123, 2015.
55
Table 4: Examples of Queries and Answers Returned by the System
Question: Bat ran into me, should I be afraid of rabies?
Query: (afraid) AND (bat OR Chiroptera) AND (Rabies)
Answer: Bats can carry rabies. Never try and retrieve a sick looking animal. The fecal material
from bats can also accumulate and harbor histoplasmosis fungus spores which may cause blindness
, and pneumonia like symptoms.Source(s): 23 years in pest control
Question: Is waterproof eyeliner necessary for traveling in hot/humid areas?
Query: (necessary) AND (travel hot humid areas) AND (waterproof eyeliner)
Answer: For a softer look, we found that Stila’s twist-up Smudge Stick provides the best pencil
eyeliner experience, from the flexibility of its application, as it goes on the softest but sets the
firmest, to its remarkable staying power.[The Stila Smudge Stick Waterproof liner] was easiest to
use of all the eyeliners tested. Once applied and given a few seconds, it sets firmly and was among
the hardest to budge for our testers.TheStila Smudge Stick Waterproof Eyelineris an extra-soft,
twist-up, thin mechanical pencil that deposits a neat matte line onto the skin.
Question: Can I rent a car in Verona, Italy if I’m 18 years old?
Query: (18 years old) AND (rent car) AND (Verona Italy)
Answer: Usually, car rental companies will only allow renters 21 and over. One thing you might
check is to see if the company you work for has a corporate account with any car rental companies,
where they have contracted a lower minimum age for their employees.Lori J * 8 years ago Thumbs
up
Question: Do you need an exotic animal permit to own a peacock? Im asking because im thinking
of raising them when i move to my own house also wondering if they do well around chickens
Query: (exotic animal) AND (peacock) AND (permit)
Answer: Exotic fowl refers to any avian species that is not indigenous to this state, including
ratites (emu, ostrich, rhea, cassowary, etc.There are no state bag or possession limits or closed
seasons on exotic animals or fowl on private property. It is against the law to: * Hunt an exotic
without a valid hunting license. Thumbs up
Question: Which coffee shops in the netherlands has the best weed?
Query: (best weed) AND (coffee shops) AND (netherlands)
Answer: those with tobacco mixed with cannabis, and have made customers smoke in upstairs or
downstairs rooms. Notable coffeeshops[edit]
* Checkpoint coffeeshop
* Mellow Yellow coffeeshop
Di Wang and Eric Nyberg. 2015. Discovering the right
answer with clues. In Proceedings of TREC, 2015.
Di Wang and Eric Nyberg. 2016. An attentional neural
encoder-decoder approach for answer ranking. In
Proceedings of TREC.
56