=Paper= {{Paper |id=Vol-3052/short23 |storemode=property |title=Powering COVID-19 Community Q&A with Curated Side Information |pdfUrl=https://ceur-ws.org/Vol-3052/short23.pdf |volume=Vol-3052 |authors=Manisha Verma,,Kapil Thadani,,Shaunak Mishra |dblpUrl=https://dblp.org/rec/conf/cikm/VermaTM21 }} ==Powering COVID-19 Community Q&A with Curated Side Information== https://ceur-ws.org/Vol-3052/short23.pdf

Powering COVID-19 community Q&A with Curated Side
Information
Manisha Verma1 , Kapil Thadani2 and Shaunak Mishra3
1
Yahoo! Research NYC
2
Yahoo! Research NYC
3
Yahoo! Research NYC

Abstract
Community question answering and discussion platforms such as Reddit, Yahoo! answers or Quora provide users the flexibility
of asking open ended questions to a large audience, and replies to such questions maybe useful both to the user and the
community on certain topics such as health, sports or finance. Given the recent events around COVID-19, some of these
platforms have attracted 2000+ questions from users about several aspects associated with the disease. Given the impact of this
disease on general public, in this work we investigate ways to improve the ranking of user generated answers on COVID-19.
We specifically explore the utility of external technical sources of side information (such as CDC guidelines or WHO FAQs)
in improving answer ranking on such platforms. We found that ranking user answers based on question-answer similarity
is not sufficient, and existing models cannot effectively exploit external (side) information. In this work, we demonstrate
the effectiveness of different attention based neural models that can directly exploit side information available in technical
documents or verified forums (e.g., research publications on COVID-19 or WHO website). Augmented with a temperature
mechanism, the attention based neural models can selectively determine the relevance of side information for a given user
question, while ranking answers.

Keywords
question answering, deep learning, knowledge injection, NLP

1. Introduction tween these entities to score answers. However, there are
some limitations of knowledge bases that would make
Question answering systems are key to finding relevant it difficult to use them for community Q&A for rapidly
and timely information about several issues. Community evolving topics such as disease outbreaks (e.g. ebola,
question answering (cQ&A) platforms such as Reddit, Ya- COVID-19), wild-fires or earthquakes. Firstly, knowledge
hoo! answers or Quora have been used to ask questions bases contain information about established entities, and
about wide ranging topics. Most of these platforms let do not rapidly evolve to incorporate new information
users ask, answer, vote or comment on questions present which makes them unreliable for novel disease outbreaks
on the platform. However, question answering platforms such as COVID-19 where information rapidly changes
are useful not only for getting public opinions or votes and its verification is time sensitive. Secondly, it may be
about areas such as entertainment or sports but can also hard to determine what even constitutes an entity as new
serve as information hot-spots for more sensitive topics information arrives about the topic. To overcome these
such as health, injuries or legal topics. Thus, it is imper- limitation in this work, we posit that external curated
ative that when the user visits sensitive topics content, free-text or semi-structured informational sources can
answer ranking also takes into account curated side infor- also be used effectively for cQ&A tasks.
mation from reliable (external) sources. Most prior work In this work, we demonstrate that free text or semi
on cQ&A has focused on incorporating question-answer structured external information sources such as CDC1 ,
similarity [1, 2], user reputation [3, 4, 5], integration of WHO2 or NHS3 can be very useful for ranking answers
multi-modal content [6], community interaction features on community Q&A platforms since they contain fre-
[7] associated with answers or just the question answer- quently updated information about several topics such as
ing network [8] on the platform. However, there is very ongoing disease outbreaks, vaccines or resources about
limited work on incorporating curated content from exter- other topics such as surgeries, birth control or historical
nal sources. Existing work only exploits knowledge bases numerical data about diseases across the world.
[6] that consist of different entities and relationships be- We argue that for sensitive topics such as COVID-19,
it is useful to use publicly available vetted information for
KINN 2021: Workshop on Knowledge Injection in Neural Networks –
November, 2021
improving our ranking systems. In this work, we explore
$ manishav@yahooinc.com (M. Verma); thadani@yahooinc.com
(K. Thadani); shaunakm@yahooinc.com (S. Mishra) 1
https://www.cdc.gov/
© 2021 Copyright for this paper by its authors. Use permitted under Creative 2
Commons License Attribution 4.0 International (CC BY 4.0). https://www.who.int/
CEUR Workshop Proceedings (CEUR-WS.org) 3
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
https://www.nhs.uk/
COVID-19 questions community Q&A 2. Related work
(Yahoo! Answers, Quora, Reddit)
curated docs Community Question and Answering (cQ&A) systems is
(research papers, a well researched sub-field both in information retrieval
sample question WHO, CDC, NHS)
and NLP communities. Several systems have been pro-
posed to rank user submitted answers to questions on
community platforms such as Yahoo! answers, Reddit
and Quora.
answer Ranking user submitted answers on community
answers from ranker question-answering platforms has been addressed with
members several approaches. Primary method is to determine the
relevance of the answer given an input question. Text
fever coughing and tiredness some serious ...
based matching is one of the most common approaches
You can have all or some of the symptoms to rank answers. Researchers have used several meth-
that can be caused by any other virus ...
ods to compute similarity between a question and user
WHO & CDC claim up to 14 days ...
Symptoms of virus Fever or chills, Cough ...
generated answers to determine relevance. For instance,
most relevant
feature based question-answer matching is used in [1]
with 17 features extracted from unigrams, bigrams and
the symptoms vary as the base virus mutates
web correlation features using unstructured user search
Figure 1: Illustrative example of COVID-19 community an- logs to rank answers. It is worth noting that user features
swer ranking powered by side information in the form of and community features when incorporated may still
research papers, and information from verified sources (such yield further improvements in the performance of these
as CDC, WHO, and NHS). models but this is not the focus of our work. The authors
in [1] used questions extracted from Yahoo! answers
for their experiments. Researchers have used different
approaches such representation learning, for instance,
the utility of publicly available information for ranking
in [11, 12] authors use LSTM to represent questions and
answers for questions associated with COVID-19. We
answers respectively. Convolutional networks have also
specifically focus on ranking answers for questions in
been used in [3, 13] to rank answers. Other approaches
two publicly available primary Q&A datasets: a) Yahoo!
such as doc2vec [14], tree-kernels [15], adversarial learn-
Answers4 and b) recently released annotated Q&A dataset
ing [16], attention [17, 6, 11, 18] or deep belief networks
[9] in presence of two external semi-structured curated
[19] have been used to score question and answer pairs.
sources: a) TREC-COVID [10] and b) WHO questions and
There have also been studies exploring community, user
answers 5 on COVID-19. We explore the utility of deep
interaction or question based features [3, 4, 5, 7] to rank
learning models with attention in this work to improve
answers. While these approaches are relevant, it is not
upon existing state-of-the-art systems.
always evident how one can incorporate external infor-
More specifically, we propose a temperature regulated
mation when it is either in free-text or semi-structured
attention mechanism to rank answers in presence of ex-
format into these systems. We explore some question-
ternal (side) information. Our experiments on 10K+ ques-
answer based matching approaches as baselines in this
tions from both source datasets on COVID-19 show that
work and show that for rapidly evolving topics such as
our models can improve ranking quality by a signifi-
COVID-19, inclusion of external curated information can
cant margin over question-answer matching baselines in
boost model performance.
presence of external information. Figure 1 demonstrates
The line of work most closely related to ours is incor-
the overall design of our system. We specifically use at-
poration of knowledge bases in Q&A systems. Existing
tention based neural architecture with temperature to
work [20, 21, 6], however, approaches different tasks. For
automatically determine which components in the ex-
instance, authors in [21, 20] focus on finding factual an-
ternal information are useful for ranking user answers
swers to questions using a knowledge base. This does
with respect to a question. Ranking performance, when
not extend easily to cQ&A where neither the questions
evaluated with three metrics shows that precision and
nor the answers may request or refer to any facts. Most
recall for correct answer retrieval improves by ∼17% and
recent work is [6] on incorporating medical KB for rank-
∼9% for both source datasets respectively over several
ing answers on medical Q&A platforms. They propose
other cQ&A models.
to learn path based representation of entities (from KB)
present in question and answers posted by users. This ap-
4
https://answers.yahoo.com/ proach relies on reliable detection of entities first, which
5
https://www.who.int/emergencies/diseases/novel-
coronavirus-2019/question-and-answers-hub may be absent for emerging topics such as COVID-19
pandemic. Another limitation of this work is that exter-
nal knowledge may not always be present in a structured
format. For example, CDC guidelines are usually sim-
ple question-answer pairs posted on the website. This
makes it difficult to apply their approach to our problem.
The proposed approach in this work incorporates semi-
structured information directly with help of temperature
regulated attention.
Finally, with the rise of COVID-19, researchers across
disciplines are actively publishing information and
datasets to share understanding of the virus and its im-
pact on people. Researchers routinely organize dedicated
challenges such as SemEval [22] with tasks such as rank-
ing answers on QA forums. One such initiative is TREC-
COVID track [10] which released queries, documents
and manual relevance judgements to power search for Figure 2: External source augmentation model
COVID related information. Authors in [23] also released
COVID-19 related QA dataset with 100+ questions and
answers pairs extracted from TREC COVID 6 initiative.
external source ⟨𝑒𝑞𝑘 , 𝐸𝐷𝑘 ⟩𝑚
𝑘=1 .
These questions/answer pairs are not user generated con-
tent, hence, do not reflect real user questions. We also
rely on recently released Q&A dataset from [9] for our 3.2. Proposed Model
task. We also compile a dataset of 2000+ COVID-19 ques- In this work, we explore token-level matching mecha-
tions with 10K+ answers all submitted by users on Yahoo! nism to determine the relevance of information in the
answers for this work. external source that may inform the label prediction task.
Our model (𝜏 -att) aims to match a given user question
3. Method with all the submitted answers in the presence of ex-
ternal information about the same domain. First, the
3.1. Problem formulation question 𝑞𝑖 , an answer 𝑎𝑖𝑗 and additional metadata can
be encoded into a 𝑑-dimensional vector 𝑥𝑖 using a text
In this work, we focus on ranking answers for 𝑛 questions encoder 𝑓input . We use LSTM based encoder for both ques-
𝑞1 , . . . , 𝑞𝑛 related to an emerging topics such as COVID. tion and answer in the primary source which can handle
Each 𝑞𝑖 is associated with a set of two or more answers input sequences of variable length.
𝐴𝑖 = {𝑎𝑖𝑗 : 𝑗 ≥ 2} and corresponding labels 𝑌𝑖 =
{𝑦𝑖𝑗 : 𝑗 ≥ 2} representing answer relevance. We use Question Encoding: Each word 𝑤𝑞 in a question is
𝑖
binary indicator for relevance where relevant judgments represented as a 𝐾 dimensional vector with pre-trained
(e.g., favorite, upvoted) are provided by the user, i.e., 𝑦𝑖𝑗 ∈ word embeddings. LSTM takes each token embedding
{0, 1} respectively. as input and updates hidden state ℎ𝑞𝑖 based on previous
We attempt to model the relevance of each answer 𝑎𝑖𝑗 state ℎ𝑞 . Finally, the hidden state is input to a feed for-
𝑖−1
to its corresponding question using an external source ward layer with smaller dimension 𝐹 < 𝐾 to compress
which may contain free text or semi-structured informa- question encoding as follows:
tion. For example, the external source could consist of
information-seeking queries or questions 𝑒𝑞1 , . . . , 𝑒𝑞𝑚 ℎ𝑞𝑖 = 𝐿𝑆𝑇 𝑀 (ℎ𝑞𝑖−1 , 𝑤𝑖𝑞 ), 𝑓𝑖𝑞 = 𝑅𝐸𝐿𝑈 (ℎ𝑞𝑖 𝑊𝑞 + 𝑏𝑞 )
related to a topic, with each 𝑒𝑞𝑘 linked to a set of rel- (1)
evant scientific articles or answers 𝐸𝐷𝑘 , where each
answer/document 𝑒𝑑1 , . . . , 𝑒𝑑𝑝 may be judged for rele- Answer Encoding: Each word 𝑤𝑗𝑎 in the answer is also
vance by human judges [10] or some experts. represented as a 𝐾 dimensional vector with pre-trained
We hypothesize that this semi-structured or free-text word embeddings. LSTM takes each token embedding as
information may be valuable in identifying user answer input and updates hidden state ℎ𝑎𝑗 . We also reduce the
quality for certain kinds of questions, although not all. dimension of answer encoding with a feed forward layer
We investigate this with our model to recover the true with dimension 𝐹 < 𝐾 as follows:
labels 𝑦𝑖𝑗 for each user answer 𝑎𝑖𝑗 ∈ 𝐴𝑖 given its ques-
tion 𝑞𝑖 , category information, and information from the ℎ𝑎𝑗 = 𝐿𝑆𝑇 𝑀 (ℎ𝑎𝑗−1 , 𝑤𝑗𝑎 ), 𝑓𝑗𝑎 = 𝑅𝐸𝐿𝑈 (ℎ𝑎𝑗 𝑊𝑎 + 𝑏𝑎 )
(2)
6
https://ir.nist.gov/covidSubmit/data.html
Source Question Rel answer Non-rel answer
Unfortunately, there’s not enough people
I am really scared to go that care and will still go out and party de-
places for St. Patrick’s Stop being scared of viruses.
Yahoo! Ans spite the coronavirus epidemic. I’m proud
day because of the coro- What’s the problem?
of you in that you’re taking extra precau-
navirus. what do I do? tions ... Good for you!
The risk is quite low for one to
A recent study shows that the virus can live become infected with COVID19
Infobot Can corona live on card- in the air ... On cardboard, it can live up to through mail/packages - especially
board? 24 hours (1 day) because...(over a period of a few
days/weeks).
Table 1
Sample rel/non-rel answers from both sources

We concatenate the question and answer representations with respect to the question encoding. Temperature (𝜏 )
for further processing. parameter helps us control the uniformity of attention
weights 𝛼𝑖𝑡 . Finally, labels are predicted using a multi-
𝑓𝑖𝑗 = [𝑓𝑖𝑞 , 𝑓𝑗𝑎 ] (3) layer perceptron over the input vector 𝑓𝑖𝑗 and the learned
weighted average of side information 𝑠′𝑖𝑡𝑑 . We use binary
External source encoding: External sources of in- cross entropy loss to train the proposed model.
formation can vary from task-to-task. We encode each
segment of data individually. For instance, if there are 𝑦ˆ𝑖𝑗 = 𝐹output ([𝑓𝑖𝑗 ; 𝑠′𝑖𝑡𝑑 ])
two segments in the source (e.g. question/answer or
query/document), our system encodes both segments in- where 𝐹output uses sigmoid activation function. Since
dividually. We use the same encoding architecture used community questions may often be entirely unrelated to
for primary source question/answer encoding above. En- external sources, a key aspect of this approach is deter-
coding example for two segment external source is given mining whether the external source is useful, not merely
below. attending to its entries that are most relevant. Tempera-
𝑒𝑞 𝑒𝑞 𝑒𝑞 𝑒𝑞 𝑒𝑞
ture based attention mechanism is useful in controlling
ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑞 + 𝑏𝑒𝑞which ) external source entries are useful for user ques-
ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑑 + 𝑏𝑒𝑑tions.
𝑒𝑑 𝑒𝑑 𝑒𝑑 𝑒𝑑 𝑒𝑑
) It is worth noting that one will have to experiment
(4) and tune the value of temperature 𝜏 such that ranking
performance improves.
We incorporate external source encoding with a tem-
perature (𝜏 ) based variant of scaled dot-product atten-
tion, which provides a straightforward conditioning ap- 4. Experimental Setup
proach over a set of query-document pairs. Question
Given the model architecture, in this section, we provide
encoding vector 𝑓𝑖𝑗 serves as a query over keys 𝑓𝑡𝑒𝑞 . If
a detailed overview of different datasets, metrics and
two segments are present in the external source such as
baselines used in our experiments.
query/document, the model uses the attention weights
over first segment (e.g. query) to determine the impor-
tance of the second segment (e.g. document) respectively. 4.1. Data
It is easy to extend this framework to external sources
We compiled two question answering datasets. The first
with multiple segments. The two segment attention is
was collected from Yahoo! answers and the second was re-
described below.
cently released in [9] where both datasets have questions
⊤ 𝑒𝑞
𝑓𝑖𝑗 𝑓𝑡 raised by real users. In this work we focus specifically
𝑧𝑖𝑡 = √ on questions associated with COVID-19. Different statis-
𝑑
tics about the train and test split of both q&a datasets
𝑒𝑧𝑖𝑡 /𝜏
𝛼𝑖𝑡 = ∑︀ 𝑧 /𝜏 (5) are given in Table 2 respectively. A pair of relevant and
𝑙𝑒 non-relevant answers for a question in both datasets is
𝑙

also shown in Table 1 for reference. More details about
′
∑︁
𝑠𝑖𝑡𝑑 = 𝛼𝑖𝑡 𝑓𝑡𝑒𝑑
𝑑 them is given below.

To summarize, temperature (𝜏 ) based attention helps
determine the relevance of each 𝑓𝑡𝑒𝑑 corresponding 𝑓𝑡𝑒𝑞
set where questions for each set were uniformly sampled.

Infobot Dataset [9] : Researchers at JHU [9] have
recently compiled a list of user submitted questions on
different platforms and manually labeled 22K+ question-
answer pairs. We cleaned this set by removing questions
(a) Yahoo! ques length (b) Yahoo! ans length with less than two answers or no relevant answers. In
total, our dataset contains 8000+ question answer pairs
where each question may have multiple relevant answers
which is not the same as Yahoo! answers dataset. Figure
3c and 3d show the distribution of question and answer
lengths respectively.

4.1.1. External sources
(c) Infobot ques length (d) Infobot ans length
We use two external datasets to rank answers. Details of
Figure 3: Token distribution in different sources each dataset are given below:

TREC COVID [10]: We use recently released TREC
Stat Yahoo! Ans Infobot
Train Q-A 9341 6354
COVID-19 track data with 50 queries which also contain
Train ans/q 6.25±2.9 4.40±0.77 manually drafted query descriptions and narratives. Ex-
Train #qwords 12.71±5.8 6.55±3.93 pert judges have labeled over 5000 scientific documents
Train #awords 36.31±93.59 92.17±59.27 for these 50 queries from the CORD-19 dataset 8 . These
Test Q-A 2232 1592 documents contain coronavirus related research. Given
Test ans/q 5.96±2.87 4.41±0.76 the documents are scientific literature, we initialize doc-
Test #qwords 13.07±5.89 6.21±2.94 ument embeddings using SPECTER [24].
Test #awords 35.64±80.31 92.39±59.47
Table 2 WHO: We use data released on question and answer
Train and test data from primary sources hub of WHO9 website to create a list of question-answer
pairs. There are 147 question and answer pairs in this
dataset where questions contain 13.28±5.36 words and
answers contain 133.2±100.9 words respectively.
Yahoo! Dataset : We crawled COVID-19 related ques-
tions from Yahoo! answers 7 using several keywords
such as ‘coronavirus’, ‘covid-19’, ‘covid’, ‘sars-cov2’ and 4.2. Baselines
‘corona virus’ between the period of Jan 2020 to July 2020
We evaluated our model against embedding similarity
to ensure we gather all possible questions for our experi-
baseline. We computed four baselines as follows:
ments. We keep only those questions have two or more
answers. In total, we obtained 1880 questions with 11500
answers. We used favorite answers as positive labels (sim- Random: An answer is chosen at random as relevant
ilar to previous work [1]), assuming that users, over time for a user question. This is expected to provide a lower
rate answers (with upvotes/downvotes) that are most bound on retrieval performance.
relevant to the submitted question. We normalized the
question and answer text by removing a small list of stop Linear Attention (att) : When 𝜏 = 1.0, our model
words, numbers, links or any symbols. Figure 3a and 3b defaults to simple linear attention over all the information
show the distribution of question and answer lengths re- present in the external sources. This gives an indication
spectively. Questions contain 12.7 ± 5.8 (qwords) words of how well the model performs when its forced to look
and answers consist of 36.3 ± 93.5 (mean±std) words at all the information in the external source.
(awords) respectively which indicates that user submitted
answers can vary widely on Yahoo! answers. On average,
a question has about 6 answers (ans/q) in Yahoo! ans
dataset. We spilt the data into three sets: train (64%, 1196
questions, 7435 answers), validation (16%, 298 questions,
8
1858 answers) and test (20%, 374 questions, 2310 answers) https://www.semanticscholar.org/cord19
9
https://www.who.int/emergencies/diseases/novel-
7
https://answers.search.yahoo.com/search?p=coronavirus coronavirus-2019/question-and-answers-hub
Yahoo! Ans Infobot i.e. (𝑘 = 1) in the ranked list is indeed correct. It
Model P@1 R@3 MRR P@1 R@3 MRR
is defined as follows:
𝜏 -att 0.393 0.644 0.598 0.673 0.868 0.802
𝜆-sim 0.3743 0.633 0.578 0.551 0.817 0.7207 |𝑄| ∑︀𝑘
1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1}
bsl-256 0.406 0.657 0.615 0.581 0.803 0.744 𝑃 𝑟𝑒𝑐@𝑘 = (7)
bsl-128 0.363 0.604 0.589 0.557 0.799 0.731 |𝑄| 𝑖=1 𝑘
att 0.377 0.645 0.589 0.567 0.821 0.739
qasim 0.318 0.608 0.546 0.551 0.817 0.720 where I{𝑟𝑒𝑙𝑖𝑗 = 1} indicates whether the answer
random 0.21 - - 0.239 - - at position 𝑗 is relevant to the 𝑖𝑡ℎ question.
• Recall (R@k): Recall at position 𝑘 evaluates the
Table 3
Evaluation with WHO external data
fraction of relevant answers retrieved from all
the answers marked relevant for a question. We
report recall averaged for all the queries in test
set. For recall, we take a cutoff as (𝑘 = 3), which
Linear combination (𝜆-sim) : We linearly combine
evaluates whether the model is able to retrieve the
similarities between Yahoo! question-answer and Trec
correct answers in top 3 positions. It is defined
query-answer as shown below:
as follows:
𝜆-sim = 𝜆 𝑐𝑜𝑠(𝑦𝑎, 𝑦𝑞) + (1 − 𝜆) max(𝑐𝑜𝑠(𝑦𝑎, 𝑡𝑞)) |𝑄| ∑︀𝑘
𝑡𝑞 1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1}
(6) 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 = (8)
|𝑄| 𝑖=1 |𝑟𝑒𝑙𝑖 |
where 𝑦𝑎, 𝑦𝑞 and 𝑡𝑞 are Yahoo! answer, question and
concatenated trec query, narrative and description em-
where |𝑟𝑒𝑙𝑖 | is the number of relevant answers
beddings respectively. This is a more crude version of
for the 𝑖th question.
temperature attention where 𝜆 controls the contribution
of each component directly. We vary 𝜆 to determine • MRR (MRR): evaluates the average of the recip-
the optimal combination. Question-Answer similarity rocal ranks corresponding to the most relevant
(qasim) is similarity between question and answer embed- answer for the questions in test set, which is given
ding i.e. 𝜆 = 1. Both question and answer embeddings by:
|𝑄|
are obtained by averaging over their individual token 1 ∑︁ 1
𝑀 𝑅𝑅 = (9)
embeddings. |𝑄| 𝑖=1 𝑟𝑎𝑛𝑘𝑖

BERT Q&A (bert) : Large scale pre-trained transform- where |𝑄| indicates the number of queries in the
ers [25] are widely popular for NLP tasks. BERT like test set and 𝑟𝑎𝑛𝑘𝑖 is the rank of the first relevant
models have shown effectiveness on Q&A datasets such answer for the 𝑖𝑡ℎ query.
as SQUAD 10 . We fine-tune BERT base model with two
different answer lengths a) 128 (bsl-128) and b) 256 to- 4.4. Parameter Settings
kens (bsl-256) respectively. The intuition is that large
Both primary datasets, Yahoo! ans and Infobot, were
scale pre-trained models are adept at language under-
divied into three parts: train (∼60%), validation and test
standing and can be fine-tuned for new tasks with small
(20%) respectively. The baseline models 𝜆-sim and 𝑎𝑡𝑡
number of samples. We finetune BERT for both datasets
are initialized with glove embeddings 11 of 100 dimen-
Yahoo! ans and Infobot respectively. It is non-trivial to
sions. We performed a parameter sweep over 𝜆 and 𝜏
include external information in BERT and we leave this
for 𝜆-sim and 𝜏 -att models with step size of 0.1 between
for future work.
{0, 1.0} respectively. We used base uncased model for
𝑏𝑒𝑟𝑡 implementation. We fine-tuned the model between
4.3. Evaluation Metrics 1-10 epochs and found that 3 epochs gave the best re-
We evaluate the performance of our model using three sult on validation set. We used LSTM with 64 hidden
popular ranking metrics, mainly Precision (P@1), Mean units to represent question, answer and all the informa-
Reciprocal Rank (MRR), and Recall (R@3). Each metric is tion in external datasets. We experimented with higher
described below: embedding size and hidden units, but the performance
degraded significantly as the model tends to overfit on
• Precision (P@k): Precision at position 𝑘 eval- training data. Lastly we used batch size of 64 and trained
uates the fraction of relevant answers retrieved the model for 30 epochs with early stopping.
until position k. For, both datasets Yahoo! ans and
Infobot [9], we evaluate whether the top answer
10 11
https://rajpurkar.github.io/SQuAD-explorer/ https://nlp.stanford.edu/projects/glove/
Yahoo! Ans Infobot Category 𝜏 -att 𝜆-sim qasim
Model P@ 1 R@3 MRR P@1 R@3 MRR Entertainment (47) 0.446 0.382 0.297
𝜏 -att 0.532 0.778 0.715 0.606 0.842 0.766 Health (62) 0.483 0.419 0.354
𝜆-sim 0.326 0.616 0.555 0.556 0.813 0.722 Politics (143) 0.45 0.300 0.272
bsl-256 0.406 0.657 0.615 0.581 0.803 0.744 Society (38) 0.28 0.157 0.236
bsl-128 0.363 0.604 0.589 0.557 0.799 0.731 Family (20) 0.6 0.350 0.40
att 0.291 0.495 0.494 0.601 0.833 0.762
qasim 0.318 0.608 0.546 0.551 0.817 0.720 Table 6
random 0.21 - - 0.239 - - Precision@1 of models across categories
Table 4
Evaluation with TREC-COVID external data
efit cQ&A task. Since attention is dependent on the input
Category 𝜏 -att 𝜆-sim qasim query and key embedding lengths, it would be interesting
Entertainment (47) 0.829 0.702 0.59 to scale the computation in our model to incorporate sev-
Health (62) 0.693 0.69 0.645 eral open external datasets to overcome this limitation
Politics (143) 0.727 0.629 0.587 in the future.
Society (38) 0.578 0.473 0.42 Yahoo! ans questions are also assigned categories by
Family (20) 0.85 0.750 0.65 users. Category based breakdown of performance on test
Table 5 set is given in Table 6 and Table 5 respectively, where
Recall@3 of models across categories categories with largest number of questions in test set are
listed. In all the categories, our model outperforms best
𝜆-sim and qasim model respectively. The largest improve-
5. Results ment happens for questions in Family category where
our model achieves an improvement of 71% over the 𝜆-
In this work, our focus is to evaluate the utility of sim model. It seems that ranking answers for questions
external information in improving answer ranking for from society and politics are harder than other categories.
cQ&A task. Thus, we performed experiments to answer All the models, however, are able to rank the top answer
three main research questions listed below. in first three positions effectively as Recall@3 is high for
RQ1: Does external information improve answer all the categories.
ranking?
RQ2: How does temperature (𝜏 ) compare with 𝜆 RQ2: How does temperature (𝜏 ) compare with 𝜆
parameter? parameter? We argued that linearly combining simi-
RQ3: What kind of queries/questions does the model larities between question-answer in primary dataset and
attend to when ranking relevant/non-relevant answers? between question-external source may not be sufficient
to boost performance. We observe that in our results too
i.e. 𝜆-sim models do not perform better than (𝜏 -att) mod-
RQ1: Does external information improve answer els. This clearly indicates that more sophisticated models
ranking? We evaluated different models for ranking can learn to combine this information directly from train-
answers in Yahoo! ans and Infobot dataset in presence ing data. However, our experiments indicate that optimal
of TREC and WHO datasets respectively. We found that value of (𝜏 ) varies across primary datasets and external
temperature regulated attention models that incorpo- sources of information. For instance, (𝜏 -att) model per-
rate external sources indeed outperform the baselines formed best when 𝜏 = 0.4 and 𝜏 = 0.9 for Yahoo! ans
as shown in Table 4 and Table 3 respectively. (𝜏 -att) and Infobot dataset respectively when TREC was used
model beats bert models by ∼30% in precision, ∼18% as external source. It performed best when 𝜏 = 0.1 and
in recall and ∼16% in MRR respectively on TREC data. 𝜏 = 0.5 for Yahoo! ans and Infobot dataset respectively
However, (𝜏 -att) does only marginally better than att when WHO was used as external source. We also tried to
model in precision and MRR on Infobot data. We sus- vary 𝜏 beyond 1.0 to determine whether it yielded a trend
pect that is due to the large set of query-document pairs as shown in Table 7. Higher values of temperature seem
in TREC-COVID data compared to fewer number of to degrade model performance. We found that optimal
question-answer pairs in Infobot dataset. Our results temperature range is between [0.1−1]. Existing research
also clearly suggest that embedding based matching of in model distillation [26] has also empirically found that
question-answer pair (qasim) would not yield a good lower values of temperature yield better performance.
ranker, though it is better than choosing an answer at We also compared model performance in terms of pre-
random (random). When WHO is used as an external cision when 𝜆 and 𝜏 are varied for 𝜆-sim models and
dataset, we find that (𝜏 -att) model is slightly worse than temperature based models respectively as shown in Fig-
bert. This suggests that not all sources would equally ben- ure 4. Temperature based models peak at one value but
(a) Yahoo!+TREC (b) Yahoo!+WHO (c) Infobot + TREC (d) Infobot + WHO

Figure 4: Temperature and 𝜆 variation impact on Prec@1

Temperature (𝜏 ) > 1.0
10 100 1000 10 100 1000
Src+ Ext Prec@1 Recall@3
Y! + TREC 0.46 0.38 0.38 0.73 0.644 0.64
Y! + WHO 0.37 0.38 0.36 0.64 0.65 0.64
Ibot + TREC 0.44 0.59 0.39 0.72 0.81 0.75
Ibot + WHO 0.65 0.41 0.44 0.85 0.76 0.79

Table 7 Figure 5: Y! ques, its rel and non-rel ans and questions with
Variation in P@1 and R@3 across different temperature values. 𝜏 -att model’s attention values for TREC queries.

do not have a clear trend indicating that one needs to ex-
plore different 𝜏 values at the time of training for better
performance. On the other hand, we observe that adding
external information also helps the 𝜆-sim models until
a certain threshold. Overall, both sets of models show
that free-text external information can be incorporated
to improve answer ranking performance. Figure 6: Infobot ques, its rel and non-rel ans and questions
with 𝜏 -att model’s attention values for TREC queries.
RQ3: What kind of queries/questions does the
model attend to when ranking relevant/non-
relevant answers? Attention based models have a this external knowledge need not always be structured
very unique feature: they can aid explaining the internal text. However, it is worth noting that curated and reli-
workings of neural network models. We inspect what able external sources may not always be available for all
kind of queries/questions in external datasets does our domains. We addressed a very niche task in this work,
model pay attention to while ranking relevant or non- and further research is required to extend it to incorpo-
relevant answers. Figure 5 shows one such example of rate multiple external sources. We posit that with scal-
Yahoo! question and incorporation of TREC data. At the able attention mechanisms, this work can be easily made
time of scoring relevant answer, the model gives higher tractable for large external sources containing thousands
weight to some queries compared to others. In the exam- or millions of entries in the future.
ple, for instance, it assigns more weight to queries asso-
ciated with masks or COVID virus response to weather
changes. We observe higher attention weights for ques- 6. Conclusion
tions when relevant answers are ranked than when non-
relevant answers are scored. An example question, a Question answering platforms provide users with effec-
relevant and non-relevant answer along with model at- tive and easy access to information. These platforms
tention weights on TREC queries are shown from the also provide content on rapidly evolving sensitive topics
Infobot data in Figure 6 respectively. It shows a simi- such as disease outbreaks (such as COVID-19) where it is
lar trend where attention weights are high for external also useful to use external vetted information for ranking
queries that are closely associated with the question an- answers. Existing work only exploits knowledge bases
swer text. which have some limitations that makes it difficult to
Overall, our experiments show that curated external use them for community Q&A for rapidly evolving top-
information is useful for improving community ques- ics such as wild-fires or earthquakes. In this work, we
tion answering task. Our experiments also indicate that tried to evaluate the effectiveness of external (free text or
semi-structured) information in improving answer rank- question answering in social multimedia, in:
ing models. We argue that simple question-answer text Proceedings of the 26th ACM International Con-
matching may be insufficient and in presence of external ference on Multimedia, MM ’18, Association for
knowledge, but temperature regulated attention models Computing Machinery, New York, NY, USA, 2018,
can distill information better which in turn yields higher p. 456–464. URL: https://doi.org/10.1145/3240508.
performance. Our proposed model with temperature reg- 3240626. doi:10.1145/3240508.3240626.
ulated attention, when evaluated on two public datasets [8] J. Hu, S. Qian, Q. Fang, C. Xu, Hierarchical graph
showed significant improvements by augmenting infor- semantic pooling network for multi-modal com-
mation from two external curated sources of information. munity question answer matching, in: Proceed-
In future, we aim to expand these experiments to other ings of the 27th ACM International Conference
categories such as disaster relief and scale the attention on Multimedia, MM ’19, Association for Com-
mechanism to include multiple external sources in one puting Machinery, New York, NY, USA, 2019, p.
model. 1157–1165. URL: https://doi.org/10.1145/3343031.
3350966. doi:10.1145/3343031.3350966.
[9] A. Poliak, M. Fleming, C. Costello, K. W. Murray,
References M. Yarmohammadi, S. Pandya, D. Irani, M. Agarwal,
U. Sharma, S. Sun, et al., Collecting verified covid-19
[1] M. Surdeanu, M. Ciaramita, H. Zaragoza, Learning
question answer pairs (2020).
to rank answers on large online qa collections, in:
[10] E. Voorhees, T. Alam, S. Bedrick, D. Demner-
ACL, 2008.
Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Sobo-
[2] Y. Shen, W. Rong, Z. Sun, Y. Ouyang, Z. Xiong,
roff, L. L. Wang, Trec-covid: Constructing a pan-
Question/answer matching for cqa system via com-
demic information retrieval test collection, 2020.
bining lexical and sequential information, in: Pro-
arXiv:2005.04474.
ceedings of the Twenty-Ninth AAAI Conference on
[11] A. Rücklé, I. Gurevych, Representation learn-
Artificial Intelligence, AAAI’15, AAAI Press, 2015,
ing for answer selection with LSTM-based impor-
p. 275–281.
tance weighting, in: IWCS 2017 — 12th Interna-
[3] L. Yang, Q. Ai, D. Spina, R.-C. Chen, L. Pang, W. B.
tional Conference on Computational Semantics —
Croft, J. Guo, F. Scholer, Beyond factoid qa: Ef-
Short papers, 2017. URL: https://www.aclweb.org/
fective methods for non-factoid answer sentence
anthology/W17-6935.
retrieval, in: European Conference on Information
[12] D. Cohen, W. Croft, End to end long short term
Retrieval, Springer, 2016, pp. 115–128.
memory networks for non-factoid question answer-
[4] L. Hong, B. D. Davison, A classification-based ap-
ing, 2016, pp. 143–146. doi:10.1145/2970398.
proach to question answering in discussion boards,
2970438.
in: Proceedings of the 32nd international ACM SI-
[13] X. Zhou, B. Hu, Q. Chen, B. Tang, X. Wang, Answer
GIR conference on Research and development in
sequence learning with neural networks for answer
information retrieval, 2009, pp. 171–178.
selection in community question answering, arXiv
[5] D. H. Dalip, M. A. Gonçalves, M. Cristo, P. Calado,
preprint arXiv:1506.06490 (2015).
Exploiting user feedback to learn to rank answers
[14] L. Nie, X. Wei, D. Zhang, X. Wang, Z. Gao, Y. Yang,
in qa forums: A case study with stack overflow,
Data-driven answer selection in community qa sys-
in: Proceedings of the 36th International ACM SI-
tems, IEEE transactions on knowledge and data
GIR Conference on Research and Development in
engineering 29 (2017) 1186–1198.
Information Retrieval, SIGIR ’13, Association for
[15] A. Severyn, A. Moschitti, Structural relationships
Computing Machinery, New York, NY, USA, 2013,
for large-scale learning of answer re-ranking, in:
p. 543–552. URL: https://doi.org/10.1145/2484028.
Proceedings of the 35th international ACM SIGIR
2484072. doi:10.1145/2484028.2484072.
conference on Research and development in infor-
[6] Y. Zhang, S. Qian, Q. Fang, C. Xu, Multi-modal
mation retrieval, 2012, pp. 741–750.
knowledge-aware hierarchical attention network
[16] X. Yang, M. Khabsa, M. Wang, W. Wang, A. H.
for explainable medical question answering, in:
Awadallah, D. Kifer, C. L. Giles, Adversarial training
Proceedings of the 27th ACM International Con-
for community question answer selection based on
ference on Multimedia, MM ’19, Association for
multi-scale matching, in: Proceedings of the AAAI
Computing Machinery, New York, NY, USA, 2019,
Conference on Artificial Intelligence, volume 33,
p. 1089–1097. URL: https://doi.org/10.1145/3343031.
2019, pp. 395–402.
3351033. doi:10.1145/3343031.3351033.
[17] H. Huang, X. Wei, L. Nie, X. Mao, X.-S. Xu, From
[7] J. Hu, S. Qian, Q. Fang, C. Xu, Attentive in-
question to text: Question-oriented feature atten-
teractive convolutional matching for community
tion for answer selection, ACM Transactions on
Information Systems 37 (2018) 1–33. doi:10.1145/
3233771.
[18] X. Zhang, S. Li, L. Sha, H. Wang, Attentive interac-
tive neural networks for answer selection in com-
munity question answering, in: Proceedings of the
Thirty-First AAAI Conference on Artificial Intelli-
gence, AAAI’17, AAAI Press, 2017, p. 3525–3531.
[19] B. Wang, X. Wang, C.-J. Sun, B. Liu, L. Sun, Model-
ing semantic relevance for question-answer pairs
in web social communities, in: Proceedings of the
48th Annual Meeting of the Association for Com-
putational Linguistics, 2010, pp. 1230–1238.
[20] B. Kratzwald, A. Eigenmann, S. Feuerriegel, Rankqa:
Neural question answering with answer re-ranking,
CoRR abs/1906.03008 (2019). URL: http://arxiv.org/
abs/1906.03008. arXiv:1906.03008.
[21] Y. Shen, Y. Deng, M. Yang, Y. Li, N. Du, W. Fan,
K. Lei, Knowledge-aware attentive neural network
for ranking question answer pairs, in: The 41st
International ACM SIGIR Conference on Research
Development in Information Retrieval, SIGIR ’18,
Association for Computing Machinery, New York,
NY, USA, 2018, p. 901–904. URL: https://doi.org/
10.1145/3209978.3210081. doi:10.1145/3209978.
3210081.
[22] P. Nakov, D. Hoogeveen, L. Màrquez, A. Moschitti,
H. Mubarak, T. Baldwin, K. Verspoor, Semeval-
2017 task 3: Community question answering, arXiv
preprint arXiv:1912.00730 (2019).
[23] D. Su, Y. Xu, T. Yu, F. B. Siddique, E. J. Barezi, P. Fung,
CAiRE-COVID: A question answering and multi-
document summarization system for covid-19 re-
search, arXiv 2005.03975 (2020).
[24] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. S.
Weld, Specter: Document-level representation
learning using citation-informed transformers, in:
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, 2020, pp.
2270–2282.
[25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
Bert: Pre-training of deep bidirectional trans-
formers for language understanding, 2019.
arXiv:1810.04805.
[26] G. Hinton, O. Vinyals, J. Dean, Distilling
the knowledge in a neural network, 2015.
arXiv:1503.02531.