=Paper= {{Paper |id=Vol-3052/short23 |storemode=property |title=Powering COVID-19 Community Q&A with Curated Side Information |pdfUrl=https://ceur-ws.org/Vol-3052/short23.pdf |volume=Vol-3052 |authors=Manisha Verma,,Kapil Thadani,,Shaunak Mishra |dblpUrl=https://dblp.org/rec/conf/cikm/VermaTM21 }} ==Powering COVID-19 Community Q&A with Curated Side Information== https://ceur-ws.org/Vol-3052/short23.pdf
Powering COVID-19 community Q&A with Curated Side
Information
Manisha Verma1 , Kapil Thadani2 and Shaunak Mishra3
1
  Yahoo! Research NYC
2
  Yahoo! Research NYC
3
  Yahoo! Research NYC


                                             Abstract
                                             Community question answering and discussion platforms such as Reddit, Yahoo! answers or Quora provide users the flexibility
                                             of asking open ended questions to a large audience, and replies to such questions maybe useful both to the user and the
                                             community on certain topics such as health, sports or finance. Given the recent events around COVID-19, some of these
                                             platforms have attracted 2000+ questions from users about several aspects associated with the disease. Given the impact of this
                                             disease on general public, in this work we investigate ways to improve the ranking of user generated answers on COVID-19.
                                             We specifically explore the utility of external technical sources of side information (such as CDC guidelines or WHO FAQs)
                                             in improving answer ranking on such platforms. We found that ranking user answers based on question-answer similarity
                                             is not sufficient, and existing models cannot effectively exploit external (side) information. In this work, we demonstrate
                                             the effectiveness of different attention based neural models that can directly exploit side information available in technical
                                             documents or verified forums (e.g., research publications on COVID-19 or WHO website). Augmented with a temperature
                                             mechanism, the attention based neural models can selectively determine the relevance of side information for a given user
                                             question, while ranking answers.

                                             Keywords
                                             question answering, deep learning, knowledge injection, NLP



1. Introduction                                                                                                       tween these entities to score answers. However, there are
                                                                                                                      some limitations of knowledge bases that would make
Question answering systems are key to finding relevant                                                                it difficult to use them for community Q&A for rapidly
and timely information about several issues. Community                                                                evolving topics such as disease outbreaks (e.g. ebola,
question answering (cQ&A) platforms such as Reddit, Ya-                                                               COVID-19), wild-fires or earthquakes. Firstly, knowledge
hoo! answers or Quora have been used to ask questions                                                                 bases contain information about established entities, and
about wide ranging topics. Most of these platforms let                                                                do not rapidly evolve to incorporate new information
users ask, answer, vote or comment on questions present                                                               which makes them unreliable for novel disease outbreaks
on the platform. However, question answering platforms                                                                such as COVID-19 where information rapidly changes
are useful not only for getting public opinions or votes                                                              and its verification is time sensitive. Secondly, it may be
about areas such as entertainment or sports but can also                                                              hard to determine what even constitutes an entity as new
serve as information hot-spots for more sensitive topics                                                              information arrives about the topic. To overcome these
such as health, injuries or legal topics. Thus, it is imper-                                                          limitation in this work, we posit that external curated
ative that when the user visits sensitive topics content,                                                             free-text or semi-structured informational sources can
answer ranking also takes into account curated side infor-                                                            also be used effectively for cQ&A tasks.
mation from reliable (external) sources. Most prior work                                                                  In this work, we demonstrate that free text or semi
on cQ&A has focused on incorporating question-answer                                                                  structured external information sources such as CDC1 ,
similarity [1, 2], user reputation [3, 4, 5], integration of                                                          WHO2 or NHS3 can be very useful for ranking answers
multi-modal content [6], community interaction features                                                               on community Q&A platforms since they contain fre-
[7] associated with answers or just the question answer-                                                              quently updated information about several topics such as
ing network [8] on the platform. However, there is very                                                               ongoing disease outbreaks, vaccines or resources about
limited work on incorporating curated content from exter-                                                             other topics such as surgeries, birth control or historical
nal sources. Existing work only exploits knowledge bases                                                              numerical data about diseases across the world.
[6] that consist of different entities and relationships be-                                                              We argue that for sensitive topics such as COVID-19,
                                                                                                                      it is useful to use publicly available vetted information for
KINN 2021: Workshop on Knowledge Injection in Neural Networks –
November, 2021
                                                                                                                      improving our ranking systems. In this work, we explore
$ manishav@yahooinc.com (M. Verma); thadani@yahooinc.com
(K. Thadani); shaunakm@yahooinc.com (S. Mishra)                                                                           1
                                                                                                                              https://www.cdc.gov/
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative       2
                                       Commons License Attribution 4.0 International (CC BY 4.0).                             https://www.who.int/
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                            3
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                                              https://www.nhs.uk/
        COVID-19 questions community Q&A                                      2. Related work
         (Yahoo! Answers, Quora, Reddit)
                                                           curated docs       Community Question and Answering (cQ&A) systems is
                                                          (research papers,   a well researched sub-field both in information retrieval
                     sample question                     WHO, CDC, NHS)
                                                                              and NLP communities. Several systems have been pro-
                                                                              posed to rank user submitted answers to questions on
                                                                              community platforms such as Yahoo! answers, Reddit
                                                                              and Quora.
                                                                  answer         Ranking user submitted answers on community
                       answers from                                ranker     question-answering platforms has been addressed with
                         members                                              several approaches. Primary method is to determine the
                                                                              relevance of the answer given an input question. Text
         fever coughing and tiredness some serious ...
                                                                              based matching is one of the most common approaches
         You can have all or some of the symptoms                             to rank answers. Researchers have used several meth-
         that can be caused by any other virus ...
                                                                              ods to compute similarity between a question and user
         WHO & CDC claim up to 14 days ...
         Symptoms of virus Fever or chills, Cough ...
                                                                              generated answers to determine relevance. For instance,
                                                          most relevant
                                                                              feature based question-answer matching is used in [1]
                                                                              with 17 features extracted from unigrams, bigrams and
         the symptoms vary as the base virus mutates
                                                                              web correlation features using unstructured user search
Figure 1: Illustrative example of COVID-19 community an-                      logs to rank answers. It is worth noting that user features
swer ranking powered by side information in the form of                       and community features when incorporated may still
research papers, and information from verified sources (such                  yield further improvements in the performance of these
as CDC, WHO, and NHS).                                                        models but this is not the focus of our work. The authors
                                                                              in [1] used questions extracted from Yahoo! answers
                                                                              for their experiments. Researchers have used different
                                                                              approaches such representation learning, for instance,
the utility of publicly available information for ranking
                                                                              in [11, 12] authors use LSTM to represent questions and
answers for questions associated with COVID-19. We
                                                                              answers respectively. Convolutional networks have also
specifically focus on ranking answers for questions in
                                                                              been used in [3, 13] to rank answers. Other approaches
two publicly available primary Q&A datasets: a) Yahoo!
                                                                              such as doc2vec [14], tree-kernels [15], adversarial learn-
Answers4 and b) recently released annotated Q&A dataset
                                                                              ing [16], attention [17, 6, 11, 18] or deep belief networks
[9] in presence of two external semi-structured curated
                                                                              [19] have been used to score question and answer pairs.
sources: a) TREC-COVID [10] and b) WHO questions and
                                                                              There have also been studies exploring community, user
answers 5 on COVID-19. We explore the utility of deep
                                                                              interaction or question based features [3, 4, 5, 7] to rank
learning models with attention in this work to improve
                                                                              answers. While these approaches are relevant, it is not
upon existing state-of-the-art systems.
                                                                              always evident how one can incorporate external infor-
   More specifically, we propose a temperature regulated
                                                                              mation when it is either in free-text or semi-structured
attention mechanism to rank answers in presence of ex-
                                                                              format into these systems. We explore some question-
ternal (side) information. Our experiments on 10K+ ques-
                                                                              answer based matching approaches as baselines in this
tions from both source datasets on COVID-19 show that
                                                                              work and show that for rapidly evolving topics such as
our models can improve ranking quality by a signifi-
                                                                              COVID-19, inclusion of external curated information can
cant margin over question-answer matching baselines in
                                                                              boost model performance.
presence of external information. Figure 1 demonstrates
                                                                                 The line of work most closely related to ours is incor-
the overall design of our system. We specifically use at-
                                                                              poration of knowledge bases in Q&A systems. Existing
tention based neural architecture with temperature to
                                                                              work [20, 21, 6], however, approaches different tasks. For
automatically determine which components in the ex-
                                                                              instance, authors in [21, 20] focus on finding factual an-
ternal information are useful for ranking user answers
                                                                              swers to questions using a knowledge base. This does
with respect to a question. Ranking performance, when
                                                                              not extend easily to cQ&A where neither the questions
evaluated with three metrics shows that precision and
                                                                              nor the answers may request or refer to any facts. Most
recall for correct answer retrieval improves by ∼17% and
                                                                              recent work is [6] on incorporating medical KB for rank-
∼9% for both source datasets respectively over several
                                                                              ing answers on medical Q&A platforms. They propose
other cQ&A models.
                                                                              to learn path based representation of entities (from KB)
                                                                              present in question and answers posted by users. This ap-
    4
     https://answers.yahoo.com/                                               proach relies on reliable detection of entities first, which
    5
     https://www.who.int/emergencies/diseases/novel-
coronavirus-2019/question-and-answers-hub                                     may be absent for emerging topics such as COVID-19
pandemic. Another limitation of this work is that exter-
nal knowledge may not always be present in a structured
format. For example, CDC guidelines are usually sim-
ple question-answer pairs posted on the website. This
makes it difficult to apply their approach to our problem.
The proposed approach in this work incorporates semi-
structured information directly with help of temperature
regulated attention.
   Finally, with the rise of COVID-19, researchers across
disciplines are actively publishing information and
datasets to share understanding of the virus and its im-
pact on people. Researchers routinely organize dedicated
challenges such as SemEval [22] with tasks such as rank-
ing answers on QA forums. One such initiative is TREC-
COVID track [10] which released queries, documents
and manual relevance judgements to power search for           Figure 2: External source augmentation model
COVID related information. Authors in [23] also released
COVID-19 related QA dataset with 100+ questions and
answers pairs extracted from TREC COVID 6 initiative.
                                                              external source ⟨𝑒𝑞𝑘 , 𝐸𝐷𝑘 ⟩𝑚
                                                                                          𝑘=1 .
These questions/answer pairs are not user generated con-
tent, hence, do not reflect real user questions. We also
rely on recently released Q&A dataset from [9] for our        3.2. Proposed Model
task. We also compile a dataset of 2000+ COVID-19 ques-         In this work, we explore token-level matching mecha-
tions with 10K+ answers all submitted by users on Yahoo!        nism to determine the relevance of information in the
answers for this work.                                          external source that may inform the label prediction task.
                                                                Our model (𝜏 -att) aims to match a given user question
3. Method                                                       with  all the submitted answers in the presence of ex-
                                                                ternal information about the same domain. First, the
3.1. Problem formulation                                        question 𝑞𝑖 , an answer 𝑎𝑖𝑗 and additional metadata can
                                                                be encoded into a 𝑑-dimensional vector 𝑥𝑖 using a text
In this work, we focus on ranking answers for 𝑛 questions encoder 𝑓input . We use LSTM based encoder for both ques-
𝑞1 , . . . , 𝑞𝑛 related to an emerging topics such as COVID. tion and answer in the primary source which can handle
Each 𝑞𝑖 is associated with a set of two or more answers input sequences of variable length.
𝐴𝑖 = {𝑎𝑖𝑗 : 𝑗 ≥ 2} and corresponding labels 𝑌𝑖 =
{𝑦𝑖𝑗 : 𝑗 ≥ 2} representing answer relevance. We use Question Encoding: Each word 𝑤𝑞 in a question is
                                                                                                       𝑖
binary indicator for relevance where relevant judgments represented as a 𝐾 dimensional vector with pre-trained
(e.g., favorite, upvoted) are provided by the user, i.e., 𝑦𝑖𝑗 ∈ word embeddings. LSTM takes each token embedding
{0, 1} respectively.                                            as input and updates hidden state ℎ𝑞𝑖 based on previous
   We attempt to model the relevance of each answer 𝑎𝑖𝑗 state ℎ𝑞 . Finally, the hidden state is input to a feed for-
                                                                       𝑖−1
to its corresponding question using an external source ward layer with smaller dimension 𝐹 < 𝐾 to compress
which may contain free text or semi-structured informa- question encoding as follows:
tion. For example, the external source could consist of
information-seeking queries or questions 𝑒𝑞1 , . . . , 𝑒𝑞𝑚       ℎ𝑞𝑖 = 𝐿𝑆𝑇 𝑀 (ℎ𝑞𝑖−1 , 𝑤𝑖𝑞 ), 𝑓𝑖𝑞 = 𝑅𝐸𝐿𝑈 (ℎ𝑞𝑖 𝑊𝑞 + 𝑏𝑞 )
related to a topic, with each 𝑒𝑞𝑘 linked to a set of rel-                                                              (1)
evant scientific articles or answers 𝐸𝐷𝑘 , where each
answer/document 𝑒𝑑1 , . . . , 𝑒𝑑𝑝 may be judged for rele- Answer Encoding: Each word 𝑤𝑗𝑎 in the answer is also
vance by human judges [10] or some experts.                     represented as a 𝐾 dimensional vector with pre-trained
   We hypothesize that this semi-structured or free-text word embeddings. LSTM takes each token embedding as
information may be valuable in identifying user answer input and updates hidden state ℎ𝑎𝑗 . We also reduce the
quality for certain kinds of questions, although not all. dimension of answer encoding with a feed forward layer
We investigate this with our model to recover the true with dimension 𝐹 < 𝐾 as follows:
labels 𝑦𝑖𝑗 for each user answer 𝑎𝑖𝑗 ∈ 𝐴𝑖 given its ques-
tion 𝑞𝑖 , category information, and information from the ℎ𝑎𝑗 = 𝐿𝑆𝑇 𝑀 (ℎ𝑎𝑗−1 , 𝑤𝑗𝑎 ), 𝑓𝑗𝑎 = 𝑅𝐸𝐿𝑈 (ℎ𝑎𝑗 𝑊𝑎 + 𝑏𝑎 )
                                                                                                                       (2)
    6
        https://ir.nist.gov/covidSubmit/data.html
 Source         Question                       Rel answer                                       Non-rel answer
                                               Unfortunately, there’s not enough people
                I am really scared to go       that care and will still go out and party de-
                places for St. Patrick’s                                                        Stop being scared of viruses.
 Yahoo! Ans                                    spite the coronavirus epidemic. I’m proud
                day because of the coro-                                                        What’s the problem?
                                               of you in that you’re taking extra precau-
                navirus. what do I do?         tions ... Good for you!
                                                                                                The risk is quite low for one to
                                               A recent study shows that the virus can live     become infected with COVID19
 Infobot        Can corona live on card-       in the air ... On cardboard, it can live up to   through mail/packages - especially
                board?                         24 hours (1 day)                                 because...(over a period of a few
                                                                                                days/weeks).
Table 1
Sample rel/non-rel answers from both sources



We concatenate the question and answer representations    with respect to the question encoding. Temperature (𝜏 )
for further processing.                                   parameter helps us control the uniformity of attention
                                                          weights    𝛼𝑖𝑡 . Finally, labels are predicted using a multi-
                        𝑓𝑖𝑗 = [𝑓𝑖𝑞 , 𝑓𝑗𝑎 ]           (3) layer perceptron over the input vector 𝑓𝑖𝑗 and the learned
                                                          weighted average of side information 𝑠′𝑖𝑡𝑑 . We use binary
External source encoding: External sources of in- cross entropy loss to train the proposed model.
formation can vary from task-to-task. We encode each
segment of data individually. For instance, if there are                      𝑦ˆ𝑖𝑗 = 𝐹output ([𝑓𝑖𝑗 ; 𝑠′𝑖𝑡𝑑 ])
two segments in the source (e.g. question/answer or
query/document), our system encodes both segments in- where 𝐹output uses sigmoid activation function. Since
dividually. We use the same encoding architecture used community questions may often be entirely unrelated to
for primary source question/answer encoding above. En- external sources, a key aspect of this approach is deter-
coding example for two segment external source is given mining whether the external source is useful, not merely
below.                                                    attending to its entries that are most relevant. Tempera-
  𝑒𝑞               𝑒𝑞       𝑒𝑞      𝑒𝑞        𝑒𝑞
                                                          ture based attention mechanism is useful in controlling
ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑞 + 𝑏𝑒𝑞which     )       external source entries are useful for user ques-
ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑑 + 𝑏𝑒𝑑tions.
  𝑒𝑑              𝑒𝑑       𝑒𝑑       𝑒𝑑        𝑒𝑑
                                                          )      It is worth noting that one will have to experiment
                                                     (4) and tune the value of temperature 𝜏 such that ranking
                                                          performance improves.
   We incorporate external source encoding with a tem-
perature (𝜏 ) based variant of scaled dot-product atten-
tion, which provides a straightforward conditioning ap- 4. Experimental Setup
proach over a set of query-document pairs. Question
                                                          Given the model architecture, in this section, we provide
encoding vector 𝑓𝑖𝑗 serves as a query over keys 𝑓𝑡𝑒𝑞 . If
                                                          a detailed overview of different datasets, metrics and
two segments are present in the external source such as
                                                          baselines used in our experiments.
query/document, the model uses the attention weights
over first segment (e.g. query) to determine the impor-
tance of the second segment (e.g. document) respectively. 4.1. Data
It is easy to extend this framework to external sources
                                                          We compiled two question answering datasets. The first
with multiple segments. The two segment attention is
                                                          was collected from Yahoo! answers and the second was re-
described below.
                                                          cently released in [9] where both datasets have questions
                                ⊤ 𝑒𝑞
                               𝑓𝑖𝑗 𝑓𝑡                     raised by real users. In this work we focus specifically
                      𝑧𝑖𝑡 = √                             on questions associated with COVID-19. Different statis-
                                   𝑑
                                                          tics about the train and test split of both q&a datasets
                                 𝑒𝑧𝑖𝑡 /𝜏
                     𝛼𝑖𝑡 = ∑︀ 𝑧 /𝜏                   (5) are given in Table 2 respectively. A pair of relevant and
                                  𝑙𝑒                      non-relevant answers for a question in both datasets is
                                       𝑙

                                                          also shown in Table 1 for reference. More details about
                      ′
                              ∑︁
                    𝑠𝑖𝑡𝑑 =         𝛼𝑖𝑡 𝑓𝑡𝑒𝑑
                                𝑑                         them is given below.

To summarize, temperature (𝜏 ) based attention helps
determine the relevance of each 𝑓𝑡𝑒𝑑 corresponding 𝑓𝑡𝑒𝑞
                                                                set where questions for each set were uniformly sampled.

                                                                Infobot Dataset [9] : Researchers at JHU [9] have
                                                                recently compiled a list of user submitted questions on
                                                                different platforms and manually labeled 22K+ question-
                                                                answer pairs. We cleaned this set by removing questions
   (a) Yahoo! ques length             (b) Yahoo! ans length     with less than two answers or no relevant answers. In
                                                                total, our dataset contains 8000+ question answer pairs
                                                                where each question may have multiple relevant answers
                                                                which is not the same as Yahoo! answers dataset. Figure
                                                                3c and 3d show the distribution of question and answer
                                                                lengths respectively.

                                                                4.1.1. External sources
    (c) Infobot ques length           (d) Infobot ans length
                                                                We use two external datasets to rank answers. Details of
Figure 3: Token distribution in different sources               each dataset are given below:

                                                                TREC COVID [10]: We use recently released TREC
          Stat               Yahoo! Ans      Infobot
          Train Q-A          9341            6354
                                                                COVID-19 track data with 50 queries which also contain
          Train ans/q        6.25±2.9        4.40±0.77          manually drafted query descriptions and narratives. Ex-
          Train #qwords      12.71±5.8       6.55±3.93          pert judges have labeled over 5000 scientific documents
          Train #awords      36.31±93.59     92.17±59.27        for these 50 queries from the CORD-19 dataset 8 . These
          Test Q-A           2232            1592               documents contain coronavirus related research. Given
          Test ans/q         5.96±2.87       4.41±0.76          the documents are scientific literature, we initialize doc-
          Test #qwords       13.07±5.89      6.21±2.94          ument embeddings using SPECTER [24].
          Test #awords       35.64±80.31     92.39±59.47
Table 2                                                         WHO: We use data released on question and answer
Train and test data from primary sources                        hub of WHO9 website to create a list of question-answer
                                                                pairs. There are 147 question and answer pairs in this
                                                                dataset where questions contain 13.28±5.36 words and
                                                                answers contain 133.2±100.9 words respectively.
Yahoo! Dataset : We crawled COVID-19 related ques-
tions from Yahoo! answers 7 using several keywords
such as ‘coronavirus’, ‘covid-19’, ‘covid’, ‘sars-cov2’ and     4.2. Baselines
‘corona virus’ between the period of Jan 2020 to July 2020
                                                                We evaluated our model against embedding similarity
to ensure we gather all possible questions for our experi-
                                                                baseline. We computed four baselines as follows:
ments. We keep only those questions have two or more
answers. In total, we obtained 1880 questions with 11500
answers. We used favorite answers as positive labels (sim-      Random: An answer is chosen at random as relevant
ilar to previous work [1]), assuming that users, over time      for a user question. This is expected to provide a lower
rate answers (with upvotes/downvotes) that are most             bound on retrieval performance.
relevant to the submitted question. We normalized the
question and answer text by removing a small list of stop       Linear Attention (att) : When 𝜏 = 1.0, our model
words, numbers, links or any symbols. Figure 3a and 3b          defaults to simple linear attention over all the information
show the distribution of question and answer lengths re-        present in the external sources. This gives an indication
spectively. Questions contain 12.7 ± 5.8 (qwords) words         of how well the model performs when its forced to look
and answers consist of 36.3 ± 93.5 (mean±std) words             at all the information in the external source.
(awords) respectively which indicates that user submitted
answers can vary widely on Yahoo! answers. On average,
a question has about 6 answers (ans/q) in Yahoo! ans
dataset. We spilt the data into three sets: train (64%, 1196
questions, 7435 answers), validation (16%, 298 questions,
                                                                    8
1858 answers) and test (20%, 374 questions, 2310 answers)            https://www.semanticscholar.org/cord19
                                                                    9
                                                                     https://www.who.int/emergencies/diseases/novel-
    7
        https://answers.search.yahoo.com/search?p=coronavirus   coronavirus-2019/question-and-answers-hub
                          Yahoo! Ans                   Infobot                    i.e. (𝑘 = 1) in the ranked list is indeed correct. It
 Model           P@1         R@3     MRR       P@1      R@3      MRR
                                                                                  is defined as follows:
 𝜏 -att          0.393       0.644   0.598     0.673    0.868    0.802
 𝜆-sim           0.3743      0.633   0.578     0.551    0.817    0.7207                                |𝑄| ∑︀𝑘
                                                                                                   1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1}
 bsl-256         0.406       0.657   0.615     0.581    0.803    0.744               𝑃 𝑟𝑒𝑐@𝑘 =                                     (7)
 bsl-128         0.363       0.604   0.589     0.557    0.799    0.731                            |𝑄| 𝑖=1             𝑘
 att             0.377       0.645   0.589     0.567    0.821    0.739
 qasim           0.318       0.608   0.546     0.551    0.817    0.720           where I{𝑟𝑒𝑙𝑖𝑗 = 1} indicates whether the answer
 random          0.21        -       -         0.239    -        -               at position 𝑗 is relevant to the 𝑖𝑡ℎ question.
                                                                               • Recall (R@k): Recall at position 𝑘 evaluates the
Table 3
Evaluation with WHO external data
                                                                                 fraction of relevant answers retrieved from all
                                                                                 the answers marked relevant for a question. We
                                                                                 report recall averaged for all the queries in test
                                                                                 set. For recall, we take a cutoff as (𝑘 = 3), which
Linear combination (𝜆-sim) : We linearly combine
                                                                                 evaluates whether the model is able to retrieve the
similarities between Yahoo! question-answer and Trec
                                                                                 correct answers in top 3 positions. It is defined
query-answer as shown below:
                                                                                 as follows:
 𝜆-sim = 𝜆 𝑐𝑜𝑠(𝑦𝑎, 𝑦𝑞) + (1 − 𝜆) max(𝑐𝑜𝑠(𝑦𝑎, 𝑡𝑞))                                                      |𝑄| ∑︀𝑘
                                              𝑡𝑞                                                    1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1}
                                                     (6)                           𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 =                                     (8)
                                                                                                   |𝑄| 𝑖=1          |𝑟𝑒𝑙𝑖 |
where 𝑦𝑎, 𝑦𝑞 and 𝑡𝑞 are Yahoo! answer, question and
concatenated trec query, narrative and description em-
                                                                                 where |𝑟𝑒𝑙𝑖 | is the number of relevant answers
beddings respectively. This is a more crude version of
                                                                                 for the 𝑖th question.
temperature attention where 𝜆 controls the contribution
of each component directly. We vary 𝜆 to determine                             • MRR (MRR): evaluates the average of the recip-
the optimal combination. Question-Answer similarity                              rocal ranks corresponding to the most relevant
(qasim) is similarity between question and answer embed-                         answer for the questions in test set, which is given
ding i.e. 𝜆 = 1. Both question and answer embeddings                             by:
                                                                                                            |𝑄|
are obtained by averaging over their individual token                                                    1 ∑︁ 1
                                                                                             𝑀 𝑅𝑅 =                                (9)
embeddings.                                                                                             |𝑄| 𝑖=1 𝑟𝑎𝑛𝑘𝑖

BERT Q&A (bert) : Large scale pre-trained transform-                              where |𝑄| indicates the number of queries in the
ers [25] are widely popular for NLP tasks. BERT like                              test set and 𝑟𝑎𝑛𝑘𝑖 is the rank of the first relevant
models have shown effectiveness on Q&A datasets such                              answer for the 𝑖𝑡ℎ query.
as SQUAD 10 . We fine-tune BERT base model with two
different answer lengths a) 128 (bsl-128) and b) 256 to-             4.4. Parameter Settings
kens (bsl-256) respectively. The intuition is that large
                                                           Both primary datasets, Yahoo! ans and Infobot, were
scale pre-trained models are adept at language under-
                                                           divied into three parts: train (∼60%), validation and test
standing and can be fine-tuned for new tasks with small
                                                           (20%) respectively. The baseline models 𝜆-sim and 𝑎𝑡𝑡
number of samples. We finetune BERT for both datasets
                                                           are initialized with glove embeddings 11 of 100 dimen-
Yahoo! ans and Infobot respectively. It is non-trivial to
                                                           sions. We performed a parameter sweep over 𝜆 and 𝜏
include external information in BERT and we leave this
                                                           for 𝜆-sim and 𝜏 -att models with step size of 0.1 between
for future work.
                                                           {0, 1.0} respectively. We used base uncased model for
                                                           𝑏𝑒𝑟𝑡 implementation. We fine-tuned the model between
4.3. Evaluation Metrics                                    1-10 epochs and found that 3 epochs gave the best re-
We evaluate the performance of our model using three sult on validation set. We used LSTM with 64 hidden
popular ranking metrics, mainly Precision (P@1), Mean units to represent question, answer and all the informa-
Reciprocal Rank (MRR), and Recall (R@3). Each metric is tion in external datasets. We experimented with higher
described below:                                           embedding size and hidden units, but the performance
                                                           degraded significantly as the model tends to overfit on
     • Precision (P@k): Precision at position 𝑘 eval- training data. Lastly we used batch size of 64 and trained
       uates the fraction of relevant answers retrieved the model for 30 epochs with early stopping.
       until position k. For, both datasets Yahoo! ans and
       Infobot [9], we evaluate whether the top answer
   10                                                                     11
        https://rajpurkar.github.io/SQuAD-explorer/                            https://nlp.stanford.edu/projects/glove/
                    Yahoo! Ans                  Infobot                 Category              𝜏 -att   𝜆-sim    qasim
 Model      P@ 1      R@3      MRR      P@1      R@3      MRR           Entertainment (47)    0.446    0.382    0.297
 𝜏 -att     0.532     0.778    0.715    0.606    0.842    0.766         Health (62)           0.483    0.419    0.354
 𝜆-sim      0.326     0.616    0.555    0.556    0.813    0.722         Politics (143)        0.45     0.300    0.272
 bsl-256    0.406     0.657    0.615    0.581    0.803    0.744         Society (38)          0.28     0.157    0.236
 bsl-128    0.363     0.604    0.589    0.557    0.799    0.731         Family (20)           0.6      0.350    0.40
 att        0.291     0.495    0.494    0.601    0.833    0.762
 qasim      0.318     0.608    0.546    0.551    0.817    0.720   Table 6
 random     0.21      -        -        0.239    -        -       Precision@1 of models across categories
Table 4
Evaluation with TREC-COVID external data
                                                                  efit cQ&A task. Since attention is dependent on the input
      Category                 𝜏 -att   𝜆-sim     qasim           query and key embedding lengths, it would be interesting
      Entertainment (47)       0.829    0.702     0.59            to scale the computation in our model to incorporate sev-
      Health (62)              0.693    0.69      0.645           eral open external datasets to overcome this limitation
      Politics (143)           0.727    0.629     0.587           in the future.
      Society (38)             0.578    0.473     0.42               Yahoo! ans questions are also assigned categories by
      Family (20)              0.85     0.750     0.65            users. Category based breakdown of performance on test
Table 5                                                           set is given in Table 6 and Table 5 respectively, where
Recall@3 of models across categories                              categories with largest number of questions in test set are
                                                                  listed. In all the categories, our model outperforms best
                                                                  𝜆-sim and qasim model respectively. The largest improve-
5. Results                                                        ment happens for questions in Family category where
                                                                  our model achieves an improvement of 71% over the 𝜆-
In this work, our focus is to evaluate the utility of             sim model. It seems that ranking answers for questions
external information in improving answer ranking for              from society and politics are harder than other categories.
cQ&A task. Thus, we performed experiments to answer               All the models, however, are able to rank the top answer
three main research questions listed below.                       in first three positions effectively as Recall@3 is high for
RQ1: Does external information improve answer                     all the categories.
ranking?
RQ2: How does temperature (𝜏 ) compare with 𝜆                     RQ2: How does temperature (𝜏 ) compare with 𝜆
parameter?                                                        parameter? We argued that linearly combining simi-
RQ3: What kind of queries/questions does the model                larities between question-answer in primary dataset and
attend to when ranking relevant/non-relevant answers?             between question-external source may not be sufficient
                                                                  to boost performance. We observe that in our results too
                                                                  i.e. 𝜆-sim models do not perform better than (𝜏 -att) mod-
RQ1: Does external information improve answer                     els. This clearly indicates that more sophisticated models
ranking? We evaluated different models for ranking                can learn to combine this information directly from train-
answers in Yahoo! ans and Infobot dataset in presence             ing data. However, our experiments indicate that optimal
of TREC and WHO datasets respectively. We found that              value of (𝜏 ) varies across primary datasets and external
temperature regulated attention models that incorpo-              sources of information. For instance, (𝜏 -att) model per-
rate external sources indeed outperform the baselines             formed best when 𝜏 = 0.4 and 𝜏 = 0.9 for Yahoo! ans
as shown in Table 4 and Table 3 respectively. (𝜏 -att)            and Infobot dataset respectively when TREC was used
model beats bert models by ∼30% in precision, ∼18%                as external source. It performed best when 𝜏 = 0.1 and
in recall and ∼16% in MRR respectively on TREC data.              𝜏 = 0.5 for Yahoo! ans and Infobot dataset respectively
However, (𝜏 -att) does only marginally better than att            when WHO was used as external source. We also tried to
model in precision and MRR on Infobot data. We sus-               vary 𝜏 beyond 1.0 to determine whether it yielded a trend
pect that is due to the large set of query-document pairs         as shown in Table 7. Higher values of temperature seem
in TREC-COVID data compared to fewer number of                    to degrade model performance. We found that optimal
question-answer pairs in Infobot dataset. Our results             temperature range is between [0.1−1]. Existing research
also clearly suggest that embedding based matching of             in model distillation [26] has also empirically found that
question-answer pair (qasim) would not yield a good               lower values of temperature yield better performance.
ranker, though it is better than choosing an answer at               We also compared model performance in terms of pre-
random (random). When WHO is used as an external                  cision when 𝜆 and 𝜏 are varied for 𝜆-sim models and
dataset, we find that (𝜏 -att) model is slightly worse than       temperature based models respectively as shown in Fig-
bert. This suggests that not all sources would equally ben-       ure 4. Temperature based models peak at one value but
       (a) Yahoo!+TREC                (b) Yahoo!+WHO                (c) Infobot + TREC             (d) Infobot + WHO

Figure 4: Temperature and 𝜆 variation impact on Prec@1


                            Temperature (𝜏 ) > 1.0
                10      100    1000    10       100    1000
 Src+ Ext              Prec@1                 Recall@3
 Y! + TREC      0.46    0.38   0.38    0.73     0.644  0.64
 Y! + WHO       0.37    0.38   0.36    0.64     0.65   0.64
 Ibot + TREC    0.44    0.59   0.39    0.72     0.81   0.75
 Ibot + WHO     0.65    0.41   0.44    0.85     0.76   0.79

Table 7                                                         Figure 5: Y! ques, its rel and non-rel ans and questions with
Variation in P@1 and R@3 across different temperature values.   𝜏 -att model’s attention values for TREC queries.



do not have a clear trend indicating that one needs to ex-
plore different 𝜏 values at the time of training for better
performance. On the other hand, we observe that adding
external information also helps the 𝜆-sim models until
a certain threshold. Overall, both sets of models show
that free-text external information can be incorporated
to improve answer ranking performance.                      Figure 6: Infobot ques, its rel and non-rel ans and questions
                                                                with 𝜏 -att model’s attention values for TREC queries.
RQ3: What kind of queries/questions does the
model attend to when ranking relevant/non-
relevant answers? Attention based models have a                 this external knowledge need not always be structured
very unique feature: they can aid explaining the internal       text. However, it is worth noting that curated and reli-
workings of neural network models. We inspect what              able external sources may not always be available for all
kind of queries/questions in external datasets does our         domains. We addressed a very niche task in this work,
model pay attention to while ranking relevant or non-           and further research is required to extend it to incorpo-
relevant answers. Figure 5 shows one such example of            rate multiple external sources. We posit that with scal-
Yahoo! question and incorporation of TREC data. At the          able attention mechanisms, this work can be easily made
time of scoring relevant answer, the model gives higher         tractable for large external sources containing thousands
weight to some queries compared to others. In the exam-         or millions of entries in the future.
ple, for instance, it assigns more weight to queries asso-
ciated with masks or COVID virus response to weather
changes. We observe higher attention weights for ques-          6. Conclusion
tions when relevant answers are ranked than when non-
relevant answers are scored. An example question, a             Question answering platforms provide users with effec-
relevant and non-relevant answer along with model at-           tive and easy access to information. These platforms
tention weights on TREC queries are shown from the              also provide content on rapidly evolving sensitive topics
Infobot data in Figure 6 respectively. It shows a simi-         such as disease outbreaks (such as COVID-19) where it is
lar trend where attention weights are high for external         also useful to use external vetted information for ranking
queries that are closely associated with the question an-       answers. Existing work only exploits knowledge bases
swer text.                                                      which have some limitations that makes it difficult to
   Overall, our experiments show that curated external          use them for community Q&A for rapidly evolving top-
information is useful for improving community ques-             ics such as wild-fires or earthquakes. In this work, we
tion answering task. Our experiments also indicate that         tried to evaluate the effectiveness of external (free text or
semi-structured) information in improving answer rank-            question answering in social multimedia, in:
ing models. We argue that simple question-answer text             Proceedings of the 26th ACM International Con-
matching may be insufficient and in presence of external          ference on Multimedia, MM ’18, Association for
knowledge, but temperature regulated attention models             Computing Machinery, New York, NY, USA, 2018,
can distill information better which in turn yields higher        p. 456–464. URL: https://doi.org/10.1145/3240508.
performance. Our proposed model with temperature reg-             3240626. doi:10.1145/3240508.3240626.
ulated attention, when evaluated on two public datasets       [8] J. Hu, S. Qian, Q. Fang, C. Xu, Hierarchical graph
showed significant improvements by augmenting infor-              semantic pooling network for multi-modal com-
mation from two external curated sources of information.          munity question answer matching, in: Proceed-
In future, we aim to expand these experiments to other            ings of the 27th ACM International Conference
categories such as disaster relief and scale the attention        on Multimedia, MM ’19, Association for Com-
mechanism to include multiple external sources in one             puting Machinery, New York, NY, USA, 2019, p.
model.                                                            1157–1165. URL: https://doi.org/10.1145/3343031.
                                                                  3350966. doi:10.1145/3343031.3350966.
                                                              [9] A. Poliak, M. Fleming, C. Costello, K. W. Murray,
References                                                        M. Yarmohammadi, S. Pandya, D. Irani, M. Agarwal,
                                                                  U. Sharma, S. Sun, et al., Collecting verified covid-19
 [1] M. Surdeanu, M. Ciaramita, H. Zaragoza, Learning
                                                                  question answer pairs (2020).
     to rank answers on large online qa collections, in:
                                                             [10] E. Voorhees, T. Alam, S. Bedrick, D. Demner-
     ACL, 2008.
                                                                  Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Sobo-
 [2] Y. Shen, W. Rong, Z. Sun, Y. Ouyang, Z. Xiong,
                                                                  roff, L. L. Wang, Trec-covid: Constructing a pan-
     Question/answer matching for cqa system via com-
                                                                  demic information retrieval test collection, 2020.
     bining lexical and sequential information, in: Pro-
                                                                  arXiv:2005.04474.
     ceedings of the Twenty-Ninth AAAI Conference on
                                                             [11] A. Rücklé, I. Gurevych, Representation learn-
     Artificial Intelligence, AAAI’15, AAAI Press, 2015,
                                                                  ing for answer selection with LSTM-based impor-
     p. 275–281.
                                                                  tance weighting, in: IWCS 2017 — 12th Interna-
 [3] L. Yang, Q. Ai, D. Spina, R.-C. Chen, L. Pang, W. B.
                                                                  tional Conference on Computational Semantics —
     Croft, J. Guo, F. Scholer, Beyond factoid qa: Ef-
                                                                  Short papers, 2017. URL: https://www.aclweb.org/
     fective methods for non-factoid answer sentence
                                                                  anthology/W17-6935.
     retrieval, in: European Conference on Information
                                                             [12] D. Cohen, W. Croft, End to end long short term
     Retrieval, Springer, 2016, pp. 115–128.
                                                                  memory networks for non-factoid question answer-
 [4] L. Hong, B. D. Davison, A classification-based ap-
                                                                  ing, 2016, pp. 143–146. doi:10.1145/2970398.
     proach to question answering in discussion boards,
                                                                  2970438.
     in: Proceedings of the 32nd international ACM SI-
                                                             [13] X. Zhou, B. Hu, Q. Chen, B. Tang, X. Wang, Answer
     GIR conference on Research and development in
                                                                  sequence learning with neural networks for answer
     information retrieval, 2009, pp. 171–178.
                                                                  selection in community question answering, arXiv
 [5] D. H. Dalip, M. A. Gonçalves, M. Cristo, P. Calado,
                                                                  preprint arXiv:1506.06490 (2015).
     Exploiting user feedback to learn to rank answers
                                                             [14] L. Nie, X. Wei, D. Zhang, X. Wang, Z. Gao, Y. Yang,
     in qa forums: A case study with stack overflow,
                                                                  Data-driven answer selection in community qa sys-
     in: Proceedings of the 36th International ACM SI-
                                                                  tems, IEEE transactions on knowledge and data
     GIR Conference on Research and Development in
                                                                  engineering 29 (2017) 1186–1198.
     Information Retrieval, SIGIR ’13, Association for
                                                             [15] A. Severyn, A. Moschitti, Structural relationships
     Computing Machinery, New York, NY, USA, 2013,
                                                                  for large-scale learning of answer re-ranking, in:
     p. 543–552. URL: https://doi.org/10.1145/2484028.
                                                                  Proceedings of the 35th international ACM SIGIR
     2484072. doi:10.1145/2484028.2484072.
                                                                  conference on Research and development in infor-
 [6] Y. Zhang, S. Qian, Q. Fang, C. Xu, Multi-modal
                                                                  mation retrieval, 2012, pp. 741–750.
     knowledge-aware hierarchical attention network
                                                             [16] X. Yang, M. Khabsa, M. Wang, W. Wang, A. H.
     for explainable medical question answering, in:
                                                                  Awadallah, D. Kifer, C. L. Giles, Adversarial training
     Proceedings of the 27th ACM International Con-
                                                                  for community question answer selection based on
     ference on Multimedia, MM ’19, Association for
                                                                  multi-scale matching, in: Proceedings of the AAAI
     Computing Machinery, New York, NY, USA, 2019,
                                                                  Conference on Artificial Intelligence, volume 33,
     p. 1089–1097. URL: https://doi.org/10.1145/3343031.
                                                                  2019, pp. 395–402.
     3351033. doi:10.1145/3343031.3351033.
                                                             [17] H. Huang, X. Wei, L. Nie, X. Mao, X.-S. Xu, From
 [7] J. Hu, S. Qian, Q. Fang, C. Xu, Attentive in-
                                                                  question to text: Question-oriented feature atten-
     teractive convolutional matching for community
                                                                  tion for answer selection, ACM Transactions on
      Information Systems 37 (2018) 1–33. doi:10.1145/
      3233771.
[18] X. Zhang, S. Li, L. Sha, H. Wang, Attentive interac-
     tive neural networks for answer selection in com-
     munity question answering, in: Proceedings of the
     Thirty-First AAAI Conference on Artificial Intelli-
     gence, AAAI’17, AAAI Press, 2017, p. 3525–3531.
[19] B. Wang, X. Wang, C.-J. Sun, B. Liu, L. Sun, Model-
     ing semantic relevance for question-answer pairs
     in web social communities, in: Proceedings of the
     48th Annual Meeting of the Association for Com-
     putational Linguistics, 2010, pp. 1230–1238.
[20] B. Kratzwald, A. Eigenmann, S. Feuerriegel, Rankqa:
     Neural question answering with answer re-ranking,
     CoRR abs/1906.03008 (2019). URL: http://arxiv.org/
     abs/1906.03008. arXiv:1906.03008.
[21] Y. Shen, Y. Deng, M. Yang, Y. Li, N. Du, W. Fan,
     K. Lei, Knowledge-aware attentive neural network
     for ranking question answer pairs, in: The 41st
     International ACM SIGIR Conference on Research
     Development in Information Retrieval, SIGIR ’18,
     Association for Computing Machinery, New York,
     NY, USA, 2018, p. 901–904. URL: https://doi.org/
     10.1145/3209978.3210081. doi:10.1145/3209978.
     3210081.
[22] P. Nakov, D. Hoogeveen, L. Màrquez, A. Moschitti,
     H. Mubarak, T. Baldwin, K. Verspoor, Semeval-
     2017 task 3: Community question answering, arXiv
     preprint arXiv:1912.00730 (2019).
[23] D. Su, Y. Xu, T. Yu, F. B. Siddique, E. J. Barezi, P. Fung,
     CAiRE-COVID: A question answering and multi-
     document summarization system for covid-19 re-
     search, arXiv 2005.03975 (2020).
[24] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. S.
     Weld, Specter: Document-level representation
     learning using citation-informed transformers, in:
     Proceedings of the 58th Annual Meeting of the As-
     sociation for Computational Linguistics, 2020, pp.
     2270–2282.
[25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional trans-
     formers for language understanding, 2019.
     arXiv:1810.04805.
[26] G. Hinton, O. Vinyals, J. Dean, Distilling
     the knowledge in a neural network, 2015.
     arXiv:1503.02531.