=Paper=
{{Paper
|id=Vol-3052/short23
|storemode=property
|title=Powering COVID-19 Community Q&A with Curated Side Information
|pdfUrl=https://ceur-ws.org/Vol-3052/short23.pdf
|volume=Vol-3052
|authors=Manisha Verma,,Kapil Thadani,,Shaunak Mishra
|dblpUrl=https://dblp.org/rec/conf/cikm/VermaTM21
}}
==Powering COVID-19 Community Q&A with Curated Side Information==
Powering COVID-19 community Q&A with Curated Side Information Manisha Verma1 , Kapil Thadani2 and Shaunak Mishra3 1 Yahoo! Research NYC 2 Yahoo! Research NYC 3 Yahoo! Research NYC Abstract Community question answering and discussion platforms such as Reddit, Yahoo! answers or Quora provide users the flexibility of asking open ended questions to a large audience, and replies to such questions maybe useful both to the user and the community on certain topics such as health, sports or finance. Given the recent events around COVID-19, some of these platforms have attracted 2000+ questions from users about several aspects associated with the disease. Given the impact of this disease on general public, in this work we investigate ways to improve the ranking of user generated answers on COVID-19. We specifically explore the utility of external technical sources of side information (such as CDC guidelines or WHO FAQs) in improving answer ranking on such platforms. We found that ranking user answers based on question-answer similarity is not sufficient, and existing models cannot effectively exploit external (side) information. In this work, we demonstrate the effectiveness of different attention based neural models that can directly exploit side information available in technical documents or verified forums (e.g., research publications on COVID-19 or WHO website). Augmented with a temperature mechanism, the attention based neural models can selectively determine the relevance of side information for a given user question, while ranking answers. Keywords question answering, deep learning, knowledge injection, NLP 1. Introduction tween these entities to score answers. However, there are some limitations of knowledge bases that would make Question answering systems are key to finding relevant it difficult to use them for community Q&A for rapidly and timely information about several issues. Community evolving topics such as disease outbreaks (e.g. ebola, question answering (cQ&A) platforms such as Reddit, Ya- COVID-19), wild-fires or earthquakes. Firstly, knowledge hoo! answers or Quora have been used to ask questions bases contain information about established entities, and about wide ranging topics. Most of these platforms let do not rapidly evolve to incorporate new information users ask, answer, vote or comment on questions present which makes them unreliable for novel disease outbreaks on the platform. However, question answering platforms such as COVID-19 where information rapidly changes are useful not only for getting public opinions or votes and its verification is time sensitive. Secondly, it may be about areas such as entertainment or sports but can also hard to determine what even constitutes an entity as new serve as information hot-spots for more sensitive topics information arrives about the topic. To overcome these such as health, injuries or legal topics. Thus, it is imper- limitation in this work, we posit that external curated ative that when the user visits sensitive topics content, free-text or semi-structured informational sources can answer ranking also takes into account curated side infor- also be used effectively for cQ&A tasks. mation from reliable (external) sources. Most prior work In this work, we demonstrate that free text or semi on cQ&A has focused on incorporating question-answer structured external information sources such as CDC1 , similarity [1, 2], user reputation [3, 4, 5], integration of WHO2 or NHS3 can be very useful for ranking answers multi-modal content [6], community interaction features on community Q&A platforms since they contain fre- [7] associated with answers or just the question answer- quently updated information about several topics such as ing network [8] on the platform. However, there is very ongoing disease outbreaks, vaccines or resources about limited work on incorporating curated content from exter- other topics such as surgeries, birth control or historical nal sources. Existing work only exploits knowledge bases numerical data about diseases across the world. [6] that consist of different entities and relationships be- We argue that for sensitive topics such as COVID-19, it is useful to use publicly available vetted information for KINN 2021: Workshop on Knowledge Injection in Neural Networks – November, 2021 improving our ranking systems. In this work, we explore $ manishav@yahooinc.com (M. Verma); thadani@yahooinc.com (K. Thadani); shaunakm@yahooinc.com (S. Mishra) 1 https://www.cdc.gov/ © 2021 Copyright for this paper by its authors. Use permitted under Creative 2 Commons License Attribution 4.0 International (CC BY 4.0). https://www.who.int/ CEUR Workshop Proceedings (CEUR-WS.org) 3 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://www.nhs.uk/ COVID-19 questions community Q&A 2. Related work (Yahoo! Answers, Quora, Reddit) curated docs Community Question and Answering (cQ&A) systems is (research papers, a well researched sub-field both in information retrieval sample question WHO, CDC, NHS) and NLP communities. Several systems have been pro- posed to rank user submitted answers to questions on community platforms such as Yahoo! answers, Reddit and Quora. answer Ranking user submitted answers on community answers from ranker question-answering platforms has been addressed with members several approaches. Primary method is to determine the relevance of the answer given an input question. Text fever coughing and tiredness some serious ... based matching is one of the most common approaches You can have all or some of the symptoms to rank answers. Researchers have used several meth- that can be caused by any other virus ... ods to compute similarity between a question and user WHO & CDC claim up to 14 days ... Symptoms of virus Fever or chills, Cough ... generated answers to determine relevance. For instance, most relevant feature based question-answer matching is used in [1] with 17 features extracted from unigrams, bigrams and the symptoms vary as the base virus mutates web correlation features using unstructured user search Figure 1: Illustrative example of COVID-19 community an- logs to rank answers. It is worth noting that user features swer ranking powered by side information in the form of and community features when incorporated may still research papers, and information from verified sources (such yield further improvements in the performance of these as CDC, WHO, and NHS). models but this is not the focus of our work. The authors in [1] used questions extracted from Yahoo! answers for their experiments. Researchers have used different approaches such representation learning, for instance, the utility of publicly available information for ranking in [11, 12] authors use LSTM to represent questions and answers for questions associated with COVID-19. We answers respectively. Convolutional networks have also specifically focus on ranking answers for questions in been used in [3, 13] to rank answers. Other approaches two publicly available primary Q&A datasets: a) Yahoo! such as doc2vec [14], tree-kernels [15], adversarial learn- Answers4 and b) recently released annotated Q&A dataset ing [16], attention [17, 6, 11, 18] or deep belief networks [9] in presence of two external semi-structured curated [19] have been used to score question and answer pairs. sources: a) TREC-COVID [10] and b) WHO questions and There have also been studies exploring community, user answers 5 on COVID-19. We explore the utility of deep interaction or question based features [3, 4, 5, 7] to rank learning models with attention in this work to improve answers. While these approaches are relevant, it is not upon existing state-of-the-art systems. always evident how one can incorporate external infor- More specifically, we propose a temperature regulated mation when it is either in free-text or semi-structured attention mechanism to rank answers in presence of ex- format into these systems. We explore some question- ternal (side) information. Our experiments on 10K+ ques- answer based matching approaches as baselines in this tions from both source datasets on COVID-19 show that work and show that for rapidly evolving topics such as our models can improve ranking quality by a signifi- COVID-19, inclusion of external curated information can cant margin over question-answer matching baselines in boost model performance. presence of external information. Figure 1 demonstrates The line of work most closely related to ours is incor- the overall design of our system. We specifically use at- poration of knowledge bases in Q&A systems. Existing tention based neural architecture with temperature to work [20, 21, 6], however, approaches different tasks. For automatically determine which components in the ex- instance, authors in [21, 20] focus on finding factual an- ternal information are useful for ranking user answers swers to questions using a knowledge base. This does with respect to a question. Ranking performance, when not extend easily to cQ&A where neither the questions evaluated with three metrics shows that precision and nor the answers may request or refer to any facts. Most recall for correct answer retrieval improves by ∼17% and recent work is [6] on incorporating medical KB for rank- ∼9% for both source datasets respectively over several ing answers on medical Q&A platforms. They propose other cQ&A models. to learn path based representation of entities (from KB) present in question and answers posted by users. This ap- 4 https://answers.yahoo.com/ proach relies on reliable detection of entities first, which 5 https://www.who.int/emergencies/diseases/novel- coronavirus-2019/question-and-answers-hub may be absent for emerging topics such as COVID-19 pandemic. Another limitation of this work is that exter- nal knowledge may not always be present in a structured format. For example, CDC guidelines are usually sim- ple question-answer pairs posted on the website. This makes it difficult to apply their approach to our problem. The proposed approach in this work incorporates semi- structured information directly with help of temperature regulated attention. Finally, with the rise of COVID-19, researchers across disciplines are actively publishing information and datasets to share understanding of the virus and its im- pact on people. Researchers routinely organize dedicated challenges such as SemEval [22] with tasks such as rank- ing answers on QA forums. One such initiative is TREC- COVID track [10] which released queries, documents and manual relevance judgements to power search for Figure 2: External source augmentation model COVID related information. Authors in [23] also released COVID-19 related QA dataset with 100+ questions and answers pairs extracted from TREC COVID 6 initiative. external source ⟨𝑒𝑞𝑘 , 𝐸𝐷𝑘 ⟩𝑚 𝑘=1 . These questions/answer pairs are not user generated con- tent, hence, do not reflect real user questions. We also rely on recently released Q&A dataset from [9] for our 3.2. Proposed Model task. We also compile a dataset of 2000+ COVID-19 ques- In this work, we explore token-level matching mecha- tions with 10K+ answers all submitted by users on Yahoo! nism to determine the relevance of information in the answers for this work. external source that may inform the label prediction task. Our model (𝜏 -att) aims to match a given user question 3. Method with all the submitted answers in the presence of ex- ternal information about the same domain. First, the 3.1. Problem formulation question 𝑞𝑖 , an answer 𝑎𝑖𝑗 and additional metadata can be encoded into a 𝑑-dimensional vector 𝑥𝑖 using a text In this work, we focus on ranking answers for 𝑛 questions encoder 𝑓input . We use LSTM based encoder for both ques- 𝑞1 , . . . , 𝑞𝑛 related to an emerging topics such as COVID. tion and answer in the primary source which can handle Each 𝑞𝑖 is associated with a set of two or more answers input sequences of variable length. 𝐴𝑖 = {𝑎𝑖𝑗 : 𝑗 ≥ 2} and corresponding labels 𝑌𝑖 = {𝑦𝑖𝑗 : 𝑗 ≥ 2} representing answer relevance. We use Question Encoding: Each word 𝑤𝑞 in a question is 𝑖 binary indicator for relevance where relevant judgments represented as a 𝐾 dimensional vector with pre-trained (e.g., favorite, upvoted) are provided by the user, i.e., 𝑦𝑖𝑗 ∈ word embeddings. LSTM takes each token embedding {0, 1} respectively. as input and updates hidden state ℎ𝑞𝑖 based on previous We attempt to model the relevance of each answer 𝑎𝑖𝑗 state ℎ𝑞 . Finally, the hidden state is input to a feed for- 𝑖−1 to its corresponding question using an external source ward layer with smaller dimension 𝐹 < 𝐾 to compress which may contain free text or semi-structured informa- question encoding as follows: tion. For example, the external source could consist of information-seeking queries or questions 𝑒𝑞1 , . . . , 𝑒𝑞𝑚 ℎ𝑞𝑖 = 𝐿𝑆𝑇 𝑀 (ℎ𝑞𝑖−1 , 𝑤𝑖𝑞 ), 𝑓𝑖𝑞 = 𝑅𝐸𝐿𝑈 (ℎ𝑞𝑖 𝑊𝑞 + 𝑏𝑞 ) related to a topic, with each 𝑒𝑞𝑘 linked to a set of rel- (1) evant scientific articles or answers 𝐸𝐷𝑘 , where each answer/document 𝑒𝑑1 , . . . , 𝑒𝑑𝑝 may be judged for rele- Answer Encoding: Each word 𝑤𝑗𝑎 in the answer is also vance by human judges [10] or some experts. represented as a 𝐾 dimensional vector with pre-trained We hypothesize that this semi-structured or free-text word embeddings. LSTM takes each token embedding as information may be valuable in identifying user answer input and updates hidden state ℎ𝑎𝑗 . We also reduce the quality for certain kinds of questions, although not all. dimension of answer encoding with a feed forward layer We investigate this with our model to recover the true with dimension 𝐹 < 𝐾 as follows: labels 𝑦𝑖𝑗 for each user answer 𝑎𝑖𝑗 ∈ 𝐴𝑖 given its ques- tion 𝑞𝑖 , category information, and information from the ℎ𝑎𝑗 = 𝐿𝑆𝑇 𝑀 (ℎ𝑎𝑗−1 , 𝑤𝑗𝑎 ), 𝑓𝑗𝑎 = 𝑅𝐸𝐿𝑈 (ℎ𝑎𝑗 𝑊𝑎 + 𝑏𝑎 ) (2) 6 https://ir.nist.gov/covidSubmit/data.html Source Question Rel answer Non-rel answer Unfortunately, there’s not enough people I am really scared to go that care and will still go out and party de- places for St. Patrick’s Stop being scared of viruses. Yahoo! Ans spite the coronavirus epidemic. I’m proud day because of the coro- What’s the problem? of you in that you’re taking extra precau- navirus. what do I do? tions ... Good for you! The risk is quite low for one to A recent study shows that the virus can live become infected with COVID19 Infobot Can corona live on card- in the air ... On cardboard, it can live up to through mail/packages - especially board? 24 hours (1 day) because...(over a period of a few days/weeks). Table 1 Sample rel/non-rel answers from both sources We concatenate the question and answer representations with respect to the question encoding. Temperature (𝜏 ) for further processing. parameter helps us control the uniformity of attention weights 𝛼𝑖𝑡 . Finally, labels are predicted using a multi- 𝑓𝑖𝑗 = [𝑓𝑖𝑞 , 𝑓𝑗𝑎 ] (3) layer perceptron over the input vector 𝑓𝑖𝑗 and the learned weighted average of side information 𝑠′𝑖𝑡𝑑 . We use binary External source encoding: External sources of in- cross entropy loss to train the proposed model. formation can vary from task-to-task. We encode each segment of data individually. For instance, if there are 𝑦ˆ𝑖𝑗 = 𝐹output ([𝑓𝑖𝑗 ; 𝑠′𝑖𝑡𝑑 ]) two segments in the source (e.g. question/answer or query/document), our system encodes both segments in- where 𝐹output uses sigmoid activation function. Since dividually. We use the same encoding architecture used community questions may often be entirely unrelated to for primary source question/answer encoding above. En- external sources, a key aspect of this approach is deter- coding example for two segment external source is given mining whether the external source is useful, not merely below. attending to its entries that are most relevant. Tempera- 𝑒𝑞 𝑒𝑞 𝑒𝑞 𝑒𝑞 𝑒𝑞 ture based attention mechanism is useful in controlling ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑞 + 𝑏𝑒𝑞which ) external source entries are useful for user ques- ℎ𝑡 = 𝐿𝑆𝑇 𝑀 (ℎ𝑡−1 , 𝑤𝑡 ), 𝑓𝑡 = 𝑅𝐸𝐿𝑈 (ℎ𝑡 𝑊𝑒𝑑 + 𝑏𝑒𝑑tions. 𝑒𝑑 𝑒𝑑 𝑒𝑑 𝑒𝑑 𝑒𝑑 ) It is worth noting that one will have to experiment (4) and tune the value of temperature 𝜏 such that ranking performance improves. We incorporate external source encoding with a tem- perature (𝜏 ) based variant of scaled dot-product atten- tion, which provides a straightforward conditioning ap- 4. Experimental Setup proach over a set of query-document pairs. Question Given the model architecture, in this section, we provide encoding vector 𝑓𝑖𝑗 serves as a query over keys 𝑓𝑡𝑒𝑞 . If a detailed overview of different datasets, metrics and two segments are present in the external source such as baselines used in our experiments. query/document, the model uses the attention weights over first segment (e.g. query) to determine the impor- tance of the second segment (e.g. document) respectively. 4.1. Data It is easy to extend this framework to external sources We compiled two question answering datasets. The first with multiple segments. The two segment attention is was collected from Yahoo! answers and the second was re- described below. cently released in [9] where both datasets have questions ⊤ 𝑒𝑞 𝑓𝑖𝑗 𝑓𝑡 raised by real users. In this work we focus specifically 𝑧𝑖𝑡 = √ on questions associated with COVID-19. Different statis- 𝑑 tics about the train and test split of both q&a datasets 𝑒𝑧𝑖𝑡 /𝜏 𝛼𝑖𝑡 = ∑︀ 𝑧 /𝜏 (5) are given in Table 2 respectively. A pair of relevant and 𝑙𝑒 non-relevant answers for a question in both datasets is 𝑙 also shown in Table 1 for reference. More details about ′ ∑︁ 𝑠𝑖𝑡𝑑 = 𝛼𝑖𝑡 𝑓𝑡𝑒𝑑 𝑑 them is given below. To summarize, temperature (𝜏 ) based attention helps determine the relevance of each 𝑓𝑡𝑒𝑑 corresponding 𝑓𝑡𝑒𝑞 set where questions for each set were uniformly sampled. Infobot Dataset [9] : Researchers at JHU [9] have recently compiled a list of user submitted questions on different platforms and manually labeled 22K+ question- answer pairs. We cleaned this set by removing questions (a) Yahoo! ques length (b) Yahoo! ans length with less than two answers or no relevant answers. In total, our dataset contains 8000+ question answer pairs where each question may have multiple relevant answers which is not the same as Yahoo! answers dataset. Figure 3c and 3d show the distribution of question and answer lengths respectively. 4.1.1. External sources (c) Infobot ques length (d) Infobot ans length We use two external datasets to rank answers. Details of Figure 3: Token distribution in different sources each dataset are given below: TREC COVID [10]: We use recently released TREC Stat Yahoo! Ans Infobot Train Q-A 9341 6354 COVID-19 track data with 50 queries which also contain Train ans/q 6.25±2.9 4.40±0.77 manually drafted query descriptions and narratives. Ex- Train #qwords 12.71±5.8 6.55±3.93 pert judges have labeled over 5000 scientific documents Train #awords 36.31±93.59 92.17±59.27 for these 50 queries from the CORD-19 dataset 8 . These Test Q-A 2232 1592 documents contain coronavirus related research. Given Test ans/q 5.96±2.87 4.41±0.76 the documents are scientific literature, we initialize doc- Test #qwords 13.07±5.89 6.21±2.94 ument embeddings using SPECTER [24]. Test #awords 35.64±80.31 92.39±59.47 Table 2 WHO: We use data released on question and answer Train and test data from primary sources hub of WHO9 website to create a list of question-answer pairs. There are 147 question and answer pairs in this dataset where questions contain 13.28±5.36 words and answers contain 133.2±100.9 words respectively. Yahoo! Dataset : We crawled COVID-19 related ques- tions from Yahoo! answers 7 using several keywords such as ‘coronavirus’, ‘covid-19’, ‘covid’, ‘sars-cov2’ and 4.2. Baselines ‘corona virus’ between the period of Jan 2020 to July 2020 We evaluated our model against embedding similarity to ensure we gather all possible questions for our experi- baseline. We computed four baselines as follows: ments. We keep only those questions have two or more answers. In total, we obtained 1880 questions with 11500 answers. We used favorite answers as positive labels (sim- Random: An answer is chosen at random as relevant ilar to previous work [1]), assuming that users, over time for a user question. This is expected to provide a lower rate answers (with upvotes/downvotes) that are most bound on retrieval performance. relevant to the submitted question. We normalized the question and answer text by removing a small list of stop Linear Attention (att) : When 𝜏 = 1.0, our model words, numbers, links or any symbols. Figure 3a and 3b defaults to simple linear attention over all the information show the distribution of question and answer lengths re- present in the external sources. This gives an indication spectively. Questions contain 12.7 ± 5.8 (qwords) words of how well the model performs when its forced to look and answers consist of 36.3 ± 93.5 (mean±std) words at all the information in the external source. (awords) respectively which indicates that user submitted answers can vary widely on Yahoo! answers. On average, a question has about 6 answers (ans/q) in Yahoo! ans dataset. We spilt the data into three sets: train (64%, 1196 questions, 7435 answers), validation (16%, 298 questions, 8 1858 answers) and test (20%, 374 questions, 2310 answers) https://www.semanticscholar.org/cord19 9 https://www.who.int/emergencies/diseases/novel- 7 https://answers.search.yahoo.com/search?p=coronavirus coronavirus-2019/question-and-answers-hub Yahoo! Ans Infobot i.e. (𝑘 = 1) in the ranked list is indeed correct. It Model P@1 R@3 MRR P@1 R@3 MRR is defined as follows: 𝜏 -att 0.393 0.644 0.598 0.673 0.868 0.802 𝜆-sim 0.3743 0.633 0.578 0.551 0.817 0.7207 |𝑄| ∑︀𝑘 1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1} bsl-256 0.406 0.657 0.615 0.581 0.803 0.744 𝑃 𝑟𝑒𝑐@𝑘 = (7) bsl-128 0.363 0.604 0.589 0.557 0.799 0.731 |𝑄| 𝑖=1 𝑘 att 0.377 0.645 0.589 0.567 0.821 0.739 qasim 0.318 0.608 0.546 0.551 0.817 0.720 where I{𝑟𝑒𝑙𝑖𝑗 = 1} indicates whether the answer random 0.21 - - 0.239 - - at position 𝑗 is relevant to the 𝑖𝑡ℎ question. • Recall (R@k): Recall at position 𝑘 evaluates the Table 3 Evaluation with WHO external data fraction of relevant answers retrieved from all the answers marked relevant for a question. We report recall averaged for all the queries in test set. For recall, we take a cutoff as (𝑘 = 3), which Linear combination (𝜆-sim) : We linearly combine evaluates whether the model is able to retrieve the similarities between Yahoo! question-answer and Trec correct answers in top 3 positions. It is defined query-answer as shown below: as follows: 𝜆-sim = 𝜆 𝑐𝑜𝑠(𝑦𝑎, 𝑦𝑞) + (1 − 𝜆) max(𝑐𝑜𝑠(𝑦𝑎, 𝑡𝑞)) |𝑄| ∑︀𝑘 𝑡𝑞 1 ∑︁ 𝑗=1 I{𝑟𝑒𝑙𝑖𝑗 = 1} (6) 𝑅𝑒𝑐𝑎𝑙𝑙@𝑘 = (8) |𝑄| 𝑖=1 |𝑟𝑒𝑙𝑖 | where 𝑦𝑎, 𝑦𝑞 and 𝑡𝑞 are Yahoo! answer, question and concatenated trec query, narrative and description em- where |𝑟𝑒𝑙𝑖 | is the number of relevant answers beddings respectively. This is a more crude version of for the 𝑖th question. temperature attention where 𝜆 controls the contribution of each component directly. We vary 𝜆 to determine • MRR (MRR): evaluates the average of the recip- the optimal combination. Question-Answer similarity rocal ranks corresponding to the most relevant (qasim) is similarity between question and answer embed- answer for the questions in test set, which is given ding i.e. 𝜆 = 1. Both question and answer embeddings by: |𝑄| are obtained by averaging over their individual token 1 ∑︁ 1 𝑀 𝑅𝑅 = (9) embeddings. |𝑄| 𝑖=1 𝑟𝑎𝑛𝑘𝑖 BERT Q&A (bert) : Large scale pre-trained transform- where |𝑄| indicates the number of queries in the ers [25] are widely popular for NLP tasks. BERT like test set and 𝑟𝑎𝑛𝑘𝑖 is the rank of the first relevant models have shown effectiveness on Q&A datasets such answer for the 𝑖𝑡ℎ query. as SQUAD 10 . We fine-tune BERT base model with two different answer lengths a) 128 (bsl-128) and b) 256 to- 4.4. Parameter Settings kens (bsl-256) respectively. The intuition is that large Both primary datasets, Yahoo! ans and Infobot, were scale pre-trained models are adept at language under- divied into three parts: train (∼60%), validation and test standing and can be fine-tuned for new tasks with small (20%) respectively. The baseline models 𝜆-sim and 𝑎𝑡𝑡 number of samples. We finetune BERT for both datasets are initialized with glove embeddings 11 of 100 dimen- Yahoo! ans and Infobot respectively. It is non-trivial to sions. We performed a parameter sweep over 𝜆 and 𝜏 include external information in BERT and we leave this for 𝜆-sim and 𝜏 -att models with step size of 0.1 between for future work. {0, 1.0} respectively. We used base uncased model for 𝑏𝑒𝑟𝑡 implementation. We fine-tuned the model between 4.3. Evaluation Metrics 1-10 epochs and found that 3 epochs gave the best re- We evaluate the performance of our model using three sult on validation set. We used LSTM with 64 hidden popular ranking metrics, mainly Precision (P@1), Mean units to represent question, answer and all the informa- Reciprocal Rank (MRR), and Recall (R@3). Each metric is tion in external datasets. We experimented with higher described below: embedding size and hidden units, but the performance degraded significantly as the model tends to overfit on • Precision (P@k): Precision at position 𝑘 eval- training data. Lastly we used batch size of 64 and trained uates the fraction of relevant answers retrieved the model for 30 epochs with early stopping. until position k. For, both datasets Yahoo! ans and Infobot [9], we evaluate whether the top answer 10 11 https://rajpurkar.github.io/SQuAD-explorer/ https://nlp.stanford.edu/projects/glove/ Yahoo! Ans Infobot Category 𝜏 -att 𝜆-sim qasim Model P@ 1 R@3 MRR P@1 R@3 MRR Entertainment (47) 0.446 0.382 0.297 𝜏 -att 0.532 0.778 0.715 0.606 0.842 0.766 Health (62) 0.483 0.419 0.354 𝜆-sim 0.326 0.616 0.555 0.556 0.813 0.722 Politics (143) 0.45 0.300 0.272 bsl-256 0.406 0.657 0.615 0.581 0.803 0.744 Society (38) 0.28 0.157 0.236 bsl-128 0.363 0.604 0.589 0.557 0.799 0.731 Family (20) 0.6 0.350 0.40 att 0.291 0.495 0.494 0.601 0.833 0.762 qasim 0.318 0.608 0.546 0.551 0.817 0.720 Table 6 random 0.21 - - 0.239 - - Precision@1 of models across categories Table 4 Evaluation with TREC-COVID external data efit cQ&A task. Since attention is dependent on the input Category 𝜏 -att 𝜆-sim qasim query and key embedding lengths, it would be interesting Entertainment (47) 0.829 0.702 0.59 to scale the computation in our model to incorporate sev- Health (62) 0.693 0.69 0.645 eral open external datasets to overcome this limitation Politics (143) 0.727 0.629 0.587 in the future. Society (38) 0.578 0.473 0.42 Yahoo! ans questions are also assigned categories by Family (20) 0.85 0.750 0.65 users. Category based breakdown of performance on test Table 5 set is given in Table 6 and Table 5 respectively, where Recall@3 of models across categories categories with largest number of questions in test set are listed. In all the categories, our model outperforms best 𝜆-sim and qasim model respectively. The largest improve- 5. Results ment happens for questions in Family category where our model achieves an improvement of 71% over the 𝜆- In this work, our focus is to evaluate the utility of sim model. It seems that ranking answers for questions external information in improving answer ranking for from society and politics are harder than other categories. cQ&A task. Thus, we performed experiments to answer All the models, however, are able to rank the top answer three main research questions listed below. in first three positions effectively as Recall@3 is high for RQ1: Does external information improve answer all the categories. ranking? RQ2: How does temperature (𝜏 ) compare with 𝜆 RQ2: How does temperature (𝜏 ) compare with 𝜆 parameter? parameter? We argued that linearly combining simi- RQ3: What kind of queries/questions does the model larities between question-answer in primary dataset and attend to when ranking relevant/non-relevant answers? between question-external source may not be sufficient to boost performance. We observe that in our results too i.e. 𝜆-sim models do not perform better than (𝜏 -att) mod- RQ1: Does external information improve answer els. This clearly indicates that more sophisticated models ranking? We evaluated different models for ranking can learn to combine this information directly from train- answers in Yahoo! ans and Infobot dataset in presence ing data. However, our experiments indicate that optimal of TREC and WHO datasets respectively. We found that value of (𝜏 ) varies across primary datasets and external temperature regulated attention models that incorpo- sources of information. For instance, (𝜏 -att) model per- rate external sources indeed outperform the baselines formed best when 𝜏 = 0.4 and 𝜏 = 0.9 for Yahoo! ans as shown in Table 4 and Table 3 respectively. (𝜏 -att) and Infobot dataset respectively when TREC was used model beats bert models by ∼30% in precision, ∼18% as external source. It performed best when 𝜏 = 0.1 and in recall and ∼16% in MRR respectively on TREC data. 𝜏 = 0.5 for Yahoo! ans and Infobot dataset respectively However, (𝜏 -att) does only marginally better than att when WHO was used as external source. We also tried to model in precision and MRR on Infobot data. We sus- vary 𝜏 beyond 1.0 to determine whether it yielded a trend pect that is due to the large set of query-document pairs as shown in Table 7. Higher values of temperature seem in TREC-COVID data compared to fewer number of to degrade model performance. We found that optimal question-answer pairs in Infobot dataset. Our results temperature range is between [0.1−1]. Existing research also clearly suggest that embedding based matching of in model distillation [26] has also empirically found that question-answer pair (qasim) would not yield a good lower values of temperature yield better performance. ranker, though it is better than choosing an answer at We also compared model performance in terms of pre- random (random). When WHO is used as an external cision when 𝜆 and 𝜏 are varied for 𝜆-sim models and dataset, we find that (𝜏 -att) model is slightly worse than temperature based models respectively as shown in Fig- bert. This suggests that not all sources would equally ben- ure 4. Temperature based models peak at one value but (a) Yahoo!+TREC (b) Yahoo!+WHO (c) Infobot + TREC (d) Infobot + WHO Figure 4: Temperature and 𝜆 variation impact on Prec@1 Temperature (𝜏 ) > 1.0 10 100 1000 10 100 1000 Src+ Ext Prec@1 Recall@3 Y! + TREC 0.46 0.38 0.38 0.73 0.644 0.64 Y! + WHO 0.37 0.38 0.36 0.64 0.65 0.64 Ibot + TREC 0.44 0.59 0.39 0.72 0.81 0.75 Ibot + WHO 0.65 0.41 0.44 0.85 0.76 0.79 Table 7 Figure 5: Y! ques, its rel and non-rel ans and questions with Variation in P@1 and R@3 across different temperature values. 𝜏 -att model’s attention values for TREC queries. do not have a clear trend indicating that one needs to ex- plore different 𝜏 values at the time of training for better performance. On the other hand, we observe that adding external information also helps the 𝜆-sim models until a certain threshold. Overall, both sets of models show that free-text external information can be incorporated to improve answer ranking performance. Figure 6: Infobot ques, its rel and non-rel ans and questions with 𝜏 -att model’s attention values for TREC queries. RQ3: What kind of queries/questions does the model attend to when ranking relevant/non- relevant answers? Attention based models have a this external knowledge need not always be structured very unique feature: they can aid explaining the internal text. However, it is worth noting that curated and reli- workings of neural network models. We inspect what able external sources may not always be available for all kind of queries/questions in external datasets does our domains. We addressed a very niche task in this work, model pay attention to while ranking relevant or non- and further research is required to extend it to incorpo- relevant answers. Figure 5 shows one such example of rate multiple external sources. We posit that with scal- Yahoo! question and incorporation of TREC data. At the able attention mechanisms, this work can be easily made time of scoring relevant answer, the model gives higher tractable for large external sources containing thousands weight to some queries compared to others. In the exam- or millions of entries in the future. ple, for instance, it assigns more weight to queries asso- ciated with masks or COVID virus response to weather changes. We observe higher attention weights for ques- 6. Conclusion tions when relevant answers are ranked than when non- relevant answers are scored. An example question, a Question answering platforms provide users with effec- relevant and non-relevant answer along with model at- tive and easy access to information. These platforms tention weights on TREC queries are shown from the also provide content on rapidly evolving sensitive topics Infobot data in Figure 6 respectively. It shows a simi- such as disease outbreaks (such as COVID-19) where it is lar trend where attention weights are high for external also useful to use external vetted information for ranking queries that are closely associated with the question an- answers. Existing work only exploits knowledge bases swer text. which have some limitations that makes it difficult to Overall, our experiments show that curated external use them for community Q&A for rapidly evolving top- information is useful for improving community ques- ics such as wild-fires or earthquakes. In this work, we tion answering task. Our experiments also indicate that tried to evaluate the effectiveness of external (free text or semi-structured) information in improving answer rank- question answering in social multimedia, in: ing models. We argue that simple question-answer text Proceedings of the 26th ACM International Con- matching may be insufficient and in presence of external ference on Multimedia, MM ’18, Association for knowledge, but temperature regulated attention models Computing Machinery, New York, NY, USA, 2018, can distill information better which in turn yields higher p. 456–464. URL: https://doi.org/10.1145/3240508. performance. Our proposed model with temperature reg- 3240626. doi:10.1145/3240508.3240626. ulated attention, when evaluated on two public datasets [8] J. Hu, S. Qian, Q. Fang, C. Xu, Hierarchical graph showed significant improvements by augmenting infor- semantic pooling network for multi-modal com- mation from two external curated sources of information. munity question answer matching, in: Proceed- In future, we aim to expand these experiments to other ings of the 27th ACM International Conference categories such as disaster relief and scale the attention on Multimedia, MM ’19, Association for Com- mechanism to include multiple external sources in one puting Machinery, New York, NY, USA, 2019, p. model. 1157–1165. URL: https://doi.org/10.1145/3343031. 3350966. doi:10.1145/3343031.3350966. [9] A. Poliak, M. Fleming, C. Costello, K. W. Murray, References M. Yarmohammadi, S. Pandya, D. Irani, M. Agarwal, U. Sharma, S. Sun, et al., Collecting verified covid-19 [1] M. Surdeanu, M. Ciaramita, H. Zaragoza, Learning question answer pairs (2020). to rank answers on large online qa collections, in: [10] E. Voorhees, T. Alam, S. Bedrick, D. Demner- ACL, 2008. Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Sobo- [2] Y. Shen, W. Rong, Z. Sun, Y. Ouyang, Z. Xiong, roff, L. L. Wang, Trec-covid: Constructing a pan- Question/answer matching for cqa system via com- demic information retrieval test collection, 2020. bining lexical and sequential information, in: Pro- arXiv:2005.04474. ceedings of the Twenty-Ninth AAAI Conference on [11] A. Rücklé, I. Gurevych, Representation learn- Artificial Intelligence, AAAI’15, AAAI Press, 2015, ing for answer selection with LSTM-based impor- p. 275–281. tance weighting, in: IWCS 2017 — 12th Interna- [3] L. Yang, Q. Ai, D. Spina, R.-C. Chen, L. Pang, W. B. tional Conference on Computational Semantics — Croft, J. Guo, F. Scholer, Beyond factoid qa: Ef- Short papers, 2017. URL: https://www.aclweb.org/ fective methods for non-factoid answer sentence anthology/W17-6935. retrieval, in: European Conference on Information [12] D. Cohen, W. Croft, End to end long short term Retrieval, Springer, 2016, pp. 115–128. memory networks for non-factoid question answer- [4] L. Hong, B. D. Davison, A classification-based ap- ing, 2016, pp. 143–146. doi:10.1145/2970398. proach to question answering in discussion boards, 2970438. in: Proceedings of the 32nd international ACM SI- [13] X. Zhou, B. Hu, Q. Chen, B. Tang, X. Wang, Answer GIR conference on Research and development in sequence learning with neural networks for answer information retrieval, 2009, pp. 171–178. selection in community question answering, arXiv [5] D. H. Dalip, M. A. Gonçalves, M. Cristo, P. Calado, preprint arXiv:1506.06490 (2015). Exploiting user feedback to learn to rank answers [14] L. Nie, X. Wei, D. Zhang, X. Wang, Z. Gao, Y. Yang, in qa forums: A case study with stack overflow, Data-driven answer selection in community qa sys- in: Proceedings of the 36th International ACM SI- tems, IEEE transactions on knowledge and data GIR Conference on Research and Development in engineering 29 (2017) 1186–1198. Information Retrieval, SIGIR ’13, Association for [15] A. Severyn, A. Moschitti, Structural relationships Computing Machinery, New York, NY, USA, 2013, for large-scale learning of answer re-ranking, in: p. 543–552. URL: https://doi.org/10.1145/2484028. Proceedings of the 35th international ACM SIGIR 2484072. doi:10.1145/2484028.2484072. conference on Research and development in infor- [6] Y. Zhang, S. Qian, Q. Fang, C. Xu, Multi-modal mation retrieval, 2012, pp. 741–750. knowledge-aware hierarchical attention network [16] X. Yang, M. Khabsa, M. Wang, W. Wang, A. H. for explainable medical question answering, in: Awadallah, D. Kifer, C. L. Giles, Adversarial training Proceedings of the 27th ACM International Con- for community question answer selection based on ference on Multimedia, MM ’19, Association for multi-scale matching, in: Proceedings of the AAAI Computing Machinery, New York, NY, USA, 2019, Conference on Artificial Intelligence, volume 33, p. 1089–1097. URL: https://doi.org/10.1145/3343031. 2019, pp. 395–402. 3351033. doi:10.1145/3343031.3351033. [17] H. Huang, X. Wei, L. Nie, X. Mao, X.-S. Xu, From [7] J. Hu, S. Qian, Q. Fang, C. Xu, Attentive in- question to text: Question-oriented feature atten- teractive convolutional matching for community tion for answer selection, ACM Transactions on Information Systems 37 (2018) 1–33. doi:10.1145/ 3233771. [18] X. Zhang, S. Li, L. Sha, H. Wang, Attentive interac- tive neural networks for answer selection in com- munity question answering, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelli- gence, AAAI’17, AAAI Press, 2017, p. 3525–3531. [19] B. Wang, X. Wang, C.-J. Sun, B. Liu, L. Sun, Model- ing semantic relevance for question-answer pairs in web social communities, in: Proceedings of the 48th Annual Meeting of the Association for Com- putational Linguistics, 2010, pp. 1230–1238. [20] B. Kratzwald, A. Eigenmann, S. Feuerriegel, Rankqa: Neural question answering with answer re-ranking, CoRR abs/1906.03008 (2019). URL: http://arxiv.org/ abs/1906.03008. arXiv:1906.03008. [21] Y. Shen, Y. Deng, M. Yang, Y. Li, N. Du, W. Fan, K. Lei, Knowledge-aware attentive neural network for ranking question answer pairs, in: The 41st International ACM SIGIR Conference on Research Development in Information Retrieval, SIGIR ’18, Association for Computing Machinery, New York, NY, USA, 2018, p. 901–904. URL: https://doi.org/ 10.1145/3209978.3210081. doi:10.1145/3209978. 3210081. [22] P. Nakov, D. Hoogeveen, L. Màrquez, A. Moschitti, H. Mubarak, T. Baldwin, K. Verspoor, Semeval- 2017 task 3: Community question answering, arXiv preprint arXiv:1912.00730 (2019). [23] D. Su, Y. Xu, T. Yu, F. B. Siddique, E. J. Barezi, P. Fung, CAiRE-COVID: A question answering and multi- document summarization system for covid-19 re- search, arXiv 2005.03975 (2020). [24] A. Cohan, S. Feldman, I. Beltagy, D. Downey, D. S. Weld, Specter: Document-level representation learning using citation-informed transformers, in: Proceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, 2020, pp. 2270–2282. [25] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional trans- formers for language understanding, 2019. arXiv:1810.04805. [26] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, 2015. arXiv:1503.02531.