1. Introduction

Powering COVID-19 community Q&A with Curated Side Information

Manisha Verma

R@3 0

Kapil Thadani

P@1 0

Shaunak Mishra

Yahoo! Research NYC

0 0 Table 7 Variation in

Community question answering and discussion platforms such as Reddit, Yahoo! answers or Quora provide users the flexibility of asking open ended questions to a large audience, and replies to such questions maybe useful both to the user and the community on certain topics such as health, sports or finance. Given the recent events around COVID-19, some of these platforms have attracted 2000+ questions from users about several aspects associated with the disease. Given the impact of this disease on general public, in this work we investigate ways to improve the ranking of user generated answers on COVID-19. We specifically explore the utility of external technical sources of side information (such as CDC guidelines or WHO FAQs) in improving answer ranking on such platforms. We found that ranking user answers based on question-answer similarity is not suficient, and existing models cannot efectively exploit external (side) information. In this work, we demonstrate the efectiveness of diferent attention based neural models that can directly exploit side information available in technical documents or verified forums (e.g., research publications on COVID-19 or WHO website). Augmented with a temperature mechanism, the attention based neural models can selectively determine the relevance of side information for a given user question, while ranking answers.

eol>question answering deep learning knowledge injection NLP

1. Introduction

tween these entities to score answers. However, there are some limitations of knowledge bases that would make Question answering systems are key to finding relevant it dificult to use them for community Q&A for rapidly and timely information about several issues. Community evolving topics such as disease outbreaks (e.g. ebola, question answering (cQ&A) platforms such as Reddit, Ya- COVID-19), wild-fires or earthquakes. Firstly, knowledge hoo! answers or Quora have been used to ask questions bases contain information about established entities, and about wide ranging topics. Most of these platforms let do not rapidly evolve to incorporate new information users ask, answer, vote or comment on questions present which makes them unreliable for novel disease outbreaks on the platform. However, question answering platforms such as COVID-19 where information rapidly changes are useful not only for getting public opinions or votes and its verification is time sensitive. Secondly, it may be about areas such as entertainment or sports but can also hard to determine what even constitutes an entity as new serve as information hot-spots for more sensitive topics information arrives about the topic. To overcome these such as health, injuries or legal topics. Thus, it is imper- limitation in this work, we posit that external curated ative that when the user visits sensitive topics content, free-text or semi-structured informational sources can answer ranking also takes into account curated side infor- also be used efectively for cQ&A tasks. mation from reliable (external) sources. Most prior work In this work, we demonstrate that free text or semi on cQ&A has focused on incorporating question-answer structured external information sources such as CDC1, similarity [ 1, 2 ], user reputation [ 3, 4, 5 ], integration of WHO2 or NHS3 can be very useful for ranking answers multi-modal content [ 6 ], community interaction features on community Q&A platforms since they contain fre[ 7 ] associated with answers or just the question answer- quently updated information about several topics such as ing network [ 8 ] on the platform. However, there is very ongoing disease outbreaks, vaccines or resources about limited work on incorporating curated content from exter- other topics such as surgeries, birth control or historical nal sources. Existing work only exploits knowledge bases numerical data about diseases across the world. [ 6 ] that consist of diferent entities and relationships be- We argue that for sensitive topics such as COVID-19, it is useful to use publicly available vetted information for improving our ranking systems. In this work, we explore COVID-19 questions community Q&A (Yahoo! Answers, Quora, Reddit) curated docs Community Question and Answering (cQ&A) systems is (research papers, a well researched sub-field both in information retrieval sample question WHO, CDC, NHS) and NLP communities. Several systems have been proposed to rank user submitted answers to questions on community platforms such as Yahoo! answers, Reddit and Quora.

answer Ranking user submitted answers on community answers from ranker question-answering platforms has been addressed with members several approaches. Primary method is to determine the relevance of the answer given an input question. Text fever coughing and tiredness some serious ... based matching is one of the most common approaches You can have all or some of the symptoms to rank answers. Researchers have used several meththat can be caused by any other virus ... ods to compute similarity between a question and user SWyHmOpto&mCsDofCvcirluaismFeuvpertoor14chdilalys,sC..o.ugh ... most relevant fgeeanteurraetebdasaendswqueressttioonde-atenrsmwienre mrealtecvhainncge.isFours eindsitnan[c1e], with 17 features extracted from unigrams, bigrams and the symptoms vary as the base virus mutates web correlation features using unstructured user search Figure 1: Illustrative example of COVID-19 community an- logs to rank answers. It is worth noting that user features swer ranking powered by side information in the form of and community features when incorporated may still research papers, and information from verified sources (such yield further improvements in the performance of these as CDC, WHO, and NHS). models but this is not the focus of our work. The authors in [ 1 ] used questions extracted from Yahoo! answers for their experiments. Researchers have used diferent tsahpneescwuitficeiarlisltlyyfoofrofcqpuuusebsoltinicolynrasanvakasisinloagcbialaentesinwdfoewrrsimt hfaotCrioOqnuVfeoIsDrti-ro1an9ns.kiWinnge iaabnnpeesp[w1nr1oeu,ars1cse2hrd]eesasinpusetu[hc3cto,hirv1s3erel]uypst.oerCeLrsoaSennTnvktMoaaltuintootsinowrenlepeararlser.nsnOeeitnntwhtgoe,qrrufkoaesrpshtpiinaorovsnteasacanahlncseedos, Atwnoswpeurbsl4icalnydabv)arielacbelnetlpyrrimeleaaryseQda&nAnodtaattaesdeQts&:aA) dYaathaosoet! isnugch[1a6s]d,aotct2evnetcio[n14[]1,7t,r6e,e1-k1e,r1n8e]lsor[1d5e]e,padbveeliresfarnieatlwleoarrkns[sa9on]usrwinceeprss:rea5s)oeTnnRcCEeCOo-fVCtIOwDVo-1IeD9x.t[We1r0ne]aealxnspdelmbo)rieW-sttHhrueOcutqtuuirleietsdyticoounfrsdaaetneepdd T[1h9e]rehahvaevebeaelsnoubseeedntostuscdoieres eqxupelsotrioinngacnodmamnsuwnietry,puasiresr. ltaueetpartMoenrnnnaotlireni(oexsgniisdsmptmeei)onceidgicnficehfasloastllarnywmti,esiwa-mtotheifot-apontthr.troeeOapn-nauotkrirsoteeansxnaypisstnweetremteimhmrpisssee.irnwnattospurorrkenestre1oen0giKcmue+lpaoqrtfueoedevxse-- ifaaamnolnnwatrssemtwwarioayaeencstrrtsieiwbno.vantWhisodeeoehntdrnhiliqtmeetushiateesohtscwesethyiistoieoshnntneaegebrpmaapicsnpsar.eponfdWrarioecfneeeahca-ceeotthsexurexpparsteorloseaorrsar[ret3ebse,sleae4oemsv,xmea5itl-iene,nsr7ttqen,r]suuiattlecionistisnurttianfrohoneonirdkst-wttttceapoiheeavuruoirneneattrnnhsottlosiueammomvrnlafneeroactiasonreetdrpbaimdfgecoaeloillaswcfnrsbldetmeloiodecytxttavasohthnidtneegiatreersonhnoutiqqnmeruarouueralrapfeemeclirssroenematitoufniidrvuoooreceaenrnsthtsmerw-.ayiriaftsRcaaushneesttnalcisteicnksowtsfmhuohknioenrr.io.nrcnegWwrFogmaCiwmqsgenpOauuitkptstehaVrchiporelnhafIeinotDttig1cneeyrpi-nmudgmfic1rbteas9ebpaseymlcalnersyiisorhncasaeaneiuonlotst,iswsuhnsinwtegwrreetaenahashetentieteirafi--xnndosst- wwisbpnnCnwooooOoosTorrtertrVasakhkretttnhsIxei[Daomect2tlnee-no0iao1nnd,,n9doqade2ss,efuu1howeilekt,noafhpse6ncswwtoer]loiius,rrloowyfsthsmnorhiolitkosreanoawmndtmyu[cegfaos2Qvoroeinf1eenrs&c,ebqrtgre,x2auAca.at0saelpepoe]swrikpsstfdneonhroilaolnyceoyrlaurwceQrcersuevhel&nrefooleeaadenAlsrttvigeedtfinitedshdnoidyfeegbiitrsanrnoaetntfegtsnoooyhemtprfu.efmatisracaTcq.scasstuhEktutissiseixsaous.s.iilncMtsndFaihtcoocoinonoanrae-srngssstrecall for correct answer retrieval improves by ∼ 17% and recent work is [ 6 ] on incorporating medical KB for rank∼ot9h%erfcoQr &boAthmsooduerlcse. datasets respectively over several ing answers on medical Q&A platforms. They propose to learn path based representation of entities (from KB) present in question and answers posted by users. This approach relies on reliable detection of entities first, which may be absent for emerging topics such as COVID-19 4https://answers.yahoo.com/ 5https://www.who.int/emergencies/diseases/novelcoronavirus-2019/question-and-answers-hub pandemic. Another limitation of this work is that external knowledge may not always be present in a structured format. For example, CDC guidelines are usually simple question-answer pairs posted on the website. This makes it dificult to apply their approach to our problem.

The proposed approach in this work incorporates semistructured information directly with help of temperature regulated attention.

Finally, with the rise of COVID-19, researchers across disciplines are actively publishing information and datasets to share understanding of the virus and its impact on people. Researchers routinely organize dedicated challenges such as SemEval [ 22 ] with tasks such as ranking answers on QA forums. One such initiative is TRECCOVID track [ 10 ] which released queries, documents and manual relevance judgements to power search for Figure 2: External source augmentation model COVID related information. Authors in [ 23 ] also released COVID-19 related QA dataset with 100+ questions and answers pairs extracted from TREC COVID 6 initiative.

These questions/answer pairs are not user generated con- external source ⟨, ⟩=1. tent, hence, do not reflect real user questions. We also rely on recently released Q&A dataset from [ 9 ] for our 3.2. Proposed Model task. We also compile a dataset of 2000+ COVID-19 ques- In this work, we explore token-level matching mechations with 10K+ answers all submitted by users on Yahoo! nism to determine the relevance of information in the answers for this work. external source that may inform the label prediction task. Our model ( -att) aims to match a given user question 3. Method with all the submitted answers in the presence of external information about the same domain. First, the 3.1. Problem formulation question , an answer and additional metadata can be encoded into a -dimensional vector using a text In this work, we focus on ranking answers for questions encoder input. We use LSTM based encoder for both ques1, . . . , related to an emerging topics such as COVID. tion and answer in the primary source which can handle Each is associated with a set of two or more answers input sequences of variable length. = { : ≥ 2} and corresponding labels = { : ≥ 2} representing answer relevance. We use Question Encoding: Each word in a question is binary indicator for relevance where relevant judgments represented as a dimensional vector with pre-trained (e.g., favorite, upvoted) are provided by the user, i.e., ∈ word embeddings. LSTM takes each token embedding {0, 1} respectively. as input and updates hidden state ℎ based on previous

We attempt to model the relevance of each answer state ℎ− 1. Finally, the hidden state is input to a feed forto its corresponding question using an external source ward layer with smaller dimension < to compress which may contain free text or semi-structured informa- question encoding as follows: tion. For example, the external source could consist of information-seeking queries or questions 1, . . . , ℎ = (ℎ− 1, ), = (ℎ + ) related to a topic, with each linked to a set of rel- (1) evant scientific articles or answers , where each answer/document 1, . . . , may be judged for rele- Answer Encoding: Each word in the answer is also vance by human judges [ 10 ] or some experts. represented as a dimensional vector with pre-trained

We hypothesize that this semi-structured or free-text word embeddings. LSTM takes each token embedding as information may be valuable in identifying user answer input and updates hidden state ℎ. We also reduce the quality for certain kinds of questions, although not all. dimension of answer encoding with a feed forward layer We investigate this with our model to recover the true with dimension < as follows: labels for each user answer ∈ given its question , category information, and information from the ℎ = (ℎ− 1, ), = (ℎ + ) (2)

6https://ir.nist.gov/covidSubmit/data.html

We concatenate the question and answer representations with respect to the question encoding. Temperature ( ) for further processing. parameter helps us control the uniformity of attention = [, ] (3) wlayeiegrhptesrcep.trFoinnaolvleyr, ltahbeeilnspaurtevpercetodrictedaunsdinthgealemarunletdiweighted average of side information ′. We use binary External source encoding: External sources of in- cross entropy loss to train the proposed model. formation can vary from task-to-task. We encode each segment of data individually. For instance, if there are ˆ = output([ ; ′]) two segments in the source (e.g. question/answer or query/document), our system encodes both segments in- where output uses sigmoid activation function. Since dividually. We use the same encoding architecture used community questions may often be entirely unrelated to for primary source question/answer encoding above. En- external sources, a key aspect of this approach is detercoding example for two segment external source is given mining whether the external source is useful, not merely below. attending to its entries that are most relevant. Temperature based attention mechanism is useful in controlling ℎ = (ℎ− 1, ), = (ℎ + w)hich external source entries are useful for user quesℎ = (ℎ− 1, ), = (ℎ + t)ions. It is worth noting that one will have to experiment (4) and tune the value of temperature such that ranking

performance improves.

We incorporate external source encoding with a temperature ( ) based variant of scaled dot-product attention, which provides a straightforward conditioning approach over a set of query-document pairs. Question

. If encoding vector serves as a query over keys two segments are present in the external source such as query/document, the model uses the attention weights over first segment (e.g. query) to determine the importance of the second segment (e.g. document) respectively. 4.1. Data It is easy to extend this framework to external sources with multiple segments. The two segment attention is described below.

⊤ = √

/ = ∑︀ / ′ = ∑︁

To summarize, temperature ( ) based attention helps

corresponding determine the relevance of each

4. Experimental Setup Given the model architecture, in this section, we provide a detailed overview of diferent datasets, metrics and baselines used in our experiments. (5)

We compiled two question answering datasets. The first was collected from Yahoo! answers and the second was recently released in [ 9 ] where both datasets have questions raised by real users. In this work we focus specifically on questions associated with COVID-19. Diferent statistics about the train and test split of both q&a datasets are given in Table 2 respectively. A pair of relevant and non-relevant answers for a question in both datasets is also shown in Table 1 for reference. More details about them is given below. (a) Yahoo! ques length (b) Yahoo! ans length (c) Infobot ques length (d) Infobot ans length Yahoo! Dataset : We crawled COVID-19 related questions from Yahoo! answers 7 using several keywords such as ‘coronavirus’, ‘covid-19’, ‘covid’, ‘sars-cov2’ and ‘corona virus’ between the period of Jan 2020 to July 2020 to ensure we gather all possible questions for our experiments. We keep only those questions have two or more answers. In total, we obtained 1880 questions with 11500 answers. We used favorite answers as positive labels (similar to previous work [ 1 ]), assuming that users, over time rate answers (with upvotes/downvotes) that are most relevant to the submitted question. We normalized the question and answer text by removing a small list of stop words, numbers, links or any symbols. Figure 3a and 3b show the distribution of question and answer lengths respectively. Questions contain 12.7 ± 5.8 (qwords) words and answers consist of 36.3 ± 93.5 (mean± std) words (awords) respectively which indicates that user submitted answers can vary widely on Yahoo! answers. On average, a question has about 6 answers (ans/q) in Yahoo! ans dataset. We spilt the data into three sets: train (64%, 1196 questions, 7435 answers), validation (16%, 298 questions, 1858 answers) and test (20%, 374 questions, 2310 answers) 7https://answers.search.yahoo.com/search?p=coronavirus set where questions for each set were uniformly sampled. Infobot Dataset [ 9 ] : Researchers at JHU [ 9 ] have recently compiled a list of user submitted questions on diferent platforms and manually labeled 22K+ questionanswer pairs. We cleaned this set by removing questions with less than two answers or no relevant answers. In total, our dataset contains 8000+ question answer pairs where each question may have multiple relevant answers which is not the same as Yahoo! answers dataset. Figure 3c and 3d show the distribution of question and answer lengths respectively.

4.1.1. External sources We use two external datasets to rank answers. Details of each dataset are given below: TREC COVID [10]: We use recently released TREC

COVID-19 track data with 50 queries which also contain manually drafted query descriptions and narratives. Expert judges have labeled over 5000 scientific documents for these 50 queries from the CORD-19 dataset 8. These documents contain coronavirus related research. Given the documents are scientific literature, we initialize document embeddings using SPECTER [ 24 ].

WHO: We use data released on question and answer

hub of WHO9 website to create a list of question-answer pairs. There are 147 question and answer pairs in this dataset where questions contain 13.28± 5.36 words and answers contain 133.2± 100.9 words respectively. 4.2. Baselines

We evaluated our model against embedding similarity baseline. We computed four baselines as follows: Random: An answer is chosen at random as relevant for a user question. This is expected to provide a lower bound on retrieval performance.

Linear Attention (att) : When = 1.0, our model defaults to simple linear attention over all the information present in the external sources. This gives an indication of how well the model performs when its forced to look at all the information in the external source.

8https://www.semanticscholar.org/cord19

9https://www.who.int/emergencies/diseases/novelcoronavirus-2019/question-and-answers-hub Linear combination ( -sim) : We linearly combine similarities between Yahoo! question-answer and Trec query-answer as shown below: -sim = (, ) + (1 − ) max((, )) (6) where , and are Yahoo! answer, question and concatenated trec query, narrative and description embeddings respectively. This is a more crude version of temperature attention where controls the contribution of each component directly. We vary to determine the optimal combination. Question-Answer similarity (qasim) is similarity between question and answer embedding i.e. = 1. Both question and answer embeddings are obtained by averaging over their individual token embeddings. i.e. ( = 1) in the ranked list is indeed correct. It is defined as follows: 1 ∑|︁| ∑︀=1 I{ = 1} || =1 (7) where I{ = 1} indicates whether the answer at position is relevant to the ℎ question. • Recall (R@k): Recall at position evaluates the fraction of relevant answers retrieved from all the answers marked relevant for a question. We report recall averaged for all the queries in test set. For recall, we take a cutof as ( = 3), which evaluates whether the model is able to retrieve the correct answers in top 3 positions. It is defined as follows: || =1

|| || ∑︀ 1 ∑︁ =1 I{ = 1} (8) where || is the number of relevant answers for the th question. • MRR (MRR): evaluates the average of the reciprocal ranks corresponding to the most relevant answer for the questions in test set, which is given by: =

|| 1 ∑︁

1 || =1 (9) 4.3. Evaluation Metrics

We evaluate the performance of our model using three

popular ranking metrics, mainly Precision (P@1), Mean Reciprocal Rank (MRR), and Recall (R@3). Each metric is described below: BERT Q&A (bert) : Large scale pre-trained transformers [ 25 ] are widely popular for NLP tasks. BERT like models have shown efectiveness on Q&A datasets such as SQUAD 10. We fine-tune BERT base model with two diferent answer lengths a) 128 (bsl-128) and b) 256 to- 4.4. Parameter Settings kens (bsl-256) respectively. The intuition is that large scale pre-trained models are adept at language understanding and can be fine-tuned for new tasks with small number of samples. We finetune BERT for both datasets Yahoo! ans and Infobot respectively. It is non-trivial to include external information in BERT and we leave this for future work.

Both primary datasets, Yahoo! ans and Infobot, were divied into three parts: train (∼ 60%), validation and test (20%) respectively. The baseline models -sim and are initialized with glove embeddings 11 of 100 dimensions. We performed a parameter sweep over and for -sim and -att models with step size of 0.1 between {0, 1.0} respectively. We used base uncased model for implementation. We fine-tuned the model between 1-10 epochs and found that 3 epochs gave the best result on validation set. We used LSTM with 64 hidden units to represent question, answer and all the information in external datasets. We experimented with higher embedding size and hidden units, but the performance degraded significantly as the model tends to overfit on • Precision (P@k): Precision at position eval- training data. Lastly we used batch size of 64 and trained uates the fraction of relevant answers retrieved the model for 30 epochs with early stopping. until position k. For, both datasets Yahoo! ans and Infobot [ 9 ], we evaluate whether the top answer where || indicates the number of queries in the test set and is the rank of the first relevant answer for the ℎ query. 10https://rajpurkar.github.io/SQuAD-explorer/ 11https://nlp.stanford.edu/projects/glove/

5. Results

Category Entertainment (47) Health (62) Politics (143) Society (38) Family (20) efit cQ&A task. Since attention is dependent on the input query and key embedding lengths, it would be interesting to scale the computation in our model to incorporate several open external datasets to overcome this limitation in the future.

Yahoo! ans questions are also assigned categories by users. Category based breakdown of performance on test set is given in Table 6 and Table 5 respectively, where categories with largest number of questions in test set are listed. In all the categories, our model outperforms best -sim and qasim model respectively. The largest improvement happens for questions in Family category where our model achieves an improvement of 71% over the sim model. It seems that ranking answers for questions from society and politics are harder than other categories. All the models, however, are able to rank the top answer in first three positions efectively as Recall @3 is high for all the categories. qasim 0.59 0.645 0.587 0.42 0.65

In this work, our focus is to evaluate the utility of

external information in improving answer ranking for cQ&A task. Thus, we performed experiments to answer three main research questions listed below.

RQ1: Does external information improve answer ranking? RQ2: How does temperature ( ) compare with parameter? RQ3: What kind of queries/questions does the model attend to when ranking relevant/non-relevant answers?

RQ2: How does temperature ( ) compare with

parameter? We argued that linearly combining similarities between question-answer in primary dataset and between question-external source may not be suficient to boost performance. We observe that in our results too i.e. -sim models do not perform better than ( -att) modRQ1: Does external information improve answer els. This clearly indicates that more sophisticated models ranking? We evaluated diferent models for ranking can learn to combine this information directly from trainanswers in Yahoo! ans and Infobot dataset in presence ing data. However, our experiments indicate that optimal of TREC and WHO datasets respectively. We found that value of ( ) varies across primary datasets and external temperature regulated attention models that incorpo- sources of information. For instance, ( -att) model perrate external sources indeed outperform the baselines formed best when = 0.4 and = 0.9 for Yahoo! ans as shown in Table 4 and Table 3 respectively. ( -att) and Infobot dataset respectively when TREC was used model beats bert models by ∼ 30% in precision, ∼ 18% as external source. It performed best when = 0.1 and in recall and ∼ 16% in MRR respectively on TREC data. = 0.5 for Yahoo! ans and Infobot dataset respectively However, ( -att) does only marginally better than att when WHO was used as external source. We also tried to model in precision and MRR on Infobot data. We sus- vary beyond 1.0 to determine whether it yielded a trend pect that is due to the large set of query-document pairs as shown in Table 7. Higher values of temperature seem in TREC-COVID data compared to fewer number of to degrade model performance. We found that optimal question-answer pairs in Infobot dataset. Our results temperature range is between [0.1− 1]. Existing research also clearly suggest that embedding based matching of in model distillation [ 26 ] has also empirically found that question-answer pair (qasim) would not yield a good lower values of temperature yield better performance. ranker, though it is better than choosing an answer at We also compared model performance in terms of prerandom (random). When WHO is used as an external cision when and are varied for -sim models and dataset, we nfid that ( -att) model is slightly worse than temperature based models respectively as shown in Figbert. This suggests that not all sources would equally ben- ure 4. Temperature based models peak at one value but (a) Yahoo!+TREC (b) Yahoo!+WHO (c) Infobot + TREC (d) Infobot + WHO

Src+ Ext Y! + TREC Y! + WHO Ibot + TREC Ibot + WHO 10 do not have a clear trend indicating that one needs to explore diferent values at the time of training for better performance. On the other hand, we observe that adding external information also helps the -sim models until a certain threshold. Overall, both sets of models show that free-text external information can be incorporated to improve answer ranking performance.

RQ3: What kind of queries/questions does the model attend to when ranking relevant/nonrelevant answers? Attention based models have a this external knowledge need not always be structured very unique feature: they can aid explaining the internal text. However, it is worth noting that curated and reliworkings of neural network models. We inspect what able external sources may not always be available for all kind of queries/questions in external datasets does our domains. We addressed a very niche task in this work, model pay attention to while ranking relevant or non- and further research is required to extend it to incorporelevant answers. Figure 5 shows one such example of rate multiple external sources. We posit that with scalYahoo! question and incorporation of TREC data. At the able attention mechanisms, this work can be easily made time of scoring relevant answer, the model gives higher tractable for large external sources containing thousands weight to some queries compared to others. In the exam- or millions of entries in the future. ple, for instance, it assigns more weight to queries associated with masks or COVID virus response to weather changes. We observe higher attention weights for ques- 6. Conclusion tions when relevant answers are ranked than when nonrelevant answers are scored. An example question, a Question answering platforms provide users with efecrelevant and non-relevant answer along with model at- tive and easy access to information. These platforms tention weights on TREC queries are shown from the also provide content on rapidly evolving sensitive topics Infobot data in Figure 6 respectively. It shows a simi- such as disease outbreaks (such as COVID-19) where it is lar trend where attention weights are high for external also useful to use external vetted information for ranking queries that are closely associated with the question an- answers. Existing work only exploits knowledge bases swer text. which have some limitations that makes it dificult to

Overall, our experiments show that curated external use them for community Q&A for rapidly evolving topinformation is useful for improving community ques- ics such as wild-fires or earthquakes. In this work, we tion answering task. Our experiments also indicate that tried to evaluate the efectiveness of external (free text or semi-structured) information in improving answer ranking models. We argue that simple question-answer text matching may be insuficient and in presence of external knowledge, but temperature regulated attention models can distill information better which in turn yields higher performance. Our proposed model with temperature regulated attention, when evaluated on two public datasets showed significant improvements by augmenting information from two external curated sources of information.

In future, we aim to expand these experiments to other categories such as disaster relief and scale the attention mechanism to include multiple external sources in one model.

[1]

Surdeanu ,

Ciaramita ,

Zaragoza , Learning to rank answers on large online qa collections , in: ACL , 2008 .

[2]

Shen ,

Rong ,

Sun ,

Ouyang ,

Xiong , Question/answer matching for cqa system via combining lexical and sequential information , in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence , AAAI' 15 , AAAI Press, 2015 , p. 275 - 281 .

[3]

Yang ,

Ai ,

Spina , R.-C. Chen,

Pang , W. B. Croft , J.

Guo , F.

Scholer , Beyond factoid qa: Effective methods for non-factoid answer sentence retrieval , in: European Conference on Information Retrieval , Springer, 2016 , pp. 115 - 128 .

[4]

Hong ,

B. D.

Davison , A classification-based approach to question answering in discussion boards , in: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , 2009 , pp. 171 - 178 .

[5]

D. H.

Dalip ,

M. A.

Gonçalves ,

Cristo ,

Calado , Exploiting user feedback to learn to rank answers in qa forums: A case study with stack overflow , in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '13, Association for Computing Machinery, New York, NY, USA, 2013 , p. 543 - 552 . URL: https://doi.org/10.1145/2484028. 2484072. doi: 10 .1145/2484028.2484072.

[6]

Zhang ,

Qian ,

Fang , C. Xu, Multi-modal knowledge-aware hierarchical attention network for explainable medical question answering , in: Proceedings of the 27th ACM International Conference on Multimedia, MM '19 , Association for Computing Machinery, New York, NY, USA, 2019 , p. 1089 - 1097 . URL: https://doi.org/10.1145/3343031. 3351033. doi: 10 .1145/3343031.3351033.

[7]

Hu ,

Qian ,

Fang ,

Xu , Attentive interactive convolutional matching for community question answering in social multimedia , in: Proceedings of the 26th ACM International Conference on Multimedia, MM '18 , Association for Computing Machinery, New York, NY, USA, 2018 , p. 456 - 464 . URL: https://doi.org/10.1145/3240508. 3240626. doi: 10 .1145/3240508.3240626.

[8]

Hu ,

Qian ,

Fang ,

Xu , Hierarchical graph semantic pooling network for multi-modal community question answer matching , in: Proceedings of the 27th ACM International Conference on Multimedia, MM '19 , Association for Computing Machinery, New York, NY, USA, 2019 , p. 1157 - 1165 . URL: https://doi.org/10.1145/3343031. 3350966. doi: 10 .1145/3343031.3350966.

[9]

Poliak ,

Fleming ,

Costello ,

K. W.

Murray ,

Yarmohammadi ,

Pandya ,

Irani ,

Agarwal ,

Sharma ,

Sun , et al., Collecting verified covid-19 question answer pairs ( 2020 ).

[10]

Voorhees ,

Alam ,

Bedrick , D. DemnerFushman,

W. R.

Hersh ,

Lo ,

Roberts , I. Soborof ,

L. L.

Wang , Trec-covid: Constructing a pandemic information retrieval test collection , 2020 . arXiv: 2005 .04474.

[11]

Rücklé , I. Gurevych , Representation learning for answer selection with LSTM-based importance weighting , in: IWCS 2017 - 12th International Conference on Computational Semantics - Short papers , 2017 . URL: https://www.aclweb.org/ anthology/W17-6935.

[12]

Cohen ,

Croft , End to end long short term memory networks for non-factoid question answering , 2016 , pp. 143 - 146 . doi: 10 .1145/2970398. 2970438.

[13]

Zhou ,

Hu ,

Chen ,

Tang ,

Wang , Answer sequence learning with neural networks for answer selection in community question answering , arXiv preprint arXiv:1506.06490 ( 2015 ).

[14]

Nie ,

Wei ,

Zhang ,

Wang ,

Gao ,

Yang , Data-driven answer selection in community qa systems , IEEE transactions on knowledge and data engineering 29 ( 2017 ) 1186 - 1198 .

[15]

Severyn ,

Moschitti , Structural relationships for large-scale learning of answer re-ranking , in: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval , 2012 , pp. 741 - 750 .

[16]

Yang ,

Khabsa ,

Wang ,

A. H.

Awadallah ,

Kifer ,

C. L.

Giles , Adversarial training for community question answer selection based on multi-scale matching , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 33 , 2019 , pp. 395 - 402 .

[17]

Huang ,

Wei ,

Nie ,

Mao ,

X.-S.

Xu , From question to text: Question-oriented feature attention for answer selection , ACM Transactions on Information Systems 37 ( 2018 ) 1 - 33 . doi: 10 .1145/ 3233771.

[18]

Zhang ,

Li ,

Sha ,

Wang , Attentive interactive neural networks for answer selection in community question answering , in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence , AAAI' 17 , AAAI Press, 2017 , p. 3525 - 3531 .

[19]

Wang ,

Wang , C.-J. Sun , B.

Liu , L.

Sun , Modeling semantic relevance for question-answer pairs in web social communities , in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , 2010 , pp. 1230 - 1238 .

[20]

Kratzwald ,

Eigenmann ,

Feuerriegel , Rankqa: Neural question answering with answer re-ranking , CoRR abs/ 1906 .03008 ( 2019 ). URL: http://arxiv.org/ abs/ 1906 .03008. arXiv: 1906 .03008.

[21]

Shen ,

Deng ,

Yang ,

Li ,

Du ,

Fan ,

Lei , Knowledge-aware attentive neural network for ranking question answer pairs , in: The 41st International ACM SIGIR Conference on Research Development in Information Retrieval , SIGIR '18, Association for Computing Machinery, New York, NY, USA, 2018 , p. 901 - 904 . URL: https://doi.org/ 10.1145/3209978.3210081. doi: 10 .1145/3209978. 3210081.

[22]

Nakov ,

Hoogeveen ,

Màrquez ,

Moschitti ,

Mubarak ,

Baldwin ,

Verspoor , Semeval2017 task 3: Community question answering , arXiv preprint arXiv: 1912 . 00730 ( 2019 ).

[23]

Su ,

Xu ,

Yu ,

F. B.

Siddique ,

E. J.

Barezi , P. Fung, CAiRE-COVID : A question answering and multidocument summarization system for covid-19 research , arXiv 2005 . 03975 ( 2020 ).

[24]

Cohan ,

Feldman ,

Beltagy ,

Downey ,

D. S.

Weld , Specter: Document-level representation learning using citation-informed transformers , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 2270 - 2282 .

[25]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[26]

Hinton ,

Vinyals ,

Dean , Distilling the knowledge in a neural network , 2015 . arXiv: 1503 . 02531 .