=Paper=
{{Paper
|id=Vol-1749/paper_022
|storemode=property
|title=Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_022.pdf
|volume=Vol-1749
|authors=Annalina Caputo,Marco de Gemmis,Pasquale Lops,Francesco Lovecchio,Vito Manzari
|dblpUrl=https://dblp.org/rec/conf/clic-it/CaputoGLLM16
}}
==Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task==
Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task Annalina Caputo1 and Marco de Gemmis2,5 and Pasquale Lops2 Francesco Lovecchio3 and Vito Manzari4 1 ADAPT Centre, Dublin 2 Department of Computer Science, University of Bari Aldo Moro 3 Acquedotto Pugliese (AQP) S.p.a. 4 Sud Sistemi S.r.l. 5 QuestionCube S.r.l. 1 annalina.caputo@adaptcentre.ie 2 {marco.degemmis,pasquale.lops}@uniba.it 3 f.lovecchio@aqp.it 4 manzariv@sudsistemi.it 5 marco.degemmis@questioncube.com Abstract startup italiana che progetta soluzioni di Question Answering. English. This paper describes the first edi- tion of the Question Answering for Fre- quently Asked Questions (QA4FAQ) task 1 Motivation at the EVALITA 2016 campaign. The task concerns the retrieval of relevant fre- Searching within the Frequently Asked Questions quently asked questions, given a user (FAQ) page of a web site is a critical task: cus- query. The main objective of the task is tomers might feel overloaded by many irrelevant the evaluation of both question answer- questions and become frustrated due to the diffi- ing and information retrieval systems in culty in finding the FAQ suitable for their prob- this particular setting in which the doc- lems. Perhaps they are right there, but just worded ument collection is composed of FAQs. in a different way than they know. The data used for the task are collected The proposed task consists in retrieving a list of in a real scenario by AQP Risponde, a relevant FAQs and corresponding answers related semantic retrieval engine used by Acque- to the query issued by the user. dotto Pugliese (AQP, the Organization for Acquedotto Pugliese (AQP) developed a se- the management of the public water in the mantic retrieval engine for FAQs, called AQP South of Italy) for supporting their cus- Risponde1 , based on Question Answering (QA) tomer care. The system is developed by techniques. The system allows customers to ask QuestionCube, an Italian startup company their own questions, and retrieves a list of rele- which designs Question Answering tools. vant FAQs and corresponding answers. Further- more, customers can select one FAQ among those Italiano. Questo lavoro descrive la prima retrieved by the system and can provide their feed- edizione del Question Answering for Fre- back about the perceived accuracy of the answer. quently Asked Questions (QA4FAQ) task AQP Risponde poses relevant research chal- proposto durante la campagna di valu- lenges concerning both the usage of the Italian lan- tazione EVALITA 2016. Il task consiste guage in a deep QA architecture, and the variety nel recuperare le domande più frequenti of language expressions adopted by customers to rilevanti rispetto ad una domanda posta formulate the same information need. dall’utente. L’obiettivo principale del task The proposed task is strongly related to the è la valutazione di sistemi di question an- one recently organized at Semeval 2015 and 2016 swering e di recupero dell’informazione in about Answer Selection in Community Question un contesto applicativo reale, utilizzando i Answering (Nakov et al., 2015). This task helps dati provenienti da AQP Risponde, un mo- to automate the process of finding good answers tore di ricerca semantico usato da Acque- to new questions in a community-created discus- dotto Pugliese (AQP, l’ente per la gestione sion forum (e.g., by retrieving similar questions in dell’acqua pubblica nel Sud Italia). Il sis- 1 tema è sviluppato da QuestionCube, una http://aqprisponde.aqp.it/ask.php the forum and by identifying the posts in the an- We provided a little sample set for the system swer threads of similar questions that answer the development and a test set for the evaluation. We original one as well). Moreover, the QA-FAQ has did not provide a set of training data: AQP is inter- some common points with the Textual Similarity ested in the development of unsupervised systems task (Agirre et al., 2015) that received an increas- because AQP Risponde must be able to achieve ing amount of attention in recent years. good performance without any user feedback. Fol- The paper is organized as follows: Section 2 de- lowing, an example of FAQ is reported: scribes the task, while Section 3 provides details Question “Come posso telefonare al numero about competing systems. Results of the task are verde da un cellulare?” How can I call the discussed in Section 4. toll-free number by a mobile phone? 2 Task Description: Dataset, Evaluation Answer “È possibile chiamare il Contact Center Protocol and Measures AQP per segnalare un guasto o per un pronto The task concerns the retrieval of relevant fre- intervento telefonando gratuitamente anche quently asked questions, given a user query. For da cellulare al numero verde 800.735.735. defining an evaluation protocol, we need a set of Mentre per chiamare il Contact Center AQP FAQs, a set of user questions and a set of relevance per servizi commerciali 800.085.853 da un judgments for each question. In order to collect cellulare e dall’estero è necessario comporre these data, we exploit an application called AQP il numero +39.080.5723498 (il costo della Risponde, developed by QuestionCube for the Ac- chiamata è secondo il piano tariffario del quedotto Pugliese. AQP Risponde provides a chiamante).” You can call the AQP Contact back-end that allows to analyze both the query log Center to report a fault or an emergency call and the customers’ feedback to discover, for in- without charge by the phone toll-free number stance, new emerging problems that need to be en- 800 735 735... coded as FAQ. AQP Risponde is provided as web Tags canali, numero verde, cellulare and mobile application for Android2 and iOS3 and is currently running in the Acquedotto Pugliese For example, the previous FAQ is relevant for customer care. AQP received about 25,000 ques- the query: “Si può telefonare da cellulare al nu- tions and collected about 2,500 user feedback. We mero verde?” Is it possible to call the toll-free rely on these data to build the dataset for the task. number by a mobile phone? In particular, we provide: Moreover, we provided a simple baseline based • a knowledge base of 406 FAQs. Each FAQ is on a classical information retrieval model. composed of a question, an answer, and a set 2.1 Data Format of tags; FAQs are provided in both XML and CSV format • a set of 1,132 user queries. The queries using “;” as separator. The file is encoded in UTF- are collected by analyzing the AQP Risponde 8 format. Each FAQ is described by the following system log. From the initial set of queries, we fields: removed queries that contains personal data; id a number that uniquely identifies the FAQ • a set of 1,406 pairs < query, relevantf aq > that are exploited to evaluate the contes- question the question text of the current FAQ tants. We build these pairs by analyzing the answer the answer text of the current FAQ user feedback provided by real users of AQP Risponde. We manually check the user feed- tag a set of tags separated by “,” back in order to remove noisy or false feed- Test data are provided as a text file composed by back. The check was performed by two ex- two strings separated by the TAB character. The perts of the AQP customer support. first string is the user query id, while the second 2 https://play.google.com/store/apps/ string is the text of the user query. For example: details?id=com.questioncube.aqprisponde& “1 Come posso telefonare al numero verde da un hl=it 3 https://itunes.apple.com/it/app/ cellulare?” and “2 Come si effettua l’autolettura aqp-risponde/id1006106860 del contatore?”. 2.2 Baseline 3 Systems The baseline is built by using Apache Lucene (ver. 4.10.4)4 . During the indexing for each FAQ, a Thirteen teams registered in the task, but only document with four fields (id, question, answer, three of them actually submitted the results for the tag) is created. For searching, a query for each evaluation. A short description of each system fol- question is built taking into account all the ques- lows: tion terms. Each field is boosted according to the following score question=4, answer=2 and tag=1. chiLab4It - The system described in (Pipitone et For both indexing and search the ItalianAnalyzer al., 2016a) is based on the cognitive model is adopted. The top 25 documents for each query proposed in (Pipitone et al., 2016b). When a are provided as result set. The baseline is freely support text is provided for finding the cor- available on GitHub5 and it was released to partic- rect answer, QuASIt is able to use this text ipants after the evaluation period. to find the required information. ChiLab4It is an adaptation of this model to the context 2.3 Evaluation of FAQs, in this case the FAQ is exploited The participants must provide results in a text file. as support text: the most relevant FAQ will For each query in the test data, the participants can be the one whose text will best fit the user’s provide 25 answers at the most, ranked according question. The authors define three similar- by their systems. Each line in the file must contain ity measures for each field of the FAQ: ques- three values separated by the TAB character: < tion, answer and tags. Moreover, an expan- queryid >< f aqid >< score >. sion step by exploiting synonyms is applied to the query. The expansion module is based Systems are ranked according to the accu- on Wiktionary. racy@1 (c@1). We compute the precision of the system by taking into account only the first cor- rect answer. This metric is used for the final rank- fbk4faq - In (Fonseca et al., 2016), the authors ing of systems. In particular, we take into account proposed a system based on vector represen- also the number of unanswered questions, follow- tations for each query, question and answer. ing the guidelines of the CLEF ResPubliQA Task Query and answer are ranked according to the (Peñas et al., 2009). The formulation of c@1 is: cosine distance to the query. Vectors are built by exploring the word embeddings generated 1 nR by (Dinu et al., 2014), and combined in a way c@1 = (nR + nU ) (1) n n to give more weight to more relevant words. where nR is the number of questions correctly answered, nU is the number of questions unan- NLP-NITMZ the system proposed by (Bhard- swered, and n is the total number of questions. waj et al., 2016) is based on a classical The system should not provide result for a par- VSM model implemented in Apache Nutch6 . ticular question when it is not confident about the Moreover, the authors add a combinatorial correctness of its answer. The goal is to reduce the searching technique that produces a set of amount of incorrect responses, keeping the num- queries by several combinations of all the ber of correct ones, by leaving some questions keywords occurring in the user query. A cus- unanswered. Systems should ensure that only the tom stop word list was developed for the task, portion of wrong answers is reduced, maintaining which is freely available7 . as high as possible the number of correct answers. Otherwise, the reduction in the number of correct answers is punished by the evaluation measure for It is important to underline that all the systems both the answered and unanswered questions. adopt different strategies, while only one system (chiLab4It) is based on a typical question answer 4 http://lucene.apache.org/ module. We provide a more detailed analysis 5 https://github.com/swapUniba/qa4faq about this aspect in Section 4. We tried to mitigate issues related to relevance Table 1: System results. judgments by manually checking users’ feedback. System c@1 However, this manual annotation process might qa4faq16.chilab4it.01 0.4439 have introduced some noise, which is common to baseline 0.4076 all participants. qa4fac16.fbk4faq.2 0.3746 Regarding missing correct answers in the gold qa4fac16.fbk4faq.1 0.3587 standard: this is a typical issue in the retrieval eval- qa4fac16.NLP-NITMZ.1 0.2125 uation, since it is impossible to assess all the FAQ qa4fac16.NLP-NITMZ.2 0.0168 for each test query. Generally, this issue can be solved by creating a pool of results for each query. 4 Results Such pool is built by exploiting the output of sev- eral systems. In this first edition of the task, we Results of the evaluation in terms of c@1 are re- cannot rely on previous evaluations on the same ported in Table 1. The best performance is ob- set of data, therefore we chose to exploit users’ tained by the chilab4it team, that is the only one feedback. In the next editions of the task, we can able to outperform the baseline. Moreover, the rely on previous results of participants to build that chilab4it team is the only one that exploits ques- pool of results. tion answering techniques: the good performance Finally, in Table 2 we report some informa- obtained by this team proves the effectiveness of tion retrieval metrics for each system9 . In particu- question answering in the FAQ domain. All the lar, we compute Mean Average Precision (MAP), other participants had results under the baseline. Geometrical-Mean Average Precision (GMAP), Another interesting outcome is that the baseline Mean Reciprocal Rank (MRR), Recall after five exploiting a simple VSM model achieved remark- (R@5) and ten (R@10) retrieved documents. Fi- able results. nally we report the success 1 that is equal to c@1, A deep analysis of results is reported in (Fon- but without taking into account answered queries. seca et al., 2016), where the authors have built We can notice that on retrieval metrics the base- a custom development set by paraphrasing origi- line is the best approach. This was quite expected nal questions or generating a new question (based since an information retrieval model tries to opti- on original FAQ answer), without considering the mize retrieval performance. Conversely, the best original FAQ question. The interesting result is approach according to success 1 is the chilab4it that their system outperformed the baseline on the system based on question answering, since it tries development set. The authors underline that the to retrieve a correct answer in the first position. development set is completely different from the This result suggests that the most suitable strat- test set which contains sometime short queries and egy in this context is to adopt a question answer- more realistic user’s requests. This is an interest- ing model, rather than to adapt an information ing point of view since one of the main challenge retrieval approach. Another interesting outcome of our task concerns the variety of language ex- concerns the system NLP-NITMZ.1, which obtains pressions adopted by customers to formulate the an encouraging success 1, compared to the c@1. information need. Moreover, in their report the This behavior is ascribable to the fact that the sys- authors provide some examples in which the FAQ tem does not adopt a strategy that provides an an- reported in the gold standard is less relevant than swer for all queries. the FAQ reported by their system, or in some cases the system returns a correct answer that is not an- 5 Conclusions notated in the gold standard. Regarding the first point, we want to point out that our relevance For the first time for the Italian language, we judgments are computed according to the users’ propose a question answering task for frequently feedback and reflect their concept of relevance8 . asked questions. Given a user query, the partici- pants must provide a list of FAQs ranked by rele- 6 https://nutch.apache.org vance according to the user need. The collection 7 https://github.com/SRvSaha/ QA4FAQ-EVALITA-16/blob/master/italian_ 9 Metrics are computed by the latest version of stopwords.txt the trec eval tool: http://trec.nist.gov/trec_ 8 Relevance is subjective. eval/ Table 2: Results computed by using typical information retrieval metrics System MAP GMAP MRR R@5 R@10 success 1 chilab4it 0.5149 0.0630 0.5424 0.6485 0.7343 0.4319 baseline 0.5190 0.1905 0.5422 0.6805 0.7898 0.4067 fbk4faq.2 0.4666 0.0964 0.4982 0.5917 0.7244 0.3750 fbk4faq.1 0.4473 0.0755 0.4781 0.5703 0.6994 0.3578 NLP-NITMZ.1 0.3936 0.0288 0.4203 0.5060 0.5879 0.3161 NLP-NITMZ.2 0.0782 0.0202 0.0799 0.0662 0.1224 0.0168 of FAQs was built by exploiting a real applica- Erick R. Fonseca, Simone Magnolini, Anna Feltracco, tion developed by QuestionCube for Acquedotto Mohammed R. H. Qwaider, and Bernardo Magnini. 2016. Tweaking Word Embeddings for FAQ Rank- Pugliese. The relevance judgments for the evalua- ing. In Pierpaolo Basile, Anna Corazza, Franco Cu- tion are built by taking into account the user feed- tugno, Simonetta Montemagni, Malvina Nissim, Vi- back. viana Patti, Giovanni Semeraro, and Rachele Sprug- Results of the evaluation demonstrated that only noli, editors, Proceedings of Third Italian Confer- the system based on question answering tech- ence on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language niques is able to outperform the baseline, while Processing and Speech Tools for Italian. Final Work- all the other participants reported results under the shop (EVALITA 2016). Associazione Italiana di Lin- baseline. Some issues pointed out by participants guistica Computazionale (AILC). suggest exploring a pool of results for building Preslav Nakov, Lluıs Marquez, Walid Magdy, Alessan- more accurate judgments. We plan to implement dro Moschitti, James Glass, and Bilal Randeree. this approach in future editions of the task. 2015. Semeval-2015 task 3: Answer selection in community question answering. SemEval-2015, Acknowledgments page 269. This work is supported by the project “Multilin- Anselmo Peñas, Pamela Forner, Richard Sutcliffe, gual Entity Liking” funded by the Apulia Region Álvaro Rodrigo, Corina Forăscu, Iñaki Alegria, Danilo Giampiccolo, Nicolas Moreau, and Petya under the program FutureInResearch. Osenova. 2009. Overview of ResPubliQA 2009: question answering evaluation over European legis- lation. In Workshop of the Cross-Language Evalu- References ation Forum for European Languages, pages 174– Eneko Agirre, Carmen Banea, Claire Cardie, Daniel 196. Springer. Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Arianna Pipitone, Giuseppe Tirone, and Roberto Pir- Guo, Inigo Lopez-Gazpio, Montse Maritxalara, rone. 2016a. ChiLab4It System in the QA4FAQ Rada Mihalcea, et al. 2015. Semeval-2015 task Competition. In Pierpaolo Basile, Anna Corazza, 2: Semantic textual similarity, english, spanish and Franco Cutugno, Simonetta Montemagni, Malv- pilot on interpretability. In Proceedings of the 9th ina Nissim, Viviana Patti, Giovanni Semeraro, and international workshop on semantic evaluation (Se- Rachele Sprugnoli, editors, Proceedings of Third mEval 2015), pages 252–263. Italian Conference on Computational Linguistics Divyanshu Bhardwaj, Partha Pakray, Jereemi Bentham, (CLiC-it 2016) & Fifth Evaluation Campaign of Saurav Saha, and Alexander Gelbukh. 2016. Ques- Natural Language Processing and Speech Tools tion Answering System for Frequently Asked Ques- for Italian. Final Workshop (EVALITA 2016). As- tions. In Pierpaolo Basile, Anna Corazza, Franco sociazione Italiana di Linguistica Computazionale Cutugno, Simonetta Montemagni, Malvina Nissim, (AILC). Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Arianna Pipitone, Giuseppe Tirone, and Roberto Pir- Conference on Computational Linguistics (CLiC-it rone. 2016b. QuASIt: a Cognitive Inspired Ap- 2016) & Fifth Evaluation Campaign of Natural Lan- proach to Question Answering System for the Italian guage Processing and Speech Tools for Italian. Final Language. In Proceedings of the 15th International Workshop (EVALITA 2016). Associazione Italiana di Conference on the Italian Association for Artificial Linguistica Computazionale (AILC). Intelligence 2016. aAcademia University Press. Georgiana Dinu, Angeliki Lazaridou, and Marco Ba- roni. 2014. Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568.