=Paper= {{Paper |id=Vol-1749/paper_022 |storemode=property |title=Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task |pdfUrl=https://ceur-ws.org/Vol-1749/paper_022.pdf |volume=Vol-1749 |authors=Annalina Caputo,Marco de Gemmis,Pasquale Lops,Francesco Lovecchio,Vito Manzari |dblpUrl=https://dblp.org/rec/conf/clic-it/CaputoGLLM16 }} ==Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task== https://ceur-ws.org/Vol-1749/paper_022.pdf
Overview of the EVALITA 2016 Question Answering for Frequently
                 Asked Questions (QA4FAQ) Task
          Annalina Caputo1 and Marco de Gemmis2,5 and Pasquale Lops2

                      Francesco Lovecchio3 and Vito Manzari4
                              1
                                ADAPT Centre, Dublin
          2
            Department of Computer Science, University of Bari Aldo Moro
     3
       Acquedotto Pugliese (AQP) S.p.a. 4 Sud Sistemi S.r.l. 5 QuestionCube S.r.l.
                     1
                       annalina.caputo@adaptcentre.ie
               2
                 {marco.degemmis,pasquale.lops}@uniba.it
            3
              f.lovecchio@aqp.it 4 manzariv@sudsistemi.it
                    5
                      marco.degemmis@questioncube.com

                Abstract                               startup italiana che progetta soluzioni di
                                                       Question Answering.
English. This paper describes the first edi-
tion of the Question Answering for Fre-
quently Asked Questions (QA4FAQ) task          1       Motivation
at the EVALITA 2016 campaign. The
task concerns the retrieval of relevant fre-   Searching within the Frequently Asked Questions
quently asked questions, given a user          (FAQ) page of a web site is a critical task: cus-
query. The main objective of the task is       tomers might feel overloaded by many irrelevant
the evaluation of both question answer-        questions and become frustrated due to the diffi-
ing and information retrieval systems in       culty in finding the FAQ suitable for their prob-
this particular setting in which the doc-      lems. Perhaps they are right there, but just worded
ument collection is composed of FAQs.          in a different way than they know.
The data used for the task are collected          The proposed task consists in retrieving a list of
in a real scenario by AQP Risponde, a          relevant FAQs and corresponding answers related
semantic retrieval engine used by Acque-       to the query issued by the user.
dotto Pugliese (AQP, the Organization for         Acquedotto Pugliese (AQP) developed a se-
the management of the public water in the      mantic retrieval engine for FAQs, called AQP
South of Italy) for supporting their cus-      Risponde1 , based on Question Answering (QA)
tomer care. The system is developed by         techniques. The system allows customers to ask
QuestionCube, an Italian startup company       their own questions, and retrieves a list of rele-
which designs Question Answering tools.        vant FAQs and corresponding answers. Further-
                                               more, customers can select one FAQ among those
Italiano. Questo lavoro descrive la prima      retrieved by the system and can provide their feed-
edizione del Question Answering for Fre-       back about the perceived accuracy of the answer.
quently Asked Questions (QA4FAQ) task
                                                  AQP Risponde poses relevant research chal-
proposto durante la campagna di valu-
                                               lenges concerning both the usage of the Italian lan-
tazione EVALITA 2016. Il task consiste
                                               guage in a deep QA architecture, and the variety
nel recuperare le domande più frequenti
                                               of language expressions adopted by customers to
rilevanti rispetto ad una domanda posta
                                               formulate the same information need.
dall’utente. L’obiettivo principale del task
                                                  The proposed task is strongly related to the
è la valutazione di sistemi di question an-
                                               one recently organized at Semeval 2015 and 2016
swering e di recupero dell’informazione in
                                               about Answer Selection in Community Question
un contesto applicativo reale, utilizzando i
                                               Answering (Nakov et al., 2015). This task helps
dati provenienti da AQP Risponde, un mo-
                                               to automate the process of finding good answers
tore di ricerca semantico usato da Acque-
                                               to new questions in a community-created discus-
dotto Pugliese (AQP, l’ente per la gestione
                                               sion forum (e.g., by retrieving similar questions in
dell’acqua pubblica nel Sud Italia). Il sis-
                                                   1
tema è sviluppato da QuestionCube, una                http://aqprisponde.aqp.it/ask.php
the forum and by identifying the posts in the an-          We provided a little sample set for the system
swer threads of similar questions that answer the       development and a test set for the evaluation. We
original one as well). Moreover, the QA-FAQ has         did not provide a set of training data: AQP is inter-
some common points with the Textual Similarity          ested in the development of unsupervised systems
task (Agirre et al., 2015) that received an increas-    because AQP Risponde must be able to achieve
ing amount of attention in recent years.                good performance without any user feedback. Fol-
   The paper is organized as follows: Section 2 de-     lowing, an example of FAQ is reported:
scribes the task, while Section 3 provides details
                                                        Question “Come posso telefonare al numero
about competing systems. Results of the task are
                                                            verde da un cellulare?” How can I call the
discussed in Section 4.
                                                            toll-free number by a mobile phone?
2       Task Description: Dataset, Evaluation           Answer “È possibile chiamare il Contact Center
        Protocol and Measures                              AQP per segnalare un guasto o per un pronto
The task concerns the retrieval of relevant fre-           intervento telefonando gratuitamente anche
quently asked questions, given a user query. For           da cellulare al numero verde 800.735.735.
defining an evaluation protocol, we need a set of          Mentre per chiamare il Contact Center AQP
FAQs, a set of user questions and a set of relevance       per servizi commerciali 800.085.853 da un
judgments for each question. In order to collect           cellulare e dall’estero è necessario comporre
these data, we exploit an application called AQP           il numero +39.080.5723498 (il costo della
Risponde, developed by QuestionCube for the Ac-            chiamata è secondo il piano tariffario del
quedotto Pugliese. AQP Risponde provides a                 chiamante).” You can call the AQP Contact
back-end that allows to analyze both the query log         Center to report a fault or an emergency call
and the customers’ feedback to discover, for in-           without charge by the phone toll-free number
stance, new emerging problems that need to be en-          800 735 735...
coded as FAQ. AQP Risponde is provided as web
                                                        Tags canali, numero verde, cellulare
and mobile application for Android2 and iOS3 and
is currently running in the Acquedotto Pugliese           For example, the previous FAQ is relevant for
customer care. AQP received about 25,000 ques-          the query: “Si può telefonare da cellulare al nu-
tions and collected about 2,500 user feedback. We       mero verde?” Is it possible to call the toll-free
rely on these data to build the dataset for the task.   number by a mobile phone?
In particular, we provide:                                Moreover, we provided a simple baseline based
    • a knowledge base of 406 FAQs. Each FAQ is         on a classical information retrieval model.
      composed of a question, an answer, and a set      2.1   Data Format
      of tags;
                                                        FAQs are provided in both XML and CSV format
    • a set of 1,132 user queries. The queries          using “;” as separator. The file is encoded in UTF-
      are collected by analyzing the AQP Risponde       8 format. Each FAQ is described by the following
      system log. From the initial set of queries, we   fields:
      removed queries that contains personal data;
                                                        id a number that uniquely identifies the FAQ
    • a set of 1,406 pairs < query, relevantf aq >
      that are exploited to evaluate the contes-        question the question text of the current FAQ
      tants. We build these pairs by analyzing the      answer the answer text of the current FAQ
      user feedback provided by real users of AQP
      Risponde. We manually check the user feed-        tag a set of tags separated by “,”
      back in order to remove noisy or false feed-         Test data are provided as a text file composed by
      back. The check was performed by two ex-          two strings separated by the TAB character. The
      perts of the AQP customer support.                first string is the user query id, while the second
    2
    https://play.google.com/store/apps/                 string is the text of the user query. For example:
details?id=com.questioncube.aqprisponde&                “1 Come posso telefonare al numero verde da un
hl=it
  3
    https://itunes.apple.com/it/app/                    cellulare?” and “2 Come si effettua l’autolettura
aqp-risponde/id1006106860                               del contatore?”.
2.2     Baseline                                        3   Systems
The baseline is built by using Apache Lucene (ver.
4.10.4)4 . During the indexing for each FAQ, a          Thirteen teams registered in the task, but only
document with four fields (id, question, answer,        three of them actually submitted the results for the
tag) is created. For searching, a query for each        evaluation. A short description of each system fol-
question is built taking into account all the ques-     lows:
tion terms. Each field is boosted according to the
following score question=4, answer=2 and tag=1.
                                                        chiLab4It - The system described in (Pipitone et
For both indexing and search the ItalianAnalyzer
                                                            al., 2016a) is based on the cognitive model
is adopted. The top 25 documents for each query
                                                            proposed in (Pipitone et al., 2016b). When a
are provided as result set. The baseline is freely
                                                            support text is provided for finding the cor-
available on GitHub5 and it was released to partic-
                                                            rect answer, QuASIt is able to use this text
ipants after the evaluation period.
                                                            to find the required information. ChiLab4It
                                                            is an adaptation of this model to the context
2.3     Evaluation                                          of FAQs, in this case the FAQ is exploited
The participants must provide results in a text file.       as support text: the most relevant FAQ will
For each query in the test data, the participants can       be the one whose text will best fit the user’s
provide 25 answers at the most, ranked according            question. The authors define three similar-
by their systems. Each line in the file must contain        ity measures for each field of the FAQ: ques-
three values separated by the TAB character: <              tion, answer and tags. Moreover, an expan-
queryid >< f aqid >< score >.                               sion step by exploiting synonyms is applied
                                                            to the query. The expansion module is based
   Systems are ranked according to the accu-
                                                            on Wiktionary.
racy@1 (c@1). We compute the precision of the
system by taking into account only the first cor-
rect answer. This metric is used for the final rank-
                                                        fbk4faq - In (Fonseca et al., 2016), the authors
ing of systems. In particular, we take into account
                                                            proposed a system based on vector represen-
also the number of unanswered questions, follow-
                                                            tations for each query, question and answer.
ing the guidelines of the CLEF ResPubliQA Task
                                                            Query and answer are ranked according to the
(Peñas et al., 2009). The formulation of c@1 is:
                                                            cosine distance to the query. Vectors are built
                                                            by exploring the word embeddings generated
                      1          nR                         by (Dinu et al., 2014), and combined in a way
              c@1 =     (nR + nU    )            (1)
                      n           n                         to give more weight to more relevant words.

   where nR is the number of questions correctly
answered, nU is the number of questions unan-           NLP-NITMZ the system proposed by (Bhard-
swered, and n is the total number of questions.            waj et al., 2016) is based on a classical
   The system should not provide result for a par-         VSM model implemented in Apache Nutch6 .
ticular question when it is not confident about the        Moreover, the authors add a combinatorial
correctness of its answer. The goal is to reduce the       searching technique that produces a set of
amount of incorrect responses, keeping the num-            queries by several combinations of all the
ber of correct ones, by leaving some questions             keywords occurring in the user query. A cus-
unanswered. Systems should ensure that only the            tom stop word list was developed for the task,
portion of wrong answers is reduced, maintaining           which is freely available7 .
as high as possible the number of correct answers.
Otherwise, the reduction in the number of correct
answers is punished by the evaluation measure for         It is important to underline that all the systems
both the answered and unanswered questions.             adopt different strategies, while only one system
                                                        (chiLab4It) is based on a typical question answer
   4
       http://lucene.apache.org/                        module. We provide a more detailed analysis
   5
       https://github.com/swapUniba/qa4faq              about this aspect in Section 4.
                                                       We tried to mitigate issues related to relevance
               Table 1: System results.
                                                       judgments by manually checking users’ feedback.
          System                     c@1
                                                       However, this manual annotation process might
          qa4faq16.chilab4it.01      0.4439
                                                       have introduced some noise, which is common to
          baseline                   0.4076
                                                       all participants.
          qa4fac16.fbk4faq.2         0.3746
                                                          Regarding missing correct answers in the gold
          qa4fac16.fbk4faq.1         0.3587
                                                       standard: this is a typical issue in the retrieval eval-
          qa4fac16.NLP-NITMZ.1 0.2125
                                                       uation, since it is impossible to assess all the FAQ
          qa4fac16.NLP-NITMZ.2 0.0168
                                                       for each test query. Generally, this issue can be
                                                       solved by creating a pool of results for each query.
4       Results                                        Such pool is built by exploiting the output of sev-
                                                       eral systems. In this first edition of the task, we
Results of the evaluation in terms of c@1 are re-      cannot rely on previous evaluations on the same
ported in Table 1. The best performance is ob-         set of data, therefore we chose to exploit users’
tained by the chilab4it team, that is the only one     feedback. In the next editions of the task, we can
able to outperform the baseline. Moreover, the         rely on previous results of participants to build that
chilab4it team is the only one that exploits ques-     pool of results.
tion answering techniques: the good performance           Finally, in Table 2 we report some informa-
obtained by this team proves the effectiveness of      tion retrieval metrics for each system9 . In particu-
question answering in the FAQ domain. All the          lar, we compute Mean Average Precision (MAP),
other participants had results under the baseline.     Geometrical-Mean Average Precision (GMAP),
Another interesting outcome is that the baseline       Mean Reciprocal Rank (MRR), Recall after five
exploiting a simple VSM model achieved remark-         (R@5) and ten (R@10) retrieved documents. Fi-
able results.                                          nally we report the success 1 that is equal to c@1,
   A deep analysis of results is reported in (Fon-     but without taking into account answered queries.
seca et al., 2016), where the authors have built       We can notice that on retrieval metrics the base-
a custom development set by paraphrasing origi-        line is the best approach. This was quite expected
nal questions or generating a new question (based      since an information retrieval model tries to opti-
on original FAQ answer), without considering the       mize retrieval performance. Conversely, the best
original FAQ question. The interesting result is       approach according to success 1 is the chilab4it
that their system outperformed the baseline on the     system based on question answering, since it tries
development set. The authors underline that the        to retrieve a correct answer in the first position.
development set is completely different from the       This result suggests that the most suitable strat-
test set which contains sometime short queries and     egy in this context is to adopt a question answer-
more realistic user’s requests. This is an interest-   ing model, rather than to adapt an information
ing point of view since one of the main challenge      retrieval approach. Another interesting outcome
of our task concerns the variety of language ex-       concerns the system NLP-NITMZ.1, which obtains
pressions adopted by customers to formulate the        an encouraging success 1, compared to the c@1.
information need. Moreover, in their report the        This behavior is ascribable to the fact that the sys-
authors provide some examples in which the FAQ         tem does not adopt a strategy that provides an an-
reported in the gold standard is less relevant than    swer for all queries.
the FAQ reported by their system, or in some cases
the system returns a correct answer that is not an-    5       Conclusions
notated in the gold standard. Regarding the first
point, we want to point out that our relevance         For the first time for the Italian language, we
judgments are computed according to the users’         propose a question answering task for frequently
feedback and reflect their concept of relevance8 .     asked questions. Given a user query, the partici-
                                                       pants must provide a list of FAQs ranked by rele-
    6
    https://nutch.apache.org                           vance according to the user need. The collection
    7
    https://github.com/SRvSaha/
QA4FAQ-EVALITA-16/blob/master/italian_                     9
                                                           Metrics are computed by the latest version of
stopwords.txt                                          the trec eval tool: http://trec.nist.gov/trec_
  8
    Relevance is subjective.                           eval/
                Table 2: Results computed by using typical information retrieval metrics
              System            MAP     GMAP MRR R@5                  R@10 success 1
              chilab4it         0.5149 0.0630 0.5424 0.6485 0.7343 0.4319
              baseline          0.5190 0.1905 0.5422 0.6805 0.7898 0.4067
              fbk4faq.2         0.4666 0.0964 0.4982 0.5917 0.7244 0.3750
              fbk4faq.1         0.4473 0.0755 0.4781 0.5703 0.6994 0.3578
              NLP-NITMZ.1 0.3936 0.0288 0.4203 0.5060 0.5879 0.3161
              NLP-NITMZ.2 0.0782 0.0202 0.0799 0.0662 0.1224 0.0168


of FAQs was built by exploiting a real applica-          Erick R. Fonseca, Simone Magnolini, Anna Feltracco,
tion developed by QuestionCube for Acquedotto               Mohammed R. H. Qwaider, and Bernardo Magnini.
                                                            2016. Tweaking Word Embeddings for FAQ Rank-
Pugliese. The relevance judgments for the evalua-
                                                            ing. In Pierpaolo Basile, Anna Corazza, Franco Cu-
tion are built by taking into account the user feed-        tugno, Simonetta Montemagni, Malvina Nissim, Vi-
back.                                                       viana Patti, Giovanni Semeraro, and Rachele Sprug-
   Results of the evaluation demonstrated that only         noli, editors, Proceedings of Third Italian Confer-
the system based on question answering tech-                ence on Computational Linguistics (CLiC-it 2016)
                                                            & Fifth Evaluation Campaign of Natural Language
niques is able to outperform the baseline, while            Processing and Speech Tools for Italian. Final Work-
all the other participants reported results under the       shop (EVALITA 2016). Associazione Italiana di Lin-
baseline. Some issues pointed out by participants           guistica Computazionale (AILC).
suggest exploring a pool of results for building         Preslav Nakov, Lluıs Marquez, Walid Magdy, Alessan-
more accurate judgments. We plan to implement              dro Moschitti, James Glass, and Bilal Randeree.
this approach in future editions of the task.              2015. Semeval-2015 task 3: Answer selection
                                                           in community question answering. SemEval-2015,
Acknowledgments                                            page 269.

This work is supported by the project “Multilin-         Anselmo Peñas, Pamela Forner, Richard Sutcliffe,
gual Entity Liking” funded by the Apulia Region            Álvaro Rodrigo, Corina Forăscu, Iñaki Alegria,
                                                           Danilo Giampiccolo, Nicolas Moreau, and Petya
under the program FutureInResearch.                        Osenova. 2009. Overview of ResPubliQA 2009:
                                                           question answering evaluation over European legis-
                                                           lation. In Workshop of the Cross-Language Evalu-
References                                                 ation Forum for European Languages, pages 174–
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel          196. Springer.
  Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei          Arianna Pipitone, Giuseppe Tirone, and Roberto Pir-
  Guo, Inigo Lopez-Gazpio, Montse Maritxalara,             rone. 2016a. ChiLab4It System in the QA4FAQ
  Rada Mihalcea, et al. 2015. Semeval-2015 task            Competition. In Pierpaolo Basile, Anna Corazza,
  2: Semantic textual similarity, english, spanish and     Franco Cutugno, Simonetta Montemagni, Malv-
  pilot on interpretability. In Proceedings of the 9th     ina Nissim, Viviana Patti, Giovanni Semeraro, and
  international workshop on semantic evaluation (Se-       Rachele Sprugnoli, editors, Proceedings of Third
  mEval 2015), pages 252–263.                              Italian Conference on Computational Linguistics
Divyanshu Bhardwaj, Partha Pakray, Jereemi Bentham,        (CLiC-it 2016) & Fifth Evaluation Campaign of
  Saurav Saha, and Alexander Gelbukh. 2016. Ques-          Natural Language Processing and Speech Tools
  tion Answering System for Frequently Asked Ques-         for Italian. Final Workshop (EVALITA 2016). As-
  tions. In Pierpaolo Basile, Anna Corazza, Franco         sociazione Italiana di Linguistica Computazionale
  Cutugno, Simonetta Montemagni, Malvina Nissim,           (AILC).
  Viviana Patti, Giovanni Semeraro, and Rachele
  Sprugnoli, editors, Proceedings of Third Italian       Arianna Pipitone, Giuseppe Tirone, and Roberto Pir-
  Conference on Computational Linguistics (CLiC-it         rone. 2016b. QuASIt: a Cognitive Inspired Ap-
  2016) & Fifth Evaluation Campaign of Natural Lan-        proach to Question Answering System for the Italian
  guage Processing and Speech Tools for Italian. Final     Language. In Proceedings of the 15th International
  Workshop (EVALITA 2016). Associazione Italiana di        Conference on the Italian Association for Artificial
  Linguistica Computazionale (AILC).                       Intelligence 2016. aAcademia University Press.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-
  roni. 2014. Improving zero-shot learning by
  mitigating the hubness problem. arXiv preprint
  arXiv:1412.6568.