=Paper= {{Paper |id=Vol-1749/paper_024 |storemode=property |title=Tweaking Word Embeddings for FAQ Ranking |pdfUrl=https://ceur-ws.org/Vol-1749/paper_024.pdf |volume=Vol-1749 |authors=Erick R. Fonseca,Simone Magnolini,Anna Feltracco,Mohammed R. H. Qwaider,Bernardo Magnini |dblpUrl=https://dblp.org/rec/conf/clic-it/FonsecaMFQM16 }} ==Tweaking Word Embeddings for FAQ Ranking== https://ceur-ws.org/Vol-1749/paper_024.pdf
                     Tweaking Word Embeddings for FAQ Ranking
                   Erick R. Fonseca                             Simone Magnolini
             University of São Paulo, Brazil                Fondazione Bruno Kessler
              Fondazione Bruno Kessler                       University of Brescia, Italy
                      rocha@fbk.eu                            magnolini@fbk.eu

          Anna Feltracco         Mohammed R. H. Qwaider                     Bernardo Magnini
     Fondazione Bruno Kessler     Fondazione Bruno Kessler               Fondazione Bruno Kessler
      University of Pavia, Italy     Povo-Trento, Italy                     Povo-Trento, Italy
    University of Bergamo, Italy     qwaider@fbk.eu                          magnini@fbk.eu
       feltracco@fbk.eu


                      Abstract                         two SemEval editions (Màrquez et al., 2015) and
                                                       (Nakov et al., 2016).
     English. We present the system developed             Given a knowledge base composed of about
     at FBK for the EVALITA 2016 Shared                470 questions (henceforth, FAQ question), their
     Task “QA4FAQ – Question Answering for             respective answers (henceforth, FAQ answers) and
     Frequently Asked Questions”. A pecu-              metadata (tags), the task consists in retrieving the
     liar characteristic of this task is the total     most relevant FAQ question/answer pair related to
     absence of training data, so we created a         the set of queries provided by the organizers.
     meaningful representation of the data us-            For this task, no training data were provided,
     ing only word embeddings. We present              ruling out machine learning based approaches. We
     the system as well as the results of the two      took advantage of the a priori knowledge pro-
     submitted runs, and a qualitative analysis        vided by word embeddings, and developed a word
     of them.                                          weighting scheme to produce vector representa-
                                                       tions of the knowledge base questions, answers
     Italiano. Presentiamo il sistema svilup-          and the user queries. We then rank the FAQs with
     pato presso FBK per la risoluzione del            respect to their cosine similarity to the queries.
     task EVALITA 2016 “QA4FAQ - Question                 The paper is organized as follows. Section 2
     Answering for Frequently Asked Ques-              presents the system we built and Section 3 reports
     tions”. Una caratteristica peculiare di           development data we created in order to test our
     questo task é la totale mancanza di              system. In Section 4 we show the results we ob-
     dati di training, pertanto abbiamo creato         tained, followed by Section 5 that presents an error
     una rappresentazione significativa dei dati       analysis. Finally, Section 6 provides some conclu-
     utilizzando solamente word embeddings.            sions.
     Presentiamo il sistema assieme ai risultati
     ottenuti dalle due esecuzioni che abbiamo         2   System Description
     inviato e un’analisi qualitativa dei risul-
     tati stessi.                                      Our system was based on creating vector repre-
                                                       sentations for each user query (from the test set),
                                                       question and answer (from the knowledge base),
1    Introduction                                      and then ranking the latter two according to the
                                                       cosine distance to the query.
FAQ ranking is an important task inside the wider        We created the vectors using the word embed-
task of question answering, which represents at the    dings generated by Dinu and Baroni (2014) and
moment a topic of great interest for research and      combined them in a way to give more weight to
business as well. Analyzing the Frequent Asked         more important words, as explained below. Since
Questions is a way to maximize the value of this       no training data was available, using word embed-
type of knowledge source that otherwise could be       dings was especially interesting, as they could pro-
difficult to consult. A similar task was proposed in   vide our system with some kind of a priori knowl-
edge about similar words.                                          ignore any word present among the tags for
   We applied similar the same operations to                       that FAQ entry. Thus, tag words, which are
queries, FAQ questions and answers, and here we                    supposed to be more relevant, have a lower
will use the term text to refer to any of the three.               DF and higher IDF value.
In order to create vector representations for texts,
the following steps were taken:                                 6. Multiword expressions. We compute the
                                                                   embeddings for 15 common multiword ex-
  1. Tokenization. The text is tokenized with                      pressions (MWEs) we extracted from the
     NLTK’s (Bird et al., 2009) Italian model,                     FAQ. They are computed as the average of
     yielding a token list X.                                      the embeddings of the MWE components,
                                                                   weighted by their IDF. If an MWE is present
  2. Filtering.      Stopwords (obtained from
                                                                   in the text, we add a token to X contain-
     NLTK’s stopword list) and punctuation signs
                                                                   ing the whole expression, but do not remove
     are discarded from X.
                                                                   the individual words. An example is codice
  3. Acronyms Substitution. Some words and                         cliente: we add codice cliente to X, but still
     expressions are replaced by their acronyms.                   keep codice and cliente.
     We performed this replacement in order to
     circumvent cases where a query could have                  7. SIDF computation.         We compute the
     an acronym while the corresponding FAQ                        Similarity-IDF (SIDF) scores. This metric
     has the expression fully written, which would                 can be seen as an extension of the IDF which
     lead to a similarity score lower than expected.               also incorporates the DF of similar words. It
     For example, we replaced Autorità Idrica                     is computed as follows:
     Pugliese with AIP and Bari with BA. In to-
     tal, 21 expressions were checked.                                                        1
                                                                              SIDF(w) =                       (2)
                                                                                            SDF(w)
  4. Out-of-vocabolary terms. When words out
     of the embedding vocabulary are found in a
     FAQ question or answer, a random embed-
     ding is generated for it1 , from a normal dis-                  SDF(w) = DF(w)+
     tribution with mean 0 and standard deviation
                                                                             X
                                                                                  cos(w, wi ) DF(wi ) (3)
     0.1. The same embedding is used for any new                               wi ∈Wsim
     occurrences of that word. This includes any
     acronyms used in the previous step.                           Here, Wsim denotes the set of the n most
  5. IDF computation. We compute the docu-                         similar words to w which have non-zero DF.
     ment frequency (DF) of each word as the pro-                  Note that under this definition, SDF is never
     portion of questions or answers in which it                   null and thus we don’t need the special case
     appears2 . Then, we compute the inverse doc-                  as in the IDF computation. We can also com-
     ument frequency (IDF) of words as:                            pute the SIDF for the MWEs introduced to
                                                                   the texts.
                      (
                            1                                   8. Embedding averaging. After these steps,
                          DF(w) ,   if DF(w) > 0
                                                                   we take the mean of the embeddings,
         IDF(w) =                                        (1)
                         10,        otherwise                      weighted by the SIDF values of their corre-
                                                                   sponding words:
       We found that tweaking the DF by decreasing
       FAQ tags count could improve our system’s                                P
                                                                                    w∈X E(w) SIDF(w)
       performance. When counting words in ques-                           v=                                 (4)
                                                                                          |X|
       tions and answers to compute their DF, we
   1
      Out of vocabulary words that only appear in the queries      Here, v stands for the vector representation
are removed from X.                                                of the text and E(·) is the function mapping
    2
      When we are comparing queries to FAQ questions, we
only count occurrences in questions. Likewise, when com-           words and MWEs to their embeddings. Note
paring queries to answers, we only count in answers.               that we do not remove duplicate words.
                  id                    272
                 question               Cos’è la quota fissa riportata in fattura?
                 answer                 La quota fissa, prevista dal piano tariffario deliberato, è addebitata in ciascuna fattura,
    FAQ Entry

                                        fattura, ed è calcolata in base ai moduli contrattuali ed ai giorni di competenza
                                        della fattura stessa. La quota fissa è dovuta indipendentemente dal consumo in
                                        quanto attiene a parte dei costi fissi che il gestore sostiene per erogare il servizio
                                        a tutti. Quindi nella fattura è addebitata proporzionalmente al periodo fatturato.
                  tag                   fattura, quota, fissa, giorni, canone acqua e fogna, quota fissa, costi fissi, quote fisse
     DevSet




                  paraphrased query     Cosa si intende per quota fissa nella fattura?
                  answer-driven query   La quota fissa è indipendente dai consumi?

                                              Table 1: Example of our development set.


   In this process, the IDF and SIDF values are cal-                      We guided the development of our system eval-
culated independently for answers and questions                         uating it with different versions of this dataset. In
in the FAQ When processing queries, the value ac-                       particular, version 1 is composed by 200 queries,
tually used depends on which one we are compar-                         begin 160 paraphrased and 40% answer driven,
ing the query vectors with.                                             and version 2 is composed by 266 queries, 133
   After computing vectors for all texts, we com-                       paraphrased and 133 answer driven.
pute the cosine similarity between query vectors                          Merging paraphrased queries and answer
and FAQ questions and also between queries and                          driven queries (in different proportions) allows us
answers. For each FAQ entry, we take the highest                        to create a very heterogeneous dataset; we ex-
value between these two as the system confidence                        pected the test set and, in general, the questions
for returning that entry as an answer to the query.                     by users to be as much varied.

3               Evaluating our system                                   3.2      Baseline
In order to evaluate our system, we created a de-                       Two baseline systems were built using Apache
velopment set and we calculate a baseline as a ref-                     Lucene3 . FBK-Baseline-sys1 was built by index-
erence threshold.                                                       ing for each FAQ entry a Document with two fields
                                                                        (id, FAQ question), while FBK-Baseline-sys2 was
3.1              Development Set                                        built by indexing for each FAQ entry a Document
We manually created a dataset of 293 queries to                         with three fields (id, FAQ question, FAQ answer).
test our systems. Each query in the dataset is
associated to one of the entries provided in the                        4       Results
knowledge base. In particular, the dataset is com-
                                                                        In Table 2 we report the results of the two runs
posed by 160 paraphrased queries and 133 an-
                                                                        of our system compared with the official baseline
swer driven queries. The paraphrased queries are
                                                                        provided by the organizers. The only difference
queries obtained by paraphrasing original ques-
                                                                        in our first two runs was that the first one always
tions; the answer queries are generated without
                                                                        tried to retrieve an answer, while the second one
considering the original FAQ questions, but have
                                                                        would abstain from answering when the system
an answer in the knowledge base. Table 1 shows
                                                                        confidence was below 0.5.
an example of a paraphrased query and an answer
                                                                           The organizers baseline (qa4faq-baseline4 ) was
driven query for FAQ 272 of the knowledge base.
                                                                        built using Lucene by having a weighted-index.
   Given the technical domain of the task, most
                                                                        For each FAQ entry a Document with four
of the generated paraphrases recall lexical items
                                                                        fields (id, FAQ question(weight=4), FAQ an-
of the original FAQ question (e.g. “uso commer-
                                                                        swer(weight=2), tag(weight=1)).
ciale”, “scuola pubblica”, etc..). Differently, the
answer driven queries are not necessarily similar                          We use three different metrics to evaluate the
in content and lexicon to the FAQ question; instead                     system: Accuracy@1, that is the official score to
we expected it to have a very high similarity with                          3
                                                                                https://lucene.apache.org/
                                                                            4
the answer.                                                                     https://github.com/swapUniba/qa4faq
                      Test set                           the first line is possible to notice that, not only, our
                      Accuracy@1     MAP      Top 10     development queries has , in average, more tokens
 run 1                35.87          51.12    73.94      than the test queries, but also that the standard de-
 run 2                37.46          50.10    71.91
                                                         viation is significantly lower. This distribution of
 qa4faq-baseline      40.76          58.97    81.71
                                                         tokens is in line with a qualitative check of the test
 FBK-Baseline-sys1    39.79          55.36    76.15
 FBK-Baseline-sys2    35.16          53.02    80.92      set. The test set includes incomplete sentences,
                                                         with only keywords, e.g. ”costo depurazione”,
Table 2: Results on the test set. Accuracy@1: of-        alongside long questions that include verbose de-
ficial score, MAP: Mean Average Precision, Top           scription of the situation e.g. ”Mia figlia acquis-
10: correct answer in the first 10 results.              terà casa a bari il giorno 22 prossimo. Come
                                                         procedere per l intestazione dell utenza? Quali
rank the systems, M AP and T op10. Accuracy@1            documenti occorrono e quali i tempi tecnici neces-
is the precision of the system taking into account       sari?”. Instead the development set is composed
only the first answer; it is computed as follows:        by queries more similar in their structure and well
                                                         formed.
                           (nc + nu ∗ nnc )                 All systems perform, almost, in the same way
         Accuracy@1 =                            (5)     according to the data sets: in the two versions of
                                  n
   Where nr is the number of correct quries, nu          the development set the correct queries are longer
is the number of unanswered queries and n is the         with a higher standard deviation compared to the
number of questions. M AP is the Mean Average            wrong ones; on the other hand, in the test set the
Precision that is the mean of the average precision      correct queries are shorter with a lower standard
scores for each query, i.e. the inverse of the rank-     deviation.
ing of the correct answer. T op10 is the percentage         We did a qualitative analysis of the result of our
of query with the correct answer in the first 10 po-     systems; we limited our observation to the 250
sitions. Both our approach runs underperformed           queries of the test set for which the right answer
compared with the baseline in all the three metrics      was not in the first ten retrieved by our systems.
we use to evaluate the systems.                          We considered these cases to be the worst and
   Comparing our runs, it is interesting to notice       wanted to investigate whether they present an is-
that run 2 performs better while evaluated with          sue that cannot be solved using our approach.
Accuracy@1, but worse in the other two metrics;             We present in this section some of these cases.
this suggests that, even in some cases where the         In Example 1, the answer of the system is weakly
system confidence was below the threshold, the           related with the query: the query is very short and
correct answer was among the top 10.                     its meaning is contained in both the gold standard
                                                         and in the system answer. In the gold standard
5   Error Analysis                                       the substitution of the counter (”sostituzione del
                                                         contatore”) is the main focus of the sentence, and
The results of our system on the development set,        the other part is just a specification of some detail
described in Section 3.1, compared with the offi-        (”con saracinesca bloccata”).
cial baseline are reported in Table 3.                      In the system answer the substitution of the
   As can be seen, both the runs outperform              counter (”sostituzione del contatore”) is the effect
the baseline in every metric, especially in the          of the main focus (”Per la telelettura”), but our
Accuracy@1.                                              approach cannot differentiate these two types of
   This difference of behavior enlightens that there     text not directly related with the query.
is a significant difference between the develop-
ment set and the test set. The systems were devel-       Example 1
oped without knowing the target style, and without       Query: sostituzione del contatore
training data, so is not surprising that the system is   Gold standard: Come effettuare il cambio del con-
not capable of style adaptation.                         tatore vecchio con saracinesca bloccata?
   An interesting aspect that describes the differ-      System answer: Per la telelettura il contatore sara
ence between development set and test set is re-         sostituito con un nuovo contatore?
ported in Table 4: the average and the standard de-
viation of the number of tokens of every query. In         A similar issue is visible in Example 2. In this
                                        Version 1                                Version 2
                                        Accuracy@1             MAP     Top 10    Accuracy@1       MAP     Top 10
         Run 1                          72.00                  79.64   95.00     66.17            74.77   92.48
         Run 2                          72.45                  77.55   92.00     66.36            73.26   90.23
         qa4faq-baseline                69.00                  76.22   89.50     60.15            70.22   88.72
         FBK-baseline-sys1              49.00                  58.63   76.50     39.47            49.53   68.05
         FBK-baseline-sys2              52.00                  62.69   82.50     49.62            62.10   86.09

Table 3: Results on the development sets. Accuracy@1: official score, MAP: Mean Average Precision,
Top 10: correct answer in the first 10 result.

                        Version 1       Version 2       Test set
     Queries            11.42 +- 4.12   11.20 +- 3.95   7.96 +- 7.27
                                                                       Even if the balance (”conguaglio”) and time ex-
     Answered queries   11.42 +- 4.12   11.20 +- 3.95   7.96 +- 7.27   pressions (”quando”, ”luglio e agosto e un po di
R1   Right queries      11.63 +- 4.15   11.41 +- 4.06   7.32+- 5.44    settembre”) are present in both query and system
     Wrong queries      10.88 +- 4.00   10.78 +- 3.69   8.32 +- 8.09
     Answered queries   11.56 +- 4.12   11.30 +- 3.94   8.09 +- 7.41   answer, and not in the gold standard, they are not
R2   Right queries      11.77 +- 4.12   11.52 +- 4.04   7.37 +- 5.47   useful to find the correct answer.
     Wrong queries      11.02 +- 4.06   10.86 +- 3.71   8.52 +- 8.33
     Answered queries   11.42 +- 4.12   11.20 +- 3.95   7.97 +- 7.27
B    Right queries      11.94 +- 4.34   11.73 +- 4.35   7.54 +- 5.98   Example 3
     Wrong queries      10.26 +- 3.31   10.40 +- 3.09   8.26 +- 8.02   Query: ho ricevuto una bolletta di conguaglio di e
Table 4: Average and standard deviation of the                         426.69 , ma son mancata da casa a luglio e agosto
number of tokens per query. R1: Run1, R2: Run2,                        e un po di settembre , senza consumare , come
B: Organizers Baseline qa4faq-baseline.                                mai?
                                                                       Gold standard: Perche ho ricevuto una fattura el-
                                                                       evata?
case, the first part (”Quali sono i tempi di allaccio                  System answer: Il conguaglio quando avviene?
di un contatore”) of the system answer matches,
almost exactly, the query, but as in Example 1,                          Alongside this issue, there are some cases (Ex-
the second part (”in caso di ripristino in quanto                      ample 4) where our system answers correctly, but
l’abitazione aveva già la fornitura?”), which is                      due to the semi-automatic nature of the gold stan-
not very relevant to the query, was not enough to                      dard it has been considered wrong.
reduce the overall ranking of this FAQ. We think
                                                                       Example 4
this issue could be avoided with some more fea-
                                                                       Query: chi paga la portella del contatore?
tures, but this would require some training data for
                                                                       Gold standard: Come richiedere la sostituzione
a machine learning approach, or some knowledge
                                                                       dello sportello della nicchia contatore?
of the domain to craft a rule approach.
                                                                       System answer: Chi paga la portella del conta-
Example 2                                                              tore?
Query: quali sono i tempi di attivazione di un con-
tatore ?                                                                  Example 5 represents one of the cases in which
Gold standard: Quali sono i tempi previsti per ot-                     the systems answer has been considered wrong but
tenere un allacciamento?                                               is more related with the query than the gold stan-
System answer: Quali sono i tempi di allaccio                          dard.
di un contatore in caso di ripristino in quanto
                                                                       Example 5
l’abitazione aveva già la fornitura?
                                                                       Query: abito in un condominio con 5 famiglie . se
                                                                       alla scadenza di una bolletta uno dei condomini
   In some cases, like in Example 3, the seman-                        non vuole pagare la sua quota , possono gli altri 4
tic match (like common or related words in both                        pagare la loro parte su un altro bollettino postale?
sentences) is not enough to understand the rela-                       Gold standard: Quali sono le modalita di paga-
tionship, or could me misleading. Some knowl-                          mento delle fatture?
edge of the world and some cause-effect reasoning                      System answer: Contratto condominiale, di cui
is needed to understand that the gold standard is                      uno moroso come comportarsi?
more related to the query than the system answer.
6   Conclusion
We reported the system we used in the EVALITA
2016 QA4FAQ shared task, as well as the develop-
ment set we created to evaluate it and an analysis
of our results.
   We found that while our system performed be-
low the baseline in the official test set, we had su-
perior performance on our in-house development
set. This is apparently related to the different style
of the two sets: ours has longer queries, which are
more homogeneous with respect to size, while the
official one has many very short queries and a few
very large ones.
   It could be argued that the official test set rep-
resents a more realistic scenario than the develop-
ment set we created, since it contains actual user
queries, thus diminishing the relevance of our re-
sults. However, further analysis showed that in a
number of cases, our system returned a more ap-
propriate FAQ question/answer than what was in
the gold standard, due to the gold standard semi-
automatic nature.
   We hypothesize that our system performed bet-
ter than what seems from the official results; how-
ever, due to the size of the test set, it would be pro-
hibitive to check it manually and arrive at a more
precise accuracy.


References
Steven Bird, Edward Loper, and Ewan Klein.
   2009. Natural Language Processing with Python.
   O’Reilly Media Inc.
Georgiana Dinu and Marco Baroni. 2014. Improving
  zero-shot learning by mitigating the hubness prob-
  lem. arXiv preprint arXiv:1412.6568.
Lluı́s Màrquez, James Glass, Walid Magdy, Alessan-
  dro Moschitti, Preslav Nakov, and Bilal Randeree.
  2015. Semeval-2015 task 3: Answer selection in
  community question answering. In Proceedings of
  the 9th International Workshop on Semantic Evalu-
  ation (SemEval 2015).
P Nakov, L Marquez, A Moschitti, W Magdy,
  H Mubarak, AA Freihat, J Glass, and B Randeree.
  2016. Semeval-2016 task 3: Community question
  answering. In Proceedings of the 10th International
  Workshop on Semantic Evaluation. San Diego, Cal-
  ifornia. Association for Computational Linguistics.