=Paper=
{{Paper
|id=Vol-1749/paper_024
|storemode=property
|title=Tweaking Word Embeddings for FAQ Ranking
|pdfUrl=https://ceur-ws.org/Vol-1749/paper_024.pdf
|volume=Vol-1749
|authors=Erick R. Fonseca,Simone Magnolini,Anna Feltracco,Mohammed R. H. Qwaider,Bernardo Magnini
|dblpUrl=https://dblp.org/rec/conf/clic-it/FonsecaMFQM16
}}
==Tweaking Word Embeddings for FAQ Ranking==
Tweaking Word Embeddings for FAQ Ranking
Erick R. Fonseca Simone Magnolini
University of São Paulo, Brazil Fondazione Bruno Kessler
Fondazione Bruno Kessler University of Brescia, Italy
rocha@fbk.eu magnolini@fbk.eu
Anna Feltracco Mohammed R. H. Qwaider Bernardo Magnini
Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler
University of Pavia, Italy Povo-Trento, Italy Povo-Trento, Italy
University of Bergamo, Italy qwaider@fbk.eu magnini@fbk.eu
feltracco@fbk.eu
Abstract two SemEval editions (Màrquez et al., 2015) and
(Nakov et al., 2016).
English. We present the system developed Given a knowledge base composed of about
at FBK for the EVALITA 2016 Shared 470 questions (henceforth, FAQ question), their
Task “QA4FAQ – Question Answering for respective answers (henceforth, FAQ answers) and
Frequently Asked Questions”. A pecu- metadata (tags), the task consists in retrieving the
liar characteristic of this task is the total most relevant FAQ question/answer pair related to
absence of training data, so we created a the set of queries provided by the organizers.
meaningful representation of the data us- For this task, no training data were provided,
ing only word embeddings. We present ruling out machine learning based approaches. We
the system as well as the results of the two took advantage of the a priori knowledge pro-
submitted runs, and a qualitative analysis vided by word embeddings, and developed a word
of them. weighting scheme to produce vector representa-
tions of the knowledge base questions, answers
Italiano. Presentiamo il sistema svilup- and the user queries. We then rank the FAQs with
pato presso FBK per la risoluzione del respect to their cosine similarity to the queries.
task EVALITA 2016 “QA4FAQ - Question The paper is organized as follows. Section 2
Answering for Frequently Asked Ques- presents the system we built and Section 3 reports
tions”. Una caratteristica peculiare di development data we created in order to test our
questo task é la totale mancanza di system. In Section 4 we show the results we ob-
dati di training, pertanto abbiamo creato tained, followed by Section 5 that presents an error
una rappresentazione significativa dei dati analysis. Finally, Section 6 provides some conclu-
utilizzando solamente word embeddings. sions.
Presentiamo il sistema assieme ai risultati
ottenuti dalle due esecuzioni che abbiamo 2 System Description
inviato e un’analisi qualitativa dei risul-
tati stessi. Our system was based on creating vector repre-
sentations for each user query (from the test set),
question and answer (from the knowledge base),
1 Introduction and then ranking the latter two according to the
cosine distance to the query.
FAQ ranking is an important task inside the wider We created the vectors using the word embed-
task of question answering, which represents at the dings generated by Dinu and Baroni (2014) and
moment a topic of great interest for research and combined them in a way to give more weight to
business as well. Analyzing the Frequent Asked more important words, as explained below. Since
Questions is a way to maximize the value of this no training data was available, using word embed-
type of knowledge source that otherwise could be dings was especially interesting, as they could pro-
difficult to consult. A similar task was proposed in vide our system with some kind of a priori knowl-
edge about similar words. ignore any word present among the tags for
We applied similar the same operations to that FAQ entry. Thus, tag words, which are
queries, FAQ questions and answers, and here we supposed to be more relevant, have a lower
will use the term text to refer to any of the three. DF and higher IDF value.
In order to create vector representations for texts,
the following steps were taken: 6. Multiword expressions. We compute the
embeddings for 15 common multiword ex-
1. Tokenization. The text is tokenized with pressions (MWEs) we extracted from the
NLTK’s (Bird et al., 2009) Italian model, FAQ. They are computed as the average of
yielding a token list X. the embeddings of the MWE components,
weighted by their IDF. If an MWE is present
2. Filtering. Stopwords (obtained from
in the text, we add a token to X contain-
NLTK’s stopword list) and punctuation signs
ing the whole expression, but do not remove
are discarded from X.
the individual words. An example is codice
3. Acronyms Substitution. Some words and cliente: we add codice cliente to X, but still
expressions are replaced by their acronyms. keep codice and cliente.
We performed this replacement in order to
circumvent cases where a query could have 7. SIDF computation. We compute the
an acronym while the corresponding FAQ Similarity-IDF (SIDF) scores. This metric
has the expression fully written, which would can be seen as an extension of the IDF which
lead to a similarity score lower than expected. also incorporates the DF of similar words. It
For example, we replaced Autorità Idrica is computed as follows:
Pugliese with AIP and Bari with BA. In to-
tal, 21 expressions were checked. 1
SIDF(w) = (2)
SDF(w)
4. Out-of-vocabolary terms. When words out
of the embedding vocabulary are found in a
FAQ question or answer, a random embed-
ding is generated for it1 , from a normal dis- SDF(w) = DF(w)+
tribution with mean 0 and standard deviation
X
cos(w, wi ) DF(wi ) (3)
0.1. The same embedding is used for any new wi ∈Wsim
occurrences of that word. This includes any
acronyms used in the previous step. Here, Wsim denotes the set of the n most
5. IDF computation. We compute the docu- similar words to w which have non-zero DF.
ment frequency (DF) of each word as the pro- Note that under this definition, SDF is never
portion of questions or answers in which it null and thus we don’t need the special case
appears2 . Then, we compute the inverse doc- as in the IDF computation. We can also com-
ument frequency (IDF) of words as: pute the SIDF for the MWEs introduced to
the texts.
(
1 8. Embedding averaging. After these steps,
DF(w) , if DF(w) > 0
we take the mean of the embeddings,
IDF(w) = (1)
10, otherwise weighted by the SIDF values of their corre-
sponding words:
We found that tweaking the DF by decreasing
FAQ tags count could improve our system’s P
w∈X E(w) SIDF(w)
performance. When counting words in ques- v= (4)
|X|
tions and answers to compute their DF, we
1
Out of vocabulary words that only appear in the queries Here, v stands for the vector representation
are removed from X. of the text and E(·) is the function mapping
2
When we are comparing queries to FAQ questions, we
only count occurrences in questions. Likewise, when com- words and MWEs to their embeddings. Note
paring queries to answers, we only count in answers. that we do not remove duplicate words.
id 272
question Cos’è la quota fissa riportata in fattura?
answer La quota fissa, prevista dal piano tariffario deliberato, è addebitata in ciascuna fattura,
FAQ Entry
fattura, ed è calcolata in base ai moduli contrattuali ed ai giorni di competenza
della fattura stessa. La quota fissa è dovuta indipendentemente dal consumo in
quanto attiene a parte dei costi fissi che il gestore sostiene per erogare il servizio
a tutti. Quindi nella fattura è addebitata proporzionalmente al periodo fatturato.
tag fattura, quota, fissa, giorni, canone acqua e fogna, quota fissa, costi fissi, quote fisse
DevSet
paraphrased query Cosa si intende per quota fissa nella fattura?
answer-driven query La quota fissa è indipendente dai consumi?
Table 1: Example of our development set.
In this process, the IDF and SIDF values are cal- We guided the development of our system eval-
culated independently for answers and questions uating it with different versions of this dataset. In
in the FAQ When processing queries, the value ac- particular, version 1 is composed by 200 queries,
tually used depends on which one we are compar- begin 160 paraphrased and 40% answer driven,
ing the query vectors with. and version 2 is composed by 266 queries, 133
After computing vectors for all texts, we com- paraphrased and 133 answer driven.
pute the cosine similarity between query vectors Merging paraphrased queries and answer
and FAQ questions and also between queries and driven queries (in different proportions) allows us
answers. For each FAQ entry, we take the highest to create a very heterogeneous dataset; we ex-
value between these two as the system confidence pected the test set and, in general, the questions
for returning that entry as an answer to the query. by users to be as much varied.
3 Evaluating our system 3.2 Baseline
In order to evaluate our system, we created a de- Two baseline systems were built using Apache
velopment set and we calculate a baseline as a ref- Lucene3 . FBK-Baseline-sys1 was built by index-
erence threshold. ing for each FAQ entry a Document with two fields
(id, FAQ question), while FBK-Baseline-sys2 was
3.1 Development Set built by indexing for each FAQ entry a Document
We manually created a dataset of 293 queries to with three fields (id, FAQ question, FAQ answer).
test our systems. Each query in the dataset is
associated to one of the entries provided in the 4 Results
knowledge base. In particular, the dataset is com-
In Table 2 we report the results of the two runs
posed by 160 paraphrased queries and 133 an-
of our system compared with the official baseline
swer driven queries. The paraphrased queries are
provided by the organizers. The only difference
queries obtained by paraphrasing original ques-
in our first two runs was that the first one always
tions; the answer queries are generated without
tried to retrieve an answer, while the second one
considering the original FAQ questions, but have
would abstain from answering when the system
an answer in the knowledge base. Table 1 shows
confidence was below 0.5.
an example of a paraphrased query and an answer
The organizers baseline (qa4faq-baseline4 ) was
driven query for FAQ 272 of the knowledge base.
built using Lucene by having a weighted-index.
Given the technical domain of the task, most
For each FAQ entry a Document with four
of the generated paraphrases recall lexical items
fields (id, FAQ question(weight=4), FAQ an-
of the original FAQ question (e.g. “uso commer-
swer(weight=2), tag(weight=1)).
ciale”, “scuola pubblica”, etc..). Differently, the
answer driven queries are not necessarily similar We use three different metrics to evaluate the
in content and lexicon to the FAQ question; instead system: Accuracy@1, that is the official score to
we expected it to have a very high similarity with 3
https://lucene.apache.org/
4
the answer. https://github.com/swapUniba/qa4faq
Test set the first line is possible to notice that, not only, our
Accuracy@1 MAP Top 10 development queries has , in average, more tokens
run 1 35.87 51.12 73.94 than the test queries, but also that the standard de-
run 2 37.46 50.10 71.91
viation is significantly lower. This distribution of
qa4faq-baseline 40.76 58.97 81.71
tokens is in line with a qualitative check of the test
FBK-Baseline-sys1 39.79 55.36 76.15
FBK-Baseline-sys2 35.16 53.02 80.92 set. The test set includes incomplete sentences,
with only keywords, e.g. ”costo depurazione”,
Table 2: Results on the test set. Accuracy@1: of- alongside long questions that include verbose de-
ficial score, MAP: Mean Average Precision, Top scription of the situation e.g. ”Mia figlia acquis-
10: correct answer in the first 10 results. terà casa a bari il giorno 22 prossimo. Come
procedere per l intestazione dell utenza? Quali
rank the systems, M AP and T op10. Accuracy@1 documenti occorrono e quali i tempi tecnici neces-
is the precision of the system taking into account sari?”. Instead the development set is composed
only the first answer; it is computed as follows: by queries more similar in their structure and well
formed.
(nc + nu ∗ nnc ) All systems perform, almost, in the same way
Accuracy@1 = (5) according to the data sets: in the two versions of
n
Where nr is the number of correct quries, nu the development set the correct queries are longer
is the number of unanswered queries and n is the with a higher standard deviation compared to the
number of questions. M AP is the Mean Average wrong ones; on the other hand, in the test set the
Precision that is the mean of the average precision correct queries are shorter with a lower standard
scores for each query, i.e. the inverse of the rank- deviation.
ing of the correct answer. T op10 is the percentage We did a qualitative analysis of the result of our
of query with the correct answer in the first 10 po- systems; we limited our observation to the 250
sitions. Both our approach runs underperformed queries of the test set for which the right answer
compared with the baseline in all the three metrics was not in the first ten retrieved by our systems.
we use to evaluate the systems. We considered these cases to be the worst and
Comparing our runs, it is interesting to notice wanted to investigate whether they present an is-
that run 2 performs better while evaluated with sue that cannot be solved using our approach.
Accuracy@1, but worse in the other two metrics; We present in this section some of these cases.
this suggests that, even in some cases where the In Example 1, the answer of the system is weakly
system confidence was below the threshold, the related with the query: the query is very short and
correct answer was among the top 10. its meaning is contained in both the gold standard
and in the system answer. In the gold standard
5 Error Analysis the substitution of the counter (”sostituzione del
contatore”) is the main focus of the sentence, and
The results of our system on the development set, the other part is just a specification of some detail
described in Section 3.1, compared with the offi- (”con saracinesca bloccata”).
cial baseline are reported in Table 3. In the system answer the substitution of the
As can be seen, both the runs outperform counter (”sostituzione del contatore”) is the effect
the baseline in every metric, especially in the of the main focus (”Per la telelettura”), but our
Accuracy@1. approach cannot differentiate these two types of
This difference of behavior enlightens that there text not directly related with the query.
is a significant difference between the develop-
ment set and the test set. The systems were devel- Example 1
oped without knowing the target style, and without Query: sostituzione del contatore
training data, so is not surprising that the system is Gold standard: Come effettuare il cambio del con-
not capable of style adaptation. tatore vecchio con saracinesca bloccata?
An interesting aspect that describes the differ- System answer: Per la telelettura il contatore sara
ence between development set and test set is re- sostituito con un nuovo contatore?
ported in Table 4: the average and the standard de-
viation of the number of tokens of every query. In A similar issue is visible in Example 2. In this
Version 1 Version 2
Accuracy@1 MAP Top 10 Accuracy@1 MAP Top 10
Run 1 72.00 79.64 95.00 66.17 74.77 92.48
Run 2 72.45 77.55 92.00 66.36 73.26 90.23
qa4faq-baseline 69.00 76.22 89.50 60.15 70.22 88.72
FBK-baseline-sys1 49.00 58.63 76.50 39.47 49.53 68.05
FBK-baseline-sys2 52.00 62.69 82.50 49.62 62.10 86.09
Table 3: Results on the development sets. Accuracy@1: official score, MAP: Mean Average Precision,
Top 10: correct answer in the first 10 result.
Version 1 Version 2 Test set
Queries 11.42 +- 4.12 11.20 +- 3.95 7.96 +- 7.27
Even if the balance (”conguaglio”) and time ex-
Answered queries 11.42 +- 4.12 11.20 +- 3.95 7.96 +- 7.27 pressions (”quando”, ”luglio e agosto e un po di
R1 Right queries 11.63 +- 4.15 11.41 +- 4.06 7.32+- 5.44 settembre”) are present in both query and system
Wrong queries 10.88 +- 4.00 10.78 +- 3.69 8.32 +- 8.09
Answered queries 11.56 +- 4.12 11.30 +- 3.94 8.09 +- 7.41 answer, and not in the gold standard, they are not
R2 Right queries 11.77 +- 4.12 11.52 +- 4.04 7.37 +- 5.47 useful to find the correct answer.
Wrong queries 11.02 +- 4.06 10.86 +- 3.71 8.52 +- 8.33
Answered queries 11.42 +- 4.12 11.20 +- 3.95 7.97 +- 7.27
B Right queries 11.94 +- 4.34 11.73 +- 4.35 7.54 +- 5.98 Example 3
Wrong queries 10.26 +- 3.31 10.40 +- 3.09 8.26 +- 8.02 Query: ho ricevuto una bolletta di conguaglio di e
Table 4: Average and standard deviation of the 426.69 , ma son mancata da casa a luglio e agosto
number of tokens per query. R1: Run1, R2: Run2, e un po di settembre , senza consumare , come
B: Organizers Baseline qa4faq-baseline. mai?
Gold standard: Perche ho ricevuto una fattura el-
evata?
case, the first part (”Quali sono i tempi di allaccio System answer: Il conguaglio quando avviene?
di un contatore”) of the system answer matches,
almost exactly, the query, but as in Example 1, Alongside this issue, there are some cases (Ex-
the second part (”in caso di ripristino in quanto ample 4) where our system answers correctly, but
l’abitazione aveva già la fornitura?”), which is due to the semi-automatic nature of the gold stan-
not very relevant to the query, was not enough to dard it has been considered wrong.
reduce the overall ranking of this FAQ. We think
Example 4
this issue could be avoided with some more fea-
Query: chi paga la portella del contatore?
tures, but this would require some training data for
Gold standard: Come richiedere la sostituzione
a machine learning approach, or some knowledge
dello sportello della nicchia contatore?
of the domain to craft a rule approach.
System answer: Chi paga la portella del conta-
Example 2 tore?
Query: quali sono i tempi di attivazione di un con-
tatore ? Example 5 represents one of the cases in which
Gold standard: Quali sono i tempi previsti per ot- the systems answer has been considered wrong but
tenere un allacciamento? is more related with the query than the gold stan-
System answer: Quali sono i tempi di allaccio dard.
di un contatore in caso di ripristino in quanto
Example 5
l’abitazione aveva già la fornitura?
Query: abito in un condominio con 5 famiglie . se
alla scadenza di una bolletta uno dei condomini
In some cases, like in Example 3, the seman- non vuole pagare la sua quota , possono gli altri 4
tic match (like common or related words in both pagare la loro parte su un altro bollettino postale?
sentences) is not enough to understand the rela- Gold standard: Quali sono le modalita di paga-
tionship, or could me misleading. Some knowl- mento delle fatture?
edge of the world and some cause-effect reasoning System answer: Contratto condominiale, di cui
is needed to understand that the gold standard is uno moroso come comportarsi?
more related to the query than the system answer.
6 Conclusion
We reported the system we used in the EVALITA
2016 QA4FAQ shared task, as well as the develop-
ment set we created to evaluate it and an analysis
of our results.
We found that while our system performed be-
low the baseline in the official test set, we had su-
perior performance on our in-house development
set. This is apparently related to the different style
of the two sets: ours has longer queries, which are
more homogeneous with respect to size, while the
official one has many very short queries and a few
very large ones.
It could be argued that the official test set rep-
resents a more realistic scenario than the develop-
ment set we created, since it contains actual user
queries, thus diminishing the relevance of our re-
sults. However, further analysis showed that in a
number of cases, our system returned a more ap-
propriate FAQ question/answer than what was in
the gold standard, due to the gold standard semi-
automatic nature.
We hypothesize that our system performed bet-
ter than what seems from the official results; how-
ever, due to the size of the test set, it would be pro-
hibitive to check it manually and arrive at a more
precise accuracy.
References
Steven Bird, Edward Loper, and Ewan Klein.
2009. Natural Language Processing with Python.
O’Reilly Media Inc.
Georgiana Dinu and Marco Baroni. 2014. Improving
zero-shot learning by mitigating the hubness prob-
lem. arXiv preprint arXiv:1412.6568.
Lluı́s Màrquez, James Glass, Walid Magdy, Alessan-
dro Moschitti, Preslav Nakov, and Bilal Randeree.
2015. Semeval-2015 task 3: Answer selection in
community question answering. In Proceedings of
the 9th International Workshop on Semantic Evalu-
ation (SemEval 2015).
P Nakov, L Marquez, A Moschitti, W Magdy,
H Mubarak, AA Freihat, J Glass, and B Randeree.
2016. Semeval-2016 task 3: Community question
answering. In Proceedings of the 10th International
Workshop on Semantic Evaluation. San Diego, Cal-
ifornia. Association for Computational Linguistics.