<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tweaking Word Embeddings for FAQ Ranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Erick R. Fonseca</string-name>
          <email>rocha@fbk.eu</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Magnolini</string-name>
          <email>magnolini@fbk.eu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Feltracco</string-name>
          <email>feltracco@fbk.eu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed R. H. Qwaider</string-name>
          <email>qwaider@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <email>magnini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Povo-Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler, University of Brescia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fondazione Bruno Kessler, University of Pavia, Italy, University of Bergamo</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Sa ̃o Paulo</institution>
          ,
          <addr-line>Brazil, Fondazione Bruno Kessler</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. We present the system developed at FBK for the EVALITA 2016 Shared Task “QA4FAQ - Question Answering for Frequently Asked Questions”. A peculiar characteristic of this task is the total absence of training data, so we created a meaningful representation of the data using only word embeddings. We present the system as well as the results of the two submitted runs, and a qualitative analysis of them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Presentiamo il sistema
sviluppato presso FBK per la risoluzione del
task EVALITA 2016 “QA4FAQ - Question
Answering for Frequently Asked
Questions”. Una caratteristica peculiare di
questo task e´ la totale mancanza di
dati di training, pertanto abbiamo creato
una rappresentazione significativa dei dati
utilizzando solamente word embeddings.
Presentiamo il sistema assieme ai risultati
ottenuti dalle due esecuzioni che abbiamo
inviato e un’analisi qualitativa dei
risultati stessi.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        FAQ ranking is an important task inside the wider
task of question answering, which represents at the
moment a topic of great interest for research and
business as well. Analyzing the Frequent Asked
Questions is a way to maximize the value of this
type of knowledge source that otherwise could be
difficult to consult. A similar task was proposed in
two SemEval editions
        <xref ref-type="bibr" rid="ref3">(Ma`rquez et al., 2015)</xref>
        and
        <xref ref-type="bibr" rid="ref4">(Nakov et al., 2016)</xref>
        .
      </p>
      <p>Given a knowledge base composed of about
470 questions (henceforth, FAQ question), their
respective answers (henceforth, FAQ answers) and
metadata (tags), the task consists in retrieving the
most relevant FAQ question/answer pair related to
the set of queries provided by the organizers.</p>
      <p>For this task, no training data were provided,
ruling out machine learning based approaches. We
took advantage of the a priori knowledge
provided by word embeddings, and developed a word
weighting scheme to produce vector
representations of the knowledge base questions, answers
and the user queries. We then rank the FAQs with
respect to their cosine similarity to the queries.</p>
      <p>The paper is organized as follows. Section 2
presents the system we built and Section 3 reports
development data we created in order to test our
system. In Section 4 we show the results we
obtained, followed by Section 5 that presents an error
analysis. Finally, Section 6 provides some
conclusions.
2</p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>Our system was based on creating vector
representations for each user query (from the test set),
question and answer (from the knowledge base),
and then ranking the latter two according to the
cosine distance to the query.</p>
      <p>We created the vectors using the word
embeddings generated by Dinu and Baroni (2014) and
combined them in a way to give more weight to
more important words, as explained below. Since
no training data was available, using word
embeddings was especially interesting, as they could
provide our system with some kind of a priori
knowledge about similar words.</p>
      <p>
        We applied similar the same operations to
queries, FAQ questions and answers, and here we
will use the term text to refer to any of the three.
In order to create vector representations for texts,
the following steps were taken:
1. Tokenization. The text is tokenized with
NLTK’s
        <xref ref-type="bibr" rid="ref1">(Bird et al., 2009)</xref>
        Italian model,
yielding a token list X.
2. Filtering. Stopwords (obtained from
NLTK’s stopword list) and punctuation signs
are discarded from X.
      </p>
      <sec id="sec-3-1">
        <title>3. Acronyms Substitution. Some words and</title>
        <p>expressions are replaced by their acronyms.
We performed this replacement in order to
circumvent cases where a query could have
an acronym while the corresponding FAQ
has the expression fully written, which would
lead to a similarity score lower than expected.
For example, we replaced Autorita` Idrica
Pugliese with AIP and Bari with BA. In
total, 21 expressions were checked.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4. Out-of-vocabolary terms. When words out</title>
        <p>of the embedding vocabulary are found in a
FAQ question or answer, a random
embedding is generated for it1, from a normal
distribution with mean 0 and standard deviation
0.1. The same embedding is used for any new
occurrences of that word. This includes any
acronyms used in the previous step.
5. IDF computation. We compute the
document frequency (DF) of each word as the
proportion of questions or answers in which it
appears2. Then, we compute the inverse
document frequency (IDF) of words as:</p>
        <p>IDF(w) =
( DF1(w) ; if DF(w) &gt; 0
10;
otherwise
(1)
We found that tweaking the DF by decreasing
FAQ tags count could improve our system’s
performance. When counting words in
questions and answers to compute their DF, we
1Out of vocabulary words that only appear in the queries
are removed from X.</p>
        <p>2When we are comparing queries to FAQ questions, we
only count occurrences in questions. Likewise, when
comparing queries to answers, we only count in answers.
ignore any word present among the tags for
that FAQ entry. Thus, tag words, which are
supposed to be more relevant, have a lower
DF and higher IDF value.</p>
      </sec>
      <sec id="sec-3-3">
        <title>6. Multiword expressions. We compute the</title>
        <p>embeddings for 15 common multiword
expressions (MWEs) we extracted from the
FAQ. They are computed as the average of
the embeddings of the MWE components,
weighted by their IDF. If an MWE is present
in the text, we add a token to X
containing the whole expression, but do not remove
the individual words. An example is codice
cliente: we add codice cliente to X, but still
keep codice and cliente.</p>
      </sec>
      <sec id="sec-3-4">
        <title>7. SIDF computation. We compute the</title>
        <p>Similarity-IDF (SIDF) scores. This metric
can be seen as an extension of the IDF which
also incorporates the DF of similar words. It
is computed as follows:</p>
        <p>SIDF(w) =
Here, Wsim denotes the set of the n most
similar words to w which have non-zero DF.
Note that under this definition, SDF is never
null and thus we don’t need the special case
as in the IDF computation. We can also
compute the SIDF for the MWEs introduced to
the texts.</p>
      </sec>
      <sec id="sec-3-5">
        <title>8. Embedding averaging. After these steps,</title>
        <p>we take the mean of the embeddings,
weighted by the SIDF values of their
corresponding words:
v =</p>
        <p>Pw2X E(w) SIDF(w)
jXj
(4)
Here, v stands for the vector representation
of the text and E( ) is the function mapping
words and MWEs to their embeddings. Note
that we do not remove duplicate words.
question
y answer
r
t
n
E
Q
A
F
tag
Cos’e` la quota fissa riportata in fattura?
La quota fissa, prevista dal piano tariffario deliberato, e` addebitata in ciascuna fattura,
fattura, ed e` calcolata in base ai moduli contrattuali ed ai giorni di competenza
della fattura stessa. La quota fissa e` dovuta indipendentemente dal consumo in
quanto attiene a parte dei costi fissi che il gestore sostiene per erogare il servizio
a tutti. Quindi nella fattura e` addebitata proporzionalmente al periodo fatturato.
fattura, quota, fissa, giorni, canone acqua e fogna, quota fissa, costi fissi, quote fisse
te paraphrased query
S
ev answer-driven query La quota fissa e` indipendente dai consumi?
D</p>
        <p>Cosa si intende per quota fissa nella fattura?</p>
        <p>In this process, the IDF and SIDF values are
calculated independently for answers and questions
in the FAQ When processing queries, the value
actually used depends on which one we are
comparing the query vectors with.</p>
        <p>After computing vectors for all texts, we
compute the cosine similarity between query vectors
and FAQ questions and also between queries and
answers. For each FAQ entry, we take the highest
value between these two as the system confidence
for returning that entry as an answer to the query.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluating our system</title>
      <p>In order to evaluate our system, we created a
development set and we calculate a baseline as a
reference threshold.
3.1</p>
      <sec id="sec-4-1">
        <title>Development Set</title>
        <p>We manually created a dataset of 293 queries to
test our systems. Each query in the dataset is
associated to one of the entries provided in the
knowledge base. In particular, the dataset is
composed by 160 paraphrased queries and 133
answer driven queries. The paraphrased queries are
queries obtained by paraphrasing original
questions; the answer queries are generated without
considering the original FAQ questions, but have
an answer in the knowledge base. Table 1 shows
an example of a paraphrased query and an answer
driven query for FAQ 272 of the knowledge base.</p>
        <p>Given the technical domain of the task, most
of the generated paraphrases recall lexical items
of the original FAQ question (e.g. “uso
commerciale”, “scuola pubblica”, etc..). Differently, the
answer driven queries are not necessarily similar
in content and lexicon to the FAQ question; instead
we expected it to have a very high similarity with
the answer.</p>
        <p>We guided the development of our system
evaluating it with different versions of this dataset. In
particular, version 1 is composed by 200 queries,
begin 160 paraphrased and 40% answer driven,
and version 2 is composed by 266 queries, 133
paraphrased and 133 answer driven.</p>
        <p>Merging paraphrased queries and answer
driven queries (in different proportions) allows us
to create a very heterogeneous dataset; we
expected the test set and, in general, the questions
by users to be as much varied.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Baseline</title>
        <p>Two baseline systems were built using Apache
Lucene3. FBK-Baseline-sys1 was built by
indexing for each FAQ entry a Document with two fields
(id, FAQ question), while FBK-Baseline-sys2 was
built by indexing for each FAQ entry a Document
with three fields (id, FAQ question, FAQ answer).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>In Table 2 we report the results of the two runs
of our system compared with the official baseline
provided by the organizers. The only difference
in our first two runs was that the first one always
tried to retrieve an answer, while the second one
would abstain from answering when the system
confidence was below 0.5.</p>
      <p>The organizers baseline (qa4faq-baseline4) was
built using Lucene by having a weighted-index.
For each FAQ entry a Document with four
fields (id, FAQ question(weight=4), FAQ
answer(weight=2), tag(weight=1)).</p>
      <p>We use three different metrics to evaluate the
system: Accuracy@1, that is the official score to
3https://lucene.apache.org/
4https://github.com/swapUniba/qa4faq
rank the systems, M AP and T op10. Accuracy@1
is the precision of the system taking into account
only the first answer; it is computed as follows:
Accuracy@1 = (nc + nu nnc ) (5)
n</p>
      <p>Where nr is the number of correct quries, nu
is the number of unanswered queries and n is the
number of questions. M AP is the Mean Average
Precision that is the mean of the average precision
scores for each query, i.e. the inverse of the
ranking of the correct answer. T op10 is the percentage
of query with the correct answer in the first 10
positions. Both our approach runs underperformed
compared with the baseline in all the three metrics
we use to evaluate the systems.</p>
      <p>Comparing our runs, it is interesting to notice
that run 2 performs better while evaluated with
Accuracy@1, but worse in the other two metrics;
this suggests that, even in some cases where the
system confidence was below the threshold, the
correct answer was among the top 10.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Error Analysis</title>
      <p>The results of our system on the development set,
described in Section 3.1, compared with the
official baseline are reported in Table 3.</p>
      <p>As can be seen, both the runs outperform
the baseline in every metric, especially in the
Accuracy@1.</p>
      <p>This difference of behavior enlightens that there
is a significant difference between the
development set and the test set. The systems were
developed without knowing the target style, and without
training data, so is not surprising that the system is
not capable of style adaptation.</p>
      <p>An interesting aspect that describes the
difference between development set and test set is
reported in Table 4: the average and the standard
deviation of the number of tokens of every query. In
the first line is possible to notice that, not only, our
development queries has , in average, more tokens
than the test queries, but also that the standard
deviation is significantly lower. This distribution of
tokens is in line with a qualitative check of the test
set. The test set includes incomplete sentences,
with only keywords, e.g. ”costo depurazione”,
alongside long questions that include verbose
description of the situation e.g. ”Mia figlia
acquistera` casa a bari il giorno 22 prossimo. Come
procedere per l intestazione dell utenza? Quali
documenti occorrono e quali i tempi tecnici
necessari?”. Instead the development set is composed
by queries more similar in their structure and well
formed.</p>
      <p>All systems perform, almost, in the same way
according to the data sets: in the two versions of
the development set the correct queries are longer
with a higher standard deviation compared to the
wrong ones; on the other hand, in the test set the
correct queries are shorter with a lower standard
deviation.</p>
      <p>We did a qualitative analysis of the result of our
systems; we limited our observation to the 250
queries of the test set for which the right answer
was not in the first ten retrieved by our systems.
We considered these cases to be the worst and
wanted to investigate whether they present an
issue that cannot be solved using our approach.</p>
      <p>We present in this section some of these cases.
In Example 1, the answer of the system is weakly
related with the query: the query is very short and
its meaning is contained in both the gold standard
and in the system answer. In the gold standard
the substitution of the counter (”sostituzione del
contatore”) is the main focus of the sentence, and
the other part is just a specification of some detail
(”con saracinesca bloccata”).</p>
      <p>In the system answer the substitution of the
counter (”sostituzione del contatore”) is the effect
of the main focus (”Per la telelettura”), but our
approach cannot differentiate these two types of
text not directly related with the query.</p>
      <sec id="sec-6-1">
        <title>Example 1</title>
        <p>Query: sostituzione del contatore
Gold standard: Come effettuare il cambio del
contatore vecchio con saracinesca bloccata?
System answer: Per la telelettura il contatore sara
sostituito con un nuovo contatore?</p>
        <p>A similar issue is visible in Example 2. In this
Run 1
Run 2
qa4faq-baseline
FBK-baseline-sys1
FBK-baseline-sys2
case, the first part (”Quali sono i tempi di allaccio
di un contatore”) of the system answer matches,
almost exactly, the query, but as in Example 1,
the second part (”in caso di ripristino in quanto
l’abitazione aveva gia` la fornitura?”), which is
not very relevant to the query, was not enough to
reduce the overall ranking of this FAQ. We think
this issue could be avoided with some more
features, but this would require some training data for
a machine learning approach, or some knowledge
of the domain to craft a rule approach.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Example 2</title>
        <p>Query: quali sono i tempi di attivazione di un
contatore ?
Gold standard: Quali sono i tempi previsti per
ottenere un allacciamento?
System answer: Quali sono i tempi di allaccio
di un contatore in caso di ripristino in quanto
l’abitazione aveva gia` la fornitura?</p>
        <p>In some cases, like in Example 3, the
semantic match (like common or related words in both
sentences) is not enough to understand the
relationship, or could me misleading. Some
knowledge of the world and some cause-effect reasoning
is needed to understand that the gold standard is
more related to the query than the system answer.
Even if the balance (”conguaglio”) and time
expressions (”quando”, ”luglio e agosto e un po di
settembre”) are present in both query and system
answer, and not in the gold standard, they are not
useful to find the correct answer.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Example 3</title>
        <p>Query: ho ricevuto una bolletta di conguaglio di e
426.69 , ma son mancata da casa a luglio e agosto
e un po di settembre , senza consumare , come
mai?
Gold standard: Perche ho ricevuto una fattura
elevata?
System answer: Il conguaglio quando avviene?</p>
        <p>Alongside this issue, there are some cases
(Example 4) where our system answers correctly, but
due to the semi-automatic nature of the gold
standard it has been considered wrong.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Example 4</title>
        <p>Query: chi paga la portella del contatore?
Gold standard: Come richiedere la sostituzione
dello sportello della nicchia contatore?
System answer: Chi paga la portella del
contatore?</p>
        <p>Example 5 represents one of the cases in which
the systems answer has been considered wrong but
is more related with the query than the gold
standard.</p>
      </sec>
      <sec id="sec-6-5">
        <title>Example 5</title>
        <p>Query: abito in un condominio con 5 famiglie . se
alla scadenza di una bolletta uno dei condomini
non vuole pagare la sua quota , possono gli altri 4
pagare la loro parte su un altro bollettino postale?
Gold standard: Quali sono le modalita di
pagamento delle fatture?
System answer: Contratto condominiale, di cui
uno moroso come comportarsi?
We reported the system we used in the EVALITA
2016 QA4FAQ shared task, as well as the
development set we created to evaluate it and an analysis
of our results.</p>
        <p>We found that while our system performed
below the baseline in the official test set, we had
superior performance on our in-house development
set. This is apparently related to the different style
of the two sets: ours has longer queries, which are
more homogeneous with respect to size, while the
official one has many very short queries and a few
very large ones.</p>
        <p>It could be argued that the official test set
represents a more realistic scenario than the
development set we created, since it contains actual user
queries, thus diminishing the relevance of our
results. However, further analysis showed that in a
number of cases, our system returned a more
appropriate FAQ question/answer than what was in
the gold standard, due to the gold standard
semiautomatic nature.</p>
        <p>We hypothesize that our system performed
better than what seems from the official results;
however, due to the size of the test set, it would be
prohibitive to check it manually and arrive at a more
precise accuracy.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Edward Loper, and
          <string-name>
            <given-names>Ewan</given-names>
            <surname>Klein</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python. O'Reilly Media Inc</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Georgiana</given-names>
            <surname>Dinu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Improving zero-shot learning by mitigating the hubness problem</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6568</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Llu</surname>
            ´ıs Ma`rquez, James Glass, Walid Magdy, Alessandro Moschitti, Preslav Nakov, and
            <given-names>Bilal</given-names>
          </string-name>
          <string-name>
            <surname>Randeree</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Semeval-2015 task 3: Answer selection in community question answering</article-title>
          .
          <source>In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>P</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Marquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Moschitti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W</given-names>
            <surname>Magdy</surname>
          </string-name>
          , H Mubarak, AA Freihat,
          <string-name>
            <given-names>J</given-names>
            <surname>Glass</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B</given-names>
            <surname>Randeree</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Semeval-2016 task 3: Community question answering</article-title>
          .
          <source>In Proceedings of the 10th International Workshop on Semantic Evaluation</source>
          . San Diego, California. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>