Question Answering System Using Query Expansion and
                 Heuristic Features

              Yolanda Sjafrizal, Indra Budi, MirnaAdriani andPhilip Arthur

    Information Retrieval Laboratory Lab, Faculty of Computer Science, Universitas Indonesia
                                       Depok, Indonesia

              yolanda.sjafrizal@ui.ac.id,indra@cs.ui.ac.id,
               mirna@cs.ui.ac.id, philip.arthur@ui.ac.id


         Abstract: This paper describes the design of question answering system that
         participates in the maintask of QA4MRE at CLEF 2013. This system will ini-
         tially perform preprocessing stage of the document and the documents related
         questions. Then, it identifies the type of questions in order to be able to search
         the answers with the most appropriate approach. In order to finding the an-
         swers, the system uses eight heuristics features and two query expansion tech-
         niques, namely Pseudo Relevance feedback and WordNet as the lexical data-
         base. There are nine types of runs were submitted at CLEF 2013, where the dif-
         ference lies in the combination of query expansion techniques and features
         used. Evaluation will be based on the c@1 measure. Best results are obtained
         when it uses both query expansion techniques and all features with certain
         weights.

         Keywords: Question answering system, query expansion, pseudo relevance
         feedback, WordNet.


1        Introduction

CLEF (Conference and Labs of the Evaluation Forum) is an organization that plays an
important role in improving research and development in the field of information
retrieval, by providing the infrastructure for a number of activities as well as a series
of lab evaluations on information systems and a number of conferences. One lab
evaluation that conducted by CLEF is QA4MRE (Question Answering for Machine
Reading Evaluation), with the main objective is to develop a best methodology for
machine reading system, which will be evaluated through a process of questioning
and testing the level of understanding of a reading [1]. Based on these issues, we con-
ducted a number of experiments to develop a question answering system based on a
dataset of QA4MRE in 2011 and 2012. The main task of this system is to conduct a
thorough understanding of the single document provided as reading material, and then
answer a number of questions related to the document. Questions are provided in the
form of multiple choices, which each have 5 answer choices.


adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
   This paper proposed two query expansion techniques, Pseudo Relevance Feedback
[2], and using WordNet as the lexical database. Beside of that, it is also uses eight
heuristics features as that will give a score for each answer choice. To support the
process of finding answers, the system also uses REVERB1to obtain the relation of
the phrases of documents and snippets from Google snippets as additional knowledge.
   In section 2 below, we describe the condition of dataset of QA4MRE at CLEF
2013, that used by this question answering system. Then, further explanation of the
stages of the process will be described in section 3. The results of the system on a
dataset of 2013 are described in section 4.


2      Dataset QA4MRE at CLEF 2013

The dataset provided by QA4MRE at CLEF 2013, consisting of four main topics,
namely AIDS, climate change, Music & society and Alzheimer. A total of four pieces
of reading test are provided at each such topic, which a reading test also consists of a
single document with 10 questions and 5 answer choices for each question. So in
general, for the dataset of 2013 there will be:

 16 test documents (4 documents for each topic)
 240/320 questions (15/20 questions for each document)
 1200/1600 choice answers (5 answer choices for each question)
   Dataset of 2013 is provided in a number of languages, like Arabic, Bulgarian, Eng-
lish, Romanian and Spanish. For this experiment, its scope was limited to the dataset
that use English only.


3      Steps of The System

3.1    Pre-Processing
    We conducted pre-processing on the dataset (document and the questions). The
process are consists of 10 standard stages in the natural languages, such as convert the
format of all numbers into a standard format, tokenization, named-entity recognition,
filtering all punctuation characters that are considered useless, lowercasing, anaphora
resolution, number recognition, lemmatization, removal of stopword, and processing
negation. In negation processing, every word that has a meaning of negation, such as
"not present" (containing the word "not"), changes to form of "not-present". We do
this process so that the system can distinguish the word with its negation.


1
https://github.com/knowitall/reverb-core/
3.2     Question Identification
   At in this step, we identify the types of questions which are divided into factoid
and non-factoid. The identification process based on the words that contains in the
question, such as “what”, “why”, “where”, “who”, “when”, “how”, and “do”, as well
as the identifier at the right of the question words (if any), such as countries, time,
year, many, and etc. In addition, there are also questions that do not contain the ques-
tion word, so it is more shaped like a command line. Sample questions as follows:
 “In what year she went to England? There is a question word "what" and identifier
   "year", so the question is identified as type of DATE.
 “Who is the president?”There is a question word "who", so the question is identi-
   fied as a type of PERSON.

   Having this type of question has been identified; next, the question will be given
tags to indicate what kind of answer is expected from the question. We uses eleven
tags,                    which                     are:                     "DATE",
"LOCATION,"PERSON","OPINION","NUMBER","YES_NO","MONEY","CAUS
AL","METHOD","PURPOSE", and "WHICH_IS_TRUE ".


3.3     Passage Retrieval
   Passaging process will form a small part or a unit of text that called passage,
wherein each passage consisting of m pieces of sentences. As example, in the text
there are n sentences,S1, S2, S3, ...,Sn. Of the entire sentence will be formed a pas-
sage p, each consisting of 6 sentences, where


So from the passaging process, will obtain a collection of passage P, with the formula:


   where k is number of passage that is formed from a text, m is number of sentenc-
es in a passage and n is number of sentences in the text. After the passaging process
has been successfully carried out, we do the indexing process on the passages. We use
Lucene2 [3] to create the inverted-index.
   After that, we do the retrieval process, getting the passages that contain the an-
swers. In this system, we assumed that the top 13 passage that has the greatest similar-
ity value (most relevant) to the query will be taken. This figure is obtained after see-
ing the accuracy of the system, when conducted experiments using the numbers 1 to
15. After that, the system looks for the appropriate answer based on the relevant pas-
sages retrieved. When it finds no answer to the question, then the system will perform
query expansion technique in order to obtain a new set of passage that is more rele-
vant to the query. Next subsection describes more on the query expansion technique.


2
    http://lucene.apache.org
3.4    Query Expansion
   Query expansion is a process of adding a number of words that are relevant to a
query, in order for the query to be more clear and unambiguous [4]. There are two
types of query expansion methods that are implemented in this system, using PRF and
WordNet [2].


Query expansion with Pseudo Relevance Feedback (PRF) .
   This technique is based on the n-top passage that obtained using the original query
prior to expansion. Each unique word from those passages will be sorted (ranked)
based on its TF-IDF value [5]. Every word that has a TF-IDF value higher than the
threshold will be acquired as an additional word to expand the initial query. Based on
experiments that have been conducted on a number of possible threshold values and
their influence on the final score, it is determined that the threshold used in this study
was 9.5. TF-IDF method was chosen because it is believed to determine the words
that have a strong connection with the initial passage obtained, so that with the addi-
tion of the words in the query is expected to obtain correctly relevant passages.

Query expansion with WordNet
   WordNet is an online lexical database system, where every noun, verb, adjectives
and adverbs will be grouped in a set of synonyms or often called synsets (Synonym
sets) [6]. This system uses WordNet to find additional words in queries, which have a
semantic relationship, such as hypernym, hyponym, similar adjective and troponym.

3.5    Answer Selection
   To determine the answer to a question, it uses a number of feature to assess all the
answer choices given (there are 5 pieces of possible answers to a question). When
more than one feature is used, the score of an answer choice is accumulated from
scores foreach feature. We select an option as the answer of the question which is has
the highest final score. Contribution of each feature to score the answer choices vary
depending on the weight of the feature. Here is the formula to assign a score to an
option .


     is score of   based on i-th feature, is weight of i-th feature,n is number of fea-
tures to assess .
   The system uses a threshold value in selecting the answer, if the difference be-
tween answer with the highest score and the score of the other four answer falls below
the threshold, then it produces no-answer for the corresponding question.
Answer Selection for Factoid Question.
   Finding answers of factoid questions only consisting of a single NE analysis fea-
ture which composing into four phases of assessment, which are:
     1. Assessment using NE
     2. Expansion of the candidate answers for questions of type DATE
     3. Assessment using snippet
     4. Assessment using average distance weight features and cosine similarity
   Initially, this conducts the first assessment phase. If the answer has been found at
this stage, the process of finding the answer to a question will be stopped, and three
other phases will not be executed. However, if there is finding no answer, then the
system carries out the next phase until the fourth phase. When there is still finding no
answer after four phases, then the system produce no-answer for the corresponding
question.

Assessment using NE.
  The process of assessment procedures using NE as follows:
1. Take a set of passage that is relevant to the query
2. Check each passage, whether it has NE or not. When itis find NE, check the type
   of NE, whether it is similar to the NE of the answer choice. When the NE type of
   passage is similar tothe type of answer choice, then that NE will be selected to be
   recorded and given a score.
3. The score of each answer choice is accumulated.
4. The system chooses the answer choice that has the highest score.

Expansion of The Candidate Answers for Questions of Type “DATE” .
  For the questions of type "DATE", the procedure as follows:
1. If candidates containing the word "later" (eg "9 years later"), but all the answer op-
   tions that exist in the form of year (eg "1985" or "1997"), then the candidates will
   be processed first in order to have the same form.
2. If the candidates is in the form of year, while the answer choices are in the range of
   years(such as "1950s" and the "beginning of 1900") then the candidates will be
   processed first in order to obtain the proper range.

Assessment Using Snippet.
   The system carries out this phase using the search engine Google to acquire addi-
tional knowledge. Steps of this phase as follows:
1. Each answer choice is inserted as a query to Google, and the search results that
   associated with those answer choices will be retrieved. Five pieces of text snippets
   from the top search resultsare acquired to be an additional knowledge in the form
   of new documents related to the answer choices. These documents will be indexed
   with Lucene.
2. Perform step 1 again using keyword from question as the query.
3. Score for an answer choice is obtained by accumulating the value of similarity
   between the questions and snippet belongs , and similarity values between and
   snippet belongs questions.

   S1 = Similarity values between the questions and snippet that belongs to   .
   S2 = Similarity values between and snippet that belongs to questions.
4. The system chooses the answer choice that has the highest Score.

Assessment Using Average Distance Weight Features and Cosine Similarity.
   This phase uses a combination of average distance weight (ADW) and cosine simi-
larity features. Each of these features will file his own score for each answer choice.
So for each answer choice, scores of ADW and cosine similarity features will be add-
ed to the original score, thus forming a new score. The answer choice having the
highest total score is selected as the answer for the question.

Answer Selection forNon-Factoid Question
   For non-factoid type questions, selection answers process based on the following
eight heuristic features. These features are given a weight based on its ability to de-
termine the correct answer.

Cosine Similarity.
   This feature calculates the similarity between two documents based on the weight
of each word contained in both documents. This feature implements the concept of
vector space model (VSM), where the two documents will be represented as a 2-
dimensional vector n (n is the number of unique words), with its components is the
weight of each word [4]. The similarity between the two documents will be calculated
using the cosine similarity algorithm [2].
   The procedure of this feature as follows:
1. Taken 6 sets of passage (assumed as document), whereas one of them is relevant to
   the question, namely P, and the others are relevant to each choice answers, namely
   A,B,C,D and E.
2. Each document is represented as a vector, where the components are the weights of
   each word. For this study, the weight of each word is obtained by taking the value
   of term frequency using built-in functions provided by Lucene.
3. Finding the cosine value of the angle between two documents. Compute the
   similarity between document related to the question (P) and documents associated
   with each choice answers (A, B, C, D and E). The system chooses A, B, C, D or E
   which having the highest similarity score to P.

Average Distance Weight.
   The ADW feature will calculate the smallest distance between each word in answer
choice and each word in questions in a set of passage. Steps for the implementation of
the ADW feature as follows:
1. Take a set of passage that is relevant to the first answer choice, then combine that
   into a paragraph.
2. Save all positions of each word in the question and positions of each word in the
   answer choice that is found in paragraph.
3. Attach each word in the answer choice with each word in the questions, then
   compute the smallest distance between the positions of both the paired words.
4. Having obtained the smallest distance for each pair of words, add all the distances
   and divide by the number of word pairs are formed. This outcome is a score for the
   first answer choice .
5. Do again steps 1-5 for the other four answer choices, in order to obtain a score for
   each answer choices.

Word Matching.
    This feature will calculate the score of each answer choice, based on the number of
occurrences of its words on the set of the relevant passage (examined at each pas-
sage). The more the words found in the passages, or the more passage that containing
the words is found, and then the score for the answer choice will be higher. Likewise,
if the words of the answer choices found in the top ranked passage, then the score for
this answer choice will also be higher.

Cluster Matching.
   In this feature, each answer choice will be given a score based on the value of simi-
larity/fitness between a set of passage that relevant to the answer choice, with a set of
passage that relevant to the question. This feature is proposed under the assumption
that the answer of the question, of course located adjacent to the questions in a same
set passage, so that both set of passage are considered relatively similar.

Question Choice Tilling.
   This feature will utilize the maximum number of words in the answer choice that
can appear in a sequence in the sentence of passages, which is relevant to the ques-
tion. The order of appearance of each word from an answer choice will be considered
on a sentence in the top 3 passages. For example, for the question "How are people
infected by HIV?", If the answer is option "through sexual intercourse", hence the
word "through" will have the first position, “sexual" has a second position, and "inter-
course" has the third position. When in a sentence in a passage only found the words
"sexual" and "intercourse" appearing in sequence, in which the word "through" in-
stead appear after those two words, then the score for this condition is 2. Do this for
each sentence in the passage, accumulate the score and grab the highest value.

Maximum occurence.
   This feature has a similar function with the word matching feature; both features
consider the number of occurrences of answer choice words in a set of relevant pas-
sage. The difference lies in the process of checking the appearance of the words. In
the word matching feature, any word on the answer choice will be checked at the
same time on one passage. So that for each passage, the score only be increased by
the number of words in answer choices at the most, even though the words appear
repeatedly in the passage. In this maximum feature, the occurrence of feature will be
checked for each sentence in a passage. So, each time the words of that answer choice
found in a sentence, then the score will increase by 1.

Maximum Similarity.
   This feature using similarity functions in Lucene. At first, it takes sets of passage
that relevant to each answer choice. Then, it computes the similarity score between an
answer choice to every relevant passage (scores ranging from 0-1) using Lucene. The
system chooses the answer choice which has the highest similarity score.

Semantic Relation.
   This feature works almost similar to the maximum similarity features; both use the
Lucene functions to see the similarity between the answer choices with its relevant
passages. This feature uses REVERB to obtain the index; meanwhile, the maximum
similarity feature uses reading-test. By using REVERB, it obtains lists of phrases
relation that related to the reading-test. Those relation phrases will be indexed by
Lucene, resulting in a new index for this feature.


3.6    Evaluation
We use c@1 method to evaluate the performance of this system [7]. The concept of
this evaluation method assumes that, instead of giving a wrong answer, the question is
better left not have an answer, so the number of wrong answers can be reduced by
maintaining the number of correct answers. The formula of evaluation method c@1 as
follows:


    is the number of questions that answered correctly by the system
    is the number of questions that are not answered by the system
 n is the total number of questions.

Result of experiment on 2013 datasets
  The result of the experiment run describes in the Table 1.
                                                    C@1
      Run      Features                Main                 Main+auxiliary
                               1    2    3    4 total 1      2    3     4 total
        2 A-W-P-M-C-Ma (0.75) 0.33 0.27 0.24 0.2 0.26 0.33 0.39 0.22 0.27 0.3
        3 A-W-M-C-Ma (0.5)     0.3 0.25 0.22 0.2 0.24 0.3 0.38 0.21 0.25 0.29
        4 A-W-M-C-Ma (0.75)   0.33 0.25 0.22 0.21 0.25 0.33 0.38 0.21 0.28 0.3
        5 A-P-M-C-Ma(0.75)    0.36 0.25 0.23 0.15 0.25 0.36 0.38 0.22 0.23 0.3
        6 A-W                 0.34 0.22 0.25 0.19 0.25 0.34 0.35 0.23 0.25 0.29
        7 A-P-W               0.33 0.23 0.26 0.19 0.26 0.33 0.36 0.25 0.25 0.3
        8 A-W-M (0.5)         0.33 0.23 0.25 0.16 0.25 0.33 0.36 0.23 0.21 0.29
        9 A-W-Ma (0.75)       0.33 0.25 0.24 0.2 0.25 0.33 0.34 0.23 0.26 0.29
       10 A-W-Ma (1)          0.33 0.23 0.23 0.19 0.24 0.33 0.33 0.22 0.25 0.28

   Descrition of features:
   A = Word Matching, Question Choice Tilling, Semantic Relation, Cosine
   Similarity & Average Distance Weight
  W = Wordnet
  P = Pseudo Relevance Feedback
  M = Maximum Similarity
  C = Cluster Matching
  Ma = Maximum Occurrence

  Description of topics:
  1. Alzheimer
  2. Music & Society
  3. Climate Change
  4. AIDS
   According to the Table 1, we can see that the system obtain optimal results when
using the two query expansion techniques, along with 5 heuristics features, namely
Word Matching, Question Choice Tilling, Semantic Relations, Cosine Similarity and
ADW. Two proposed query expansion techniques, PRF and WordNet, both give a
positive contribution to the system, especially when they are using together. Mean-
while, among the 8 existing features, features that provide a considerable influence on
the performance of the system is 5 features that previously mentioned. The difference
in the performance of the system when the weight of features is modified, also
became evident that the determination of the appropriate weights for each existing
feature, will greatly affect the system performance and the final score.


3.7     Conclusions
Based on the problems posed by QA4MRE, this study has developed a question an-
swering system, which will be used to understand the documents and answer multi-
ple-choice questions that related to documents. To increase the number of relevant
passages that is obtained, this study propose two query expansion techniques, namely
PRF and WordNet. In the process of finding answers, the system proposed eight heu-
ristic features that will give scores to each answer choice.
   According to the analysis result that obtained in the dataset of 2013, the system
will give the best performance when using both the query expansion techniques and
all heuristic features with predetermined weights. The result of the final score of the
overall system is still relatively low. One reason is indicated because the system does
not use the background collection as additional knowledge, so knowledge is only
obtained from the documents and snippets from Google. In addition, the system per-
formance when dealing with non-factoid type questions are still not good, so the per-
centage of questions that answered incorrectly is still very high.


3.8    References
 1. Peñas, A., et al. (2011). Overview of QA4MRE at CLEF 2011: Question answering for
    machine reading evaluation. CLEF.
 2. Manning, C. D., Raghavan, P., & Schütze, H. (2008, July). Introduction to information
    retrieval (1 ed.). Cambridge University Press.
 3. Hatcher, E., & Gospodnetić, O. (2005). Lucene in action. United States of America:
    Manning Publication Co.
 4. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Inggris:
    Addison-Wesley.
 5. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New
    York: McGrawHill.
 6. Miller, G. A. (1995, November). WordNet: A lexical database for english.
    Communications of the ACM, Vol. 38 No. 11.
 7. Peñas, A., & Rodrigo, A. (2011, June). A simple measure to assess non response.
    Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,
    (hal. 1415-1424). Portland, Oregon, USA.