=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-QA4MRE-deOliveiraDaCostaMarquesEt2013
|storemode=property
|title=Combining Text Mining Techniques for QA4MRE 2013
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-QA4MRE-deOliveiraDaCostaMarquesEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/MarquesV13
}}
==Combining Text Mining Techniques for QA4MRE 2013==
Combining text mining techniques for QA4MRE 2013 Guilherme de Oliveira da Costa Marques1 and Mathias Verbeke2 1 Federal University of São Carlos (UFSCar), campus Sorocaba, Brazil 2 Department of Computer Science, KU Leuven, Belgium Abstract. This paper describes a lexical system developed for the main task of Question Answering for Machine Reading Evaluation 2013 (QA4MRE). The presented system executes the preprocessing of test documents, and generates hypotheses consisting of the question text combined with text from possible answers for the question. The hypotheses are compared to sentences from the text by the means of a set similarity measure. The k best similarity scores obtained by each hypothesis are averaged as ranking score for the hypothesis. Two variations of the developed system were utilized, one of them employing coreference detection and resolu- tion techniques in order to take advantage of the discourse structure on the question answering process. The results generated by the systems in QA4MRE 2013 edition are presented and analyzed. The presented system should serve as a solid base for the development of a semantic approach on the task. 1 Introduction The QA4MRE competition [1] focuses on the Question Answering field of Ma- chine Reading. Adopting the form of several tests spread over a few themes, it aims at evaluating a system’s Natural Language Understanding capabilities by the means of multiple choice questions. The main task of the competition is currently composed by four topics: “AIDS”, “Climate Change”, “Music and Society” and “Alzheimer’s”. A back- ground collection of texts is provided for all topics. This collection attempts to encompass all specific domain knowledge of the topic. The 2013 edition consists of 16 reading tests, 4 on each topic. Each reading test presents a text document followed by 15 to 20 questions about it. Those are multiple choice questions with 5 alternatives, the last one being “None of the above”. Questions are distributed over different degrees of complexity as to the knowl- edge and inference required to devise the correct answer. The simplest ones have both the question fact and the answer appearing directly in the same sentence of the text. Others have the question fact and the answer appearing in distinct sentences. Some questions require background knowledge or inference, and some may require the use of both. Question: What caused an improvement in sound quality in 1950? Alternatives: 1. the introduction of soundtrack recording on 35 mm magnetic tape 2. the use of an optical soundtrack 3. the adoption of a quadraphonic sountrack 4. the specialisation in silent films 5. none of the above Table 1: Example question An example question is presented in table 1. For this question, alternative 1 should be identified as correct. The system described in this work is based on a text mining baseline sys- tem [2]. It was developed and tested with data from QA4MRE 2012 edition. Several parameters and system variations were tested, which are described more thoroughly in [3]. We do not employ the background collection in the current system. 2 Methodology Two different systems were employed for the competition: a main system was elaborated, and employed both as standalone and as a base for a variation in- cluding coreference resolution. The structure of both systems is presented in figure 1. Section 2.1 presents the text preprocessing employed on the test documents. Section 2.2 explains the procedure responsible for ranking the alternatives, while section 2.3 discusses the different techniques employed in cases where a tie occurs in the ranking. While all the previously mentioned characteristics are common to the two systems, section 2.4 explains the additions only present in the coreference variation. 2.1 Preprocessing Preprocessing of the test documents proceeds according to the following se- quence: 1. Unicode decoding: treats special unicode characters, generating ASCII- safe strings. 2. Text fixing: corrects some of the formatting issues present in the 2012 test set, through the use of regular expressions. Although similar issues were not detected with 2013’s test set, the procedure was maintained for safety and compatibility reasons. 3. Sentence tokenization: splits sentences from text into separate strings. 4. Word tokenzation: segments individual words and punctuation signs from text strings. Fig. 1: System structure (a) Base System (b) Coreference System 5. Bag of words model: sentences are represented as sets of word strings. 6. Stopword and punctuation removal: punctuation signs and words clas- sified as stopwords according to the English stopword corpus from NLTK [4] are removed from the sentences. 7. Stemming or Lemmatization: words are converted into word stems (by NLTK’s Porter stemmer) or into lemmas (by NLTK’s WordNet-based lem- matizer). 8. Joining of adjacent sentences (optional): in an attempt to perform a primitive form of discourse analysis, a procedure where the sets of words from adjacent sentences are joined prior to ranking was tested. Results generated by this strategy on QA4MRE’12 test set presented an overall improvement in accuracy, so this procedure was maintained for the base system. With respect to item 7, the choice between stemming and lemmatization took into consideration the accuracy observed on 2012 test set. For the base system, lemmatization yielded marginally better results, while the coreference system presented improved results when stemming was employed. 2.2 Ranking procedure Ranking is performed by computing the similarity between sentences and hy- potheses, as presented in Algorithm 1. The hypotheses are generated by joining question text and the text from alternatives: Hi = Q ∪ Ai Algorithm 1 Ranking Procedure ranks ← list() for all hypothesis hyp in hypotheses list do similarities ← list() for all sentence sent in document do sim ← similarity(sent, hyp) similarities.append(sim) end for average ← 0 for all top k values sim in similarities do average ← average + sim end for average ← average/k ranks.append(average) end for selected ← indexOf (maximum(ranks)) The similarity metric employed during the ranking procedure is the MASI similarity [5], calculated by the formula below: |set1 ∩ set2 | masi sim(set1 , set2 ) = max(|set1 |, |set2 |) According to the similarity scores calculated between hypotheses and text sentences, each hypothesis receives an overall ranking score computed as the average of the k best sentence similarities. The candidate answer chosen by the system corresponds to the hypothesis which presents the highest ranking score. The value of k is a parameter provided to the system, and was set to k = 2, considering the test results obtained with the QA4MRE’12 test set. 2.3 Handling of ties In some cases, the ranking procedure results in a tie between two or more hy- potheses. To handle those cases, four different strategies were envised: 1. All questions where no candidate answer was found (there was a ranking tie) were answered as “None of the above”. 2. All questions where no candidate answer was found were left unanswered. 3. If the maximum ranking score between the alternatives is inferior to a certain threshold, the question is answered as “None of the above”. Otherwise, it is left unanswered. 4. If the maximum ranking score between the alternatives is inferior to a certain threshold, the question is answered as “None of the above”. Otherwise, one of the tied alternatives is selected at random. The reasoning behind the threshold value utilized in 3 and 4 is that in ques- tions where the ranking values were lower, there would be a higher chance that none of the alternatives was correct and the question had no answer. In contrast, in questions where the ranking values were higher, it would be more likely that there was a correct answer, but the system was unable to find it. This threshold was empirically set to 0.1. 2.4 Coreference variation The system illustrated in Figure 1b includes a coreference detection phase, as well as a coreference resolver. This addition intends to take advantage of discourse analysis, allowing for the resolution af anaphoric pronouns, as well as other types of coreferences. Coreference detection is performed by Stanford’s NLP suite [6], which out- puts an XML file containing information on the detected coreferences present in the processed text. Information extracted from this XML file is employed by the coreference resolver in the following way: 1. The representative noun phrase is located. 2. Words from the representative reference are included in a word set. 3. Sentences where other references to the same entity appear are located. 4. The word set from the representative reference is joined into the sentences that refer to the same entity. This simple strategy presents good results with regards to the resolution of referential pronouns, in the context of a hypothesis ranking computed through set similarity: since the words from the representative reference are included in the word set of referencing sentences, this has a positive impact on the similarity between those sentences and hypotheses that mention the same entity. 3 Results Eight distinct runs were submitted to the competition, where the two presented systems were paired with each one of the four tie strategies described in section 2.3. General results from each run are presented in table 2. In this table, per- formance is measured according to two metrics: accuracy and the c@1 measure [7]. C@1 is the main performance measure employed in the competition, and is calculated as follows: (nR + nU ∗ nnR ) c@1 = n where nR = number of correctly answered questions nU = number of unanswered questions n = total number of questions The competition consisted of a total of 240 main questions, of which 44 re- quired inference in the answering process. Those inference-demanding questions had simpler duplicates where the question was phrased in a way the inference was no longer required. The “c@1 main” accuracy only takes the 240 main questions into consideration, and “c@1 all ” is calculated over all 284 questions. Algorithm Main Qs. All Qs. System Tie Run Accur. c@1 Accur. c@1 1 02 0.28 0.28 0.34 0.34 2 03 0.23 0.26 0.29 0.33 Base + Join 3 04 0.27 0.28 0.33 0.35 4 05 0.28 0.28 0.35 0.35 1 06 0.30 0.30 0.33 0.33 2 07 0.22 0.26 0.26 0.30 Coreference 3 08 0.26 0.29 0.29 0.32 4 09 0.28 0.28 0.31 0.31 Table 2: General results Run number 6 presented the best general results, with a c@1 measure of 0.30 on the main questions. The considerable increase in the c@1 metric between the main set of questions and the complete set reinforces the weakness of the employed systems with inference demanding questions. Algorithm topic 1 topic 2 topic 3 topic 4 System Tie Run Median Average Median Average Median Average Median Average 1 02 0.30 0.33 0.21 0.18 0.22 0.25 0.22 0.21 2 03 0.26 0.29 0.18 0.16 0.21 0.23 0.19 0.20 Base + Join 3 04 0.31 0.33 0.19 0.17 0.22 0.25 0.23 0.22 4 05 0.33 0.33 0.18 0.17 0.22 0.25 0.22 0.22 1 06 0.37 0.37 0.18 0.19 0.26 0.27 0.17 0.21 2 07 0.23 0.26 0.20 0.20 0.23 0.23 0.17 0.18 Coreference 3 08 0.28 0.31 0.19 0.21 0.27 0.26 0.17 0.20 4 09 0.27 0.28 0.18 0.19 0.28 0.26 0.19 0.21 Table 3: C@1 per topic Table 3 lists the c@1 results per topic, for the main set of questions. From the analysis of this table, a considerable variation in performance between topics can be noticed: the system presented the best results in topic 1, followed by topic 3; topics 2 and 4 have inferior performance. Another remarkable fact is that while the best results from the coreference system are superior to the results from the base system in topic 1, the differences in the results on other topics are negligible. 4 Conclusions The presented methodology is entirely based on lexical similarity. Possible di- rections for improvement are the inclusion of techniques for proper handling of questions involving negation (added to 2013 main task, but not present in 2012). The system could also benefit from a weighted similarity measure that would prioritize words according to importance. Although there is still room for improvement while maintaining the lexical character of the system, we believe that the ideal focus of future work would be on establishing a system able to deal with semantic relations through the devel- opment of strategies aiming at textual and logic inference. We also consider of crucial importance the development of techniques for knowledge base construc- tion, which can perform the extraction of domain-specific knowledge from the background collection. 5 Acknowledgements Guilherme de Oliveira da Costa Marques is funded by Brazil’s National Coun- sel of Technological and Scientific Development (Conselho Nacional de Desen- volvimento Cientı́fico e Tecnológico - CNPq). Mathias Verbeke is funded by the Research Foundation Flanders (FWO-project G.0478.10 - Statistical Relational Learning of Natural Language). References 1. CELCT: Question Answering for Machine Reading Evaluation http://celct.fbk. eu/QA4MRE/. 2. Verbeke, M., Davis, J.: A text mining approach as baseline for QA4MRE’12. In: CLEF (Online Working Notes/Labs/Workshop). (2012) http://www.clef-initiative.eu/documents/71612/ 234cb84c-03a3-45c3-8844-9b1ca448d976. 3. de Oliveira da Costa Marques, G.: Combining text mining techniques for question answering. Master’s thesis, KU Leuven (2013) 4. Bird, S., Klein, E., Loper, E.: Natural language processing with Python. O’Reilly (2009) 5. Passonneau, R.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC). (2006) 831–836 6. Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., Jurafsky, D.: De- terministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics (2013) 1–54 7. Peñas, A., Forner, P., Sutcliffe, R.F.E., Rodrigo, Á., Forascu, C., Alegria, I., Gi- ampiccolo, D., Moreau, N., Osenova, P.: Overview of ResPubliQA 2009: Ques- tion answering evaluation over european legislation. In: CLEF. (2009) 174–196 http://dx.doi.org/10.1007/978-3-642-15754-7_21.