CSGS: Adapting a short answer scoring system for multiple-choice reading comprehension exercises Simon Ostermann, Nikolina Koleva, Alexis Palmer, and Andrea Horbach Department of Computational Linguistics, Saarland University, Saarbrücken, Germany (simono,nikkol,apalmer,andrea)@coli.uni-saarland.de home page: http://www.coli.uni-saarland.de Abstract. This paper describes our system submission to the CLEF Question Answering Track 2014 Entrance Exam shared task competi- tion, where the task is to correctly answer multiple choice reading com- prehension exercises. Our system is a straightforward adaptation of a model originally designed for scoring short answers given by language learners to reading comprehension questions. Our model implements a two step procedure, where both steps use the same set of metrics for evaluating similarity between pairs of input sentences/questions. In the first step, we automatically select the sentence of the reading text that best matches the question. In the second step, the selected sentence is compared to each of the four answers, and the answer with the highest similarity score is chosen as the correct answer. Although the model has not been tuned to this specific task, we obtain scores that are compet- itive with other top-performing systems in the challenge. Additionally, we make no use of the training material but rather treat the task as one of general determination of semantic similarity between text sentences and provided answers. 1 Introduction Reading comprehension exercises are a widely-used and important means of assessing students’ ability to understand the material they read. Questions in reading comprehension exercises generally span a range of difficulty levels, from simple extraction of facts contained in reading texts to more sophisticated in- ference, requiring both information in the text and general (or subject-specific) background knowledge. As such, they provide a challenging context for auto- mated analysis. The CLEF Question Answering Track 2014 Entrance Exam shared task asks systems to read a given document and answer a set of multiple-choice questions based on the reading text. We approach this task by adopting a model originally designed for a different reading comprehension context: scoring short answers to reading comprehension questions given by language learners. The tasks have in 1427 common that they require assessing the suitability of answers to questions based on reading texts, but they differ in their inputs and expected outputs. For short answer scoring, the system is provided with a reading text, a set of questions, a target answer for each question, and a set of learner answers to be scored as correct or incorrect. Most short answer scoring systems work by comparing learner answers to a sample solution (aka target answer), but language learners tend to replicate chunks of the reading text in their answers. Thus, it is often straightforward to match a sentence of the text to an answer written by a learner. In previous work we used this tendency to develop a scoring model that incorporated features based on the relationship between reading text sentences and learner answers [3]. In the Entrance Exam challenge, the system is again provided with a reading text and a set of questions based on the text. Instead of a target answer, though, there is a set of four answers: one best answer and three distractor answers. Often there is high similarity between the best answer and one or more other answers. To accomplish this task, we again use a model that evaluates similarity between text sentences and both the question and the set of answers. The basic idea of the model (which is described in more detail in Section 3) is to use a common set of similarity metrics (following [4]) in a two-step procedure. First, we automatically identify the sentence of the reading text that best matches the question, on the assumption that this sentence has a reasonable likelihood of containing the question’s answer. We will see (in Section 5) that this assumption does not always hold. Second, we choose as the best answer the one our system evaluates as most similar to the selected sentence from the text. It should be noted that this approach requires no training material, as it simply relies on evaluating similarity between either the question and a sentence from the reading text or a reading text sentence and each answer from the set of four in the multiple-choice setting. Although our model has not been tuned to the specific task, it still performs at a level that is comparable to other top- performing systems in the challenge. This suggests that the general approach captures key aspects of semantic similarity. 2 Task and Data The aim of the given task is to provide a computational solution towards auto- matic question answering. The data consist of reading comprehension exercises which were part of Japanese university entrance exams. Those exams are meant to be used for checking new student’s capabilities of various skills by testing them with the help of reading exercises and are collected from the Japanese Center Tests of 2013 and 2014.1 The task now is to detect the right answer for a given question and reading text. Contrary to earlier shared tasks from this scenario, questions are rather unconstrained here and range from “simple” comprehension questions, that re- quire the student or computer just to find a paraphrase, over sentence completion 1 http://nlp.uned.es/entrance-exams/ 1428 questions up to complex questions that demand a deeper knowledge of text co- herence. The data consists of 12 test documents with 56 questions and 4 answers for each question. The texts vary in terms of length, complexity and content. Task organizers provided two data sets: one for training and development, with the correct answer for every question indicated, and a second for testing and ranking systems participating in the challenge. Though we used training material to test the general feasibility of using our pre-existing model for this new task, we do not use correct-answer annotations for any actual model training or even parameter setting. 3 Our Model As described above, our system was originally developed for the related task of short answer scoring. In short answer scoring, a number of different criteria, ranging from token overlap to various syntactic and semantic features, are used to determine whether a given student answer is correct or incorrect. A crucial difference between short answer scoring and answering multiple-choice questions is the absence in the latter of an “ideal” target answer. When we use this model for short answer scoring, we compare each student answer to the target answer, and then do supervised classification to learn how this comparison looks for correct vs. incorrect answers. In this section we describe how we adapt the model for scoring answers when we don’t have a target answer to compare to. Figure 1 schematically shows the workflow of the system. In general the sys- tem consists of two components. In the first step, answers, texts and questions are preprocessed and annotated with linguistic information. The output of this preprocessing afterwards serves as the input for the alignment module. Our adapted alignment model for evaluating answers as such consists of two sub-modules again, the sentence selection module and the answer selection module. Both modules rely on alignment between sentences. For sentence selec- tion, we find the best reading text sentence for a given question via alignment ; for answer selection we align each answer with this best sentence. We first de- scribe the alignment model and then the general workflow of our two-step answer evaluation model. 3.1 Alignment model In our alignment model we follow the methodology that has been proposed by [4] for grading short answer questions. In such a task, the content of a learner answer is aligned to that of a target answer, and features measuring the overlap between target and learner answer are extracted in order to approximate the determination of semantic equivalence between target answer and learner answer. During alignment, the model identifies similarities between a learner answer and its corresponding target answer on a number of pairs on a number of linguistic levels: tokens, chunks, and dependency triples. For the current task, aligning 1429 Fig. 1: System architecture (abbreviations: (Q)uestions, (A)nswers (1-4) and (T)exts) answers and questions to text sentence, we mainly consider alignments between tokens. We preprocess all material (texts, questions and answers) using standard NLP tools: for sentence splitting (OpenNLP) and tokenization (Stanford CoreNLP),23 POS tagging and stemming (both Treetagger [6]),4 NP chunking (Treetagger) and synonym extraction (WordNet [1]).5 For synonyms we use not only words that occur in the same synset but also words that are in a hypernym relation and have maximally one node in between them. On the token level, we use several different metrics for identity between to- kens, with each metric associated with a certain alignment weight. We use the following types of identity (id), weighted in descending order: token id > lemma id > synonym id. After weights have been determined for all possible token pairs, 2 http://opennlp.apache.org/index.html 3 http://nlp.stanford.edu/software/corenlp.shtml 4 http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ 5 http://wordnet.princeton.edu/ 1430 the best applicable weight is used as input for a traditional marriage alignment algorithm [2]. The token alignment is afterwards refined by chunk-alignment: For chunk alignment, we use the percentage of aligned tokens between chunks for two sen- tences as input for the alignment process. Two chunks can only be aligned if at least one of the tokens has been aligned. If an aligned token pair from the previ- ous token alignment step ends up in two not-aligned chunks, the token alignment is split up again. In short-answer scoring, the resulting alignment is used to extract a variety of overlap features, like e.g. how many tokens of the learner answer can be found in the target answer. In the task at hand, where we have to identify the one sentence out of a set of sentences that fits some other sentence best, we instead use the alignment directly to compute one overall alignment weight by summing up the weights of all token-level links of the final alignment between two sentences. 3.2 Answer Evaluation Workflow Sentence Selection. In this first step of our evaluation, we aim at finding the text sentence which best matches the lexical material in the question. To do so, we align each sentence in the reading text with the question and com- pute the overall alignment weight for each pairing. The text sentence with the highest weight is then assumed to be the sentence that carries the most crucial information for answering the question. Answer Selection. After the best-matching sentence has been identified, we align this sentence with each of the four potential answers. Again the answer with the highest alignment weight is proposed as the correct answer. Figure 2 schematically illustrates the process of selecting the correct answer. The black arrow indicates the selection of the best sentence by aligning it with the question. The answer that aligns best with this sentence is taken to be correct. Other answers that were not selected might potentially link to other regions of the text (as indicated with red arrows). One technical problem that may occur is that two answers to the same ques- tion can end up with the same alignment weight. In the test data, this happens for 11 out of 56 questions. We investigate two different ways of handling this outcome, submitting them as two runs of our model. In the first run (csgs-1 in Table 1), we simply choose the first (in linear sequence) of the equally-weighted answers. In the second run (csgs-2), we mark such questions as unanswered. Example (1) shows an example for such a set of answers that all receive the same weight, because none of them has any overlap with the proposed text sentence that could be picked up by our system. (1) a. Question: At the beginning of the story, what did the boy think George did for a living? Text sentence: “Are you a carpenter, sir” the boy asked, looking 1431 Fig. 2: Visualization of the alignment model up at the old man’s face. Answers: –He thought he made and repaired wooden objects. –He thought he painted pictures or houses. –He thought he fixed musical instruments. –He thought he grew plants and flowers. 4 Results Table 1 summarizes the evaluation of the systems that participated in the English-language portion of the Question Answering Track for Entrance Exams (cut down to roughly the top 50% of runs). In total five systems participated in the shared task, each submitting multiple runs. Each run is given a c@1 score according to [5]. This measure is an extension of the accuracy measure but also accounts for deciding not to answer a question if a system is not confident enough about it. Since it is an extension of the accuracy measure, the higher the score, the better the performance. Our two runs (csgs-1 and csgs-2, as described above) are highlighted in bold. The c@1 score gives a slight boost for not answering a question rather than submitting an incorrect answer. Therefore it is not surprising that csgs-2 achieves the better performance of our two runs, since it takes the intuitive and simple strategy of declining to answer when it cannot confidently select one best answer from the set of four possible answers. 5 Discussion and Error Analysis In this section we provide error analysis to show and discuss the strengths and weaknesses of our model in the context of this task. In order to analyze the 1432 Table 1: List of submitted systems that achieved c@1 => 0.25 on the English data set. c@1 System Run 0.446 EE2014–SynapseDeveloppement–1–Output english.xml 0.375 EE2014–DIPF–7–dipf1407enen.xml 0.375 EE2014–cicnlp–8–cicnlp-8.xml 0.362 EE2014–csgs–2–02 1.xml 0.357 EE2014–csgs–1–01 1.xml 0.357 EE2014–cicnlp–7–cicnlp-7.xml 0.339 EE2014–cicnlp–2–cicnlp-2.xml 0.304 EE2014–cicnlp–4–cicnlp-4.xml 0.304 EE2014–cicnlp–3–cicnlp-3.xml 0.286 EE2014–DIPF–5–dipf1405enen.xml 0.286 EE2014–DIPF–3–dipf1403enen.xml 0.286 EE2014–cicnlp–6–cicnlp-6.xml 0.286 EE2014–cicnlp–1–cicnlp-1.xml 0.25 EE2014–LIMSI-CNRS–4–dude4.xml 0.25 EE2014–DIPF–6–dipf1406enen.xml source of the errors, we annotate the English data set with the best sentence that matches the meaning of the correct answer for each question. In this manner, we establish a gold standard (GS) for the first step of our model, which is the sentence selection described in Section 3.2. We also discuss the performance of the two modules of the system for more detailed evaluation. 5.1 Additional Annotations Our answer selection mechanism comprises two steps: the selection of a relevant text sentence that potentially contains the correct answer and the alignment of the proposed answer alternatives against this passage. For a better understanding of these two components, we assess their contri- butions separately: We first assess how often our sentence selection module finds the correct sentence in the text. In a second step we examined how good the alignment of answers to the text would be given that we had oracle informa- tion about this best sentence. In order to be able to conduct these experiments we need some additional gold standard information we marked the text sen- tence that we thought contained the relevant material necessary to answer the question, i.e. the gold standard for the best text sentence. These annotations have been done by two annotators each, in case of a dis- agreement they have been adjudicated by a third annotator. The two annotators agreed in 61% of the cases. This shows that the task for selecting the best sen- tence is not trivial even for humans. As a matter of fact usually one needs more than one sentence to answer a question. Especially for a general question like “What is this story about?” it is necessary to infer the main point of the story, which would require a deeper semantic analysis that would need to include pro- cessing on the discourse level. 1433 Having the GS for the best matching sentence we re-ran the classification of the answers and obtained an accuracy of 50%, which is an improvement of 16% compared to the performance with the automatically selected sentences. Even using GS sentences from the text we are able to detect the correct answer for only half of the questions. 5.2 Error Analysis In this section we show examples for which our model worked and also didn’t work and discuss the reasons for that. Example (2) shows a case where the correct answer was selected. The third answer highlighted in italic bold has the highest weight and thus it is classified as correct by our system. The best sentence in this case is also the one that we obtained in the gold standard. Therefore the comparison leads us to the correct answer. Here we can see the importance of the synonymy check. Although there is a high overlap of identical words it is particularly important to recognize that the verbs go away and leave in the answer and the best sentence, respectively, have the same meaning in order to score the answer higher. (2) a. Question: What was the problem the author had with his house? b. Best Sentence: My son and I were trying to sell the house we had restored but in the barn attached to it there were bats and they wouldn’t leave. c. Answers ordered by alignment weight, best fitting first: –Bats were living in the barn and wouldn’t go away. –The author and his son might not be able to stay for the season. –The author and his son couldn’t sleep well because of the mut- tering sounds. –The house was still badly in need of repair. Another case that is well handled by our model is not a direct question but rather a completion of a sentence like in example (3). The detection of the best matching sentence works well and consequently the answer selection that is based on the comparison to the best text sentence is good. (3) a. Question: Rats that live with their brothers and sisters during their early days b. Best Sentence: It has been found that while baby rats kept with their brothers and sisters engage in a lot of rough play, those raised alone with their mothers play just a little. 1434 c. Answers ordered by alignment weight, best fitting first: –spend a lot of the time playing roughly with them. –hurt each other a lot through their rough play. –quickly learn to be independent of their mothers. –still want to play with their mothers. In case our system fails to detect the best sentence, the consequence is that the selection of the correct answer also fails. The reason for that is because we compare the answers to a different piece of the given text. The incorrect answers in the multiple choice are also related to the text but not to the relevant part for the question under consideration. This idea is illustrated with example (4). The best sentence selected by the system gives a higher alignment weight when it is compared to an incorrect answer. In case we know the best sentence according to the gold standard, our system is able to select the correct answer. It is interesting to observe that the weight of the overlap with the correct answer is the same as before but now it is the highest one. (4) a. Question: How did the author obtain Margaret’s address? b. System Best Sentence: “I don’t know how my address got into a magazine in Japan, be- cause I have never asked for a pen pal, but it’s so nice hearing from someone in such a fascinating country, and I look forward to corre- sponding with you.” c. GS Best Sentence: I was reading a popular youth magazine when I noticed a list of ad- dresses of young people from all over the world who were seeking pen pals in Japan. d. Answers: –He wrote to a popular magazine for her address. (best for system sentence) –He found it in a popular magazine. (best for gold standard sentence) –He received it from one of his classmates. –He selected it from a list given by his teacher. 5.3 Performance of the Sentence Detection Unit We compared how many of the sentences in the GS are also selected by the sentence detection component of our system. It turned out that only 11 of 56 sentences matched the GS sentences. In other words the accuracy on the sentence 1435 selection task is 24%. It is particularly challenging for an automatic method to distinguish in case there is a high overlap with the lexical material in the question, whether the sentence contains the information relevant for answering the question or not. In example (6) approximately half of the lexical material that occurs in the question overlaps with the best automatically selected sentence and thus it matches the sentence in the GS. (5) a. Question: Why was the writer in Arizona during World War II? b. System Best Sentence: I’d been sent to a special camp in Arizona for Japanese-Americans during World War II, before I joined the army. c. GS Best Sentence: I’d been sent to a special camp in Arizona for Japanese-Americans during World War II, before I joined the army. In contrast, in example (6) the system picks a different system because the over- lap of the lexical material is higher than with the GS sentence corresponding to this question. In this example we see the subject of the sentences has the sur- face representation “I” and it refers to “the author”. Our model does not apply co-reference resolution but this may improve the performance of the system. If a model is able to figure out that the pronoun “I” in the GS Best sentence is referring to the author and to measure the overlap with the noun phrase then the GS Best Sentence would have a better chance to be selected. One other fac- tor that influences the choice of a wrong sentence is that in general our model prefers shorter sentences as it computes the percentage overlap. (6) a. Question: Why did the author ask Margaret for her picture? b. System Best Sentence: Margaret had asked her friend to send it only in the case of her death. c. GS Best Sentence: I knew it would be impolite to ask a girl her age, but thought it would be all right to ask her to send a picture. 6 Conclusions and Future Work In this paper we have described the adaption of a model originally developed for short answer scoring to the task of answering multiple choice questions in an entrance exam scenario. The model has been shown to be suitable for both 1436 tasks, since both require evaluating a set of responses according to how well they answer reading comprehension questions. With no task-specific tuning, and without using the training material provided, the system achieves performance comparable to the best-performing runs submitted to the CLEF Question An- swering Track 2014 Entrance Exam shared task. That said, there is ample room for improving the system in order to better handle the task at hand. In our two-step approach, aspects of both modules could be improved. Error analysis shows that the first step of the procedure, sentence selection, performs quite poorly compared to gold standard annotations. High overlap with the question material is simply not enough to choose the sentence from the text that contains the highest proportion of the answer material. For this task we need deeper semantic analysis. In particular, a first step would be to incorporate a co-reference system; this would be beneficial for the many sentences in which pronouns occur instead of full noun phrases. We could further improve this module by taking into account the type of the question and the corresponding expected answer type. One particular weakness of our approach is that the system, by maximizing percentage overlap, tends to prefer shorter sentences to longer ones. One poten- tial way to address this problem would be to use some metric for identifying the most important words in the reading text and then give these terms more weight when determining the overall alignment score. We would also like to expand our approach to lexical similarity to better identify words taht are not covered by Wordnet synset relations. References 1. Christiane Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, 1998. 2. D. Gale and L. S. Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962. 3. Andrea Horbach, Alexis Palmer, and Manfred Pinkal. Using the text to evaluate short answers for reading comprehension exercises. In Second Joint Conference on Lexical and Computational Semantics (*SEM), pages 286–295, Atlanta, Georgia, USA, June 2013. Association for Computational Linguistics. 4. Detmar Meurers, Ramon Ziai, Niels Ott, and Stacey Bailey. Integrating parallel analysis modules to evaluate the meaning of answers to reading comprehension questions. Special Issue on Free-text Automatic Evaluation. International Journal of Continuing Engineering Education and Life-Long Learning (IJCEELL), 21(4):355– 369, 2011. 5. Anselmo Peñas and Alvaro Rodrigo. A simple measure to assess non-response. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1415–1424, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. 6. Helmut Schmid. Improvements in part-of-speech tagging with an application to German. In In Proceedings of the ACL SIGDAT-Workshop, pages 47–50, 1995. 1437