Automatic Evaluation of World History Essay Using Chronological and Geographical Measures Kotaro Sakamoto Hideyuki Shibuki Madoka Ishioroshi Yokohama National University Yokohama National University National Institute of Informatics National Institute of Informatics shib@forest.eis.ynu.ac.jp ishioroshi@nii.ac.jp sakamoto@forest.eis.ynu.ac.jp Akira Fujita Yoshinobu Kano Teruko Mitamura Yokohama National University Shizuoka University Carnegie Mellon University fujita@ynu.ac.jp kano@inf.shizuoka.ac.jp teruko@cs.cmu.edu Tatsunori Mori Noriko Kando Yokohama National University National Institute of Informatics mori@forest.eis.ynu.ac.jp SOKENDAI kando@nii.ac.jp ABSTRACT measures was not always concordant with the ranking order given We propose a method for measuring chronological and geographi- by the human marks. Therefore, we investigated more appropriate cal consistency of the world history essays in Japanese university measures for evaluating world history essays in Japanese university entrance exams. The experimental result shows a weak positive entrance exams. correlation between the scores measured by the proposed method For evaluating summaries, the linguistic well-formedness and and the scores estimated by a human expert in world history. the relative responsiveness were used in the DUC workshops.2 The content, readability/fluency, and the overall responsiveness were KEYWORDS used at the Guided Summarization tasks3 in the TAC workshops. These measures are important for evaluating world history essays in essay QA, automated evaluation, chronological and geographical university entrance exams. However, the linguistic well-formedness measures, world history, university entrance exams and readability/fluency were scored arbitrarily by human assessors, 1 INTRODUCTION while the content was methodologically scored by the ROUGE family and the Pyramid method, among others. We would like Research on real-world complex question-answering (QA) has flour- to methodologically give other scores based on merits other than ished in recent years [1]. In the QA Lab tasks [10] at the NTCIR the content. For evaluating world history essays, chronological workshop,1 the current problems and solutions in QA technolo- and geographical consistency is important as a kind of semantic gies have been investigated using the world history questions in consistency. However, how to evaluate these is not obvious. In Japanese university entrance exams and their English translation. this paper, we propose a method for measuring chronological and Japanese university entrance exams include various types of ques- geographical consistency of world history essays, and examined tions such as multiple-choice, fill-in-the-blank, true-or-false and the method using essays submitted to the QA Lab. essay questions. Above all, essay QA is the most challenging, and The main contributions of this paper are as follows: (i) to clarify still has many open problems, such as the evaluation of essays that the features of well-formed world history essays in terms of the QA systems generated. Although there is a way of evaluation by chronological information and the geographical information, (ii) to human experts in world history, it takes considerable time and cost. introduce a new scoring method based on the features to evaluate In the case of the QA Lab, evaluation of 46 essays by an expert who the well-formedness of world history essays. teaches world history took around a month and about 500,000 yen (4,500 USD). Therefore, a new method is required. 2 RELATED WORK Because essay generation is regarded as a kind of query-biased The linguistic well-formedness in the DUC workshop and the read- summarization, the measures for evaluating summaries using gold- ability/fluency in the TAC Guided Summarization tasks were evalu- standard data can be applied to essay evaluation. In the QA Lab, ated in terms of grammaticality, non-redundancy, referential clarity, the ROUGE family [5] and the Pyramid method [7, 9] are used for focus, and ‘structure and coherence’. Our measures are relative to grading essays besides a human expert’s evaluation. A positive the focus and ‘structure and coherence’. Although Barzilay et al. correlation between these grades and those provided by humans [2] and Okazaki et al. [8] researched the chronological ordering, was between moderate and weak, and the ranking order by the they did not take account of geographical information. Buscaldi 1 http://research.nii.ac.jp/ntcir/index-en.html et al. [4] found that geography is related to semantic similarity, but they only aimed to measure semantic equivalence between two Copying permitted for private and academic purposes. EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. 2 http://duc.nist.gov/duc2007/tasks.html © 2017 Copyright held by the author. 3 http://www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html 20 geographical information, all HEs are described in chronological order, and (b) grouping HEs by the geographical information. In both, information is described in chronological order. If the former is regarded to be grouped by geographical information from “the whole world,” there is no difference between the two manners; that is, both are descriptions in chronological order for HEs in a particu- lar area. We defined a sequence of HEs with the same geographical information as a geographical section (GS). GSs could be nested hi- erarchically. For example, a GS of Europe may contain GSs such as England, France, and Germany, and the GS of England may contain GSs such as London, Birmingham and Manchester. From the above, we built the following hypotheses for the struc- ture of world history essay. (H1) An essay is a GS. (H2) A GS can consist of more than one sub-GSs that is in the parent GS. (H3) HEs in a GS are put in chronological order. GSs in a hierarchical structure are classified into terminal and non-terminal sections. A terminal section means an HE sequence without hierarchical structure, and likewise a non-terminal section Figure 1: An example of essay question of world history can be divided into several GSs. We defined a non-terminal section text snippets. Because Madanani et al. [6] only researched sen- corresponding to the essay as the root section. A GS s is defined tence ordering, the research only applied to the context of a short, as a paired HE sequence E = (e 1 , e 2 , · · · , em ) and GS sequence domain-independent summarization. Bauer and Teufe [3] proposed SS = (s 1 , s 2 , · · · , sn ). If SS is an empty tuple, then the GS is a the extended Pyramid method for timeline summarization, but they terminal section. HEs in a sub-GS are shared with the superordinate did not focus on the well-formedness. Although Wagner et al. [11] GS, and E of non-terminal sections are not empty. For a question, researched the well-formedness, they focused only on grammatical the chronological condition CC is defined as a pair of the beginning errors. Therefore, there is no research on a methodology for mea- time bt and the ending time et, and the geographical condition GC suring the focus and the structure and coherence of world history is defined as a geographical entities set {д1 , д2 , · · · , дk }. essays in terms of the chronological and geographical information. 4.2 Uniformity 3 ESSAY QUESTION OF WORLD HISTORY Let us consider the uniformity of GSs in a GS. If GSs of the East Figure 1 shows an example of an essay question for world history, Midlands, Paris and German are placed on the same level in a GS which is an English translation from the original Japanese version. of Europe, they are incongruous even though they are all parts of The question contains additional text besides the main essay topic. Europe. This is because they are in different levels of a geographical The first paragraph gives background information, and the texts category, such as country, region, and city. Therefore, well-formed below the essay topic are the constraints for writing the essay. The essay require the uniformity of geographical category level. In constraints include a length limitation of “no more than 20 lines,” a addition, if England is described with hundreds of words while geographical condition of “West Europe, West Asia and East Asia,” France and Germany are respectively described with a dozen words, a chronological condition of “up to and including the first half of there is incongruity even though they are in the same geographical the 18th century,” the keywords that must be used in the essay, and category level. This is because their quantities of description are other associated conditions. The chronological condition and the imbalanced. Therefore, well-formed essay seems to require the geographical condition prove the importance of chronological and uniformity of quantity. geographical consistency. We built the following hypotheses for the uniformity of GSs. 4 WELL-FORMED WORLD HISTORY ESSAY (H4) GSs placed on the same level in a GS are in the same level 4.1 Structure of geographical category. (H5) GSs placed on the same level in a GS are described in the In general, a world history essay is a sequential description of his- same quantity. torical events (HEs). A HE has both chronological information and geographical information. Let us consider how this is written. Although several functions implementing the hypotheses were While the chronological information can be easily put in a linear come up with, we took simple functions by way of experiment. order from the past to the future, the geographical information is The geographical uniformity scGU () and the quantity uniformity not easy to be determinately put in a linear order because of the sc QU () are calculated by the following functions: spatial extent. Based on the study of several model answer essays from past university entrance exam collections, the general struc- sdGU (S) ture of the essays follows one of two approaches: (a) disregarding scGU (SS) = 1− (1) amGU (SS) 21 v a function to return the shortest distance between the thesaurus u t |S S | 1 Õ sdGU (SS) = (depth(si ) − amGU (SS))2 (2) nodes ni and n j . |SS | i=1 |S S | 4.4 Cooperability 1 Õ Let us consider the cooperability of a world history essay to ques- amGU (SS) = depth(si ) (3) |SS | i=1 tion constraints in terms of the chronological and the geographical Í |S S | information. As described in Section 3, world history essay ques- − i=1 p(si , SS) log2 p(si , SS) sc QU (SS) = (4) tions give chronological and geographical conditions such as “up to log2 |SS | and including the first half of the 18th century” and “West Europe, lenдth(s) West Asia and East Asia.” In this case, if an essay describes only the p(s, SS) = (5) Í |S S | ancient histories of West Europe, West Asia and East Asia, the essay i=1 lenдth(si ) satisfies the conditions logically. However, it does not reflect the where depth(s) is a function to return the distance between the question intention. Cooperative essay should describe at least one thesaurus root node and the node corresponding to the range of s. HE of the 18th century. The geographical information is also similar. lenдth(s) is a function to return the number of characters described For example, an essay describing only “West Europe and West Asia” in s. We designed the scoring functions to be normalized into the violates the maxim of quantity, and the cooperative essay should range [0, 1]. describe at least one HE for each area of the geographical condi- 4.3 Ordering tion. We assume that the chronological cooperability is observed in all GSs while the geographical cooperability is observed in only Let us consider the ordering of HEs in a GS. HEs in well-formed a GS corresponding to the essay. For a GS, we defined a period essays are generally described in chronological order. Note that from the beginning of the earliest HE to the end of the latest one the occurrence order of HEs does not always correspond with the as a period of the GS. The smallest geographical range, including descriptive order of an essay. Since the chronological information where the HEs in a GS occurred, was defined as the range of the of an HE has a beginning and ending in a range, the occurrence GS. We assume that the observance of the maxim of quantity is order relation between HEs is either non-overlapping, partially approximated to the coverage of the period and the range of GSs. overlapping or inclusive.In all relations, the beginning of the HE We built the following hypotheses for the cooperability on the e 1 precedes the beginning of the HE e 2 . However, in the inclu- chronological and the geographical conditions in questions. sion relation, e 1 may be described after e 2 such as “The Treaty of Nanking ended the First Opium War.” Therefore, we assume that (H8) A period of a GS covers the period of the chronological the describing order of HEs in the inclusion relation is free to the condition as justly as possible. chronological order. Next, let us consider the ordering of GSs in a (H9) A range of a GS corresponding to the essay covers the GS. The describing order of GSs is free relative to the chronological range of the geographical condition as justly as possible. order. However, for example, the describing order of Athens, Rome, The chronological cooperability scCC () and the geographical Cairo, Baghdad, Beijing and Shanghai seems to be better than the cooperability scGC () are calculated by the following functions: order of Athens, Baghdad, Beijing, Cairo, Rome and Shanghai. This overlap(period(E), CC) is because GSs relating to each other are placed closely. We assume scCC (E, CC) = (9) extend(period(E), CC) that the relativity is approximated by the geographical distance. 2P(E, GC)R(E, GC) We built the following hypotheses for the ordering in a GS. scGC (E, GC) = (10) P(E, GC) + R(E, GC) (H6) As an exception to the hypotheses (H3), an HE can be subsumed(дeoentities(E), GC) described both before and after another HE if they are in P(E, GC) = (11) |дeoentities(E)| the inclusion relation. (H7) GSs in a GS are described in the order of short geographical subsuminд(дeoentities(E), GC) R(E, GC) = (12) distance. |GC | The hypothesis (H6) is the complement of the hypothesis (H3). where period(E) is a function to return a pair of the earliest time The chronological ordering scCO () and the geographical ordering and the latest time in E, overlap(P1 , P2 ) is a function to return the scGO () are calculated by the following functions: length of the overlap period between P1 and P2 , and extend(P1 , P2 ) K −L is a function to return the length of the period between the ear- scCO (E) = (6) liest time and the latest time among P1 and P2 . дeoentities(E) is K +L 1 a function that returns a set of geographical entities included in scGO (E) = (7) E, subsumed(G 1 , G 2 ) is a function that returns the number of geo- дeochanдe(E) + 1 graphical entities of G 1 subsumed by geographical entities of G 2 , |E |−1 1 Õ and subsuminд(G 1 , G 2 ) is a function that returns the number of дeochanдe(E) = distance(ranдe(ei ), ranдe(ei+1 ))(8) |E| − 1 i=1 geographical entities of G 2 subsuming geographical entities of G 1 . where K is the number of concordant pairs of HEs in E, and L is 5 PROPOSED METHOD the number of discordant pairs. ranдe(e) is a function to return Figure 2 shows the outline of the proposed method. First, the input a thesaurus node that is the nearest common node subsuming all essay is segmented into HEs by punctuation marks. A HE is repre- geographical entities included in the HE e, and distance(ni , n j ) is sented by a set of named entities extracted from the segment. Some 22 Figure 2: The outline of the proposed method away besides the total score by a human expert. Basically the marks awarded take account of the correctness of the content, and the marks lost account for the ill-formedness. With this, we compared the scores to the method behind subtracting marks. Note that the lost marks are caused by not only chronological and geographical inconsistencies. Figure 3 shows the scatter plot between the scores by our method and the subtracted marks. The correlation coefficient was 0.21, which indicated a weak positive correlation. Taking into account that the marks subtracted include other causes than the chronological and geographical problems, the value seems to be Figure 3: The scatter plot between the scores by the method fairly good. and the scores by a human expert 7 CONCLUSION named entities evoke the chronological and/or the geographical For world history essays in Japanese university entrance exams, information. Because exam cram books cover such information, we we proposed a method for measuring the uniformity, ordering and constructed a database of world history terms based on the world cooperability in terms of the chronological and the geographical history glossary published by Yamakawa Shuppan-sha.4 Using the information. The features of well-formedness are found by observ- database, the named entities are converted into chronological and ing several model answer essays. From the experimental result, we geographical information. Using both chronological and geograph- found a weak positive correlation between the scores measured by ical information sets, the period and the range of the segment are our method and the scores estimated by a human expert. We will respectively determined in the same way as that of the GS described investigate more appropriate functions in the future. in 4.4. They are regarded as the chronological and geographical information of the HE. Then, all hierarchical structures of GSs that REFERENCES can be gotten from the essay are listed. After scoring the HEs [1] Eugene Agichtein, David Carmel, Donna Harman, Dan Pelleg, and Yuval Pinter. 2015. Overview of the TREC 2015 LiveQA Track. In Proceedings of The Twenty- for each hierarchical structure, the maximum score is selected as Fourth Text REtrieval Conference. the final score for the essay in order to select the most plausible [2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2002. Inferring Strategies for Sentence Ordering in Multidocument News Summarization. Jour- hierarchical structure. nal of Artificial Intelligence Research 17, 1 (2002), 35–55. Based on the hypotheses described in Section 4, the score sc [3] Sandro Bauer and Simone Teufe. 2015. Improving Chronological Sentence Or- for a GS to a question is recursively calculated by the following dering by Precedence Relation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confer- functions. ence on Natural Language Processing of the Asian Federation of Natural Language Processing, Vol. 2. 834–839.  scT (E, CC)  [4] Davide Buscaldi, Jorge J. Garcia Flores, Joseph Le Roux, and Nadi Tomeh. 2014. if it is a terminal section   LIPN: Introducing a new Geographical Context Similarity Measure and a Statis-    sc N (E, SS, CC)scGC (E, GC)  tical Similarity Measure Based on the Bhattacharyya Coefficient. In Proceedings  sc(E, SS, CC, GC) = (13)  of the 8th International Workshop on Semantic Evaluation. 400–405.  if it is the root section [5] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. sc N (E, SS, CC)   In Proceedings of Workshop on Text Summarization Branches Out. 74–81.    otherwise [6] Nitin Madnani, Rebecca Passonneau, Necip Fazil Ayan, John M. Conroy, Bonnie J.    Dorr, Judith L. Klavans, Dianne P. O’Leary, and Judith D. Schlesinger. 2007. Mea- scT (E, CC) = scCO (E)scGO (E)scCC (E, CC) (14) suring Variability in Sentence Ordering for News Summarization. In Proceedings 1 of the Eleventh European Workshop on Natural Language Generation. 81–88. sc N (E, SS, CC) = scGU (SS)sc QU (SS) [7] Ani Nenkova and Rebecca J. Passonneau. 2004. Evaluating Content Selection in |SS | Summarization: The Pyramid Method. In Proceedings of the Human Language |S S| Technology Conference of the North American Chapter of the Association for Com- Õ putational Linguistics. 145–152. sc(events(si ), sections(si ), CC, GC) (15) [8] Naoaki Okazaki, Yutaka Matsuo, and Mitsuru Ishizuka. 2004. Improving Chrono- i=1 logical Sentence Ordering by Precedence Relation. In Proceedings of the 20th International Conference on Computational Linguistics. 81–88. where events(s) and sections(s) are respectively functions to return [9] Rebecca J. Passonneau, Emily Chen, Weiwei Guo, and Dolores Perin. 2013. Auto- an HE sequence and a GS sequence included in a GS s. mated Pyramid Scoring of Summaries using Distributional Semantics. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics. 6 EXPERIMENTAL RESULT 143–147. [10] Hideyuki Shibuki, Kotaro Sakamoto, Madoka Ishioroshi, Akira Fujita, Yoshinobu Using essays submitted to the QA Lab-2 and the QA Lab-3, we Kano, Teruko Mitamura, Tatsunori Mori, and Noriko Kando. 2017. Overview of compared the scores measured by the proposed method and the the NTCIR-13 QA Lab-3 Task. In Proceedings of The NTCIR-13 Conference. [11] Joachim Wagner, Jennifer Foster, and Josef van Genabith. 2007. A Comparative scores evaluated by human expert. Although the number of the Evaluation of Deep and Shallow Approaches to the Automatic Detection of essays is 55, they are annotated with the marks granted and taken Common Grammatical Errors. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural 4 http://www.yamakawa.co.jp/ (in Japanese) Language Learning. 112–121. 23