Automatic Evaluation of World History Essay Using Chronological and Geographical Measures

Automatic Evaluation of World History Essay Using Chronological and Geographical Measures KotaroSakamoto sakamoto@forest.eis.ynu.ac.jp Yokohama National University National Institute of Informatics HideyukiShibuki Yokohama National University MadokaIshioroshi ishioroshi@nii.ac.jp National Institute of Informatics AkiraFujita fujita@ynu.ac.jp Yokohama National University YoshinobuKano kano@inf.shizuoka.ac.jp Shizuoka University TerukoMitamura teruko@cs.cmu.edu Carnegie Mellon University TatsunoriMori mori@forest.eis.ynu.ac.jp Yokohama National University NorikoKando kando@nii.ac.jp National Institute of Informatics SOKENDAI Automatic Evaluation of World History Essay Using Chronological and Geographical Measures 9B96FA9D8E7A0C5D0901095DC8640464 GROBID - A machine learning software for extracting information from scholarly documents essay QA automated evaluation chronological and geographical measures world history university entrance exams

We propose a method for measuring chronological and geographical consistency of the world history essays in Japanese university entrance exams. e experimental result shows a weak positive correlation between the scores measured by the proposed method and the scores estimated by a human expert in world history.

INTRODUCTION

Research on real-world complex question-answering (QA) has ourished in recent years [1]. In the QA Lab tasks [10] at the NTCIR workshop, 1 the current problems and solutions in QA technologies have been investigated using the world history questions in Japanese university entrance exams and their English translation. Japanese university entrance exams include various types of questions such as multiple-choice, ll-in-the-blank, true-or-false and essay questions. Above all, essay QA is the most challenging, and still has many open problems, such as the evaluation of essays that QA systems generated. Although there is a way of evaluation by human experts in world history, it takes considerable time and cost. In the case of the QA Lab, evaluation of 46 essays by an expert who teaches world history took around a month and about 500,000 yen (4,500 USD). erefore, a new method is required.

Because essay generation is regarded as a kind of query-biased summarization, the measures for evaluating summaries using goldstandard data can be applied to essay evaluation. In the QA Lab, the ROUGE family [5] and the Pyramid method [7,9] are used for grading essays besides a human expert's evaluation. A positive correlation between these grades and those provided by humans was between moderate and weak, and the ranking order by the measures was not always concordant with the ranking order given by the human marks. erefore, we investigated more appropriate measures for evaluating world history essays in Japanese university entrance exams.

For evaluating summaries, the linguistic well-formedness and the relative responsiveness were used in the DUC workshops. 2 e content, readability/ uency, and the overall responsiveness were used at the Guided Summarization tasks3 in the TAC workshops.

ese measures are important for evaluating world history essays in university entrance exams. However, the linguistic well-formedness and readability/ uency were scored arbitrarily by human assessors, while the content was methodologically scored by the ROUGE family and the Pyramid method, among others. We would like to methodologically give other scores based on merits other than the content. For evaluating world history essays, chronological and geographical consistency is important as a kind of semantic consistency. However, how to evaluate these is not obvious. In this paper, we propose a method for measuring chronological and geographical consistency of world history essays, and examined the method using essays submi ed to the QA Lab.

e main contributions of this paper are as follows: (i) to clarify the features of well-formed world history essays in terms of the chronological information and the geographical information, (ii) to introduce a new scoring method based on the features to evaluate the well-formedness of world history essays.

RELATED WORK

e linguistic well-formedness in the DUC workshop and the readability/ uency in the TAC Guided Summarization tasks were evaluated in terms of grammaticality, non-redundancy, referential clarity, focus, and 'structure and coherence'. Our measures are relative to the focus and 'structure and coherence'. Although Barzilay et al. [2] and Okazaki et al. [8] researched the chronological ordering, they did not take account of geographical information. Buscaldi et al. [4] found that geography is related to semantic similarity, but they only aimed to measure semantic equivalence between two [6] only researched sentence ordering, the research only applied to the context of a short, domain-independent summarization. Bauer and Teufe [3] proposed the extended Pyramid method for timeline summarization, but they did not focus on the well-formedness. Although Wagner et al. [11] researched the well-formedness, they focused only on grammatical errors. erefore, there is no research on a methodology for measuring the focus and the structure and coherence of world history essays in terms of the chronological and geographical information. e question contains additional text besides the main essay topic. e rst paragraph gives background information, and the texts below the essay topic are the constraints for writing the essay. e constraints include a length limitation of "no more than 20 lines, " a geographical condition of "West Europe, West Asia and East Asia, " a chronological condition of "up to and including the rst half of the 18th century, " the keywords that must be used in the essay, and other associated conditions. e chronological condition and the geographical condition prove the importance of chronological and geographical consistency.

ESSAY QUESTION OF WORLD HISTORY

WELL-FORMED WORLD HISTORY ESSAY 4.1 Structure

In general, a world history essay is a sequential description of historical events (HEs). A HE has both chronological information and geographical information. Let us consider how this is wri en. While the chronological information can be easily put in a linear order from the past to the future, the geographical information is not easy to be determinately put in a linear order because of the spatial extent. Based on the study of several model answer essays from past university entrance exam collections, the general structure of the essays follows one of two approaches: (a) disregarding geographical information, all HEs are described in chronological order, and (b) grouping HEs by the geographical information. In both, information is described in chronological order. If the former is regarded to be grouped by geographical information from "the whole world, " there is no di erence between the two manners; that is, both are descriptions in chronological order for HEs in a particular area. We de ned a sequence of HEs with the same geographical information as a geographical section (GS). GSs could be nested hierarchically. For example, a GS of Europe may contain GSs such as England, France, and Germany, and the GS of England may contain GSs such as London, Birmingham and Manchester.

From the above, we built the following hypotheses for the structure of world history essay.

(H1) An essay is a GS. (H2) A GS can consist of more than one sub-GSs that is in the parent GS. (H3) HEs in a GS are put in chronological order.

GSs in a hierarchical structure are classi ed into terminal and non-terminal sections. A terminal section means an HE sequence without hierarchical structure, and likewise a non-terminal section can be divided into several GSs. We de ned a non-terminal section corresponding to the essay as the root section. A GS s is de ned as a paired HE sequence

E = (e 1 , e 2 , • • • , e m ) and GS sequence SS = (s 1 , s 2 , • • • , s n ).

If SS is an empty tuple, then the GS is a terminal section. HEs in a sub-GS are shared with the superordinate GS, and E of non-terminal sections are not empty. For a question, the chronological condition CC is de ned as a pair of the beginning time bt and the ending time et, and the geographical condition GC is de ned as a geographical entities set

{ 1 , 2 , • • • , k }.

Uniformity

Let us consider the uniformity of GSs in a GS. If GSs of the East Midlands, Paris and German are placed on the same level in a GS of Europe, they are incongruous even though they are all parts of Europe. is is because they are in di erent levels of a geographical category, such as country, region, and city. erefore, well-formed essay require the uniformity of geographical category level. In addition, if England is described with hundreds of words while France and Germany are respectively described with a dozen words, there is incongruity even though they are in the same geographical category level. is is because their quantities of description are imbalanced.

erefore, well-formed essay seems to require the uniformity of quantity.

We built the following hypotheses for the uniformity of GSs.

(H4) GSs placed on the same level in a GS are in the same level of geographical category. (H5) GSs placed on the same level in a GS are described in the same quantity.

Although several functions implementing the hypotheses were come up with, we took simple functions by way of experiment.

e geographical uniformity sc GU () and the quantity uniformity sc QU () are calculated by the following functions:

sc GU (SS) = 1 − sd GU (S) am GU (SS)(1)sd GU (SS) = 1 |SS | |S S | i=1 (depth(s i ) − am GU (SS)) 2

(2)

am GU (SS) = 1 |SS | |S S | i=1 depth(s i ) (3) sc QU (SS) = − |S S | i=1 p(s i , SS) log 2 p(s i , SS) log 2 |SS |(4)

p(s, SS) = len th(s)

|S S | i=1 len th(s i )(5)

where depth(s) is a function to return the distance between the thesaurus root node and the node corresponding to the range of s. len th(s) is a function to return the number of characters described in s. We designed the scoring functions to be normalized into the range [0, 1].

Ordering

Let us consider the ordering of HEs in a GS. HEs in well-formed essays are generally described in chronological order. Note that the occurrence order of HEs does not always correspond with the descriptive order of an essay. Since the chronological information of an HE has a beginning and ending in a range, the occurrence order relation between HEs is either non-overlapping, partially overlapping or inclusive.In all relations, the beginning of the HE e 1 precedes the beginning of the HE e 2 . However, in the inclusion relation, e 1 may be described a er e 2 such as " e Treaty of Nanking ended the First Opium War." erefore, we assume that the describing order of HEs in the inclusion relation is free to the chronological order. Next, let us consider the ordering of GSs in a GS. e describing order of GSs is free relative to the chronological order. However, for example, the describing order of Athens, Rome, Cairo, Baghdad, Beijing and Shanghai seems to be be er than the order of Athens, Baghdad, Beijing, Cairo, Rome and Shanghai. is is because GSs relating to each other are placed closely. We assume that the relativity is approximated by the geographical distance. We built the following hypotheses for the ordering in a GS. (H6) As an exception to the hypotheses (H3), an HE can be described both before and a er another HE if they are in the inclusion relation. (H7) GSs in a GS are described in the order of short geographical distance. e hypothesis (H6) is the complement of the hypothesis (H3).

e chronological ordering sc CO () and the geographical ordering sc GO () are calculated by the following functions:

sc CO (E) = K − L K + L (6) sc GO (E) = 1 eochan e(E) + 1 (7) eochan e(E) = 1 |E| − 1 |E |−1 i=1

distance(ran e(e i ), ran e(e i+1 )) (8) where K is the number of concordant pairs of HEs in E, and L is the number of discordant pairs. ran e(e) is a function to return a thesaurus node that is the nearest common node subsuming all geographical entities included in the HE e, and distance(n i , n j ) is a function to return the shortest distance between the thesaurus nodes n i and n j .

Cooperability

Let us consider the cooperability of a world history essay to question constraints in terms of the chronological and the geographical information. As described in Section 3, world history essay questions give chronological and geographical conditions such as "up to and including the rst half of the 18th century" and "West Europe, West Asia and East Asia. " In this case, if an essay describes only the ancient histories of West Europe, West Asia and East Asia, the essay satis es the conditions logically. However, it does not re ect the question intention. Cooperative essay should describe at least one HE of the 18th century. e geographical information is also similar. For example, an essay describing only "West Europe and West Asia" violates the maxim of quantity, and the cooperative essay should describe at least one HE for each area of the geographical condition. We assume that the chronological cooperability is observed in all GSs while the geographical cooperability is observed in only a GS corresponding to the essay. For a GS, we de ned a period from the beginning of the earliest HE to the end of the latest one as a period of the GS. e smallest geographical range, including where the HEs in a GS occurred, was de ned as the range of the GS. We assume that the observance of the maxim of quantity is approximated to the coverage of the period and the range of GSs.

We built the following hypotheses for the cooperability on the chronological and the geographical conditions in questions.

(H8) A period of a GS covers the period of the chronological condition as justly as possible. (H9) A range of a GS corresponding to the essay covers the range of the geographical condition as justly as possible. e chronological cooperability sc CC () and the geographical cooperability sc GC () are calculated by the following functions:

sc CC (E, CC) = o erlap(period(E), CC) extend(period(E), CC)(9)

sc GC (E, GC) = 2P(E, GC)R(E, GC) P(E, GC) + R(E, GC)

P(E, GC) = subsumed( eoentities(E), GC)

| eoentities(E)|(11)

PROPOSED METHOD

Figure 2 shows the outline of the proposed method. First, the input essay is segmented into HEs by punctuation marks. A HE is represented by a set of named entities extracted from the segment. Some named entities evoke the chronological and/or the geographical information. Because exam cram books cover such information, we constructed a database of world history terms based on the world history glossary published by Yamakawa Shuppan-sha. 4 Using the database, the named entities are converted into chronological and geographical information. Using both chronological and geographical information sets, the period and the range of the segment are respectively determined in the same way as that of the GS described in 4.4. ey are regarded as the chronological and geographical information of the HE. en, all hierarchical structures of GSs that can be go en from the essay are listed. A er scoring the HEs for each hierarchical structure, the maximum score is selected as the nal score for the essay in order to select the most plausible hierarchical structure. Based on the hypotheses described in Section 4, the score sc for a GS to a question is recursively calculated by the following functions.

sc(E, SS, CC, GC)=                    sc T (E, CC)

if it is a terminal section sc N (E, SS, CC)sc GC (E, GC) if it is the root section sc N (E, SS, CC) otherwise where e ents(s) and sections(s) are respectively functions to return an HE sequence and a GS sequence included in a GS s.

EXPERIMENTAL RESULT

Using essays submi ed to the QA Lab-2 and the QA Lab-3, we compared the scores measured by the proposed method and the scores evaluated by human expert. Although the number of the essays is 55, they are annotated with the marks granted and taken 4 h p://www.yamakawa.co.jp/ (in Japanese)

away besides the total score by a human expert. Basically the marks awarded take account of the correctness of the content, and the marks lost account for the ill-formedness. With this, we compared the scores to the method behind subtracting marks. Note that the lost marks are caused by not only chronological and geographical inconsistencies. Figure 3 shows the sca er plot between the scores by our method and the subtracted marks. e correlation coe cient was 0.21, which indicated a weak positive correlation. Taking into account that the marks subtracted include other causes than the chronological and geographical problems, the value seems to be fairly good.

Figure 1 :1Figure 1: An example of essay question of world history text snippets. Because Madanani et al.[6] only researched sentence ordering, the research only applied to the context of a short, domain-independent summarization. Bauer and Teufe[3] proposed the extended Pyramid method for timeline summarization, but they did not focus on the well-formedness. Although Wagner et al.[11] researched the well-formedness, they focused only on grammatical errors. erefore, there is no research on a methodology for measuring the focus and the structure and coherence of world history essays in terms of the chronological and geographical information.

Figure 11Figure1shows an example of an essay question for world history, which is an English translation from the original Japanese version.e question contains additional text besides the main essay topic. e rst paragraph gives background information, and the texts below the essay topic are the constraints for writing the essay. e constraints include a length limitation of "no more than 20 lines, " a geographical condition of "West Europe, West Asia and East Asia, " a chronological condition of "up to and including the rst half of the 18th century, " the keywords that must be used in the essay, and other associated conditions. e chronological condition and the geographical condition prove the importance of chronological and geographical consistency.

Figure 2 :2Figure 2: e outline of the proposed method

( 13 )13sc T (E, CC) = sc CO (E)sc GO (E)sc CC (E, CC) (14) sc N (E, SS, CC) = 1 |SS | sc GU (SS)sc QU (SS) |S S | i=1sc(e ents(s i ), sections(s i ), CC, GC) (15)

is a function to return a pair of the earliest time and the latest time in E, o erlap(P 1 , P 2 ) is a function to return the length of the overlap period between P 1 and P 2 , and extend(P 1 , P 2 ) is a function to return the length of the period between the earliest time and the latest time among P 1 and P 2 . eoentities(E) is a function that returns a set of geographical entities included in E, subsumed(G 1 , G 2 ) is a function that returns the number of geographical entities of G 1 subsumed by geographical entities of G 2 , and subsumin (G 1 , G 2 ) is a function that returns the number of geographical entities of G 2 subsuming geographical entities of G 1 .R(E, GC) =subsumin ( eoentities(E), GC) |GC |(12)where period(E)

h p://research.nii.ac.jp/ntcir/index-en.htmlCopying permi ed for private and academic purposes. EVIA , co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. © 2017 Copyright held by the author. h p://duc.nist.gov/duc2007/tasks.html h p://www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html CONCLUSIONFor world history essays in Japanese university entrance exams, we proposed a method for measuring the uniformity, ordering and cooperability in terms of the chronological and the geographical information. e features of well-formedness are found by observing several model answer essays. From the experimental result, we found a weak positive correlation between the scores measured by our method and the scores estimated by a human expert. We will investigate more appropriate functions in the future.

Overview of the TREC 2015 LiveQA Track EugeneAgichtein DavidCarmel DonnaHarman DanPelleg YuvalPinter Proceedings of e Twenty-Fourth Text REtrieval Conference e Twenty-Fourth Text REtrieval Conference 2015 Inferring Strategies for Sentence Ordering in Multidocument News Summarization ReginaBarzilay NoemieElhadad KathleenRMckeown Journal of Arti cial Intelligence Research 17 1 2002. 2002 Improving Chronological Sentence Ordering by Precedence Relation SandroBauer SimoneTeufe Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing 2015 2 LIPN: Introducing a new Geographical Context Similarity Measure and a Statistical Similarity Measure Based on the Bha acharyya Coe cient DavideBuscaldi JorgeJ GarciaFlores JosephLe Roux NadiTomeh Proceedings of the 8th International Workshop on Semantic Evaluation the 8th International Workshop on Semantic Evaluation 2014 ROUGE: A Package for Automatic Evaluation of Summaries Chin-YewLin Proceedings of Workshop on Text Summarization Branches Out Workshop on Text Summarization Branches Out 2004 Measuring Variability in Sentence Ordering for News Summarization NitinMadnani RebeccaPassonneau NecipFazil Ayan JohnMConroy BonnieJDorr JudithLKlavans DiannePO'leary JudithDSchlesinger Proceedings of the Eleventh European Workshop on Natural Language Generation the Eleventh European Workshop on Natural Language Generation 2007 Evaluating Content Selection in Summarization: e Pyramid Method AniNenkova RebeccaJPassonneau Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics 2004 Improving Chronological Sentence Ordering by Precedence Relation NaoakiOkazaki YutakaMatsuo MitsuruIshizuka Proceedings of the 20th International Conference on Computational Linguistics the 20th International Conference on Computational Linguistics 2004 Automated Pyramid Scoring of Summaries using Distributional Semantics RebeccaJPassonneau EmilyChen WeiweiGuo DoloresPerin Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics the 51st Annual Meeting of the Association for Computational Linguistics 2013 Overview of the NTCIR-13 QA Lab-3 Task HideyukiShibuki KotaroSakamoto MadokaIshioroshi AkiraFujita YoshinobuKano TerukoMitamura TatsunoriMori NorikoKando Proceedings of e NTCIR-13 Conference e NTCIR-13 Conference 2017 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors JoachimWagner JenniferFoster JosefVan Genabith Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 2007