=Paper=
{{Paper
|id=Vol-2008/paper_10
|storemode=property
|title=Automatic Evaluation of World History Essay Using Chronological and Geographical Measures
|pdfUrl=https://ceur-ws.org/Vol-2008/paper_10.pdf
|volume=Vol-2008
|authors=Kotaro Sakamoto,Hideyuki Shibuki,Madoka Ishioroshi,Akira Fujita,Yoshinobu Kano,Teruko Mitamura,Tatsunori Mori,Noriko Kando
|dblpUrl=https://dblp.org/rec/conf/ntcir/SakamotoSIFKMMK17
}}
==Automatic Evaluation of World History Essay Using Chronological and Geographical Measures==
Automatic Evaluation of World History Essay Using
Chronological and Geographical Measures
Kotaro Sakamoto Hideyuki Shibuki Madoka Ishioroshi
Yokohama National University Yokohama National University National Institute of Informatics
National Institute of Informatics shib@forest.eis.ynu.ac.jp ishioroshi@nii.ac.jp
sakamoto@forest.eis.ynu.ac.jp
Akira Fujita Yoshinobu Kano Teruko Mitamura
Yokohama National University Shizuoka University Carnegie Mellon University
fujita@ynu.ac.jp kano@inf.shizuoka.ac.jp teruko@cs.cmu.edu
Tatsunori Mori Noriko Kando
Yokohama National University National Institute of Informatics
mori@forest.eis.ynu.ac.jp SOKENDAI
kando@nii.ac.jp
ABSTRACT measures was not always concordant with the ranking order given
We propose a method for measuring chronological and geographi- by the human marks. Therefore, we investigated more appropriate
cal consistency of the world history essays in Japanese university measures for evaluating world history essays in Japanese university
entrance exams. The experimental result shows a weak positive entrance exams.
correlation between the scores measured by the proposed method For evaluating summaries, the linguistic well-formedness and
and the scores estimated by a human expert in world history. the relative responsiveness were used in the DUC workshops.2 The
content, readability/fluency, and the overall responsiveness were
KEYWORDS used at the Guided Summarization tasks3 in the TAC workshops.
These measures are important for evaluating world history essays in
essay QA, automated evaluation, chronological and geographical
university entrance exams. However, the linguistic well-formedness
measures, world history, university entrance exams
and readability/fluency were scored arbitrarily by human assessors,
1 INTRODUCTION while the content was methodologically scored by the ROUGE
family and the Pyramid method, among others. We would like
Research on real-world complex question-answering (QA) has flour-
to methodologically give other scores based on merits other than
ished in recent years [1]. In the QA Lab tasks [10] at the NTCIR
the content. For evaluating world history essays, chronological
workshop,1 the current problems and solutions in QA technolo-
and geographical consistency is important as a kind of semantic
gies have been investigated using the world history questions in
consistency. However, how to evaluate these is not obvious. In
Japanese university entrance exams and their English translation.
this paper, we propose a method for measuring chronological and
Japanese university entrance exams include various types of ques-
geographical consistency of world history essays, and examined
tions such as multiple-choice, fill-in-the-blank, true-or-false and
the method using essays submitted to the QA Lab.
essay questions. Above all, essay QA is the most challenging, and
The main contributions of this paper are as follows: (i) to clarify
still has many open problems, such as the evaluation of essays that
the features of well-formed world history essays in terms of the
QA systems generated. Although there is a way of evaluation by
chronological information and the geographical information, (ii) to
human experts in world history, it takes considerable time and cost.
introduce a new scoring method based on the features to evaluate
In the case of the QA Lab, evaluation of 46 essays by an expert who
the well-formedness of world history essays.
teaches world history took around a month and about 500,000 yen
(4,500 USD). Therefore, a new method is required. 2 RELATED WORK
Because essay generation is regarded as a kind of query-biased The linguistic well-formedness in the DUC workshop and the read-
summarization, the measures for evaluating summaries using gold- ability/fluency in the TAC Guided Summarization tasks were evalu-
standard data can be applied to essay evaluation. In the QA Lab, ated in terms of grammaticality, non-redundancy, referential clarity,
the ROUGE family [5] and the Pyramid method [7, 9] are used for focus, and ‘structure and coherence’. Our measures are relative to
grading essays besides a human expert’s evaluation. A positive the focus and ‘structure and coherence’. Although Barzilay et al.
correlation between these grades and those provided by humans [2] and Okazaki et al. [8] researched the chronological ordering,
was between moderate and weak, and the ranking order by the they did not take account of geographical information. Buscaldi
1 http://research.nii.ac.jp/ntcir/index-en.html et al. [4] found that geography is related to semantic similarity,
but they only aimed to measure semantic equivalence between two
Copying permitted for private and academic purposes.
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. 2 http://duc.nist.gov/duc2007/tasks.html
© 2017 Copyright held by the author. 3 http://www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html
20
geographical information, all HEs are described in chronological
order, and (b) grouping HEs by the geographical information. In
both, information is described in chronological order. If the former
is regarded to be grouped by geographical information from “the
whole world,” there is no difference between the two manners; that
is, both are descriptions in chronological order for HEs in a particu-
lar area. We defined a sequence of HEs with the same geographical
information as a geographical section (GS). GSs could be nested hi-
erarchically. For example, a GS of Europe may contain GSs such as
England, France, and Germany, and the GS of England may contain
GSs such as London, Birmingham and Manchester.
From the above, we built the following hypotheses for the struc-
ture of world history essay.
(H1) An essay is a GS.
(H2) A GS can consist of more than one sub-GSs that is in the
parent GS.
(H3) HEs in a GS are put in chronological order.
GSs in a hierarchical structure are classified into terminal and
non-terminal sections. A terminal section means an HE sequence
without hierarchical structure, and likewise a non-terminal section
Figure 1: An example of essay question of world history can be divided into several GSs. We defined a non-terminal section
text snippets. Because Madanani et al. [6] only researched sen- corresponding to the essay as the root section. A GS s is defined
tence ordering, the research only applied to the context of a short, as a paired HE sequence E = (e 1 , e 2 , · · · , em ) and GS sequence
domain-independent summarization. Bauer and Teufe [3] proposed SS = (s 1 , s 2 , · · · , sn ). If SS is an empty tuple, then the GS is a
the extended Pyramid method for timeline summarization, but they terminal section. HEs in a sub-GS are shared with the superordinate
did not focus on the well-formedness. Although Wagner et al. [11] GS, and E of non-terminal sections are not empty. For a question,
researched the well-formedness, they focused only on grammatical the chronological condition CC is defined as a pair of the beginning
errors. Therefore, there is no research on a methodology for mea- time bt and the ending time et, and the geographical condition GC
suring the focus and the structure and coherence of world history is defined as a geographical entities set {д1 , д2 , · · · , дk }.
essays in terms of the chronological and geographical information.
4.2 Uniformity
3 ESSAY QUESTION OF WORLD HISTORY Let us consider the uniformity of GSs in a GS. If GSs of the East
Figure 1 shows an example of an essay question for world history, Midlands, Paris and German are placed on the same level in a GS
which is an English translation from the original Japanese version. of Europe, they are incongruous even though they are all parts of
The question contains additional text besides the main essay topic. Europe. This is because they are in different levels of a geographical
The first paragraph gives background information, and the texts category, such as country, region, and city. Therefore, well-formed
below the essay topic are the constraints for writing the essay. The essay require the uniformity of geographical category level. In
constraints include a length limitation of “no more than 20 lines,” a addition, if England is described with hundreds of words while
geographical condition of “West Europe, West Asia and East Asia,” France and Germany are respectively described with a dozen words,
a chronological condition of “up to and including the first half of there is incongruity even though they are in the same geographical
the 18th century,” the keywords that must be used in the essay, and category level. This is because their quantities of description are
other associated conditions. The chronological condition and the imbalanced. Therefore, well-formed essay seems to require the
geographical condition prove the importance of chronological and uniformity of quantity.
geographical consistency. We built the following hypotheses for the uniformity of GSs.
4 WELL-FORMED WORLD HISTORY ESSAY (H4) GSs placed on the same level in a GS are in the same level
4.1 Structure of geographical category.
(H5) GSs placed on the same level in a GS are described in the
In general, a world history essay is a sequential description of his-
same quantity.
torical events (HEs). A HE has both chronological information
and geographical information. Let us consider how this is written. Although several functions implementing the hypotheses were
While the chronological information can be easily put in a linear come up with, we took simple functions by way of experiment.
order from the past to the future, the geographical information is The geographical uniformity scGU () and the quantity uniformity
not easy to be determinately put in a linear order because of the sc QU () are calculated by the following functions:
spatial extent. Based on the study of several model answer essays
from past university entrance exam collections, the general struc- sdGU (S)
ture of the essays follows one of two approaches: (a) disregarding scGU (SS) = 1− (1)
amGU (SS)
21
v
a function to return the shortest distance between the thesaurus
u
t |S S |
1 Õ
sdGU (SS) = (depth(si ) − amGU (SS))2 (2) nodes ni and n j .
|SS | i=1
|S S |
4.4 Cooperability
1 Õ Let us consider the cooperability of a world history essay to ques-
amGU (SS) = depth(si ) (3)
|SS | i=1 tion constraints in terms of the chronological and the geographical
Í |S S | information. As described in Section 3, world history essay ques-
− i=1 p(si , SS) log2 p(si , SS)
sc QU (SS) = (4) tions give chronological and geographical conditions such as “up to
log2 |SS | and including the first half of the 18th century” and “West Europe,
lenдth(s) West Asia and East Asia.” In this case, if an essay describes only the
p(s, SS) = (5)
Í |S S | ancient histories of West Europe, West Asia and East Asia, the essay
i=1 lenдth(si ) satisfies the conditions logically. However, it does not reflect the
where depth(s) is a function to return the distance between the question intention. Cooperative essay should describe at least one
thesaurus root node and the node corresponding to the range of s. HE of the 18th century. The geographical information is also similar.
lenдth(s) is a function to return the number of characters described For example, an essay describing only “West Europe and West Asia”
in s. We designed the scoring functions to be normalized into the violates the maxim of quantity, and the cooperative essay should
range [0, 1]. describe at least one HE for each area of the geographical condi-
4.3 Ordering tion. We assume that the chronological cooperability is observed
in all GSs while the geographical cooperability is observed in only
Let us consider the ordering of HEs in a GS. HEs in well-formed
a GS corresponding to the essay. For a GS, we defined a period
essays are generally described in chronological order. Note that
from the beginning of the earliest HE to the end of the latest one
the occurrence order of HEs does not always correspond with the
as a period of the GS. The smallest geographical range, including
descriptive order of an essay. Since the chronological information
where the HEs in a GS occurred, was defined as the range of the
of an HE has a beginning and ending in a range, the occurrence
GS. We assume that the observance of the maxim of quantity is
order relation between HEs is either non-overlapping, partially
approximated to the coverage of the period and the range of GSs.
overlapping or inclusive.In all relations, the beginning of the HE
We built the following hypotheses for the cooperability on the
e 1 precedes the beginning of the HE e 2 . However, in the inclu-
chronological and the geographical conditions in questions.
sion relation, e 1 may be described after e 2 such as “The Treaty of
Nanking ended the First Opium War.” Therefore, we assume that (H8) A period of a GS covers the period of the chronological
the describing order of HEs in the inclusion relation is free to the condition as justly as possible.
chronological order. Next, let us consider the ordering of GSs in a (H9) A range of a GS corresponding to the essay covers the
GS. The describing order of GSs is free relative to the chronological range of the geographical condition as justly as possible.
order. However, for example, the describing order of Athens, Rome, The chronological cooperability scCC () and the geographical
Cairo, Baghdad, Beijing and Shanghai seems to be better than the cooperability scGC () are calculated by the following functions:
order of Athens, Baghdad, Beijing, Cairo, Rome and Shanghai. This overlap(period(E), CC)
is because GSs relating to each other are placed closely. We assume scCC (E, CC) = (9)
extend(period(E), CC)
that the relativity is approximated by the geographical distance.
2P(E, GC)R(E, GC)
We built the following hypotheses for the ordering in a GS. scGC (E, GC) = (10)
P(E, GC) + R(E, GC)
(H6) As an exception to the hypotheses (H3), an HE can be
subsumed(дeoentities(E), GC)
described both before and after another HE if they are in P(E, GC) = (11)
|дeoentities(E)|
the inclusion relation.
(H7) GSs in a GS are described in the order of short geographical subsuminд(дeoentities(E), GC)
R(E, GC) = (12)
distance. |GC |
The hypothesis (H6) is the complement of the hypothesis (H3). where period(E) is a function to return a pair of the earliest time
The chronological ordering scCO () and the geographical ordering and the latest time in E, overlap(P1 , P2 ) is a function to return the
scGO () are calculated by the following functions: length of the overlap period between P1 and P2 , and extend(P1 , P2 )
K −L is a function to return the length of the period between the ear-
scCO (E) = (6) liest time and the latest time among P1 and P2 . дeoentities(E) is
K +L
1 a function that returns a set of geographical entities included in
scGO (E) = (7) E, subsumed(G 1 , G 2 ) is a function that returns the number of geo-
дeochanдe(E) + 1
graphical entities of G 1 subsumed by geographical entities of G 2 ,
|E |−1
1 Õ and subsuminд(G 1 , G 2 ) is a function that returns the number of
дeochanдe(E) = distance(ranдe(ei ), ranдe(ei+1 ))(8)
|E| − 1 i=1 geographical entities of G 2 subsuming geographical entities of G 1 .
where K is the number of concordant pairs of HEs in E, and L is 5 PROPOSED METHOD
the number of discordant pairs. ranдe(e) is a function to return Figure 2 shows the outline of the proposed method. First, the input
a thesaurus node that is the nearest common node subsuming all essay is segmented into HEs by punctuation marks. A HE is repre-
geographical entities included in the HE e, and distance(ni , n j ) is sented by a set of named entities extracted from the segment. Some
22
Figure 2: The outline of the proposed method
away besides the total score by a human expert. Basically the marks
awarded take account of the correctness of the content, and the
marks lost account for the ill-formedness. With this, we compared
the scores to the method behind subtracting marks. Note that the
lost marks are caused by not only chronological and geographical
inconsistencies. Figure 3 shows the scatter plot between the scores
by our method and the subtracted marks. The correlation coefficient
was 0.21, which indicated a weak positive correlation. Taking into
account that the marks subtracted include other causes than the
chronological and geographical problems, the value seems to be
Figure 3: The scatter plot between the scores by the method fairly good.
and the scores by a human expert
7 CONCLUSION
named entities evoke the chronological and/or the geographical For world history essays in Japanese university entrance exams,
information. Because exam cram books cover such information, we we proposed a method for measuring the uniformity, ordering and
constructed a database of world history terms based on the world cooperability in terms of the chronological and the geographical
history glossary published by Yamakawa Shuppan-sha.4 Using the information. The features of well-formedness are found by observ-
database, the named entities are converted into chronological and ing several model answer essays. From the experimental result, we
geographical information. Using both chronological and geograph- found a weak positive correlation between the scores measured by
ical information sets, the period and the range of the segment are our method and the scores estimated by a human expert. We will
respectively determined in the same way as that of the GS described investigate more appropriate functions in the future.
in 4.4. They are regarded as the chronological and geographical
information of the HE. Then, all hierarchical structures of GSs that REFERENCES
can be gotten from the essay are listed. After scoring the HEs [1] Eugene Agichtein, David Carmel, Donna Harman, Dan Pelleg, and Yuval Pinter.
2015. Overview of the TREC 2015 LiveQA Track. In Proceedings of The Twenty-
for each hierarchical structure, the maximum score is selected as Fourth Text REtrieval Conference.
the final score for the essay in order to select the most plausible [2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2002. Inferring
Strategies for Sentence Ordering in Multidocument News Summarization. Jour-
hierarchical structure. nal of Artificial Intelligence Research 17, 1 (2002), 35–55.
Based on the hypotheses described in Section 4, the score sc [3] Sandro Bauer and Simone Teufe. 2015. Improving Chronological Sentence Or-
for a GS to a question is recursively calculated by the following dering by Precedence Relation. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Confer-
functions. ence on Natural Language Processing of the Asian Federation of Natural Language
Processing, Vol. 2. 834–839.
scT (E, CC)
[4] Davide Buscaldi, Jorge J. Garcia Flores, Joseph Le Roux, and Nadi Tomeh. 2014.
if it is a terminal section
LIPN: Introducing a new Geographical Context Similarity Measure and a Statis-
sc N (E, SS, CC)scGC (E, GC)
tical Similarity Measure Based on the Bhattacharyya Coefficient. In Proceedings
sc(E, SS, CC, GC) = (13)
of the 8th International Workshop on Semantic Evaluation. 400–405.
if it is the root section [5] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
sc N (E, SS, CC)
In Proceedings of Workshop on Text Summarization Branches Out. 74–81.
otherwise [6] Nitin Madnani, Rebecca Passonneau, Necip Fazil Ayan, John M. Conroy, Bonnie J.
Dorr, Judith L. Klavans, Dianne P. O’Leary, and Judith D. Schlesinger. 2007. Mea-
scT (E, CC) = scCO (E)scGO (E)scCC (E, CC) (14) suring Variability in Sentence Ordering for News Summarization. In Proceedings
1 of the Eleventh European Workshop on Natural Language Generation. 81–88.
sc N (E, SS, CC) = scGU (SS)sc QU (SS) [7] Ani Nenkova and Rebecca J. Passonneau. 2004. Evaluating Content Selection in
|SS | Summarization: The Pyramid Method. In Proceedings of the Human Language
|S S| Technology Conference of the North American Chapter of the Association for Com-
Õ putational Linguistics. 145–152.
sc(events(si ), sections(si ), CC, GC) (15) [8] Naoaki Okazaki, Yutaka Matsuo, and Mitsuru Ishizuka. 2004. Improving Chrono-
i=1 logical Sentence Ordering by Precedence Relation. In Proceedings of the 20th
International Conference on Computational Linguistics. 81–88.
where events(s) and sections(s) are respectively functions to return [9] Rebecca J. Passonneau, Emily Chen, Weiwei Guo, and Dolores Perin. 2013. Auto-
an HE sequence and a GS sequence included in a GS s. mated Pyramid Scoring of Summaries using Distributional Semantics. In Proceed-
ings of the 51st Annual Meeting of the Association for Computational Linguistics.
6 EXPERIMENTAL RESULT 143–147.
[10] Hideyuki Shibuki, Kotaro Sakamoto, Madoka Ishioroshi, Akira Fujita, Yoshinobu
Using essays submitted to the QA Lab-2 and the QA Lab-3, we Kano, Teruko Mitamura, Tatsunori Mori, and Noriko Kando. 2017. Overview of
compared the scores measured by the proposed method and the the NTCIR-13 QA Lab-3 Task. In Proceedings of The NTCIR-13 Conference.
[11] Joachim Wagner, Jennifer Foster, and Josef van Genabith. 2007. A Comparative
scores evaluated by human expert. Although the number of the Evaluation of Deep and Shallow Approaches to the Automatic Detection of
essays is 55, they are annotated with the marks granted and taken Common Grammatical Errors. In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
4 http://www.yamakawa.co.jp/ (in Japanese) Language Learning. 112–121.
23