Automatic Evaluation of World History Essay Using
                      Chronological and Geographical Measures
                 Kotaro Sakamoto                                       Hideyuki Shibuki                                    Madoka Ishioroshi
        Yokohama National University                             Yokohama National University                       National Institute of Informatics
        National Institute of Informatics                          shib@forest.eis.ynu.ac.jp                              ishioroshi@nii.ac.jp
         sakamoto@forest.eis.ynu.ac.jp

                     Akira Fujita                                       Yoshinobu Kano                                     Teruko Mitamura
         Yokohama National University                                   Shizuoka University                            Carnegie Mellon University
               fujita@ynu.ac.jp                                       kano@inf.shizuoka.ac.jp                             teruko@cs.cmu.edu

                                                  Tatsunori Mori                                 Noriko Kando
                                        Yokohama National University                  National Institute of Informatics
                                          mori@forest.eis.ynu.ac.jp                             SOKENDAI
                                                                                             kando@nii.ac.jp

ABSTRACT                                                                             measures was not always concordant with the ranking order given
We propose a method for measuring chronological and geographi-                       by the human marks. Therefore, we investigated more appropriate
cal consistency of the world history essays in Japanese university                   measures for evaluating world history essays in Japanese university
entrance exams. The experimental result shows a weak positive                        entrance exams.
correlation between the scores measured by the proposed method                          For evaluating summaries, the linguistic well-formedness and
and the scores estimated by a human expert in world history.                         the relative responsiveness were used in the DUC workshops.2 The
                                                                                     content, readability/fluency, and the overall responsiveness were
KEYWORDS                                                                             used at the Guided Summarization tasks3 in the TAC workshops.
                                                                                     These measures are important for evaluating world history essays in
essay QA, automated evaluation, chronological and geographical
                                                                                     university entrance exams. However, the linguistic well-formedness
measures, world history, university entrance exams
                                                                                     and readability/fluency were scored arbitrarily by human assessors,
1     INTRODUCTION                                                                   while the content was methodologically scored by the ROUGE
                                                                                     family and the Pyramid method, among others. We would like
Research on real-world complex question-answering (QA) has flour-
                                                                                     to methodologically give other scores based on merits other than
ished in recent years [1]. In the QA Lab tasks [10] at the NTCIR
                                                                                     the content. For evaluating world history essays, chronological
workshop,1 the current problems and solutions in QA technolo-
                                                                                     and geographical consistency is important as a kind of semantic
gies have been investigated using the world history questions in
                                                                                     consistency. However, how to evaluate these is not obvious. In
Japanese university entrance exams and their English translation.
                                                                                     this paper, we propose a method for measuring chronological and
Japanese university entrance exams include various types of ques-
                                                                                     geographical consistency of world history essays, and examined
tions such as multiple-choice, fill-in-the-blank, true-or-false and
                                                                                     the method using essays submitted to the QA Lab.
essay questions. Above all, essay QA is the most challenging, and
                                                                                        The main contributions of this paper are as follows: (i) to clarify
still has many open problems, such as the evaluation of essays that
                                                                                     the features of well-formed world history essays in terms of the
QA systems generated. Although there is a way of evaluation by
                                                                                     chronological information and the geographical information, (ii) to
human experts in world history, it takes considerable time and cost.
                                                                                     introduce a new scoring method based on the features to evaluate
In the case of the QA Lab, evaluation of 46 essays by an expert who
                                                                                     the well-formedness of world history essays.
teaches world history took around a month and about 500,000 yen
(4,500 USD). Therefore, a new method is required.                                    2    RELATED WORK
    Because essay generation is regarded as a kind of query-biased                   The linguistic well-formedness in the DUC workshop and the read-
summarization, the measures for evaluating summaries using gold-                     ability/fluency in the TAC Guided Summarization tasks were evalu-
standard data can be applied to essay evaluation. In the QA Lab,                     ated in terms of grammaticality, non-redundancy, referential clarity,
the ROUGE family [5] and the Pyramid method [7, 9] are used for                      focus, and ‘structure and coherence’. Our measures are relative to
grading essays besides a human expert’s evaluation. A positive                       the focus and ‘structure and coherence’. Although Barzilay et al.
correlation between these grades and those provided by humans                        [2] and Okazaki et al. [8] researched the chronological ordering,
was between moderate and weak, and the ranking order by the                          they did not take account of geographical information. Buscaldi
1 http://research.nii.ac.jp/ntcir/index-en.html                                      et al. [4] found that geography is related to semantic similarity,
                                                                                     but they only aimed to measure semantic equivalence between two
Copying permitted for private and academic purposes.
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                  2 http://duc.nist.gov/duc2007/tasks.html

© 2017 Copyright held by the author.                                                 3 http://www.nist.gov/tac/2011/Summarization/Guided-Summ.2011.guidelines.html


                                                                                20
                                                                             geographical information, all HEs are described in chronological
                                                                             order, and (b) grouping HEs by the geographical information. In
                                                                             both, information is described in chronological order. If the former
                                                                             is regarded to be grouped by geographical information from “the
                                                                             whole world,” there is no difference between the two manners; that
                                                                             is, both are descriptions in chronological order for HEs in a particu-
                                                                             lar area. We defined a sequence of HEs with the same geographical
                                                                             information as a geographical section (GS). GSs could be nested hi-
                                                                             erarchically. For example, a GS of Europe may contain GSs such as
                                                                             England, France, and Germany, and the GS of England may contain
                                                                             GSs such as London, Birmingham and Manchester.
                                                                                 From the above, we built the following hypotheses for the struc-
                                                                             ture of world history essay.
                                                                               (H1) An essay is a GS.
                                                                               (H2) A GS can consist of more than one sub-GSs that is in the
                                                                                    parent GS.
                                                                               (H3) HEs in a GS are put in chronological order.
                                                                                GSs in a hierarchical structure are classified into terminal and
                                                                             non-terminal sections. A terminal section means an HE sequence
                                                                             without hierarchical structure, and likewise a non-terminal section
    Figure 1: An example of essay question of world history                  can be divided into several GSs. We defined a non-terminal section
text snippets. Because Madanani et al. [6] only researched sen-              corresponding to the essay as the root section. A GS s is defined
tence ordering, the research only applied to the context of a short,         as a paired HE sequence E = (e 1 , e 2 , · · · , em ) and GS sequence
domain-independent summarization. Bauer and Teufe [3] proposed               SS = (s 1 , s 2 , · · · , sn ). If SS is an empty tuple, then the GS is a
the extended Pyramid method for timeline summarization, but they             terminal section. HEs in a sub-GS are shared with the superordinate
did not focus on the well-formedness. Although Wagner et al. [11]            GS, and E of non-terminal sections are not empty. For a question,
researched the well-formedness, they focused only on grammatical             the chronological condition CC is defined as a pair of the beginning
errors. Therefore, there is no research on a methodology for mea-            time bt and the ending time et, and the geographical condition GC
suring the focus and the structure and coherence of world history            is defined as a geographical entities set {д1 , д2 , · · · , дk }.
essays in terms of the chronological and geographical information.
                                                                             4.2    Uniformity
3     ESSAY QUESTION OF WORLD HISTORY                                        Let us consider the uniformity of GSs in a GS. If GSs of the East
Figure 1 shows an example of an essay question for world history,            Midlands, Paris and German are placed on the same level in a GS
which is an English translation from the original Japanese version.          of Europe, they are incongruous even though they are all parts of
The question contains additional text besides the main essay topic.          Europe. This is because they are in different levels of a geographical
The first paragraph gives background information, and the texts              category, such as country, region, and city. Therefore, well-formed
below the essay topic are the constraints for writing the essay. The         essay require the uniformity of geographical category level. In
constraints include a length limitation of “no more than 20 lines,” a        addition, if England is described with hundreds of words while
geographical condition of “West Europe, West Asia and East Asia,”            France and Germany are respectively described with a dozen words,
a chronological condition of “up to and including the first half of          there is incongruity even though they are in the same geographical
the 18th century,” the keywords that must be used in the essay, and          category level. This is because their quantities of description are
other associated conditions. The chronological condition and the             imbalanced. Therefore, well-formed essay seems to require the
geographical condition prove the importance of chronological and             uniformity of quantity.
geographical consistency.                                                       We built the following hypotheses for the uniformity of GSs.
4 WELL-FORMED WORLD HISTORY ESSAY                                              (H4) GSs placed on the same level in a GS are in the same level
4.1 Structure                                                                       of geographical category.
                                                                               (H5) GSs placed on the same level in a GS are described in the
In general, a world history essay is a sequential description of his-
                                                                                    same quantity.
torical events (HEs). A HE has both chronological information
and geographical information. Let us consider how this is written.               Although several functions implementing the hypotheses were
While the chronological information can be easily put in a linear            come up with, we took simple functions by way of experiment.
order from the past to the future, the geographical information is           The geographical uniformity scGU () and the quantity uniformity
not easy to be determinately put in a linear order because of the            sc QU () are calculated by the following functions:
spatial extent. Based on the study of several model answer essays
from past university entrance exam collections, the general struc-                                                      sdGU (S)
ture of the essays follows one of two approaches: (a) disregarding                              scGU (SS)    =    1−                              (1)
                                                                                                                       amGU (SS)


                                                                        21
                         v
                                                                                 a function to return the shortest distance between the thesaurus
                         u
                         t         |S S |
                               1 Õ
       sdGU (SS)    =                     (depth(si ) − amGU (SS))2   (2)        nodes ni and n j .
                             |SS | i=1
                               |S S |
                                                                                 4.4    Cooperability
                           1 Õ                                                   Let us consider the cooperability of a world history essay to ques-
      amGU (SS)     =                 depth(si )                      (3)
                         |SS | i=1                                               tion constraints in terms of the chronological and the geographical
                            Í |S S |                                             information. As described in Section 3, world history essay ques-
                         − i=1 p(si , SS) log2 p(si , SS)
       sc QU (SS)   =                                                 (4)        tions give chronological and geographical conditions such as “up to
                                     log2 |SS |                                  and including the first half of the 18th century” and “West Europe,
                              lenдth(s)                                          West Asia and East Asia.” In this case, if an essay describes only the
         p(s, SS)   =                                                 (5)
                         Í |S S |                                                ancient histories of West Europe, West Asia and East Asia, the essay
                           i=1 lenдth(si )                                       satisfies the conditions logically. However, it does not reflect the
where depth(s) is a function to return the distance between the                  question intention. Cooperative essay should describe at least one
thesaurus root node and the node corresponding to the range of s.                HE of the 18th century. The geographical information is also similar.
lenдth(s) is a function to return the number of characters described             For example, an essay describing only “West Europe and West Asia”
in s. We designed the scoring functions to be normalized into the                violates the maxim of quantity, and the cooperative essay should
range [0, 1].                                                                    describe at least one HE for each area of the geographical condi-
4.3    Ordering                                                                  tion. We assume that the chronological cooperability is observed
                                                                                 in all GSs while the geographical cooperability is observed in only
Let us consider the ordering of HEs in a GS. HEs in well-formed
                                                                                 a GS corresponding to the essay. For a GS, we defined a period
essays are generally described in chronological order. Note that
                                                                                 from the beginning of the earliest HE to the end of the latest one
the occurrence order of HEs does not always correspond with the
                                                                                 as a period of the GS. The smallest geographical range, including
descriptive order of an essay. Since the chronological information
                                                                                 where the HEs in a GS occurred, was defined as the range of the
of an HE has a beginning and ending in a range, the occurrence
                                                                                 GS. We assume that the observance of the maxim of quantity is
order relation between HEs is either non-overlapping, partially
                                                                                 approximated to the coverage of the period and the range of GSs.
overlapping or inclusive.In all relations, the beginning of the HE
                                                                                    We built the following hypotheses for the cooperability on the
e 1 precedes the beginning of the HE e 2 . However, in the inclu-
                                                                                 chronological and the geographical conditions in questions.
sion relation, e 1 may be described after e 2 such as “The Treaty of
Nanking ended the First Opium War.” Therefore, we assume that                       (H8) A period of a GS covers the period of the chronological
the describing order of HEs in the inclusion relation is free to the                       condition as justly as possible.
chronological order. Next, let us consider the ordering of GSs in a                 (H9) A range of a GS corresponding to the essay covers the
GS. The describing order of GSs is free relative to the chronological                      range of the geographical condition as justly as possible.
order. However, for example, the describing order of Athens, Rome,                  The chronological cooperability scCC () and the geographical
Cairo, Baghdad, Beijing and Shanghai seems to be better than the                 cooperability scGC () are calculated by the following functions:
order of Athens, Baghdad, Beijing, Cairo, Rome and Shanghai. This                                             overlap(period(E), CC)
is because GSs relating to each other are placed closely. We assume                      scCC (E, CC)    =                                          (9)
                                                                                                              extend(period(E), CC)
that the relativity is approximated by the geographical distance.
                                                                                                               2P(E, GC)R(E, GC)
    We built the following hypotheses for the ordering in a GS.                          scGC (E, GC)    =                                         (10)
                                                                                                              P(E, GC) + R(E, GC)
    (H6) As an exception to the hypotheses (H3), an HE can be
                                                                                                              subsumed(дeoentities(E), GC)
         described both before and after another HE if they are in                           P(E, GC)    =                                         (11)
                                                                                                                     |дeoentities(E)|
         the inclusion relation.
    (H7) GSs in a GS are described in the order of short geographical                                         subsuminд(дeoentities(E), GC)
                                                                                             R(E, GC)    =                                         (12)
         distance.                                                                                                         |GC |
The hypothesis (H6) is the complement of the hypothesis (H3).                    where period(E) is a function to return a pair of the earliest time
    The chronological ordering scCO () and the geographical ordering             and the latest time in E, overlap(P1 , P2 ) is a function to return the
scGO () are calculated by the following functions:                               length of the overlap period between P1 and P2 , and extend(P1 , P2 )
                        K −L                                                     is a function to return the length of the period between the ear-
       scCO (E) =                                                 (6)            liest time and the latest time among P1 and P2 . дeoentities(E) is
                        K +L
                               1                                                 a function that returns a set of geographical entities included in
       scGO (E) =                                                 (7)            E, subsumed(G 1 , G 2 ) is a function that returns the number of geo-
                        дeochanдe(E) + 1
                                                                                 graphical entities of G 1 subsumed by geographical entities of G 2 ,
                           |E |−1
                       1 Õ                                                       and subsuminд(G 1 , G 2 ) is a function that returns the number of
  дeochanдe(E) =                  distance(ranдe(ei ), ranдe(ei+1 ))(8)
                    |E| − 1 i=1                                                  geographical entities of G 2 subsuming geographical entities of G 1 .

where K is the number of concordant pairs of HEs in E, and L is                  5     PROPOSED METHOD
the number of discordant pairs. ranдe(e) is a function to return                 Figure 2 shows the outline of the proposed method. First, the input
a thesaurus node that is the nearest common node subsuming all                   essay is segmented into HEs by punctuation marks. A HE is repre-
geographical entities included in the HE e, and distance(ni , n j ) is           sented by a set of named entities extracted from the segment. Some


                                                                            22
                                                      Figure 2: The outline of the proposed method

                                                                                       away besides the total score by a human expert. Basically the marks
                                                                                       awarded take account of the correctness of the content, and the
                                                                                       marks lost account for the ill-formedness. With this, we compared
                                                                                       the scores to the method behind subtracting marks. Note that the
                                                                                       lost marks are caused by not only chronological and geographical
                                                                                       inconsistencies. Figure 3 shows the scatter plot between the scores
                                                                                       by our method and the subtracted marks. The correlation coefficient
                                                                                       was 0.21, which indicated a weak positive correlation. Taking into
                                                                                       account that the marks subtracted include other causes than the
                                                                                       chronological and geographical problems, the value seems to be
Figure 3: The scatter plot between the scores by the method                            fairly good.
and the scores by a human expert
                                                                                       7    CONCLUSION
named entities evoke the chronological and/or the geographical                         For world history essays in Japanese university entrance exams,
information. Because exam cram books cover such information, we                        we proposed a method for measuring the uniformity, ordering and
constructed a database of world history terms based on the world                       cooperability in terms of the chronological and the geographical
history glossary published by Yamakawa Shuppan-sha.4 Using the                         information. The features of well-formedness are found by observ-
database, the named entities are converted into chronological and                      ing several model answer essays. From the experimental result, we
geographical information. Using both chronological and geograph-                       found a weak positive correlation between the scores measured by
ical information sets, the period and the range of the segment are                     our method and the scores estimated by a human expert. We will
respectively determined in the same way as that of the GS described                    investigate more appropriate functions in the future.
in 4.4. They are regarded as the chronological and geographical
information of the HE. Then, all hierarchical structures of GSs that                   REFERENCES
can be gotten from the essay are listed. After scoring the HEs                          [1] Eugene Agichtein, David Carmel, Donna Harman, Dan Pelleg, and Yuval Pinter.
                                                                                            2015. Overview of the TREC 2015 LiveQA Track. In Proceedings of The Twenty-
for each hierarchical structure, the maximum score is selected as                           Fourth Text REtrieval Conference.
the final score for the essay in order to select the most plausible                     [2] Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2002. Inferring
                                                                                            Strategies for Sentence Ordering in Multidocument News Summarization. Jour-
hierarchical structure.                                                                     nal of Artificial Intelligence Research 17, 1 (2002), 35–55.
   Based on the hypotheses described in Section 4, the score sc                         [3] Sandro Bauer and Simone Teufe. 2015. Improving Chronological Sentence Or-
for a GS to a question is recursively calculated by the following                           dering by Precedence Relation. In Proceedings of the 53rd Annual Meeting of the
                                                                                            Association for Computational Linguistics and the 7th International Joint Confer-
functions.                                                                                  ence on Natural Language Processing of the Asian Federation of Natural Language
                                                                                            Processing, Vol. 2. 834–839.
                             scT (E, CC)
                            
                                                                                        [4] Davide Buscaldi, Jorge J. Garcia Flores, Joseph Le Roux, and Nadi Tomeh. 2014.
                                      if it is a terminal section
                            
                            
                                                                                            LIPN: Introducing a new Geographical Context Similarity Measure and a Statis-
                            
                            
                             sc N (E, SS, CC)scGC (E, GC)
                            
                                                                                            tical Similarity Measure Based on the Bhattacharyya Coefficient. In Proceedings
                            
   sc(E, SS, CC, GC) =                                            (13)
                            
                                                                                            of the 8th International Workshop on Semantic Evaluation. 400–405.
                                     if it is the root section                         [5] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
                               sc N (E, SS, CC)
                            
                            
                                                                                            In Proceedings of Workshop on Text Summarization Branches Out. 74–81.
                            
                            
                            
                                      otherwise                                         [6] Nitin Madnani, Rebecca Passonneau, Necip Fazil Ayan, John M. Conroy, Bonnie J.
                            
                            
                                                                                           Dorr, Judith L. Klavans, Dianne P. O’Leary, and Judith D. Schlesinger. 2007. Mea-
           scT (E, CC)      =       scCO (E)scGO (E)scCC (E, CC)          (14)              suring Variability in Sentence Ordering for News Summarization. In Proceedings
                                       1                                                    of the Eleventh European Workshop on Natural Language Generation. 81–88.
      sc N (E, SS, CC)      =              scGU (SS)sc QU (SS)                          [7] Ani Nenkova and Rebecca J. Passonneau. 2004. Evaluating Content Selection in
                                     |SS |                                                  Summarization: The Pyramid Method. In Proceedings of the Human Language
                            |S S|                                                           Technology Conference of the North American Chapter of the Association for Com-
                            Õ                                                               putational Linguistics. 145–152.
                                    sc(events(si ), sections(si ), CC, GC) (15)         [8] Naoaki Okazaki, Yutaka Matsuo, and Mitsuru Ishizuka. 2004. Improving Chrono-
                            i=1                                                             logical Sentence Ordering by Precedence Relation. In Proceedings of the 20th
                                                                                            International Conference on Computational Linguistics. 81–88.
where events(s) and sections(s) are respectively functions to return                    [9] Rebecca J. Passonneau, Emily Chen, Weiwei Guo, and Dolores Perin. 2013. Auto-
an HE sequence and a GS sequence included in a GS s.                                        mated Pyramid Scoring of Summaries using Distributional Semantics. In Proceed-
                                                                                            ings of the 51st Annual Meeting of the Association for Computational Linguistics.
6    EXPERIMENTAL RESULT                                                                    143–147.
                                                                                       [10] Hideyuki Shibuki, Kotaro Sakamoto, Madoka Ishioroshi, Akira Fujita, Yoshinobu
Using essays submitted to the QA Lab-2 and the QA Lab-3, we                                 Kano, Teruko Mitamura, Tatsunori Mori, and Noriko Kando. 2017. Overview of
compared the scores measured by the proposed method and the                                 the NTCIR-13 QA Lab-3 Task. In Proceedings of The NTCIR-13 Conference.
                                                                                       [11] Joachim Wagner, Jennifer Foster, and Josef van Genabith. 2007. A Comparative
scores evaluated by human expert. Although the number of the                                Evaluation of Deep and Shallow Approaches to the Automatic Detection of
essays is 55, they are annotated with the marks granted and taken                           Common Grammatical Errors. In Proceedings of the 2007 Joint Conference on
                                                                                            Empirical Methods in Natural Language Processing and Computational Natural
4 http://www.yamakawa.co.jp/ (in Japanese)                                                  Language Learning. 112–121.


                                                                                  23