=Paper= {{Paper |id=Vol-2446/paper2 |storemode=property |title=A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography |pdfUrl=https://ceur-ws.org/Vol-2446/paper2.pdf |volume=Vol-2446 |authors=Yihong Ma,Qingkai Zeng,Tianwen Jiang,Liang Cai,Meng Jiang |dblpUrl=https://dblp.org/rec/conf/eyre/Ma0JC019 }} ==A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography== https://ceur-ws.org/Vol-2446/paper2.pdf
A Study of Person Entity Extraction and Profiling from Classical
                   Chinese Historiography

                          Yihong Ma†,§ , Qingkai Zeng† , Tianwen Jiang† , Liang Cai‡ , Meng Jiang†
           † Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
                              ‡ Department of History, University of Notre Dame, Notre Dame, IN 46556, USA
                           § School of Finance, Shanghai University of Finance and Economics, Shanghai, China

                                          yihongma97@gmail.com,{qzeng,tjiang2,lcai,mjiang2}@nd.edu

ABSTRACT                                                                                                       Meng Xi 孟喜
When historians are interested in demographic and social network                                           Our approach                      Truth
information of historical actors in the early Chinese empires (841                   Courtesy name         长卿                                长卿
BC–1911 AD), very few studies have been done on entity retrieval                     Hometown              東海蘭陵                              東海蘭陵
from classical Chinese historiography. The key challenge lies in the                 Title(s)              郎, 丞相掾,                           郎, 丞相掾,
low resource of the language: deep learning requires large amounts                                         名之                                曲臺署長
of annotated data and becomes impracticable when such data is not                    Father                孟卿                                孟卿
available. In this study, we employ domain experts (history profes-                  Son                   N/A                               N/A
sors) to curate a set of person entities and their profile attributes                Master(s)             田王孫, 同郡碭田王孫                       田王孫
(e.g., courtesy name, place of birth, title) and relations (e.g., father-            Disciple(s)           沛翟牧子兄, 同郡白光少子,                    翟牧, 白光,
son, master-disciple) from two books, Records of the Grand Historian                                       疏廣, 后蒼                            趙賓, 焦延壽
and Book of Han. We develop a pattern-based bootstrapping ap-                       Table 1: The task is to extract person entities and their pro-
proach to extract the information with a very small number (i.e.,                   filing attributes from classical Chinese text. Our approach
1 or 2) of seed patterns. Experimental results show the effective-                  can find most of the attribute values correctly (i.e., marked
ness as well as the limitations of the iterative method. We would                   by underlines), compared with the ground truth annotated
appreciate research of digital humanities to address the challenges                 by history professors.
in entity retrieval from low-resource languages.
                                                                                    databases about ancient China. As we know, NLP is being revolu-
CCS CONCEPTS                                                                        tionized by deep learning with neural networks. However, deep
• Information systems → Data mining; Information extraction.                        learning requires large amounts of annotated data, and its advan-
                                                                                    tage over traditional statistical methods typically diminishes when
KEYWORDS                                                                            such data is not available. How to address the issue of low resource
                                                                                    in the task of entity extraction and profiling from classical Chinese
Information extraction, Entity profiling, Classical Chinese, Textual
                                                                                    text is still an open problem.
pattern, Bootstrapping
                                                                                        In this study, we collect a ground-truth dataset for evaluating on
                                                                                    the task, propose a pattern-based information extraction approach
1    INTRODUCTION                                                                   that requires very limited prior knowledge of classical Chinese,
Historians are interested in the classical Chinese literature as it                 and conduct experiments to show its effectiveness and limitations.
witnessed the rise and fall of the early empires and dynasties [2, 3,                   First, we recruit domain experts (history professors) to anno-
18, 32]. Currently, they have to spend a large amount of time read-                 tate person entities and their profiles from two classical Chinese
ing those books and digging out who came from where, who did                        books, Records of the Grand Historian (authored by Sima Qian, com-
what, who studied from whom, and so on, and before it, they had                     pleted in c. 86 BC) and Book of Han (authored by Ban Gu, com-
to spend even longer time to learn the classical Chinese language                   pleted in 111 AD). Table 1 gives an example of the annotated profile
even some of them are (absolutely modern) Chinese [11, 22, 26].                     of Meng Xi 孟喜 (∼90 BC–∼40 BC). We focus on three attributes
Therefore, the idea of utilizing digital technologies to extract per-               (i.e., courtesy name, place of birth, positions/titles) and two relations
son entities and their profiling attributes from classical Chinese text             (e.g., father-son, master-disciple), because (1) these are main con-
becomes promising and exciting in the community, as natural lan-                    tents in the work of Chinese historiography and (2) historians are
guage processing (NLP) and entity retrieval have been developing                    very interested in how the government mechanisms were influ-
and accelerating at an unprecedented speed today.                                   enced by family and master-disciple relationships in the ancient
   However, it is a rather challenging task due to lack of annotated                time. The domain experts generated fifty handcrafted patterns to
data. Historians write papers, publish books, but rarely build entity               extract the above information. They validated the attribute values
                                                                                    and assessed the reliability of the patterns (see Table 4). Moreover,
Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-   they annotated 15 complete person profiles (with 158 attribute val-
mons License Attribution 4.0 International (CC BY 4.0).                             ues) that they feel the most interested in. This dataset can serve as
EYRE’19, November 2019, Beijing, China                                                                                                   Y. Ma et al.


                                  Meng Xi 孟喜            Meng Qing 孟卿            Yan An Le 顏安樂                Zhang Yu 張禹
       Courtesy name                 長卿                      N/A                      公孫                         子文
       Hometown                    東海蘭陵                      東海                      魯國薛                        河內軹
       Title(s)                郎, 曲臺署長, 丞相掾                  N/A               齊郡太守丞, 大司農                 郡文學, 光祿大夫, 東平內史
       Father                        孟卿                      N/A                     眭孟姊                         N/A
       Son(s)                        N/A                     孟喜                       N/A                        N/A
       Master                       田王孫                      蕭奮                       眭孟                         施讎
       Disciple(s)            趙賓, 白光, 翟牧, 焦延壽           后倉, 閭丘卿, 疏廣            泠豐, 任公, 冥都, 筦路                  彭宣, 戴崇
Table 2: Four examples of profiles of the historical actors in the Han Dynasty. We focus on the three attributes and two relations.

ground truth for evaluating such information extraction methods             Historiography Book                    # Sentences      # Words
on the classical Chinese literature.                                        Records of the Grand Historian               32,564       615,457
    Second, we propose a bootstrapping approach to extract person           Book of Han                                  40,114       874,165
entities and profiles requiring very little prior knowledge of the
                                                                                  Table 3: Statistics of the dataset (two books).
language. The algorithm starts from only one or two simple seed
patterns, finds the attribute values, and then use them to discover           • trumpdonaldjohn,
more complicated patterns. It has an estimator to access the relia-           • newyorktrumpdonald,
bility of patterns during the iterative process. So, the new attribute        • newyorktrumpdonaldjohn.
values extracted from more reliable patterns are more likely to be       Note that there would be no white-space (nor upper-case) to split
trustworthy and can be used to infer patterns in next iterations.        the words. Complicated patterns have to be designed or recognized
    Experiments show that textual patterns achieve an F1 score of        in the extraction methods.
0.851 on 15 ground-truth person profiles. Table 1 shows a compar-
ison between the generated profile (left) and the ground-truth pro-      2.2    Person Entity Profiling
file (right) of Meng Xi 孟喜. On the other side, the bootstrapping         Given the classical Chinese historiography, the task of person en-
method achieves the highest performance after 7 iterations to find       tity profiling aims to extract demographic attributes (e.g., courtesy
a set of related patterns and extract person-title pairs, while meet-    name, place of birth, title) and social relations (e.g., father-son, master-
ing barriers to find more patterns for other attributes and relations.   disciple) and to generate a complete profile for the person entities
    We summarize our contributions in this study as follows:             extracted in Section 2.1.
      • New dataset: We recruit history professors to curate a set          Table 2 shows examples of the profiles of Confusians in the Han
        of person profiles from classical Chinese literature.            Dynasty such as Meng Xi 孟喜, Meng Qing 孟卿, Yan An Le 顏安
      • New approach: We develop a bootstrapping method based            樂, and Zhang Yu 張禹. Some of the attribute values are “N/A” if
        on textual patterns to extract the person entities and attrib-   not available in the text corpora.
        uted information, requiring little prior knowledge.                 There are two typical challenges of this task. One is the variety
      • Effectiveness: Experiments show that textual patterns make       of demographic attributes. Each type of attributes needs a set of
        an F1 score of 0.851 on 15 person profiles annotated by the      specific, reliable extractors, which requires prior knowledge of the
        domain experts. Limitations are discussed in Section 4.3.        classical Chinese language. The other challenge is typically for the
                                                                         Chinese historiography: Zero Pronoun (ZP), which stands for pro-
2 ENTITY EXTRACTION AND PROFILING                                        nouns that are omitted when they are pragmatically or grammat-
2.1 Person Entity Extraction                                             ically inferable from the context. Here is an example taken from
                                                                         Records of the Grand Historian, where the ZPs (denoted as φ) all
It sounds like a subtask of the standard named entity recognition
                                                                         refer to “Mr. Chunshen” 春申君:
(NER) task – it narrows down from recognizing multiple types of
entities (i.e., persons, locations, organizations) to only one type.            [春申君] 者,φ 楚人也,φ 名歇,φ 姓黄氏。φ 游
However, it has to face a challenge when put into the classical Chi-            學博聞,φ 事楚頃襄王。
nese text: in classical Chinese literature, there are many different            (Translation: [Mr. Chunshen], φ was born in Chu, φ’s
ways of mentioning a specific person. A person has first name, last             first name is Xie, φ’s family name is Huang. φ trav-
name (family name), and courtesy name; and he is also recognized                elled over the country and enriched his knowledge,
by his hometown and title/position in the government. For the sake              φ served King Qingxiang of Chu.)
of readability, let us take the President of United States Donald J.        This sentence indicates three attributes (i.e., hometown, first name,
Trump as an example. “Donald” is his first name and suppose “John”       last name) and one relation (i.e., master-disciple) about Mr. Chun-
(J.) can be considered as his courtesy name. (Ancient Chinese peo-       shen 春 申 君. However, ZP makes it challenging to link the at-
ple do not have middle name. They have courtesy name.) He was            tributes and relations in the context with the person entity. More-
born in New York. So, all the following could be used in the classi-     over, we observe that ZPs occur not only in the same sentence with
cal Chinese literature to mention President Trump:                       the mention of the person entity but also across several sentences
      • donald,                                                          in the same paragraph. In biographical historiography, each chap-
      • trumpdonald,                                                     ter discussed the life story of a certain person. So, we adopt the
      • presidentdonald,                                                 following assumption (learned from history professors) to resolve
A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography                                          EYRE’19, November 2019, Beijing, China


                         Seed Patterns
                            1& $TITLE
                                                                                 Rank       Pattern                  Entities
                             , $TITLE
                                                                                    1        ...&         ;9/8 ;9) ;9
                               ……                 Generate and rank                                       7 ;9#34 ;927 ;9:::
                                                  textual patterns
                         Seed Entities
                                                                                   …            …                      :::9:::
                                 
                                                                                 Top #       1...&      2 ;9 ;96 ;90 ;9
                           /8                                                                        $! ;9 ;9 4 ;9:::

                                                                                   …            …                      :::9:::
                              )                  Expand seed patterns
                                                  and seed entities                53        .            ;9( ;9"5 ;9-
                                 7                                                                        ;9* '%+ ;9 ;9:::

                             #34
                               ……

                                 Figure 1: The diagram of the proposed pattern-based bootstrapping method.
the ZP issue: given a paragraph, as long as a person entity was ex-                             • #Values: The number of (person entity, attribute or relation
tracted in the first clause, the ZPs in every clause of the paragraph                             value)-pairs extracted by the pattern.
refer to that person entity. This will help us propose an approach                              • #True Values: The number of true pairs annotated by the do-
to extract person-attribute/relation pairs when the extractors were                               main experts.
only able to find the attributes and relations in local contexts.                               • Reliability: It describes whether a pattern is reliable for ex-
                                                                                                  tracting true values. It is calculated as
3     THE PROPOSED APPROACH                                                                                                #True V alues
                                                                                                              Reliability =              ,                         (1)
In this section, we first introduce how the dataset was curated                                                               #V alues
with handcrafted patterns by domain experts. Next, we present a                                   which gives a score between 0 and 1.
pattern-based bootstrapping method to find the entity information
                                                                                             Specifically, for person, pattern [$Person 者] was the first pat-
with a small number of seed patterns.
                                                                                          tern that the domain experts come up with. “者” is a typical symbol
                                                                                          in classical Chinese that indicates the appearance of a person. The
3.1     Data Curation with Handcrafted Patterns                                           person entities extracted by pattern [$Person 者] may be in any
We curate a dataset of two classical Chinese historiography books,                        of the 6 forms of person entity mentions in Section 3.1.1. Another
Records of the Grand Historian (authored by Sima Qian, completed                          frequent pattern is [$Person 字 $CourtesyName]. Unlike pattern
in c. 86 BC) and Book of Han (authored by Ban Gu, completed in                            [$Person 者], person entities extracted by pattern [$Person 字
111 AD). Table 3 lists the statistics of the dataset.                                     $CourtesyName] strictly follow the form of last name + first name.
                                                                                          It is a more reliable pattern. As the table shows, the reliability
3.1.1 Patterns for person entity extraction. The domain experts we                        of pattern [$Person 字 $CourtesyName] is 1 and the reliability
recruited to annotate the data contribute the following patterns to                       of pattern [$Person 者] is only 0.6087 though the number of ex-
recognize mentions of person entities:                                                    tracted person-value pairs is smaller (205 vs. 299).
     • $FirstName,                                                                           Most of the patterns for hometown, father-son, and master-disciple
     • $LastName + $FirstName                                                             are highly reliable (higher-than-0.96 reliability), except patterns [,
     • $Title + $FirstName,                                                               $Father 子] (ID 38) and [,    事 $Master] (ID 46) of reliability 0.8571
     • $LastName + $FirstName + $CourtesyName,                                            and 0.8356, respectively. Among the 28 patterns for attribute title,
     • $Hometown + $LastName + $FirstName,                                                only 4 patterns have reliability of lower than 0.8 and only one has a
     • $Hometown + $LastName + $FirstName + $CourtesyName.                                reliability score of lower than 0.7, i.e., [至 $Title] (ID 36). Among
                                                                                          all the 50 handcrafted patterns, 35 (70%) patterns have reliability
3.1.2 Patterns for entity profiling. Table 4 presents 50 textual pat-                     score of 1; 5 (10%) patterns have reliability score of lower than 0.8.
terns that were used to extract a set of candidates of person’s at-
tribute or relation values. Some attributes such as hometown and                          3.2       Pattern-based Bootstrapping
father-son have a small number of patterns. Some such as title and                        We propose a new approach to extract person entities and profiles
master-disciple have a large number of patterns. The domain ex-                           from classical Chinese historiography requiring very little prior
perts also annotated whether the attribute values and relations are                       knowledge of the language. Generally, it is an iterative method
true or false. For each pattern, we give three numbers associated                         that uses textual patterns to extract attribute or relation values
with its extractions:                                                                     from text data. Figure 1 shows the diagram of one iteration in the
EYRE’19, November 2019, Beijing, China                                                                         Y. Ma et al.


 ID       Attribute                   Pattern                 Example              #Values   #True Values   Reliability
  1         person                  $Person 者                陳丞相平 者                  299         182          0.609
  2         person           $Person 字 $CourtesyName         王莽 字巨君                  205         205          1.000
  3         person           $Person,$Hometown 人也          申屠嘉,梁 人也                  106         103          0.971
  4         person            $Person,$Hometown 人          朝鮮王滿,燕 人                   46          34          0.739
  5       hometown               ,$Hometown 人也               ,陽城 人也                  190         189          0.995
  6       hometown                ,$Hometown 人                ,高陽 人                   11          11          1.000
  7       hometown                 徙 $Hometown              自下邑徙平陵                    16          16          1.000
  8     courtesy name           ,字 $CourtesyName              ,字長卿                    21          21          1.000
  9          title                  拜為 $Title                 拜為上卿                    22          22          1.000
 10          title              拜 $Person 為 $Title          拜仁 為郎中令                    8           8          1.000
 11          title                   遷 $Title                遷東平太傅                    74          64          0.865
 12          title                  遷為 $Title                起遷為國尉                    36          36          1.000
 13          title              遷 $Person 為 $Title         遷廣明 為淮陽太守                   1           1          1.000
 14          title                  遷至 $Title               稍遷至栘中廄監                   19          19          1.000
 15          title                  封為 $Title               綰封為長安侯                    18          18          1.000
 16          title              封 $Person 為 $Title       孝景後三年封蚡 為武安侯                  3           3          1.000
 17          title                  召為 $Title                 復召為郎                     2           2          1.000
 18          title              召 $Person 為 $Title        於是上召寧成 為中尉                   5           5          1.000
 19          title                   補 $Title               以選除補御史掾                   41          40          0.976
 20          title                 察… 為 $Title            以郡吏察廉為樓煩長                    8           8          1.000
 21          title                  舉為 $Title              後以御史舉為鄭令                    8           8          1.000
 22          title                 舉… 為 $Title             復舉賢良為河南令                   11          10          0.909
 23          title                  擢為 $Title               擢為光祿大夫                    10          10          1.000
 24          title              擢 $Person 為 $Title         因擢延壽 為諫大夫                   3           3          1.000
 25          title                  徵為 $Title                 徵為廄丞                    14          11          0.786
 26          title              徵 $Person 為 $Title          徵由 為大鴻臚                    5           5          1.000
 27          title                  徙為 $Title                徙為頻陽令                    11          11          1.000
 28          title              徙 $Person 為 $Title         徙立 為太原太守                    2           2          1.000
 29          title                  復為 $Title               後復為淮陽都尉                   15          14          0.933
 30          title                  以 $Title 察            以郡吏 察廉為樓煩長                   4           4          1.000
 31          title                  薦為 $Title                 薦為議郎                     4           4          1.000
 32          title              薦 $Person 為 $Title          薦宣 為長安令                    3           3          1.000
 33          title                  贖為 $Title                 贖為庶人                     8           8          1.000
 34          title                  立為 $Title                自立為代王                    24          21          0.875
 34          title                   為 $Title               為駙馬都尉侍中                  193         151          0.782
 35          title               $Person 為 $Title            禹 為丞相史                   45          32          0.711
 36          title                   至 $Title                 至中大夫                   115          37          0.322
 37       father-son              ,$Father 子也              ,秦莊襄王 子也                   25          24          0.960
 38       father-son               ,$Father 子                ,文公 少子                   14          12          0.857
 39       father-son              ,其父 $Father               ,其父高祖中子                    3           3          1.000
 40       father-son               ,父 $Father                ,父號孟卿                     6           6          1.000
 41       father-son             $Son 父曰 $Father           悼侯 父曰隱太子友                  18          18          1.000
 42     master-disciple           從 $Master 受…            從太中大夫京房 受易                  12          12          1.000
 43     master-disciple           事 $Master 受…           又事前將軍蕭望之 受論語                  2           2          1.000
 44     master-disciple         $Master 授 $Disciple        常 授梁蕭秉君房                   52          52          1.000
 45     master-disciple            ,授 $Disciple          ,授翼奉、蕭望之、匡衡                  25          25          1.000
 46     master-disciple            ,事 $Master               ,事太傅夏侯勝                   73          61          0.836
 47     master-disciple         事 $Master 為 $Title         事梁孝王 為中大夫                   3           3          1.000
 48     master-disciple          弟子… 者,$Master         弟子遂之者,蘭陵褚大,東平嬴公                 4           4          1.000
 49     master-disciple           受… 於 $Master          嘗受韓子、雜家說於騶田生所                  3           3          1.000
 50     master-disciple       與 $Person 俱事 $Master         與顏安樂 俱事眭孟                   6           6          1.000
Table 4: Patterns manually annotated by domain experts to find person entity profiles, where underlines mark the values
extracted by the patterns.
A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography                                         EYRE’19, November 2019, Beijing, China


pattern-based bootstrapping method. It starts with only one or two                        value strings. Hamming distance is defined as the minimum num-
simple seed patterns for each attribute. Because the number of seed                       ber of substitutions required to change one string into the other.
patterns is small, it would not take much effort to find one or two.                         (2) Variety of the pattern’s extracted values: A pattern would be
For example, [遷為 $Title] (i.e., [relegated to $Title]) and [補                             more reliable if it extracted more true values. Besides the frequency,
$Title] (i.e., [filled in the position of $Title]) were the two reli-                     we try another measurement: we assume that if there was a value
able seed patterns for the attribute title. The iterative method runs                     whose frequency dominates the set of values one pattern extracted,
the following steps until convergence.                                                    the pattern would be less reliable. So we use 1 minus the ratio of
                                                                                          the count of the most frequent value over all the value counts.
Step 1: Generating pattern candidates. Candidate patterns are
generated using contextual features of the target value vi in the                         Step 3: Selecting new patterns and extracting new values for
clause. We find that target values are more likely to be at the end                       the next iteration. For each pattern, we calculated the reliability
of the clause because of the linguistic structure. Therefore, the com-                    score r (p) and the frequency of values that it extracted. For the
monly used skip-gram contextual pattern “w −1 ____w 1 ” [28] would                        next iteration, we first filter out the patterns whose frequency is
not work for our task. Instead, we explore two different kinds of                         below a threshold and then select top patterns of the highest r (p).
contextual features described as follows:                                                 After that, we expand the set of true values V + by adding the
     • $Pattern $Value. The textual pattern is a window of                                values extracted by the new patterns.
        a certain size of Chinese characters before a target value.
        For example, if the target value is $Title, we can find the                       4 EXPERIMENTS
        pattern candidate [遷為 $Title] (i.e., [relegate to $Title]),                       In this section, we first evaluate the quality of handcrafted patterns
       when the window size is 2.                                                         given by domain experts. Then we evaluate the effectiveness of the
     • $Pattern $Entity $Pattern $Value. Both a window                                    bootstrapping method. Finally, we discuss the limitations.
        of one Chinese character before $Entity and all characters
        between $Entity and $Value are selected as the contex-                            4.1    Evaluating the Handcrafted Patterns
        tual feature. For example, if $Title is the target value and                      Here we conduct experiments to answer: do the handcrafted pat-
        $Person is the entity that has already been extracted in Sec-                     terns extract correct person entities, attributes, and relations?
        tion 3.1.1, we can find a new pattern candidate [遷 $Person
        為 $Title] (i.e., [relegate $Person to $Title]).                                   Evaluation methods. We use the 15 complete person profiles
                                                                                          (with 158 values) as ground truth. We use standard Information
Step 2: Ranking pattern candidates. It is nontrivial to rank the                          Retrieval metrics: Precision, Recall, and F1 score. Precision is the frac-
quality of pattern candidates. It has two serious issues when con-                        tion of true attribute or relation values (i.e., values that find a match
sidering all the unlabeled entities as false: (1) penalized reliable                      in the corresponding attribute in ground-truth profiles) among all
patterns that extracted true unlabeled values and (2) could not pe-                       values extracted by handcrafted patterns. Recall is the fraction of
nalize unreliable patterns that extracted false unlabeled values. To                      true attribute or relation values among all ground-truth values. F1
address these issues, we define the estimation score of pattern re-                       score is the harmonic mean of Precision and Recall.
liability as follows:                                                                     Evaluation results. Regarding the entity profiles, handcrafted tex-
                        ∑      (                           )                              tual patterns achieve a Precision of 0.901, a Recall of 0.803, and an F1
                         v ∈Vp 1 − minv + ∈V + d(v, v )
                                                       +
          r (p) = w 1 ·          ∑                                                        score of 0.851. Table 1 shows a comparison between the generated
                                   v ∈Vp f req(v)                                         profile (left) and the ground-truth profile (right) of Meng Xi 孟喜,
                          (                       )
                              maxv ∈Vp f req(v)                                           where most of the values extracted are correct. We also find the fol-
                  + w2 · 1 − ∑                      ∈ [0, 1],                             lowing limitations of the handcrafted patterns. First, the different
                                  v ∈Vp f req(v)
                                                                                          forms of person entity mentions make entity linking (i.e., mention
where p is a textual pattern, v is a value string, v + is a true value                    alignment) difficult. For example, in the master-disciple relation of
string, Vp is the set of unique value strings extracted by pattern                        Meng Xi 孟喜, “同郡白光少子”, i.e., Bai Guang 白光 whose cour-
p, V + is the set of unique true value strings; d(v 1 , v 2 ) is the nor-                 tesy name is Shao Zi 少子 from the same (“同”) hometown (“郡”)
malized hamming distance between the two value strings, f req(v)                          as Meng Xi’s, extracted by handcrafted patterns should refer to Bai
is the frequency of the value string v. w 1 and w 2 are weights:                          Guang 白光 in the annotation. Second, ZP problem was resolved
w 1 + w 2 = 1.                                                                            in most of the cases but may still assign attributes or relations to
   The estimator includes two kinds of features:                                          wrong persons. For example, in the master-disciple relation, Hou
   (1) The textual similarity between the pattern’s extracted values                      Cang 后蒼 and Shu Guang 疏廣 are indeed disciples of Meng Xi
and true values: If the value a pattern extracted is very similar with                    孟喜’s father Meng Qing 孟卿, but are mistakenly regarded as the
one true value, the value is likely to be true and the pattern is likely                  disciples of Meng Xi 孟喜 due to the assumption.
to be reliable. For example, suppose “Tai Shou 太守”, the name of
an official position, has been in the set of true values (as $Title).
                                                                                          4.2    Evaluating the Effectiveness of the
Then the value “Nan Yang Tai Shou 南陽太守” extracted by a pat-                                      Bootstrapping Method
tern, which means the Tai Shou 太守 ruling a place called Nan                               We conduct experiments to see if the bootstrapping method can
Yang 南陽, is likely to be a good value (as $Title). We use Ham-                            find the set of handcrafted patterns with only one or two seed pat-
ming distance as the metric to measure the similarity between two                         terns and see if the attribute values can be accurately extracted.
EYRE’19, November 2019, Beijing, China                                                                                                        Y. Ma et al.


Parameter settings. We set the window size as 2. The frequency                        1.0                                 # of iterations:
threshold of patterns is 10. The number of top patterns selected per                                              1 (AUC=0.107)        6 (AUC=0.184)
                                                                                                                  2 (AUC=0.140)        7 (AUC=0.207)
iteration is 10. We run until convergence but just report the first 10                                            3 (AUC=0.159)        8 (AUC=0.193)
                                                                                      0.8                         4 (AUC=0.154)        9 (AUC=0.122)
iterations for the sake of space. The weights of pattern reliability                                              5 (AUC=0.171)        10 (AUC=0.120)
features are w 1 = w 2 = 0.5.
                                                                                                                              Best performance




                                                                          Precision
Evaluation methods. Here are the metrics for the two tasks.                           0.6                                     at iteration #7

Task of pattern extraction: We evaluate the performance on extract-
ing patterns for the title attribute. We use the metric Precision@K,
                                                                                      0.4
which is the fraction of top K scored generated patterns that are in
the ground-truth pattern set. We also define a new metric Coverage@K
for the task, which is the fraction of top K scored ground-truth                      0.2
patterns that are extracted by the bootstrapping method. The gen-
erated patterns were scored by the reliability estimation in Step 2
in Section 3.2 and the ground-truth patterns were scored by the re-                   0.0
                                                                                         0.0   0.1       0.2            0.3           0.4           0.5
liability in Table 4. Average precision (AP) computes the average                                              Recall
precision value for coverage over 0 to 1.
                                                                          Figure 2: The performance of the bootstrapping method on
Task of person-title pair extraction: We first assign a confidence        person-title pair extraction gradually improved through it-
score to each person-title pair by weighting the reliability score        erations. AUC increased from 0.107 to 0.207 (in iteration 7)
of the textual pattern that extracts person and title respectively.       and decreased after the point.
We evaluate the person-title pairs extracted by the bootstrapping
method at different numbers of iterations with Precision-Recall curves.
Precision is the fraction of true person-title pairs among all person-    Results on person-title pair extraction. Running the bootstrap-
title pairs generated by handcrafted patterns. Recall is the fraction     ping method for more iterations generally increases the perfor-
of true person-title pairs among all 516 ground-truth person-title        mance of person-title pair extractions, while after certain iterations
pairs. AUC is the area under the curve.                                   the performance starts to shrink. From Figure 2, AUC keeps in-
                                                                          creasing in the first 7 iterations, achieving a maximum of 0.207 in
Results on pattern extraction. The bootstrapping method had
                                                                          iteration 7, and then begins to decrease in the last 3 iterations.
been improving the performance of pattern extraction since it started,
                                                                             Why the Recall scores were consistently low? Pattern ID 11, ID 34
while after certain iterations the performance turned to be worse.
                                                                          and ID 36 from Table 4 are not found by the bootstrapping method
From Table 5, running the bootstrapping algorithm for 3 iterations
                                                                          due to the setting of a window size of 2. Therefore, values extracted
can increase AP by 42.55%, compared to running only one 1 itera-
                                                                          by those patterns, which occupy 45% of the total true values, will
tion. After around 5 iterations, AP displays a continuous trend of
                                                                          never be found.
declining and iteration 10 gives the lowest AP of 0.131, which is a
                                                                             Why many false person-title pairs were included after iteration
decrease of 44.26% from iteration 1. What’s more, Coverage@K no
                                                                          #8? Domain experts have also designed stop words for each hand-
longer update after 7 iterations. It indicates that the bootstrapping
                                                                          crafted patterns, which are capable of screening out common noises
may meet certain barriers in extracting more patterns.
                                                                          with certain patterns. But for the bootstrapping method, those noises
   After observing the result patterns, we can infer some limita-
                                                                          extracted by the patterns are still regarded as true.
tions on pattern extraction of the pattern-based bootstrapping method:
   First, there exist many patterns with either a relatively low fre-
quency (i.e. less than 20) or lack of interpretability (i.e. patterns     4.3           Discussions
with scarcely any actual semantic meaning but somehow capable             We find that the bootstrapping method can work only on extract-
of extracting “good” entities, which are still considered “good” by       ing attribute values of $Title. The values of $Title could be shared
our method) that tend not to be found by our domain experts,              by multiple patterns’ extractions because multiple people can be
which we should be reasonably tolerant of.                                assigned to the same position in the government. Only when the
   Second, the pattern-based bootstrapping method is not good at          values are shared, we can find one pattern with another by boot-
abstracting the first type of contextual patterns mentioned in Step       strapping. However, one person cannot have multiple fathers and
1 in Section 3.2. Human experts can easily generalize patterns with       rarely have multiple masters. By now, we have only investigated
a v. + prep. structure that are composed of different verbs but the       the pattern-based bootstrapping method in Section 3.2 on the at-
same pronoun into one super-category: prep. For example, it is            tribute in the task of attribute discovery. The preliminary of this
reasonable for domain experts to find such common feature of pat-         method lies in the fact that there should exist some entities that
terns like [拜為 $Title], [擢為 $Title], [舉為 $Title] and etc.,                could be extracted by multiple patterns, which makes it possible
all of which mean [promote to $Title], and generalize them into           to find new patterns through pattern generation. However, for the
pattern [為 $Title] (i.e., [to $Title]). However, the bootstrapping        task of relation extraction (e.g., father-son and master-disciple), since
method tends not to capture such abstraction of patterns and there-       each relation pair is unique in the text, there is not a pattern shown
fore generates a subset of certain ground-truth patterns, which           in Table 4 that shares even a single common instance that could
pulls down the evaluation metric.                                         also be extracted by other patterns in the same category, which
                                                                          makes it hard for instance-level bootstrapping method to work.
A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography                                        EYRE’19, November 2019, Beijing, China


    # of iterations        P@3        C@3       P@5        C@5       P@7        C@7       P@10      C@10      P@15      C@15      P@20       C@20        AP
            1              0.667      0.667     0.800      0.400     0.857      0.429     0.700     0.400     0.467     0.400     0.350      0.300      0.235
            2              0.667      0.667     0.800      0.800     0.714      0.714     0.800     0.600     0.667     0.533     0.550      0.500      0.329
            3              1.000      0.667     0.800      0.800     0.714      0.714     0.500     0.600     0.600     0.533     0.550      0.500      0.335
            4              0.667      0.667     0.600      0.800     0.714      0.714     0.500     0.600     0.533     0.533     0.450      0.500      0.279
            5              0.667      1.000     0.600      1.000     0.714      0.857     0.700     0.700     0.533     0.667     0.450      0.600      0.320
            6              0.333      1.000     0.400      1.000     0.429      0.857     0.500     0.700     0.467     0.667     0.400      0.600      0.254
            7              0.333      1.000     0.400      1.000     0.286      0.857     0.500     0.800     0.467     0.733     0.350      0.650      0.214
            8              0.333      1.000     0.400      1.000     0.286      0.857     0.500     0.800     0.467     0.733     0.350      0.650      0.204
            9              0.000      1.000     0.200      1.000     0.143      0.857     0.400     0.800     0.333     0.733     0.350      0.650      0.162
           10              0.000      1.000     0.200      1.000     0.143      0.857     0.200     0.800     0.267     0.733     0.250      0.650      0.131
Table 5: We use Precision@K, Coverage@K and Average Precision (AP) to evaluate the method on pattern extraction. At the
iteration #3, the method achieved the highest AP of 0.335, improved relatively 42.6% over the seed iteration (and patterns).
However, the reliability of newly extracted patterns significantly reduced and AP started dropping after iteration #5.

5     RELATED WORK                                                                        such as part-of-speech tags or entity types in order to extract a
In this section, we survey three main topics related to our work.                         large collection of tuple-like information [23, 31, 34, 40]. Hearst pat-
We point out the uniqueness of our study.                                                 terns like “NP such as NP, NP, and NP” were proposed to automat-
                                                                                          ically acquire hyponymy relations from text data [14]. Later, ma-
5.1     Chinese NLP Techniques                                                            chine learning experts designed the Snowball systems to propagate
                                                                                          in plain text for numerous relational patterns [1, 4, 43]. Google’s
Though robust NLP techniques are often language independent,
                                                                                          Biperpedia [12, 13] generated E-A patterns (e.g., “A of E” and “E’s
most of the NLP techniques for Chinese have their own specific
                                                                                          A”) from users’fact-seeking queries by replacing entity with “E”
characteristics and thus advantages compared to those for English
                                                                                          and noun-phrase attribute with “A”. ReNoun [35] generated S-A-O
or other Latin-based languages. Unlike Latin-based languages, Chi-
                                                                                          patterns (e.g., “S’s A is O” and “O, A of S,”) from human-annotated
nese languages do not use white-space as the natural delimiter.
                                                                                          corpus on a predefined subset of the attribute names. Patty used
Therefore word segmentation is always a key precursor for lan-
                                                                                          parsing structures to generate relational patterns with semantic
guage processing tasks in Chinese [5, 6, 8, 30, 41, 42]. Moreover,
                                                                                          types [29]. The recent MetaPAD generated “meta patterns” based
due to lack of morphological features, Chinese Part-of-Speech (POS)
                                                                                          on content quality [16]. However, all the patterns in the above
tagging and dependency parsing are harder than Latin-based lan-
                                                                                          methods can only serve for English. Due to the fundamental gram-
guages like English. Li et al. [24] proposed joint models for Chinese
                                                                                          mar difference between classical Chinese and English, the above
POS tagging and dependency parsing tasks. As neural methods
                                                                                          methods no longer work for our problem. Our work has made the
have recently achieved significant performance with large amount
                                                                                          first step in the field of pattern-based entity retrieval that is suit-
of annotated data, many deep neural models for Chinese POS tag-
                                                                                          able for classical Chinese text.
ging and dependency parsing have been developed [9, 21, 33]. Zero
Pronoun (ZP) resolution is also a challenging problem in Chinese.
Existing studies utilize heuristic rules to resolve ZP issues in Chi-                     5.3     Neural Entity Information Extraction
nese [10, 36]. Recently, supervised neural approaches have been
                                                                                          The task of named entity recognition (NER) is typically cast as a
vastly explored on many different tasks [7, 37–39].
                                                                                          sequence labeling problem and solved by supervised learning mod-
   However, all these studies focus on modern Chinese text. Clas-
                                                                                          els. Different from statistical learning methods like conditional ran-
sical Chinese is important but was paid little attention, as the ma-
                                                                                          dom fields (CRF) [19], end-to-end neural network methods have
jority of precious historical literature was written in classical Chi-
                                                                                          been proposed to solve the problem [15, 17, 20, 27]. Recent work
nese hundreds or even thousands of years ago. Doing NLP tasks on
                                                                                          used language model as another type of supervised signals [25],
classical Chinese would be more difficult than modern Chinese, be-
                                                                                          which can help models obtain more contextual knowledge from
cause of the very different written style and very limited annotated
                                                                                          corpus without extra annotation. Open source pre-train models
data. Our approach was the first to curate a person entity profiling
                                                                                          have been widely used in the entity information extraction tasks.
dataset for the studies and we proposed a pattern-based bootstrap-
                                                                                          They improved the performance with models pre-trained on mas-
ping method to extract the attributes of historical actors in ancient
                                                                                          sive corpora. Note that all the models need large amount of anno-
China. The extracted high quality profile information would facil-
                                                                                          tated data, while unfortunately we don’t have in classical Chinese.
itate history studies. Digital humanities need more attention from
both humanity studies and digital technologies.
                                                                                          6 CONCLUSIONS
5.2     Textual Pattern-based Entity Information                                          In this paper, we aimed at extracting and profiling historical ac-
        Extraction Techniques                                                             tors from classical Chinese literature. We addressed the challenge
Given a text corpus, textual patterns leverage statistics (e.g., high                     of low-resource language. In this study, we employed domain ex-
frequency) by replacing words, phrases, or entities with symbols                          perts to curate a ground-truth dataset of person entities and their
EYRE’19, November 2019, Beijing, China                                                                                                                               Y. Ma et al.


profile attributes and relations (e.g., courtesy name, place of birth, ti-                      (2001).
                                                                                           [20] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya
tle, father-son, master-disciple) with handcrafted patterns from two                            Kawakami, and Chris Dyer. 2016. Neural architectures for named entity
books, Historical Records and Book of Han. We developed a pattern-                              recognition. In NAACL.
based bootstrapping approach to extract the information with a                             [21] Haonan Li, Zhisong Zhang, Yuqi Ju, and Hai Zhao. 2018. Neural character-level
                                                                                                dependency parsing for Chinese. In AAAI.
very small number of seed patterns. Experimental results showed                            [22] Kaiyuan Li. 2000. The Establishment of Han Dynasty and the Liu Bang Group:
the effectiveness and limitations of the iterative method.                                      A Study of the Meritorious Military Class. Beijing: San lian shu dian (2000).
                                                                                           [23] Qi Li, Meng Jiang, Xikun Zhang, Meng Qu, Timothy P Hanratty, Jing Gao, and Ji-
                                                                                                awei Han. 2018. Truepie: Discovering reliable patterns in pattern-based informa-
ACKNOWLEDGMENTS                                                                                 tion extraction. In Proceedings of the 24th ACM SIGKDD International Conference
The authors would like to thank all the funds for their support. This                           on Knowledge Discovery & Data Mining. ACM, 1675–1684.
                                                                                           [24] Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu, Wenliang Chen, and Haizhou
work was supported in part by Notre Dame Research 2019 Global                                   Li. 2011. Joint models for Chinese POS tagging and dependency parsing. In
Gateway Faculty Research Award (RGG) FY19RGG03 373106 and                                       Proceedings of Empirical Methods on Natural Language Processing. Association
                                                                                                for Computational Linguistics, 1180–1191.
NSF Grant CCF-1901059.                                                                     [25] Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Fangzheng Xu, Huan Gui, Jian Peng,
                                                                                                and Jiawei Han. 2018. Empower sequence labeling with task-aware neural lan-
REFERENCES                                                                                      guage model. In AAAI.
                                                                                           [26] Michael Loewe. 2000. A Biographical Dictionary of the Qin, Former Han and
 [1] Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from
                                                                                                Xin Periods: 221 Bc - Ad 24. Leiden: Brill (2000).
     large plain-text collections. In Proceedings of the fifth ACM conference on Digital
                                                                                           [27] Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-
     libraries. ACM, 85–94.
                                                                                                directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
 [2] Liang Cai. 2014. Witchcraft and the Rise of the First Confucian Empire. Albany,
                                                                                           [28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
     NY: State University of New York Press (2014).
                                                                                                Distributed representations of words and phrases and their compositionality. In
 [3] Liang Cai. 2019. Confucians, Social Networks, and Bureaucracy: Donghai Men
                                                                                                Advances in neural information processing systems. 3111–3119.
     and Models for Success in the Western Han China (206 BCE–9 CE). Early China
                                                                                           [29] Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY:
     (2019).
                                                                                                a taxonomy of relational patterns with semantic types. In Proceedings of Em-
 [4] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hr-
                                                                                                pirical Methods on Natural Language Processing. Association for Computational
     uschka, and Tom M Mitchell. 2010. Toward an architecture for never-ending
                                                                                                Linguistics, 1135–1145.
     language learning. In AAAI.
                                                                                           [30] Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmen-
 [5] Pi-Chuan Chang, Michel Galley, and Christopher D Manning. 2008. Optimizing
                                                                                                tation and new word detection using conditional random fields. In Proceedings
     Chinese word segmentation for machine translation performance. In Proceedings
                                                                                                of the 20th international conference on Computational Linguistics. Association for
     of the third workshop on statistical machine translation. Association for Compu-
                                                                                                Computational Linguistics, 562.
     tational Linguistics, 224–232.
                                                                                           [31] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choud-
 [6] Wanxiang Che, Zhenghua Li, and Ting Liu. 2010. Ltp: A chinese language tech-
                                                                                                hury, and Michael Gamon. 2015. Representing text for joint embedding of text
     nology platform. In Proceedings of the 23rd International Conference on Compu-
                                                                                                and knowledge bases. In Proceedings of Empirical Methods on Natural Language
     tational Linguistics: Demonstrations. Association for Computational Linguistics,
                                                                                                Processing. 1499–1509.
     13–16.
                                                                                           [32] Hans Van Ess. 1993. The Meaning of Huang-Lao in Shiji and Hanshu. Études
 [7] Chen Chen and Vincent Ng. 2016. Chinese zero pronoun resolution with deep
                                                                                                chinoises 12, 2 (1993), 161–177.
     neural networks. In ACL. 778–788.
                                                                                           [33] Wenhui Wang and Baobao Chang. 2016. Graph-based dependency parsing with
 [8] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015.
                                                                                                bidirectional LSTM. In ACL, Vol. 1. 2306–2315.
     Long short-term memory neural networks for chinese word segmentation. In
                                                                                           [34] Xueying Wang, Haiqiao Zhang, Qi Li, Yiyu Shi, and Meng Jiang. 2019. A Novel
     Proceedings of Empirical Methods on Natural Language Processing. 1197–1206.
                                                                                                Unsupervised Approach for Precise Temporal Slot Filling from Incomplete and
 [9] Yufei Chen, Sheng Huang, Fang Wang, Junjie Cao, Weiwei Sun, and Xiaojun
                                                                                                Noisy Temporal Contexts. In The World Wide Web Conference. ACM, 3328–3334.
     Wan. 2018. Neural Maximum Subgraph Parsing for Cross-Domain Semantic
                                                                                           [35] Mohamed Yahya, Steven Whang, Rahul Gupta, and Alon Halevy. 2014. Renoun:
     Dependency Analysis. In Proceedings of the 22nd Conference on Computational
                                                                                                Fact extraction for nominal attributes. In Proceedings of Empirical Methods on
     Natural Language Learning. 562–572.
                                                                                                Natural Language Processing. 325–335.
[10] Susan P Converse and Martha Stone Palmer. 2006. Pronominal anaphora resolu-
                                                                                           [36] Ching-Long Yeh and Yi-Chun Chen. 2007. Zero Anaphora Resolution in Chinese
     tion in Chinese. Citeseer.
                                                                                                with Shallow Parsing. Journal of Chinese Language and Computing 17, 1 (2007),
[11] Crespigny R. De. 2007. A Biographical Dictionary of Later Han to the Three
                                                                                                41–56.
     Kingdoms (23-220 Ad). Leiden: Brill (2007).
                                                                                           [37] Qingyu Yin, Yu Zhang, Weinan Zhang, and Ting Liu. 2017. Chinese Zero Pro-
[12] Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu.
                                                                                                noun Resolution with Deep Memory Network. In EMNLP. Association for Com-
     2014. Biperpedia: An ontology for search applications. VLDB 7, 7 (2014), 505–
                                                                                                putational Linguistics, Copenhagen, Denmark, 1309–1318. https://doi.org/10.
     516.
                                                                                                18653/v1/D17-1135
[13] Alon Halevy, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, and Xiao
                                                                                           [38] Qingyu Yin, Yu Zhang, Weinan Zhang, Ting Liu, and William Yang Wang. 2018.
     Yu. 2016. Discovering structure in the universe of attribute names. In WWW.
                                                                                                Zero Pronoun Resolution with Attention-based Neural Network. In Proceedings
     International World Wide Web Conferences Steering Committee, 939–949.
                                                                                                of the 27th International Conference on Computational Linguistics. Association for
[14] Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text cor-
                                                                                                Computational Linguistics, Santa Fe, New Mexico, USA, 13–23. https://www.
     pora. In Proceedings of the 14th conference on Computational linguistics-Volume
                                                                                                aclweb.org/anthology/C18-1002
     2. Association for Computational Linguistics, 539–545.
                                                                                           [39] Qingyu Yin, Yu Zhang, Wei-Nan Zhang, Ting Liu, and William Yang Wang.
[15] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for
                                                                                                2018. Deep Reinforcement Learning for Chinese Zero Pronoun Resolution. In
     sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
                                                                                                ACL. Association for Computational Linguistics, Melbourne, Australia, 569–578.
[16] Meng Jiang, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timo-
                                                                                                https://doi.org/10.18653/v1/P18-1053
     thy P Hanratty, and Jiawei Han. 2017. Metapad: Meta pattern discovery from
                                                                                           [40] Wenhao Yu, Zongze Li, Qingkai Zeng, and Meng Jiang. 2019. Tablepedia: Au-
     massive text corpora. In Proceedings of the ACM SIGKDD International Confer-
                                                                                                tomating PDF Table Reading in an Experimental Evidence Exploration and An-
     ence on Knowledge Discovery & Data Mining. ACM, 877–886.
                                                                                                alytic System. In The World Wide Web Conference. ACM, 3615–3619.
[17] Tianwen Jiang, Tong Zhao, Bing Qin, Ting Liu, Nitesh V Chawla, and Meng
                                                                                           [41] Qi Zhang, Xiaoyu Liu, and Jinlan Fu. 2018. Neural networks incorporating dic-
     Jiang. 2019. The Role of “Condition”: A Novel Scientific Knowledge Graph Rep-
                                                                                                tionaries for chinese word segmentation. In AAAI.
     resentation and Construction Model. In Proceedings of the 25th ACM SIGKDD
                                                                                           [42] Wei Zhou, Aiping Wang, Hua Shu, Reinhold Kliegl, and Ming Yan. 2018. Word
     International Conference on Knowledge Discovery & Data Mining. ACM, 1634–
                                                                                                segmentation by alternating colors facilitates eye guidance in Chinese reading.
     1642.
                                                                                                Memory & cognition 46, 5 (2018), 729–740.
[18] Martin Kern. 2003. The ”biography of Sima Xiangru” and the question of the
                                                                                           [43] Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. Stat-
     Fu in Sima Qian’s Shiji. Journal of the American Oriental Society 123, 2 (2003),
                                                                                                Snowball: a statistical approach to extracting entity relationships. In WWW.
     303–316.
                                                                                                ACM, 101–110.
[19] John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional
     random fields: Probabilistic models for segmenting and labeling sequence data.