-

End-to-end sequence labeling via bi- directional lstm-cnns-carrfX.iv preprint arXiv:

A Study of Person Entity Extraction and Profiling from Classical Chinese Historiography

Yihong May

yihongma97@gmail.com

Qingkai Zengy

Tianwen Jianyg

Liang Cazi

Meng Jiangy

1603

01354 1135 1145

When historians are interested in demographic and social network Our approach Truth information of historical actors in the early Chinese empires (841Courtesy name 长卿长卿 BC-1911 AD), very few studies have been done on entity retrieval Hometown 東海蘭陵東海蘭陵 from classical Chinese historiography. The key challenge lies in the Title(s) 郎, 丞相掾, 郎, 丞相掾, low resource of the language: deep learning requires large amounts 名之曲臺署長 of annotated data and becomes impracticable when such data is noFtather 孟卿孟卿 available. In this study, we employ domain experts (history profes- Son N/A N/A sors) to curate a setpoefrson entities and their profile atributes Master(s) 田王孫, 同郡碭田王孫田王孫 (e.g., courtesy name,place of birth,titl)eand relations (e.gfa.,ther- Disciple(s) 沛翟牧子兄, 同郡白光少子, 翟牧, 白光, son, master-discipl)efrom two booksR,ecords of the Grand Historian 疏廣, 后蒼趙賓, 焦延壽 and Book of Han. We develop a patern-based bootstrapping apTable 1: The task is to extract person entities and their proproach to extract the information with a very small number (i.eifl.i,ng attributes from classical Chinese text. Our approach 1 or 2) of seed paterns. Experimental results show the efective- can find most of the attribute values correctly (i.e., marked ness as well as the limitations of the iterative method. We woublyd underline),s compared with the ground truth annotated appreciate research of digital humanities to address the challenbgyeshistory professors. in entity retrieval from low-resource languages.

databases about ancient China. As we know, NLP is being revoluCCS CONCEPTS tionized by deep learning with neural networks. However, deep • Information system→s Data minin;gInformation extractio.n learning requires large amounts of annotated data, and its advantage over traditional statistical methods typically diminishes when KEYWORDS such data is not available. How to address the islsouweroefsource in the task of entity extraction and profiling from classical Chinese

Information extraction, Entity profiling, Classical Chinese, Textual

text is still an open problem. patern, Bootstrapping In this study, we collect a ground-truth dataset for evaluating on the task, propose a patern-based information extraction approach 1 INTRODUCTION that requires very limited prior knowledge of classical Chinese, Historians are interested in the classical Chinese literature aasnditconduct experiments to show its efectiveness and limitations. witnessed the rise and fall of the early empires and dyn2a,s3t,ies [ First, we recruit domain experts (history professors) to anno18, 32]. Currently, they have to spend a large amount of time readt-ate person entities and their profiles from two classical Chinese ing those books and digging ouwtho came from where, who did books,Records of the Grand Historian(authored by Sima Qian, comwhat, who studied from whom, and so on, and before it, they had pleted in c. 86 BC) andBook of Han (authored by Ban Gu, comto spend even longer time to learn the classical Chinese languapgleeted in 111 AD). Tabl1egives an example of the annotated profile even some of them are (absolutely modern) Chine1s1e, 2[2, 26]. of Meng Xi孟喜 ( 90 BC– 40 BC). We focus on three atributes hTerefore, the idea of utilizing digital technologieesxtroact per- (i.e., courtesy name,place of birth,positions/titl)esand two relations son entities and their profiling atributes from classical Chinese text (e.g., father-son, master-discipl)e, because (1) these are main conbecomes promising and exciting in the community, as natural lan-tents in the work of Chinese historiography and ( 2 ) historians are guage processing (NLP) and entity retrieval have been developingvery interested in how the government mechanisms were influand accelerating at an unprecedented speed today. enced by family and master-disciple relationships in the ancient

However, it is a rather challenging task due to lack of annotatteimde. The domain experts generated fity handcrafted paterns to data. Historians write papers, publish books, but rarely build entietxytract the above information. They validated the atribute values and assessed the reliability of the paterns (see Ta4)b.lMeoreover, Copyright © 2019 for this paper by its authors. Use permited under Creative Com- they annotated 15 complete person profiles (with 158 atribute valmons License Atribution 4.0 International (CC BY 4.0). ues) that they feel the most interested in. This dataset can serve as

Courtesy name Hometown Title(s) Father

Son(s)

Master Disciple(s) Meng Xi孟喜

長卿東海蘭陵郎, 曲臺署長, 丞相掾孟卿

N/A 田王孫趙賓, 白光, 翟牧, 焦延壽

Meng Qing孟卿

N/A 東海 N/A N/A 孟喜蕭奮后倉, 閭丘卿, 疏廣 Yan An Le顏安樂公孫魯國薛齊郡太守丞, 大司農眭孟姊

N/A 眭孟泠豐, 任公, 冥都, 筦路

Zhang Yu 張禹

子文河內軹郡文學, 光祿大夫, 東平內史

N/A N/A 施讎彭宣, 戴崇 ground truth for evaluating such information extraction methodsHistoriography Book # Sentences # Words on the classical Chinese literature. Records of the Grand Historian 32,564 615,457

Second, we propose a bootstrapping approach to extract person Book of Han 40,114 874,165 entities and profiles requiring very litle prior knowledge of the language. The algorithm starts from only one or two simple seed Table 3: Statistics of the dataset (two books). paterns, finds the atribute values, and then use them to discover trumpdonaldjohn, more complicated paterns. It has an estimator to access the relia- newyorktrumpdonald, bility of paterns during the iterative process. So, the new atribute newyorktrumpdonaldjohn. values extracted from more reliable paterns are more likely to beNote that there would be no white-space (nor upper-case) to split trustworthy and can be used to infer paterns in next iterations. the words. Complicated paterns have to be designed or recognized

Experiments show that textual paterns achieve an F1 score of in the extraction methods. 0.851 on 15 ground-truth person profiles. Tab1lsehows a comparison between the generated profile (left) and the ground-truth pro-2.2 Person Entity Profiling ifle (right) of Meng Xi孟喜. On the other side, the bootstrapping Given the classical Chinese historiography, the task of person enmethod achieves the highest performance after 7 iterations to find tity profiling aims to extract demographic atributes (ec.ogu.,rtesy a set of related paterns and extrapecrtson-titlepairs, while meet- name, place of birth,titl)eand social relations (ef.agt.,her-son, mastering barriers to find more paterns for other atributes and relations. disciple) and to generate a complete profile for the person entities

We summarize our contributions in this study as follows: extracted in Sectio2.n1.

New datasetW:e recruit history professors to curate a set Table2 shows examples of the profiles of Confusians in the Han of person profiles from classical Chinese literature. Dynasty such as Meng X孟i喜, Meng Qing孟卿, Yan An Le顏安 New approach:We develop a bootstrapping method based 樂, and Zhang Yu張禹. Some of the atribute values are “N/A” if on textual paterns to extract the person entities and atrib-not available in the text corpora. uted information, requiring litle prior knowledge. hTere are two typical challenges of this task. One is the variety Efectiveness: Experiments show that textual paterns make of demographic atributes. Each type of atributes needs a set of an F1 score of 0.851 on 15 person profiles annotated by the specific, reliable extractors, which requires prior knowledge of the domain experts. Limitations are discussed in Sec4t.3io.n classical Chinese language. The other challenge is typically for the

Chinese historiography: Zero Pronoun (ZP), which stands for pro

2 ENTITY EXTRACTION AND PROFILING nouns that are omited when they are pragmatically or grammat2.1 Person Entity Extraction ically inferable from the context. Here is an example taken from

Records of the Grand Historia,nwhere the ZPs (denoted φas) all

It sounds like a subtask of the standard named entity recognition

refer to “Mr. Chunshen春” 申君: (NER) task – it narrows down from recognizing multiple types of entities (i.e., persons, locations, organizations) to only one type. [春申君] 者，φ 楚人也，φ 名歇，φ 姓黄氏。φ 游

However, it has to face a challenge when put into the classical Chi- 學博聞，φ 事楚頃襄王。

nese text: in classical Chinese literature, there are many diferent (Translation: [Mr. Chunsheφ n]w,as born in Chuφ , ’s ways of mentioning a specific person. A person has first name, last ifrst name is Xie, φ ’s family name is Huangφ. travname (family name), and courtesy name; and he is also recognized elled over the country and enriched his knowledge, by his hometown and title/position in the government. For the sake φ served King Qingxiang of Chu.) of readability, let us take the President of United States Donald JThi.s sentence indicates three atributes (i.eh.,ometown, first name , Trump as an example. “Donald” is his first name ansudppose“John” last nam)eand one relation (i.me.,aster-discipl)e about Mr. Chun(J.) can be considered as his courtesy name. (Ancient Chinese peo-shen 春申君. However, ZP makes it challenging to link the atple do not have middle name. They have courtesy name.) He was tributes and relations in the context with the person entity. Moreborn in New York. So, all the following could be used in the classoiv-er, we observe that ZPs occur not only in the same sentence with cal Chinese literature to mention President Trump: the mention of the person entity but also across several sentences donald, in the same paragraph. In biographical historiography, each chaptrumpdonald, ter discussed the life story of a certain person. So, we adopt the presidentdonald, following assumption (learned from history professors) to resolve Seed Patterns $TITLE $TITLE … … Seed Entities … …

Rank

Pattern

Entities Generate and rank textual patterns Expand seed patterns and seed entities

Top # 1 … … 53 ... … ...

… the ZP issue: given a paragraph, as long as a person entity was extracted in the first clause, the ZPs in every clause of the paragraph refer to that person entity. This will help us propose an approach to extract person-atribute/relation pairs when the extractors were only able to find the atributes and relations in local contexts. 3 THE PROPOSED APPROACH In this section, we first introduce how the dataset was curated with handcrafted paterns by domain experts. Next, we present a patern-based bootstrapping method to find the entity information with a small number of seed paterns. #Values:The number of (person entity, atribute or relation value)-pairs extracted by the patern. #True Values:The number of true pairs annotated by the domain experts.

ReliabilityI:t describes whether a patern is reliable for ex

tracting true values. It is calculated as #True V alues Reliability = ; #V alues which gives a score between 0 and 1. (1)

Specifically, for person, patern [$ Person 者] was the first pat

tern that the domain experts come up w者ith”.is“ a typical symbol in classical Chinese that indicates the appearance of a person. The 3.1 Data Curation with Handcrafted Patterns person entities extracted by paternPe[$rson 者] may be in any We curate a dataset of two classical Chinese historiography boooksf,the 6 forms of person entity mentions in Sect3.i1o.1n. Another Records of the Grand Historia n(authored by Sima Qian, completed frequent patern is [$Person 字 $CourtesyName]. Unlike patern in c. 86 BC) andBook of Han (authored by Ban Gu, completed in [$Person 者], person entities extracted by paternPe[$rson 字 111 AD). Table3 lists the statistics of the dataset. $CourtesyName] strictly follow the formlaostf name+ first name .

It is a more reliable patern. As the table shows, the reliability 3.1.1 Paterns for person entity extraction. The domain experts we of patern [$ Person 字 $CourtesyName] is 1 and the reliability recruited to annotate the data contribute the following paternsoftopatern [$ Person 者] is only 0.6087 though the number of exrecognize mentions of person entities: tracted person-value pairs is smaller (205 vs. 299). $FirstName, Most of the paterns forhometown, father-son, andmaster-disciple $LastName + $FirstName are highly reliable (higher-than-0.96 reliability), except pa，terns [ $Title + $FirstName, $Father 子] (ID 38) and [，事 $Master] (ID 46) of reliability 0.8571 $LastName + $FirstName + $CourtesyName, and 0.8356, respectively. Among the 28 paterns for atributteitl,e $Hometown + $LastName + $FirstName, only 4 paterns have reliability of lower than 0.8 and only one has a $Hometown + $LastName + $FirstName + $CourtesyName. reliability score of lower than 0.7, 至i.e.$,T[itle] (ID 36). Among all the 50 handcrafted paterns, 35 (70%) paterns have reliability 3.1.2 Paterns for entity profiling. Table4 presents 50 textual pat- score of 1; 5 (10%) paterns have reliability score of lower than 0.8. terns that were used to extract a set of candidates of person’s attribute or relation values. Some atributes suchhomasetownand 3.2 Pattern-based Bootstrapping father-sonhave a small number of paterns. Some such atsitleand We propose a new approach to extract person entities and profiles master-disciplehave a large number of paterns. The domain ex- from classical Chinese historiography requiring very litle prior perts also annotated whether the atribute values and relations akrneowledge of the language. Generally, it is an iterative method true or false. For each patern, we give three numbers associatedthat uses textual paterns to extract atribute or relation values with its extractions: from text data. Figu1reshows the diagram of one iteration in the

Attribute Pattern person $Person 者 person $Person 字 $CourtesyName person $Person，$Hometown 人也 person $Person，$Hometown 人 hometown ，$Hometown 人也 hometown ，$Hometown 人 hometown 徙 $Hometown courtesy name ，字 $CourtesyName title 拜為 $Title title 拜 $Person 為 $Title title 遷 $Title title 遷為 $Title title 遷 $Person 為 $Title title 遷至 $Title title 封為 $Title title 封 $Person 為 $Title title 召為 $Title title 召 $Person 為 $Title title 補 $Title title 察… 為 $Title title 舉為 $Title title 舉… 為 $Title title 擢為 $Title title 擢 $Person 為 $Title title 徵為 $Title title 徵 $Person 為 $Title title 徙為 $Title title 徙 $Person 為 $Title title 復為 $Title title 以 $Title 察 title 薦為 $Title title 薦 $Person 為 $Title title 贖為 $Title title 立為 $Title title 為 $Title title $Person 為 $Title title 至 $Title father-son ，$Father 子也 father-son ，$Father 子 father-son ，其父 $Father father-son ，父 $Father father-son $Son 父曰 $Father master-disciple 從 $Master 受… master-disciple 事 $Master 受… master-disciple $Master 授 $Disciple master-disciple ，授 $Disciple master-disciple ，事 $Master master-disciple 事 $Master 為 $Title master-disciple 弟子… 者，$Master master-disciple 受… 於 $Master master-disciple 與 $Person 俱事 $Master

Example 陳丞相平者王莽字巨君申屠嘉，梁人也朝鮮王滿，燕人，陽城人也，高陽人自下邑徙平陵，字長卿拜為上卿拜仁為郎中令遷東平太傅起遷為國尉遷廣明為淮陽太守稍遷至栘中廄監綰封為長安侯孝景後三年封蚡為武安侯

復召為郎於是上召寧成為中尉

以選除補御史掾以郡吏察廉為樓煩長後以御史舉為鄭令復舉賢良為河南令

擢為光祿大夫因擢延壽為諫大夫

徵為廄丞徵由為大鴻臚

徙為頻陽令徙立為太原太守後復為淮陽都尉以郡吏察廉為樓煩長

薦為議郎薦宣為長安令贖為庶人自立為代王為駙馬都尉侍中禹為丞相史

至中大夫，秦莊襄王子也

，文公少子，其父高祖中子

，父號孟卿悼侯父曰隱太子友從太中大夫京房受易又事前將軍蕭望之受論語

常授梁蕭秉君房，授翼奉、蕭望之、匡衡

，事太傅夏侯勝事梁孝王為中大夫弟子遂之者，蘭陵褚大，東平嬴公嘗受韓子、雜家說於騶田生所與顏安樂俱事眭孟 patern-based bootstrapping method. It starts with only one or twovalue strings. Hamming distance is defined as the minimum numsimple seed paterns for each atribute. Because the number of seed ber of substitutions required to change one string into the other. paterns is small, it would not take much efort to find one or two. ( 2 ) Variety of the patern’s extracted valuesA: patern would be For example, [遷為 $Title] (i.e., [relegated toTi$tle]) and [補 more reliable if it extracted more true values. Besides the frequency, $Title] (i.e., [filled in the position ofT$itle]) were the two reli- we try another measurement: we assume that if there was a value able seed paterns for the atributetitl.eThe iterative method runs whose frequency dominates the set of values one patern extracted, the following steps until convergence. the patern would be less reliable. So we use 1 minus the ratio of the count of the most frequent value over all the value counts.

Step 1: Generating pattern candidaCtaensd.idate paterns are generated using contextual features of the targevti vianlutehe Step 3: Selecting new patterns and extracting new values for clause. We find that target values are more likely to be at the entdhe next iteratioFonr. each patern, we calculated the reliability of the clause because of the linguistic structure. Therefore, the com-scorer (p) and the frequency of values that it extracted. For the monly usedskip-gram contextual paternw“ 1____w1” [28] would next iteration, we first filter out the paterns whose frequency is not work for our task. Instead, we explore two diferent kinds ofbelow a threshold and then select top paterns of the hrig(ph)e.st contextual features described as follows: After that, we expand the set of true valVue+s by adding the $Pattern $Value. The textual patern is a window of values extracted by the new paterns. a certain size of Chinese characters before a target value.

For example, if the target valueTiist$le, we can find the 4 EXPERIMENTS patern candidate 遷[為 $Title] (i.e., [relegate toT$itle]), In this section, we first evaluate the quality of handcrafted paterns when the window size is 2. given by domain experts. Then we evaluate the efectiveness of the $Pattern $Entity $Pattern $Value. Both a window bootstrapping method. Finally, we discuss the limitations. of one Chinese character befo$rEentity and all characters between$Entity and $Value are selected as the contex- 4.1 Evaluating the Handcrafted Patterns tual feature. For example, Tifit$le is the target value and

Here we conduct experiments to answer: do the handcrafted pat

$Person is the entity that has already been extracted in Setce-rns extract correct person entities, atributes, and relations? tion3.1.1, we can find a new patern candidate 遷[ $Person 為 $Title] (i.e., [relegateP$erson to $Title]).

Evaluation methodsW. e use the 15 complete person profiles

(with 158 values) as ground truth. We use standard Information Step 2: Ranking pattern candida tItesis. nontrivial to rank the Retrieval metricPsr:ecision, Recall,andF1 score. Precisionis the fracquality of patern candidates. It has two serious issues when con-tion of true atribute or relation values (i.e., values that find a match sidering all the unlabeled entities as false: (1) penalized reliabinlethe corresponding atribute in ground-truth profiles) among all paterns that extracted true unlabeled values and ( 2 ) could not pvea-lues extracted by handcrafted paternRs.ecallis the fraction of nalize unreliable paterns that extracted false unlabeled values. Ttorue atribute or relation values among all ground-truthFv1alues. address these issues, we define the estimation score of patern re- score is the harmonic mean oPfrecision andRecall. liability as follows:

Evaluation resulRtse.garding the entity profiles, handcrafted tex∑v 2Vp (1 minv+ 2V+ d(v;v+)) tual paterns achieve aPrecisionof 0.901, aRecallof 0.803, and anF1 r (p) = w1 ∑v 2Vp f req(v) spcroorfileeo(fl0e.ft)8a51n.dTathblee1grsohuonwds-tarcuotmhpparroisfiloen(rbiegthwt)eoenfMthenegg孟eXni喜er,ated ( maxv 2Vp f req(v) ) where most of the values extracted are correct. We also find the fol+ w2 1 ∑v 2Vp f req(v) 2 [0; 1]; floorwmisnogflpimeritsoantieonntsitoyf tmheenthiaonndscmraakfteedepnatteirtnysl.iFnikrisntg,t(ih.ee.,dmifeernetniotn wherep is a textual paternv, is a value strinvg,+ is a true value alignment) dificult. For example, in thmeaster-disciplerelation of string,Vp is the set of unique value strings extracted by patern Meng Xi 孟喜, “同郡白光少子”, i.e., Bai Guang白光 whose courp, V+ is the set of unique true value stridn(gvs1;;v2) is the nor- tesy name is Shao Z少i子 from the same (同“”) hometown (郡“”) malized hamming distance between the two value strfirnegqs(v,) as Meng Xi’s, extracted by handcrafted paterns should refer to Bai is the frequency of the value strvi n.gw1 and w2 are weights: Guang白光 in the annotation. Second, ZP problem was resolved w1 + w2 = 1. in most of the cases but may still assign atributes or relations to hTe estimator includes two kinds of features: wrong persons. For example, in thmeaster-disciplerelation, Hou (1) The textual similarity between the patern’s extracted values Cang 后蒼 and Shu Guang疏廣 are indeed disciples of Meng Xi and true valuesI:f the value a patern extracted is very similar with 孟喜’s father Meng Qin孟g卿, but are mistakenly regarded as the one true value, the value is likely to be true and the patern is likeldyisciples of Meng X孟i喜 due to the assumption. to be reliable. For example, suppose “Tai Sh太ou守”, the name of an oficial position, has been in the set of true values T(aitsl$e). 4.2 Evaluating the Efectiveness of the hTen the value “Nan Yang Tai Shou南陽太守” extracted by a pat- Bootstrapping Method tern, which means the Tai Sho太u守 ruling a place called Nan We conduct experiments to see if the bootstrapping method can Yang 南陽, is likely to be a good value (aTsit$le). We use Ham- ifnd the set of handcrafted paterns with only one or two seed patming distanceas the metric to measure the similarity between twoterns and see if the atribute values can be accurately extracted.

Parameter settinWgse. set the window size as 2. The frequency

threshold of paterns is 10. The number of top paterns selected per iteration is 10. We run until convergence but just report the first 10 iterations for the sake of space. The weights of patern reliability features arwe1 = w2 = 0:5.

Evaluation methodHs.ere are the metrics for the two tasks.

Task of patern extraction: We evaluate the performance on extracting paterns for thetitleatribute. We use the metriPcrecision@K, which is the fraction of tKospcored generated paterns that are in the ground-truth patern set. We also define a new metCroivcerage@K for the task, which is the fraction ofKtsocpored ground-truth paterns that are extracted by the bootstrapping method. The generated paterns were scored by the reliability estimation in Step 2 in Section3.2 and the ground-truth paterns were scored by the reliability in Tabl4e. Average precisionA(P) computes the average precision value for coverage over 0 to 1.

Figure 2: The performance of the bootstrapping method on Task of person-title pair extractioWn:e first assign a confidence person-title pair extraction gradually improved through itscore to eachperson-titlepair by weighting the reliability score erationsA.UC increased from 0.107 to 0.207 (in iteration 7) of the textual patern that extrapcetrsson and titlerespectively. and decreased after the point.

We evaluate the person-title pairs extracted by the bootstrapping

method at diferent numbers of iterations wPritechision-Recallcurves.

Precisionis the fraction of true person-title pairs among all persoRn-esults on person-title pair extracRtuinoni.ng the bootstraptitle pairs generated by handcrafted paternRse.callis the fraction ping method for more iterations generally increases the perforof true person-title pairs among all 516 ground-truth person-timtlaence ofperson-titlpeair extractions, while after certain iterations pairs.AUC is the area under the curve. the performance starts to shrink. From Fig2,urAeUC keeps increasing in the first 7 iterations, achieving a maximum of 0.207 in Results on pattern extractTheiobno.otstrapping method had iteration 7, and then begins to decrease in the last 3 iterations. been improving the performance of patern extraction since it started,Why the Recall scores were consistently loPwa?tern ID 11, ID 34 while after certain iterations the performance turned to be worse.

and ID 36 from Table4 are not found by the bootstrapping method From Table5, running the bootstrapping algorithm for 3 iterationdsue to the seting of a window size of 2. Therefore, values extracted can increaseAP by 42.55%, compared to running only one 1 itera- by those paterns, which occupy 45% of the total true values, will tion. After around 5 iterationAsP, displays a continuous trend of never be found. declining and iteration 10 gives the loAwPesotf 0.131, which is a Why many false person-title pairs were included after iteration decrease of 44.26% from iteration 1. What’s mCoroev,erage@K no #8? Domain experts have also designed stop words for each handlonger update after 7 iterations. It indicates that the bootstrappicnrgafted paterns, which are capable of screening out common noises may meet certain barriers in extracting more paterns. with certain paterns. But for the bootstrapping method, those noises

After observing the result paterns, we can infer some limita- extracted by the paterns are still regarded as true. tions on patern extraction of the patern-based bootstrapping method:

First, there exist many paterns with either a relatively low fre

quency (i.e. less than 20) or lack of interpretability (i.e. paterns4.3 Discussions with scarcely any actual semantic meaning but somehow capablWee find that the bootstrapping method can work only on extractof extracting “good” entities, which are still considered “good” binyg atribute values of T$itle. The values of $ Title could be shared our method) that tend not to be found by our domain expertsb,y multiple paterns’ extractions because multiple people can be which we should be reasonably tolerant of. assigned to the same position in the government. Only when the

Second, the patern-based bootstrapping method is not good at values are shared, we can find one patern with another by bootabstracting the first type of contextual paterns mentioned in Stepstrapping. However, one person cannot have multiple fathers and 1 in Section3.2. Human experts can easily generalize paterns with rarely have multiple masters. By now, we have only investigated av: + prep: structure that are composed of diferent verbs but thethe patern-based bootstrapping method in Sect3io.2non the atsame pronoun into one super-categoprrye:p: For example, it is tribute in the taskaotrfibute discovery . The preliminary of this reasonable for domain experts to find such common feature of pat- method lies in the fact that there should exist some entities that terns like拜[ 為 $Title], [擢為 $Title], [舉為 $Title] and etc., could be extracted by multiple paterns, which makes it possible all of which mean [promote tToit$le], and generalize them into to find new paterns through patern generation. However, for the patern [ 為 $Title] (i.e., [to $Title]). However, the bootstrapping task ofrelation extractio(en.g., father-sonandmaster-discipl)e, since method tends not to capture such abstraction of paterns and thereea-ch relation pair is unique in the text, there is not a patern shown fore generates a subset of certain ground-truth paterns, whichin Table4 that shares even a single common instance that could pulls down the evaluation metric. also be extracted by other paterns in the same category, which makes it hard for instance-level bootstrapping method to work. # of iterations P@3 1 0.667 2 0.667 3 1.000 4 0.667 5 0.667 6 0.333 7 0.333 8 0.333 9 0.000 10 0.000 5 RELATED WORK such as part-of-speech tags or entity types in order to extract a In this section, we survey three main topics related to our worlka.rge collection of tuple-like informa2t3i,o3n1,[34, 40]. Hearst patWe point out the uniqueness of our study. terns likeN“P such asNP, NP, andNP” were proposed to automatically acquire hyponymy relations from text d1a4]t.aL[ater, ma5.1 Chinese NLP Techniques chine learning experts designed the Snowball systems to propagate in plain text for numerous relational pater1n,4s,[43]. Google’s hTough robust NLP techniques are often language independent, most of the NLP techniques for Chinese have their own specific Biperpedia [12, 13] generatedE-A paterns (e.g., “ A of E” and “E’s characteristics and thus advantages compared to those for EnglAis”h) from users’fact-seeking queries by replacing entity wEit”h “ and noun-phrase atribute withA”“. ReNoun [35] generatedS-A-O or other Latin-based languages. Unlike Latin-based languages, Chinese languages do not use white-space as the natural delimitepra.terns (e.g., “ S’s A is O” and “O, A of S,”) from human-annotated hTerefore word segmentation is always a key precursor for lan- corpus on a predefined subset of the atribute namePs.atty used guage processing tasks in Chines5e, [6, 8, 30, 41, 42]. Moreover, parsing structures to generate relational paterns with semantic due to lack of morphological features, Chinese Part-of-Speech (POSt)ypes [29]. The recent MetaPAD generated “meta paterns” based tagging and dependency parsing are harder than Latin-based laonn- content qualit1y6][. However, all the paterns in the above guages like English. Leit al[.24] proposed joint models for Chinese methods can only serve for English. Due to the fundamental gramPOS tagging and dependency parsing tasks. As neural methods mar diference between classical Chinese and English, the above have recently achieved significant performance with large amountmethods no longer work for our problem. Our work has made the of annotated data, many deep neural models for Chinese POS tagfir-st step in the field of patern-based entity retrieval that is suitable for classical Chinese text. ging and dependency parsing have been develop9e,d2[1, 33]. Zero

Pronoun (ZP) resolution is also a challenging problem in Chinese.

Existing studies utilize heuristic rules to resolve ZP issues in Ch5i.-3 Neural Entity Information Extraction nese [10, 36]. Recently, supervised neural approaches have been hTe task of named entity recognition (NER) is typically cast as a vastly explored on many diferent task7s, 3[7–39]. sequence labeling problem and solved by supervised learning mod

However, all these studies focus on modern Chinese text. Clase-ls. Diferent from statistical learning methods like conditional ransical Chinese is important but was paid litle atention, as the ma- dom fields (CRF) [ 19], end-to-end neural network methods have jority of precious historical literature was writen in classical Chbei-en proposed to solve the proble1m5,[17, 20, 27]. Recent work nese hundreds or even thousands of years ago. Doing NLP tasks onused language model as another type of supervised sign25a]l,s [ classical Chinese would be more dificult than modern Chinese, be- which can help models obtain more contextual knowledge from cause of the very diferent writen style and very limited annotated corpus without extra annotation. Open source pre-train models data. Our approach was the first to curate a person entity profilinghave been widely used in the entity information extraction tasks. dataset for the studies and we proposed a patern-based bootstrap

hTey improved the performance with models pre-trained on masping method to extract the atributes of historical actors in anciesnitve corpora. Note that all the models need large amount of annoChina. The extracted high quality profile information would facil- tated data, while unfortunately we don’t have in classical Chinese. itate history studies. Digital humanities need more atention from both humanity studies and digital technologies.

6 CONCLUSIONS 5.2 Textual Pattern-based Entity Information In this paper, we aimed at extracting and profiling historical ac

Extraction Techniques tors from classical Chinese literature. We addressed the challenge Given a text corpus, textual paterns leverage statistics (e.g., highof low-resource language. In this study, we employed domain exfrequency) by replacing words, phrases, or entities with symbolsperts to curate a ground-truth dataset of person entities and their ACKNOWLEDGMENTS hTe authors would like to thank all the funds for their support. This work was supported in part by Notre Dame Research 2019 Global

Gateway Faculty Research Award (RGG) FY19RGG03 373106 and NSF Grant CCF-1901059.

profile atributes and relations (e.gc.o,urtesy name,place of birth,ti- (2001). tle,father-son, master-discipl)ewith handcrafted paterns from two [20] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity books,Historical RecordsandBook of Han. We developed a patern- recognition. InNAACL. based bootstrapping approach to extract the information with[21a] Haonan Li, Zhisong Zhang, Yuqi Ju, and Hai Zhao. 2018. Neural character-level dependency parsing for Chinese. IAnAAI. very small number of seed paterns. Experimental results showed [22] Kaiyuan Li. 2000. The Establishment of Han Dynasty and the Liu Bang Group: the efectiveness and limitations of the iterative method. large plain-text collectionPsr.oIcneedings of the fith ACM conference on Digital libraries. ACM, 85–94. [2] Liang Cai. 2014. Witchcraft and the Rise of the First Confucian EmApilrbea. ny,

NY: State University of New York Pres(s2014). [3] Liang Cai. 2019. Confucians, Social Networks, and Bureaucracy: Donghai Men and Models for Success in the Western Han China (206–9BCE). Early China ( 2019 ). [4] Andrew Carlson, Justin Beteridge, Bryan Kisiel, Burr Setles, Estevam R Hruschka, and Tom M Mitchell. 2010. Toward an architecture for never-ending language learning. AInAAI. [5] Pi-Chuan Chang, Michel Galley, and Christopher D Manning. 2008. Optimizing [30] Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fiePlrdosc.eIendings of the 20th international conference on Computational Ling.uAisstsiocsciation for

Computational Linguistics, 562. [31] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing text for joint embedding of text and knowledge bases. IPnroceedings of Empirical Methods on Natural Language Processing. 1499–1509.

A Study of the Meritorious Military CBlaeisjsin .g: San lian shu dian( 2000 ). [23]

Li ,

Meng

Jiang , Xikun Zhang, Meng Qu, Timothy P Hanraty, Jing Gao , and Ji-

awei Han . 2018 . Truepie: Discovering reliable paterns in patern-based informa-

tion extraction . PIrnoceedings of the 24th ACM SIGKDD International Conference

on Knowledge Discovery & Data Minin. gACM , 1675 - 1684 . [24]

Zhenghua

Li , Min Zhang, Wanxiang Che, Ting Liu, Wenliang Chen, and Haizhou

Li . 2011 . Joint models for Chinese POS tagging and dependency parsing . In

for Computational Linguistics , 1180 - 1191 . [25] Liyuan

Liu

, Jingbo Shang, Xiang Ren, Frank Fangzheng Xu, Huan Gui, Jian Peng,

and Jiawei

Han . 2018 . Empower sequence labeling with task-aware neural lan-

neural networks . AInCL . 778 - 788 . [8]

Xinchi

Chen , Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and

Xuanjing

Huang . 2015 .

Proceedings of Empirical Methods on Natural Language Process . in11g97 - 1206 . [9]

Yufei

Chen , Sheng Huang, Fang

Wang

, Junjie Cao, Weiwei Sun, and Xiaojun

Wan . 2018 . Neural Maximum Subgraph Parsing for Cross-Domain Semantic

Dependency

Analysis . InProceedings of the 22nd Conference on Computational

Natural Language

Learni n .g562- 572 . [10] Susan

Converse and Martha Stone

Palmer. 2P0r0o6n.ominal anaphora resolu-

tion in Chines.eCiteseer. [11] Crespigny R. De . 2007 . A Biographical Dictionary of Later Han to the Three

Kingdoms ( 23 -220 Ad). Leiden: Brill ( 2007 ). [12] Rahul

Gupta

, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu. [37] Qingyu

Yin

, Yu Zhang, Weinan Zhang, and Ting Liu. 2017 . Chinese Zero Pro-

2014. Biperpedia: An ontology for search applicatiVoLnDs . B 7 , 7 ( 2014 ), 505 - noun Resolution with Deep Memory NetworkE .MInNLP. Association for Com-

516. putational Linguistics, Copenhagen, Denmark, 1309 - 131h8t .tps://doi.org/10. [13] Alon

Halevy

, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, and Xiao 18653 /v1/ D17 -1135

Yu . 2016 . Discovering structure in the universe of atribute nameWs . WInW. [38] Qingyu

Yin

, Yu Zhang, Weinan Zhang, Ting Liu, and William Yang Wang. 2018 .

International World Wide Web Conferences Steering Commitee , 939 - 949 . Zero Pronoun Resolution with Atention-based Neural NetwoPrrko .cIenedings [14] Marti

Hearst . 1992 . Automatic acquisition of hyponyms from large text cor- of the 27th International Conference on Computational Ling .uAisstsiocsciation for

Computational

Linguistics , Santa Fe, New Mexico, USA, 13 - h2t3 .tps://www.

pora. InProceedings of the 14th conference on Computational linguistics-Volume

2. Association for Computational Linguistics, 539 - 545 . aclweb.org/anthology/C18-1002 [15] Zhiheng

Huang

, Wei Xu,

and Kai

Yu . 2015 . Bidirectional LSTM-CRF models for [39] Qingyu

Yin

, Yu Zhang, Wei-Nan

Zhang

, Ting Liu, and William Yang Wang.

sequence tagginga . rXiv preprint arXiv:1508 . 01991 ( 2015 ). 2018 . Deep Reinforcement Learning for Chinese Zero Pronoun Resolution . In [16] Meng

Jiang

, Jingbo Shang, Taylor Cassidy, Xiang Ren, Lance M Kaplan, Timo- ACL. Association for Computational Linguistics , Melbourne, Australia, 569 - 578 .

thy P Hanraty , and Jiawei Han. 2017 . Metapad: Meta patern discovery from https://doi .org/10.18653/v1/ P18 -1053

massive text corpora . IPnroceedings of the ACM SIGKDD International Confer- [40] Wenhao

Zongze

Li ,

Qingkai

Zeng , and

Meng

Jiang . 2019 . Tablepedia: Au-

ence on Knowledge Discovery & Data Minin.AgCM , 877 - 886 . tomating PDF Table Reading in an Experimental Evidence Exploration and An [17] Tianwen

Jiang

Tong

Zhao ,

Bing

Qin , Ting Liu, Nitesh V Chawla, and Meng alytic System. I nThe World Wide Web Conference . ACM , 3615 - 3619 .

Jiang . 2019 . The Role of “Condition”: A Novel Scientific Knowledge Graph Rep- [41] Qi

Zhang

, Xiaoyu Liu, and

Jinlan

Fu . 2018 . Neural networks incorporating dic-

resentation and Construction ModePlr.oIcneedings of the 25th ACM SIGKDD tionaries for chinese word segmentatioAnA . IAnI.

International Conference on Knowledge Discovery & Data Mi.nAinCgM , 1634 - [42] Wei

Zhou

, Aiping

Wang

, Hua Shu, Reinhold Kliegl, and

Ming

Yan . 2018 . Word

1642. segmentation by alternating colors facilitates eye guidance in Chinese reading . [18]

Martin

Kern . 2003 . The ”biography of Sima Xiangru” and the question of the Memory & cognition46, 5 ( 2018 ), 729 - 740 .

Fu in Sima Qian's Shiji . Journal of the American Oriental Socie1t2y3 , 2 ( 2003 ), [43] Jun

Zhu

, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen . 2009 . Stat-

303- 316 . Snowball: a statistical approach to extracting entity relatioWnsWhiWps. . In [19] John Laferty, Andrew McCallum, and Fernando CN Pereira . 2001 . Conditional

ACM

, 101 - 110 .