Detecting Entity Descriptions from Chinese Historical
                         Texts
                         Ye Xia1,† , Bin Wang2,† , Linxuan Yu3,† , Xiaoci Lin1,† and Hui Li2,*,†
                         1
                           College of Artificial Intelligence, Nanjing Agricultural University, 210095, Nanjing ,China
                         2
                           College of Humanities and Social Development, Nanjing Agricultural University, 210095, Nanjing ,China
                         3
                           College of Sciences, Nanjing Agricultural University, 210095, Nanjing ,China


                                     Abstract
                                     Driven by an increasing number of digitized historical documents in machine-readable formats, researchers from
                                     various disciplines actively participate in the information extraction and exploration of historical documents,
                                     especially the recognition and classification of named entities in large-scale texts. Most existing studies focus
                                     on the identification of flat entities, however, the nested structures inside entities are often overlooked. In this
                                     paper, we focus on the extraction of nested entities in Chinese local gazetteers spanning over 8 centuries. We
                                     first propose an annotation guideline for two entity types and five entity categories in local gazetteers, which can
                                     be easily adapted to other domains. Then we utilize three popular span-based NER approaches in the context of
                                     Chinese historical texts, and analyze the corresponding results. Our preliminary study can enhance the existing
                                     geographical resources with entity information and be a reference for similar tasks within the field of digital
                                     humanities.

                                     Keywords
                                     Chinese historical texts, nested entity, span-based NER


                         1. Introduction
                         Named entity recognition (NER), which plays an important role in the area of natural language processing
                         (NLP), identifies entities such as person, organization and location names from texts. Currently NER
                         task has achieved a remarkable performance on texts written in modern languages. However, historical
                         texts are still faced with multiple challenges, such as lack of resources, input noisiness, and domain
                         heterogeneity [1]. Although transformer-based NER techniques have already been used on historical
                         texts [2], fine-grained NER, e.g., nested entity recognition, has not been widely studied, especially for
                         Chinese historical texts. The most widely used NER-flat approach on Chinese historical texts is the
                         sequential labeling method “BERT-BiLSTM-CRF”.
                            Chinese local gazetteers (also known as ”difangzhi” ), are historical records that contain comprehensive
                         information about administrative units in China over time. In this study, we are particularly interested
                         in extracting fine-grained entity information from large-scale Chinese historical texts. We make our
                         efforts to extract the flat and nested entity mentions, e.g., local products, books and locations, from a
                         sizable number of local gazetteers spanning over 8 centuries, using a computational approach.


                         2. Methodology
                         Within the scope of Chinese local gazetteers, our major focus is about two entity types, namely flat
                         and nested entities. Five categories within these two entity types are defined and labeled with different

                         GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024,
                         Glasgow, Scotland
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ 19220124@stu.njau.edu.cn (Y. Xia); 2022110023@stu.njau.edu.cn (B. Wang); 23121215@stu.njau.edu.cn (L. Yu);
                         19222119@stu.njau.edu.cn (X. Lin); 2021005@njau.edu.cn (H. Li)
                          0009-0000-3244-4069 (Y. Xia); 0009-0000-7685-3251 (B. Wang); 0009-0005-0961-5452 (L. Yu); 0009-0004-2972-2071 (X. Lin);
                         0000-0001-7050-1845 (H. Li)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
tags：PRO for local product names, PER for person names, LOC for location names, BOK for book
names, and TIM for temporal expressions.
  Let a sentence 𝑠 ∈ 𝑆 be denoted as a sequence of tokens 𝑠 = {𝑤1 , 𝑤2 , ..., 𝑤𝑛 }, where 𝑤𝑖 denotes
the 𝑖𝑡ℎ token in the sequence. Let C denote a set of pre-defined categories 𝐶 = {𝑐1 , 𝑐2 , ..., 𝑐𝑛 }. The
goal of our task is to predict a list of tuples 𝑇 = {< 𝐼1ℎ𝑒𝑎𝑑 , 𝐼1𝑡𝑎𝑖𝑙 , 𝑐1 , 𝑑1 >, < 𝐼2ℎ𝑒𝑎𝑑 , 𝐼2𝑡𝑎𝑖𝑙 , 𝑐2 , 𝑑2 >, ..., <
 ℎ𝑒𝑎𝑑 , 𝐼 𝑡𝑎𝑖𝑙 , 𝑐 , 𝑑 >}, each of which refers to an entity mentioned in the sentence. 𝐼 ℎ𝑒𝑎𝑑 and 𝐼 𝑡𝑎𝑖𝑙
𝐼𝑚       𝑚        𝑚 𝑚                                                                                    𝑖            𝑖
represent the head index and the tail index of the 𝑖𝑡ℎ entity mention. 𝑐𝑖 is the predicted entity category
and 𝑑𝑖 corresponds to the depth of this mention. 𝑚 is the number of entity mentions detected within
the given sentence. In this study, we consider the flat entity as a special case of the nested entity with
the entity depth of zero (see Figure 1).


Figure 1: An example of the flat and nested entities annotated in Chinese local gazetteers.

    Flat entities are annotated for all entity categories and nested entities are annotated within the
categories of persons, locations and books in order to achieve a fine-grained entity detection. For
instance, the category LOC covers the geographical names of a certain place, such as village, town or
city, ′′ 高郵狀元墩′′ (Gaoyou Zhuangyuandun), the entity mention contains an embedded county name
′′ 高郵′′ (Gaoyou), so we consider this location name as a nested entity mention.

    Our task here is to find all occurrences of the entities that belong to the categories indicated above,
and use pre-defined tags to mark the beginning, the end, and the nested structure of each mention span.
It should be noted that the tag sets not only refer to the full name of an entity, but also to the specific
features embedded in a given entity. For example, the nested entity ′′ 波斯橙′′ (Bosi Cheng, Persian
Orange) , the tag LOC is used inside the mention to capture the latent geographic feature (波斯, Persian)
of this local product. We ask domain experts to manually assigned tags to each entity mention in texts
and consult with each other or refer to external resources in case of entity ambiguities or uncertainties.
    Considering the limited data scale, we prefer to fine-tune the pre-trained BERT-based model for
NER task on classical Chinese texts instead of training our own from scratch. The span-based methods
have advantages in easily identifying nested entities in different sub-sequences, therefore we intend to
tackle the NER-nested task with three popular span-based approaches, namely MRC, Global Pointer
and BERT-span, respectively.
    • MRC. Li et al.[3] formulated the NER task as a Machine Reading Comprehension (MRC) task and
      transformed the tagging-style annotated dataset to a set of tuples {question, answer, context} to
      tackle nested entity problem.
    • Global Pointer. Su et al.[4] leverages the relative positions through a multiplicative attention
      mechanism to identify the nested entities.
    • BERT-Span. This approach leverages the strengths of BERT and span-based strategy to tackle the
      complexity of NER-nested task. This approach can effectively identify and extract the hierarchical
      relationships between nested entities.


3. Experiment and Evaluation
In this study, we use a digital collection of Chinese local gazetteers from the 12th to 20th century [5].
After text correction of misspellings, integration of metadata and manually annotation of entities, our
dataset contains 25,353 items of product descriptions and 940,189 entity labels. Table 1 shows the entity
distribution of our dataset and we divide the dataset into three subsets, i.e., training, development
and test set, with a ratio of 7:2:1. Apart from generic location names, we notice that there is a large
proportion of nested entities containing ′′ LOC′′ labels inside, which correspond to the ′′ Geo-related′′ in
Table 1.

    Table 1
    Entity Statistics for Our Dataset
                                Type     Category     Vocabulary Size      Geo-related
                                           PRO                10,423              ✓
                                           LOC                6,445               ✓
                                flat       BOK                  102               ✓
                                           PER                 1,083              –
                                           TIM                  540               –
                                           PRO                4,783           68.59%
                                nested     LOC                2,349             ✓
                                           BOK                  97            82.35%


   We investigate the state-of-the-art (SOTA) pre-training language models for classical Chinese, and
we find that BERT-ancient-Chinese is the best fit for our task since it outperforms others [6]. We
fine-tune it on our annotated gazetteer dataset for the NER task with three span-based approaches,
respectively. Table 2 illustrates the precision, recall and F1 values of entities in five categories using
different span-based methods, and macro-average is calculated over all categories. We use bold to mark
the highest value of each category. According to Table 2, it seems that Global Pointer outperforms
others on macro-average scores and MRC outperforms others on identification of entities from PRO,
LOC and PER categories.

    Table 2
    P, R, F1-Score, and Macro-average Results of Each Entity Category using Different Span-based Ap-
    proaches
                                  Methods        Category         P (%)   R (%)       F1 (%)
                                                     PRO          81.22   82.34       81.77
                              BERT-Span              LOC          86.93   82.89       84.86
                                                     BOK          83.97   86.29       85.12
                                                     PER          82.17   85.87       83.98
                                                     TIM          85.16   80.5        82.77
                                                    Overall       83.89   83.58        83.7
                                                     PRO          83.16   84.1        83.63
                              Global Pointer         LOC          86.86   81.62       84.16
                                                     BOK          89.74   84.06       86.81
                                                     PER          84.14   81.23       82.66
                                                     TIM          81.51   81.41       81.46
                                                    Overall       85.08   82.48       83.74
                                                     PRO          83.26   86.42       84.81
                              MRC                    LOC          86.53   85.37       85.95
                                                     BOK          83.42   83.22       83.32
                                                     PER          89.25   88.76       89.00
                                                     TIM          81.54   80.43       80.98
                                                    Overall       82.26   79.88       81.03


4. Conclusion
In this study, we focus on the nested entity extraction from large-scale Chinese local gazetteers. We
utilize three popular span-based approaches with fine-tuning BERT-ancient-Chinese on our domain-
specific dataset and the corresponding experimental results show the effectiveness and feasibility of
span-based NER on Chinese historical texts. The extracted entity mentions in this ongoing study can
enrich the existed geographical resources with historical location names and local products. Our further
step will be the entity linking with external resources, which will facilitate domain experts in the
interpretation and understanding of the fine-grained knowledge embedded in the historical texts.


Acknowledgments
This work is supported by the Fundamental Research Funds for the Chinese Central Universities
(SKQY2022003) and the National Key Research Project on Rare-Book Collections (22GJK004). The
authors are especially thankful to Prof. Dr. Ping Bao for the insightful comments, administrative
technical support, and materials used for experiments. We are also grateful to Mr. Shun Zou for
individual effort and encouragement to the successful completion of this paper.


References
[1] Won, M., Murrieta-Flores, P. and Martins, B. (2018). Ensemble named entity recognition (NER):
   evaluating NER tools in the identification of place names in historical corpora. Frontiers in Digital
   Humanities, 5, p.2. doi: 10.3389/fdigh.2018.00002.
[2] Abadie, N., Carlinet, E., Chazalon, J. and Duménieu, B. (2022). A benchmark of named entity
   recognition approaches in historical documents application to 19th century French directories.
   International Workshop on Document Analysis Systems, pp. 445-460. Cham: Springer International
   Publishing. doi: 10.1007/978-3-031-06555-2_30.
[3] Li, X., Feng, J., Meng, Y., Han, Q., Wu, F. and Li, J. (2020). A unified MRC framework for named entity
   recognition. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
   pp. 5849–5859. 10.18653/v1/2020.acl-main.519.
[4] Su, J., Murtadha, A., Pan, S., Hou, J., Sun, J., Huang, W., Wen, B. and Liu, Y., 2022. Global
   pointer: Novel efficient span-based approach for named entity recognition. arXiv preprint.
   https://doi.org/10.48550/arXiv.2208.03054.
[5] Li, Y. and Li, H. (2022). Exploring the Rice Cultivars in Large-Scale Chinese Local Gazetteers: A
   Computational Approach. Plants, 11(23), p.3403. https://doi.org/10.3390/plants112334
[6] X. Hu, H. Zhang and Y. Sun. Chinese medical short text matching model based on fine-tuning
   BERT-Attention-BiLSTM. (2023). 23rd International Conference on Computer and Information Science,
   pp. 91-96. doi: 10.1109/ICIS57766.2023.10210224.