Fine-Tuning NER with spaCy for Transliterated
Entities Found in Digital Collections From the
Multilingual Persian Gulf
Almazhan Kapan1 , Suphan Kirmizialtin2 , Rhythm Kukreja2 and David Joseph Wrisley2
1
    New York University Shanghai, Pudong New District, Shanghai, China
2
    New York University Abu Dhabi, Saadiyat Island, Abu Dhabi, United Arab Emirates


                                         Abstract
                                         Text recognition technologies increase access to global archives and make possible their computational
                                         study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach
                                         to extracting a variety of named entities (NE) in unstructured historical datasets from open digital
                                         collections dealing with a space of informal British empire: the Persian Gulf region. The sources are
                                         largely concerned with people, places and tribes as well as economic and diplomatic transactions in the
                                         region. Since models in state-of-the-art NER systems function with limited tag sets and are generally
                                         trained on English-language media, they struggle to capture entities of interest to the historian and do not
                                         perform well with entities transliterated from other languages. We build custom spaCy-based NER models
                                         trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by
                                         spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test
                                         and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further
                                         improvements. Our study makes an intervention into thinking beyond Western notions of the entity in
                                         digital historical research by creating more inclusive models using non-metropolitan corpora in English.

                                         Keywords
                                         Named Entity Recognition, Gulf Studies, Colonial Archives, Persian Gulf, spaCy, Transliterated Names.


1. Introduction
With the increase in digitization and transcription of historical archives, Named Entity Recogni-
tion (NER) is often regarded as an important step in text processing, ensuring scaled access to
layers of information found in text, such as names of people, places or currencies [1]. In addition
to the possibility of creating linked data and building gazetteers, identifying relevant entities in
unstructured text enables scholarly examination of broader patterns in archival collections. This
potential of NER has been demonstrated in the spatial humanities and the study of historical
networks, with notable challenges [2, 3]. Cultural heritage collections span long periods of
time, and historical text contains named entities (NE) which often have changed over time. In

The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18,
2022
Envelope-Open aa5456@nyu.edu (A. Kapan); suphan@nyu.edu (S. Kirmizialtin); rk3781@nyu.edu (R. Kukreja); djw12@nyu.edu
(D. J. Wrisley)
Orcid 0000-0002-1064-8199 (A. Kapan); 0000-0001-5020-0578 (S. Kirmizialtin); 0000-0002-4424-1100 (R. Kukreja);
0000-0002-0355-1487 (D. J. Wrisley)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         288
the case of our sources, the dynamic orthography of colonial English further contributes to the
instability of entity names. Furthermore, for regions such as the Middle East, entities are not
commonly found in knowledge bases and, therefore, are not easily detected by modern systems
[4]. State-of-the-art NER systems, primarily trained on web news data, do not successfully
address the relevant issues [5]. Finally, of particular significance to our work, modern English
language NER systems often fail to recognize NE of non-English origin such as Arabic and Farsi
names, indigenous tribal designations, and names of cities transliterated from other languages.
   In this paper we focus on early stage research into detecting NE in colonial historical docu-
ments, in particular from samples of the ledgers of the British Political Residency (BPR) in the
Persian Gulf [6]. We have designed a custom spaCy NER system addressing some of the above-
mentioned issues in detecting and classifying transliterated NE of historical and non-Western
origin. The system is trained on annotated datasets from the BPR ledgers [5] and Lorimer’s
Gazetteer of the Persian Gulf, Central Arabia and Oman ( Lorimer’s Gazetteer) [7]. We provide
an easily scalable blueprint for custom NER with spaCy, optimized for detecting transliterated
NE in historical documents. Notably, the system can be extended to other historical datasets
and custom NE lists which are not defined in state-of-the-art NER systems (e.g. spaCy, NLTK).
For this purpose, we make our raw text and annotated datasets openly available1 .


2. Related Work
NER with Historical Collections. Despite the marginal position of historical datasets in
general NER research, the demand for systems for use in archival scenarios has steadily in-
creased [4]. The literature emphasizes the roles that NE play in the information workflow
of historians, e.g., in searching historical newspapers or providing search recommendations
for digital collections. NER can also be used for historical data analysis, visualization, event
detection, even biography reconstruction [8]. Current research with historical datasets has
three main approaches i) a focus on a particular data type or textual genre, e.g. administrative
documents [9], museum record metadata [5], newspapers [10], gazetteers; ii) a focus on the
kind of writing, handwritten [11] or typewritten materials [12] or iii) a task-specific focus: NE
recognition, classification or linking [1]. Most current NER systems for historical data either
fine-tune existing NER systems or use NE processing web services [1], the former seeming more
scalable and customizable. Indeed, datasets used in NER research vary in kind and language,
meaning that a comparison of their performance can be a challenging task.

Using spaCy for Custom NER with Historical Documents. NE processing efforts are
led by supervised machine learning and deep neural networks [13]. Won et al. [4] analyzed
applicability of existing NER systems on historical texts and compared performance of popular
supervised solutions such as Stanford NER, NER-Tagger, spaCy, and Polyglot-NER. In this study,
the spaCy-based NER repeatedly achieved the top-1 or top-2 f-score when applied to historical
datasets [4]. Importantly, these results were achieved without leveraging custom training
features of the systems. Compared to similar systems such as Stanford NER or NLTK, spaCy


   1
       https://github.com/opengulf/Bushire/tree/main/spacy


                                                      289
Table 1
Custom tags used for our historical corpora
 Tag name                Description                                Example
  COMMODITY              Any object traded or transported           woolens
  REL                    Religious groups, as separate from NORP    Shiites
  STRUCTURE              Named elements of the built environment    Caravanserail, Factory, Residence
  TITLE                  An distinction for a person or family      Sheikh, Imam, Governor of Sheraze
  TRIBE                  A name of a tribe, as distinct from NORP   Sooedan tribe
  VESSEL                 A name of a seaborne vessel                Dolphin Schooner


provides extensive documentation on fine-tuning and customization2 . Several spaCy-based
custom systems have shown optimal performance for NER in a variety of sectors [14].


3. Datasets
We used two large entity-rich text collections from the era of British informal empire in the
region: excerpts from the BPR ledgers held in the India Office Records (IOR/R/15) as well as a
45000-word entry on Iraq from Lorimer’s Gazetteer. Since we were keen on testing NER both
on different textual genres and formats, we used a combination of unstructured textual data
types: corrected OCR of typeset materials, uncorrected OCR of typewritten documents and
ground truth used to train Handwritten Text Recognition (HTR) models.

The Handwritten And Typewritten Bushire Political Residency Ledgers. The BPR
ledgers were not all originally written in English; instead, they are a combination of English
originals and translations made from incoming Arabic and Farsi documents. From image
files available at the Qatar Digital Library (QDL) we created a generic HTR model in Tran-
skribus for early handwritten volumes in the collection. Whereas the official letters exhibit
a formulaic consistency of style, evolving capitalization norms and tendencies to Anglicize
foreign names over the course of the nineteenth century further complicate the entity space
in the corpus. A companion dataset to the handwritten corpus was created by collecting sam-
ples of OCR’d typewritten documents from twentieth-century BPR ledgers when the use of
typewriters had become standard (from 1932-1939 and 1940-1948). Preference for document
selection was based on a cursory glance at the number of entities on the letters and page layouts.

Lorimer’s Gazetteer. The last document used in our NER experiment is an excerpt of the
45000-word chapter on Iraq from Lorimer’s Gazetteer. The Gazetteer was a compendium result-
ing from “gathering and processing information” about the Gulf region over the years 1904-1908
[15]. Importantly, its organizational structure tends to group similar entities together in sections,
exhibiting greater data density than the BPR ledgers.

The documents described above are especially germane to the study of the modern history of
    2
        https://spacy.io/usage/training


                                                    290
Figure 1: Distribution of Named Entities in the training and gold test datasets


the Gulf states. For pre-oil Gulf studies, NER shows promise in identifying people for building
correspondence networks, as well as studying the movement of ideas, goods in the region. These
elements can be correlated, in turn, with the growth of colonial authority and the emergence of
Orientalist transliteration norms aimed at “de-Anglicization of indigenous names” [15].


4. Data Annotation
Annotation workflow. For training, development and testing, we selected samples from the
BPR ledgers and Lorimer’s Gazetteer totaling 6586 words. We allocated 82.5% of data for training
and development and 17.5% for testing. The training data was further split into 60-line segments
for streamlined group annotation. We used an open-source, full-stack web app (NER-Annotator)
with custom entity tags and exported the annotated data in the requisite training format for
spaCy3 . To ensure that NER-annotator system requirements were met across machines and
multiple annotators, the application was hosted on a virtual machine.

Tag selection and customization. Our system includes pre-trained models for detecting
pre-established named entities, such as “GPE” for “countries, cities, states,” or “QUANTITY” for
“measurements, as of weight or distance”. We maintained a selection of spaCy default tags both
to track their recognition and to mitigate the effects of catastrophic forgetting problem [16]
where a pre-trained model ‘forgets’ previous information. We also devised a set of customized
tags adapted to the content of our sources . We annotated our training data using DATE, GPE,
GPE_ORG, LOC, MONEY, NORP, ORG, PERSON, QUANTITY which are pre-defined in spaCy.
Our custom tags are summarized in table 1. A minority of the entities (1.4%) were annotated
with the tag UNKNOWN to flag ambiguity. In the training data, 1117 NE were annotated with
16 tags and in the testing gold data, 174 NE were annotated with 13 tags. Also, we computed
inter-annotator agreement (average f-score of 0.76) by extracting NE and applying fuzzy string
matching to address index mismatches while parsing entity labels.


5. Methods
System architecture. Our system is based on spaCy and consists of several deep learning
models using the transition-based framework of [13] CNN and LSTM architectures. In our

    3
        https://github.com/tecoholic/ner-annotator


                                                     291
Table 2
Varieties of Models Used in Our Research
 Model Name                   Description
  SM and LG: Pre-trained,     en_core_web_sm (SM) and en_core_web_lg (LG) are English models pre-
 non-fine-tuned               trained on Ontonotes 5, Wordnet 3, etc. differing in size, with pre-trained
                              components for tokenization, POS tagging, dependency parsing, NER, etc.
  BLK-F: Fine-tuned Blank     No pre-trained components for tagging, parsing, NER. Converts tokens
                              to vector embeddings (‘tok2vec’) and entity detecting components. Fine-
                              tuned by its components trained with entity names, see section 4.
  DEF-F: Pre-trained, fine-   Based on spaCy’s SM model. It is fine-tuned by its components (except
 tuned with default ‘ner’     for ‘ner’) re-trained with our custom annotated data.
  UPD-F: Pre-trained fine-    Based on spaCy’s SM model. Fine-tuned by its ‘ner’ component re-trained
 tuned with updated ‘ner’     with custom annotated data.
  REP-F: Pre-trained, fine-   Based on spaCy’s SM model. Fine-tuned by its ‘ner’ component replaced
 tuned with replaced ‘ner’    with a new ‘ner’ component trained on custom annotated data.
 DOB-F: Fine-tuned double-    Double-ner models: DOB-BLK (BLK-F, SM), DOB-DEF (DEF-F, SM), DOB-
 ner                          REP (REP-F, SM), DOB-UPD (UPD-F, SM).


custom NER pipeline, we apply transfer learning and use spaCy’s English pre-trained and
blank models to build, fine-tune, and evaluate our custom models. Moreover, we consider the
effects of the ‘catastrophic interference (forgetting)’ problem [16], and build custom double-
ner models with multiple ‘ner’ components. Our system provides a pipeline that computes
inter-annotator agreement and analyzes training and testing data composition details. Table 2
provides description of the models.
   We apply non-fine-tuned, pre-trained SM and LG models on gold test data, discovering that
the SM model works better. We then use SM as a baseline to build and fine-tune our custom
models, named here DEF-F, UPD-F, REP-F and DOB-F. Also, we build and fine-tune a blank
model, BLK-F, using it as another basis to build the abovementioned custom models. In the DEF-F
model, we train only non-‘ner’ components to evaluate the potential of fine-tuning, parsing and
tagging components. In the UPD-F model (in contrast to BLK-F), we update the ‘ner’ component
of a pre-trained SM model taking advantage of other pre-trained non-‘ner’ components (tagging,
parsing, etc.) for the additional layer of information and potential performance improvement.
Moreover, the ‘ner’ component of a pre-trained model can already recognize some entities,
which is beneficial if there are overlaps between its built-in entities and our entities in our
training data. To mitigate effects of catastrophic forgetting, we build a REP-F model with a new
fine-tuned ‘ner’ component that overwrites the SM’s ‘ner’ component and retains other pre-
trained features. An alternative approach to this problem involves building a double-ner model
with two separate ‘ner’ components. In this approach, ‘ner’ components of both fine-tuned
models (DEF-F, BLK-F, REP-F, UPD-F) and non-fine-tuned SM co-exist in one pipeline.
   Resampling training data. In our annotations, some entity examples and labels are under-
represented, resulting in a class imbalance problem with potential negative effects on model
performance [17]. To address this issue, we resample training data by adding copies of instances


                                               292
from underrepresented classes and labels to our training data. Furthermore, we train and
evaluate every model on both resampled and non-resampled datasets. In future work, we plan
to adjust annotation classes and annotate more labels from minority classes.


6. Evaluation and Results
In this section we look at the performance of different models on our different datasets. Per-
formance of NER systems is traditionally evaluated in terms of f1-score, precision and recall
[4]. We compute both individual scores (i.e. precision) for each data segment in gold data and
total weighted scores across all segments. The weighted score is a sum of individual segment
scores multiplied by segment’s weight, proportional to the number of labeled entities in gold
annotations for the segment. Evaluation results are shown in table 3 with the one-component
custom models (BLK-F, UPD-F and REP-F) achieving highest performance.

Table 3
Evaluation results for NER models in section 5 tested on gold annotated data
        Model                                Non-sampled                      Resampled
                                   f-score     precision   recall   f-score    precision   recall
        SM                          0.194        0.215     0.178     0.363       0.385      0.32
        LG                           0.15        0.177     0.136     0.312       0.332      0.29
        BLK-F                        0.77        0.797     0.767      0.77        0.8      0.767
        DEF-F                       0.775        0.798     0.755      0.79       0.83      0.755
        REP-F                        0.78        0.817     0.755      0.77        0.8       0.75
        UPD-F                       0.654        0.663      0.65     0.655       0.67       0.65
        DOB-BLK: BLK-F + SM          0.60       0 0.56      0.66      0.60       0.56       0.66
        DOB-DEF: DEF-F + SM          0.59        0.54       0.65      0.60       0.55       0.65
        DOB-REP: REP-F + SM          0.60        0.56       0.69      0.60       0.56       0.65
        DOB-UPD: UPD-F + SM          0.60        0.55       0.67      0.61       0.57       0.66

   We evaluated spaCy’s pre-trained SM and LG models without customization. For all datasets,
SM outperformed LG possibly due to the overbearing effects of the latter’s embedded feature
vectors [18]. SM performed significantly worse than custom models in detecting and classifying
entities of non-Western origin. For example, figure 2 visualizes entities recognized by SM from
a sample of Lorimer’s Gazetteer: although ‘Baghdad’ is classified correctly, ‘Basrah Wilayats’ is
not recognized as a GPE and ‘Walis of Baghdad’ is not detected at all.
   As for BLK-F, it performed best on samples from the Bushire dataset, which contains the
majority of tagged entities and worst on Lorimer’s Gazetteer, a dataset relatively underrepre-
sented in training. This tendency is expected since the blank model has neither a built-in ‘ner’
component nor entity classes and it depends highly on training. Similarly, the blank model
has high entity f-scores for frequently tags (i.e. GPE, PERSON, VESSEL) and low scores for
less common (i.e. QUANTITY, TRIBE) labels. Notably, the model can successfully classify
non-Western entities “Baghdad” or “Basrah Wilayat” as GPE, but it does not detect entities of
CARDINAL, NORP types from built-in models. A conclusion we can make from this is that


                                                  293
Figure 2: Named Entity Recognition output from en-core-web-sm model tested on gold data


Figure 3: Named Entity Recognition output from a custom pre-trained model tested on gold data


Figure 4: Named Entity Recognition output from a custom double-ner model tested on gold data


depending on the end goal, sometimes such limitations can be mitigated by more training,
resulting in a comparatively high total weighted f-score for the model.
   Generally, for samples from datasets underrepresented in training, such as Lorimer’s Gazetteer,
custom pre-trained models with both replaced and updated ‘ner’ components show slightly, but
not significantly, better performance. These cases are the ‘catastrophic forgetting’ problem in
action. Although a fine-tuned pre-trained model is originally based on SM with a built-in ‘ner’
component, after re-training or replacing this ‘ner’ component, entities that were previously
recognized (e.g. NORP) are no longer recognized or ‘forgotten’ [19].
   Our experiments reveal that for detecting entities from datasets underrepresented in training,
combining a custom ‘ner’ component with spaCy’s built-in ‘ner’ component without direct
overwriting is a better option. Due to the limited training of this scenario, leveraging pre-trained
features benefits performance. However, when applied to datasets well-represented in training,
double-ner models can cause model underfitting [20], resulting in low f-score performance. For
example, in a sample text from the BPR dataset, custom pre-trained models perform better than
a double-ner model (view Figures 3 and 4), possibly due to a frequent inconsistency between
the classifications of the same entities between the custom and built-in model.


                                                294
   Evaluation scores in table 3 confirm this observation across all datasets. Notably, for one-
component custom models where the ‘ner’ component is mostly assembled during training,
precision is higher than recall. This implies that one-component custom models are most likely
to classify any model-recognized entities correctly–particularly useful for our task of detecting
entities of non-Western and historical origin. These models might not detect, however, all
possible true entities. By contrast, double-ner models have higher recall than precision; they
detect all possible true entities more successfully, since the double-ner models leverage already
built-in ‘ner’ components with a more extensive set of tags. However, double-ner models are
less likely to classify detected entities correctly (e.g. the label TITLE as WORK_OF_ART in
fig 4). For our specific project needs of detecting and correctly classifying custom entities,
custom models perform generally better. However, our experiments indicate that potential
exists to create a more generalized NER system that performs well across all entities, not only
those built-in within spaCy’s list, but also custom entities. This process would require further
fine-tuning double-ner and enriching both training and testing datasets.


7. Conclusion and Future work
In this paper, we developed an extensive blueprint for a custom NER system that outperforms
state-of-the-art built-in spaCy NER models in detecting and classifying NE of a non-Western
linguistic origin. The system includes custom NER models based on a combination of i) built-in
and ii) fine-tuned custom ‘ner’ components. For our initial goal of detecting non-Western,
historical entities, fine-tuned, one-component models perform best. However, it is still possible
to create a generalized custom NER system, extending the NER system to a larger and more
diverse set of historical entities and documents. We plan to apply the NER models to a corpus
of HTR-generated text, rather than the HTR ground truth used for the current project.
   We plan to preprocess datasets to account for historical norms and HTR artefacts in view of
identifying diachronic developments in the transliterated entities across the corpus. Moreover,
the NER system trained on such a rich dataset of entities has the additional value of being
applicable to other historical texts of similar style and origin.


References
 [1] M. Ehrmann, G. Colavizza, Y. Rochat, F. Kaplan, Diachronic Evaluation of NER Systems
     on Old Newspapers, in: Proceedings of the 13th Conference on NLP (KONVENS), 2016.
 [2] K. McDonough, L. Moncla, M. van de Camp, Named Entity Recognition Goes to Old
     Regime France: Geographic Text Analysis for Early Modern French Corpora, International
     Journal of Geographical Information Science 33 (2019) 2498–2522.
 [3] J. Clifford, B. Alex, C. M. Coates, E. Klein, A. Watson, Geoparsing History: Locating
     Commodities in Ten Million Pages of Nineteenth-Century Sources, Historical Methods: A
     Journal of Quantitative and Interdisciplinary History 49 (2016) 115–131.
 [4] M. Won, P. Murrieta-Flores, B. Martins, Ensemble Named Entity Recognition (NER):
     Evaluating NER Tools in the Identification of Place Names in Historical Corpora, Frontiers
     Digital Humanities 5 (2018) 2.


                                              295
 [5] B. Batjargal, G. Khaltarkhuu, F. Kimura, A. Maeda, An Approach to Named Entity Ex-
     traction from Historical Documents in Traditional Mongolian Script, in: IEEE/ACM Joint
     Conference on Digital Libraries, 2014, pp. 489–490.
 [6] Qatar Digital Library, The Political Residency, Bushire, 2014. URL: https://www.qdl.qa/en/
     political-residency-bushire.
 [7] J. G. Lorimer, Gazetteer of the Persian Gulf, Central Arabia and Oman, Government
     Printing House, Calcutta, 1915.
 [8] W. M. Duff, C. A. Johnson, Accidentally Found on Purpose: Information-Seeking Behavior
     of Historians in Archives, The Library Quarterly 72 (2002) 472–496.
 [9] N. Nagai, F. Kimura, A. Maeda, R. Akama, Personal Name Extraction from Japanese
     Historical Documents Using Machine Learning, in: International Conference on Culture
     and Computing, 2015, pp. 207–208.
[10] C. Neudecker, L. Wilms, W. J. Faber, T. van Veen, Large-Scale Refinement of Digital
     Historic Newspapers with Named Entity Recognition, Proc IFLA Newspapers/GENLOC
     Pre-Conference Satellite Meeting (2014) 232–246.
[11] B. Alex, C. Grover, E. Klein, R. Tobin, Digitised Historical Text: Does it Have to be
     mediOCRe?, in: KONVENS, 2012, pp. 401–409.
[12] A. Erdmann, C. Brown, B. Joseph, M. Janse, P. Ajaka, M. Elsner, M.-C. de Marneffe,
     Challenges and Solutions for Latin Named Entity Recognition, in: COLING, ACL, 2016.
[13] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural Architectures
     for Named Entity Recognition, in: Conference of the North American Chapter of the ACL:
     Human Language Technologies, ACL, 2016, pp. 260–270.
[14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python Natural Language
     Processing Toolkit for Many Human Languages, in: Proceedings of the 58th Annual
     Meeting of the ACL: System Demonstrations, ACL, 2020, pp. 101–108.
[15] N. Fuccaro, ”Knowledge at the Service of the British Empire: The Gazetteer of the Per-
     sian Gulf, Oman and Central Arabia”, volume 22, Swedish Research Institute in Istanbul,
     Transactions, 2015, pp. 17–34.
[16] B. Thompson, J. Gwinnup, H. Khayrallah, K. Duh, P. Koehn, Overcoming Catastrophic
     Forgetting During Domain Adaptation of Neural Machine Translation, in: Conference of
     the North American Chapter of the ACL: Human Language Technologies, volume 1, ACL,
     2019, pp. 2062–2068.
[17] S. Longpre, Z. Tu, C. DuBois, An Exploration of Data Augmentation and Sampling
     Techniques for Domain-Agnostic Question Answering, in: the Second Workshop on
     Machine Reading for Question Answering, ACL, 2019.
[18] Spacy.io, English spaCy models documentation, 2021. URL: https://spacy.io/models/en.
[19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan,
     J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Had-
     sell, Overcoming Catastrophic Forgetting in Neural Networks, in: Proceedings of the
     National Academy of Sciences, volume 114, 2017, pp. 3521–3526.
[20] R. Wolfe, A. Caliskan, Low Frequency Names Exhibit Bias and Overfitting in Contextualiz-
     ing Language Models, 2021.


                                             296