<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae. Sergio Torres Aguilar1, Xavier Tannier2, Pierre Chastang3</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>DYPAC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris-Saclay (regester</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@gmail.com)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIMSI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ. Paris-Sud</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris-Saclay (xavier.tannier@limsi.fr)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DYPAC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univ. Versailles St-Quentin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Université Paris-Saclay (pierre.chastang@uvsq.fr)</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The work on the named entity recognition (NER) in databases of historical texts has been placed among the most promising new ways to implement best recovery and managements tools for exploring mass data. In this paper, we describe the application processing NER through a modelling with CRF on an annotated database of Burgundy collection of charters from the tenth to thirteenth centuries. The aim is to generate a model for automatic recognition of named entities in historical sources. We discuss the nature of historical documents in the corpus and extraction of rules, and we expose adaptation to the processing algorithm and the most common problems encountered in Medio Latin texts using diplomatic formularies, which is an atypical case within the NER studies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In this paper we present the creation of an automatic recognition model of named entities on
historical sources in Medio-Latin language. The benefits of NER applied to digitized editions of
manuscripts are well known. The main hypothesis is that the control and management of indexing,
through digital languages, of the names of people, places and institutions within vast databases could
enrich results for information retrieval engines and provide unreleased data from edited sources.
However, the production of an annotated corpus is a time-consuming and labor-intensive work. A
research team can take months to annotate a corpus containing thousands of documents.</p>
      <p>Our work uses an annotated database counting 5300 documents from editions of cartularies from
Burgundy (10th-13th centuries) completed by the CBMA group*. On this basis, we intend to generate a
model that can automate or semi-automate the recognition of entities in Medio-Latin language, used in
most of the formal documentation written during the 10th-15th centuries. The model should be adaptable
to different scriptural variants that developed according to the historical and institutional evolutions of
writing practices. To accomplish this, we propose a statistical modeling linear chain based on CRF
(Conditional Random Fields), a machine-learning method for labeling sequence. This method makes it
possible to use a large number of features, from the words of a text to POS tagging, lemmas, suffix,
punctuation, etc., for covering all aspects of a word inside a phrase context.</p>
      <p>A model capable of automating entities recognition can offer, within a reasonable time, an enriched
text with semantic information and structural level data. This will generate historical works that
penetrate long-time realities poorly explored by now [1]. Moreover, the accelerated mass
homogenization of textual data could favor the study of historical sources with renewed quantitative
and statistical methods.</p>
      <p>* This work uses the results obtained in the CBMA project (http://www.cbma-project.eu/), which aims to provide an open
access database of the diplomatic sources produced during the Middle Ages in Burgundy and promotes studies on epistemological
transformations of the research in Humanities generated by the digital tools. We want to thank and credit the network of
researchers involved in CBMA project.</p>
      <p>
        Studies on the NER have a long history in other fields of textual analysis such as journalism and
medicine [2] and most recently in social networks. In recent years, digital humanities have had among
its objectives the recognition of named entities as one of the most promising fields for the discovery of
new ways of exploration and new approaches to the huge masses of digitized documents. In the past
five years, new tools for the exploration on huge masses of digitized data have been developed. Literary
data bases as Gutemberg [3] or databases of digitized newspapers [4] have developed based-NER tools
for exploring written phenomena. More specific works on the named entities detection can also be found
in corpus of English parliamentary records of modern times [5] in Anglo-Occitan corpus of medieval
times [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or in bibliographic records, which also approaches semantic relationships [7].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 Process</title>
      <sec id="sec-2-1">
        <title>3.1. Corpus description</title>
        <p>We use a database with nearly 19 thousand items obtained from 59 diplomatic editions from 300
cartularies and collections of charters, produced in Cluniac and Cistercian abbeys of Burgundy.
Researchers responsible for the CBMA project (Chartae Burgundiae Medii Aevi) isolated a 5300-item
corpus from the tenth to thirteenth centuries, produced in Cluniac abbeys, on which a manual annotation
of named entities (personal names and place names) was performed. The arrangement of these data also
follows a pattern which distributes in columns all about its identification</p>
        <p>A cartulary is a volume in which the originals or transcripts of various kinds of documents, royal
charters, privileges, judgments, notarial minutes, etc., are collected. These are formal documents that
follow a model form, i.e. a stereotyped discursive structure on which names of people, places and
organizations, dates, titles, etc. are added. All the documents are written in a medieval variant of Latin.
Since the texts come from diplomatic editions, they include elements that are not usually found in the
original, such as punctuation, capitalization and development of abbreviations.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Processing corpus</title>
        <p>Processing Latin requires to compile various observations related to the morphology and syntax of
the language regarding the occurrences of named entities. As Latin is an inflected language, the function
of the word is usually contained in the termination. Personal names usually appear in the nominative
and accusative cases (-o, -us, -um ending). The genitive in the Medio Latin variant is problematic (-is,
-isis, -orum ending) because it presents in some cases a name and a place name without separation, e.g.:
Armundi (name) Vianensis (place) archiepiscopi. There is also a notable shortage of ablative and dative
cases. Flexion also causes a restricted use of prepositions, and, in addition, the order of the word in the
sentence reduces its importance. Therefore, the information about the lemma and the suffix is crucial to
work with NER. Besides, the attenuation and corruption of the rules of classical Latin texts cause a
series of irregularities such as the abuse of enclitics, lexical redundancies and significant instability in
the writing and spelling of names.</p>
        <p>However, in syntactic terms, the context of appearance of simple entities in Medieval Latin is not
very different from the context in Romance languages. Important characteristics include, but are not
limited to:
•
•
•</p>
        <p>Classifying type words: villa, terra, locum, mansum, castrum
Titles and functions: Sanctus, dominus, presbiter, comes, rex, episcopus
Prepositions and affixes: ad Artulfo, in villa Verziaco, per manum Aymini, pro anima
Fromaldi,
Albeit compound entities are very original of the Medio Latin language:
• Hagiotoponyms: ecclesia in onore Sancti Andree; terra Sancti Petri; partes Sancti Petri campi
• Two-element anthroponyms: Matheus de Sosiaco; Hugo Gregarii; Odelinus filius Aginaldi;</p>
        <p>Petrus dicitur Grossus
• Appositions: Willebertus nomine, Regnante Lothario Rege; Quod Guido cantor
In many cases, it is the meaning of the word and not its morphology which determines its function
(E.g. In alio [terre Adalrado (person)] place). In other cases, there is a confusion between personal
names, places and institutions, for example in the so-called donatio pro anima (E.g. Donamus a sancti
Pauli et Sancti Petri Cloniacense in pago Vianense), where the donation of land or property may be
made under the invocation of a saint (person) and it becomes part of a land dedicated to the saint (place),
but the donation can be administered by a church dedicated to this saint (institution).</p>
        <p>Entities related to names of saints can also lead to confusion because their appearances may have
different contexts of use: festivities, biblical references, allegories, invocations or donations. Most of
these entities do not offer greater utility to our model and we have not considered it in the annotation.</p>
        <p>A very problematic case is that of overlapping or nested entities: a person’s name including or
appearing next to a place name or within an institutional name. Nesting levels are usually two, but can
get to the third level. For example:</p>
        <sec id="sec-2-2-1">
          <title>Two levels</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Three levels</title>
          <p>This phenomenon involves the addition of two or more columns of labels to certain tokens, which
complicates the treatment of the data, because most machine learning classifier are not designed to
attributes more than one class to each instance. The loss of information at this level can be critical for
further work because this phenomenon reveals the genesis and evolution of the compound name and
preserves information about familiar and social relationships [8]; all of this data serves for attributing
an individual ID, which is the important one for History, to named entities, which is what we retrieve.</p>
          <p>On the other hand, the initial corpus only contains marks of recognition for names and places, which
is a serious problem when an entity that clearly should be classified as an institution appears. The use
of classic trinomial of ENAMEX: person, place, institution, would involve a long manual extension of
the original corpus. At the same time, for first experiments, we want a model applied only to individuals
and not legal and administrative entities. For this reason we conserved the original annotation of about
95% of the institutional entities as places.</p>
          <p>In general, the recognition of entities in Medieval Latin does not only concern the lexical and
morphological properties of the word, but also its semantic and historical context. This entire series of
accidents eventually lead to a long work of manual validation of lists extracted from the corpus
containing the problematic entities. Correcting these lists provides an update of the current annotation
guidelines to lead to our gold-standard corpus.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Method</title>
      <p>The corpus was transformed into a 7 columns format with lexical, morphological and semantic
information: TOKEN, POS, LEMMA, CASE, SUFFIX, NAME_Entity, GEO_Entity. The first three
columns were obtained from a variant of TreeTagger for the Medio Latin language created by the
OMNIA group in 2013†. The two following columns add information about the capital letters in the
word and give the suffix (last three letters) of each word. For columns containing named entities we
used BIO labels, which are very useful to determine categories and boundaries of named entities through
† http://www.glossaria.eu/treetagger/
a very simple format: B-X, I-X and O for representing the beginning, continuation or absence,
respectively, of named entities.</p>
      <p>Once transformed the textual corpus to CRF format, we started a machine-learning-based
recognition method. The total number of documents was later divided semi-randomly in well distinct
parts into two sections: a major section of 4300 documents for the development corpus and a minor
section of 1000 documents for test corpus. Later, the development corpus was split in the training corpus
set of almost 4000 documents (&gt; 1 million words) for training the algorithm, and the rest for the
development test-set, which would reduce the degree of extrapolation model.</p>
      <p>In addition we create a pattern that determines word by word the rules of observation for the
document and the relevant combinations of information both in lines and columns, i.e., the lemma, the
entity and the phrase itself. Combining unigrams and bigrams, we wrote a pattern of 26 unigrams with
an extended sequence of two positions ahead and two positions behind in the token column for each
line (see Table 1).</p>
      <p>
        For this work, we have annotated in the same format provided for the database and we used Wapiti‡
[
        <xref ref-type="bibr" rid="ref8">9</xref>
        ] a toolkit for labelling sequences developed by LIMSI-CNRS with standard options for the work
with CRF linear-chain: L-BFGS algorithm that rendered better results than others such as BCD and
RPROP+ and defaults values for L1 and L2 regularization (0.5 and 0.0001 respectively).
      </p>
    </sec>
    <sec id="sec-4">
      <title>5 Results</title>
      <p>The performance with all this data was remarkable, exceeding, 96% F-measure on test data for the
person’s names in beginning of entities and 92% F-measure for place names. The number has decreased
for identification on the rest of entity in about 88% F-measure for person entities and more discrete
result in the case of locations, 80%. In general, model precision is very close to model recall in person
names and locations, but, in the second case the distance from B-LOC to I-LOC is about 10% less.</p>
      <sec id="sec-4-1">
        <title>Person name</title>
        <p>B-PERS
I-PERS
Location name
B-LOC
I-LOC</p>
      </sec>
      <sec id="sec-4-2">
        <title>Precision</title>
        <p>0.95
0.88
0.91
0.81</p>
      </sec>
      <sec id="sec-4-3">
        <title>Recall</title>
        <p>0.96
0.92
0.93
0.80</p>
        <p>F1
0.96
0.90
0.92
0.80</p>
        <p>Our model is very strong to distinguish single entities and entities in combination with master words
preceding the apparition of an entity like villa, terra, ecclesia, mansus, for the places and ego, filius,
uxor, abbas, frater, etc for the names. Moreover, our model is strong to recognize an easy combination
of name entity and to place the entity under the format name + de + place where the origin of the format
name and last name is. But the model is less able (but still strong) to capture boundaries of entities into
‡ https://wapiti.limsi.fr/
long place names, especially when the composition is a combination (or overlap) of institutional and
geographical names.</p>
        <p>The rate of error in manual annotation with this kind of entities is stronger because of the ambiguity,
and in an examination line by line of the result we noticed that a very important quantity of errors in
our results is linked to errors in the annotation of our standard-gold corpus. The high rate of recognition
is surely connected with the specific nature of the document where are a textual formulaic patterns.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 Future work</title>
      <sec id="sec-5-1">
        <title>Three immediate improvements to the system are being planned: We will increase further iterations of the model with corpus of increasingly less voluminous training to reach the most balanced outcome between the extent of the training corpus and the recognition rates.</title>
        <p>Besides, we will test the robustness of the model with 400 new documents from different
periods from the large corpus that we have annotated by hand as well as on other documents
beyond our corpus. We will test the model on same type of documents: charters and cartularies,
and on different document types, such as Latin chronicles or administrative documents.
A likely new review of the quality of the gold-standard corpus in order to correct systematically
some manual errors in the annotation.
i.
ii.
iii.
* This work is supported by the "IDI 2016" project funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Moretti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Distant reading</article-title>
          . Verso Books.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bodnari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deleger</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2013</year>
          ,
          <article-title>September). A Supervised Named-Entity Extraction System for Medical Text</article-title>
          . InCLEF.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <source>Hammond</source>
          , (
          <year>2015</year>
          ).
          <article-title>Gutentag: an nlp-driven tool for digital humanities research in the project gutenberg corpus</article-title>
          .
          <source>4th Workshop on Computational Linguistics for Literature</source>
          ,
          <volume>42</volume>
          -
          <fpage>47</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Mac</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            , &amp;
            <surname>Cassidy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Finding Names in Trove: Named Entity Recognition for Australian Historical Newspapers</article-title>
          .
          <source>In Australasian Language Technology Association Workshop</source>
          ,
          <fpage>57</fpage>
          -
          <lpage>60</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Grover</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Givon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tobin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2008</year>
          , May).
          <article-title>Named Entity Recognition for Digitized Historical Texts</article-title>
          . In LREC.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Scrivner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kübler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2015</year>
          , June).
          <article-title>Tools for digital humanities: Enabling access to the old occitan romance of flamenca</article-title>
          .
          <source>In Proceedings of the Fourth Workshop on Computational Linguistics for Literature</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Byrne</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2007</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>Nested named entity recognition in historical archive text</article-title>
          .
          <source>In Semantic Computing</source>
          ,
          <year>2007</year>
          . ICSC 2007. International Conference on IEEE,
          <fpage>589</fpage>
          -
          <lpage>596</lpage>
          Beck,
          <string-name>
            <surname>P.</surname>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>Anthroponymie et parenté</article-title>
          . Collection de l'Ecole française de Rome,
          <volume>226</volume>
          ,
          <fpage>495</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappé</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yvon</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2010</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Practical very large scale CRFs</article-title>
          .
          <source>In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <fpage>504</fpage>
          -
          <lpage>513</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>