=Paper=
{{Paper
|id=Vol-2253/paper10
|storemode=property
|title=The iDAI.publication: Extracting and Linking Information in the Publications of the German Archaeological Institute (DAI)
|pdfUrl=https://ceur-ws.org/Vol-2253/paper10.pdf
|volume=Vol-2253
|authors=Francesco Mambrini
|dblpUrl=https://dblp.org/rec/conf/clic-it/Mambrini18
}}
==The iDAI.publication: Extracting and Linking Information in the Publications of the German Archaeological Institute (DAI)==
<pdf width="1500px">https://ceur-ws.org/Vol-2253/paper10.pdf</pdf>
<pre>
       The iDAI.publication: extracting and linking information in the
         publications of the German Archaeological Institute (DAI)

                                      Francesco Mambrini
                                Deutsches Archäologisches Institut
                                  Podbielskiallee 69-71, Berlin
                              francesco.mambrini@dainst.de


                     Abstract                       responsibility of the federal Foreign Office; the
                                                    goal of the institue is to promote research in ar-
    English. We present the results of our at-      chaeological sciences and on ancient civilizations
    tempt to use NLP tools in order to iden-        worldwide. Founded in Rome in 1829, the DAI
    tify named entities in the publications of      has developed into a complex institution, with
    the Deutsches Archäologisches Institute        branches and offices located around the world.
    (DAI) and link the identified locations to      The Institute has participated in several projects,
    entries in the iDAI.gazetteer. Our              including missions of paramount importance like
    case study focuses on articles written in       those in Olympia, Pergamon or Elephantine.
    German and published in the journal Ch-
                                                       One of the most visible output of this activity
    iron between 1971 and 2014. We describe
                                                    is the amount of scientific publications produced
    the annotation pipeline that starts from the
                                                    by the DAI. The Institute currently publishes 14
    digitized texts published in the new portal
                                                    international journals and 70 book series on dif-
    of the DAI. We evaluate the performances
                                                    ferent topics.1 Since 2018, part of this collection
    of geoparsing and NER and test an ap-
                                                    is now accessible to the public on a new online
    proach to improve the accuracy of the lat-
                                                    portal named idai.publications for books
    ter.
                                                    and journals.2 This ongoing initiative will not only
    Italiano. Il paper descrive i risultati         enable researchers to have easier access to the pub-
    dell’esperimento di applicazione di stru-       lished works; even more importantly, it will allow
    menti di NLP per annotare le Named En-          the Institute to integrate the data contained in ar-
    tities nelle pubblicazioni del Deutsches        ticles and books (such as persons, places and ar-
    Archäologisches Institute (DAI) e colle-       chaeological sites, artifacts and monuments) into
    gare i toponimi identificati alle rispettive    a network of all the other digital resources of the
    voci dell’iDAI.gazetteer. Il nos-               DAI.
    tro studio si concentra sugli articoli in          All the digital collections of the DAI are indeed
    tedesco pubblicati nella rivista Chiron tra     designed to operate within a network known as the
    il 1974 e il 2014. Descriviamo la pipeline      idai.welt (or idai.world).3 This network
    di annotazione impiegata per processare         includes web collections such as “Arachne”,4 the
    gli articoli disponibili nel nuovo portale      database of archaeological monuments and arti-
    per le pubblicazioni del DAI. Discutiamo        facts of the DAI, and “Zenon”,5 the central biblio-
    i risultati della valutazione degli script di   graphic catalogue that serves all the libraries of the
    geoparsing e NER e, infine, proponiamo          DAI offices around the world, but also compiles
    un approccio per migliorare l’accuratezza          1
                                                        A list of journal is provided at: https://www.
    in quest’ultimo task.                           dainst.org/publikationen/zeitschriften/
                                                    alphabetisch; for the list of book series: https:
                                                    //new.dainst.org/publikationen/reihen.
                                                      2
1   The iDAI.publications and the                       See    https://publications.dainst.org/
                                                    journals/ and https://publications.dainst.
    iDAI.world                                      org/books/.
                                                      3
                                                        https://www.dainst.org/de/forschung/
The Deutsches Archäologisches Institute (Ger-      forschung-digital/idai.welt
man Archaeological Institute, henceforth DAI) is      4
                                                        https://arachne.dainst.org/
                                                      5
a German agency operating within the sphere of          https://zenon.dainst.org/
some of the most comprehensive bibliographies in        2.1    Preprocessing and NER
the areas of activity of the different branches.        The pipeline is programmed in Python and takes
   The other cornerstone of the idai.world              advantages of modules of the NLTK platform for
is represented by the layer of web-based ser-           several task (Bird et al., 2009), like sentence- and
vices such as thesauri and controlled vocabular-        word-tokenization.
ies. The idai.gazetteer,6 in particular, con-              The input of our annotation pipeline is, in the
nects names of locations with unique identifiers        case of articles and books for which no other ver-
and coordinates; the gazetteer is intended to serve     sions survive, the full text extracted from the PDF
both as a controlled list of topnyms for DAI’s          files of the articles.7 The automatic recognition
services and to link the geographic data with           of the publication’s main language is carried out
other gazetteers. Unique identifiers defined in the     by the Python library langid (Lui and Baldwin,
idai.gazetteer are already used to connect              2011).
places and entries in Zenon and Arachne. In this           NER is performed using the Stanford Named
way, users of these services can already query          Entity Recognizer (Finkel et al., 2005), which im-
monuments and artifacts in Arachne or books in          plements Conditional Random Field (CRF) se-
Zenon that are linked to a specific place.              quence models. For a preliminary evaluation,
                                                        we used pre-trained models for English, Span-
2       A pipeline for textual annotation
                                                        ish,8 German (Faruqui and Padó, 2010), and Ital-
This network of references holds a great poten-         ian (Palmero Aprosio and Moretti, 2016). All
tial for the DAI publications. Places, persons, ar-     these models are trained to recognize compara-
tifacts, monuments, and other entities of interest      ble classes of entities (persons, places, organiza-
mentioned within the publications can be identi-        tions and miscellaneous). We then chunked to-
fied and linked to the concepts in the appropriate      gether the annotated tokens with a simple regular-
knowledge bases of the DAI. The linking of the          expression chunker that takes consecutive, non-
different relevant entities would allow researchers     empty (O) tags together and labels them with the
not just to retrieve the texts that, independently      same label as the first token in the series.
from the language of the publication, make ref-            Part-of-speech (POS) tagging, though not
erence to certain concepts of interest, but also to     strictly necessary for NER and geoparsing, as the
study such epistemologically relevant questions as      out-of-the-box models for Stanford NER do not
the variation in the patterns of locations cited in     require it, is also supported by our pipeline. Tree-
the studies across decades.                             Tagger (Schmid, 1999) was chosen since it offered
    While the linking between entries in Zenon          a vast array of pre-trained models for many lan-
and Archne and the idai.gazetteer had been              guages.
conducted manually, the volume and nature of the
                                                        2.2    Geoparsing
textual information to be processed in the publica-
tions encouraged us to turn to Natural Language         The task of resolving place names by linking them
Processing (NLP). We set up a pipeline for text         to identifiers from a gazetteer is commonly re-
annotation that aims to process the full texts of the   ferred to as “georparsing”. The Edinburgh Geop-
publications, perform Named Entity Recognition          arser9 is a suite of tools that is often employed in
(NER) to identify the mentions of the relevant en-      DH (Grover et al., 2010; Alex, 2017) and allows
tities, and finally link them to the appropriate en-    users to preprocess texts, extract toponyms and re-
tries in the idai.world.                                solve them by identifying the possible candidates
    We chose to build the first version of the          in a gazetteer and scoring them. Users have the op-
pipeline around a series of open-source software        tion to select between 4 gazetteers, and to set some
that offer support for multiple languages and           parameters, like the coordinates of areas that will
are widely used in the Digital Humanities (DH);            7
                                                              All the PDF files of the publications already include
at present, the annotation is limited to persons,       texts, so no Optical Character Recognition (OCR) is needed.
                                                            8
places and organization, and only the linking of              Models for English and Spanish are available for
                                                        download at https://stanfordnlp.github.io/
place-names to the idai.gazetteer is sup-               CoreNLP/; for English we used the 4 Class model CoNLL
ported.                                                 2003 English training set.
                                                            9
                                                              http://groups.inf.ed.ac.uk/geoparser/
    6
        https://gazetteer.dainst.org/                   documentation/v1.1/html/
be given preference while ranking the candidates.                       Language          Nr. Articles     Auto rec.
The scoring process makes use of some properties
                                                                        German                      645           580
recorded for places in gazetteers (e.g. the type of
                                                                        English                     211           222
location, such as inhabited place or archaeological
                                                                        French                       59            55
site) and especially by comparing locations pair-
                                                                        Italian                      17            15
wise with all other places identified; preference is
                                                                        Spanish                      10            12
thus given to places that cluster together.
                                                                        Luxembourgish                 0            19
   Although Edinburgh works only with English
                                                                        Greek and Lat.                0            39
and the idai.gazetteer is not supported, the
CLI software is built as a suite of scripts, so that
                                                                 Table 1: Chiron: number of article per language
the input of a process is the output of the preced-
                                                                 (actual count vs automatically recognized)
ing one. By knowing the script that performs a
task and the input it expects, it is therefore possible
to inject a pre-processed text into any given step,              most relevant role by far.11
while most processes (like scoring) are language-
agnostic. We integrated the ranking script of Ed-                3.2     Evaluating the annotation
inburgh within our pipeline to score, for any loca-              In this preliminary stage, we decided to focus on
tion that we extracted with our own NER scripts,                 the 580 automatically identified German articles in
any list of possible candidates matched in the                   order to evaluate the performances of our pipeline
idai.gazetteer.                                                  and to improve its accuracy.
                                                                    We have manually corrected the NER annota-
3        Testing and Improving The Pipeline: a
                                                                 tion and geoparsing of 4 articles (Linke, 2009;
         case study
                                                                 Hammerstaedt, 2009; Sänger, 2010; Haensch and
In this section we discuss the preliminary results               Mackensen, 2011), for a total of 36,159 words.
obtained by running the pipeline described above                 The articles were selected so as to represent a
on the complete series of one journal now avail-                 broad scope of subjects (from papyrology, to so-
able in the idai.publications. The results                       cial and religious history, to military archaeology)
will serve as a baseline for future improvement.                 and geographic areas (North Africa, Asia Minor,
                                                                 Rome and Italy).
3.1 Chiron: the data set                                            For the evaluation of our NER tools we adopted
The first complete publication series that was                   the same metrics (precision, recall and Fβ=1
added to the portal was Chiron, a journal published              score) and methods of the CoNLL-2000 shared
by the DAI’s “Kommission für Alte Geschichte                    task (Tjong Kim Sang and Buchholz, 2000). Note,
und Epigraphik” from 1970. Volumes from 1 to                     in particular, that the scores are calculated at the
44 (2014) are currently available,10 for a total of              level of the phrase, not of the single tag. The
942 articles. The focus of the publication is in                 evaluation of the geoparser is also based on the
Graeco-Roman history and epigraphy; several ar-                  same principles, but instead of evaluating its per-
ticles contain lengthy quotations (or even full edi-             formances on the automatically annotated texts,
tions) of inscriptions in Greek or Latin.                        we re-ran the geoparser on the gold-standard and
   Table 1 reports the total number of articles per              evaluated that output.
language. As can be seen, quotations in Greek and                   The scores reported in Table 2 are considerably
Latin are sufficiently frequent and long to confuse              below the state of the art in NER for German, as
the automatic recognition. In 39 cases, Latin or                 documented e.g. in the CoNLL 2003 shared task
Greek were considered the main language of the                   (Tjong Kim Sang and De Meulder, 2003). These
publication. Luxembourgish (a West Germanic                      results would very likely be considered insuffi-
language) is also a clear mistake for German, also               cient or too noisy for the needs of researchers in
possibly prompted by lengthy quotations (Nollé                  the (Digital) Humanities.
and Wartner, 1987, for one likely case). The 44                    11
                                                                     A word count on the automatically recognized languages
volumes of the journal show an interesting dis-                  confirms this conclusion: German has 7,394,004 words
tribution of languages, with German playing the                  (60.48% of total), English 2,955,640, and French 899,888.
                                                                 Greek and Latin total 481,596 words; the other languages
    10
         Readers are however requested to register an account.   count between 193k and 148k words.
   Entity           Precision      Recall     Fβ=1      3.3       Applying in-domain NER models
   Person            73.21%      47.13%       57.34     We decided to use the manually corrected articles
   Location          67.18%      34.56%       45.64     to see whether we could improve on the baseline
   Organization       9.23%      35.71%       14.66     with the help of in-domain models. We trained a
                                                        CRF model adding a series of linguistic features,
   TOTAL             56.27%      43.22%       48.89     like POS, which may help capturing non-German
                                                        expressions, or type-set features such as the use of
Table 2: NER: results of the first evaluation round;
                                                        small- and full-caps.12 As the articles in Chiron
1423 phrases; found: 1093; correct: 615
                                                        focus on the Greco-Roman civilization, we expect
                                                        a lookup in lists of known toponyms of the An-
                                                        cient Word to sensibly improve the performances
   Modules for NER trained on general corpora do
                                                        of NER for locations. We chose to add a gazetteer
not seem to be suited to annotate texts that belong
                                                        lookup to the list of features; we preferred to re-
to such a specific domain with acceptable accu-
                                                        sort to a more specific resource like the “Digital
racy. The poor performances with organizations,
                                                        Atlas of the Roman Empire” (DARE)13 instead of
in particular, point to some peculiarities of the ar-
                                                        the general-purpose idai.gazetteer.
chaeological literature in comparison to texts in-
cluded in most general-use corpora: companies,
firms and other institutions, which are frequent in          Entity           Precision      Recall      Fβ=1
the news, are rarely found in scholarly texts of             Person             80.00%      71.41%      75.30
our domain; the organization tag is more often re-           Location           76.26%      58.90%      65.87
served either to ancient institutions (like “the Ro-         Organization       22.02%      23.08%      16.94
man Senate”) or peoples and tribes (“the Aqui-
                                                             TOTAL              79.32%      65.75%      71.75
tani”) which are hardly represented in ordinary
corpora.                                                Table 4: NER: results of the in-domain model; av-
                                                        erage scores of 10-fold cross-validation
     Article      Precision      Recall     Fβ=1
     L09           76.53%       73.53%      75.00          Table 4 reports the results of this second round
     H09           97.87%       95.83%      96.84       of testing, which was conducted using the same
     S10           72.66%       80.17%      76.23       methodology as before and performing a 10-fold
     H&M11         86.67%       74.71%      80.25       cross-validation. As can be seen, the in-domain
     TOTAL         83.49%       79.13%      81.25       model considerably improves over the baseline.
                                                        The performance with organizations is still largely
Table 3: Geoparsing: results per article; 575           insufficient, mainly on account of the scarcity of
phrases; found: 545; correct: 455. Articles: L09        examples (70 phrases, vs 970 persons, 387 loca-
(Linke 2009), H09 (Hammerstaedt 2009), S10              tions). The improvement with locations is signifi-
(Sanger 2010), H&M11 (Haensch and Mackensen             cant, but the overall performance still leaves room
2011)                                                   for substantial improvement.

                                                        4        Conclusions and future work
  The performances of the geoparser, on
the other hand, seem encouraging (Table 3).             The use of in-domain CRF models trained specif-
With gold-standar named entity recognition,             ically for the target journal and adopting a spe-
the Edinburgh Geoparsers combined with the              cialized gazetteer for place names improves on the
idai.gazetteer attained scores that closely             baseline of the out-of-the-box NER tools in our
approximate, or even surpass 80%. The evaluation        initial pipeline. It is likely that the accuracy on
of our annotation was also a valuable occasion          the Chiron data can be further increased with addi-
to assess the accuracy and granularity of the           tional training. Given that an accurate recognition
idai.gazetteer: 38 locations in North                   is a prerequisite for geoparsing, we plan to con-
Africa mentioned in one article (Haensch and                12
                                                             The CRF implementation that we used is provided by the
Mackensen, 2011) did not have any record in             Python library sklearn-crfsuite (0.3.6).
                                                          13
DAI’s gazetteer.                                             http://dare.ht.lu.se/
centrate our effort on the NER components. We                 Natural Language Processing, pages 553–561, Chi-
intend to progress in the direction discussed above,          ang Mai, Thailand, November. Asian Federation of
                                                              Natural Language Processing.
in particular by: a. training and evaluating models
for the other languages (French, English, Italian,          Johannes Nollé and Sylvia Wartner. 1987. Ein
Spanish) b. testing the models on other publica-              tückischer Iotazismus in einer milesischen Inschrift.
tions in the portal.                                          Chiron, 17:361–364.
   In a more distant future, we also intend to in-          A. Palmero Aprosio and G. Moretti. 2016. Italy goes
clude support to the identification (and subsequent           to Stanford: a collection of CoreNLP modules for
linking) of other named entities of interest for ar-          Italian. ArXiv e-prints.
chaeologists, such as artifacts, monuments and              Patrick Sänger. 2010. Kommunikation zwischen
chronological references.                                     Prätorianerpräfekt und Statthalter: Eine Zweitschrift
                                                              von IvE Ia 44. Chrion, 40:89–102.
                                                          Helmut Schmid. 1999. Improvements in Part-of-
References                                                   Speech Tagging with an Application to German.
Beatrice Alex.        2017.      Geoparsing English-         In Susan Armstrong, Kenneth Church, Pierre Is-
   Language Text with the Edinburgh Geoparser.               abelle, Sandra Manzi, Evelyne Tzoukermann, and
   https://programminghistorian.org/en/lessons/geoparsing- David Yarowsky, editors, Natural Language Pro-
   text-with-edinburgh.                                      cessing Using Very Large Corpora, volume 11 of
                                                             Text, Speech and Language Processing, pages 13–
Steven Bird, Ewan Klein, and Edward Loper.                   26. Kluwer Academic Publishers, Dordrecht.
   2009. Natural Language Processing with Python.
   O’Reilly, New York.                                    Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.
                                                             Introduction to the CoNLL-2000 Shared Task:
Manaal Faruqui and Sebastian Padó. 2010. Train-             Chunking. In Proceedings of the 2Nd Workshop
   ing and evaluating a german named entity recog-           on Learning Language in Logic and the 4th Confer-
   nizer with semantic generalization. In Proceedings        ence on Computational Natural Language Learning
   of KONVENS 2010, Saarbrücken, Germany.                   - Volume 7, ConLL ’00, pages 127–132, Strouds-
                                                             burg, PA. Association for Computational Linguis-
Jenny Rose Finkel, Trond Grenager, and Christopher           tics.
   Manning. 2005. Incorporating Non-local Informa-
   tion into Information Extraction Systems by Gibbs      Erik F. Tjong Kim Sang and Fien De Meulder.
   Sampling. In Proceedings of the 43rd Annual Meet-         2003. Introduction to the conll-2003 shared task:
   ing on Association for Computational Linguistics,         Language-independent named entity recognition. In
   ACL ’05, pages 363–370, Stroudsburg, PA, USA.             Walter Daelemans and Miles Osborne, editors, Pro-
   Association for Computational Linguistics.                ceedings of CoNLL-2003, pages 142–147. Edmon-
                                                             ton, Canada.
Claire Grover, Richard Tobin, Kate Byrne, Matthew
   Woollard, James Reid, Stuart Dunn, and Julian Ball.
   2010. Use of the Edinburgh geoparser for georefer-
   encing digitized historical collections. Philosophi-
   cal Transactions of the Royal Society of London A:
   Mathematical, Physical and Engineering Sciences,
   368:3875–3889.

Rudolf Haensch and Michael Mackensen. 2011. Das
  tripolitanische Kastell Gheriat el-Garbia im Licht
  einer neuen spätantiken Inschrift: Am Tag, als der
  Regen kam. Chiron, 41:263–286.

Jürgen Hammerstaedt. 2009. Warum Simonides den
    Artemidorpapyrus nicht hätte fälschen können: Eine
    seltene Schreibung für Tausender in Inschriften und
    Papyri. Chiron, 39:323–338.

Bernhard Linke. 2009. Jupiter und die Republik.
  Die Entstehung des europäischen Republikanismus
  in der Antike. Chiron, 39:339–358.

Marco Lui and Timothy Baldwin. 2011. Cross-domain
 feature selection for language identification. In Pro-
 ceedings of 5th International Joint Conference on

</pre>