=Paper=
{{Paper
|id=Vol-2253/paper10
|storemode=property
|title=The iDAI.publication: Extracting and Linking Information in the Publications of the German Archaeological Institute (DAI)
|pdfUrl=https://ceur-ws.org/Vol-2253/paper10.pdf
|volume=Vol-2253
|authors=Francesco Mambrini
|dblpUrl=https://dblp.org/rec/conf/clic-it/Mambrini18
}}
==The iDAI.publication: Extracting and Linking Information in the Publications of the German Archaeological Institute (DAI)==
The iDAI.publication: extracting and linking information in the publications of the German Archaeological Institute (DAI) Francesco Mambrini Deutsches Archäologisches Institut Podbielskiallee 69-71, Berlin francesco.mambrini@dainst.de Abstract responsibility of the federal Foreign Office; the goal of the institue is to promote research in ar- English. We present the results of our at- chaeological sciences and on ancient civilizations tempt to use NLP tools in order to iden- worldwide. Founded in Rome in 1829, the DAI tify named entities in the publications of has developed into a complex institution, with the Deutsches Archäologisches Institute branches and offices located around the world. (DAI) and link the identified locations to The Institute has participated in several projects, entries in the iDAI.gazetteer. Our including missions of paramount importance like case study focuses on articles written in those in Olympia, Pergamon or Elephantine. German and published in the journal Ch- One of the most visible output of this activity iron between 1971 and 2014. We describe is the amount of scientific publications produced the annotation pipeline that starts from the by the DAI. The Institute currently publishes 14 digitized texts published in the new portal international journals and 70 book series on dif- of the DAI. We evaluate the performances ferent topics.1 Since 2018, part of this collection of geoparsing and NER and test an ap- is now accessible to the public on a new online proach to improve the accuracy of the lat- portal named idai.publications for books ter. and journals.2 This ongoing initiative will not only Italiano. Il paper descrive i risultati enable researchers to have easier access to the pub- dell’esperimento di applicazione di stru- lished works; even more importantly, it will allow menti di NLP per annotare le Named En- the Institute to integrate the data contained in ar- tities nelle pubblicazioni del Deutsches ticles and books (such as persons, places and ar- Archäologisches Institute (DAI) e colle- chaeological sites, artifacts and monuments) into gare i toponimi identificati alle rispettive a network of all the other digital resources of the voci dell’iDAI.gazetteer. Il nos- DAI. tro studio si concentra sugli articoli in All the digital collections of the DAI are indeed tedesco pubblicati nella rivista Chiron tra designed to operate within a network known as the il 1974 e il 2014. Descriviamo la pipeline idai.welt (or idai.world).3 This network di annotazione impiegata per processare includes web collections such as “Arachne”,4 the gli articoli disponibili nel nuovo portale database of archaeological monuments and arti- per le pubblicazioni del DAI. Discutiamo facts of the DAI, and “Zenon”,5 the central biblio- i risultati della valutazione degli script di graphic catalogue that serves all the libraries of the geoparsing e NER e, infine, proponiamo DAI offices around the world, but also compiles un approccio per migliorare l’accuratezza 1 A list of journal is provided at: https://www. in quest’ultimo task. dainst.org/publikationen/zeitschriften/ alphabetisch; for the list of book series: https: //new.dainst.org/publikationen/reihen. 2 1 The iDAI.publications and the See https://publications.dainst.org/ journals/ and https://publications.dainst. iDAI.world org/books/. 3 https://www.dainst.org/de/forschung/ The Deutsches Archäologisches Institute (Ger- forschung-digital/idai.welt man Archaeological Institute, henceforth DAI) is 4 https://arachne.dainst.org/ 5 a German agency operating within the sphere of https://zenon.dainst.org/ some of the most comprehensive bibliographies in 2.1 Preprocessing and NER the areas of activity of the different branches. The pipeline is programmed in Python and takes The other cornerstone of the idai.world advantages of modules of the NLTK platform for is represented by the layer of web-based ser- several task (Bird et al., 2009), like sentence- and vices such as thesauri and controlled vocabular- word-tokenization. ies. The idai.gazetteer,6 in particular, con- The input of our annotation pipeline is, in the nects names of locations with unique identifiers case of articles and books for which no other ver- and coordinates; the gazetteer is intended to serve sions survive, the full text extracted from the PDF both as a controlled list of topnyms for DAI’s files of the articles.7 The automatic recognition services and to link the geographic data with of the publication’s main language is carried out other gazetteers. Unique identifiers defined in the by the Python library langid (Lui and Baldwin, idai.gazetteer are already used to connect 2011). places and entries in Zenon and Arachne. In this NER is performed using the Stanford Named way, users of these services can already query Entity Recognizer (Finkel et al., 2005), which im- monuments and artifacts in Arachne or books in plements Conditional Random Field (CRF) se- Zenon that are linked to a specific place. quence models. For a preliminary evaluation, we used pre-trained models for English, Span- 2 A pipeline for textual annotation ish,8 German (Faruqui and Padó, 2010), and Ital- This network of references holds a great poten- ian (Palmero Aprosio and Moretti, 2016). All tial for the DAI publications. Places, persons, ar- these models are trained to recognize compara- tifacts, monuments, and other entities of interest ble classes of entities (persons, places, organiza- mentioned within the publications can be identi- tions and miscellaneous). We then chunked to- fied and linked to the concepts in the appropriate gether the annotated tokens with a simple regular- knowledge bases of the DAI. The linking of the expression chunker that takes consecutive, non- different relevant entities would allow researchers empty (O) tags together and labels them with the not just to retrieve the texts that, independently same label as the first token in the series. from the language of the publication, make ref- Part-of-speech (POS) tagging, though not erence to certain concepts of interest, but also to strictly necessary for NER and geoparsing, as the study such epistemologically relevant questions as out-of-the-box models for Stanford NER do not the variation in the patterns of locations cited in require it, is also supported by our pipeline. Tree- the studies across decades. Tagger (Schmid, 1999) was chosen since it offered While the linking between entries in Zenon a vast array of pre-trained models for many lan- and Archne and the idai.gazetteer had been guages. conducted manually, the volume and nature of the 2.2 Geoparsing textual information to be processed in the publica- tions encouraged us to turn to Natural Language The task of resolving place names by linking them Processing (NLP). We set up a pipeline for text to identifiers from a gazetteer is commonly re- annotation that aims to process the full texts of the ferred to as “georparsing”. The Edinburgh Geop- publications, perform Named Entity Recognition arser9 is a suite of tools that is often employed in (NER) to identify the mentions of the relevant en- DH (Grover et al., 2010; Alex, 2017) and allows tities, and finally link them to the appropriate en- users to preprocess texts, extract toponyms and re- tries in the idai.world. solve them by identifying the possible candidates We chose to build the first version of the in a gazetteer and scoring them. Users have the op- pipeline around a series of open-source software tion to select between 4 gazetteers, and to set some that offer support for multiple languages and parameters, like the coordinates of areas that will are widely used in the Digital Humanities (DH); 7 All the PDF files of the publications already include at present, the annotation is limited to persons, texts, so no Optical Character Recognition (OCR) is needed. 8 places and organization, and only the linking of Models for English and Spanish are available for download at https://stanfordnlp.github.io/ place-names to the idai.gazetteer is sup- CoreNLP/; for English we used the 4 Class model CoNLL ported. 2003 English training set. 9 http://groups.inf.ed.ac.uk/geoparser/ 6 https://gazetteer.dainst.org/ documentation/v1.1/html/ be given preference while ranking the candidates. Language Nr. Articles Auto rec. The scoring process makes use of some properties German 645 580 recorded for places in gazetteers (e.g. the type of English 211 222 location, such as inhabited place or archaeological French 59 55 site) and especially by comparing locations pair- Italian 17 15 wise with all other places identified; preference is Spanish 10 12 thus given to places that cluster together. Luxembourgish 0 19 Although Edinburgh works only with English Greek and Lat. 0 39 and the idai.gazetteer is not supported, the CLI software is built as a suite of scripts, so that Table 1: Chiron: number of article per language the input of a process is the output of the preced- (actual count vs automatically recognized) ing one. By knowing the script that performs a task and the input it expects, it is therefore possible to inject a pre-processed text into any given step, most relevant role by far.11 while most processes (like scoring) are language- agnostic. We integrated the ranking script of Ed- 3.2 Evaluating the annotation inburgh within our pipeline to score, for any loca- In this preliminary stage, we decided to focus on tion that we extracted with our own NER scripts, the 580 automatically identified German articles in any list of possible candidates matched in the order to evaluate the performances of our pipeline idai.gazetteer. and to improve its accuracy. We have manually corrected the NER annota- 3 Testing and Improving The Pipeline: a tion and geoparsing of 4 articles (Linke, 2009; case study Hammerstaedt, 2009; Sänger, 2010; Haensch and In this section we discuss the preliminary results Mackensen, 2011), for a total of 36,159 words. obtained by running the pipeline described above The articles were selected so as to represent a on the complete series of one journal now avail- broad scope of subjects (from papyrology, to so- able in the idai.publications. The results cial and religious history, to military archaeology) will serve as a baseline for future improvement. and geographic areas (North Africa, Asia Minor, Rome and Italy). 3.1 Chiron: the data set For the evaluation of our NER tools we adopted The first complete publication series that was the same metrics (precision, recall and Fβ=1 added to the portal was Chiron, a journal published score) and methods of the CoNLL-2000 shared by the DAI’s “Kommission für Alte Geschichte task (Tjong Kim Sang and Buchholz, 2000). Note, und Epigraphik” from 1970. Volumes from 1 to in particular, that the scores are calculated at the 44 (2014) are currently available,10 for a total of level of the phrase, not of the single tag. The 942 articles. The focus of the publication is in evaluation of the geoparser is also based on the Graeco-Roman history and epigraphy; several ar- same principles, but instead of evaluating its per- ticles contain lengthy quotations (or even full edi- formances on the automatically annotated texts, tions) of inscriptions in Greek or Latin. we re-ran the geoparser on the gold-standard and Table 1 reports the total number of articles per evaluated that output. language. As can be seen, quotations in Greek and The scores reported in Table 2 are considerably Latin are sufficiently frequent and long to confuse below the state of the art in NER for German, as the automatic recognition. In 39 cases, Latin or documented e.g. in the CoNLL 2003 shared task Greek were considered the main language of the (Tjong Kim Sang and De Meulder, 2003). These publication. Luxembourgish (a West Germanic results would very likely be considered insuffi- language) is also a clear mistake for German, also cient or too noisy for the needs of researchers in possibly prompted by lengthy quotations (Nollé the (Digital) Humanities. and Wartner, 1987, for one likely case). The 44 11 A word count on the automatically recognized languages volumes of the journal show an interesting dis- confirms this conclusion: German has 7,394,004 words tribution of languages, with German playing the (60.48% of total), English 2,955,640, and French 899,888. Greek and Latin total 481,596 words; the other languages 10 Readers are however requested to register an account. count between 193k and 148k words. Entity Precision Recall Fβ=1 3.3 Applying in-domain NER models Person 73.21% 47.13% 57.34 We decided to use the manually corrected articles Location 67.18% 34.56% 45.64 to see whether we could improve on the baseline Organization 9.23% 35.71% 14.66 with the help of in-domain models. We trained a CRF model adding a series of linguistic features, TOTAL 56.27% 43.22% 48.89 like POS, which may help capturing non-German expressions, or type-set features such as the use of Table 2: NER: results of the first evaluation round; small- and full-caps.12 As the articles in Chiron 1423 phrases; found: 1093; correct: 615 focus on the Greco-Roman civilization, we expect a lookup in lists of known toponyms of the An- cient Word to sensibly improve the performances Modules for NER trained on general corpora do of NER for locations. We chose to add a gazetteer not seem to be suited to annotate texts that belong lookup to the list of features; we preferred to re- to such a specific domain with acceptable accu- sort to a more specific resource like the “Digital racy. The poor performances with organizations, Atlas of the Roman Empire” (DARE)13 instead of in particular, point to some peculiarities of the ar- the general-purpose idai.gazetteer. chaeological literature in comparison to texts in- cluded in most general-use corpora: companies, firms and other institutions, which are frequent in Entity Precision Recall Fβ=1 the news, are rarely found in scholarly texts of Person 80.00% 71.41% 75.30 our domain; the organization tag is more often re- Location 76.26% 58.90% 65.87 served either to ancient institutions (like “the Ro- Organization 22.02% 23.08% 16.94 man Senate”) or peoples and tribes (“the Aqui- TOTAL 79.32% 65.75% 71.75 tani”) which are hardly represented in ordinary corpora. Table 4: NER: results of the in-domain model; av- erage scores of 10-fold cross-validation Article Precision Recall Fβ=1 L09 76.53% 73.53% 75.00 Table 4 reports the results of this second round H09 97.87% 95.83% 96.84 of testing, which was conducted using the same S10 72.66% 80.17% 76.23 methodology as before and performing a 10-fold H&M11 86.67% 74.71% 80.25 cross-validation. As can be seen, the in-domain TOTAL 83.49% 79.13% 81.25 model considerably improves over the baseline. The performance with organizations is still largely Table 3: Geoparsing: results per article; 575 insufficient, mainly on account of the scarcity of phrases; found: 545; correct: 455. Articles: L09 examples (70 phrases, vs 970 persons, 387 loca- (Linke 2009), H09 (Hammerstaedt 2009), S10 tions). The improvement with locations is signifi- (Sanger 2010), H&M11 (Haensch and Mackensen cant, but the overall performance still leaves room 2011) for substantial improvement. 4 Conclusions and future work The performances of the geoparser, on the other hand, seem encouraging (Table 3). The use of in-domain CRF models trained specif- With gold-standar named entity recognition, ically for the target journal and adopting a spe- the Edinburgh Geoparsers combined with the cialized gazetteer for place names improves on the idai.gazetteer attained scores that closely baseline of the out-of-the-box NER tools in our approximate, or even surpass 80%. The evaluation initial pipeline. It is likely that the accuracy on of our annotation was also a valuable occasion the Chiron data can be further increased with addi- to assess the accuracy and granularity of the tional training. Given that an accurate recognition idai.gazetteer: 38 locations in North is a prerequisite for geoparsing, we plan to con- Africa mentioned in one article (Haensch and 12 The CRF implementation that we used is provided by the Mackensen, 2011) did not have any record in Python library sklearn-crfsuite (0.3.6). 13 DAI’s gazetteer. http://dare.ht.lu.se/ centrate our effort on the NER components. We Natural Language Processing, pages 553–561, Chi- intend to progress in the direction discussed above, ang Mai, Thailand, November. Asian Federation of Natural Language Processing. in particular by: a. training and evaluating models for the other languages (French, English, Italian, Johannes Nollé and Sylvia Wartner. 1987. Ein Spanish) b. testing the models on other publica- tückischer Iotazismus in einer milesischen Inschrift. tions in the portal. Chiron, 17:361–364. In a more distant future, we also intend to in- A. Palmero Aprosio and G. Moretti. 2016. Italy goes clude support to the identification (and subsequent to Stanford: a collection of CoreNLP modules for linking) of other named entities of interest for ar- Italian. ArXiv e-prints. chaeologists, such as artifacts, monuments and Patrick Sänger. 2010. Kommunikation zwischen chronological references. Prätorianerpräfekt und Statthalter: Eine Zweitschrift von IvE Ia 44. Chrion, 40:89–102. Helmut Schmid. 1999. Improvements in Part-of- References Speech Tagging with an Application to German. Beatrice Alex. 2017. Geoparsing English- In Susan Armstrong, Kenneth Church, Pierre Is- Language Text with the Edinburgh Geoparser. abelle, Sandra Manzi, Evelyne Tzoukermann, and https://programminghistorian.org/en/lessons/geoparsing- David Yarowsky, editors, Natural Language Pro- text-with-edinburgh. cessing Using Very Large Corpora, volume 11 of Text, Speech and Language Processing, pages 13– Steven Bird, Ewan Klein, and Edward Loper. 26. Kluwer Academic Publishers, Dordrecht. 2009. Natural Language Processing with Python. O’Reilly, New York. Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 Shared Task: Manaal Faruqui and Sebastian Padó. 2010. Train- Chunking. In Proceedings of the 2Nd Workshop ing and evaluating a german named entity recog- on Learning Language in Logic and the 4th Confer- nizer with semantic generalization. In Proceedings ence on Computational Natural Language Learning of KONVENS 2010, Saarbrücken, Germany. - Volume 7, ConLL ’00, pages 127–132, Strouds- burg, PA. Association for Computational Linguis- Jenny Rose Finkel, Trond Grenager, and Christopher tics. Manning. 2005. Incorporating Non-local Informa- tion into Information Extraction Systems by Gibbs Erik F. Tjong Kim Sang and Fien De Meulder. Sampling. In Proceedings of the 43rd Annual Meet- 2003. Introduction to the conll-2003 shared task: ing on Association for Computational Linguistics, Language-independent named entity recognition. In ACL ’05, pages 363–370, Stroudsburg, PA, USA. Walter Daelemans and Miles Osborne, editors, Pro- Association for Computational Linguistics. ceedings of CoNLL-2003, pages 142–147. Edmon- ton, Canada. Claire Grover, Richard Tobin, Kate Byrne, Matthew Woollard, James Reid, Stuart Dunn, and Julian Ball. 2010. Use of the Edinburgh geoparser for georefer- encing digitized historical collections. Philosophi- cal Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 368:3875–3889. Rudolf Haensch and Michael Mackensen. 2011. Das tripolitanische Kastell Gheriat el-Garbia im Licht einer neuen spätantiken Inschrift: Am Tag, als der Regen kam. Chiron, 41:263–286. Jürgen Hammerstaedt. 2009. Warum Simonides den Artemidorpapyrus nicht hätte fälschen können: Eine seltene Schreibung für Tausender in Inschriften und Papyri. Chiron, 39:323–338. Bernhard Linke. 2009. Jupiter und die Republik. Die Entstehung des europäischen Republikanismus in der Antike. Chiron, 39:339–358. Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Pro- ceedings of 5th International Joint Conference on