=Paper=
{{Paper
|id=Vol-3220/paper3
|storemode=property
|title=Extracting bibliographic references from footnotes with EXcite-docker
|pdfUrl=https://ceur-ws.org/Vol-3220/paper3.pdf
|volume=Vol-3220
|authors=Christian Boulanger,Anastasiia Iurshina
|dblpUrl=https://dblp.org/rec/conf/jcdl/BoulangerI22
}}
==Extracting bibliographic references from footnotes with EXcite-docker==
Extracting bibliographic references from footnotes with EXcite-docker Christian Boulanger1,∗ , Anastasiia Iurshina2 1 Max Planck Institute for Legal History and Legal Theory, Frankfurt a.M. 2 University of Stuttgart Abstract The paper presents a project that aims at providing a user-friendly way for the domain-specific extraction and segmentation of references from PDF documents containing scholarship from the humanities and social sciences. The software builds on code developed by the EXcite project, adding a server and improved web interface for producing gold standard with which to train the extraction and segmentation models. The paper notes that the model trained with EXcite’s gold standard, similarly to comparable software, is optimized for documents in which bibliographic references are given in a bibliography section at the end of the document. The results are much worse in the case of documents where the full or partial references are in footnotes. Searching for ways of improving the performance, we compare the accuracy of a model trained with a small set of annotated documents with references in footnotes with that of the default EXcite model and that of a model trained with a combined dataset. Preliminary results suggest that a specialized footnote model provides better accuracy as compared to a model trained with a combined dataset. We conclude with the roadmap to further improve the accuracy of the model. Keywords Digital Humanities, Scholarly literature, Citations, Footnotes, Reference mining, Reference extraction, Reference segmentation 1. Introduction Current Open Source software for metadata mining, in particular for reference extraction, specializes in documents in which bibliographic references are listed in a bibliography section at the end of the document. However, in legal scholarship, the humanities and parts of the social sciences, the cited literature is referenced in footnotes, which contain the full (or sometime only fragmentary) information on the source of the citation (see Fig. 1). As a result, the accuracy of reference extraction of the available tools is very low.1 In this paper, we describe ongoing work on a tool (”excite-docker”) that has been built on existing work in the EXcite project and is being used for reference mining in (socio-)legal studies. We compare the accuracy of a reference extraction model trained with a small set of annotated ∗ Corresponding author. Envelope-Open boulanger@lhlt.mpg.de (C. Boulanger); anastasiia.iurshina@ipvs.uni-stuttgart.de (A. Iurshina) GLOBE https://lhlt.mpg.de/boulanger (C. Boulanger) Orcid 0000-0001-6928-3246 (C. Boulanger); 0000-0002-1231-2314 (A. Iurshina) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 This can be seen when trying to extract references from a paper with references in footnotes using the webservices of GROBID https://cloud.science-miner.com/grobid/ and EXcite https://excite.informatik.uni-stuttgart.de/excite. Figure 1: A typical example of bibliographic references in a footnote documents with references in footnotes with that of the default EXcite model, and that of a model trained with a combined dataset. Preliminary results suggest that a specialized footnote model provides better accuracy as compared to a model trained with a combined dataset of footnote and bibliography data. We conclude with the roadmap to further improve the accuracy of the model. 2. Related Work 2.1. Software for extraction of citation data For a long time, technologies for extraction of bibliographic metadata from scholarly articles, including citations, have been the domain of commercial services using closed-source tech- nologies.2 . For a couple of years now, Free and Open Source Software (FOSS) projects have emerged that develop these technologies and allow their use unencumbered by license fees or usage restrictions. Projects going back into the late 2000s and early 2010s, such as GROBID[1] or CERMINE[2], have concentrated on English language papers in the natural sciences, which have - in all their variety - relatively similar document structures and citation patterns. Since the focus on this kind of literature yielded sub-optimal results for the extraction of citations from German language social science literature, the EXcite project (https://excite. 2 The most important data source, in particular for bibliometric analyses, has been the Web of Science (https: //webofscience.com), which is extremely expensive and restrictive in how its data can be used. Other examples are Scopus, Dimensions or the discontinued Microsoft Academic Graph informatik.uni-stuttgart.de) has developed a set of algorithms for information extraction and matching.[3] The results have been promising for German social science literature.[4, 5]. There are, however, two areas in which the code as released in 2019 (see https://github.com/ exciteproject/) are in need of improvement from the perspective of this paper. First of all, the cor- pus of documents that has been used to train EXcite’s default model contains only very few doc- uments with references in the footnotes (see https://github.com/exciteproject/EXgoldstandard. Not surprisingly, this results in an extremely low accuracy of reference identification. In fact, the number of false positives far outweighs the number of correctly identified citations. On the other hand, the tools in their published form were not readily usable without intimate knowledge of the code. 2.2. Use case: the Legal Theory Graph Project These challenges became apparent when one of the authors was looking for reference extraction software to use in a Digital Humanities project at the Max Planck Institut for Legal History and Legal Theory. The ”Legal Theory Graph Project” aims at producing machine-readable data on legal theory scholarship since 1945 (with a focus on socio-legal theory), mapping scholars, institutions, publications and citations in a network graph.3 Data for this graph is collected, among other things, by harvesting metadata and using text and reference mining techniques. In accordance with the stated goals of the Max Planck Society,4 , the project adheres to an Open Source and Open Access approach, which means that both the data produced as well as the technologies used should be freely available. This rules out to rely on proprietary data sources and technologies. In any case, the coverage of scholarship in law, the social sciences and the humanities is very limited in the case of both the commercial services and the newly emerging Open Access sources of bibliographic and bibliometric data[6, 7]. This is particularly true for publications which are not journal articles or which are not identified by a Digital Object Identifier (DOI). In law, the social sciences and the humanities, however, a large part of research is published as books, as chapters in edited volumes or as articles in journals which do not assign DOIs (yet). Most importantly, all of the bibliographic and bibliometric data sources mainly contain data on English language scholarship of the last 20-30 years, whereas the Legal Theory Graph Project has a focus on German scholarship and includes a much larger time period. Some of this data is covered by Google Scholar[6, 7], but Google Scholar has no API and actively prevents data mining. For these reasons, the data had to be self-produced using freely available technologies. EXcite seemed to be the best candidate for the reference extraction part of the project’s technical workflow (see Fig. 2). 2.3. Collaborative project: excite-docker The authors collaborated to achieve the following tasks: • working on EXcite’s codebase to make it more readily usable for non-technical users, 3 See https://www.lhlt.mpg.de/2514927/03-boulanger-legal-theory-graph?c=2367449 4 See https://openaccess.mpg.de/Berlin-Declaration Figure 2: The workflow of the Legal Theory Graph Project including for support staff producing annotations that can serve as training data for the reference extraction models; • improving the algorithm’s accuracy for the intended target literature • adding an evaluation workflow, to be able to document improvements in accuracy in a reproducible way The result is the GitHub repository https://github.com/cboulanger/excite-docker5 . The code in the repository builds a Docker image which can be run on any platform that support Docker. The docker container provides a web app (see figure 3) that can be used to produce annotations for the learning algorithm and a command line interface to run the commands for training of the machine learning models, citation data extraction and accuracy evaluation. The web application has been used to annotate text extracted from PDFs from the German Journal of Law and Society (1980-) (https://zfrsoz.info), which will be part of the Legal Theory Graph. Many papers published in this journal from the 1980 use footnote citations. In addition, the citation style is very inconsistent. This data is ideal for stress-testing extraction models, in fact, initial extractions with the default excite model resulted in large amounts of unusable results. 5 The original code has been forked from https://git.gesis.org/hosseiam/excite-docker but has been rewritten in large parts. The main ML algorithms are largely unchanged and have been written by the EXcite team members. All improvements to the extraction algorithms and the evaluation algorithms are by A. Iurshina, who has also ported the code to Python 3. C. Boulanger has rewritten the web application for document annotation and has added the CLI. Table 1 Average accuracy for different configurations Configuration Extraction Acc Segmentation Acc Default 0.24 0.37 Footnotes only 0.26 0.37 Combined 0.22 0.47 3. Data As training data, we used EXcite[5] gold dataset6 as well as manually annotated papers from the German Journal of Law and Society. EXcite’s dataset contains 125 annotated articles in German language (2652 reference strings) and 100 articles in English (2838 reference strings in different languages). A small fraction of these references is in footnotes, most, however, are in the reference section. The dataset with the German Journal of Law and Society papers contains 20 documents (970 reference strings). For evaluation, only socio-legal papers with references in footnotes were used because they represent the examples of the data we want to see improvements on. Since the dataset contains only 20 papers, we used 5 papers as the test set (454 reference strings). 4. Results EXParser[3] was trained and evaluated in three configurations: • Trained on EXcite gold and tested on the test split of socio-legal papers (default) • Trained on the training split of socio-legal papers and tested on the test split of socio-legal papers (footnotes only) • Trained on EXcite gold combined with the training split of socio-legal papers and tested on the test split of socio-legal papers (combined) Table 1 presents evaluation results for three configurations. Accuracy7 was calculated as follows: for extraction, we found the longest common sequence between each extracted reference string and the ground truth file and divided the length of this sequence by the length of the extracted line. For segmentation, we evaluated for each reference string, the proportion of correctly classified tokens (title, author, etc). 6 https://github.com/exciteproject/EXgoldstandard 7 We relied on accuracy as the evaluation metric, not on F1-score, as F1-score is not very suitable for evaluation of reference extraction. Besides, our goal was not compare with EXparser but to see if there is an improvement with adding more data. 5. Conclusion and Future Work As Tkaczyk et al. argue, ”tuning [extraction] models to the task-specific data results in the increase in the quality” [8]. In the case of reference-in-footnotes vs. references-in-bibliography, this is to be expected, since the structure of the pages and the text as a whole is very different. The results of this initial test makes us skeptical whether training the extraction model with a large diversity of documents will improve the accuracy of the extraction. Instead, it suggests that models that are more fine-tuned to the type of the document (footnote vs. bibliography) are the way to go, and that an extraction workflow should either try different models for the best outcome or - since this is time-intensive- use heuristics to select the model that suits the type of the paper. The data allows this conclusion only for reference extraction. Concerning reference segmentation, the combined model did better. The difference can be explained by the fact that even though citation styles vary widely8 , structurally the differences are much smaller compared to the difference between references-in-footnotes vs. references-in-bibliography. In any case, a lot of work is still ahead. We have far too few annotated reference-in-footnotes documents needed to improve the accuracy of the specialized footnotes model. Preparing such documents is a time-consuming task and would profit from the collaboration with other projects that want to extract citation data from similar type of scholarly literature. Alternatively, synthetic training data could be produced from existing citation data on journals that use the references-in-footnotes style, although such data is yet to be located. We are also looking into ways of increasing the accuracy using word embeddings, part-of-speech-tagging or dictionary- based approaches that would help to identify text that is expected to appear in references in the target literature and to further increase the accuracy of segmentation. Another direction to explore, given that we will have enough data, is replacing manual feature extraction, on which EXParser relies heavily, with deep-leaning-based models or with a combination of manually and automatically-extracted features. References [1] P. Lopez, GROBID: Combining Automatic Bibliographic Data Recognition and Term Ex- traction for Scholarship Publications, in: M. Agosti, J. Borbinha, S. Kapidakis, C. Pap- atheodorou, G. Tsakonas (Eds.), Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2009, pp. 473–474. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 6 4 2 - 0 4 3 4 6 - 8 _ 6 2 . [2] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, L. Bolikowski, CERMINE: automatic extraction of structured metadata from scientific literature, International Journal on Docu- ment Analysis and Recognition (IJDAR) 18 (2015) 317–335. URL: https://doi.org/10.1007/ s10032-015-0249-8. doi:1 0 . 1 0 0 7 / s 1 0 0 3 2 - 0 1 5 - 0 2 4 9 - 8 . [3] A. Hosseini, B. Ghavimi, Z. Boukhers, P. Mayr, Excite – a toolchain to extract, match and publish open literature references, 2019. doi:1 0 . 1 1 0 9 / J C D L . 2 0 1 9 . 0 0 1 0 5 . [4] B. Ghavimi, W. Otto, P. Mayr, An Evaluation of the Effect of Reference Strings and Seg- mentation on Citation Matching, in: A. Doucet, A. Isaac, K. Golub, T. Aalberg, A. Jatowt 8 See, for example https://citationstyles.org/ for an open source attempt to make this variety manageable for software (Eds.), Digital Libraries for Open Knowledge, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2019, pp. 365–369. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 0 7 6 0 - 8 _ 3 5 . [5] Z. Boukhers, S. Ambhore, S. Staab, An end-to-end approach for extracting and segmenting high-variance references from pdf documents, 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2019) 186–195. [6] A.-W. Harzing, S. Alakangas, Google Scholar, Scopus and the Web of Science: a longitudinal and cross-disciplinary comparison, Scientometrics 106 (2016) 787–804. URL: https://doi.org/ 10.1007/s11192-015-1798-9. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 0 1 5 - 1 7 9 8 - 9 . [7] A. Martín-Martín, M. Thelwall, E. Orduna-Malea, E. Delgado López-Cózar, Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations, Scientometrics 126 (2021) 871–906. URL: https://doi.org/10.1007/s11192-020-03690-4. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 0 2 0 - 0 3 6 9 0 - 4 . [8] D. Tkaczyk, A. Collins, P. Sheridan, J. Beel, Machine Learning vs. Rules and Out-of-the- Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers, in: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL ’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 99–108. doi:1 0 . 1 1 4 5 / 3 1 9 7 0 2 6 . 3 1 9 7 0 4 8 . Figure 3: Web application for the annotation of documents