=Paper= {{Paper |id=Vol-3220/paper4 |storemode=property |title=Extracting literature references in German Speaking Geography – the GEOcite project |pdfUrl=https://ceur-ws.org/Vol-3220/paper4.pdf |volume=Vol-3220 |authors=Bastian Birkeneder,Philipp Aufenvenne,Christian Haase,Philipp Mayr,Malte Steinbrink |dblpUrl=https://dblp.org/rec/conf/jcdl/BirkenederAH0S22 }} ==Extracting literature references in German Speaking Geography – the GEOcite project== https://ceur-ws.org/Vol-3220/paper4.pdf
Extracting literature references in German Speaking
Geography – the GEOcite project
Bastian Birkeneder1 , Philipp Aufenvenne1 , Christian Haase1 , Philipp Mayr2 and
Malte Steinbrink1
1
    Chair of Human Geography, University of Passau, Germany
2
    GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany


                                         Abstract
                                         The paper outlines the motivation and build-up of the DFG-funded GEOcite project at University of
                                         Passau. The project works on a domain-specific approach to automatically extract, segment, match and
                                         visualize literature references in the German speaking geography domain with the objective to provide a
                                         novel basis for a scientometric monitoring instrument for the community. In this paper, we describe the
                                         GEOcite corpus, its construction and elaborate on a preliminary evaluation of different approaches to
                                         extract and segment references from the digitized part of the corpus. We further evaluate the EXCITE
                                         segmentation model [1] on different datasets of German research papers. The results of our evaluation
                                         show small improvements with domain-specific and increased training data.

                                         Keywords
                                         Reference extraction, Geography papers, Network analytics, Scientometric monitoring,




1. Introduction
The GEOcite project presented in this paper is a central part of the overarching research project
”The Pillars of Unity and Disciplinary Bridges: Geographical Research between Rhetoric and
Practice”, which has been funded by the German Research Foundation (DFG) since 2013. The
main focus of the project is the question of the unity of geography. This question is nearly as
old as the discipline itself. While human geography sees itself as social and cultural science,
physical geography is assigned to the natural sciences. Not only in German speaking Geography
the relationship between physical and human geography has always been a matter of concern
deeply interwoven with the discipline‘s identity [2, 3]. While the idea of bringing together
the natural and the social sciences is claimed as the discipline´s unique selling point, there
is a growing awareness of centrifugal tendencies within geography threatening its integrity
and cohesion as one academic discipline [4]. So far, these discussions about the disciplines
unity lack empirical support. Therefore, the project aims to provide an empirical basis by
using bibliometric and network analytic methods. Based on an analysis of publication and

ULITE workshop at JCDL 2022
Envelope-Open Bastian.Birkeneder@uni-passau.de (B. Birkeneder); Philipp.Aufenvenne@uni-passau.de (P. Aufenvenne);
Christian.Haase@uni-passau.de (C. Haase); philipp.mayr@gesis.org (P. Mayr); Malte.Steinbrink@uni-passau.de
(M. Steinbrink)
Orcid 0000-0002-2460-4920 (B. Birkeneder); 0000-0001-7957-5752 (P. Aufenvenne); 0000-0002-6656-1658 (P. Mayr);
0000-0001-7503-2750 (M. Steinbrink)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
citation patterns, the disciplinary structure of German speaking geography is investigated. In
the first phase of the project (2014-2018), the citation relationships of all geographers who held
a professorship at a German, Austrian or Swiss university in 2012 were collected. The citation
data was retrieved from journal papers published by these actors in the decade from 2003 to
2012. The data collection was carried out partly automated using Scopus. Reference data from
geographical journals not listed in Scopus were manually extracted. The results of the first
project phase show that the discipline has clearly split into different subdisciplinary clusters
[5, 6, 2]. However, the subdisciplines are still more or less linked by citations. Though, the
temporal dimension of the structuring process could not be taken into account. So it is not clear
whether the current situation is the result of growing together or drifting apart. Therefore, in
the second project phase (since 2019), the data basis was comprehensively expanded in order to
enable longitudinal analyses focusing on disciplinary dynamics over time. Our aim is to include
all journal publications of geography professors from German speaking countries from 1949
until today. For this purpose the scientometric monitoring tool GEOcite was developed. In the
following, the structure and functionality of GEOcite will be explained.
   This paper matches with a couple of focus topics at the ULITE workshop1 : GEOcite completely
builds on open source software and has the objective to produce ”Open infrastructures and
services for reference mining”; in addition, GEOcite is an application of an established software
framework for reference extraction and matching EXCITE [7] in the Geography domain. Thirdly,
GEOcite matches with the topic ”Search, exploration and mining of the reference graph” in the
way that the retrieved data will ultimately be used for network analysis aiming at a deeper
understanding of historical changes and paradigmatic shifts within geography.


2. GEOcite: technical background
As a software solution, GEOcite aims to locate, collect and process extensive historical and
recent citation data (from 1949 to the present). Digital and analogue archives are used for this.
Extensive digitization work is being carried out, which forms the basis for an automated citation
data extraction. For this task the EXCITE tools are used (see Figure 1). Our aim is to create a
database for the analysis of current structures of disciplinary knowledge networks and their
historical genesis and development. GEOcite enables us to create the conditions for bibliometric
network analysis to better understand the disciplinary dynamics. The data obtained are made
available to the scientific community and is thus permanently available for empirical research
and historical discipline observation.
   Figure 1 gives an overview of the structure and the data flows of the GEOcite tool. At the
center is the GEOcite database [M1]. This links three datasets [M1a, b, c] necessary for the
planned bibliometric network analyses.
   In GEOcite, as in the first phase of the project, the actors considered are also the geographic
professors in German speaking countries. The GEOprof -Database [actor data, M1a] contains a
list of all geography professors since 1949 as well as additional biographical attribute data [8].
   The second dataset [M1b] is a comprehensive compilation of the bibliographic information of
journal papers published by those geography professors listed in the GEOprof -Database (citing
   1
       https://exciteproject.github.io/ULITE-ws/
Figure 1: GEOcite database and data flow


articles). To cover both historical and current publication activity, the database feeds from two
different sources: In addition to texts listed in Scopus, the data is taken from analog journals
that were digitized by the Göttingen Digitization Center (GDZ). Due to copyright legislation
only the title pages and the bibliographies of a paper were scanned. The copies were provided
as single files in TIFF format and needed to be processed in several steps: First, automated
text recognition (OCR) was performed. Errors detected during text recognition were corrected
manually. The image files were then converted into PDF format. After that the documents were
merged to combine the title pages of an article and the associated bibliography into one file.
The open source programs Cermine [9] and Grobid [10] were used to extract bibliographic data
from the title pages, specifically authors and title of an article.
   The third component depicted [M1c], represents a list of all works cited in the articles. In
addition to the complete bibliographic information of the references, the dataset also contains
the link to the actor data (GEOprof database) [M1a] and the source texts [M1b] in which they
were cited. While this bibliographic information can be queried directly in Scopus, extracting
the information of the cited works from the digitized corpus is more challenging. This is
precisely the application field of the EXCITE project, which has been running since 2016 and
is funded by the DFG [7]. EXCITE provides a tool for extracting literature references from
PDF files. For this purpose, the reference strings in PDF documents are automatically detected,
extracted and segmented. EXCITE was developed specifically for the extraction of citation
data from social science texts and has been trained on mainly recent German-language papers.
Relevant extracted bibliographic information is used to find matches between our actor database
and processed scientific articles. The resulting links between citing actor (matched author of
an article) and cited actor (matched author in a reference) is then used to create our citation
network.
  GEOcite reuses the following EXCITE tools2 [7]:
   1. EXannotator3 to build a dataset for Exparser model training.
   2. Exparser [1] to process and extract the references from the PDF corpus.

2.1. GEOcite Data
In the following Table 1, we outline the GEOcite corpus consisting of active male and female
professors of Geography in Germany and other German speaking countries. In addition, we list
the amount of considered relevant papers in Scopus and our digitized article corpus. We divide
our data into bins of 20 years (1949–1968; 1969–1988; 1989–2008; 2009-2022).

                                      1949-68     1969-88    1989-2008   2009-2022   total
              Active professors1      180         565        759         567         1,180
                                 2
              Papers in Scopus        231         3,149      10,762      15,263      29,4843
              Papers in digitized
                                      3,352       4,119      5,758       2,823       16,052
              corpus4
              1
                Included are those professors who were actively holding a professorship in
                the respective time interval.
              2
                The Scopus corpus includes only scientific papers written by relevant actors.
              3
                23 articles from Scopus do not include a publication date.
              4
                The digitalized corpus also contains articles from authors, which are not
                part of our research group.
Table 1
Overview of the GEOcite corpus.



2.2. GEOcite tools
As the first output of the GEOcite project, a comprehensive list of the geographic professors in
Germany, Austria and Switzerland since 1949 is available. The GEOprof dataset is available to
the research community as a static download (CSV format) at the geoscience data publisher [8].
There you will also find further information about the methods used to collect the data on the
professorship. In addition, an interactive map to explore the dataset is available on the project’s
website4 (see Figure 2).
We further created a geography-specific dataset of annotated and segmented references, ex-
tracted from scientific articles. At the end of the project, a web-based platform for bibliometric
network analysis of the collected geogrphic citation data is planned. All datasets and tools are
or will be available for reuse5 .
    2
      https://github.com/exciteproject/
    3
      https://github.com/exciteproject/EXannotator
    4
      https://geographische-netzwerkstatt.uni-passau.de/de/geoprof/
    5
      https://github.com/GeoCite
Figure 2: GEOprof demonstrator with active Professors in Geography from 1949–2021, see interactive
map6


   In the following section, we describe a preliminary evaluation and discussion of an analysis
of reference segmentation in our Geocite corpus, as well as the next steps to be taken in the
project in order to optimize the segmentation process.


3. Evaluation of the reference segmentation
3.1. Set-up and Training
In order to create a comprehensive citation network between our actors, it is essential to extract
important bibliographic data with preferably low error rates. While the default Exparser models
provide sufficient results, the used toolchain allows its user to train a model on a custom dataset.
To maximize the results of the segmentation of our extracted references, we trained three
models on different datasets to examine the effects of more domain specific articles and more
articles in general. To increase our training data for the machine learning (ML) models used by
Exparser, we extracted references from 170 German geography research papers. These articles
were randomly chosen from our digital corpus and were published between 1952 and 2019. The
EXCITE dataset7 contains 125 German articles. We further annotated and segmented these
references according to the specified EXCITE requirements [11]. As test set we combined 10%
of our dataset and 10% of the EXCITE German Goldstandard. Training parameters were set
identical as reported by Hosseini et al. [7]. We trained one model with our data (GEOcite model)
and the EXCITE Goldstandard (EXCITE model) respectively, as well as one model with both
    7
        https://github.com/exciteproject/EXgoldstandard
training sets combined (Combined model).

3.2. Results
The results of all three models are shown in Table 2.
            Label         F1 GEOcite model      F1 EXCITE model        F1 Combined model
           publisher             0.82                   0.85                    0.87
           last page             0.93                   0.94                    0.94
           surname               0.75                   0.80                    0.84
          article-title          0.90                   0.91                    0.90
               url               0.89                   0.88                    0.87
            volume               0.89                   0.82                    0.86
             source              0.83                   0.81                    0.82
         given-names             0.82                   0.85                    0.86
             editor              0.81                   0.80                    0.81
           first page            0.94                   0.95                    0.95
              year               0.86                   0.90                    0.93
           identifier            0.73                   0.75                    0.76
              issue              0.77                   0.79                    0.80
              other              0.75                   0.78                    0.77
Table 2
Results of three models, trained on different datasets. The F1-score is used as evaluation metric.


   Our evaluation shows that a domain-specific dataset does not necessarily improve the output
of the Exparser segmentation model. In particular we can observe that important tags for our
work, like surname, given-names, and article-title achieve lower F1-scores that the EXCITE
model. For several tags, we notice slightly improved results with the combined model. Our
results indicate that more general data and training data from our target domain can improve
the Exparser segmentation model. Considering the predominating use of English language in
the scientific community, it might be no surprise that the majority of datasets in this domain
were collected from English publications. This unfortunately limits the usage of large scale
datasets like PMC Open Access [12] or DocBank [13].
One subject of our future research is the utilization of more sophisticated ML models. In recent
years a paradigm shift for ML can be observed [14]. Models like BERT [15] or GPT-3 [16]
trained on broad data at scale and used as foundation models show exceeding results in NLP or
Computer Vision tasks. We experiment with multilingual models (e.g. XLM-R [17]) for text
features and instance segmentation models (e.g. Mask R-CNN [18]) for structural features. The
underlying idea is to use these available large scale datasets for language independent models
and circumvent the sparsity of data in different languages.


4. Outlook
With the completion of the project, we will release a dataset of our citation network, as well as
all extracted citations from our corpus. Additionally, we will provide a REST API for all members
of the scientific community to query data from different actors, corresponding citations, and
various other attributes.
Similar to our GEOprof dataset, we will provide an interactive website where our citation
network and research results are visualized. Furthermore our software platform GEOcite will
be entirely Open Source.
Initial empirical analyses based on the GEOcite corpus are also already planned. For example,
there will be further investigations on the question of the unity of geography (see above). In
addition, work is planned on paradigm genesis and evolution in German speaking geography
as well as specific bibliometric studies on the disadvantage of female geographers in the sense
of the so called Matilda effect [19, 20] in the course of the discipline’s history. As another
example, self-citation behavior [21] of this special community covered in the Geocite corpus
can be analysed over the covered period.


Acknowledgments
This work was funded by DFG under grant 249237273, Die Säulen der Einheit und die
Brücken im Fach: Geographische Forschung zwischen Rhetorik und Praxis (GEOcite)
project, https://geographische-netzwerkstatt.uni-passau.de/geocite/.


References
 [1] Z. Boukhers, S. Ambhore, S. Staab, An end-to-end approach for extracting and segmenting
     high-variance references from pdf documents, in: Proceedings of the ACM/IEEE Joint
     Conference on Digital Libraries 2019, 2019, pp. 186–195. doi:1 0 . 1 1 0 9 / J C D L . 2 0 1 9 . 0 0 0 3 5 .
 [2] P. Aufenvenne, M. Steinbrink, Brüche und Brücken: Netzwerk- und zitationsanalytische
     Beobachtungen zur Einheit der Geographie, Geographie und Landeskunde 88 (2014)
     257–292.
 [3] C. Kesteloot, L. Bagnoli, Human and physical geography: Can we learn something from
     the history of their relations?, BELGEO (2021). doi:1 0 . 4 0 0 0 / b e l g e o . 5 2 6 2 7 .
 [4] D. Demeritt, Dictionaries, disciplines and the future of geography, Geoforum 39 (2008)
     1811–1813. doi:1 0 . 1 0 1 6 / j . g e o f o r u m . 2 0 0 8 . 0 9 . 0 0 8 .
 [5] M. Steinbrink, P. Aufenvenne, Integrative Geographiedidaktik? Versuch einer Positions-
     bestimmung der Fachdidaktik innerhalb der deutschsprachigen Geographie 142/143 (2016).
     URL: http://hw.oeaw.ac.at/?arp=7887-0inhalt/gwu142-143_03_Steinbrink-Aufenvenne.pdf.
     doi:1 0 . 1 5 5 3 / g w - u n t e r r i c h t 1 4 2 / 1 4 3 s 5 .
 [6] M. Steinbrink, P. Aufenvenne, On othering and mainstreamisation of new cultural geogra-
     phy. some scientometric observations, Mitteilungen der Osterreichischen Geographischen
     Gesellschaft 159 (2017) 83–104. doi:1 0 . 1 5 5 3 / m o e g g 1 5 9 s 8 3 .
 [7] A. Hosseini, B. Ghavimi, Z. Boukhers, P. Mayr, EXCITE - A toolchain to extract, match
     and publish open literature references, in: Proceedings of the ACM/IEEE Joint Conference
     on Digital Libraries 2019, ACM, 2019, pp. 432–433. doi:1 0 . 1 1 0 9 / J C D L . 2 0 1 9 . 0 0 1 0 5 .
 [8] M. Steinbrink, P. Aufenvenne, M. Köhler, B. Birkeneder, GEOprof-Database: Datenbank
     der geographischen ProfessorInnenschaft im deutschsprachigen Raum ab 1949, 2021.
     doi:1 0 . 5 8 8 0 / F I D G E O . 2 0 2 1 . 0 1 8 .
 [9] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, Ł. Bolikowski, CERMINE: automatic
     extraction of structured metadata from scientific literature, Int. J. Doc. Anal. Recognit. 18
     (2015) 317–335.
[10] Grobid,                 https : / / github.com / kermitt2 / grobid,                                              2008–2022.
     arXiv:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c.
[11] Excite documentation, https://exparser.readthedocs.io/en/latest/ReferenceParsing/, 2019.
     [Online; accessed 1-May-2022].
[12] Pmc open access subset [internet]. bethesda (md): National library of medicine, https:
     //www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, 2003. [Online; accessed 1-May-2022].
[13] M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, M. Zhou, Docbank: A benchmark dataset for
     document layout analysis, 2020. a r X i v : 2 0 0 6 . 0 1 0 3 8 .
[14] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein,
     J. Bohg, A. Bosselut, E. Brunskill, et al., On the opportunities and risks of foundation
     models, arXiv preprint arXiv:2108.07258 (2021).
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[16] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv
     preprint arXiv:2005.14165 (2020).
[17] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale (2019). a r X i v : 1 9 1 1 . 0 2 1 1 6 .
[18] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE
     international conference on computer vision, 2017, pp. 2961–2969.
[19] M. W. Rossiter, The matthew matilda effect in science, Social Studies of Science 23 (1993)
     325 – 341. doi:1 0 . 1 1 7 7 / 0 3 0 6 3 1 2 9 3 0 2 3 0 0 2 0 0 4 .
[20] P. Aufenvenne, C. Haase, F. Meixner, M. Steinbrink, Participation and communication be-
     haviour at academic conferences – an empirical gender study at the german congress of ge-
     ography 2019, Geoforum 126 (2021) 192–204. URL: https://www.sciencedirect.com/science/
     article/pii/S0016718521001986. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / j . g e o f o r u m . 2 0 2 1 . 0 7 . 0 0 2 .
[21] A. Kacem, J. W. Flatt, P. Mayr, Tracking self-citations in academic publishing, Scientomet-
     rics 123 (2020) 1157–1165. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 0 2 0 - 0 3 4 1 3 - 9 .



A. Online Resources
The sources for GEOcite project will be available via

     • GitHub.