Into the bibliography jungle: using random forests to
predict dissertations’ reference section
Silvia E. Gutiérrez De la Torre1,∗ , Julián Equihua2 , Andreas Niekler1 and
Manuel Burghardt1
1
    Computational Humanities Group (Leipzig University)
2
    Helmholtz-Centre for Environmental Research (Leipzig)


                                         Abstract
                                         Cited-works-lists in Humanities dissertations are typically the result of five years of work. However,
                                         despite the long-standing tradition of reference mining, no research has systematically untapped the
                                         bibliographic data of existing electronic thesis collections. One of the main reasons for this is the
                                         difficulty of creating a tagged gold standard for the around 300 pages long theses. In this short paper,
                                         we propose a page-based random forest (RF) prediction approach which uses a new corpus of Literary
                                         Studies Dissertations from Germany. Moreover, we will explain the handcrafted but computationally
                                         informed feature-selection process. The evaluation demonstrates that this method achieves an F1 score
                                         of 0.88 on this new dataset. In addition, it has the advantage of being derived from an interpretable
                                         model, where feature relevance for prediction is clear, and incorporates a simplified annotation process.

                                         Keywords
                                         electronic theses and dissertations, bibliographic reference parsing, information retrieval, machine
                                         learning


1. Introduction
Citation analysis (CA) of dissertations, that is, the examination of cited works in theses, has
a long story within bibliometric studies but still no computational operationalization. Most
approaches address collection development needs in libraries and thus seek to ascertain what
types of documents are the most frequently used in the doctoral research stage [1, 2, 3, 4, 5]. To a
lesser degree, some studies have investigated other research behaviors such as interdisciplinarity
[6, 7], language use [8, 9, 10], and domain-specific trends in specific domains such as chemistry
[4, 2], library science [8, 6, 11], sociology/anthropology [10], atmospheric science [12], agricul-
ture/biology [13] and mathematics education [14]. Because of tedious manual extraction, so far,
only small-scale studies are to be found; the big picture of citation strategies in Ph.D. theses is,
however, yet to be painted. In this paper, we propose a computational approach to automatically
mine references from a large corpus of – mostly – German-language dissertations in the field
of literary studies. The ultimate goal of this endeavor is to investigate the epistemological

Understanding LIterature references in academic full TExt at JCDL 2022, June 24, 2022, Köln, Germany
∗
    Corresponding author.
Envelope-Open silviaegt@uni-leipzig.de (S. E. G. D. l. Torre); julian.equihua@ufz.de (J. Equihua);
aniekler@informatik.uni-leipzig.de (A. Niekler); burghardt@informatik.uni-leipzig.de (M. Burghardt)
Orcid 0000-0002-0877-7063 (S. E. G. D. l. Torre); 0000-0002-3036-3318 (A. Niekler); 0000-0003-1354-9089 (M. Burghardt)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
foundations of this field by means of systematic, large-scale, citation analysis. While there are
a number of examples of citation analyses in the humanities [12] and even more specifically
in literary studies[15, 16, 17, 18, 19], each of those has only used monographs and journals
as their sources, but not dissertations. Only one study from the 1990s has tapped into the
richness of citations in literary studies dissertations, manually analyzing 26 theses defended
in one university in India [20]. The reasons for a gap in large-scale dissertations’ CA may be
threefold: First, the extension of the reference section of dissertations (which in our corpus
averages 25 pages in length) implies a great challenge for in-depth tagging, which is a necessary
step for machine learning approaches. Second, the assumed inconsistency of references in
the humanities[21] is particularly true for dissertations, which tend to lack conventionalized
editorial guidelines. Third, and related to the first two: while reference mining tools for the
automatic extraction of citations have become more sophisticated in recent years, they are all
trained on gold standards of English papers from the natural and applied sciences[22, 23, 24, 25].
The aftermath of this is that they are hardly applicable to other domains, languages, and doc-
ument types. To close this gap and address these challenges, we present ongoing work on a
page-based random forest prediction approach, which allows us to extract bibliography sections
(in various styles and formats) from a corpus of 1,330 literary studies dissertations. This task is
an important prerequisite for the later exploration of specific bibliographic information, which
will be used to reveal citation strategies and trends in this field.


2. Related work
As for related studies to our approach, the EXCITE project aims at extracting citations outside
the natural and applied sciences Anglosphere. Its purpose is to develop “a set of algorithms
for the extraction of citation and reference information” for the social sciences[26]. Yet, the
transparency with which they share their gold standard shows a heavy inclination towards
journal articles and the inclusion of fewer than five dissertations. The only large-scale precedent
on the dissertation parsing front is the “Opening Books and the National Corpus of Graduate
Research” project[27]. Their wide scope of tasks includes reference mining. However, so far,
the team has only released its code on DOI metadata retrieval and has expressed its intention to
identify “particular pieces of information within the citations such as the author names”[28].
There are no available pipelines yet for the first task in reference mining: detecting the reference
section[26]. Another interesting approach is a line-based conditional random field, which
showed to reduce model complexity of reference string extraction by detecting relevant strings
without first identifying the reference section[26]. Yet, this supervised learning algorithm
requires line-wise annotations that are hardly scalable for the ca. 12,000 lines long dissertations.
Especially considering that all dissertations have a bibliography section, unlike certain journals.


3. Methodology
Corpus: To test our approach, we gathered 1,330 electronic theses with their corresponding
metadata from the German National Library. The selection criteria were subject and tempo-
rality: Literary Studies dissertations, which were defended in German Universities after the
 dates_sw     pubplaces     page_n     bibtest                 feature          MDA
  0.06211      0.01527        152         0                    pubplaces_sw      32.3184181
  0.06338      0.00352        153         0                    dates_sw         24.87243224
  0.06609      0.00575        154         1                    dates            22.39062468
  0.08397         0           155         1                    position         22.08909718
                                                               pubplaces        20.99889236
Table 1
Example of tagged dataset to train RF model,           Table 2
each line is one page of the same PDF file             Top 5 features by mean decrease accuracy (MDA)


reunification (1990) until 2020. After cleaning duplicates and misclassified documents, we were
left with 1,116 electronic dissertations and their corresponding PDF files. Model, features, and
sample selection: The machine learning method we chose for its interpretable model is random
forests (RF). Breiman (2001) developed RF as a machine learning technique for classification and
regression[29]. An RF is a collection of “decision trees”. Each node in a tree (i.e., each decision)
is based on a random subset of the available features, automatically predicting the likelihood
that an item belongs to a specific class (in our case: if the page was part of the bibliography
section or not). We used the randomForest package implementation in R[30].
   The selected features were the relative frequencies of bibliographic elements that appear in
other reference mining projects: 1) dates, both in a four numbers date and with special but
common formatting in bibliographies, namely between parenthesis; 2) publication places with a
consistent frequency (we selected fifteen); 3) different bibliography-section-headers in English,
Spanish, French, and German. Common abbreviations for references in these languages such
as: “Ed(s)”, “Hrsg.”, “Vol”, etc. were added as suggested by previous work on reference mining
German academic texts[26]. We also added typical abbreviations in the footnotes (such as
“vgl.”, “ebd.”, “ders.”). We added this last feature to push the label to a negative class (i.e. “not
bibliography section”). Finally, we considered page number (“n_pag”), the total number of pages
(“n_pags”), and position (i.e. n_pag/n_pags).
   Furthermore, we used a sliding windows approach to these features. Since bibliographies are
a section, we wanted to get a feature measure that overcomes single pages as barriers. Sliding
windows are useful to compute a running average over adjacent pairs, this allows to catch the
presence of features across a range of pages rather than on singleton ones, which also contain
some selected features but are not part of the section. For instance, the sliding-window mean
for dates in the first three pages is 0.042 (that is: the average of 1/37 + 1/32 + 2/29) and it is
compared against those of the second window (0.054) and the third one 0.044 (see Figure 1). We
used this sliding windows approach for dates, publication places, abbreviations, page numbers,
positions, and section headings. In order to create a representative sample of different types of
dissertations’ bibliographies, we selected documents with the most varied distribution of these
features. We selected 11 different k-means clusters of differentiated features distribution and 5
different documents from each one of these clusters. We then proceeded to create a training
data frame. Each row represents while columns contain the relative frequency for each feature
except for the last one which is for the class to be predicted (“bibtest”). This last column contains
0’s for non-bibliography sections, and 1’s for positive cases (see Table 1).
Figure 1: Example of sliding windows calculation.


4. Results and future work
We trained the RF on 1,000 trees and followed a bootstrap strategy, which roughly contains
two-thirds of the observations[31]. The confusion matrix shows more errors in classifying
content text as part of the reference section (304 false positives (FP), class error 0.016) while only
9 out of 1,145 reference section pages got erroneous predictions (false negatives, FN). In other
numbers, we got a precision of 0.79, a recall of 0.99, and an overall F1 score of 0.87. Moreover,
we were able to identify the most relevant features by their mean decrease accuracy and observe
the relevance of both our proposed sliding windows method (’_sw’ features in Table 1 and 2)
and of the selected features.
   Besides these extremely promising results, the problems of our approach are worth mention-
ing. As previously stated, we get a relevant amount of FPs. Looking at concrete examples of
this, we can identify pages which are quite similar to true bibliographies that get misclassified.
For instance, we found containing pages containing a single footnote with references or where
the author lists works that are not cited in the scholarly sense. Likewise, lists of figures and of
abbreviations are sources of errors, as they superficially appear very similar to the bibliogra-
phies. Furthermore, derived from our sliding windows approach, bibliography-like pages that
are nearby the reference section also get incorrect predictions. On the FN classification side,
the formatting and layout of the bibliographies is often very unusual, or they do not begin
on a new page. We can experimentally trace both observations back to the feature level and
show that the feature structure differs in these examples. The biggest difference is in position,
as FP are on a lower page number than true-positives. Also, the distribution of dates and
publication places is very different, as FP contain a lower number of both. We thus need to
complement heuristic plausibility checks for the application of the approach, which eliminate
possible misclassifications based on additional conditions, e.g., checking the position in the
text. In addition, we consider complementary Regular Expressions that can automate additional
plausibility checking of the result set. However, we realize that these additional corrections
would have to be adapted for dissertations from other fields.
Acknowledgments
Authors Silvia Gutiérrez and Julián Equihua have completed these experiments while receiving
a doctoral research grant from the German Academic Exchange Service (DAAD)


References
 [1] P. M. Beile, D. N. Boote, E. K. Killingsworth, A Microscope or a Mirror? A Question of
     Study Validity Regarding the Use of Dissertation Citation Analysis for Evaluating Research
     Collections, The Journal of Academic Librarianship 30 (2004) 347–353. URL: http://www.
     sciencedirect.com/science/article/pii/S0099133304001041. doi:1 0 . 1 0 1 6 / j . a c a l i b . 2 0 0 4 . 0 6 .
     001.
 [2] N. Vallmitjana, L. G. Sabaté, Citation Analysis of Ph.D. Dissertation References as a Tool for
     Collection Management in an Academic Chemistry Library, College & Research Libraries
     69 (2008) 72–82. URL: http://crl.acrl.org/index.php/crl/article/view/15913. doi:1 0 . 5 8 6 0 / c r l .
     69.1.72.
 [3] T. P. Franks, D. S. Dotson, Book Publishers Cited in Science Dissertations: Are Commercial
     Publishers Worth the Hype?, Science & Technology Libraries 36 (2017) 63–76. URL:
     https://doi.org/10.1080/0194262X.2016.1263172. doi:1 0 . 1 0 8 0 / 0 1 9 4 2 6 2 X . 2 0 1 6 . 1 2 6 3 1 7 2 .
 [4] L. Zhang, A Comparison of the Citation Patterns of Doctoral Students in Chemistry versus
     Chemical Engineering at Mississippi State University, 2002–2011, Science & Technology
     Libraries 32 (2013) 299–313. URL: http://www.tandfonline.com/doi/abs/10.1080/0194262X.
     2013.791169. doi:1 0 . 1 0 8 0 / 0 1 9 4 2 6 2 X . 2 0 1 3 . 7 9 1 1 6 9 .
 [5] P. C. Johnson, Dissertations and discussions: engineering graduate student research
     resource use at New Mexico State University, Collection Building 33 (2013) 25–30. URL: http:
     //www.emeraldinsight.com/doi/10.1108/CB-09-2013-0037. doi:1 0 . 1 1 0 8 / C B - 0 9 - 2 0 1 3 - 0 0 3 7 .
 [6] C. R. Sugimoto, Mentoring, collaboration, and interdisciplinarity: An evaluation of the
     scholarly development of Information and Library Science doctoral students, Ph.D. Thesis,
     University of North Carolina at Chapel Hill, 2010.
 [7] W. R. Fernandes, B. V. Cendón, C. A. A. Araújo, Ciência da informação e áreas correlatas: um
     estudo de caso na Universidade Federal de Minas Gerais, Brazilian Journal of Information
     Science 5 (2011) 3–35. Publisher: Universidade Estadual Paulista.
 [8] T. LaBorie, M. Halperin, Citation Patterns in Library Science Dissertations, Journal of
     Education for Librarianship 16 (1976) 271–283. URL: https://www.jstor.org/stable/40322465.
     doi:1 0 . 2 3 0 7 / 4 0 3 2 2 4 6 5 , publisher: Association for Library and Information Science Educa-
     tion (ALISE).
 [9] S.-J. Gao, W.-Z. Yu, F.-P. Luo, Citation analysis of PhD thesis at Wuhan University,
     China, Library Collections, Acquisitions, and Technical Services 33 (2009) 8–16. URL:
     http://linkinghub.elsevier.com/retrieve/pii/S1464905509000281. doi:1 0 . 1 0 1 6 / j . l c a t s . 2 0 0 9 .
     03.001.
[10] Z. Rosenberg, Citation Analysis of M.A. Theses and Ph.D. Dissertations in Sociology
     and Anthropology: An Assessment of Library Resource Usage, The Journal of Academic
     Librarianship 41 (2015) 680–688. URL: http://www.sciencedirect.com/science/article/pii/
     S0099133315001007. doi:1 0 . 1 0 1 6 / j . a c a l i b . 2 0 1 5 . 0 5 . 0 1 0 .
[11] R. Echezona, Principal, V. Okafor, S. C. Ukwoma, Information Sources Used by Postgraduate
     Students in Library and Information Science: A Citation Analysis of Dissertations 2011
     (2011).
[12] S. Kaczor, A Citation Analysis of Doctoral Dissertations in Atmospheric Science at
     the University at Albany, Science & Technology Libraries 33 (2014) 89–98. URL: http://
     www.tandfonline.com/doi/abs/10.1080/0194262X.2013.866067. doi:1 0 . 1 0 8 0 / 0 1 9 4 2 6 2 X . 2 0 1 3 .
     866067.
[13] P. U. Kuruppu, D. C. Moore, Information Use by PhD Students in Agriculture and Biology:
     A Dissertation Citation Analysis, portal: Libraries and the Academy 8 (2008) 387–405.
     URL: http://muse.jhu.edu/content/crossref/journals/portal_libraries_and_the_academy/
     v008/8.4.kuruppu.html. doi:1 0 . 1 3 5 3 / p l a . 0 . 0 0 2 4 .
[14] A. Fernández Cano, M. Torralbo, L. Rico, P. Gutiérrez, A. Maz, Análisis cientimétrico,
     conceptual y metodológico de las tesis doctorales españolas en Educación Matemática
     (1976-1998), Revista Española de Documentación Científica 26 (2003) 162–176. URL:
     https://redc.revistas.csic.es/index.php/redc/article/view/135/189, publisher: Universidad
     de Granada.
[15] C. O. Frost, The Use of Citations in Literary Research: A Preliminary Classification of
     Citation Functions, The Library Quarterly 49 (1979) 399–414. URL: https://www.journals.
     uchicago.edu/doi/abs/10.1086/600930. doi:1 0 . 1 0 8 6 / 6 0 0 9 3 0 .
[16] J. Ardanuy, C. Urbano, L. Quintana, The Evolution of Recent Research on Catalan Literature
     through the Production of PhD Theses: A Bibliometric and Social Network Analysis,
     Information Research: An International Electronic Journal 14 (2009). URL: https://eric.ed.
     gov/?id=EJ851921.
[17] J. W. Thompson, The death of the scholarly monograph in the humanities? Citation
     patterns in literary scholarship, Libri 52 (2002) 121–136.
[18] R. Heinzkill, Characteristics of References in Selected Scholarly English Literary Journals,
     The Library Quarterly: Information, Community, Policy 50 (1980) 352–365. URL: https:
     //www.jstor.org/stable/4307248.
[19] D. S. Nolen, H. A. Richardson, The search for landmark works in English literary studies:
     a citation analysis, The Journal of Academic Librarianship 42 (2016) 453–458.
[20] V. N. Deo, S. M. Mohal, Bibliometric study of doctoral dissertations on English language
     and literature, Annals of Library and Information Studies 42 (1995) 81–95.
[21] D. Rodrigues Alves, G. Colavizza, F. Kaplan, Deep Reference Mining From Scholarly
     Literature in the Arts and Humanities, Frontiers in Research Metrics and Analytics
     3 (2018). URL: https://www.frontiersin.org/articles/10.3389/frma.2018.00021/full. doi:1 0 .
     3389/frma.2018.00021.
[22] P. Lopez, Grobid, 2019. URL: https://github.com/kermitt2/grobid.
[23] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, L. Bolikowski, CERMINE: automatic
     extraction of structured metadata from scientific literature, International Journal on
     Document Analysis and Recognition (IJDAR) 18 (2015) 317–335. URL: https://doi.org/10.
     1007/s10032-015-0249-8. doi:1 0 . 1 0 0 7 / s 1 0 0 3 2 - 0 1 5 - 0 2 4 9 - 8 .
[24] I. G. Councill, C. L. Giles, M.-Y. Kan, ParsCit: an Open-source CRF Reference String Parsing
     Package., in: LREC, volume 8, 2008, pp. 661–667.
[25] A. Prasad, M. Kaur, M.-Y. Kan, Neural ParsCit: a deep learning-based reference string
     parser, International Journal on Digital Libraries 19 (2018) 323–337. URL: https://doi.org/
     10.1007/s00799-018-0242-1. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 1 8 - 0 2 4 2 - 1 .
[26] M. Körner, B. Ghavimi, P. Mayr, H. Hartmann, S. Staab, Evaluating Reference String
     Extraction Using Line-Based Conditional Random Fields: A Case Study with German
     Language Publications, in: M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper,
     R. Wrembel, J. Darmont, S. Rizzi (Eds.), New Trends in Databases and Information Systems,
     Communications in Computer and Information Science, Springer International Publishing,
     Cham, 2017, pp. 137–145. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 6 7 1 6 2 - 8 _ 1 5 .
[27] W. A. Ingram, E. A. Fox, J. Wu, Opening Books and the National Corpus of Grad-
     uate Research, LG-37-19-0078-19, 2019. URL: https://www.imls.gov/grants/awarded/
     lg-37-19-0078-19.
[28] B. Ingram, B. Banerjee, S. Kahu, Classification and extraction of information from ETD
     documents, 2019. URL: https://github.com/Opening-ETDs/CS6604-ETD.
[29] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32. URL: https://doi.org/10.
     1023/A:1010933404324. doi:1 0 . 1 0 2 3 / A : 1 0 1 0 9 3 3 4 0 4 3 2 4 .
[30] A. Liaw, M. Wiener, Classification and Regression by randomForest, R News 2 (2002)
     18–22. URL: https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf.
[31] B. Efron, R. Tibshirani, Improvements on Cross-Validation: The .632+ Bootstrap Method,
     Journal of the American Statistical Association 92 (1997) 548. URL: https://www.jstor.org/
     stable/2965703?origin=crossref. doi:1 0 . 2 3 0 7 / 2 9 6 5 7 0 3 .