Sequence Labeling for Citation Field Extraction
from Cyrillic Script References
Igor Shapiro, Tarek Saier and Michael Färber
Institute AIFB, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Abstract
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While
approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than
English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded
consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and
multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and
contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of
varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of
our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic
data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly
outperforms a state-of-the-art model we retrain and evaluate on our data.
Keywords
reference extraction, reference parsing, sequence labeling, Cyrillic script
1. Introduction exist over 25 million scholarly publications [10]. Publi-
cations written in Cyrillic script languages, accordingly,
Citations are a crucial part of the scientific discourse and make up an even larger portion, as they include further
represent a measure of the extent to which authors indi- languages such as Ukrainian and Belarusian. A lack of
rectly communicate with other researchers through publi- methods and tools able to automatically extract infor-
cations [1]. Therefore, accurate citation data is important mation from these Cyrillic script documents naturally
for applications such as academic search engines [2] and results in an underrepresentation of such information in
academic recommender systems (e.g., for recommending scholarly data.
papers [3] or citations [4]). Since the number of scientific To pave the way for reducing this imbalance, we focus
publications that is available on the web is growing expo- on the task of extracting structured information from
nentially [5], it is crucial to automatically extract citation bibliographic references found at the end of scholarly
data from them. Many tools and models have been devel- publications—commonly referred to as citation field ex-
oped for this purpose, such as GROBID [6], Cermine [7], traction (CFE)—in Cyrillic script languages (see Figure 1).
and Neural ParsCit [8]. These tools mostly use super- For this task, we introduce a data set of Cyrillic script
vised deep neural models. Accordingly, a large amount references for training and evaluating CFE models. As
of labeled data is needed for training. However, most Cyrillic publications usually contain both Cyrillic and
reference data sets are restricted in terms of discipline English references, the data set contains a small portion
coverage and size, containing only several thousand in- (7%) of English references as well. The data set can be
stances (see Table 1). Furthermore, most models and tools used in various scenarios, such as cross-lingual citation
are only trained on English data [9, 8]. Therefore, exist- recommendation [11] and analyzing the scientific land-
ing models perform insufficiently on data in languages scape and scientific discourse independent of the used
other than English, especially in languages written in languages [12]. To showcase the utility of our data set,
scripts other than the Latin alphabet. we train several sequence labeling models on our data
While English is the language with the largest share of and evaluate them against a GROBID model retrained
scholarly literature, with estimates of over one hundred on the same data. Throughout the paper we refer to
million documents [5], other languages still make up a the reference string parsing module of GROBID as just
significant portion. For Russian alone, for example, there “GROBID”. To the best of our knowledge, we are the first
The AAAI-22 Workshop on Scientific Document Understanding, March
to train a CFE model, more specifically BERT, specialized
1, 2022, Online in Cyrillic script references.
Envelope-Open igor.shapiro@student.kit.edu (I. Shapiro); tarek.saier@kit.edu
(T. Saier); michael.faerber@kit.edu (M. Färber)
Orcid 0000-0001-5028-0109 (T. Saier); 0000-0001-5458-8645 (M. Färber)
© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: A real-world example of a Cyrillic script reference with marked bibliographic labels (top) and the corresponding
labeled reference string (bottom).
Our contributions can be summarized as follows. Table 1
A selection of existing citation data sets.
1. We introduce a large data set of labeled Cyril-
lic reference strings,1 consisting of over 100,000 Data set # Instances Discipline
synthetically generated references and over 700 GROBID 6,835 Multi-discipline (GRO-
references that were manually labeled and gath- BID’s data set is a
collection of various
ered from multidisciplinary Cyrillic script publi-
citation data sets)
cations. CORA 1,877 Computer Science
2. We train the very first BERT-based citation field UMass CFE 1,829 Science, technology, engi-
extraction (CFE) model specialized in Cyrillic neering, and mathematics
script references and show the importance of re- GIANT 911 million Multi-discipline
training GROBID for Cyrillic script language data.
We achieve an acceptably high F1 score of 0.933
with our best BERT model. for different languages can be found in the literature.
For Cyrillic languages, for example, RuBERT is a BERT
The data is available at https://doi.org/10.5281/
variant trained on Russian text [17], and Slavic BERT is
zenodo.5801914, the code at https://github.com/igor261/
a named entity recognition model that was trained on
Sequence-Labeling-for-Citation-Field-Extraction-from-
four Slavic languages (Russian, Bulgarian, Czech, and
Cyrillic-Script-References.
Polish) [18]. Both of the aforementioned publications
present a performance gain compared to the pretrained
2. Related Work multilingual BERT by retraining on task-relevant lan-
guages. Because references in Cyrillic publications typi-
CFE approaches that currently achieve the best per- cally also contain a mix of Cyrillic and English references,
formance are supervised machine learning approaches. we use multilingual BERT in our evaluation.
Among them, the reference-parsing model of GROBID is
typically reported to perform the best. We therefore use
GROBID as the baseline in our evaluation. 3. Data Set
In recent years, transformer-based models [13] such
as BERT [14] have achieved state-of-the-art evaluation 3.1. Existing Data Sets
results on a wide range of NLP tasks. To the best of our Several publicly available data sets for training and evalu-
knowledge, there is so far only one paper presenting a ating CFE models exist. In Table 1, we show an overview
BERT-based approach to CFE [15]. The authors achieve of these citation data sets, including the number of ref-
state-of-the-art results on the UMass CFE data set [16] by erence strings contained and disciplines covered. In the
using RoBERTa, a BERT model with a modified training following, we describe each of the data sets in more detail.
procedure and hyperparameters. The authors of GROBID [6] provide the 6,835 samples
The original BERT model comes in three varieties, one their tool’s reference parser is trained on. These are gath-
trained on English text only, one on Chinese, and a mul- ered from various sources (e.g., CORA, HAL archive, and
tilingual model. Furthermore, many offshoots of BERT arXiv). New data is continuously added to the GROBID
1
data set2 .
In the course of this work, we use the terms “reference string”
2
and “citation string” interchangeably. See https://github.com/kermitt2/grobid/issues/535.
One of the most widely used data sets for the CFE 1 transform
task is CORA,3 which comprises 1,877 “coarse-grained”
labeled instances from the computer science domain. As Web of 2 pdflatex
pointed out by Prasad et al. [8], a shortcoming of the CFE Science 3 match
research field is that the models are evaluated mainly
on the CORA data set, which lacks diversity in terms of
multidisciplinarity and multilinguality. .bib 1 .bst
The UMass CFE data set by Anzaroot and McCallum @article{Ivanov17,
[16] provides both fine- and coarse-grained labels from title = {Заголовок},
across the STEM fields. Fine- and coarse-grained labels author = {Иван Иванов},
year = {2017},
means, for example, that labels are given for a person’s ...
bibliography
full name (coarse-grained), but also for their given and style files
}
family name separately (fine-grained).
All of the above manually annotated data sets are
rather small and part of them is limited in terms of the
.pdf 2
scientific disciplines covered. These issues are addressed
by Grennan et al. [9] with the data set GIANT, created by
3
training data
synthetically generating reference strings. The data set
consists of roughly 1 billion references from multiple dis- .txt for CFE task
ciplines, which were created using 677,000 bibliographic Иван Иванов
entries from Crossref4 rendered in over 1,500 citation .
Заголовок.
In Сибирский филологичес...
styles.
We see none of the data sets described above as suit-
able for training a model for extracting citation data from Figure 2: Schematic overview of the synthetic data set cre-
ation.
Cyrillic publications’ references, because they are based
.
on English language citation strings only, except for GI-
ANT. However, GIANT does not provide consistent lan-
guage labels, making the issue of accurate filtering for 3.2.1. Synthetic References
Cyrillic script citation strings non-trivial.
To the best of our knowledge, no data set of citation Figure 2 shows a schematic overview of our data set
strings in Cyrillic script currently exists. It is therefore creation, which is described in the following.
necessary to create a data set of labeled citation strings To create a data set of synthetic citation strings, a suit-
to be able to train models capable of reliably extracting able source of metadata of Cyrillic script documents is
information from Cyrillic script reference strings. necessary. Crossref, which is used by GIANT, provides
metadata for over 120 million records5 of various con-
tent types (e.g., journal-article, book, and chapter) via
3.2. Data Set Creation their REST API. Unfortunately, most of the data either
In the following subsection, we identify two approaches does not provide a language field or the language tag is
for creating an appropriate data set to train and test deep English. We also considered CORE [20] as a source of
neural networks that extract citation fields, such as au- metadata. Although CORE provides at least 23,000 pa-
thor information and paper titles. Grennan et al. [9], pers with Cyrillic script language labels and correspond-
Grennan and Beel [19], and Thai et al. [15] found that ing PDF files [21], it comes with insufficient metadata.
synthetically generated citation strings are suitable to Furthermore, for the relevant BibTeX fields, CORE only
train machine learning algorithms for CFE, resulting in provides title, authors, year, and some publisher entries.
high-performance models. We use a similar approach to We identified Web of Science (WoS)6 as the most ap-
create a synthetic data set of citation strings for model propriate source of metadata for creating synthetic refer-
training in the next section. To evaluate the resulting ences and based on the option to gather language-specific
models on citation strings from real documents, we man- metadata. Additionally, WoS provides a filter for the doc-
ually annotate citation strings from several Cyrillic script ument type, even though it lacks, for example, book types.
scientific papers. This is described in the subsection The final data set should contain multiple document types
“Manually Annotated References.” to cover various citation fields.
Web of Science provides access to the Russian Sci-
ence Citation Index (RSCI), a bibliographic database of
3 5
See https://people.cs.umass.edu/~mccallum/data.html. See https://api.crossref.org/works.
4 6
See https://www.crossref.org. See https://www.webofknowledge.com/.
Table 2 Table 3
Distribution of the reference languages from WoS. Number of synthetic labeled reference strings per citation
style & reference type.
Language Number of items
Russian 31,977 Citation # Articles # Conf. Proc. Total
English 2,241 Style
other 9 APA 1,293 833 2,126
GOST2003 26,289 7,061 33,350
GOST2006 26,328 7,078 33,406
scientific publications in Russian with roughly 750,000 GOST2008 26,467 7,113 33,580
instances. We chose to gather around 27,000 most recent Total 80,377 22,085 102,462
(i.e., from 2020) article type and around 7,000 most recent
(i.e., from 2010-2020) conference proceeding type7 meta-
data records from the RSCI. The selection is motivated tain level of variety we use the GOST2003, GOST2006,10
by the finding of Grennan and Beel [19] that a model and GOST2008 styles for all references. Since the APA
trained with more than 10,000 citations would decrease style cannot handle Cyrillic characters, it is used for non-
in performance compared with a smaller training data Cyrillic references only.
set. To verify the latter statement in our evaluation, we For each reference, we create a separate PDF rendition.
decide to create a data set consisting of 100,000 citation Using various bibliography styles for the same reference
strings in total. Last but not least, following the GIANT can result in reference strings that are completely differ-
data set, we wanted our data set to consist of around 80% ent in look and structure. For instance, author names
articles and 20% conference proceedings. can be abbreviated or duplicated at different positions.11
Based on the language tags in the metadata provided Metadata labels and their counterparts in the PDF refer-
by WoS, a breakdown of the languages of the data we ences are then matched by an exact string match or, alter-
collected is shown in Table 2. Unfortunately, the RSCI natively, the Levenshtein distance. Exact string matches
database by WoS does not provide Ukrainian language are not always possible because some characters are ma-
metadata, but since Russian and Ukrainian are very simi- nipulated by TeX while generating a PDF file or field
lar, we expect the model to process Ukrainian language values themselves change during the generation process
references comparably reliable to Russian language ref- in various ways, like abbreviations or misinterpreted
erences. In our evaluation, we show that our model characters. To store the reference text and reference to-
achieves similar F1 scores for Russian and Ukrainian lan- ken labels in one file per reference, we create labeled
guage references. reference strings as shown in Figure 1.
After converting the WoS data to the BibTeX format In rare cases during the parsing process of the PDFs
and filtering out corrupted entries, we enrich the data to text strings using PDFMiner, tokens were garbled and
with additional features, such as “Pagetotal”8 and “ad- files could not be read. Consequently, the correspond-
dress” (publisher city), to get extensive BibTeX entries ing items are removed from the data set, resulting in
that are comparable to real references. This process re- slightly varying numbers of references for different cita-
sults in a total of 34,228 metadata records in the BibTeX tion styles. In the end, our approach yields about 100,000
format. To generate bibliographic references, we addi- synthetically generated labeled reference strings. A de-
tionally need to identify a set of suitable citation styles. tailed breakdown of the quantity of data for each citation
Based on a CORE subset of Cyrillic script scientific style is shown in Table 3.
papers (see next subsection for details), we identify the In Table 4, we additionally show the breakdown of
GOST and APA citation styles to be best suited for gen- labels covered by our synthetic references.
erating realistic reference strings. The GOST standards9
were developed by the government of the Soviet Union
and are comparable to standards by the American ANSI
or German DIN. They are still widely used in Russia and
in many former soviet republics. To introduce a cer- 10
Because we were not able to find a copy of the GOST2006 BST
file, we replicated it ourselves based on the GOST2003 BST file and
the description at https://science.kname.edu.ua/images/dok/journal/
texnika/2021/2021.pdf.
11
An example for a duplicated author name is shown in the
7 following GOST2006 style reference: “Alefirov, A.N. Antitumoral
The conference proceeding type corresponds to meeting type
in WoS. effects of Aconitum soongaricum tincture on Ehrlich carcinoma
8 in mice [Text] / Alefirov, A.N. and Bespalov, V.G. // Obzory po
“Pagetotal” is a field specific to the citation style “GOST”, which
will be discussed later. klinicheskoi farmakologii i lekarstvennoi terapii.–St. Petersburg :
9
See https://dic.academic.ru/dic.nsf/ruwiki/79269. Limited Liability Company Eco-Vector.–2012.”.
Table 4
Number of synthetic labeled reference strings having respec- 25
tive labels per reference type.
Frequency of papers
20
Label # Articles # Conf. Proc. Total
title 80,376 22,085 102,461 15
author 80,375 22,079 102,454
year 80,305 21,870 102,175 10
pages 80,419 17,944 97,113
journal 80,376 – 80,376 5
number 80,214 – 80,214
0
volume 46,494 11,423 57,917 2006 2008 2010 2012 2014 2016
booktitle – 22,085 22,085 Year
publisher – 22,083 22,083
address – 20,034 20,034 Figure 3: Distribution of publication years of the selected 100
pagetotal 1,208 4,141 5,349 papers.
Table 5
3.2.2. Manually Annotated References Summary of the manually annotated data set.
Despite the fact that many large scholarly data sets are Parameter Counts
publicly available, most lack broad language coverage Number of annotated papers 100
or do not contain full text documents. Investigating sev- Number of reference strings 771
eral data sources, we find that, for example, the PubMed Average reference length (in tokens) 28.00
Central Open Access Subset12 provides mostly English Number of reference related labels 11
language publications,13 just like S2ORC [22]. Further, Number of labeled reference segments 5,080
the Microsoft Academic Graph [23, 24] covers millions of
publications, but does not contain full texts and therefore
also no reference strings. pers. Furthermore, references containing fields outside
We use the data set introduced by Krause et al. [21] as the scope of our labels, like editor or institution, exist. In
a source of Cyrillic script papers. After a filtering step to the case of booktitle fields of conference proceedings, we
remove papers with lacking or unstructured citations we used the journal label. Lastly, due to the difference in use
randomly chose 100 papers to manually annotate. of “№” across citation styles (indicating either an issue
Analyzing the origin of the selected papers, we note or volume number), in ambiguous cases the number after
that 80 originate from the “A.N.Beketov KNUME Digital “№” is labeled volume following the GOST2006 citation
Repository”14 and five from the “Zhytomyr State Uni- style.
versity Library.”15 Origins could not be determined for Table 5 shows the summary statistics of the resulting
15 papers. Figure 3 shows the distribution of papers by data set. In Table 6, we show the labels used and their
publication year. A breakdown of the disciplines covered number of occurrences counted in segments (a segment
by the data set revealed that the most strongly repre- is the full text range for a label).
sented disciplines are “engineering” with 36 papers and Although 65% of the 100 documents are Ukrainian
“economics” with 16 papers. The remaining 48 papers are language papers, the references are written in various
spread across various fields, such as education, zoology, languages. Nearly 99% are written in Russian, Ukrainian
urban planning/infrastructure. and English (see Table 7). Other languages contained are
Using fastText [25, 26] language detection, we find Polish, German, Serbian, and French.
that our sample consists of 65 Ukrainian language and While the number of manually annotated references
35 Russian language papers. is not large enough for training purposes, we argue that
Using the annotation tool INCEpTION [27], we label the size and language distribution enable us to perform
the references in our 100 PDFs. Regarding manual anno- a realistic evaluation of our models.
tation, we note that the real references did not always
fit our set of metadata labels. For example, references to
patents, legal texts, or web resources might not contain
4. Approach
certain elements typical for references to scientific pa- There are various approaches to the CFE task. Most of
12
them use regular expressions, template matching, knowl-
See https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
13 edge bases, or supervised machine learning, whereby
See https://www.ncbi.nlm.nih.gov/pmc/about/faq/#q16.
14
See https://eprints.kname.edu.ua/. machine learning-based approaches achieve the best re-
15
See http://eprints.zu.edu.ua/. sults [28]. Furthermore, tools differ in terms of extracted
Table 6 Table 8
Segment counts for the labels assigned. Evaluation on manually annotated data set for BERT models
with differing sizes of training data average over 5 models
Label #segments
trained on different random samples.
author 1,560
title 773 Train Set Recall Precision F1 Score Standard
year 680 Size Deviation
pages 612 500 0.909 0.916 0.910 0.007
address 410 1,000 0.922 0.926 0.923 0.009
publisher 364 2,000 0.928 0.932 0.928 0.007
journal 328 3,000 0.928 0.931 0.928 0.003
volume 256 5,000 0.926 0.929 0.927 0.004
number 91 10,000 0.920 0.925 0.921 0.005
20,000 0.907 0.913 0.907 0.008
50,000 0.863 0.880 0.864 0.017
Table 7 100,000 0.847 0.868 0.848 0.012
Distribution of the reference languages in the manually anno-
tated data set.
Language Number of references
16 core Intel Xeon Gold 6226R 2.90GHz CPU) takes 1,233
Russian 390
minutes.
Ukrainian 288
English 82 To evaluate our fine-tuned BERT model not only on the
manually annotated but also on the synthetic references,
we remove a hold-out set of 2,000 synthetic references
reference fields and their granularity. from the training set, with a fixed distribution of citation
GROBID is commonly considered as the most effec- styles, according to the distribution of the entire data set.
tive tool [28] and was created by Lopez. Tkaczyk et al.
reported an F1 score of 0.92 for the retrained GROBID 5.1. BERT Evaluation on the Manually
CRF model on their data set. Beyond parsing reference Annotated Data Set
strings, GROBID is also able to extract metadata and log-
ical structure from scientific documents in PDF format. We fine-tune the cased multilingual BERT model on 9
Following existing literature, we decide to use the GRO- training set sizes from our synthetically generated labeled
BID CRF model as a baseline. Therefore we retrain the reference data. To ensure robust results, for each of the
GROBID CRF model on our synthetic data set following 9 training set sizes, we sample 5 training sets, train one
GROBID’s documentation.16 The GROBID CRF model is model per sample and average the resulting scores (i.e.,
trained from scratch.17 in total we train 9 × 5 = 45 models).
State-of-the-art sequence labeling approaches are of- Averaged scores for recall, precision, and F1 score for
ten based on BERT. Accordingly, we fine-tune the cased all 9 training set sizes are visualized in Table 8.We found
multilingual BERT model, which is pretrained on 104 that models trained on relatively small training data sets
languages, on our synthetic reference data set. We fine- (between 1,000 and 10,000 instances) perform best on our
tune/retrain both BERT and GROBID on several subsets manually annotated test set. More precisely, on average,
of our synthetic data set with differing sizes (between the models trained on 2,000 instances perform best re-
500 and 100,000) so that we can assess the necessity of a garding the F1 score. These models achieve an average
large training set. F1 score of 0.928 (range from 0.917 to 0.936). Already
with the smallest considered training set of 500 instances,
we can fine-tune a powerful BERT model for the Cyrillic
5. Evaluation CFE task achieving an F1 score of 0.91 on average.
The highest achieved F1 score of 0.928 (averaged F1
Fine-tuning the BERT model is, compared to pretrain- scores of five models trained on different 2,000 instances
ing, relatively inexpensive [14]. We observed this as random samples) on our test set is comparable with state-
well by comparing the time for fine-tuning with the time of-the-art models proposed for English CFE [28, 19, 15, 8],
needed to train GROBID. For example, fine-tuning BERT especially considering the fact that there are reference
with 100,000 training instances takes 125 minutes (on a types and languages in the test set the model was not
GeForce RTX 3090 GPU) and training GROBID CRF (on a trained on. Nevertheless, it is difficult to compare our
results with other papers, since we work with Cyrillic
16
See https://grobid.readthedocs.io/en/latest/Training-the- script references and evaluate the models on our self-
models-of-Grobid/. created test set.
17
See https://github.com/kermitt2/grobid/issues/748.
Table 9 Russian Ukrainian English
1.0
Detailed evaluation of labels predicted by BERT𝐹 𝑖𝑛𝑎𝑙 .
0.9
Label Prec. Rec. F1 Supp.
author 0.984 0.994 0.989 7,104 0.8
year 0.945 0.962 0.953 680 0.7
pages 0.922 0.984 0.952 1,112
F1 Scores
address 0.927 0.961 0.944 715 0.6
other 0.945 0.926 0.936 10,730 0.5
title 0.938 0.931 0.934 7,257
publisher 0.913 0.781 0.842 1,165 0.4
journal 0.765 0.861 0.810 1,982 0.3
volume 0.836 0.454 0.588 269
0.2
number 0.345 0.860 0.492 93
r
vo r
e
er
er
s
jou s
e
ar
L
l
rna
tho
e
ge
s
AL
lum
titl
ye
sh
oth
mb
dre
Weighted
pa
au
bli
nu
ad
pu
Average 0.936 0.932 0.933 31,107 Labels
Score
Figure 4: Evaluation on manually annotated data set for
We further evaluate a BERT model trained on 2,000 BERT𝐹 𝑖𝑛𝑎𝑙 model per label and language
random instances18 —referred to as BERT𝐹 𝑖𝑛𝑎𝑙 from here
on—regarding individual labels. Since our model is more
fine-grained than the test set, i.e. labels in the synthetic fact that most English language references are format-
data set and manually annotated data set are not the ted in the APA style, where there is no ambiguity in the
same, we had to change the pagetotal label to pages and respective labels.
the booktitle label to journal. Furthermore, BERT𝐹 𝑖𝑛𝑎𝑙 predicts publisher and address
As shown in Table 9, our model performs best on iden- labels worse for English language references than for
tifying author tokens with an F1 score of 0.989. Overall, Russian and Ukrainian language references.
we observe an F1 score of more than 0.934 for 6 labels
(author, year, pages, address, other, and title).
5.2. BERT Evaluation on the Synthetic
We see room for improvement in publisher, journal,
volume, and number predictions. The poor performance Hold-Out Set
in volume and number predictions can be explained by the Our fine-tuned BERT underperforms in some labels on
ambiguity of “№” in the test set (see Section “Manually the manually annotated test set. To evaluate our model
Annotated References”). on data with less ambiguity and the same reference docu-
We see high recall with low precision values in number ment types it was trained on, we assess the performance
predictions and low recall with high precision values on the synthetic hold-out set.
in volume predictions. The same observation can be Scores for recall, precision, and F1 score for all 9 train-
made for journal and publisher predictions, but to a lesser ing set sizes evaluated on the hold-out set are visualized
degree. in Figure 5. All BERT models achieve F1 scores of over
More than 50% of the actual volume labels are labeled 0.99, even the model fine-tuned with 500 instances. We
as number, and around 17% of actual publisher labels are also see a steady increase in the performance, when in-
labeled as journal. creasing the training data set size. Best performance
Next, we look into the evaluation on the synthetic hold- regarding the F1 score (0.998) is achieved by the model
out set. We evaluate the BERT𝐹 𝑖𝑛𝑎𝑙 model depending on trained on 100,000 instances, while this model performs
the languages of references (see Figure 4). worst on the manually annotated data set. There are
As mentioned before, our synthetic data set lacks also small differences in the scores concerning individual
Ukrainian language references. Nevertheless, the F1 labels.
score of 0.946 for Russian language references is only
2.5% higher than the F1 score of 0.921 for Ukrainian lan-
guage references. This is potentially due to the high
5.3. GROBID Evaluation
similarity between the Russian and Ukrainian languages. We compare our fine-tuned BERT with the state-of-the-
Additionally, for English language references, the pre- art GROBID model. First, we evaluate the off-the-shelf
dictions of volume and number labels are much better GROBID on our manually annotated test set. The model
than for Cyrillic script references. This is due to the achieves unsatisfying results with an F1 score of 0.09.
18
Only numeric tokens such as number or year achieve an
Models trained on 2,000 instances perform best on average.
F1 Score Table 10
1.00 Recall
Precision Summary of metrics of the models evaluated on the manually
annotated test set.
0.99
Model Precision Recall F1 Score
0.98 Vanilla GROBID 0.347 0.052 0.090
Scores
GROBID𝐹 𝑖𝑛𝑎𝑙 0.665 0.631 0.647
0.97 BERT𝐹 𝑖𝑛𝑎𝑙 0.936 0.932 0.933
0.96
than 10,000 references. The best performing GROBID
0.95 model was trained with 5,000 instances, achieving a F1
500 1K 2K 3K 5K 10K 20K 50K 100K score of 0.647. We refer to this best performing GROBID
Size of training set
model as GROBID𝐹 𝑖𝑛𝑎𝑙 . Compared to the off-the-shelf
GROBID results, we managed to increase the F1 score by
Figure 5: Evaluation on synthetic hold-out data set for BERT a factor of seven by retraining GROBID.
models with differing size of training data Compared to the off-the-shelf GROBID, we see higher
F1 scores in almost every label, except for year and num-
1.0 ber. The best label performance is measured for paper ti-
F1 Score tle, with an F1 score of 0.817. A comparison of evaluation
Recall
0.9 Precision metrics of GROBID and BERT is shown in Table 10. Our
0.8 BERT𝐹 𝑖𝑛𝑎𝑙 model outperforms the GROBID𝐹 𝑖𝑛𝑎𝑙 model in
every label and, consequently, in overall F1 score as well.
0.7
Scores
0.6 6. Conclusion
0.5
In this paper, we provide a large data set covering over
0.4 100,000 labeled reference strings in various citation styles
0.3 and languages, of which 771 are manually annotated ref-
500 1K 2K 3K 5K 10K 20K 50K 100K erences from 100 Cyrillic script scientific papers. Further-
Size of training set
more, we fine-tune multilingual BERT models on various
training set sizes and achieve the best F1 score of 0.933
Figure 6: Evaluation on real data set for GROBID CRF models with 2,000 training instances. We show the eligibility of
with differing sizes of training data sets. synthetically created data for training CFE models. To
compare our results with existing models, we retrained
a GROBID model serving as a benchmark. Our BERT
F1 score of over 0.1. Most of the non-numeric labels have model significantly outperformed both off-the-shelf and
a F1 score of 0 or close to 0.19 retrained GROBID. In future work, our BERT model could
GROBID was initially trained on English language ref- be compared to other well-performing CFE models, such
erences. Consequently, it is not surprising that it per- as Cermine and Neural ParsCit.
forms poorly regarding Cyrillic reference data. Therefore, Our data sets can be reused by other researchers to
we retrain the GROBID CRF model on our synthetic Cyril- train Cyrillic script CFE models. In particular our man-
lic reference data with differing training data set sizes, ually annotated data set can serve as a benchmark for
as we did for the BERT model. Evaluations of resulting further research in this field, since it provides references
models on our manually annotated test set are shown in from various domains and covers several languages.
Figure 6. Regarding our BERT model, we see two key aspects
We observe poorer performance of the GROBID mod- for future work. First, literature describes benefits of
els compared to our fine-tuned BERT. Similar to evalua- adding a CRF layer at the top of a model’s underlying ar-
tions of the fine-tuned BERT models and Grennan and chitecture [8, 18], which could also be considered for our
Beel [19], we see that the best performing models where approach. Second, our model’s performance could be in-
trained on relatively small data sets consisting of less creased by retraining BERT from scratch on task-specific
languages, e.g. in our case Cyrillic Script languages and
19
Data used for training of the off-the-shelf GROBID has different English, as shown by Kuratov and Arkhipov [17] and
labels than we have in our synthetic data set. Consequently some Arkhipov et al. [18].
labels are condemned to have scores equal zero, e.g. web. Note that
GROBID does not provide evaluation scores for other labels.
References Citation Recommendation via Hierarchical Repre-
sentation Learning on Heterogeneous Graph, in:
[1] W. Shaw, Information theory and scientific commu- Proceedings of the 41st International ACM SIGIR
nication, Scientometrics 3 (1981) 235–249. URL: Conference on Research & Development in Infor-
https : / / link.springer.com / content / pdf / 10.1007 / mation Retrieval, SIGIR’18, 2018, pp. 635–644.
BF02101668.pdf. [12] A. Martín-Martín, M. Thelwall, E. Orduña-Malea,
[2] J. L. Ortega, Academic search engines: A quantita- E. D. López-Cózar, Google scholar, microsoft aca-
tive outlook, Elsevier, 2014. demic, scopus, dimensions, web of science, and
[3] J. Beel, B. Gipp, S. Langer, C. Breitinger, Research- opencitations’ COCI: a multidisciplinary compar-
paper recommender systems: a literature survey, ison of coverage via citations, Scientometrics 126
International Journal on Digital Libraries 17 (2016) (2021) 871–906.
305–338. URL: https://dx.doi.org/10.1007/s00799- [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
015-0156-0. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 1 5 - 0 1 5 6 - 0 . L. Jones, Aidan, L. Kaiser, I. Polosukhin, Atten-
[4] M. Färber, A. Jatowt, Citation recommendation: tion is all you need, arXiv pre-print server (2017).
approaches and datasets, Int. J. Digit. Libr. 21 (2020) URL: https://arxiv.org/abs/1706.03762. doi:a r x i v :
375–405. URL: https://doi.org/10.1007/s00799-020- 1706.03762.
00288-2. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 2 0 - 0 0 2 8 8 - 2 . [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
[5] M. Khabsa, C. L. Giles, The number Bert: Pre-training of deep bidirectional transform-
of scholarly documents on the public ers for language understanding, arXiv pre-print
web, PLoS ONE 9 (2014) e93949. URL: server (2019). URL: https://arxiv.org/abs/1810.04805.
https://dx.doi.org/10.1371/journal.pone.0093949. doi:a r x i v : 1 8 1 0 . 0 4 8 0 5 .
doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0 0 9 3 9 4 9 . [15] D. Thai, Z. Xu, N. Monath, B. Veytsman, A. Mc-
[6] P. Lopez, Grobid: Combining automatic bib- Callum, Using bibtex to automatically generate
liographic data recognition and term extraction labeled data for citation field extraction, in: Auto-
for scholarship publications, in: M. Agosti, mated Knowledge Base Construction, 2020. URL:
J. Borbinha, S. Kapidakis, C. Papatheodorou, https://openreview.net/forum?id=OnUd3hf3o3.
G. Tsakonas (Eds.), Research and Advanced Tech- [16] S. Anzaroot, A. McCallum, A new dataset for
nology for Digital Libraries, Springer Berlin Heidel- fine-grained citation field extraction, ICML Work-
berg, Berlin, Heidelberg, 2009, pp. 473–474. URL: shop on Peer Reviewed and Publishing Models
https://link.springer.com/chapter/10.1007/978-3- (2013). URL: https://openreview.net/forum?id=
642-04346-8_62. ffO1Piqs1KZo5.
[7] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Den- [17] Y. Kuratov, M. Arkhipov, Adaptation of deep bidi-
dek, L. Bolikowski, Cermine: automatic extrac- rectional multilingual transformers for russian lan-
tion of structured metadata from scientific liter- guage, arXiv pre-print server (2019). URL: https:
ature, International Journal on Document Anal- //arxiv.org/abs/1905.07213. doi:a r x i v : 1 9 0 5 . 0 7 2 1 3 .
ysis and Recognition (IJDAR) 18 (2015) 317–335. [18] M. Arkhipov, M. Trofimova, Y. Kuratov, A. Sorokin,
URL: https://dx.doi.org/10.1007/s10032-015-0249-8. Tuning multilingual transformers for language-
doi:1 0 . 1 0 0 7 / s 1 0 0 3 2 - 0 1 5 - 0 2 4 9 - 8 . specific named entity recognition, in: Proceed-
[8] A. Prasad, M. Kaur, M.-Y. Kan, Neural parscit: a ings of the 7th Workshop on Balto-Slavic Nat-
deep learning-based reference string parser, In- ural Language Processing, Association for Com-
ternational Journal on Digital Libraries 19 (2018) putational Linguistics, Florence, Italy, 2019, pp.
323–337. URL: https://dx.doi.org/10.1007/s00799- 89–93. URL: https://www.aclweb.org/anthology/
018-0242-1. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 1 8 - 0 2 4 2 - 1 . W19-3712. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 3 7 1 2 .
[9] M. Grennan, M. Schibel, A. Collins, J. Beel, Giant: [19] M. Grennan, J. Beel, Synthetic vs. real reference
The 1-billion annotated synthetic bibliographic- strings for citation parsing, and the importance of
reference-string dataset for deep citation parsing, re-training and out-of-sample data for meaningful
in: 27th AIAI Irish Conference on Artificial Intel- evaluations: Experiments with grobid, giant and
ligence and Cognitive Science, 2019, pp. 101–112. cora, ArXiv abs/2004.10410 (2020). URL: https://
URL: http://ceur-ws.org/Vol-2563/aics_25.pdf. arxiv.org/abs/2004.10410.
[10] O. Moskaleva, V. Pislyakov, I. Sterligov, M. Akoev, [20] P. Knoth, Z. Zdrahal, CORE: three access levels to
S. Shabanova, Russian index of science citation: underpin open access, D-Lib Magazine 18 (2012).
Overview and review, Scientometrics 116 (2018) URL: http://oro.open.ac.uk/35755/.
449–462. URL: https://doi.org/10.1007/s11192-018- [21] J. Krause, I. Shapiro, T. Saier, M. Färber, Bootstrap-
2758-y. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 0 1 8 - 2 7 5 8 - y . ping Multilingual Metadata Extraction: A Showcase
[11] Z. Jiang, Y. Yin, L. Gao, Y. Lu, X. Liu, Cross-language in Cyrillic, in: Proceedings of the Second Workshop
on Scholarly Document Processing, 2021, pp. 66–72.
URL: https://aclanthology.org/2021.sdp-1.8.pdf.
[22] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. Weld,
S2ORC: The Semantic Scholar Open Research Cor-
pus, in: Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics,
Association for Computational Linguistics, 2020, pp.
4969–4983. URL: 10.18653/v1/2020.acl-main.447.
[23] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P.
Hsu, K. Wang, An Overview of Microsoft Aca-
demic Service (MAS) and Applications, in: Pro-
ceedings of the 24th International Conference on
World Wide Web, WWW ’15 Companion, ACM,
2015, pp. 243–246. doi:1 0 . 1 1 4 5 / 2 7 4 0 9 0 8 . 2 7 4 2 8 3 9 .
[24] K. Wang, Z. Shen, C. Huang, C.-H. Wu, D. Eide,
Y. Dong, J. Qian, A. Kanakia, A. Chen, R. Rogahn, A
Review of Microsoft Academic Services for Science
of Science Studies, Frontiers in Big Data 2 (2019)
45. doi:1 0 . 3 3 8 9 / f d a t a . 2 0 1 9 . 0 0 0 4 5 .
[25] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov,
Bag of tricks for efficient text classifica-
tion, arXiv preprint arXiv:1607.01759 (2016).
arXiv:1607.01759.
[26] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jé-
gou, T. Mikolov, Fasttext.zip: Compressing text clas-
sification models, arXiv preprint arXiv:1612.03651
(2016). a r X i v : 1 6 1 2 . 0 3 6 5 1 .
[27] J.-C. Klie, M. Bugert, B. Boullosa, R. Eckart de
Castilho, I. Gurevych, The inception platform:
Machine-assisted and knowledge-oriented inter-
active annotation, in: Proceedings of the 27th
International Conference on Computational Lin-
guistics: System Demonstrations, Association for
Computational Linguistics, 2018, pp. 5–9. URL:
http://tubiblio.ulb.tu-darmstadt.de/106270/.
[28] D. Tkaczyk, A. Collins, P. Sheridan, J. Beel, Ma-
chine learning vs. rules and out-of-the-box vs.
retrained: An evaluation of open-source biblio-
graphic reference and citation parsers, arXiv pre-
print server (2018). URL: https://arxiv.org/abs/
1802.01168. doi:a r x i v : 1 8 0 2 . 0 1 1 6 8 .