Sequence Labeling for Citation Field Extraction
from Cyrillic Script References
Igor Shapiro, Tarek Saier and Michael Färber
Institute AIFB, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany


                                          Abstract
                                          Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While
                                          approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than
                                          English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded
                                          consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and
                                          multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and
                                          contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of
                                          varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of
                                          our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic
                                          data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly
                                          outperforms a state-of-the-art model we retrain and evaluate on our data.

                                          Keywords
                                          reference extraction, reference parsing, sequence labeling, Cyrillic script


1. Introduction                                                                                                    exist over 25 million scholarly publications [10]. Publi-
                                                                                                                   cations written in Cyrillic script languages, accordingly,
Citations are a crucial part of the scientific discourse and                                                       make up an even larger portion, as they include further
represent a measure of the extent to which authors indi-                                                           languages such as Ukrainian and Belarusian. A lack of
rectly communicate with other researchers through publi-                                                           methods and tools able to automatically extract infor-
cations [1]. Therefore, accurate citation data is important                                                        mation from these Cyrillic script documents naturally
for applications such as academic search engines [2] and                                                           results in an underrepresentation of such information in
academic recommender systems (e.g., for recommending                                                               scholarly data.
papers [3] or citations [4]). Since the number of scientific                                                          To pave the way for reducing this imbalance, we focus
publications that is available on the web is growing expo-                                                         on the task of extracting structured information from
nentially [5], it is crucial to automatically extract citation                                                     bibliographic references found at the end of scholarly
data from them. Many tools and models have been devel-                                                             publications—commonly referred to as citation field ex-
oped for this purpose, such as GROBID [6], Cermine [7],                                                            traction (CFE)—in Cyrillic script languages (see Figure 1).
and Neural ParsCit [8]. These tools mostly use super-                                                              For this task, we introduce a data set of Cyrillic script
vised deep neural models. Accordingly, a large amount                                                              references for training and evaluating CFE models. As
of labeled data is needed for training. However, most                                                              Cyrillic publications usually contain both Cyrillic and
reference data sets are restricted in terms of discipline                                                          English references, the data set contains a small portion
coverage and size, containing only several thousand in-                                                            (7%) of English references as well. The data set can be
stances (see Table 1). Furthermore, most models and tools                                                          used in various scenarios, such as cross-lingual citation
are only trained on English data [9, 8]. Therefore, exist-                                                         recommendation [11] and analyzing the scientific land-
ing models perform insufficiently on data in languages                                                             scape and scientific discourse independent of the used
other than English, especially in languages written in                                                             languages [12]. To showcase the utility of our data set,
scripts other than the Latin alphabet.                                                                             we train several sequence labeling models on our data
   While English is the language with the largest share of                                                         and evaluate them against a GROBID model retrained
scholarly literature, with estimates of over one hundred                                                           on the same data. Throughout the paper we refer to
million documents [5], other languages still make up a                                                             the reference string parsing module of GROBID as just
significant portion. For Russian alone, for example, there                                                         “GROBID”. To the best of our knowledge, we are the first
The AAAI-22 Workshop on Scientific Document Understanding, March
                                                                                                                   to train a CFE model, more specifically BERT, specialized
1, 2022, Online                                                                                                    in Cyrillic script references.
Envelope-Open igor.shapiro@student.kit.edu (I. Shapiro); tarek.saier@kit.edu
(T. Saier); michael.faerber@kit.edu (M. Färber)
Orcid 0000-0001-5028-0109 (T. Saier); 0000-0001-5458-8645 (M. Färber)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: A real-world example of a Cyrillic script reference with marked bibliographic labels (top) and the corresponding
labeled reference string (bottom).


  Our contributions can be summarized as follows.                      Table 1
                                                                       A selection of existing citation data sets.
     1. We introduce a large data set of labeled Cyril-
        lic reference strings,1 consisting of over 100,000   Data set       # Instances Discipline
        synthetically generated references and over 700      GROBID               6,835 Multi-discipline (GRO-
        references that were manually labeled and gath-                                    BID’s data set is a
                                                                                           collection of various
        ered from multidisciplinary Cyrillic script publi-
                                                                                           citation data sets)
        cations.                                             CORA                 1,877 Computer Science
     2. We train the very first BERT-based citation field    UMass CFE            1,829 Science, technology, engi-
        extraction (CFE) model specialized in Cyrillic                                     neering, and mathematics
        script references and show the importance of re-     GIANT          911 million Multi-discipline
        training GROBID for Cyrillic script language data.
        We achieve an acceptably high F1 score of 0.933
        with our best BERT model.                          for different languages can be found in the literature.
                                                           For Cyrillic languages, for example, RuBERT is a BERT
   The data is available at https://doi.org/10.5281/
                                                           variant trained on Russian text [17], and Slavic BERT is
zenodo.5801914, the code at https://github.com/igor261/
                                                           a named entity recognition model that was trained on
Sequence-Labeling-for-Citation-Field-Extraction-from-
                                                           four Slavic languages (Russian, Bulgarian, Czech, and
Cyrillic-Script-References.
                                                           Polish) [18]. Both of the aforementioned publications
                                                           present a performance gain compared to the pretrained
2. Related Work                                            multilingual BERT by retraining on task-relevant lan-
                                                           guages. Because references in Cyrillic publications typi-
CFE approaches that currently achieve the best per- cally also contain a mix of Cyrillic and English references,
formance are supervised machine learning approaches. we use multilingual BERT in our evaluation.
Among them, the reference-parsing model of GROBID is
typically reported to perform the best. We therefore use
GROBID as the baseline in our evaluation.                  3. Data Set
   In recent years, transformer-based models [13] such
as BERT [14] have achieved state-of-the-art evaluation 3.1. Existing Data Sets
results on a wide range of NLP tasks. To the best of our Several publicly available data sets for training and evalu-
knowledge, there is so far only one paper presenting a ating CFE models exist. In Table 1, we show an overview
BERT-based approach to CFE [15]. The authors achieve of these citation data sets, including the number of ref-
state-of-the-art results on the UMass CFE data set [16] by erence strings contained and disciplines covered. In the
using RoBERTa, a BERT model with a modified training following, we describe each of the data sets in more detail.
procedure and hyperparameters.                               The authors of GROBID [6] provide the 6,835 samples
   The original BERT model comes in three varieties, one their tool’s reference parser is trained on. These are gath-
trained on English text only, one on Chinese, and a mul- ered from various sources (e.g., CORA, HAL archive, and
tilingual model. Furthermore, many offshoots of BERT arXiv). New data is continuously added to the GROBID
    1
                                                                       data set2 .
     In the course of this work, we use the terms “reference string”
                                                                           2
and “citation string” interchangeably.                                         See https://github.com/kermitt2/grobid/issues/535.
   One of the most widely used data sets for the CFE                                            1 transform
task is CORA,3 which comprises 1,877 “coarse-grained”
labeled instances from the computer science domain. As                  Web of                  2 pdﬂatex
pointed out by Prasad et al. [8], a shortcoming of the CFE              Science                 3 match
research field is that the models are evaluated mainly
on the CORA data set, which lacks diversity in terms of
multidisciplinarity and multilinguality.                         .bib              1                         .bst
   The UMass CFE data set by Anzaroot and McCallum                @article{Ivanov17,
[16] provides both fine- and coarse-grained labels from             title = {Заголовок},
across the STEM fields. Fine- and coarse-grained labels             author = {Иван Иванов},
                                                                    year   = {2017},
means, for example, that labels are given for a person’s            ...
                                                                                                             bibliography
full name (coarse-grained), but also for their given and                                                     style ﬁles
                                                                  }
family name separately (fine-grained).
   All of the above manually annotated data sets are
rather small and part of them is limited in terms of the
                                                                                      .pdf                   2
scientific disciplines covered. These issues are addressed
by Grennan et al. [9] with the data set GIANT, created by
                                                                         3
                                                                                                             training data
synthetically generating reference strings. The data set
consists of roughly 1 billion references from multiple dis-      .txt                                        for CFE task

ciplines, which were created using 677,000 bibliographic          <author>Иван</author> <author>Иванов
entries from Crossref4 rendered in over 1,500 citation            </author>. <title>Заголовок</title>.
                                                                  In <journal>Сибирский филологичес...
styles.
   We see none of the data sets described above as suit-
able for training a model for extracting citation data from    Figure 2: Schematic overview of the synthetic data set cre-
                                                               ation.
Cyrillic publications’ references, because they are based
                                                                                           .
on English language citation strings only, except for GI-
ANT. However, GIANT does not provide consistent lan-
guage labels, making the issue of accurate filtering for       3.2.1. Synthetic References
Cyrillic script citation strings non-trivial.
   To the best of our knowledge, no data set of citation     Figure 2 shows a schematic overview of our data set
strings in Cyrillic script currently exists. It is therefore creation, which is described in the following.
necessary to create a data set of labeled citation strings      To create a data set of synthetic citation strings, a suit-
to be able to train models capable of reliably extracting    able source of metadata of Cyrillic script documents is
information from Cyrillic script reference strings.          necessary. Crossref, which is used by GIANT, provides
                                                             metadata for over 120 million records5 of various con-
                                                             tent types (e.g., journal-article, book, and chapter) via
3.2. Data Set Creation                                       their REST API. Unfortunately, most of the data either
In the following subsection, we identify two approaches does not provide a language field or the language tag is
for creating an appropriate data set to train and test deep English. We also considered CORE [20] as a source of
neural networks that extract citation fields, such as au- metadata. Although CORE provides at least 23,000 pa-
thor information and paper titles. Grennan et al. [9], pers with Cyrillic script language labels and correspond-
Grennan and Beel [19], and Thai et al. [15] found that ing PDF files [21], it comes with insufficient metadata.
synthetically generated citation strings are suitable to Furthermore, for the relevant BibTeX fields, CORE only
train machine learning algorithms for CFE, resulting in provides title, authors, year, and some publisher entries.
high-performance models. We use a similar approach to           We identified Web of Science (WoS)6 as the most ap-
create a synthetic data set of citation strings for model propriate source of metadata for creating synthetic refer-
training in the next section. To evaluate the resulting ences and based on the option to gather language-specific
models on citation strings from real documents, we man- metadata. Additionally, WoS provides a filter for the doc-
ually annotate citation strings from several Cyrillic script ument type, even though it lacks, for example, book types.
scientific papers. This is described in the subsection The final data set should contain multiple document types
“Manually Annotated References.”                             to cover various citation fields.
                                                                Web of Science provides access to the Russian Sci-
                                                             ence Citation Index (RSCI), a bibliographic database of
    3                                                             5
        See https://people.cs.umass.edu/~mccallum/data.html.          See https://api.crossref.org/works.
    4                                                             6
        See https://www.crossref.org.                                 See https://www.webofknowledge.com/.
Table 2                                                                      Table 3
Distribution of the reference languages from WoS.                            Number of synthetic labeled reference strings per citation
                                                                             style & reference type.
                  Language        Number of items
                   Russian                 31,977                                Citation        # Articles     # Conf. Proc.        Total
                   English                   2,241                               Style
                    other                        9                               APA                  1,293               833       2,126
                                                                                 GOST2003            26,289             7,061      33,350
                                                                                 GOST2006            26,328             7,078      33,406
scientific publications in Russian with roughly 750,000                          GOST2008            26,467             7,113      33,580
instances. We chose to gather around 27,000 most recent                          Total               80,377            22,085     102,462
(i.e., from 2020) article type and around 7,000 most recent
(i.e., from 2010-2020) conference proceeding type7 meta-
data records from the RSCI. The selection is motivated                       tain level of variety we use the GOST2003, GOST2006,10
by the finding of Grennan and Beel [19] that a model                         and GOST2008 styles for all references. Since the APA
trained with more than 10,000 citations would decrease                       style cannot handle Cyrillic characters, it is used for non-
in performance compared with a smaller training data                         Cyrillic references only.
set. To verify the latter statement in our evaluation, we                       For each reference, we create a separate PDF rendition.
decide to create a data set consisting of 100,000 citation                   Using various bibliography styles for the same reference
strings in total. Last but not least, following the GIANT                    can result in reference strings that are completely differ-
data set, we wanted our data set to consist of around 80%                    ent in look and structure. For instance, author names
articles and 20% conference proceedings.                                     can be abbreviated or duplicated at different positions.11
    Based on the language tags in the metadata provided                      Metadata labels and their counterparts in the PDF refer-
by WoS, a breakdown of the languages of the data we                          ences are then matched by an exact string match or, alter-
collected is shown in Table 2. Unfortunately, the RSCI                       natively, the Levenshtein distance. Exact string matches
database by WoS does not provide Ukrainian language                          are not always possible because some characters are ma-
metadata, but since Russian and Ukrainian are very simi-                     nipulated by TeX while generating a PDF file or field
lar, we expect the model to process Ukrainian language                       values themselves change during the generation process
references comparably reliable to Russian language ref-                      in various ways, like abbreviations or misinterpreted
erences. In our evaluation, we show that our model                           characters. To store the reference text and reference to-
achieves similar F1 scores for Russian and Ukrainian lan-                    ken labels in one file per reference, we create labeled
guage references.                                                            reference strings as shown in Figure 1.
    After converting the WoS data to the BibTeX format                          In rare cases during the parsing process of the PDFs
and filtering out corrupted entries, we enrich the data                      to text strings using PDFMiner, tokens were garbled and
with additional features, such as “Pagetotal”8 and “ad-                      files could not be read. Consequently, the correspond-
dress” (publisher city), to get extensive BibTeX entries                     ing items are removed from the data set, resulting in
that are comparable to real references. This process re-                     slightly varying numbers of references for different cita-
sults in a total of 34,228 metadata records in the BibTeX                    tion styles. In the end, our approach yields about 100,000
format. To generate bibliographic references, we addi-                       synthetically generated labeled reference strings. A de-
tionally need to identify a set of suitable citation styles.                 tailed breakdown of the quantity of data for each citation
    Based on a CORE subset of Cyrillic script scientific                     style is shown in Table 3.
papers (see next subsection for details), we identify the                       In Table 4, we additionally show the breakdown of
GOST and APA citation styles to be best suited for gen-                      labels covered by our synthetic references.
erating realistic reference strings. The GOST standards9
were developed by the government of the Soviet Union
and are comparable to standards by the American ANSI
or German DIN. They are still widely used in Russia and
in many former soviet republics. To introduce a cer-                              10
                                                                                     Because we were not able to find a copy of the GOST2006 BST
                                                                             file, we replicated it ourselves based on the GOST2003 BST file and
                                                                             the description at https://science.kname.edu.ua/images/dok/journal/
                                                                             texnika/2021/2021.pdf.
                                                                                  11
                                                                                     An example for a duplicated author name is shown in the
     7                                                                       following GOST2006 style reference: “Alefirov, A.N. Antitumoral
       The conference proceeding type corresponds to meeting type
in WoS.                                                                      effects of Aconitum soongaricum tincture on Ehrlich carcinoma
     8                                                                       in mice [Text] / Alefirov, A.N. and Bespalov, V.G. // Obzory po
       “Pagetotal” is a field specific to the citation style “GOST”, which
will be discussed later.                                                     klinicheskoi farmakologii i lekarstvennoi terapii.–St. Petersburg :
     9
       See https://dic.academic.ru/dic.nsf/ruwiki/79269.                     Limited Liability Company Eco-Vector.–2012.”.
Table 4
Number of synthetic labeled reference strings having respec-                             25

tive labels per reference type.


                                                                   Frequency of papers
                                                                                         20
        Label       # Articles   # Conf. Proc.        Total
        title          80,376          22,085       102,461                              15
        author         80,375          22,079       102,454
        year           80,305          21,870       102,175                              10

        pages          80,419          17,944        97,113
        journal        80,376                –       80,376                              5

        number         80,214                –       80,214
                                                                                         0
        volume         46,494          11,423        57,917                                   2006   2008   2010     2012   2014   2016
        booktitle            –         22,085        22,085                                                        Year

        publisher            –         22,083        22,083
        address              –         20,034        20,034     Figure 3: Distribution of publication years of the selected 100
        pagetotal       1,208            4,141        5,349     papers.


                                                                Table 5
3.2.2. Manually Annotated References                            Summary of the manually annotated data set.
Despite the fact that many large scholarly data sets are                                 Parameter                                 Counts
publicly available, most lack broad language coverage                                    Number of annotated papers                   100
or do not contain full text documents. Investigating sev-                                Number of reference strings                  771
eral data sources, we find that, for example, the PubMed                                 Average reference length (in tokens)       28.00
Central Open Access Subset12 provides mostly English                                     Number of reference related labels            11
language publications,13 just like S2ORC [22]. Further,                                  Number of labeled reference segments       5,080
the Microsoft Academic Graph [23, 24] covers millions of
publications, but does not contain full texts and therefore
also no reference strings.                                      pers. Furthermore, references containing fields outside
    We use the data set introduced by Krause et al. [21] as     the scope of our labels, like editor or institution, exist. In
a source of Cyrillic script papers. After a filtering step to   the case of booktitle fields of conference proceedings, we
remove papers with lacking or unstructured citations we         used the journal label. Lastly, due to the difference in use
randomly chose 100 papers to manually annotate.                 of “№” across citation styles (indicating either an issue
    Analyzing the origin of the selected papers, we note        or volume number), in ambiguous cases the number after
that 80 originate from the “A.N.Beketov KNUME Digital           “№” is labeled volume following the GOST2006 citation
Repository”14 and five from the “Zhytomyr State Uni-            style.
versity Library.”15 Origins could not be determined for            Table 5 shows the summary statistics of the resulting
15 papers. Figure 3 shows the distribution of papers by         data set. In Table 6, we show the labels used and their
publication year. A breakdown of the disciplines covered        number of occurrences counted in segments (a segment
by the data set revealed that the most strongly repre-          is the full text range for a label).
sented disciplines are “engineering” with 36 papers and            Although 65% of the 100 documents are Ukrainian
“economics” with 16 papers. The remaining 48 papers are         language papers, the references are written in various
spread across various fields, such as education, zoology,       languages. Nearly 99% are written in Russian, Ukrainian
urban planning/infrastructure.                                  and English (see Table 7). Other languages contained are
    Using fastText [25, 26] language detection, we find         Polish, German, Serbian, and French.
that our sample consists of 65 Ukrainian language and              While the number of manually annotated references
35 Russian language papers.                                     is not large enough for training purposes, we argue that
    Using the annotation tool INCEpTION [27], we label          the size and language distribution enable us to perform
the references in our 100 PDFs. Regarding manual anno-          a realistic evaluation of our models.
tation, we note that the real references did not always
fit our set of metadata labels. For example, references to
patents, legal texts, or web resources might not contain
                                                                4. Approach
certain elements typical for references to scientific pa-       There are various approaches to the CFE task. Most of
   12
                                                                them use regular expressions, template matching, knowl-
      See https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
   13                                                           edge bases, or supervised machine learning, whereby
      See https://www.ncbi.nlm.nih.gov/pmc/about/faq/#q16.
   14
      See https://eprints.kname.edu.ua/.                        machine learning-based approaches achieve the best re-
   15
      See http://eprints.zu.edu.ua/.                            sults [28]. Furthermore, tools differ in terms of extracted
Table 6                                                         Table 8
Segment counts for the labels assigned.                         Evaluation on manually annotated data set for BERT models
                                                                with differing sizes of training data average over 5 models
                      Label         #segments
                                                                trained on different random samples.
                      author             1,560
                      title                773                    Train Set     Recall    Precision    F1 Score    Standard
                      year                 680                    Size                                             Deviation
                      pages                612                    500            0.909      0.916       0.910      0.007
                      address              410                    1,000          0.922      0.926       0.923      0.009
                      publisher            364                    2,000          0.928      0.932       0.928      0.007
                      journal              328                    3,000          0.928      0.931       0.928      0.003
                      volume               256                    5,000          0.926      0.929       0.927      0.004
                      number                91                    10,000         0.920      0.925       0.921      0.005
                                                                  20,000         0.907      0.913       0.907      0.008
                                                                  50,000         0.863      0.880       0.864      0.017
Table 7                                                           100,000        0.847      0.868       0.848      0.012
Distribution of the reference languages in the manually anno-
tated data set.
                Language       Number of references
                                                                  16 core Intel Xeon Gold 6226R 2.90GHz CPU) takes 1,233
                Russian                         390
                                                                  minutes.
                Ukrainian                       288
                English                          82                   To evaluate our fine-tuned BERT model not only on the
                                                                  manually annotated but also on the synthetic references,
                                                                  we remove a hold-out set of 2,000 synthetic references
reference fields and their granularity.                           from the training set, with a fixed distribution of citation
   GROBID is commonly considered as the most effec- styles, according to the distribution of the entire data set.
tive tool [28] and was created by Lopez. Tkaczyk et al.
reported an F1 score of 0.92 for the retrained GROBID 5.1. BERT Evaluation on the Manually
CRF model on their data set. Beyond parsing reference                    Annotated Data Set
strings, GROBID is also able to extract metadata and log-
ical structure from scientific documents in PDF format. We fine-tune the cased multilingual BERT model on 9
   Following existing literature, we decide to use the GRO- training set sizes from our synthetically generated labeled
BID CRF model as a baseline. Therefore we retrain the reference data. To ensure robust results, for each of the
GROBID CRF model on our synthetic data set following 9 training set sizes, we sample 5 training sets, train one
GROBID’s documentation.16 The GROBID CRF model is model per sample and average the resulting scores (i.e.,
trained from scratch.17                                           in total we train 9 × 5 = 45 models).
   State-of-the-art sequence labeling approaches are of-              Averaged scores for recall, precision, and F1 score for
ten based on BERT. Accordingly, we fine-tune the cased            all 9 training set sizes are visualized in Table 8.We found
multilingual BERT model, which is pretrained on 104 that models trained on relatively small training data sets
languages, on our synthetic reference data set. We fine- (between 1,000 and 10,000 instances) perform best on our
tune/retrain both BERT and GROBID on several subsets manually annotated test set. More precisely, on average,
of our synthetic data set with differing sizes (between the models trained on 2,000 instances perform best re-
500 and 100,000) so that we can assess the necessity of a garding the F1 score. These models achieve an average
large training set.                                               F1 score of 0.928 (range from 0.917 to 0.936). Already
                                                                  with the smallest considered training set of 500 instances,
                                                                  we can fine-tune a powerful BERT model for the Cyrillic
5. Evaluation                                                     CFE task achieving an F1 score of 0.91 on average.
                                                                      The highest achieved F1 score of 0.928 (averaged F1
Fine-tuning the BERT model is, compared to pretrain- scores of five models trained on different 2,000 instances
ing, relatively inexpensive [14]. We observed this as random samples) on our test set is comparable with state-
well by comparing the time for fine-tuning with the time of-the-art models proposed for English CFE [28, 19, 15, 8],
needed to train GROBID. For example, fine-tuning BERT especially considering the fact that there are reference
with 100,000 training instances takes 125 minutes (on a types and languages in the test set the model was not
GeForce RTX 3090 GPU) and training GROBID CRF (on a trained on. Nevertheless, it is difficult to compare our
                                                                  results with other papers, since we work with Cyrillic
    16
       See  https://grobid.readthedocs.io/en/latest/Training-the- script references and evaluate the models on our self-
models-of-Grobid/.                                                created test set.
   17
        See https://github.com/kermitt2/grobid/issues/748.
Table 9                                                                                           Russian    Ukrainian   English
                                                                                 1.0
Detailed evaluation of labels predicted by BERT𝐹 𝑖𝑛𝑎𝑙 .
                                                                                 0.9
          Label          Prec.     Rec.        F1     Supp.
          author         0.984    0.994     0.989     7,104                      0.8
          year           0.945    0.962     0.953       680                      0.7
          pages          0.922    0.984     0.952     1,112


                                                                     F1 Scores
          address        0.927    0.961     0.944       715                      0.6
          other          0.945    0.926     0.936    10,730                      0.5
          title          0.938    0.931     0.934     7,257
          publisher      0.913    0.781     0.842     1,165                      0.4
          journal        0.765    0.861     0.810     1,982                      0.3
          volume         0.836    0.454     0.588       269
                                                                                 0.2
          number         0.345    0.860     0.492        93


                                                                                        r

                                                                                             vo r
                                                                                                     e
                                                                                                    er

                                                                                                    er
                                                                                                      s

                                                                                              jou s

                                                                                                      e
                                                                                                    ar
                                                                                                      L
                                                                                                       l
                                                                                                  rna
                                                                                       tho

                                                                                                    e


                                                                                                  ge

                                                                                                    s


                                                                                                  AL
                                                                                                lum


                                                                                                 titl
                                                                                                 ye
                                                                                             sh


                                                                                                oth

                                                                                                mb


                                                                                                dre
         Weighted


                                                                                               pa
                                                                                  au

                                                                                         bli


                                                                                             nu


                                                                                             ad
                                                                                        pu
         Average         0.936    0.932     0.933    31,107                                                 Labels
         Score

                                                                     Figure 4: Evaluation on manually annotated data set for
   We further evaluate a BERT model trained on 2,000                 BERT𝐹 𝑖𝑛𝑎𝑙 model per label and language
random instances18 —referred to as BERT𝐹 𝑖𝑛𝑎𝑙 from here
on—regarding individual labels. Since our model is more
fine-grained than the test set, i.e. labels in the synthetic         fact that most English language references are format-
data set and manually annotated data set are not the                 ted in the APA style, where there is no ambiguity in the
same, we had to change the pagetotal label to pages and              respective labels.
the booktitle label to journal.                                         Furthermore, BERT𝐹 𝑖𝑛𝑎𝑙 predicts publisher and address
   As shown in Table 9, our model performs best on iden-             labels worse for English language references than for
tifying author tokens with an F1 score of 0.989. Overall,            Russian and Ukrainian language references.
we observe an F1 score of more than 0.934 for 6 labels
(author, year, pages, address, other, and title).
                                                                     5.2. BERT Evaluation on the Synthetic
   We see room for improvement in publisher, journal,
volume, and number predictions. The poor performance                      Hold-Out Set
in volume and number predictions can be explained by the             Our fine-tuned BERT underperforms in some labels on
ambiguity of “№” in the test set (see Section “Manually              the manually annotated test set. To evaluate our model
Annotated References”).                                              on data with less ambiguity and the same reference docu-
   We see high recall with low precision values in number            ment types it was trained on, we assess the performance
predictions and low recall with high precision values                on the synthetic hold-out set.
in volume predictions. The same observation can be                      Scores for recall, precision, and F1 score for all 9 train-
made for journal and publisher predictions, but to a lesser          ing set sizes evaluated on the hold-out set are visualized
degree.                                                              in Figure 5. All BERT models achieve F1 scores of over
   More than 50% of the actual volume labels are labeled             0.99, even the model fine-tuned with 500 instances. We
as number, and around 17% of actual publisher labels are             also see a steady increase in the performance, when in-
labeled as journal.                                                  creasing the training data set size. Best performance
   Next, we look into the evaluation on the synthetic hold-          regarding the F1 score (0.998) is achieved by the model
out set. We evaluate the BERT𝐹 𝑖𝑛𝑎𝑙 model depending on               trained on 100,000 instances, while this model performs
the languages of references (see Figure 4).                          worst on the manually annotated data set. There are
   As mentioned before, our synthetic data set lacks                 also small differences in the scores concerning individual
Ukrainian language references. Nevertheless, the F1                  labels.
score of 0.946 for Russian language references is only
2.5% higher than the F1 score of 0.921 for Ukrainian lan-
guage references. This is potentially due to the high
                                                                     5.3. GROBID Evaluation
similarity between the Russian and Ukrainian languages.              We compare our fine-tuned BERT with the state-of-the-
   Additionally, for English language references, the pre-           art GROBID model. First, we evaluate the off-the-shelf
dictions of volume and number labels are much better                 GROBID on our manually annotated test set. The model
than for Cyrillic script references. This is due to the              achieves unsatisfying results with an F1 score of 0.09.
   18
                                                                     Only numeric tokens such as number or year achieve an
        Models trained on 2,000 instances perform best on average.
                      F1 Score                                               Table 10
         1.00         Recall
                      Precision                                              Summary of metrics of the models evaluated on the manually
                                                                             annotated test set.
         0.99
                                                                                 Model               Precision    Recall   F1 Score
         0.98                                                                    Vanilla GROBID          0.347     0.052      0.090
Scores


                                                                                 GROBID𝐹 𝑖𝑛𝑎𝑙            0.665     0.631      0.647
         0.97                                                                    BERT𝐹 𝑖𝑛𝑎𝑙             0.936     0.932       0.933

         0.96
                                                                             than 10,000 references. The best performing GROBID
         0.95                                                                model was trained with 5,000 instances, achieving a F1
                500     1K        2K   3K 5K 10K 20K          50K 100K       score of 0.647. We refer to this best performing GROBID
                                       Size of training set
                                                                             model as GROBID𝐹 𝑖𝑛𝑎𝑙 . Compared to the off-the-shelf
                                                                             GROBID results, we managed to increase the F1 score by
Figure 5: Evaluation on synthetic hold-out data set for BERT                 a factor of seven by retraining GROBID.
models with differing size of training data                                     Compared to the off-the-shelf GROBID, we see higher
                                                                             F1 scores in almost every label, except for year and num-
         1.0                                                                 ber. The best label performance is measured for paper ti-
                                                                 F1 Score    tle, with an F1 score of 0.817. A comparison of evaluation
                                                                 Recall
         0.9                                                     Precision   metrics of GROBID and BERT is shown in Table 10. Our
         0.8                                                                 BERT𝐹 𝑖𝑛𝑎𝑙 model outperforms the GROBID𝐹 𝑖𝑛𝑎𝑙 model in
                                                                             every label and, consequently, in overall F1 score as well.
         0.7
Scores


         0.6                                                                 6. Conclusion
         0.5
                                                                             In this paper, we provide a large data set covering over
         0.4                                                                 100,000 labeled reference strings in various citation styles
         0.3                                                                 and languages, of which 771 are manually annotated ref-
                500    1K         2K   3K 5K 10K 20K          50K 100K       erences from 100 Cyrillic script scientific papers. Further-
                                       Size of training set
                                                                             more, we fine-tune multilingual BERT models on various
                                                                             training set sizes and achieve the best F1 score of 0.933
Figure 6: Evaluation on real data set for GROBID CRF models                  with 2,000 training instances. We show the eligibility of
with differing sizes of training data sets.                                  synthetically created data for training CFE models. To
                                                                             compare our results with existing models, we retrained
                                                                             a GROBID model serving as a benchmark. Our BERT
F1 score of over 0.1. Most of the non-numeric labels have                    model significantly outperformed both off-the-shelf and
a F1 score of 0 or close to 0.19                                             retrained GROBID. In future work, our BERT model could
   GROBID was initially trained on English language ref-                     be compared to other well-performing CFE models, such
erences. Consequently, it is not surprising that it per-                     as Cermine and Neural ParsCit.
forms poorly regarding Cyrillic reference data. Therefore,                      Our data sets can be reused by other researchers to
we retrain the GROBID CRF model on our synthetic Cyril-                      train Cyrillic script CFE models. In particular our man-
lic reference data with differing training data set sizes,                   ually annotated data set can serve as a benchmark for
as we did for the BERT model. Evaluations of resulting                       further research in this field, since it provides references
models on our manually annotated test set are shown in                       from various domains and covers several languages.
Figure 6.                                                                       Regarding our BERT model, we see two key aspects
   We observe poorer performance of the GROBID mod-                          for future work. First, literature describes benefits of
els compared to our fine-tuned BERT. Similar to evalua-                      adding a CRF layer at the top of a model’s underlying ar-
tions of the fine-tuned BERT models and Grennan and                          chitecture [8, 18], which could also be considered for our
Beel [19], we see that the best performing models where                      approach. Second, our model’s performance could be in-
trained on relatively small data sets consisting of less                     creased by retraining BERT from scratch on task-specific
                                                                             languages, e.g. in our case Cyrillic Script languages and
    19
       Data used for training of the off-the-shelf GROBID has different      English, as shown by Kuratov and Arkhipov [17] and
labels than we have in our synthetic data set. Consequently some             Arkhipov et al. [18].
labels are condemned to have scores equal zero, e.g. web. Note that
GROBID does not provide evaluation scores for other labels.
References                                                                             Citation Recommendation via Hierarchical Repre-
                                                                                       sentation Learning on Heterogeneous Graph, in:
 [1] W. Shaw, Information theory and scientific commu-                                 Proceedings of the 41st International ACM SIGIR
     nication, Scientometrics 3 (1981) 235–249. URL:                                   Conference on Research & Development in Infor-
     https : / / link.springer.com / content / pdf / 10.1007 /                         mation Retrieval, SIGIR’18, 2018, pp. 635–644.
     BF02101668.pdf.                                                              [12] A. Martín-Martín, M. Thelwall, E. Orduña-Malea,
 [2] J. L. Ortega, Academic search engines: A quantita-                                E. D. López-Cózar, Google scholar, microsoft aca-
     tive outlook, Elsevier, 2014.                                                     demic, scopus, dimensions, web of science, and
 [3] J. Beel, B. Gipp, S. Langer, C. Breitinger, Research-                             opencitations’ COCI: a multidisciplinary compar-
     paper recommender systems: a literature survey,                                   ison of coverage via citations, Scientometrics 126
     International Journal on Digital Libraries 17 (2016)                              (2021) 871–906.
     305–338. URL: https://dx.doi.org/10.1007/s00799-                             [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     015-0156-0. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 1 5 - 0 1 5 6 - 0 .               L. Jones, Aidan, L. Kaiser, I. Polosukhin, Atten-
 [4] M. Färber, A. Jatowt, Citation recommendation:                                    tion is all you need, arXiv pre-print server (2017).
     approaches and datasets, Int. J. Digit. Libr. 21 (2020)                           URL: https://arxiv.org/abs/1706.03762. doi:a r x i v :
     375–405. URL: https://doi.org/10.1007/s00799-020-                                 1706.03762.
     00288-2. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 2 0 - 0 0 2 8 8 - 2 .           [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
 [5] M. Khabsa, C. L. Giles,                                       The number          Bert: Pre-training of deep bidirectional transform-
     of scholarly documents on the public                                              ers for language understanding, arXiv pre-print
     web,             PLoS ONE 9 (2014) e93949. URL:                                   server (2019). URL: https://arxiv.org/abs/1810.04805.
     https://dx.doi.org/10.1371/journal.pone.0093949.                                  doi:a r x i v : 1 8 1 0 . 0 4 8 0 5 .
     doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0 0 9 3 9 4 9 .                [15] D. Thai, Z. Xu, N. Monath, B. Veytsman, A. Mc-
 [6] P. Lopez, Grobid: Combining automatic bib-                                        Callum, Using bibtex to automatically generate
     liographic data recognition and term extraction                                   labeled data for citation field extraction, in: Auto-
     for scholarship publications,                               in: M. Agosti,        mated Knowledge Base Construction, 2020. URL:
     J. Borbinha, S. Kapidakis, C. Papatheodorou,                                      https://openreview.net/forum?id=OnUd3hf3o3.
     G. Tsakonas (Eds.), Research and Advanced Tech-                              [16] S. Anzaroot, A. McCallum, A new dataset for
     nology for Digital Libraries, Springer Berlin Heidel-                             fine-grained citation field extraction, ICML Work-
     berg, Berlin, Heidelberg, 2009, pp. 473–474. URL:                                 shop on Peer Reviewed and Publishing Models
     https://link.springer.com/chapter/10.1007/978-3-                                  (2013). URL: https://openreview.net/forum?id=
     642-04346-8_62.                                                                   ffO1Piqs1KZo5.
 [7] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Den-                            [17] Y. Kuratov, M. Arkhipov, Adaptation of deep bidi-
     dek, L. Bolikowski, Cermine: automatic extrac-                                    rectional multilingual transformers for russian lan-
     tion of structured metadata from scientific liter-                                guage, arXiv pre-print server (2019). URL: https:
     ature, International Journal on Document Anal-                                    //arxiv.org/abs/1905.07213. doi:a r x i v : 1 9 0 5 . 0 7 2 1 3 .
     ysis and Recognition (IJDAR) 18 (2015) 317–335.                              [18] M. Arkhipov, M. Trofimova, Y. Kuratov, A. Sorokin,
     URL: https://dx.doi.org/10.1007/s10032-015-0249-8.                                Tuning multilingual transformers for language-
     doi:1 0 . 1 0 0 7 / s 1 0 0 3 2 - 0 1 5 - 0 2 4 9 - 8 .                           specific named entity recognition, in: Proceed-
 [8] A. Prasad, M. Kaur, M.-Y. Kan, Neural parscit: a                                  ings of the 7th Workshop on Balto-Slavic Nat-
     deep learning-based reference string parser, In-                                  ural Language Processing, Association for Com-
     ternational Journal on Digital Libraries 19 (2018)                                putational Linguistics, Florence, Italy, 2019, pp.
     323–337. URL: https://dx.doi.org/10.1007/s00799-                                  89–93. URL: https://www.aclweb.org/anthology/
     018-0242-1. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 1 8 - 0 2 4 2 - 1 .               W19-3712. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 9 - 3 7 1 2 .
 [9] M. Grennan, M. Schibel, A. Collins, J. Beel, Giant:                          [19] M. Grennan, J. Beel, Synthetic vs. real reference
     The 1-billion annotated synthetic bibliographic-                                  strings for citation parsing, and the importance of
     reference-string dataset for deep citation parsing,                               re-training and out-of-sample data for meaningful
     in: 27th AIAI Irish Conference on Artificial Intel-                               evaluations: Experiments with grobid, giant and
     ligence and Cognitive Science, 2019, pp. 101–112.                                 cora, ArXiv abs/2004.10410 (2020). URL: https://
     URL: http://ceur-ws.org/Vol-2563/aics_25.pdf.                                     arxiv.org/abs/2004.10410.
[10] O. Moskaleva, V. Pislyakov, I. Sterligov, M. Akoev,                          [20] P. Knoth, Z. Zdrahal, CORE: three access levels to
     S. Shabanova, Russian index of science citation:                                  underpin open access, D-Lib Magazine 18 (2012).
     Overview and review, Scientometrics 116 (2018)                                    URL: http://oro.open.ac.uk/35755/.
     449–462. URL: https://doi.org/10.1007/s11192-018-                            [21] J. Krause, I. Shapiro, T. Saier, M. Färber, Bootstrap-
     2758-y. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 0 1 8 - 2 7 5 8 - y .                   ping Multilingual Metadata Extraction: A Showcase
[11] Z. Jiang, Y. Yin, L. Gao, Y. Lu, X. Liu, Cross-language                           in Cyrillic, in: Proceedings of the Second Workshop
     on Scholarly Document Processing, 2021, pp. 66–72.
     URL: https://aclanthology.org/2021.sdp-1.8.pdf.
[22] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. Weld,
     S2ORC: The Semantic Scholar Open Research Cor-
     pus, in: Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics,
     Association for Computational Linguistics, 2020, pp.
     4969–4983. URL: 10.18653/v1/2020.acl-main.447.
[23] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P.
     Hsu, K. Wang, An Overview of Microsoft Aca-
     demic Service (MAS) and Applications, in: Pro-
     ceedings of the 24th International Conference on
     World Wide Web, WWW ’15 Companion, ACM,
     2015, pp. 243–246. doi:1 0 . 1 1 4 5 / 2 7 4 0 9 0 8 . 2 7 4 2 8 3 9 .
[24] K. Wang, Z. Shen, C. Huang, C.-H. Wu, D. Eide,
     Y. Dong, J. Qian, A. Kanakia, A. Chen, R. Rogahn, A
     Review of Microsoft Academic Services for Science
     of Science Studies, Frontiers in Big Data 2 (2019)
     45. doi:1 0 . 3 3 8 9 / f d a t a . 2 0 1 9 . 0 0 0 4 5 .
[25] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov,
     Bag of tricks for efficient text classifica-
     tion,     arXiv preprint arXiv:1607.01759 (2016).
     arXiv:1607.01759.
[26] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jé-
     gou, T. Mikolov, Fasttext.zip: Compressing text clas-
     sification models, arXiv preprint arXiv:1612.03651
     (2016). a r X i v : 1 6 1 2 . 0 3 6 5 1 .
[27] J.-C. Klie, M. Bugert, B. Boullosa, R. Eckart de
     Castilho, I. Gurevych, The inception platform:
     Machine-assisted and knowledge-oriented inter-
     active annotation, in: Proceedings of the 27th
     International Conference on Computational Lin-
     guistics: System Demonstrations, Association for
     Computational Linguistics, 2018, pp. 5–9. URL:
     http://tubiblio.ulb.tu-darmstadt.de/106270/.
[28] D. Tkaczyk, A. Collins, P. Sheridan, J. Beel, Ma-
     chine learning vs. rules and out-of-the-box vs.
     retrained: An evaluation of open-source biblio-
     graphic reference and citation parsers, arXiv pre-
     print server (2018). URL: https://arxiv.org/abs/
     1802.01168. doi:a r x i v : 1 8 0 2 . 0 1 1 6 8 .