<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>zbMATH Open: API Solutions and Research Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Petrera</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis Trautwein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Beckenbach</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dariush Ehsani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Müller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olaf Teschke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bela Gipp</string-name>
          <email>last@gipplab.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moritz Schubotz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bergische Universität Wuppertal</institution>
          ,
          <addr-line>Wuppertal</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>zbMATH / FIZ Karlsruhe</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to ofer our data. APIs improve interoperability with others, i.e., digital libraries, and allow using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services ofered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Since the beginning of 2021, zbMATH is open for
public access. Currently, zbMATH Open1 contains over 4
million bibliographic entries with reviews contributed
by more than 7,000 active reviewers and abstracts drawn
from more than 3,000 journals and book series, and more
than 190,000 books. For most working mathematicians,
this means that they can access zbMATH from anywhere
in the world without subscription nor authentication.
Additionally, we envision benefits to the community by our
eforts to connect zbMATH data with information
systems of research data, collaborative platforms, funding
agencies, and intra-disciplinary eforts, as outlined in [8,
18]. We expect that our commitment in disseminating
mathematics research results will increase the visibility
of mathematics for any scientific audience. We invite the Figure 1: Overview of the zbMATH database and its
associmathematical community to participate actively in the ated data flows. This paper focuses on the “Scholix Links API”.
further development of the platform. “Future APIs” are under construction.</p>
      <p>Very recently, at zbMATH, eforts have been spent to
develop Application Programming Interface (API) solutions vors, the standardized Dublin Core3 metadata format and
to facilitate and optimize open-access to mathematical a second format, that is closer to zbMATH’s internal data
research data. model. The content generated by zbMATH Open, such as</p>
      <p>
        In Figure 1, we sketch a conceptual overview of zb- reviews, classifications, software, or author
disambiguaMATH’s services. The boxes “Reviewer Interface”, “In- tion data are distributed under CC-BY-SA 4.0. This defines
ternal Interfaces”, and “zbMATH.org Website” show the the license for the whole dataset, which also contains
nonwell-established components of zbMATH and are out- copyrighted bibliographic metadata and reference data
side the scope of this paper. The box “OAI-PMH API” derived from I4OSC (CC0). Note that the API does only
was released in A
        <xref ref-type="bibr" rid="ref15">pril 2021</xref>
        [18]. This protocol is widely provide a subset of the data in the zbMATH Open Web
inused for metadata-harvesting. Via the OAI-PMH API2, terface since in several cases third-party information, such
researchers can harvest the entire dataset or only specific as abstracts, cannot be made available under a suitable
lisubsets of our collection. We ofer the data in two fla- cense through the API. In those cases we replaced the data
with a placeholder string. We envision that for researchers
Digital Infrastructures for Scholarly Content Objects (DISCO2021) at dealing with diferent data providers, the Dublin Core
forJCDL2021 mat is more suitable. We expect that for people used to our
website, our own format is more a
        <xref ref-type="bibr" rid="ref15">ppealing to use. From
© 2021</xref>
        Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl (sCC( CBYE4U.0).R-WS.org)
1https://zbmath.org/
2https://oai.zbmath.org
      </p>
      <sec id="sec-1-1">
        <title>3https://dublincore.org/</title>
        <p>the API one can fetch the entire dataset or a well-defined
subset using a metadata harvester4. One harvest output
will be permanently stored as a research dataset of the
Special Interest Group on Maths Linguistics data
repository. This data repository also contains annual snapshots
of arXiv5 articles in diferent formats optimized for
mathematical information retrieval research challenges. As
the zbMATH open data links to many arXiv preprints, we
plan to synchronize the release cycles to create consistent
snapshots of zbMATH data and associated fulltext sources.</p>
        <p>In this paper, we describe a new service ofered by
zbMATH, namely an API, called “zbMATH Links API”,
represented by the box stating “Scholix Link API” in Figure 1. At
present, this new API is focused on the interconnections
between zbMATH and the Digital Library of
Mathematical Functions (DLMF)6, even though more partners are
expected to be hosted soon (e.g., MathOverflow, arXiv,
Online Encyclopedia of Integer Sequences). Search engines or
researchers from mathematics or the field of bibliometric
research might use our zbMATH Links API to present and
use the search results. Furthermore, the source code of our
API has been released in the form of a Python package7,
so that any interested user can use it for similar purposes
in any context where the interconnection between
bibliographic data and links has to be studied and documented.
In this way, we hope to serve the needs of a wide range of
potential users.</p>
        <p>The main contributions of this paper are:</p>
      </sec>
      <sec id="sec-1-2">
        <title>1. We provide an overview of the new API implementation using the example of how DLMF makes use of it. An analysis of the currently available dataset will be outlined.</title>
      </sec>
      <sec id="sec-1-3">
        <title>2. We present other natural candidates for the API, thus proving the potential coverage of the current mathematical literature.</title>
      </sec>
      <sec id="sec-1-4">
        <title>3. We highlight implications and new research potentials by showing how existing research can be transferred to make use of zbMATHs open APIs.</title>
      </sec>
      <sec id="sec-1-5">
        <title>In the following section 2, we motivate the choice of</title>
        <p>DLMF as the first partner for our new API and how it is
currently used in their environment. Afterward, in section 3,
we present the implementation details, analyze the DLMF
link data and give some details about other potential
partners. In section 4, we discuss the technical capabilities of
the new API and compare the capabilities of the open APIs
of zbMATH with its pendant of PubMed. The last section is
devoted to some concluding remarks and open problems.</p>
      </sec>
      <sec id="sec-1-6">
        <title>4https://www.openarchives.org/pmh/tools/ 5https://arxiv.org/ 6https://dlmf.nist.gov/ 7https://purl.org/zb/13</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. DLMF as a zbMATH partner</title>
      <sec id="sec-2-1">
        <title>Among all possible partners that may interact with zb</title>
        <p>MATH, we selected the aforementioned Digital Library of
Mathematical Functions (DLMF) as a first partner. In
addition to being an important reference tool for
mathematicians, DLMF ofers a relatively small bibliographic catalog
and is therefore very well suited for testing our API.</p>
        <p>DLMF is a well-established web resource that enlarges
and translates the classical “Handbook of Mathematical
Functions with Formulas, Graphs, and Mathematical
Tables”, edited by M. Abramowitz and I. A. Stegun in 1964
into a modern and functional digital library. As the
original book’s title inspiring this web service suggests, DLMF
is a digital handbook about theoretical and computational
aspects of special functions. Its primary purpose is to
provide a modern reference tool for researchers in
mathematics, physical sciences, and engineering. It contains
hundreds of definitions and theorems, presented with a
standardized notation, together with tables, figures, and
references to peer-reviewed papers and books. It was
published online on the May 7th 2010 and is continuously
maintained, reviewed, and updated ever since. Indeed,
the field of special functions still receives great attention
from the mathematics community, and new contributions
enrich the contents of the library year by year. DLMF
presents its contents in 36 chapters, and the
bibliography currently consists of 2,748 references8 of which 2,053
directly link to zbMATH (i.e., about 75%). This is a
valuable service ofered independently by DLMF and zbMATH
since each user has the possibility of accessing all selected
publications’ bibliographic data. Let us note that of the
remaining 25% of publications not linked to zbMATH, most
of them do not belong to the zbMATH database.</p>
        <p>Before providing more details about our Links API, let
8https://dlmf.nist.gov/bib/
us mention a few details about the links’ structure we are
interested in. Each reference in the DLMF bibliography
may be cited many times in the DLMF pages. Each of these
instances carries its own link to zbMATH. For example, the
book “Asymptotics and special functions” by F. W. J. Olver
(Reprint, 1997; Zbl 0982.41018)9 is referenced 332 times.
Each citation defines a link to zbMATH uniquely. An
example of one of these links is: https://dlmf.nist.gov/2.10#iv.p2
(see Figure 2). In this case, Olver’s book is referenced in
Part 2 of Section §2.10(iv) Taylor and Laurent Coeficients:
Darboux’s Method. In Figure 2, we also see that the
Section §2.10(iv) is cited 3 times. Each instance corresponds
to a link that points to a diferent destination site in the
DLMF library. The highlighted §2.10(iv) points to what
we see in the first screenshot of Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. zbMATH Links API</title>
      <sec id="sec-3-1">
        <title>This section presents the main features of the new “zbMATH Links API” by explaining its structure and various technical capabilities. Then, we give an analysis of the link statistics associated with our DLMF collaboration.</title>
        <p>3.1. Structure of the API
5
3104
n
is
ik3
n
L
2
1
0
2008
2010
2012
2014
Year
2016
2018</p>
        <p>2020
• POST /link allows any user of the API to create a new
link (for a given partner) related to a zbMATH object.</p>
        <p>The parameters are: Zbl code, Source identifier, Partner
name, Link relation.
• GET /source gives a list of all links of a given zbMATH
partner.</p>
        <p>The API itself has been implemented in Python and is
described using the OpenAPI Specification 10, a language- • GET /statistics/msc shows the occurrence of
priagnostic interface description standard for APIs. At present, mary MSC codes (2-digit level) in the source.
it hosts only one partner, DLMF, but it will soon host other
partners. The underlying dataset has been generated by
scraping the DLMF bibliography. As a result, we got 2,053
references (indexed at zbMATH) and 6,526 distinct links.</p>
        <p>In this framework, the links are objects belonging to the
source (of a given partner; DLMF in the present case), and
zbMATH objects are objects belonging to the target.</p>
        <p>The API ofers eight endpoints, more specifically six
GET routes, one POST route, and one PUT route. The
Swagger UI of the zbMATH Links API is available online11. Here
is a concise listing of the provided functionalities:</p>
        <p>• GET /statistics/year shows the occurrence of years
of publication of references in the source.
• PUT /partner edits data of a given zbMATH partner.
• GET /link retrieves links for a given zbMATH object.</p>
        <p>The parameters are: Authors, MSC codes12, X-Field13.</p>
        <p>
          9https://zbmath.org/?q=an%3A0982.41018
10https://swagger.io/specification/
11https://purl.org/zb/14
12Mathematics Subject Classification Sche
          <xref ref-type="bibr" rid="ref2">me 2020</xref>
          , https://
          <xref ref-type="bibr" rid="ref2">msc2020</xref>
          .org/
        </p>
        <p>13The X-Field is an optional parameter that can be used when
one is running a query that can pull back a lot of metadata, but only
a few fields in the output are of interest. Example: in the GET/link
one is interested only in retrieving the id identifier of sources where</p>
        <sec id="sec-3-1-1">
          <title>3.2. Analysis of DLMF Data</title>
          <p>Based on our available DLMF dataset, it is possible to draw
some conclusions:
• In the JSON response body of our GET /link methods,
one can see that each link is equipped with a publication
date. This date refers to the date the link itself has been
added in the DLMF bibliography. We scraped the
historical bibliography between 2008 and 2020 (December is
the name of the author is Abramowitz. Then, Author: Abramowitz,
X-Field: {Source{Identifier {ID}}}.</p>
          <p>14http://www.scholix.org/schema/3-0
the scraping’s reference month) and found the growth
numbers depicted in Figure 3. Clearly, the growth of
population of references changed drastically in 2010,
the year when DLMF started oficially.
• The two statistics routes show results concerning the
distribution of primary MSC codes (2-digit level) and
years of publication of the references in the current
dataset. As one may expect, the most frequently cited
primary MSC codes are:</p>
          <p>MSC Code</p>
          <p>Area
33
65
11
491
351
172</p>
          <p>Special functions
Numerical analysis
Number theory</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>See Figure 4 for more details. On the other hand, the most frequent years of publication of cited references in the current dataset are:</title>
        <p>Year
67
1998
65
1999
65
1995
See Figure 5 for more details. Looking at both Figures 3
and 5 we could infer that the DLMF bibliography sufers
from a delay in updating its references. More precisely,
the fact that the maximum peak is centered at the end of • A DLMF user can access all bibliographic resources
the 90s makes us think of some kind of dificulty in iden- indexed at zbMATH relating to a specific topic of
intertifying relevant references referring to the last twenty est. This may help to get a consistent overview of the
years. scientific development of the topic itself.
• The references in the current DLMF dataset which have
the most citations are:
– F. W. J. Olver, Asymptotics and special functions.</p>
        <p>Wellesley, MA: A K Peters (1997; Zbl 0982.41018):
332 citations,
– M. Abramowitz (ed.) and I. A. Stegun (ed.), Handbook
of mathematical functions with formulas, graphs and
• A researcher interested in a publication indexed at
zb</p>
        <p>MATH can use our API to verify if and possibly where
that publication is cited in DLMF. A search of this type
can also be very diversified thanks to the filters that our
routes ofer. For example, one might be interested in
identifying which DLMF links are related to a particular</p>
        <p>Year</p>
        <p>1980
1940
1960
2000</p>
        <p>2020
mathematical tables. Washington: U.S. Department
of Commerce. (1964; Zbl 0171.38503): 118 citations,
– A. Erdélyi et al., Higher transcendental functions.</p>
        <p>
          Vol. I. New York: Mc
          <xref ref-type="bibr" rid="ref13">Graw-Hill Book Co. (1953</xref>
          ; Zbl
0051.30303): 110 citations.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>In Figure 6 one can see the references, identified by Zbl code, with more than 50 citations.</title>
        <p>3.3. Usage
The motivation behind the recent implementation of APIs
at zbMATH is twofold. On the one hand, we want to ofer
to the scientific community an eficient and open access
to our data. On the other hand, we wish to expose the
dynamic interaction between our bibliographic data and
those coming from other resources. It is essential to note
that both of these targets are made possible by zbMATH
becoming an open web service. This provides a boost for
disseminating scientific knowledge, and our work may
help to understand how it spreads and auto-correlates in
a functional way.</p>
        <p>The zbMATH links API with its first partner DLMF
represents a tool that can be used in various ways and
contains many properties that are advantageous for the
research process. Here, we want to present concrete usage
instances where a user of either DLMF or zbMATH can
generally benefit from the service:
0982.410180171.385030051.303030058.295030063.081840689.330Z010b3l35C.10o00d10e95s1.300020865.330010920.330010856.330010193.01701</p>
        <p>
          MSC code or a particular author. This means that a
targeted use of our API can allow a detailed bibliographic
search that otherwise would not be possible.
• A researcher more interested in the history of
mathematics can use our API to trace the bibliography related
to a certain topic covered in DLMF and observe the
historical development of the topic itself in terms of the
literature related to it. Such research can be very rich
and diverse. It is suficient to think that in the field of
special functions there are classical topics, such as the
“gamma function” or “elliptic integrals”, which have a
long history behind them.
15https://mathoverflow.net/
16https://arxiv.org/
17https://oeis.org/
18https://stackexchange.com/
zbMATH. Therefore, a suitable algorithm is needed to Table 1
ifnd a corresponding preprint for a zbMATH record if Side-by-side comparison of zbMATH Open and PubMed.
one exists. This problem can be seen as an entity match- These are the nu
          <xref ref-type="bibr" rid="ref2">mbers from 2020</xref>
          ing problem, and there exists software for it, for example, zbMATH Open PubMed
JedAI, see [12]. For our purpose, the existing software O
          <xref ref-type="bibr" rid="ref15">pen Access since 2021</xref>
          1996
was not suitable. Therefore, we implemented our own Annual Bib. Entries &gt; .13 M &gt; 1.5 M
matching algorithm. Let us provide a few details about Bib. Entries Total &gt; 4.0 M &gt; 31.5 M
such a matching process, although an accurate and critical Journal Titles &gt; 3.0 K &gt; 5.0 K
description is beyond the scope of this article. Search Queries 2020 closed access &gt; 3300 M
        </p>
        <p>For each search record we generate a small set (default:
3) of possible matching records (called candidates), and
compare them with the search record. The candidate 4. Research Opportunities
records are generated via an Elasticsearch19 query, where
we search for the title and authors of a search record. To This section presents research opportunities arising from
decide whether a search and a candidate record match, a the newly released open data and API solutions at
zbthree-dimensional feature vector is computed. We use the MATH in a broader perspective. Moreover, we compare
similarity of the titles, authors, and abstracts as features. our service with PubMed to put it in a broader context.
The similarity of two titles is their Levenshtein distance PubMed, with its underlying MEDLINE dataset and PubMed
divided by the maximum length of the titles. To compare Central free full-text archive, is another well-known search
the similarity of two abstracts, we use the cosine distance engine within the biomedical scientific research and
digiof their tf-idf vectors (based on words). For the similarity tal libraries community [1, 4, 5, 6, 7, 9, 13, 19]. It is available
of the authors of two articles we use a more involved ap- to the public since 1996, indexes over 32 million
biblioproach, which is based on the Levenshtein distance of the graphic references of biomedical literature, and is
supauthor names, but also can handle changes in the order of ported by the National Center for Biotechnology
Inforthe author names and incorporates information on difer- mation (NCBI), at the U.S. National Library of Medicine
ent author spellings. Using these feature vectors, we train (NLM), located at the National Institutes of Health (NIH)22.
a decision tree classifier on our training data and test it On the other hand, zbMATH Open has over four million
on some test data using sklearn20. If multiple candidates bibliographic entries and was made public on 1st January
match according to the trained classifier, we take the one 2021. Table 1 shows a side-by-side comparison of PubMed
whose feature vector has the smallest Euclidean norm. and zbMATH Open.</p>
        <p>The training and test data is generated as follows. For We work out strengths and weaknesses by presenting
every arXiv preprint with a DOI in its metadata we search selected research publications that leverage PubMeds APIs
for a zbMATH entry with the same DOI. If we find one, we and analyze their applicability to the current state of
zbadd this pair to our ground truth file. We also add some MATH. This serves the purpose of uncovering immediate
arXiv preprints with a DOI for which no zbMATH entry research opportunities in applying existing methods to
with the same DOI exists. Finally, we split the ground the new open dataset of zbMATH and highlighting
develtruth into a training set and a test set. We currently obtain opment prospects in areas where existing methods can
a precision of 99.51 % and a recall of 96.89 % on the test set. not yet readily be applied due to missing interfaces or
gen</p>
        <p>The Online Encyclopedia of Integer Sequences is erally missing capabilities. The following paragraphs are
a renowned online database of sequences of numbers to be understood as an inspiration for projects that can be
launched in November 2010. It currently contains 342.422 based on the new open-access zbMATH data. After each
sequences, each of them with its own list of metadata: paragraph, we propose one or multiple research questions
ifrst terms of the sequence, formulas for generating the that could follow from the described use case.
sequence, references to books, articles, and scholarly links
where the sequences have appeared, and more. At present, 4.1. Immediate Research Opportunities
we are working on retrieving all references listed under
“References” and “Links” for each sequence. Such refer- In this subsection, we focus on research publications that
ences will be matched with our internal zbMATH Citation have leveraged PubMeds open APIs and on general
reMatcher21 and then stored in our Links API. search opportunities.</p>
        <p>19https://www.elastic.co/elasticsearch/
20https://scikit-learn.org/stable/index.html
21https://zbmath.org/citationmatching/
4.1.1. Tagging of Scientific Publications
Assigning keywords or tags to scientific publications is a
crucial tool to increase discoverability. However,
assign22https://pubmed.ncbi.nlm.nih.gov/about/
ing such tags to scientific literature is an expensive and those tools to the semantic information present in the tex
cumbersome process as human reviewers often assign ifles. zbMATH Open also provides semantic information
them manually. This, in turn, leads to inconsistencies as in the form of the XML format. While the investigated PDF
diferent reviewers may assign diferent tags to the same ifles also contained some mathematics literature, the
idpublications. In [19] Veytsman proposes an automated iosyncrasies of mathematical typesetting may be worth a
approach to measure tag consistency across research pub- reevaluation with the sole focus on mathematics literature.
lications based on a metric that captures how predictive Here especially the link between zbMATH entries and tex
a tag is for a citation. The author conducted experiments sources on arxiv which are provided by the API are helpful.
based on the MeSH23 tags that human reviewers manu- Furthermore, zbMATH Open provides high-resolution
ally attach to documents of the PubMed database corpus. scans of early publications that were not yet typeset in
He concluded that their simple metric, whether a tag is a digital form alongside their corresponding tex source
predictive of citations, indeed can be used to measure tag- files for over 15,000 research article reviews. This
corging consistency. Each indexed publication of zbMATH pus constitutes a huge potential for improving optical
contains one or many MSC codes24 and a set of keywords. character recognition (OCR) techniques in the domain of
The former is a hierarchical, alphanumerical identifier mathematics as outlined in [2].
indicating the area of mathematics a certain research pa- Potential research questions:
per touches and the latter are free-text keywords that the
authors suggest. Both classifiers, i.e., MSC codes and key- 3. How do the state-of-the-art PDF text extraction
words, are eventually adjusted by the editors of zbMATH. tools perform for mathematical literature?</p>
        <p>We can imagine that the same experiments that Veyts- 4. What are the main challenges in optical
characman in [19] carried out can now be done based on the cor- ter recognition of mathematical formulas?
pus of zbMATH Open. There would even be the possibility
to further integrate with MathOverflow and recommend 4.1.3. Training Dataset
citations based on the tags given in their platform when
a post is created.</p>
        <p>Potential research questions:</p>
        <p>The opening up of zbMATH means that new training data
can be used for artificial intelligence applications. The
following listing provides inspiration for new possibilities
1. How to measure tagging consistency across math- that the dataset could be used for:
ematical research publications? Here, one can
investigate how the methods developed in [19] can be
applied to mathematics data. The required data can be
derived via our API.</p>
        <p>Formula Search The search mask of zbMATH Open
already ofers a formula search. However, the new open API
allows building ones own or improving the formula search
2. What can be learned from crowd-sourced tagging functionality by leveraging meta information provided
in MathOverflow compared to curated tagging in alongside with the indexed articles.
zbMATH? Especially interesting is here, if the tags Potential research questions:
from one service can help to search in the other service.</p>
        <p>The diferences in the tagging behavior might also give
insights on the learning curve as only known concepts
will be tagged by individuals.</p>
        <p>5. What influence do diferent search options in
digital libraries have on the scientific discovery
process? It is save to assume, that the discovery options for
scientific literature will have an efect on the outcomes
on ones own research. Here, one could try to
qualitatively or even quantitatively assess this influence.
4.1.2. PDF Text Extraction Benchmark</p>
        <p>23https://www.nlm.nih.gov/bsd/disted/meshtutorial/
introduction/index.html</p>
        <p>
          24Mathe
          <xref ref-type="bibr" rid="ref2">matics Subject Classification 2020</xref>
          ,https://
          <xref ref-type="bibr" rid="ref2">msc2020</xref>
          .org/
As the Portable Document Format (PDF) is the ubiquitous
and standard format for scientific publications, its layout- 6. What are the state-of-the-art approaches to
forbased nature makes it hard to extract semantic meaning mula search, and what are the main challenges
from the content. There exist a variety of tools that apply to overcome?
certain heuristics to identify which parts of a document
represent, e.g., the title or a paragraph of text. Bast et al. Recommender Systems The provided data allow
build[1] established a benchmark for text extraction perfor- ing a comprehensive recommendation system. This
sysmance of 14 tools by taking over 12,000 PDF documents tem could incorporate not only the meta information of
from arXiv and obtaining their semantic information from the OAI-PMH APIs like MSC tags or keywords but also
associated tex files and then comparing the outputs of leverage the information on other platforms that a
certain research article is linked in. E.g., mentions of related
research papers in conversations on MathOverflow may
be a good indicator for other relevant literature. As we
7. Which features are most significant for related
        </p>
        <p>literature recommendations in mathematics?
continue to attract more and more partners for our Link 13. What factors make a formula more readable than
API the context increases from which a potential recom- a diferently typeset formula describing the same
mender system can draw meaningful conclusions. concept? Here, one can investigate factors for
readPotential research questions: ability and if there are objectively better ways to
typeset a certain formula.</p>
        <p>Math Spell-Checking Popular tools like Grammarly25
8. What are the distinguishing challenges in feature scan your text for common grammatical mistakes and
extraction from mathematical literature? The chal- provide the user hints about potential improvements. A
lenge of this research question is to identify how state- similar ofering could be developed for typesetting
forof-the-art recommender systems of other disciplines mulas by, for example, giving simple warnings of missing
need to be tuned to excel at mathematical literature closed parentheses (if applicable) or other common
misrecommendations. takes. Such a spell-checking system could make use of
the data of zbMATH Open and linked peripheral services.</p>
        <p>The linking to arXiv could be used to retrieve the full-text
tex information, and the connection to MathOverflow
could be used to detect common mistakes by taking into
account the edit history of formulas in posts.</p>
        <p>Potential research questions:
Formula Disambiguation I Similar formulas can have
vastly diferent meanings in diferent contexts [14, 15, 16,
17]. This is especially true for single symbols used in these
formulas as researchers in diferent fields will certainly
have assigned a diferent meaning to symbols. A system
that tries to understand in which context a formula ap- 14. What are common errors in mathematical
forpears and draw meaning from that could especially lever- mula typesetting, and how to identify them? The
age the MSC classification that is assigned to all articles on main challenge of this research question is to derive a
zbMATH Open. Most results from the OAI-PMH API con- method to identify erroneous formulas; and as a second
tain an abstract where one can often find typeset formulas step to investigate what common errors are.
that can be used as training data along with full-text data
that can be obtained through arXiv.</p>
        <p>Potential research questions:
12. How can diferently typeset formulas describing
the same concept be disambiguated? The main
challenge of this research question is to devise ways
to identify such formula combinations.</p>
        <p>15. What impact had formulas containing errors in
the mathematics research community? Here, one
can research the consequences that errors in formulas
and the research that built on them had. This could
be extended to the influence of errors in formulas on
widespread websites like Wikipedia to contemporary
incidence.</p>
        <p>Classification and clustering While zbMATH Open
provides MSC tags and keywords for the research articles,
we can imagine that there are diferent classification and
clustering approaches that are not represented through
Following the above dis- the meta information of zbMATH. The open-access to the</p>
        <p>APIs allows building use case specific search and
clustering systems.</p>
        <p>Potential research questions:
Formula Disambiguation II
ambiguation, it is also possible for a single concept to be
expressed in diferent ways. Imagine the circumference
 of a circle being expressed in one paper as  = 2
and in another  =  with radius  and diameter . In- 16. Do diferent logical classification and clustering
deed, both formulas describe the same concept but are schemes emerge from the zbMATH Open
metatypeset diferently. This kind of disambiguation will be of data besides the MSC classification scheme?
immediate relevance for academic plagiarism detection.</p>
        <p>State-of-the-art plagiarism detection systems already
consider paraphrased text but lack capabilities to efectively
detect “paraphrased” formulae [10].</p>
        <p>Potential research questions:
10. How can similarly typeset formulas describing
diferent concepts be disambiguated? The main
challenge of this research question is to devise criteria
that make a formula ambiguous.
11. What are the distinguishing factors in formula
typesetting to avoid ambiguity? In this research
question it would be the goal to devise guidelines to
avoid typesetting ambiguous formulas in the first place.</p>
        <p>Review generation At present, many research papers
and books indexed at zbMATH are supplemented with a
review written by external experts in the field. Currently
more than 7,000 active experts participate in compiling
reviews for research papers and books. They critically
analyze the contribution of the publication under
consideration, often summarize the content and judge it in
25https://grammarly.com/
17. What are the significant properties that a
mathematical review should include? In this research
question one should distill the essential properties of
what makes a “good” mathematical review.
18. How do mathematical reviews generated by AI
language models compare with manually writ- 21. What are the most common reasons for the
reten reviews according to the aforementioned sig- traction of mathematical research papers, and
nificant properties? Here, it is interesting to under- how can publication of such papers be minimized?
stand if artificial intelligence is capable of meeting the Here, one can think in the direction of computer
asaforementioned properties. sisted quality assurance on the publisher side and how
this could help the publishing process.
19. What impact can AI language models have on
the mathematical review process? In this research
question, one should work out the implications of
potentially machine written reviews.</p>
        <sec id="sec-3-3-1">
          <title>4.2. Development Prospects</title>
          <p>In this subsection, we focus on research publications that
have leveraged PubMeds open APIs to which there is no
pendant yet in zbMATH Open. The uses-cases in this
section serve as inspiration for development opportunities.
4.2.1. Retraction Tracking
reference to a bigger context. With the advancements of stating that the citation rate of retracted publications can
text generating deep learning models such as language even increase after they got their retraction status [4], so,
models, it is not far to seek to train models on these hand- literature is still cited even years after retraction.
written reviews in conjunction with their full-text articles Potential research questions:
and metadata of zbMATH Open.</p>
          <p>Potential research questions:
26https://www.zotero.org/
There are manifold reasons why a scientific publication
could get retracted. It can range from erroneous study
design to deliberate misconduct like plagiarism or
generating artificial data to support a hypothesis. With the
increasing amount of scientific literature at an accelerating
rate, the number of retracted papers naturally increases as 22. How can the open data of zbMATH be used to
conwell. Therefore, it is crucial to notify researchers early in struct collaboration graphs among mathematics
the research process about possible retracted publications. researchers? The main contribution in this research
In [4] Dinh et al. present a Zotero26 plugin called ReTracker question would be a comprehensive collaboration graph
that helps to identify retracted papers from PubMed. Re- based on the zbMATH open dataset.
Tracker uses the full paper titles as they are present in the
Zotero library to query PubMed on its retraction status. 23. What conclusions can be drawn from an author
This status is persisted in a local cache and displayed to collaboration graph concerning collaboration
efthe user. With the opening of zbMATH this plugin could fectiveness? Here, one can investigate how the
methnow not only cover articles of biomedical literature but to ods developed in [3] can be applied to the data of our
also inform researches about retracted publications in the APIs.
ifeld of mathematics. Currently, zbMATH Open does not
provide information about the retraction status, but we 5. Conclusions and Future Work
can imagine that collecting this information from various
trustworthy sources and making it accessible through the
API would be a valuable addition to the current service.</p>
          <p>The authors in [4] underline the need for such a tool by
In this article, we have presented the recent innovations
made to zbMATH. We implemented API solutions
following the OAI-PMH and Scholix standards. Those solutions
allow the scientific community to use our open database
20. How does the retraction of mathematical papers
influence their citations? This question follows the
observation of [4] that the citation count of literature
still increases after it got retracted, so the intuitive
answer that citations stop after retraction does not
hold true. Here, it would be interesting to identify the
reasons why literature is still cited.
4.2.2. Collaboration Identification
While digital libraries nowadays ofer comprehensive and
advanced search interfaces to retrieve and explore related
scientific literature, they often lack the understanding of
how authors have collaborated and to which extent their
collaboration was fruitful. The same statement is true for
zbMATH Open. In [3] Cagliero et al. explored ways to
identify collaboration patterns of authors and to measure to
what extent the collaboration was fruitful. They harvested
digital libraries and online databases for research
publications and applied a pattern-based approach to identify
collaborations among researchers. By making the APIs of
zbMATH open-access, we believe that Cagliero et al. [3]
can serve as inspiration to motivate further insights
generation techniques like author collaboration identification.</p>
          <p>Potential research questions:
in an eficient and reproducible way. We demonstrated the
capabilities of API solutions on the basis of existing links
between DLMF and zbMATH. By combining classification
information from zbMATH with reference information
from DLMF, we could derive new insights on references
in the DLMF. In the future, we will incorporate
MathOverlfow, arXiv, and the Online Encyclopedia of Integer
Sequences to the new zbMATH Links API. Moreover, we gave
inspiration for research opportunities arising from the
APIs. In this context, we proposed 23 open research
questions that can be immediately approached by leveraging
the open access model and new programming interfaces.</p>
          <p>We will optimize our API interfaces to the needs of the
scientific community and zbMATHs data partners in the
future. Depending on the needs of the communities, we
will evolve and adapt our data formats. Moreover, we
are working for open access publications and permissive
licenses for the reuse of scholarly metadata. We aim to
convince publishers to distribute abstracts and references
under permissive licenses. We will also continue to
integrate mathematics-related research software and research
data besides traditional publications.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Bast</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Korzen</surname>
          </string-name>
          . “
          <string-name>
            <given-names>A</given-names>
            <surname>Benchmark and Evalua-</surname>
          </string-name>
          [13]
          <article-title>tion for Text Extraction from PDF”</article-title>
          .
          <source>In: Proc. ACM/IEEE JCDL</source>
          . Toronto, ON, Canada: IEEE,
          <year>June 2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . doi: 10/ghchxm.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>M. Beck</surname>
          </string-name>
          et al. “Transforming Scanned zbMATH [14]
          <article-title>Volumes to LaTeX: Planning the Next Level Digitisation”</article-title>
          .
          <source>In: EMS Newsletter</source>
          <year>2020</year>
          -
          <volume>9</volume>
          .117 (
          <issue>Sept</issue>
          .
          <year>2020</year>
          ), pp.
          <fpage>49</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .4171/news/117/11.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          et al. “
          <article-title>Identifying Collaborations among Researchers: a pattern-based approach”</article-title>
          .
          <source>In: Proc. [15] BIRNDL</source>
          at ACM SIGIR. Ed. by
          <string-name>
            <given-names>P.</given-names>
            <surname>Mayr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Jaidka</surname>
          </string-name>
          . Vol.
          <year>1888</year>
          .
          <article-title>CEUR-WS</article-title>
          .org,
          <year>2017</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Dinh</surname>
          </string-name>
          , Y.-Y. Cheng, and
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Parulian</surname>
          </string-name>
          . “ReTracker: [16]
          <article-title>an Open-Source Plugin for Automated and Standardized Tracking of Retracted Scholarly Publications”</article-title>
          .
          <source>In: Proc. ACM</source>
          /IEEE JCDL. Ed. by M. Bonn [
          <volume>17</volume>
          ] et al. IEEE,
          <year>2019</year>
          , pp.
          <fpage>406</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          .1109/JCDL.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Eggers</surname>
          </string-name>
          et al. “
          <article-title>Visualizing aggregated biological [18] pathway relations”</article-title>
          .
          <source>In: Proc. ACM/IEEE JCDL</source>
          .
          <year>2005</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>68</lpage>
          . doi:
          <volume>10</volume>
          .1145/1065385.1065400.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Erekhinskaya</surname>
          </string-name>
          et al. “
          <article-title>Knowledge Extraction for Literature Review”. en</article-title>
          .
          <source>In: Proc. ACM/IEEE JCDL. [19] Newark New Jersey USA: ACM</source>
          ,
          <year>June 2016</year>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>222</lpage>
          . doi:
          <volume>10</volume>
          .1145/2910896.2925441.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>“What Drives Research Eforts? Find Scientific Claims that Count!” In: Proc. ACM/IEEE JCDL</source>
          .
          <year>2019</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>226</lpage>
          . doi:
          <volume>10</volume>
          .1109/JCDL.
          <year>2019</year>
          .
          <volume>00038</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Hulek</surname>
          </string-name>
          and
          <string-name>
            <surname>O. Teschke.</surname>
          </string-name>
          “
          <article-title>The Transition of zbMATH Towards an Open Information Platform for Mathematics”</article-title>
          .
          <source>In: EMS Newsletter</source>
          <year>2020</year>
          -
          <volume>6</volume>
          .116 (
          <year>June 2020</year>
          ), pp.
          <fpage>44</fpage>
          -
          <lpage>47</lpage>
          . doi:
          <volume>10</volume>
          .4171/news/116/12.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Jhawar</surname>
          </string-name>
          et al. “
          <article-title>Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms”</article-title>
          . In: Aug.
          <year>2020</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>470</lpage>
          . doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>ACM</surname>
          </string-name>
          /IEEE JCDL.
          <article-title>Urbana-Champaign, Illinois</article-title>
          , USA,
          <year>June 2019</year>
          . doi:
          <volume>10</volume>
          .1109/JCDL.
          <year>2019</year>
          .
          <volume>00026</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Teschke</surname>
          </string-name>
          . “
          <article-title>References to Research Literature in QA Forums - A Case Study of zbMATH Links from MathOverflow”</article-title>
          .
          <source>In: EMS Newsletter</source>
          <year>2019</year>
          -
          <volume>12</volume>
          .114 (
          <issue>Nov</issue>
          .
          <year>2019</year>
          ), pp.
          <fpage>50</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>doi: 10</source>
          .4171/news/114/15.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          et al. “
          <article-title>The return of jedAI: end-to-end entity resolution for structured and semi-structured data”</article-title>
          .
          <source>In: Proc. VLDB 11.12 (Aug</source>
          .
          <year>2018</year>
          ), pp.
          <fpage>1950</fpage>
          -
          <lpage>1953</lpage>
          . doi:
          <volume>10</volume>
          .14778/3229863.3236232.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>ACM</surname>
          </string-name>
          /IEEE JCDL. Toronto, ON, Canada: IEEE,
          <year>June 2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1109/jcdl.
          <year>2017</year>
          .
          <volume>7991622</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Scharpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          . “
          <article-title>Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation”</article-title>
          .
          <source>In: Proc. WWW</source>
          . ACM, Apr.
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1145/3442442.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Scharpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          . “
          <article-title>Representing Mathematical Formulae in Content MathML using Wikidata”</article-title>
          .
          <source>In: BIRNDL@SIGIR</source>
          . Vol.
          <volume>2132</volume>
          . CEURWS.org,
          <year>2018</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Scharpf</surname>
          </string-name>
          et al. “
          <article-title>AnnoMath TeX - a formula identifier annotation recommender system for STEM documents”</article-title>
          .
          <source>In: RecSys. ACM</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>532</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Scharpf</surname>
          </string-name>
          et al. “
          <article-title>Towards Formula Concept Discovery and Recognition”</article-title>
          .
          <source>In: BIRNDL@SIGIR</source>
          . Vol.
          <volume>2414</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          .org,
          <year>2019</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Schubotz</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Teschke</surname>
          </string-name>
          . “zbMATH Open:
          <article-title>Towards standardized machine interfaces to expose bibliographic metadata”</article-title>
          .
          <source>In: EMS Newsletter</source>
          <year>2021</year>
          -
          <volume>4</volume>
          (
          <year>2021</year>
          ). doi: DOI10.4171/MAG-12.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>ACM</surname>
          </string-name>
          /IEEE JCDL. Champaign, IL, USA: IEEE,
          <year>June 2019</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>373</lpage>
          . doi:
          <volume>10</volume>
          .1109/jcdl.
          <year>2019</year>
          .
          <volume>00076</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>