1. Introduction

zbMATH Open: API Solutions and Research Challenges

Matteo Petrera

Dennis Trautwein

Isabel Beckenbach

Dariush Ehsani

Fabian Müller

Olaf Teschke

Bela Gipp

last@gipplab.org 0

Moritz Schubotz

0 1 0 Bergische Universität Wuppertal , Wuppertal , Germany 1 zbMATH / FIZ Karlsruhe , Berlin , Germany

We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to ofer our data. APIs improve interoperability with others, i.e., digital libraries, and allow using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services ofered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems.

1. Introduction

Since the beginning of 2021, zbMATH is open for public access. Currently, zbMATH Open1 contains over 4 million bibliographic entries with reviews contributed by more than 7,000 active reviewers and abstracts drawn from more than 3,000 journals and book series, and more than 190,000 books. For most working mathematicians, this means that they can access zbMATH from anywhere in the world without subscription nor authentication. Additionally, we envision benefits to the community by our eforts to connect zbMATH data with information systems of research data, collaborative platforms, funding agencies, and intra-disciplinary eforts, as outlined in [8, 18]. We expect that our commitment in disseminating mathematics research results will increase the visibility of mathematics for any scientific audience. We invite the Figure 1: Overview of the zbMATH database and its associmathematical community to participate actively in the ated data flows. This paper focuses on the “Scholix Links API”. further development of the platform. “Future APIs” are under construction.

Very recently, at zbMATH, eforts have been spent to develop Application Programming Interface (API) solutions vors, the standardized Dublin Core3 metadata format and to facilitate and optimize open-access to mathematical a second format, that is closer to zbMATH’s internal data research data. model. The content generated by zbMATH Open, such as

In Figure 1, we sketch a conceptual overview of zb- reviews, classifications, software, or author disambiguaMATH’s services. The boxes “Reviewer Interface”, “In- tion data are distributed under CC-BY-SA 4.0. This defines ternal Interfaces”, and “zbMATH.org Website” show the the license for the whole dataset, which also contains nonwell-established components of zbMATH and are out- copyrighted bibliographic metadata and reference data side the scope of this paper. The box “OAI-PMH API” derived from I4OSC (CC0). Note that the API does only was released in A pril 2021 [18]. This protocol is widely provide a subset of the data in the zbMATH Open Web inused for metadata-harvesting. Via the OAI-PMH API2, terface since in several cases third-party information, such researchers can harvest the entire dataset or only specific as abstracts, cannot be made available under a suitable lisubsets of our collection. We ofer the data in two fla- cense through the API. In those cases we replaced the data with a placeholder string. We envision that for researchers Digital Infrastructures for Scholarly Content Objects (DISCO2021) at dealing with diferent data providers, the Dublin Core forJCDL2021 mat is more suitable. We expect that for people used to our website, our own format is more a ppealing to use. From © 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl (sCC( CBYE4U.0).R-WS.org) 1https://zbmath.org/ 2https://oai.zbmath.org

3https://dublincore.org/

the API one can fetch the entire dataset or a well-defined subset using a metadata harvester4. One harvest output will be permanently stored as a research dataset of the Special Interest Group on Maths Linguistics data repository. This data repository also contains annual snapshots of arXiv5 articles in diferent formats optimized for mathematical information retrieval research challenges. As the zbMATH open data links to many arXiv preprints, we plan to synchronize the release cycles to create consistent snapshots of zbMATH data and associated fulltext sources.

In this paper, we describe a new service ofered by zbMATH, namely an API, called “zbMATH Links API”, represented by the box stating “Scholix Link API” in Figure 1. At present, this new API is focused on the interconnections between zbMATH and the Digital Library of Mathematical Functions (DLMF)6, even though more partners are expected to be hosted soon (e.g., MathOverflow, arXiv, Online Encyclopedia of Integer Sequences). Search engines or researchers from mathematics or the field of bibliometric research might use our zbMATH Links API to present and use the search results. Furthermore, the source code of our API has been released in the form of a Python package7, so that any interested user can use it for similar purposes in any context where the interconnection between bibliographic data and links has to be studied and documented. In this way, we hope to serve the needs of a wide range of potential users.

The main contributions of this paper are:

1. We provide an overview of the new API implementation using the example of how DLMF makes use of it. An analysis of the currently available dataset will be outlined. 2. We present other natural candidates for the API, thus proving the potential coverage of the current mathematical literature. 3. We highlight implications and new research potentials by showing how existing research can be transferred to make use of zbMATHs open APIs. In the following section 2, we motivate the choice of

DLMF as the first partner for our new API and how it is currently used in their environment. Afterward, in section 3, we present the implementation details, analyze the DLMF link data and give some details about other potential partners. In section 4, we discuss the technical capabilities of the new API and compare the capabilities of the open APIs of zbMATH with its pendant of PubMed. The last section is devoted to some concluding remarks and open problems.

4https://www.openarchives.org/pmh/tools/ 5https://arxiv.org/ 6https://dlmf.nist.gov/ 7https://purl.org/zb/13 2. DLMF as a zbMATH partner Among all possible partners that may interact with zb

MATH, we selected the aforementioned Digital Library of Mathematical Functions (DLMF) as a first partner. In addition to being an important reference tool for mathematicians, DLMF ofers a relatively small bibliographic catalog and is therefore very well suited for testing our API.

DLMF is a well-established web resource that enlarges and translates the classical “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables”, edited by M. Abramowitz and I. A. Stegun in 1964 into a modern and functional digital library. As the original book’s title inspiring this web service suggests, DLMF is a digital handbook about theoretical and computational aspects of special functions. Its primary purpose is to provide a modern reference tool for researchers in mathematics, physical sciences, and engineering. It contains hundreds of definitions and theorems, presented with a standardized notation, together with tables, figures, and references to peer-reviewed papers and books. It was published online on the May 7th 2010 and is continuously maintained, reviewed, and updated ever since. Indeed, the field of special functions still receives great attention from the mathematics community, and new contributions enrich the contents of the library year by year. DLMF presents its contents in 36 chapters, and the bibliography currently consists of 2,748 references8 of which 2,053 directly link to zbMATH (i.e., about 75%). This is a valuable service ofered independently by DLMF and zbMATH since each user has the possibility of accessing all selected publications’ bibliographic data. Let us note that of the remaining 25% of publications not linked to zbMATH, most of them do not belong to the zbMATH database.

Before providing more details about our Links API, let 8https://dlmf.nist.gov/bib/ us mention a few details about the links’ structure we are interested in. Each reference in the DLMF bibliography may be cited many times in the DLMF pages. Each of these instances carries its own link to zbMATH. For example, the book “Asymptotics and special functions” by F. W. J. Olver (Reprint, 1997; Zbl 0982.41018)9 is referenced 332 times. Each citation defines a link to zbMATH uniquely. An example of one of these links is: https://dlmf.nist.gov/2.10#iv.p2 (see Figure 2). In this case, Olver’s book is referenced in Part 2 of Section §2.10(iv) Taylor and Laurent Coeficients: Darboux’s Method. In Figure 2, we also see that the Section §2.10(iv) is cited 3 times. Each instance corresponds to a link that points to a diferent destination site in the DLMF library. The highlighted §2.10(iv) points to what we see in the first screenshot of Figure 2.

3. zbMATH Links API This section presents the main features of the new “zbMATH Links API” by explaining its structure and various technical capabilities. Then, we give an analysis of the link statistics associated with our DLMF collaboration.

3.1. Structure of the API 5 3104 n is ik3 n L 2 1 0 2008 2010 2012 2014 Year 2016 2018

2020 • POST /link allows any user of the API to create a new link (for a given partner) related to a zbMATH object.

The parameters are: Zbl code, Source identifier, Partner name, Link relation. • GET /source gives a list of all links of a given zbMATH partner.

The API itself has been implemented in Python and is described using the OpenAPI Specification 10, a language- • GET /statistics/msc shows the occurrence of priagnostic interface description standard for APIs. At present, mary MSC codes (2-digit level) in the source. it hosts only one partner, DLMF, but it will soon host other partners. The underlying dataset has been generated by scraping the DLMF bibliography. As a result, we got 2,053 references (indexed at zbMATH) and 6,526 distinct links.

In this framework, the links are objects belonging to the source (of a given partner; DLMF in the present case), and zbMATH objects are objects belonging to the target.

The API ofers eight endpoints, more specifically six GET routes, one POST route, and one PUT route. The Swagger UI of the zbMATH Links API is available online11. Here is a concise listing of the provided functionalities:

• GET /statistics/year shows the occurrence of years of publication of references in the source. • PUT /partner edits data of a given zbMATH partner. • GET /link retrieves links for a given zbMATH object.

The parameters are: Authors, MSC codes12, X-Field13.

9https://zbmath.org/?q=an%3A0982.41018 10https://swagger.io/specification/ 11https://purl.org/zb/14 12Mathematics Subject Classification Sche me 2020 , https:// msc2020 .org/

13The X-Field is an optional parameter that can be used when one is running a query that can pull back a lot of metadata, but only a few fields in the output are of interest. Example: in the GET/link one is interested only in retrieving the id identifier of sources where

3.2. Analysis of DLMF Data

Based on our available DLMF dataset, it is possible to draw some conclusions: • In the JSON response body of our GET /link methods, one can see that each link is equipped with a publication date. This date refers to the date the link itself has been added in the DLMF bibliography. We scraped the historical bibliography between 2008 and 2020 (December is the name of the author is Abramowitz. Then, Author: Abramowitz, X-Field: {Source{Identifier {ID}}}.

14http://www.scholix.org/schema/3-0 the scraping’s reference month) and found the growth numbers depicted in Figure 3. Clearly, the growth of population of references changed drastically in 2010, the year when DLMF started oficially. • The two statistics routes show results concerning the distribution of primary MSC codes (2-digit level) and years of publication of the references in the current dataset. As one may expect, the most frequently cited primary MSC codes are:

MSC Code

Area 33 65 11 491 351 172

Special functions Numerical analysis Number theory

See Figure 4 for more details. On the other hand, the most frequent years of publication of cited references in the current dataset are:

Year 67 1998 65 1999 65 1995 See Figure 5 for more details. Looking at both Figures 3 and 5 we could infer that the DLMF bibliography sufers from a delay in updating its references. More precisely, the fact that the maximum peak is centered at the end of • A DLMF user can access all bibliographic resources the 90s makes us think of some kind of dificulty in iden- indexed at zbMATH relating to a specific topic of intertifying relevant references referring to the last twenty est. This may help to get a consistent overview of the years. scientific development of the topic itself. • The references in the current DLMF dataset which have the most citations are: – F. W. J. Olver, Asymptotics and special functions.

Wellesley, MA: A K Peters (1997; Zbl 0982.41018): 332 citations, – M. Abramowitz (ed.) and I. A. Stegun (ed.), Handbook of mathematical functions with formulas, graphs and • A researcher interested in a publication indexed at zb

MATH can use our API to verify if and possibly where that publication is cited in DLMF. A search of this type can also be very diversified thanks to the filters that our routes ofer. For example, one might be interested in identifying which DLMF links are related to a particular

Year

1980 1940 1960 2000

2020 mathematical tables. Washington: U.S. Department of Commerce. (1964; Zbl 0171.38503): 118 citations, – A. Erdélyi et al., Higher transcendental functions.

Vol. I. New York: Mc Graw-Hill Book Co. (1953 ; Zbl 0051.30303): 110 citations.

In Figure 6 one can see the references, identified by Zbl code, with more than 50 citations.

3.3. Usage The motivation behind the recent implementation of APIs at zbMATH is twofold. On the one hand, we want to ofer to the scientific community an eficient and open access to our data. On the other hand, we wish to expose the dynamic interaction between our bibliographic data and those coming from other resources. It is essential to note that both of these targets are made possible by zbMATH becoming an open web service. This provides a boost for disseminating scientific knowledge, and our work may help to understand how it spreads and auto-correlates in a functional way.

The zbMATH links API with its first partner DLMF represents a tool that can be used in various ways and contains many properties that are advantageous for the research process. Here, we want to present concrete usage instances where a user of either DLMF or zbMATH can generally benefit from the service: 0982.410180171.385030051.303030058.295030063.081840689.330Z010b3l35C.10o00d10e95s1.300020865.330010920.330010856.330010193.01701

MSC code or a particular author. This means that a targeted use of our API can allow a detailed bibliographic search that otherwise would not be possible. • A researcher more interested in the history of mathematics can use our API to trace the bibliography related to a certain topic covered in DLMF and observe the historical development of the topic itself in terms of the literature related to it. Such research can be very rich and diverse. It is suficient to think that in the field of special functions there are classical topics, such as the “gamma function” or “elliptic integrals”, which have a long history behind them. 15https://mathoverflow.net/ 16https://arxiv.org/ 17https://oeis.org/ 18https://stackexchange.com/ zbMATH. Therefore, a suitable algorithm is needed to Table 1 ifnd a corresponding preprint for a zbMATH record if Side-by-side comparison of zbMATH Open and PubMed. one exists. This problem can be seen as an entity match- These are the nu mbers from 2020 ing problem, and there exists software for it, for example, zbMATH Open PubMed JedAI, see [12]. For our purpose, the existing software O pen Access since 2021 1996 was not suitable. Therefore, we implemented our own Annual Bib. Entries > .13 M > 1.5 M matching algorithm. Let us provide a few details about Bib. Entries Total > 4.0 M > 31.5 M such a matching process, although an accurate and critical Journal Titles > 3.0 K > 5.0 K description is beyond the scope of this article. Search Queries 2020 closed access > 3300 M

For each search record we generate a small set (default: 3) of possible matching records (called candidates), and compare them with the search record. The candidate 4. Research Opportunities records are generated via an Elasticsearch19 query, where we search for the title and authors of a search record. To This section presents research opportunities arising from decide whether a search and a candidate record match, a the newly released open data and API solutions at zbthree-dimensional feature vector is computed. We use the MATH in a broader perspective. Moreover, we compare similarity of the titles, authors, and abstracts as features. our service with PubMed to put it in a broader context. The similarity of two titles is their Levenshtein distance PubMed, with its underlying MEDLINE dataset and PubMed divided by the maximum length of the titles. To compare Central free full-text archive, is another well-known search the similarity of two abstracts, we use the cosine distance engine within the biomedical scientific research and digiof their tf-idf vectors (based on words). For the similarity tal libraries community [1, 4, 5, 6, 7, 9, 13, 19]. It is available of the authors of two articles we use a more involved ap- to the public since 1996, indexes over 32 million biblioproach, which is based on the Levenshtein distance of the graphic references of biomedical literature, and is supauthor names, but also can handle changes in the order of ported by the National Center for Biotechnology Inforthe author names and incorporates information on difer- mation (NCBI), at the U.S. National Library of Medicine ent author spellings. Using these feature vectors, we train (NLM), located at the National Institutes of Health (NIH)22. a decision tree classifier on our training data and test it On the other hand, zbMATH Open has over four million on some test data using sklearn20. If multiple candidates bibliographic entries and was made public on 1st January match according to the trained classifier, we take the one 2021. Table 1 shows a side-by-side comparison of PubMed whose feature vector has the smallest Euclidean norm. and zbMATH Open.

The training and test data is generated as follows. For We work out strengths and weaknesses by presenting every arXiv preprint with a DOI in its metadata we search selected research publications that leverage PubMeds APIs for a zbMATH entry with the same DOI. If we find one, we and analyze their applicability to the current state of zbadd this pair to our ground truth file. We also add some MATH. This serves the purpose of uncovering immediate arXiv preprints with a DOI for which no zbMATH entry research opportunities in applying existing methods to with the same DOI exists. Finally, we split the ground the new open dataset of zbMATH and highlighting develtruth into a training set and a test set. We currently obtain opment prospects in areas where existing methods can a precision of 99.51 % and a recall of 96.89 % on the test set. not yet readily be applied due to missing interfaces or gen

The Online Encyclopedia of Integer Sequences is erally missing capabilities. The following paragraphs are a renowned online database of sequences of numbers to be understood as an inspiration for projects that can be launched in November 2010. It currently contains 342.422 based on the new open-access zbMATH data. After each sequences, each of them with its own list of metadata: paragraph, we propose one or multiple research questions ifrst terms of the sequence, formulas for generating the that could follow from the described use case. sequence, references to books, articles, and scholarly links where the sequences have appeared, and more. At present, 4.1. Immediate Research Opportunities we are working on retrieving all references listed under “References” and “Links” for each sequence. Such refer- In this subsection, we focus on research publications that ences will be matched with our internal zbMATH Citation have leveraged PubMeds open APIs and on general reMatcher21 and then stored in our Links API. search opportunities.

19https://www.elastic.co/elasticsearch/ 20https://scikit-learn.org/stable/index.html 21https://zbmath.org/citationmatching/ 4.1.1. Tagging of Scientific Publications Assigning keywords or tags to scientific publications is a crucial tool to increase discoverability. However, assign22https://pubmed.ncbi.nlm.nih.gov/about/ ing such tags to scientific literature is an expensive and those tools to the semantic information present in the tex cumbersome process as human reviewers often assign ifles. zbMATH Open also provides semantic information them manually. This, in turn, leads to inconsistencies as in the form of the XML format. While the investigated PDF diferent reviewers may assign diferent tags to the same ifles also contained some mathematics literature, the idpublications. In [19] Veytsman proposes an automated iosyncrasies of mathematical typesetting may be worth a approach to measure tag consistency across research pub- reevaluation with the sole focus on mathematics literature. lications based on a metric that captures how predictive Here especially the link between zbMATH entries and tex a tag is for a citation. The author conducted experiments sources on arxiv which are provided by the API are helpful. based on the MeSH23 tags that human reviewers manu- Furthermore, zbMATH Open provides high-resolution ally attach to documents of the PubMed database corpus. scans of early publications that were not yet typeset in He concluded that their simple metric, whether a tag is a digital form alongside their corresponding tex source predictive of citations, indeed can be used to measure tag- files for over 15,000 research article reviews. This corging consistency. Each indexed publication of zbMATH pus constitutes a huge potential for improving optical contains one or many MSC codes24 and a set of keywords. character recognition (OCR) techniques in the domain of The former is a hierarchical, alphanumerical identifier mathematics as outlined in [2]. indicating the area of mathematics a certain research pa- Potential research questions: per touches and the latter are free-text keywords that the authors suggest. Both classifiers, i.e., MSC codes and key- 3. How do the state-of-the-art PDF text extraction words, are eventually adjusted by the editors of zbMATH. tools perform for mathematical literature?

We can imagine that the same experiments that Veyts- 4. What are the main challenges in optical characman in [19] carried out can now be done based on the cor- ter recognition of mathematical formulas? pus of zbMATH Open. There would even be the possibility to further integrate with MathOverflow and recommend 4.1.3. Training Dataset citations based on the tags given in their platform when a post is created.

Potential research questions:

The opening up of zbMATH means that new training data can be used for artificial intelligence applications. The following listing provides inspiration for new possibilities 1. How to measure tagging consistency across math- that the dataset could be used for: ematical research publications? Here, one can investigate how the methods developed in [19] can be applied to mathematics data. The required data can be derived via our API.

Formula Search The search mask of zbMATH Open already ofers a formula search. However, the new open API allows building ones own or improving the formula search 2. What can be learned from crowd-sourced tagging functionality by leveraging meta information provided in MathOverflow compared to curated tagging in alongside with the indexed articles. zbMATH? Especially interesting is here, if the tags Potential research questions: from one service can help to search in the other service.

The diferences in the tagging behavior might also give insights on the learning curve as only known concepts will be tagged by individuals.

5. What influence do diferent search options in digital libraries have on the scientific discovery process? It is save to assume, that the discovery options for scientific literature will have an efect on the outcomes on ones own research. Here, one could try to qualitatively or even quantitatively assess this influence. 4.1.2. PDF Text Extraction Benchmark

23https://www.nlm.nih.gov/bsd/disted/meshtutorial/ introduction/index.html

24Mathe matics Subject Classification 2020 ,https:// msc2020 .org/ As the Portable Document Format (PDF) is the ubiquitous and standard format for scientific publications, its layout- 6. What are the state-of-the-art approaches to forbased nature makes it hard to extract semantic meaning mula search, and what are the main challenges from the content. There exist a variety of tools that apply to overcome? certain heuristics to identify which parts of a document represent, e.g., the title or a paragraph of text. Bast et al. Recommender Systems The provided data allow build[1] established a benchmark for text extraction perfor- ing a comprehensive recommendation system. This sysmance of 14 tools by taking over 12,000 PDF documents tem could incorporate not only the meta information of from arXiv and obtaining their semantic information from the OAI-PMH APIs like MSC tags or keywords but also associated tex files and then comparing the outputs of leverage the information on other platforms that a certain research article is linked in. E.g., mentions of related research papers in conversations on MathOverflow may be a good indicator for other relevant literature. As we 7. Which features are most significant for related

literature recommendations in mathematics? continue to attract more and more partners for our Link 13. What factors make a formula more readable than API the context increases from which a potential recom- a diferently typeset formula describing the same mender system can draw meaningful conclusions. concept? Here, one can investigate factors for readPotential research questions: ability and if there are objectively better ways to typeset a certain formula.

Math Spell-Checking Popular tools like Grammarly25 8. What are the distinguishing challenges in feature scan your text for common grammatical mistakes and extraction from mathematical literature? The chal- provide the user hints about potential improvements. A lenge of this research question is to identify how state- similar ofering could be developed for typesetting forof-the-art recommender systems of other disciplines mulas by, for example, giving simple warnings of missing need to be tuned to excel at mathematical literature closed parentheses (if applicable) or other common misrecommendations. takes. Such a spell-checking system could make use of the data of zbMATH Open and linked peripheral services.

The linking to arXiv could be used to retrieve the full-text tex information, and the connection to MathOverflow could be used to detect common mistakes by taking into account the edit history of formulas in posts.

Potential research questions: Formula Disambiguation I Similar formulas can have vastly diferent meanings in diferent contexts [14, 15, 16, 17]. This is especially true for single symbols used in these formulas as researchers in diferent fields will certainly have assigned a diferent meaning to symbols. A system that tries to understand in which context a formula ap- 14. What are common errors in mathematical forpears and draw meaning from that could especially lever- mula typesetting, and how to identify them? The age the MSC classification that is assigned to all articles on main challenge of this research question is to derive a zbMATH Open. Most results from the OAI-PMH API con- method to identify erroneous formulas; and as a second tain an abstract where one can often find typeset formulas step to investigate what common errors are. that can be used as training data along with full-text data that can be obtained through arXiv.

Potential research questions: 12. How can diferently typeset formulas describing the same concept be disambiguated? The main challenge of this research question is to devise ways to identify such formula combinations.

15. What impact had formulas containing errors in the mathematics research community? Here, one can research the consequences that errors in formulas and the research that built on them had. This could be extended to the influence of errors in formulas on widespread websites like Wikipedia to contemporary incidence.

Classification and clustering While zbMATH Open provides MSC tags and keywords for the research articles, we can imagine that there are diferent classification and clustering approaches that are not represented through Following the above dis- the meta information of zbMATH. The open-access to the

APIs allows building use case specific search and clustering systems.

Potential research questions: Formula Disambiguation II ambiguation, it is also possible for a single concept to be expressed in diferent ways. Imagine the circumference of a circle being expressed in one paper as = 2 and in another = with radius and diameter . In- 16. Do diferent logical classification and clustering deed, both formulas describe the same concept but are schemes emerge from the zbMATH Open metatypeset diferently. This kind of disambiguation will be of data besides the MSC classification scheme? immediate relevance for academic plagiarism detection.

State-of-the-art plagiarism detection systems already consider paraphrased text but lack capabilities to efectively detect “paraphrased” formulae [10].

Potential research questions: 10. How can similarly typeset formulas describing diferent concepts be disambiguated? The main challenge of this research question is to devise criteria that make a formula ambiguous. 11. What are the distinguishing factors in formula typesetting to avoid ambiguity? In this research question it would be the goal to devise guidelines to avoid typesetting ambiguous formulas in the first place.

Review generation At present, many research papers and books indexed at zbMATH are supplemented with a review written by external experts in the field. Currently more than 7,000 active experts participate in compiling reviews for research papers and books. They critically analyze the contribution of the publication under consideration, often summarize the content and judge it in 25https://grammarly.com/ 17. What are the significant properties that a mathematical review should include? In this research question one should distill the essential properties of what makes a “good” mathematical review. 18. How do mathematical reviews generated by AI language models compare with manually writ- 21. What are the most common reasons for the reten reviews according to the aforementioned sig- traction of mathematical research papers, and nificant properties? Here, it is interesting to under- how can publication of such papers be minimized? stand if artificial intelligence is capable of meeting the Here, one can think in the direction of computer asaforementioned properties. sisted quality assurance on the publisher side and how this could help the publishing process. 19. What impact can AI language models have on the mathematical review process? In this research question, one should work out the implications of potentially machine written reviews.

4.2. Development Prospects

In this subsection, we focus on research publications that have leveraged PubMeds open APIs to which there is no pendant yet in zbMATH Open. The uses-cases in this section serve as inspiration for development opportunities. 4.2.1. Retraction Tracking reference to a bigger context. With the advancements of stating that the citation rate of retracted publications can text generating deep learning models such as language even increase after they got their retraction status [4], so, models, it is not far to seek to train models on these hand- literature is still cited even years after retraction. written reviews in conjunction with their full-text articles Potential research questions: and metadata of zbMATH Open.

Potential research questions: 26https://www.zotero.org/ There are manifold reasons why a scientific publication could get retracted. It can range from erroneous study design to deliberate misconduct like plagiarism or generating artificial data to support a hypothesis. With the increasing amount of scientific literature at an accelerating rate, the number of retracted papers naturally increases as 22. How can the open data of zbMATH be used to conwell. Therefore, it is crucial to notify researchers early in struct collaboration graphs among mathematics the research process about possible retracted publications. researchers? The main contribution in this research In [4] Dinh et al. present a Zotero26 plugin called ReTracker question would be a comprehensive collaboration graph that helps to identify retracted papers from PubMed. Re- based on the zbMATH open dataset. Tracker uses the full paper titles as they are present in the Zotero library to query PubMed on its retraction status. 23. What conclusions can be drawn from an author This status is persisted in a local cache and displayed to collaboration graph concerning collaboration efthe user. With the opening of zbMATH this plugin could fectiveness? Here, one can investigate how the methnow not only cover articles of biomedical literature but to ods developed in [3] can be applied to the data of our also inform researches about retracted publications in the APIs. ifeld of mathematics. Currently, zbMATH Open does not provide information about the retraction status, but we 5. Conclusions and Future Work can imagine that collecting this information from various trustworthy sources and making it accessible through the API would be a valuable addition to the current service.

The authors in [4] underline the need for such a tool by In this article, we have presented the recent innovations made to zbMATH. We implemented API solutions following the OAI-PMH and Scholix standards. Those solutions allow the scientific community to use our open database 20. How does the retraction of mathematical papers influence their citations? This question follows the observation of [4] that the citation count of literature still increases after it got retracted, so the intuitive answer that citations stop after retraction does not hold true. Here, it would be interesting to identify the reasons why literature is still cited. 4.2.2. Collaboration Identification While digital libraries nowadays ofer comprehensive and advanced search interfaces to retrieve and explore related scientific literature, they often lack the understanding of how authors have collaborated and to which extent their collaboration was fruitful. The same statement is true for zbMATH Open. In [3] Cagliero et al. explored ways to identify collaboration patterns of authors and to measure to what extent the collaboration was fruitful. They harvested digital libraries and online databases for research publications and applied a pattern-based approach to identify collaborations among researchers. By making the APIs of zbMATH open-access, we believe that Cagliero et al. [3] can serve as inspiration to motivate further insights generation techniques like author collaboration identification.

Potential research questions: in an eficient and reproducible way. We demonstrated the capabilities of API solutions on the basis of existing links between DLMF and zbMATH. By combining classification information from zbMATH with reference information from DLMF, we could derive new insights on references in the DLMF. In the future, we will incorporate MathOverlfow, arXiv, and the Online Encyclopedia of Integer Sequences to the new zbMATH Links API. Moreover, we gave inspiration for research opportunities arising from the APIs. In this context, we proposed 23 open research questions that can be immediately approached by leveraging the open access model and new programming interfaces.

We will optimize our API interfaces to the needs of the scientific community and zbMATHs data partners in the future. Depending on the needs of the communities, we will evolve and adapt our data formats. Moreover, we are working for open access publications and permissive licenses for the reuse of scholarly metadata. We aim to convince publishers to distribute abstracts and references under permissive licenses. We will also continue to integrate mathematics-related research software and research data besides traditional publications.

Bast and

Korzen . “

Benchmark and Evalua- [13] tion for Text Extraction from PDF” . In: Proc. ACM/IEEE JCDL . Toronto, ON, Canada: IEEE, June 2017 , pp. 1 - 10 . doi: 10/ghchxm.

M. Beck et al. “Transforming Scanned zbMATH [14] Volumes to LaTeX: Planning the Next Level Digitisation” . In: EMS Newsletter 2020 - 9 .117 ( Sept . 2020 ), pp. 49 - 52 . doi: 10 .4171/news/117/11.

Cagliero et al. “ Identifying Collaborations among Researchers: a pattern-based approach” . In: Proc. [15] BIRNDL at ACM SIGIR. Ed. by

Mayr ,

M. K.

Chandrasekaran , and

Jaidka . Vol. 1888 . CEUR-WS .org, 2017 , pp. 56 - 68 .

Dinh , Y.-Y. Cheng, and

N. N.

Parulian . “ReTracker: [16] an Open-Source Plugin for Automated and Standardized Tracking of Retracted Scholarly Publications” . In: Proc. ACM /IEEE JCDL. Ed. by M. Bonn [ 17 ] et al. IEEE, 2019 , pp. 406 - 407 . doi: 10 .1109/JCDL.

Eggers et al. “ Visualizing aggregated biological [18] pathway relations” . In: Proc. ACM/IEEE JCDL . 2005 , pp. 67 - 68 . doi: 10 .1145/1065385.1065400.

Erekhinskaya et al. “ Knowledge Extraction for Literature Review”. en . In: Proc. ACM/IEEE JCDL. [19] Newark New Jersey USA: ACM , June 2016 , pp. 221 - 222 . doi: 10 .1145/2910896.2925441.

“What Drives Research Eforts? Find Scientific Claims that Count!” In: Proc. ACM/IEEE JCDL . 2019 , pp. 217 - 226 . doi: 10 .1109/JCDL. 2019 . 00038 .

Hulek and O. Teschke. “ The Transition of zbMATH Towards an Open Information Platform for Mathematics” . In: EMS Newsletter 2020 - 6 .116 ( June 2020 ), pp. 44 - 47 . doi: 10 .4171/news/116/12.

Jhawar et al. “ Author Name Disambiguation in PubMed using Ensemble-Based Classification Algorithms” . In: Aug. 2020 , pp. 469 - 470 . doi: 10.

ACM /IEEE JCDL. Urbana-Champaign, Illinois , USA, June 2019 . doi: 10 .1109/JCDL. 2019 . 00026 .

Müller ,

Schubotz , and

Teschke . “ References to Research Literature in QA Forums - A Case Study of zbMATH Links from MathOverflow” . In: EMS Newsletter 2019 - 12 .114 ( Nov . 2019 ), pp. 50 - 52 .

doi: 10 .4171/news/114/15.

Papadakis et al. “ The return of jedAI: end-to-end entity resolution for structured and semi-structured data” . In: Proc. VLDB 11.12 (Aug . 2018 ), pp. 1950 - 1953 . doi: 10 .14778/3229863.3236232.

ACM /IEEE JCDL. Toronto, ON, Canada: IEEE, June 2017 , pp. 1 - 2 . doi: 10 .1109/jcdl. 2017 . 7991622 .

Scharpf ,

Schubotz , and

Gipp . “ Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation” . In: Proc. WWW . ACM, Apr. 2021 . doi: 10 .1145/3442442.

Scharpf ,

Schubotz , and

Gipp . “ Representing Mathematical Formulae in Content MathML using Wikidata” . In: BIRNDL@SIGIR . Vol. 2132 . CEURWS.org, 2018 , pp. 46 - 59 .

Scharpf et al. “ AnnoMath TeX - a formula identifier annotation recommender system for STEM documents” . In: RecSys. ACM , 2019 , pp. 532 - 533 .

Scharpf et al. “ Towards Formula Concept Discovery and Recognition” . In: BIRNDL@SIGIR . Vol. 2414 .

CEUR-WS .org, 2019 , pp. 108 - 115 .

Schubotz and

Teschke . “zbMATH Open: Towards standardized machine interfaces to expose bibliographic metadata” . In: EMS Newsletter 2021 - 4 ( 2021 ). doi: DOI10.4171/MAG-12.

ACM /IEEE JCDL. Champaign, IL, USA: IEEE, June 2019 , pp. 372 - 373 . doi: 10 .1109/jcdl. 2019 . 00076 .