<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Variability in Document Formats Scientific publications exhibit considerable variability in their
formatting and organization. Journal articles</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>MESD: Metadata Extraction from Scholarly Documents - A Shared Task Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zeyd Boukhers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cong Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Applied Information Technology</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Soochow University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Hospital of Cologne</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents an overview of the Metadata Extraction from Scholarly Documents (MESD) shared task, which was designed to address the challenge of extracting structured metadata (e.g. Title, Author, Abstract, etc.) from scientific publications. The task aimed to promote the development of techniques for making scholarly data more Findable, Accessible, Interoperable, Reusable (FAIR) by improving metadata extraction from PDF documents. We describe the task design and the creation of two complementary datasets: (1) the S2ORC_Exp500v1 dataset consisting of 500 training samples, 100 validation samples, and 100 test samples with text-based annotations, and (2) the SSOAR Multidisciplinary Vision Dataset (SSOARGMVD) containing more than 8000 documents with bounding box annotations suitable for computer vision approaches. We discuss potential directions for future research in metadata extraction from scholarly documents, highlighting the opportunities presented by these new resources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;metadata extraction</kwd>
        <kwd>document processing</kwd>
        <kwd>scholarly documents</kwd>
        <kwd>natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Scientific literature continues to grow at an unprecedented rate, with millions of new scholarly
documents published each year. Making this vast corpus of knowledge discoverable and reusable depends
critically on the availability of high-quality metadata. While contemporary publications typically
include structured metadata, a significant proportion of the existing scholarly record, particularly
from smaller publishers and historical archives, lacks accessible metadata [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This gap represents a
substantial challenge for scientific information retrieval and knowledge management.
      </p>
      <p>
        The scientific community has recognized the importance of making research outputs findable,
accessible, interoperable, and reusable (FAIR) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Metadata plays a crucial role in achieving these FAIR
principles, serving as the foundation for discovery systems, citation networks, and knowledge graphs.
Despite this critical role, many scholarly documents remain dificult to discover and integrate into the
scientific knowledge ecosystem due to metadata deficiencies.
      </p>
      <p>The Metadata Extraction from Scholarly Documents (MESD) shared task was conceived to address
this challenge by encouraging the development of automated techniques for extracting key metadata
elements from scientific publications. The primary goal was to advance the state of the art in metadata
extraction from PDFs, with a focus on practical applications that could help make scholarly literature
more FAIR. The task specifically targeted publications from smaller and mid-sized publishers, which
often exhibit greater variability in formatting and layout compared to publications from major publishers
with standardized templates.</p>
      <p>The MESD shared task focused on extracting key bibliographic metadata elements: title, authors,
abstract, keywords, Digital Object Identifier (DOI), publication venue, publication date, volume/issue
2nd International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2025), co-located
with ESWC 2025, June 01–02, 2025, Portorož, Slovenia
* Corresponding author.
$ zeyd.boukhers@fit.fraunhofer.de (Z. Boukhers); cong.yang@suda.edu.cn (C. Yang)
 https://zeyd.boukhers.com (Z. Boukhers)
0000-0001-9778-9164 (Z. Boukhers); 0000-0002-8314-0935 (C. Yang)</p>
      <p>© 2025 Copyright © 2025 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
information, and page numbers. These elements form the foundation of discoverability and citation
networks in scholarly communication systems.</p>
      <p>In this paper, we describe the motivation, design, and execution of the MESD shared task, including the
creation of two specialized datasets and our evaluation methodology. The first dataset, S2ORC_Exp500v1,
consists of 500 training samples, 100 validation samples, and 100 test samples with text-based annotations
derived from the Semantic Scholar Open Research Corpus. The second dataset, SSOAR Multidisciplinary
Vision Dataset (SSOAR-MVD), contains 50,000 documents with bounding box annotations suitable for
computer vision approaches, specifically targeting documents from disciplines known for challenging
layouts such as Social Sciences, Humanities, Law, and Administration.</p>
      <p>We outline the challenges inherent in metadata extraction from scholarly documents, discuss
evaluation considerations, and propose directions for future research, including the creation of FAIR Digital
Objects based on extracted metadata. By providing these resources and insights, we aim to stimulate
further innovation in making scholarly literature more discoverable and reusable.</p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews related work in metadata
extraction and discusses the challenges in metadata extraction. Section 3 describes the MESD task in
detail, including the specific metadata elements targeted. Section 4 outlines the evaluation methodology,
while Section 6 proposes future directions, and Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Metadata Extraction Techniques and Approaches</title>
        <p>
          Metadata extraction from scholarly documents has been an active area of research for over two decades,
with approaches evolving from rule-based systems to sophisticated machine learning techniques [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
This section reviews key developments in the field
        </p>
        <p>
          Early eforts in metadata extraction relied primarily on handcrafted rules and templates. Systems like
ParsCit [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and GROBID [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] initially employed rule-based strategies to identify metadata from scholarly
documents. These approaches performed reasonably well for documents with consistent formatting but
struggled with the diversity of layouts and styles found across diferent publishers and disciplines [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ].
Despite these limitations, rule-based components continue to play a role in modern hybrid systems,
particularly for well-structured metadata elements like DOIs and dates [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ].
        </p>
        <p>
          The limitations of rule-based systems led to the adoption of machine learning techniques for metadata
extraction. Conditional Random Fields (CRFs) became particularly popular for sequence labeling tasks
in document processing [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. CERMINE [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] employed a CRF-based approach for extracting metadata
from scientific literature, while Peng and McCallum [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] used CRFs for bibliographic reference extraction.
These statistical approaches ofered better generalization to unseen document formats compared to
rule-based methods [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          With the advancement of deep learning, neural networks have increasingly been applied to metadata
extraction tasks. Bi-directional Long Short-Term Memory (BiLSTM) networks and their variants have
shown promising results [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. For instance, An et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] introduced a DNN-based segment sequence
labeling approach that outperformed traditional methods on standard datasets. Chiu and Nichols [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
applied BiLSTM-CNN models for named entity recognition, a technique that has since been adapted for
metadata extraction tasks. These deep learning approaches benefit from their ability to learn complex
patterns without extensive feature engineering [
          <xref ref-type="bibr" rid="ref14 ref16">14, 16</xref>
          ].
        </p>
        <p>
          Recognizing the importance of document layout and visual cues, some researchers have explored
computer vision techniques for metadata extraction. DeepPDF [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] pioneered the use of neural networks
for PDF document segmentation, treating the task as an image segmentation problem. More recently,
approaches like PubLayNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] have utilized mask region-based convolutional neural networks (Mask
R-CNN) to identify diferent components of scientific documents based on their visual appearance.
        </p>
        <p>
          MexPub [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] specifically addressed the challenges of extracting metadata from German scientific
publications by using Mask R-CNN to analyze document images. Ali et al. [20] explored computer
vision and machine learning approaches for metadata enrichment of historical newspaper collections.
These vision-based approaches have proven particularly efective for documents with complex layouts
or when text extraction is unreliable [21, 22].
        </p>
        <p>
          The most recent trend in metadata extraction involves multimodal approaches that combine textual
and visual features. Liu et al. [23] proposed a deep learning architecture that processes both the textual
content and the visual layout of documents. Similarly, Boukhers and Bouabdallah [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] introduced a
multimodal approach that simultaneously analyzes PDF documents as text and as images.
        </p>
        <p>
          Balasubramanian et al. [24] demonstrated the superiority of multimodal approaches over unimodal
ones in metadata extraction from video lectures, achieving significant improvements in precision
and recall. These multimodal approaches represent the current state of the art, as they leverage
complementary signals from both the textual content and visual layout of documents. However, they
often require larger datasets and more computational resources compared to unimodal approaches
[
          <xref ref-type="bibr" rid="ref2">2, 23</xref>
          ].
        </p>
        <p>Despite the importance of metadata extraction, there have been relatively few shared tasks specifically
focused on this challenge. The 1st Workshop on Scholarly Document Processing included shared tasks
on citation contextualization and style [25], but focused primarily on citation contexts rather than
document metadata extraction. the BiblioDAP workshop (The 1st Workshop on Bibliographic Data
Analysis and Processing) [26] addressed challenges in bibliographic data processing, including metadata
extraction and management, though it has a broader scope that includes citation analysis and bibliometric
studies as well.</p>
        <p>
          Datasets for metadata extraction have also been limited in size and scope. The UMass citation dataset
[27] and Cora reference string dataset [28] have been commonly used for evaluating citation extraction
systems, but they focus on bibliographic references rather than document metadata. The PubLayNet
dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] provides document layout annotations but does not include specific metadata annotations.
The CiteSeerX dataset [29] ofers a larger collection but with varying annotation quality [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          The S2ORC corpus [30] represents a significant advancement, providing a large dataset of open
access papers with associated metadata, though it was not specifically designed for training metadata
extraction systems. The GROBID dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] includes annotated documents for training but is primarily
focused on bibliographic references.
        </p>
        <p>The MESD shared task addresses these gaps by providing two complementary
datasets—S2ORC_Exp500v1 with detailed text annotations and SSOAR-MVD with computer
vision-oriented bounding box annotations—specifically designed for metadata extraction from
scholarly documents. By encompassing both textual and visual approaches, these datasets enable more
comprehensive evaluation of metadata extraction techniques.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Challenges in Metadata Extraction</title>
        <p>
          Metadata extraction from scholarly documents presents several significant challenges that make it an
interesting and complex research problem:
Task Complexity and Resource Requirements Metadata extraction from PDFs is an inherently
complex task that sits at the intersection of document layout analysis, information extraction, and
natural language processing. Developing efective solutions requires expertise in multiple domains and
potentially significant computational resources for training models on document images or structured
representations. This complexity is particularly evident when processing documents with non-standard
layouts [
          <xref ref-type="bibr" rid="ref7">31, 7</xref>
          ] or from multiple publishers with diferent formatting conventions [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ].
        </p>
        <p>
          Data Representation Challenges The transition from visual PDF representation to structured
metadata involves multiple transformations, each introducing potential errors. Text extraction from
PDFs can sufer from issues like incorrect character recognition, disrupted reading order, and loss of
formatting cues that might indicate metadata boundaries. These challenges are compounded when
dealing with multi-column layouts [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], embedded figures and tables, or documents with complex
mathematical notation.
        </p>
        <p>Evaluation Considerations The evaluation based on Levenshtein Similarity with a 95% threshold
represents a stringent standard that acknowledges the importance of accuracy in metadata extraction
while allowing for minor variations in text representation. This approach balances the need for precise
extraction with the practical realities of processing diverse document formats.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. MESD Task</title>
      <p>The MESD shared task focused on the extraction of nine predefined metadata elements from scholarly
documents:
• Title: The main title of the publication, including separations such as colons or dashes.
• Authors: All named contributors, typically appearing after the title. Multiple authors are
separated by commas or “and”.
• Abstract: The summary paragraph(s) typically appearing at the beginning of the paper, often
preceded by an "Abstract" header. When abstract content overlaps with keywords, we consider
only the text preceding any keyword listing.
• Keywords: Terms indicating the paper’s subject matter, typically appearing as a list after the
abstract, often preceded by “Keywords:” or similar indicators.
• DOI: The persistent identifier in the format “10.xxxx/xxxxx” or “https://doi.org/10.xxxx/xxxxx”.
• Publication venue: The journal name, conference proceedings, or book title where the paper
appeared. We consider the primary publication container (e.g., "Journal of Informatics") as the
venue, while conference proceedings details are included in volume/issue information.
• Publication date: Any date information related to publication, including online availability,
print dates, or submission dates.
• Volume/issue information: Numeric or alphanumeric identifiers for the specific volume or
issue (e.g., "Vol. 42, No. 3").
• Page numbers: The range or single page indicating the paper’s location within a larger
publication (e.g., "pp. 123-145").</p>
      <p>Participants were challenged to develop systems capable of identifying and extracting these elements
from the PDF documents. A key aspect of the task was handling the variability in document structures
and formats, as well as dealing with potential missing elements (e.g., some documents might not include
a DOI or explicit keywords). The task was designed to simulate real-world scenarios where metadata
extraction systems must process documents from diferent publishers, domains, and time periods. The
evaluation was based on the system’s ability to correctly identify the presence or absence of each
metadata element and accurately extract the text content when present.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Creation</title>
        <sec id="sec-3-1-1">
          <title>Two datasets were created specifically for the MESD shared task:</title>
          <p>3.1.1. S2ORC_Exp500v1 Dataset
The S2ORC_Exp500v1 dataset was created specifically for the MESD shared task. The source documents
were selected from the S2ORC (Semantic Scholar Open Research Corpus) [30], which provides a large
collection of open-access scientific papers. The selection process aimed to ensure diversity in research
domains, publication years, and document formats. For each document in the dataset, we prepared:
• The original PDF file
• The extracted text (using a standardised extraction tool)
• A metadata file containing the nine predefined labels and their locations in the extracted text
The metadata annotations were created through a semi-automated process. For each document:
• Using the DOI from S2ORC, additional metadata was retrieved from CrossRef [33] via their API.
• PDFs were downloaded when available.
• Text was extracted from the first page of each PDF and normalized to eliminate irregularities
such as inconsistent spacing and line breaks.
• Both exact and fuzzy matching techniques were employed to extract critical metadata elements
(title, authors, abstract, etc.).
• For each identified metadata element, positions were determined based on their locations within
the extracted text.</p>
          <p>The extracted metadata underwent a simple human validation step to verify accuracy and correct
any extraction errors, ensuring high-quality annotations
3.1.2. SSOAR Generated Multidisciplinary Vision Dataset (SSOAR-GMVD)
The SSOAR-GMVD1 dataset contains approximately 44,000 papers with both German and English
content. We began by randomly selecting 100 German scientific papers from publications available in
the SSOAR repository2, representing various layouts and styles. During the manual annotation phase,
we identified the 28 most common layouts across these publications.</p>
          <p>To expand our dataset, we developed an automated approach to generate synthetic papers based on
these identified layouts. We randomly extracted metadata records from SSOAR, DBLP, and a list of
scientific afiliations from Wikipedia. For each of the 28 common layouts, we generated an average of
1,600 synthetic papers by randomly inserting the extracted metadata at their corresponding positions on
the first page of document templates. This approach allowed us to create a diverse and representative
dataset while maintaining layout consistency with real-world scientific publications.</p>
          <p>For the shared task evaluation, we utilized a subset of 8,518 documents from the complete collection,
which were fully annotated with bounding box information. This subset contains over 2 million words,
with approximately 1.6 million labeled words (79.2% of the total content). The average document in this
subset contains 241 words, with 191 labeled for metadata extraction.</p>
          <p>The class distribution in the annotated subset shows a predominance of abstract content (46.87% of
all labeled words), followed by author names (3.75%), titles (4.97%), journal information (1.47%), and
afiliations (0.96%). Structured elements like DOIs (0.08%) and email addresses (0.04%) constitute a
smaller but critical portion of the dataset.</p>
          <p>The SSOAR-GMVD dataset additionally includes afiliations in 8,242 documents (96.8%) and email
addresses in 3,407 documents (40.0%). While afiliations were not included in the primary evaluation
metrics, this additional element provides valuable information for institutional analysis and author
disambiguation.</p>
          <p>The dataset spans a wide temporal range, with publications dating from 1900 to 2020, though most
documents (approximately 54%) were published between 2000-2010. The distribution across publishers
shows significant diversity, with content from Nature (1.03%), SAGE (0.20%), and numerous specialized
publishers in the social sciences. For evaluation purposes, the dataset was divided into training (70%),
validation (15%), and testing (15%) splits.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>1https://github.com/zeyd31/SSOR_GMVD 2https://www.gesis.org/en/ssoar/home</title>
          <p>The SSOAR-GMVD dataset is particularly valuable for approaches that leverage computer vision
techniques, as it provides pixel-level bounding box annotations for metadata elements. The dataset’s
focus on disciplines with challenging layout formats (Social Sciences, Humanities, Law, and
Administration) makes it especially useful for testing the robustness of metadata extraction systems across
diverse document templates.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Methodology</title>
      <p>The evaluation of submissions was designed to be comprehensive, incorporating multiple metrics to
assess diferent aspects of metadata extraction performance:
• Accuracy: The proportion of metadata elements correctly identified as present or absent.
• Precision: The proportion of extracted metadata that correctly matched the gold standard.
• Recall: The proportion of gold standard metadata that was correctly extracted.
• F1 score: The harmonic mean of precision and recall, serving as the primary ranking metric.</p>
      <p>A key feature of the evaluation was using Levenshtein Similarity to assess the match between extracted
metadata and the gold standard. Rather than requiring exact matches, the evaluation considered an
extraction correct if it achieved at least 95% Levenshtein Similarity with the reference text. This approach
acknowledged the inherent challenges in extracting metadata from PDFs, where minor diferences
in whitespace, punctuation, or character encoding might occur without significantly afecting the
semantic content. The evaluation function was provided to participants in the main repository to ensure
transparency and to allow for consistent self-assessment during system development.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Task Timeline and Organization</title>
      <p>The MESD shared task followed this timeline:
• Release of training datasets: January 27, 2025
• Release of testing datasets: February 15, 2025 (extended from original plan)
• Deadline for system submissions: March 4, 2025 (extended from February 25, 2025)
• Announcement of results: March 6, 2025 (extended from February 27, 2025)
• Paper submission deadline: March 6, 2025
• Notification of acceptance: April 3, 2025
• Camera-ready submission: April 17, 2025
• Workshop: June 1 or 2, 2025</p>
      <p>The task was organized by a team from Fraunhofer FIT, Germany, with support from the Natural
Language Processing community. The datasets, evaluation scripts, and submission instructions were
made available through a dedicated repository, and participants were encouraged to contact the
organizers with questions or clarifications. Participants were allowed to use either or both datasets for their
system development, but the final evaluation was conducted on the test portions of both datasets to
assess performance across diferent document types and annotation styles.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Directions</title>
      <p>
        While detailed baseline performance on these datasets is not included in this paper, we refer readers to
our companion work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which provides comprehensive evaluation of several approaches including
established tools like GROBID, classical machine learning approaches, deep learning models and large
language models. That analysis quantitatively demonstrates the challenges in metadata extraction
across diferent document types and layouts, particularly for multilingual content and documents from
disciplines with less standardized formatting.
      </p>
      <p>Based on our experience in designing the MESD shared task and analyzing the challenges in metadata
extraction, we propose several considerations for future initiatives in this area:
• Broader Task Definition : Expanding the scope to include additional metadata elements or
related tasks, such as citation extraction or bibliographic reference parsing, could increase the
appeal and applicability of research in this domain.
• Tiered Evaluation: Implementing a tiered evaluation framework that acknowledges diferent
levels of extraction dificulty could provide more nuanced assessment and better reflect the
complexity of the task.
• Hybrid Approaches: Encouraging the development of methods that combine rule-based
techniques with machine learning models might better address the variety of document formats and
metadata representations.
• Integration with Existing Workflows : Aligning extraction techniques more closely with
existing scholarly document processing workflows and tools could enhance their practical relevance
and facilitate adoption of the resulting technologies.
• Multimodal Approaches: The SSOAR-GMVD dataset opens opportunities for exploring
computer vision techniques in conjunction with text-based methods, potentially leading to more
robust metadata extraction systems that can leverage both the visual and textual aspects of
documents.
• Cross-lingual Extensions: Expanding the datasets to include non-English documents would
address the important challenge of extracting metadata from multilingual scholarly literature.
• FAIR Digital Objects: Extracted metadata can serve as the foundation for creating FAIR Digital
Objects (FDOs) of scholarly articles. FDOs represent a paradigm for making digital content
machine-actionable through persistent identifiers, type definitions, and rich metadata. By
transforming extracted metadata into standardized FDO representations, scholarly documents can be
seamlessly integrated into knowledge graphs, research infrastructures, and scientific workflows.
This would enhance not only the discoverability of articles but also enable automated reasoning
and integration with computational tools, further advancing the FAIR principles in scholarly
communication.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The MESD shared task was conceived to address an important challenge in scientific information
management: the extraction of metadata from scholarly documents to enhance their findability and
reusability. The task design, dataset creation, and evaluation methodology reflect the complexity and
importance of this problem. The creation of two complementary datasets—S2ORC_Exp500v1 with detailed
text annotations and SSOAR-GMVD with computer vision-oriented bounding box annotations—provides
valuable resources for researchers approaching the problem from diferent angles. We believe these
datasets will enable more robust comparisons between diferent methodological approaches to metadata
extraction. The need for efective metadata extraction from scholarly documents remains pressing in the
scientific community. We hope that the resources developed for the MESD shared task will contribute
to ongoing research in this area. Future initiatives can build on these foundations to advance the state
of the art in scholarly document processing and bring us closer to a fully FAIR scholarly ecosystem.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>While preparing this work, the authors used AI-assisting tools such as ChatGPT and Grammarly to
check grammar and spelling, as well as paraphrase and reword. After using this tool/service, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.</p>
      <sec id="sec-8-1">
        <title>Libraries (JCDL), IEEE, 2021.</title>
        <p>[20] D. Ali, K. Milleville, S. Verstockt, N. Van de Weghe, S. Chambers, J. M. Birkholz, Computer vision
and machine learning approaches for metadata enrichment to improve searchability of historical
newspaper collections, Emerald Publishing Limited, 2023.
[21] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: 2017 IEEE International Conference on</p>
        <p>Computer Vision (ICCV), 2017, pp. 2980–2988. doi:10.1109/ICCV.2017.322.
[22] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks
for object detection, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 2117–2125.
[23] R. Liu, L. Gao, D. An, Z. Jiang, Z. Tang, Automatic document metadata extraction based on deep
networks, in: X. Huang, J. Jiang, D. Zhao, Y. Feng, Y. Hong (Eds.), Natural Language Processing
and Chinese Computing, Springer International Publishing, Cham, 2018, pp. 305–317.
[24] V. Balasubramanian, S. G. Doraisamy, N. K. Kanakarajan, A multimodal approach for extracting
content descriptive metadata from lecture videos, J. Intell. Inf. Syst. 46 (2016) 121–145. URL:
https://doi.org/10.1007/s10844-015-0356-5. doi:10.1007/s10844-015-0356-5.
[25] M. K. Chandrasekaran, G. Feigenblat, D. Freitag, T. Ghosal, E. Hovy, P. Mayr, M. Shmueli-Scheuer,
A. de Waard, Overview of the first workshop on scholarly document processing (sdp), in:
Proceedings of the first workshop on scholarly document processing, 2020, pp. 1–6.
[26] Z. Boukhers, P. Mayr, S. Peroni, Bibliodap’21: The 1st workshop on bibliographic data analysis
and processing, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery
&amp; Data Mining, 2021, pp. 4110–4111.
[27] S. Anzaroot, A. Mccallum, A new dataset for fine-grained citation field extraction, ICML Workshop
on Peer Reviewing and Publishing Models. (2013).
[28] K. Seymore, A. Mccallum, R. Rosenfeld, Learning hidden markov model structure for information
extraction, in: In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999, pp.
37–42.
[29] H. Li, I. Councill, W.-C. Lee, C. L. Giles, Citeseerx: an architecture and web service design for an
academic document search engine, in: Proceedings of the 15th international conference on World
Wide Web, 2006, pp. 883–884.
[30] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. S. Weld, S2orc: The semantic scholar open research
corpus, arXiv preprint arXiv:1911.02782 (2020).
[31] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2961–2969.
[32] D. An, L. Gao, Z. Jiang, R. Liu, Z. Tang, Citation metadata extraction via deep neural network-based
segment sequence labeling, in: Proceedings of the 2017 ACM on Conference on Information and
Knowledge Management, 2017, pp. 1967–1970.
[33] G. Hendricks, D. Tkaczyk, J. Lin, P. Feeney, Crossref: The sustainable source of community-owned
scholarly metadata, Quantitative Science Studies 1 (2020) 414–427.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Boukhers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Beili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Zafar</surname>
          </string-name>
          , Mexpub:
          <article-title>Deep transfer learning for metadata extraction from german publications</article-title>
          ,
          <source>in: 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>250</fpage>
          -
          <lpage>253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Boukhers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouabdallah</surname>
          </string-name>
          ,
          <article-title>Vision and natural language for metadata extraction from scientific pdf documents: a multimodal approach</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Aalbersberg</surname>
          </string-name>
          , G. Appleton,
          <string-name>
            <given-names>M.</given-names>
            <surname>Axton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Blomberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-W.</given-names>
            <surname>Boiten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B. da Silva</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Bourne</surname>
          </string-name>
          , et al.,
          <article-title>The fair guiding principles for scientific data management and stewardship</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Boukhers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Beyond feature learning: Textmap and its comparative performance against traditional methods and large language models for pdf metadata extraction</article-title>
          ,
          <source>arXiv preprint arXiv:2501.05082</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I. G.</given-names>
            <surname>Councill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Parscit: an open-source crf reference string parsing package</article-title>
          ,
          <source>LREC</source>
          , Vol.
          <volume>8</volume>
          . (
          <year>2008</year>
          )
          <fpage>661</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Grobid</surname>
          </string-name>
          (
          <year>2008</year>
          -
          <fpage>2021</fpage>
          ).
          <source>arXiv:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>M.-Y. Day</surname>
          </string-name>
          , R. T.
          <string-name>
            <surname>-H. Tsai</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Sung</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Hsieh</surname>
            ,
            <given-names>C.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-P. Wu</surname>
            ,
            <given-names>C.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Ong</surname>
          </string-name>
          , W.-L. Hsu,
          <article-title>Reference metadata extraction using a hierarchical knowledge representation framework, Decision Support Systems 43 (</article-title>
          <year>2007</year>
          )
          <fpage>152</fpage>
          -
          <lpage>167</lpage>
          . URL: https://www.sciencedirect.com/science/article/ pii/S0167923606001205. doi:https://doi.org/10.1016/j.dss.
          <year>2006</year>
          .
          <volume>08</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kawtrakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yingsaeree</surname>
          </string-name>
          ,
          <article-title>A unified framework for automatic metadata extraction from electronic document</article-title>
          ,
          <source>in: Proceedings of The International Advanced Digital Library Conference. Nagoya, Japan</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tkaczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szostek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fedoryszak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Dendek</surname>
          </string-name>
          , L. Bolikowski, Cermine:
          <article-title>Automatic extraction of structured metadata from scientific literature</article-title>
          ,
          <source>Int. J. Doc. Anal. Recognit</source>
          .
          <volume>18</volume>
          (
          <year>2015</year>
          )
          <fpage>317</fpage>
          -
          <lpage>335</lpage>
          . URL: https://doi.org/10.1007/s10032-015-0249-8. doi:
          <volume>10</volume>
          .1007/s10032-015-0249-8.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rastegar-Mojarad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Afzal</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehrabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sohn</surname>
          </string-name>
          , et al.,
          <article-title>Clinical information extraction applications: a literature review</article-title>
          ,
          <source>Journal of biomedical informatics 77</source>
          (
          <year>2018</year>
          )
          <fpage>34</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Information extraction from research papers using conditional random ifelds</article-title>
          ,
          <source>Inf. Process. Manage</source>
          .
          <volume>42</volume>
          (
          <year>2006</year>
          )
          <fpage>963</fpage>
          -
          <lpage>979</lpage>
          . URL: https://doi.org/10.1016/j.ipm.
          <year>2005</year>
          .
          <volume>09</volume>
          .002. doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2005</year>
          .
          <volume>09</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heuser</surname>
          </string-name>
          ,
          <article-title>Arctic: metadata extraction from scientific papers in pdf using two-layer crf (</article-title>
          <year>2014</year>
          )
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Bidirectional lstm-crf models for sequence tagging</article-title>
          ,
          <source>CoRR abs/1508</source>
          .
          <year>01991</year>
          (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1508.
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Citation metadata extraction via deep neural network-based segment sequence labeling</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          , CIKM '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>1967</fpage>
          -
          <lpage>1970</lpage>
          . URL: https://doi.org/10.1145/3132847.3133074. doi:
          <volume>10</volume>
          .1145/3132847. 3133074.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. P. C.</given-names>
            <surname>Chiu</surname>
          </string-name>
          , E. Nichols,
          <article-title>Named entity recognition with bidirectional lstm-cnns</article-title>
          ,
          <source>CoRR abs/1511</source>
          .08308 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1511.08308.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Nayaka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          ,
          <article-title>An eficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique (</article-title>
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Herrmannova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Patton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Wells</surname>
          </string-name>
          ,
          <article-title>Deeppdf: A deep learning approach to extracting text from pdfs (</article-title>
          <year>2018</year>
          ). URL: https://www.osti.gov/biblio/1460210.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Yepes</surname>
          </string-name>
          ,
          <article-title>Publaynet: largest dataset ever for document layout analysis (</article-title>
          <year>2019</year>
          )
          <fpage>1015</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Boukhers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Beili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Zafar</surname>
          </string-name>
          , Mexpub:
          <article-title>Deep transfer learning for metadata extraction from german publications</article-title>
          , in: 2021 ACM/IEEE Joint Conference on Digital
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>