<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NER for Specialized Scientific Domains: Fine-Tuning on Patents for Plasma Technology and Battery Materials</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Farag Saad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hidir Aras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus M. Becker</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carsten Becker-Willinger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz Institute for Information Infrastructure (FIZ Karlsruhe)</institution>
          ,
          <addr-line>Hermann-von-Helmholtz Platz 1, 76344 Eggenstein-Leopoldshafen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leibniz Institute for New Materials (INM)</institution>
          ,
          <addr-line>Campus D2 2, 66123 Saarbrücken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leibniz Institute for Plasma Science and Technology (INP)</institution>
          ,
          <addr-line>Felix-Hausdorf-Str. 2, 17489 Greifswald</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>10</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Domain-specific Named Entity Recognition (NER) allows to identify and extract specific types of entities from text. In particular for technical domains such as plasma technology and battery materials extracting and aligning such entities with complex (structured) semantic information such as in Knowledge Graphs (KG) plays a crucial role. In this work, we fine-tuned SciBERT, BERT-for-Patents, and BatteryBERT for domain-specific NER based on systematically constructed annotated datasets specific to the regarded domain. Despite the relatively limited size of the training data, particularly for battery materials, the models achieved strong overall performance. By leveraging the linguistic knowledge encoded in the pretrained models, combined with domain-specific patterns learned from the training datasets, the developed models efectively identified and classified entities based on their contextual usage. Our evaluation demonstrated that fine-tuning domain-adapted pretrained models significantly enhance NER efectiveness in specialized scientific and technological domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition (NER)</kwd>
        <kwd>NLP</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Plasma Technology</kwd>
        <kwd>Battery Materials</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves
identifying and classifying entities within unstructured text. In scientific and technical domains such as
Plasma Technology (PT) and Battery Materials (BM), NER plays a critical role in extracting structured
information from complex texts and aligning it with explicit semantic models such as in Knowledge
Graphs. However, applying NER to these specialized fields presents unique challenges. Patent literature,
in particular, is filled with ambiguous terminology, unconventional phrasing, and rapidly evolving
terminology, making entity recognition dificult [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally, there is a limited availability of
annotated corpora, further complicating the task.
      </p>
      <p>Efective NER models must therefore be tailored to handle domain-specific complexities and trained
on high-quality annotated datasets. Accurate identification of entities is essential for enabling advanced
downstream tasks such as knowledge graph construction, literature mining, and patent analysis.
Consequently, improving NER in PT and BM domains supports the broader goal of making scientific
information more accessible and interoperable.</p>
      <p>
        Recent adaptations of BERT (Bidirectional Encoder Representations from Transformers) model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
such as SciBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], BatteryBERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and BERT-for-Patents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have shown promise in specialized
scientific and technical domains. However, few studies have focused on the application of these models
to PT and BM, particularly in the context of patent literature. Existing approaches lack the customization
needed to address the specific challenges of these domains, such as the frequent emergence of new
terms and the abstract writing style common in patents.
      </p>
      <p>In this paper, we developed an approach, to capture the unique terminology and context of PT and BM
within patent texts by leveraging domain-specific BERT variants, such as SciBERT, BERT-for-Patents,
and BatteryBERT. To achieve this, we have curated high-quality, domain-specific training datasets
(See Section 3) that accurately reflect the specific nature of these technologies. By fine-tuning these
BERT variants, we train several NER models to efectively extract and classify entities relevant to these
specialized fields.</p>
      <p>This methodology allows us to address the unique challenges posed by the specialized language and
rapidly evolving terminologies present in patent documents, ultimately contributing to more accurate
and eficient information extraction in these critical areas of research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Traditional NER approaches, including rule-based, statistical, and hybrid methods, have been used
in technical domains but often struggle with the complexity of domain-specific language. These
methods typically rely on predefined rules or feature engineering, which makes them less adaptable
to the dynamic and evolving vocabulary found in technical domains [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While dictionary-based
approaches and pattern-matching methods have shown some success in specific domains, they often
lack of generalizability needed for broader applications [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The emergence of deep learning models,
particularly transformer-based architectures such as BERT, has significantly advanced NER performance.
These models are capable of learning contextual dependencies and handling subword-level semantics,
which are critical for understanding the complex material compositions and experimental parameters
found in engineering and scientific texts. However, pretrained models such as BERT often struggle with
domain-specific terminology and specialized language used in patent documents, where terminology
may be newly invented or ambiguous.
      </p>
      <p>
        The emergence of deep learning has significantly improved NER performance, particularly with the
use of deep learning approaches, which alleviate the need for extensive feature engineering by learning
contextual dependencies [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, these models still face limitations in capturing long-range
dependencies and incorporating subword-level semantics, both of which are critical for accurately
parsing complex material structure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Recent advances in transformer-based architectures, in particular
BERT model, have revolutionized NER tasks by ofering contextualized embeddings and enabling
finetuning on downstream applications. Domain-specific BERT variants such as SciBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], BatteryBERT
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and BERT-for-Patents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been developed to capture the specificity of scientific and
materialsspecific terminology, yielding significant improvements over general-purpose models.
      </p>
      <p>SciBERT is a pretrained language model specifically designed for scientific texts, demonstrating
superior performance over general-purpose models like BERT when applied to scientific content. Trained
on a vast collection of scientific papers, SciBERT efectively captures domain-specific language,
terminology, and syntactic structures. The model has shown significant improvements in various scientific
tasks, including scientific paper classification, named entity recognition (NER), and relation extraction.
Its ability to understand complex scientific language makes it particularly valuable for domains with
specialized vocabularies, such as biomedical research and chemistry. SciBERT’s architecture allows it to
generalize well across diferent scientific disciplines while maintaining high accuracy in domain-specific
tasks. As a result, SciBERT is widely adopted in natural language processing pipelines tailored to
scientific literature analysis.</p>
      <p>BatteryBERT, developed by Huang and Cole (2022), is a domain-specific variant of the BERT model that
has been fine-tuned specifically for the field of battery research, capturing the unique terminology and
concepts within this domain. As a result, BatteryBERT significantly outperforms general-purpose models
in extracting crucial information from battery-related texts. Its training on a specialized corpus enables
it to better recognize complex technical entities and fine-grained semantic distinctions commonly found
in battery materials literature. This makes it particularly efective for tasks such as material property
extraction and electrode classification. Moreover, BatteryBERT’s focused pretraining helps reduce
errors caused by ambiguous terms and enhances its ability to interpret context-specific complexities.
This specialization ultimately leads to more reliable and accurate finetuning for information extraction
in battery research applications.</p>
      <p>BERT-for-Patents is a specialized variant of the BERT model, designed specifically for patent
documents. By leveraging BERT’s capabilities, this model has been fine-tuned to understand the unique
terminology and structure of patent texts, ofering significant improvements over general-purpose
language models. BERT-for-Patents has been demonstrated to enhance tasks such as patent classification,
text mining, which are essential for eficient patent analysis and retrieval. Its training on a large corpus
of patent literature allows it to capture domain-specific language patterns, legal jargon, and complex
sentence structures that are typical in patents. This specialization allows for more accurate fine-tuning
in tasks like entity recognition and relationship extraction, outperforming generic language models in
the patent domain. As a result, the model can be efectively adapted to various technical fields within
patents, making it a powerful and flexible tool for understanding intellectual property.</p>
      <p>
        Building on specialized variants of these BERT model, recent research has focused on fine-tuning
models such as SciBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], BioBERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and BlueBERT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] on domain-specific corpora. These
finetuned models have demonstrated significant improvements in performance across various low-resource
scientific domains. For instance, Rostam and Kertész (2024) fine-tuned BERT-based models for scientific
text classification tasks and found that domain-specific models like SciBERT consistently outperformed
general-purpose models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Despite significant advancements in BERT-based NER, their application in
highly specialized domains such as PT and BM remains largely underexplored. Fine-tuning pretrained
scientific language models on domain-specific corpora presents a promising strategy to bridge this gap.
      </p>
      <p>In this work, we investigate the efectiveness of fine-tuning transformer-based models for NER
tasks using patent documents from the PT and BM domains. To support this efort, we introduce a
domain-specific annotated corpus and fine-tune multiple BERT-based model variants to enhance entity
recognition performance. Our approach ofers a comprehensive solution for information extraction in
these complex and rapidly advancing scientific and technological fields. To the best of our knowledge,
this is the first study to investigate Named NER in the domains of PT and BM within the context of
patent literature.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Training Data Construction</title>
      <p>Developing a robust NER model for specialized domains requires high-quality, domain-specific training
data. Given the lack of publicly available annotated corpora for the regarded domains, particularly
within patent literature, we constructed our own annotated dataset by systematically labeling the titles
and abstracts of selected patent documents from each domain.</p>
      <p>To initiate the annotation process, we first prepared an initial list of seed entities relevant to PT and BM.
These seed entities served as the basis for pre-annotating the corpus, which was subsequently reviewed
and corrected by domain experts using a Prodigy annotation tool1 (See Figure 1) , thereby reducing
the manual workload for human annotators. Candidate entity lists were automatically extracted from
two domain-specific Wikipedia categories: Plasma Physics 2 and Battery (Electricity)3. While these
categories provided a broad range of potential entities central to PT and BM, not all extracted entities
were relevant. Therefore, a filtering step was introduced to ensure only relevant entities were selected.</p>
      <p>
        The filtering process involved matching the extracted entities against a corpus of patent texts related
to PT and BM domains. Entities were ranked by their frequency of occurrence to prioritize entities
that were more likely to be relevant to patent literature. Furthermore, domain-specific relevance was
verified by checking that each entity’s context aligned with the scope of PT and BM technologies,
respectively. A shortlist of entities was manually reviewed by domain experts to ensure accuracy.
During this review, irrelevant entities were excluded, and additional entities were incorporated using
expert knowledge and external resources, including relevant ontologies that ofered a comprehensive
view of the domain-specific vocabulary [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>This procedure and ontology-based selection of core entity types also supported a high inter-annotator
1https://prodi.gy/
2https://en.wikipedia.org/wiki/Category:Plasma_physics
3https://en.wikipedia.org/wiki/Category:Electric_battery
agreement (IAA), which was evaluated by adopting a practical and widely accepted approach based
on a representative sample of the training data. This strategy aligns with best practices used in
largescale annotation projects, such as the methodology followed by Google Healthcare, where IAA is
typically calculated on a subset (usually 5–20%) to assess annotator consistency without duplicating
annotation eforts across the full dataset 4. We began with a preliminary session where annotators
collaboratively reviewed the annotation guidelines, discussed ambiguous cases, and resolved potential
points of confusion. Since the core entity types were selected based on well accepted concepts of
the respective research domain, a high IAA of &gt;80% was achieved from the beginning and this phase
mostly resolved technical questions related to the annotation process. It should be noted that this
resource-eficient process, which provided a common understanding of the task, did not allow us to
measure the IAA for the entire dataset. Nevertheless, it ensured the consistency and quality of the
annotations, similar to the phased training and review approach used in Google Healthcare’s annotation
framework.</p>
      <p>The annotation process was iterative. In each round, the newly annotated entities from the previous
iteration were incorporated into the pre-annotation pipeline, allowing for continuous refinement. This
iterative approach not only improved the accuracy and coverage of the dataset but also enabled the
corpus to expand with new entities that emerged during the annotation process. Over time, this iterative
refinement helped capture emerging trends and terminology, further enhancing the relevance and
quality of the dataset.</p>
      <p>
        The final set of entity types was defined in collaboration with domain experts. For PT, the core entity
types are: Plasma Application, Plasma Target, Plasma Source, Plasma Medium, Plasma Property, Plasma
Source Property, Device, Diagnostics, Material, Physical Efect, and Physical Quantity. For BM, the core
entity types are: Property, Cathode, Anode, Technology, Additive, Component, and Electrolyte. These entity
types represent fundamental concepts critical for understanding the fundamental principles within PT
[
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ] and BM. Table 1 and Table 2 show the core entity types along with their frequencies in the
annotated data for the Plasma Technology and Battery Materials domains, receptively.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Approach based on Fine-Tuning BERT Model variants</title>
      <p>BERT is an exceptionally powerful, general-purpose language model that can be fine-tuned for a wide
range of text-based machine learning tasks. Instead of training models from scratch, one can leverage
pre-trained BERT variants, such as SciBERT, BERT-for-Patents, and BatteryBERT, which have been
specifically trained on domain-specific corpora. These models can then be fine-tuned for various NLP
tasks, including NER, classification, sentiment analysis, etc.
4https://github.com/google/healthcare-text-annotation/blob/master/methodology/annotation-methodology.md</p>
      <p>Fine-tuning involves adapting a pre-trained model, like BatteryBERT, which has already learned a
rich representation of language from large-scale unsupervised training, to a specific downstream task.
This is achieved by training the model on a smaller, task-specific labeled dataset. During fine-tuning,
the weights of the pre-trained model are adjusted to optimize performance for the target task, such as
NER. By leveraging the general language knowledge embedded in the pre-trained model, fine-tuning
enables the model to quickly adapt to specialized tasks, even with limited task-specific training data.</p>
      <p>Figure 2 illustrates the high-level components involved in fine-tuning a selected pre-trained BERT
models for the NER task. We treat the identification of entities in patent text as a sequence labeling
task, where a label is assigned to each word or token within the identified entity or phrase, based on
its context. Each sequence of tokens are tokenized 1, 2, ..,  using the appropriate
tokenizer.</p>
      <p>The tokenizer usually splits tokens into sub-tokens (or sub-words) where some special tokens ([]
and [ ]) are added. The [] token refers to the beginning of the sentence or sequence, as in BERT
the sequence length is fixed (max 512 token), the [ ] token is responsible for unifying the length
of each sentence to the longest one in that all sentences fed to the BERT model must have the same
length. As figure 2 shows, based on the input sequence three diferent BERT contextual embeddings for
each tokenized token 1, 2, .., , capturing each token’s context through many of attention heads are
generated.</p>
      <p>• The token embeddings is calculated based on token, however, if any token is not present in the
selected BERT model vocabulary, BERT tries to generate its embeddings based on the sub-words
level.
• The Segment/Sentence embeddings is calculated based on a single segment or two segments. As
two segments are present in the same sequence they are separated by the [SEP] token where
each segment has its own embeddings.
• The position embeddings represents the token’s position within a sentence e.g., to identify which
is the first, second, third, etc., token of the sentence.</p>
      <p>Once the embeddings are generated they will be summed and fed to the output layer that is the
classification layer to classify each sequence token to its relevant label 1, 2, .., .</p>
      <p>Specifically, the fine-tuning process begins by extending the base BERT architecture with a token-level
classification layer, transforming it into a model suitable for sequence labeling tasks. This additional layer
enables the model to predict entity labels for individual tokens within a given input sequence, leveraging
the rich contextual embeddings produced by the underlying BERT layers. Once the architecture is
adapted, the model is fine-tuned on a curated, manually annotated dataset specific to the target domain.
Each token is paired with its corresponding entity label, enabling the model to learn explicit associations
between contextual usage and semantic categories. The fine-tuning is conducted in a supervised manner
using gradient-based optimization techniques, during which the parameters of the pre-trained model
are updated to reflect domain-specific linguistic and semantic patterns. Importantly, the knowledge
embedded in the pre-trained model provides a strong basis of general language understanding, while
the domain-specific labeled data serves as a corrective signal that guides the model toward recognizing
entities unique to the specialized context. This synergy allows the model to achieve robust performance
even when fine-tuning data is limited, by eficiently integrating prior knowledge with task-specific
supervision.</p>
      <p>In Section 5, we present the evaluation methodology used to assess the performance of the developed
NER models for the plasma technology and battery materials domains within patent texts.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>For the training and testing of the developed NER models, the dataset was divided into two parts:
80% for training and 20% for testing. To evaluate the efectiveness of the developed models in
lowresource scientific domains, we conducted experiments on two distinct patent-related use cases: PT
(see Table 3) and BM (see Table 4). The models evaluated include fine-tuned SciBERT and
BERT-forPatents. Additionally, for the Battery Materials use case, another domain-specific pre-trained model,
BatteryBERT, was fine-tuned and evaluated. Model performance was assessed at the entity level across
relevant scientific core entity types, using standard metrics of precision, recall, and F1-score.</p>
      <p>In the PT domain, both SciBERT and BERT-for-Patents demonstrated competitive performance,
achieving relatively high F1-scores for most core entity types. For example, Physical Quantity (81%),
Plasma Medium (79%), and Plasma Source (78%). These results indicate that both fine-tuned general
scientific models (SciBERT) and patent-specific models (BERT-for-Patents) are capable of extracting
semantically rich domain entities with reasonable accuracy.</p>
      <p>In particular, SciBERT slightly outperformed BERT-for-Patents for core entity types such as Plasma
Medium and Device. This slightly better performance can be attributed to SciBERT’s pretraining corpus,
which consists of large-scale scientific texts spanning diverse disciplines, including physics, engineering,
and biomedical sciences. Scientific publications frequently describe experimental setups, apparatuses,
and material environments in detail, providing SciBERT with richer contextual insight to technical
entities. This broader scientific context enhances the model’s ability to generalize and accurately identify
entities in patent-based evaluation sets, even when certain entity terms are underrepresented in the
corpus. Conversely, BERT-for-Patents, while trained on a large corpus of patent documents, may face
slight challenges in capturing the richer context for certain core entity types. Patent language often
focuses on legal or application aspects, which can make technical descriptions unclear. This likely
explains why BERT-for-Patents underperformed on entity types that require high technical specificity.</p>
      <p>Particularly, both models showed reduced performance on underrepresented or ambiguous entity
types such as Diagnostics and Plasma Property. These performance drops are likely due to the inherent
vagueness of these entity types and their relatively low frequency in the annotated training data.</p>
      <p>The BM domain presented greater challenges, primarily because the available annotated dataset
was smaller and had a narrower focus (about 59% smaller than the Plasma Technology training data).
The complexity of battery materials terminology, including detailed chemical and material properties,
made it dificult to achieve high performance across all entity types with the small training dataset.
Additionally, the wide variety of battery applications and evolving terminology caused confusion,
making it harder for the models to adapt to the changing language without fine-tuning on a larger,
balanced, and representative training dataset. This emphasizes the importance of maintaining a balanced
distribution of entity types to ensure consistent generalization across all core entity types.</p>
      <p>Despite these limitations, the BM NER models achieved promising results for many core entity types.
In particular, the BM NER model fine-tuned on BatteryBERT outperformed both SciBERT and
BERT-forPatents on most core entity types, such as Cathode and Anode. BatteryBERT was specifically pretrained
on a domain-specific corpus of battery-related scientific literature, which likely contributed to its strong
performance. However, for more general entity types like Property, SciBERT and BERT-for-Patents
performed better, achieving slightly higher F1-scores (69% and 70%, respectively). This can be attributed
to the broader training corpora of these models, which include frequent and varied mentions of general
scientific entities across diferent fields. As a result, they are better able to generalize the meaning of
cross-domain concepts like Property.</p>
      <p>The results from the Battery Materials domain showed a significant performance drop for the
Technology core entity type across all models, indicating that semantic ambiguity and contextual
overlap with general scientific or industrial terminology make this entity type particularly challenging
to disambiguate. The Electrolyte core entity type, which involves complex and context-dependent
chemical formulations, exhibited slight drops in recall across all models. This likely related to insuficient
representation of such technical details in the training data, highlighting the need for further refinement
in domain-specific data annotation. Addressing this gap is planned for the next iteration of data
annotation and model training, as enriching the training corpus with more detailed chemical terminology
will likely improve model performance.</p>
      <p>Moreover, these findings suggest that while transformer-based models, such as SciBERT,
BERTfor-Patents, and BatteryBERT, perform well in the identification of specialized patent-related entities,
performance varies significantly based on the extent of domain-specific training. The evaluation
also highlights the importance of training dataset size and quality in fine-tuning domain-specific
models. The relatively small annotated dataset in the Battery Materials domain highlights a potential
limitation of transformer models in low-resource domains, where models struggle to learn from limited
training examples. The model’s performance on low-frequency entity types indicates that significant
improvements could be achieved with a larger, more comprehensive training dataset.</p>
      <p>Overall, the evaluation demonstrates that even with relatively small, manually curated training
datasets, fine-tuned transformer models can achieve strong performance in identifying technical
concepts within complex patent texts. Domain-specific pretraining plays a critical role: SciBERT provides
robust general scientific grounding across a broad range of entity types, while BatteryBERT ofers
ifne-grained precision for battery-specific entity types. Meanwhile, patent-specific models like
BERT-forPatents, although broadly relevant, may not adequately distinguish battery-related entities such as Anode
and Cathode. These observations suggest that further exploration into domain-specific architectures
could enhance the performance of NER models in specialized scientific fields.</p>
      <p>Figure 3 shows these trends, illustrating how pretraining objectives and domain alignment afect
model performance for specific core entity types.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>The future work aims to address the limitations observed during model evaluation and improve the
overall performance of the NER models. Several key areas are currently being developed and refined:
• Improved Data Annotation: We are expanding and refining the annotated datasets for the Plasma
Technology and Battery Materials domains, with a focus on a new domain, Additive Manufacturing
(AM). For example, we are adding more detailed chemical and material property terms to the
Battery Materials dataset to improve model performance in categories like Electrolyte, where
performance gaps were identified. We are also prioritizing a balanced distribution of entity
types to enhance model generalization across all core entity types, such as Plasma Property and
Diagnostics.
• Domain-Specific Fine-Tuning : To further enhance model performance, we are exploring additional
ifne-tuning techniques using domain-specific corpora. This involves incorporating specialized
texts that reflect the evolving terminology in the focused domains. For instance, we plan to
annotate patent texts from other sections, such as the Description and Claims, which provide richer
context for domain-specific entity types. The Description section often elaborates on the technical
details of the invention, while the Claims outline the specific legal and functional aspects, both of
which are crucial for improving entity recognition in a more domain-specific context.
• Addressing Semantic Ambiguities: We are developing methods to address semantic ambiguities,
particularly for entity types like Technology, where overlapping terms with general scientific or
industrial language make identification more dificult. Our approach involves strategies for better
disambiguation, using richer contextual information. This includes leveraging advanced language
representations to capture fine diferences in meaning and improving the model’s capacity to
diferentiate between closely related concepts. Ultimately, these eforts aim to reduce labeling
errors and enhance the precision of entity recognition in complex texts.
• Model Architecture Enhancements: Ongoing experiments are exploring potential improvements in
model architecture and training strategies. We are considering incorporating domain-specific
knowledge, such as integrating domain-specific ontologies into the training pipeline to further
improve model generalization across subdomains by better capturing hierarchical relationships
between entities and contextual dependencies in complex sentence structures.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work, we presented the development of domain-specific NER models for plasma technology and
battery materials. By fine-tuning SciBERT, BERT-for-Patents, and BatteryBERT on high-quality,
domainspecific datasets derived from patent texts, we demonstrated that domain adaptation significantly
improves NER performance, even when training data is relatively small. In the plasma technology
domain, SciBERT and BERT-for-Patents achieved competitive F1-scores across a wide range of core entity
types. In the battery materials domain, the NER models fine-tuned on BatteryBERT, a model pretrained
on battery-related scientific literature, delivered particularly strong performance. In particular,
finetuning BatteryBERT led to superior results compared to general-purpose models, especially for core
entity types such as Cathode and Anode, despite the small size of the available training data. Notably, the
models performed robustly despite the inherent ambiguity and syntactic complexity of patent language,
which often lacks clarity and consistency. This underlines their capacity to adapt to challenging textual
environments. Such adaptability is crucial for extracting structured knowledge from domains where
well-defined language is not always used These results highlight the efectiveness of domain-specific
ifne-tuning in low-resource settings and demonstrate the critical role of specialized language models in
advancing information extraction from highly specialized scientific fields. Future work will focus on
further enriching the training datasets, both in terms of quality and quantity, by curating a more balanced
and comprehensive raw corpus from patent texts. Particular attention will be given to semantically
strengthening the representation of all core entity types, especially those with currently limited training
instances.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <sec id="sec-8-1">
        <title>This work was partly funded by the DFG project Patents4Science, Project id: 496963457.</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <sec id="sec-9-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Saad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Aras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hackl-Sommer</surname>
          </string-name>
          ,
          <article-title>Improving Named Entity Recognition for Biomedical and Patent Data using Bi-LSTM deep neural network models</article-title>
          , in: E.
          <string-name>
            <surname>Métais</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Meziane</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Horacek</surname>
          </string-name>
          , P. Cimiano (Eds.),
          <source>Natural Language Processing and Information Systems</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -51310-
          <issue>8</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          . 18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A pretrained language model for scientific text</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1371.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <article-title>BatteryBERT: A pretrained language model for battery database enhancement</article-title>
          ,
          <source>Journal of Chemical Information and Modeling</source>
          <volume>62</volume>
          (
          <year>2022</year>
          )
          <fpage>6365</fpage>
          -
          <lpage>6377</lpage>
          . doi:
          <volume>10</volume>
          .1021/acs.jcim. 2c00035.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Srebrovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yonamine</surname>
          </string-name>
          ,
          <article-title>Leveraging the BERT algorithm for patents with TensorFlow and BigQuery</article-title>
          ,
          <year>2020</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-27.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawakami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <article-title>Neural architectures for Named Entity Recognition</article-title>
          , in: K.
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Rambow (Eds.),
          <source>Proceedings of the</source>
          <year>2016</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>270</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          -1030.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>Biomedical Named Entity Recognition via Dictionary-based Synonym Generalization</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>14621</fpage>
          -
          <lpage>14635</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>903</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Deep learning for Named Entity Recognition: A survey</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>8995</fpage>
          -
          <lpage>9022</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00521-024-09646-6.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Saad</surname>
          </string-name>
          ,
          <article-title>Named Entity Recognition for Biomedical Patent Text using Bi-LSTM variants</article-title>
          ,
          <source>in: Proceedings of the 21st International Conference on Information Integration and Web-Based Applications &amp; Services</source>
          , iiWAS2019, Association for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>617</fpage>
          -
          <lpage>621</lpage>
          . doi:
          <volume>10</volume>
          .1145/3366030.3366104.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Kang,</surname>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets</article-title>
          , in: D.
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>K. B.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 18th BioNLP Workshop</source>
          and Shared Task, Association for Computational Linguistics, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -5006.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z. R. K.</given-names>
            <surname>Rostam</surname>
          </string-name>
          , G. Kertész,
          <article-title>Fine-tuning large language models for scientific text classification: A comparative study</article-title>
          ,
          <source>in: 2024 IEEE 6th International Symposium on Logistics and Industrial Informatics (LINDI)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>000233</fpage>
          -
          <lpage>000238</lpage>
          . doi:
          <volume>10</volume>
          .1109/LINDI63813.
          <year>2024</year>
          .
          <volume>10820432</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M. M. Becker</surname>
            ,
            <given-names>I. Chaerony</given-names>
          </string-name>
          <string-name>
            <surname>Sifa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Aras</surname>
          </string-name>
          ,
          <article-title>VIVO-based plasma knowledge graph for improving the discoverability of patent information in plasma science and technology</article-title>
          ,
          <source>in: Proceedings of the E-Science-Tage</source>
          <year>2025</year>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .11588/heidok.00036414, accessed:
          <fpage>2025</fpage>
          -04-28.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Friis</surname>
          </string-name>
          , T. Vegge,
          <article-title>BattINFO: The ontology for the battery interface genome - materials acceleration platform (BIG-MAP)</article-title>
          ,
          <source>in: Proceedings of the 3rd EMMC International Workshop</source>
          ,
          <year>2021</year>
          . Accessed:
          <fpage>2025</fpage>
          -04-28.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Franke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Paulet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. O'Connell</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
          </string-name>
          ,
          <article-title>Plasma-MDS, a metadata schema for plasma science with examples from plasma technology</article-title>
          ,
          <source>Scientific Data 7</source>
          ,
          <issue>439</issue>
          (
          <year>2020</year>
          ).
          <source>doi: 10. 1038/s41597-020-00771-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I. Chaerony</given-names>
            <surname>Sifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Becker</surname>
          </string-name>
          ,
          <article-title>Semantic information management in Low-Temperature Plasma Science and Technology with VIVO</article-title>
          ,
          <source>J. Phys. D: Appl. Phys</source>
          <volume>58</volume>
          (
          <year>2025</year>
          )
          <article-title>235204</article-title>
          . doi:
          <volume>10</volume>
          .1088/
          <fpage>1361</fpage>
          -6463/add710.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>