<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Tunisian-Algerian Conference on applied Computing, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Two-Stage GAN Oversampling: Integrating GPT-3 and DBpedia for Named Entity Recognition Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adel Belbekri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wissem Bouarroudj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fouzia Benchikha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lire Laboratory, Abdelhamid Mehri Constantine 2 University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Oversampling is a technique used to adjust the class distribution of a dataset, particularly to address imbalance between diferent classes or categories. This paper presents a novel oversampling method for named entity recognition (NER) datasets using a generative approach. Our method leverages the power of the GPT-3 large language models in combination with the DBpedia knowledge graphs to create high-quality synthetic examples for underrepresented entity classes. The process starts by analyzing the dataset to identify entity types that require balancing. For each such entity, we explore the DBpedia knowledge graph to find similar or equivalent concepts using ontological relationships. These concepts are then used to create GPT-3 prompts, guiding them to generate contextually appropriate examples of the type of target entity. This process is repeated until a balanced distribution across all entity classes is achieved. Our approach aims to address the common challenge of class imbalance in NER datasets while maintaining semantic coherence and linguistic diversity in the generated examples. We evaluate the efectiveness of this method on several benchmark NER datasets and discuss its potential impact on model performance and generalization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Oversampling</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Generative Large Language Models</kwd>
        <kwd>Knowledge Graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that
involves identifying and classifying named entities such as persons, organizations, locations, and other
predefined categories within unstructured text. NER plays a crucial role in various NLP applications,
including information extraction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], question answering [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and text summarizing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Several approaches to NER exist, including rule-based methods, unsupervised techniques, and
supervised machine learning tasks that rely on annotated data. One of the key challenges in supervised
approaches is class imbalance, where certain entity types are significantly underrepresented in datasets,
which can negatively afect model performance.</p>
      <p>
        Traditional oversampling techniques such as Random Oversampling [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and SMOTE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been
widely used to address class imbalance in various machine learning tasks, including NER. Random
oversampling merely duplicates existing minority class samples, while SMOTE creates synthetic
examples by interpolating between existing instances. However, when applied to NER datasets, these
methods often fail to preserve the complex linguistic and semantic structures essential for accurate
entity recognition. They struggle to maintain the natural flow of language, context-dependent entity
relationships, and the various ways entities can be expressed in text. For instance, duplicating sentences
with rare entity types does not introduce new contextual variations, potentially leading to over-fitting.
SMOTE, designed for numerical data, may generate linguistically implausible sequences when applied
to text. More recently, generative approaches that use large language models have shown promise
in creating diverse and contextually appropriate samples. These models can generate new sentences
containing specific entity types while maintaining better semantic coherence and linguistic diversity.
      </p>
      <p>However, they introduce their own set of challenges. The generated text may contain subtle errors or
inconsistencies not present in real-world data, and ensuring precise entity annotations in the generated
content can be problematic. There is also a risk of introducing noise or creating examples that, while
linguistically plausible, may not accurately reflect the true distribution of entities in the domain of
interest. Balancing the benefits of these generative approaches with the need for accurate and reliable
NER training data remains an ongoing challenge in the field.</p>
      <p>
        This paper presents a novel oversampling method for NER datasets using a Generative Adversarial
Networks (GAN) approach that addresses these limitations. Our method leverages the power of large
language models like GPT-3 in combination with knowledge graphs such as DBpedia to create high-quality
synthetic examples for underrepresented entity classes. The process begins by analyzing the dataset
to identify entity types requiring balance. For each such entity, we explore the DBpedia knowledge
graph [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to find similar or equivalent concepts using ontological relationships. These concepts are
then used to create GPT-3 prompts, guiding them to generate contextually appropriate examples of the
type of target entity. Our approach aims to address the common challenge of class imbalance in NER
datasets while maintaining semantic coherence and linguistic diversity in the generated examples. By
integrating knowledge-graph information, we enhance the precision and relevance of the generated
annotations, mitigating a key limitation of purely generative methods. We evaluate the efectiveness of
this method on several benchmark NER datasets and discuss its potential impact on model performance
and generalization.
      </p>
      <p>The results demonstrate significant improvements in the balanced accuracy between class entities,
suggesting that our method could be a valuable tool for researchers and practitioners working with
imbalanced NER datasets.</p>
      <p>The remainder of this paper is structured as follows. Section 2 provides a comprehensive review of
related work. Section 3 presents our novel oversampling method in detail and describes the dataset
analysis process. Section 4 outlines our experimental setup, including the datasets used, evaluation
metrics, and baseline methods for comparison. In Section 5, we present and analyze our results,
discussing the impact of our method. Finally, Section 6 concludes the paper with a summary of our
ifndings, limitations of the current approach, and potential directions for future research in this area.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>This section provides a foundational overview of key concepts integral to our approach, aiming to
enhance the reader’s comprehension of the following sections.</p>
      <sec id="sec-2-1">
        <title>2.1. Knowledge Graph</title>
        <p>The concept of a knowledge graph is central to organizing and representing structured information
through a network of entities and their interrelationships, using a graph-based model.</p>
        <p>
          At the core of this representation is the Resource Description Framework (RDF) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a standard model
for web-based data interchange. RDF represents data as triples consisting of a subject, predicate, and
object, to describe relationships between resources. This structure is crucial for representing entities
and their relationships in a standardized format, enabling the integration of diverse datasets, supporting
the creation of structured vocabularies for entity description, and facilitating data interoperability and
knowledge sharing.
        </p>
        <p>
          Within this framework, SPARQL (SPARQL Protocol and RDF Query Language) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] serves as a semantic
query language essential for interacting with knowledge graphs. It allows for complex queries that
traverse the graph structure, enabling the retrieval and manipulation of data stored in RDF format. In
Named Entity Recognition, SPARQL is used to retrieve potential entity candidates and their associated
information from the knowledge graph.
        </p>
        <p>Knowledge graphs are particularly valuable in fields such as artificial intelligence, natural language
processing, and data analytics due to their ability to capture semantic relationships for meaningful data
retrieval and reasoning.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Named Entity Recognition</title>
        <p>
          NER is a crucial natural language processing task involving the identification and classification of named
entities in unstructured text into predefined categories such as person names, organizations, locations,
and more [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It is a fundamental step in many NLP applications, including question-answering systems
and information extraction. NER aims to locate and extract specific entities from text, which is critical
to understanding content and context. The precision of NER significantly impacts subsequent tasks
such as Named Entity Disambiguation [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], as correctly identifying entities is a prerequisite to
linking them to corresponding entries in knowledge bases.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Generative Language Model</title>
        <p>
          A Generative Language Model is a type of artificial intelligence model designed to understand and
produce human-like text. These models are trained on vast amounts of textual data to learn patterns,
structures, and relationships in language. They can generate coherent and contextually relevant text
based on given prompts or inputs [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Key characteristics of Generative Language Models include:
• Text generation: They can produce original text in various formats and styles.
• Context understanding: They can comprehend and maintain context over extended passages of
text.
        </p>
        <p>• Language understanding: They can interpret and respond to natural language inputs.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Oversampling</title>
        <p>
          Oversampling is a data augmentation technique used in machine learning to address the challenge of
imbalanced datasets [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It involves increasing the number of samples in the minority class to achieve a
more balanced distribution of classes. This can be done through various methods, including the random
duplication of existing minority samples, the creation of synthetic data points, or more sophisticated
approaches such as the Synthetic Minority Oversampling Technique (SMOTE). The primary goals of
oversampling are to enhance model performance on imbalanced data, improve the classifier’s ability to
identify minority class instances, and reduce the cost associated with false negatives. By providing a more
balanced dataset, oversampling helps traditional machine learning algorithms, which are often optimized
for balanced metrics, to perform more efectively. This technique is particularly valuable in classification
problems where the minority class is of high importance, such as in fraud detection or medical diagnosis,
where accurately identifying rare but critical instances is crucial. However, the traditional oversampling
methods, like random oversampling and SMOTE, are often limited by their inability to generate diverse
or realistic synthetic data. As the field has progressed, more advanced techniques have emerged to
address these limitations, particularly in highly imbalanced datasets. Generative Adversarial Networks
(GANs) have gained attention as a cutting-edge solution for oversampling, ofering more sophisticated
methods for generating synthetic data.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>Generative Adversarial Networks (GANs) have emerged as powerful tools for oversampling in recent
years, ofering the ability to generate high-quality synthetic data. Several GAN-based approaches have
been proposed to address class imbalance issues, each with its own strengths and limitations.</p>
      <p>
        BAGAN (Balancing GAN), proposed by Mariani et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], is specifically designed to handle class
imbalance problems, particularly in the context of image classification. It employs an autoencoder to
initialize the generator and discriminator, providing improved stability when training on imbalanced
datasets. Although BAGAN ofers better performance on highly skewed datasets, it may require careful
tuning of the autoencoder component.
      </p>
      <p>
        Mirza and Osindero [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] introduced CGAN (Conditional GAN), which incorporates class information
as an additional input to both the generator and the discriminator. CGANs are widely used in generating
realistic images based on specific conditions, such as generating images of particular objects or scenes.
This approach enables the generation of class-specific samples, making it particularly useful for targeted
oversampling. However, CGANs may struggle with highly imbalanced datasets where certain classes
have very few examples.
      </p>
      <p>
        ACGAN (Auxiliary Classifier GAN), developed by Odena et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], integrates an auxiliary classifier
into the discriminator. This architecture allows for the generation of high-quality samples while ensuring
they belong to the desired class. ACGAN have been used in various state-of-the-art applications across
diferent fields (Medical Imaging, Acoustic Scene Classification, Portfolio Optimization...). While
ACGAN can improve overall classification performance, its more complex architecture may be more
challenging to train efectively.
      </p>
      <p>
        Mullick et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] proposed GAMO (Generative Adversarial Minority Oversampling), which utilizes
a three-player adversarial game between a convex generator, a classifier network, and a discriminator.
GAMO aims to generate synthetic samples near decision boundaries to enhance classifier performance.
GAMO have been used for Classification Tasks, Image Generation... Although efective, this approach
may be computationally intensive.
      </p>
      <p>
        CTGAN (Conditional Tabular GAN), developed by Xu et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], is primarily designed for tabular
data but can be adapted for other domains. It incorporates conditioning techniques and mode collapse
prevention strategies, this technique have been used for Privacy-Preserving Data Sharing, Customer
Churn Prediction and Survival Analysis. While CTGAN shows promise for diverse data types, it may
require significant adaptation for non-tabular data such as text or images.
      </p>
      <p>These GAN-based oversampling techniques ofer several advantages over traditional methods. They
can generate diverse, high-quality synthetic samples, capture complex distributions and subtle
relationships in the data, and allow for targeted and controllable oversampling. However, they also face
challenges, particularly in terms of training stability and the risk of mode collapse.</p>
      <p>In the context of NER, GAN-based oversampling have not been used yet but could potentially address
class imbalance issues by generating synthetic examples of underrepresented entity classes. However, a
significant challenge lies in ensuring the accuracy of entity-type annotations in the generated samples.
GANs may find it challenging to maintain precise entity boundaries and labeling, especially for complex
or context-dependent entities.</p>
      <p>This limitation highlights the need for further research into GAN architectures that can
incorporate domain-specific knowledge or constraints to ensure generated-samples maintain accurate entity
annotations.</p>
      <p>As the field progresses, addressing these challenges will be crucial for the efective application of
GAN-based oversampling techniques in NER and other natural language processing tasks dealing with
imbalanced datasets.</p>
      <p>To address this gap, we propose a new technique for generating synthetic data based on GPT-3
prompts, utilizing knowledge about named entities extracted from the DBpedia knowledge graph. This
method ensures precise entity annotation and generates content that respects the original context
of the data, thereby avoiding ambiguities. By leveraging the structured information from DBpedia,
our approach enhances the quality and relevance of the generated examples, maintaining semantic
coherence and linguistic diversity. This technique provides a robust solution for creating high-quality
synthetic data that accurately reflects the characteristics of underrepresented entity classes in NER
datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The Proposed Approach</title>
      <p>We propose a solution integrated on a component-based architecture, which enables us to address each
aspect of the oversampling process independently while ensuring the system functions cohesively as
a whole. This approach provides the flexibility to enhance or replace individual components as new
techniques emerge or requirements evolve, allowing for continuous improvement and adaptation of
our system.</p>
      <p>The schema depicted in Figure 1 illustrates the main components identified in our architecture: (1)
the Imbalance Quantifier Module, (2) the Two Stage GAN composed of : Knowledge Graph Explorer
Module, the Synthetic Example Generator Module, and (3) the Dataset Integrator Module. Each of
these components plays a crucial role in our novel oversampling method for Named Entity Recognition
datasets. In the following subsections, we will describe the specific role and functionality of each
component, explaining how they work together to create a robust and efective oversampling solution
for addressing class imbalance in NER tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Imbalance Quantifier Module</title>
        <p>The Imbalance Quantifier Module is designed to address class imbalance in Named Entity Recognition
(NER) datasets. It evaluates the dataset and provides a quantitative analysis of the imbalance among
diferent entity classes. The module calculates key metrics such as the imbalance ratio, entropy, and
Gini coeficient, and identifies underrepresented classes for prioritization (Algorithm 1).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Two Stage GAN Module</title>
        <p>This module integrates two main actions: exploring the knowledge graph to understand named entities
and identify similar concepts, and using this information to generate correctly labeled synthetic data.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Knowledge Graph Explorer Module</title>
          <p>The Knowledge Graph Explorer Module is crucial for enriching underrepresented named entities
identified by the previous module. It interacts with the DBpedia knowledge base to extract relevant
information about these entities. By using SPARQL, the module formulates detailed queries (Listing 1)
to retrieve specific data, allowing for precise and flexible extraction of attributes and relationships. For
instance, if the target entity is "Python (programming language)," a SPARQL query might retrieve its
description, related technologies, and alternative names like "Python 3."</p>
          <p>Additionally, the module navigates ontological relationships within the knowledge graph to identify
related or equivalent concepts, broadening the context around target entities. For example, it might
explore relationships such as "similarEntity" to find broader programming paradigms or "sameAs" to
link Python with similar languages like Ruby or JavaScript.</p>
          <p>The module also extracts important properties, such as descriptions and alternative names, to provide
a comprehensive context for generating synthetic examples. In the case of Python, this could include
extracting attributes like its creator, Guido van Rossum, and its primary use cases in data science and
web development.</p>
          <p>Algorithm 1 Imbalance Quantifier Component
1: function ImbalanceQuantifier(dataset)
2: entityClasses ← ExtractEntityClasses(dataset)
3: classCounts ← CountEntitiesByClass(dataset, entityClasses)
4: metrics ← CalculateMetrics(entityClasses, classCounts)
5: underrepresentedClasses ← IdentifyUnderrepresentedClasses(metrics, threshold)
6: prioritizedClasses ← PrioritizeClasses(underrepresentedClasses, metrics)
7: return {metrics, prioritizedClasses}
8: end function
9: function CalculateMetrics(entityClasses, classCounts)
10: metrics ← {}
11: for each class in entityClasses do
12: metrics[class].count ← classCounts[class]
13: metrics[class].proportion ← classCounts[class] / ∑︀ classCounts
14: end for
15: imbalanceRatio ← max(classCounts) / min(classCounts)
16: entropy ← − ∑︀∈probabilities,&gt;0  log2()
17: gini ← 1 − ∑︀∈probabilities 2
18: return {metrics, imbalanceRatio, entropy, gini}
19: end function
20: function IdentifyUnderrepresentedClasses(metrics, threshold)
21: return {class for class, data in metrics if data.proportion &lt; threshold}
22: end function
23: function PrioritizeClasses(underrepresentedClasses, metrics)
24: return SortByProportion(underrepresentedClasses, metrics)
25: end function</p>
          <p>Relevant attributes are systematically extracted and organized, ensuring readiness for use by the
synthetic example generator module. This comprehensive approach enables the creation of rich,
contextually relevant prompts, enhancing the representation of underrepresented entities.</p>
          <p>Listing 1: SPARQL query to generate Similar entity
PREFIX r d f s : &lt; h t t p : / / www. w3 . o r g /
2 0 0 0 / 0 1 / r d f − schema # &gt;
PREFIX owl : &lt; h t t p : / / www. w3 . o r g /
2 0 0 2 / 0 7 / owl # &gt;
PREFIX r d f : &lt; h t t p : / / www. w3 . o r g /
1 9 9 9 / 0 2 / 2 2 − r d f − s y n t a x − n s # &gt;
SELECT DISTINCT ? s i m i l a r E n t i t y
WHERE {
{</p>
          <p>SELECT ? e n t i t y ? l a b e l ? t y p e
WHERE {
? e n t i t y a ? t y p e ;</p>
          <p>r d f s : l a b e l ? l a b e l .</p>
          <p>FILTER ( ? t y p e ! = owl : T h i n g )</p>
          <p>FILTER (LANG( ? l a b e l ) = " en " )
}
}</p>
          <p>LIMIT 1
? e n t i t y owl : sameAs ∗ ? s i m i l a r E n t i t y .
? s i m i l a r E n t i t y r d f s : l a b e l ? s i m i l a r L a b e l ;</p>
          <p>a ? s i m i l a r T y p e .</p>
          <p>FILTER ( ? s i m i l a r E n t i t y ! = ? e n t i t y )
FILTER (LANG( ? s i m i l a r L a b e l ) = " en " )</p>
          <p>FILTER ( ? s i m i l a r T y p e ! = owl : T h i n g )
}
LIMIT ? c a l c u l a t e d L i m i t</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Synthetic Example Generator Module</title>
          <p>The Synthetic Example Generator Module is responsible for creating high-quality synthetic examples
for underrepresented entity classes. This module uses the advanced generative language model GPT-3,
leveraging information from the Knowledge Graph Explorer Module.</p>
          <p>Here are the main steps of the generation process also described in the Algorithm 2:
Creation of contextual prompts: The module utilizes data extracted by the Knowledge Graph
Explorer to craft specific prompts for each underrepresented entity class. These prompts are designed
to guide GPT-3 in generating relevant examples, as detailed in Algorithm 3.</p>
          <p>Interaction with GPT-3: The crafted prompts are sent to GPT-3, which generates synthetic examples
containing the target entities. The language model produces sentences or paragraphs that naturally
incorporate entities from the underrepresented class, as outlined in Algorithm 4.</p>
          <p>This process is illustrated in Figure 2</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Dataset Integrator Module</title>
        <p>The Dataset Integrator Module is the final component in the proposed oversampling architecture for
Named Entity Recognition datasets. Its primary role is to incorporate the synthetic examples generated
by the previous modules into the original dataset, ensuring a balanced distribution of entity classes.</p>
        <p>The Dataset Integrator Module performs several key functions crucial to the oversampling process.
It begins by merging the synthetic data, seamlessly integrating the generated synthetic examples with
the original NER dataset.</p>
        <p>This module then focuses on balancing class distribution, carefully adjusting the proportions of entity
classes to achieve the desired balance as determined by the Imbalance Quantifier Module.</p>
        <p>Throughout this process, maintaining data integrity is a priority, ensuring that the integration
preserves the structure and format of the original dataset. The module also conducts thorough validation,
performing final checks to verify the quality and consistency of the augmented dataset.
Algorithm 2 Generate Synthetic Examples
1: function GenSyntExamples(underrepClasses, kGinfo, targetCount)
2: syntheticExamples ← {}
3: for each entityClass in underrepClasses do
4: classInfo ← kGinfo[entityClass]
5: while length of syntheticExamples[entityClass] &lt; targetCount do
6: prompt ← CreatePrompt(entityClass, classInfo)
7: generatedText ← GPT3Generate(promp AnnotateEntities(generatedText, entityClass)
8: Append annotatedText to syntheticExamples[entityClass]
9: end while
10: end for
11: return syntheticExamples
12: end function
Algorithm 3 Create Prompt
1: function CreatePrompt(entityClass, classInfo)
2: definition ← classInfo.get(’definition’, ”)
3: examples ← Join(classInfo.get(’examples’, []))
4: relatedConcepts ← Join(classInfo.get(’related_concepts’, []))
5: prompt ← "Generate a short sentences that mentions a {entityClass}."
6: Append "Definition: {definition}" to prompt
7: Append "Examples: {examples}" to prompt
8: Append "Related concepts: {relatedConcepts}" to prompt
9: Append "The paragraph should naturally incorporate the {entityClass}’s name and at least one
of its common activities." to prompt
10: Append "Make sure the {entityClass} is clearly identifiable as an entity in the text." to prompt
11: Append "Generate a new and unique paragraph following these guidelines:" to prompt
12: return prompt
13: end function
Algorithm 4 GPT-3 Generate
1: function GPT3Generate(prompt)
2: response ← CallGPT3API(prompt, maxTokens=150, temperature=0.7)
3: if response is successful then
4: return Trim(response.text)
5: else
6: Print "Error in GPT-3 generation"
7: return ""
8: end if
9: end function</p>
        <p>Finally, it handles output generation, producing the final, balanced NER dataset ready for use in
training models. These functions collectively ensure that the resulting dataset efectively addresses the
initial class imbalance while maintaining the overall quality and usability of the data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>To evaluate the efectiveness of our proposed two-stage GAN oversampling method, we conducted
experiments on three widely recognized NER datasets known for their class imbalance issues.</p>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. SocialNER2.0 before Balancing Step</title>
          <p>
            SocialNER2.0 [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] dataset, developed for Named Entity Recognition in short, human-produced texts
like social media posts, exhibits significant class imbalance. Before balancing, common entity types
such as Person and Location appeared much more frequently than rarer types like Event or Product.
5.1.2. OntoNotes 5.0
OntoNotes 5.0 [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] is a large-scale, multi-genre corpus widely used for various NLP tasks, including
NER. It contains 18 entity types with notable imbalance. Person, Organization, and Location are the
most common entity types, while types like Product, Event ... have significantly fewer examples. Person
entities appear 5-10 times more frequently than Product entities, highlighting the substantial disparity
in entity distribution.
5.1.3. CoNLL-2003
CoNLL-2003 [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] is a standard benchmark dataset for NER tasks, focusing on news articles. It contains
4 entity types with a notable imbalance in their distribution. Location and Person entities each account
for approximately 30-35% of all entities, while Organization entities make up about 25-30%. The
Miscellaneous category is significantly underrepresented, comprising only 10-15% of the entities.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Model used in the experimentation</title>
        <p>In this experimentation we chose to use the BERT model (Bidirectional Encoder Representations from
Transformers). This choice is motived by the following reason
• State-of-the-art performance: BERT has shown excellent performance on various NLP tasks,
including Named Entity Recognition (NER). It’s a strong baseline for evaluating the efectiveness
of our oversampling method.
• Contextual understanding: BERT’s bidirectional nature allows it to capture context from both
directions, which is crucial for accurate NER, especially for ambiguous entities.
• Pre-training advantage: BERT is pre-trained on a large corpus, which can be beneficial when
dealing with imbalanced datasets, as it has prior knowledge about language structure and entities.
• Adaptability: BERT can be fine-tuned for specific NER tasks, making it suitable for evaluating
our method across diferent datasets and entity types.
• Comparison with previous work: Many recent NER studies use BERT as a baseline, allowing for
easier comparison of our results with existing literature.
• Handling imbalanced data: While BERT itself doesn’t solve class imbalance issues, it provides
a strong foundation for evaluating how our oversampling method improves performance on
underrepresented entity classes.
• Compatibility with generated examples: BERT’s ability to handle variable-length input makes
it suitable for processing the synthetic examples generated by our GPT-3 and DBpedia-based
approach.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Experimental Setup</title>
        <p>To evaluate the efectiveness of our proposed two-stage GAN oversampling method, we conducted
experiments on the three datasets mentioned above: SocialNER2.0, OntoNotes 5.0, and CoNLL-2003.</p>
        <p>We used BERT as our base model for all experiments. The experiments were conducted using Google
Colab with GPU acceleration (NVIDIA Tesla T4). We implemented our method using Python 3.7,
PyTorch 1.9, and the Transformers library 4.10.</p>
        <p>For each dataset, we followed this experimental procedure:
1. Analyse the initial class distribution using our Imbalance Quantifier Module.
2. Apply the proposed oversampling method to generate synthetic examples for underrepresented
classes.
3. Integrate the synthetic examples into the original dataset.
4. Split the augmented dataset into training (80%), validation (10%), and test (10%) sets.
5. Fine-tuned BERT on the augmented training set.
6. Evaluate the model’s performance on the test set.</p>
        <p>
          We compared our method against the following baselines:
• Original imbalanced dataset
• Random oversampling [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
• SMOTE (Synthetic Minority Over-sampling Technique) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Evaluation Metrics</title>
        <p>Precision, recall, and F1 score are fundamental metrics used to evaluate the performance of NER systems.
These entity-level metrics provide a comprehensive view of the model’s accuracy and efectiveness in
identifying named entities.</p>
        <p>• Precision: Precision measures the proportion of correctly identified named entities among all
the entities identified by the NER system. It answers the question: "Of all the entities the model
identified, how many were correct?"</p>
        <p>Precision = True Positives / (True Positives + False Positives)
• Recall: Recall measures the proportion of correctly identified named entities among all the entities
that should have been identified. It answers the question: "Of all the actual entities in the text,
how many did the model correctly identify?"</p>
        <p>Recall = True Positives / (True Positives + False Negatives)
• F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure
of the model’s performance. It’s particularly useful when you have an uneven class distribution,
as it takes both false positives and false negatives into account.</p>
        <p>F1 Score = 2 * (Precision * Recall) / (Precision + Recall)</p>
        <p>These metrics are calculated for each entity type separately and then averaged to give an overall
performance measure. This allows for a nuanced understanding of the model’s strengths and weaknesses
across diferent entity types.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Results and Analysis</title>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Discussion</title>
        <p>The results demonstrate that our two-stage GAN oversampling method efectively addresses class
imbalance in NER datasets while maintaining semantic coherence and linguistic diversity. The
integration of DBpedia knowledge significantly improved the quality and relevance of generated examples,
particularly for rare entity types. The most substantial improvements were observed in SocialNER2.0,
likely due to its higher initial imbalance and the challenging nature of entity recognition in short,
informal texts. The method’s ability to generate contextually appropriate examples for rare entities in
social media language proved particularly valuable. While improvements were seen across all datasets,
the gains were less pronounced in CoNLL-2003. This is likely due to its lower initial imbalance and
more formal language structure, which presents fewer challenges for traditional NER models.</p>
        <p>Our method’s performance on OntoNotes 5.0 demonstrates its scalability to datasets with a large
number of entity types, efectively improving recognition for all 18 categories.</p>
        <p>These results suggest that our approach is particularly beneficial for datasets with high class imbalance
and those dealing with informal or diverse text genres. The method’s ability to generate high-quality,
diverse examples for rare entity types addresses a critical challenge in NER tasks, potentially improving
model generalization and robustness in real-world applications.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Perspectives</title>
      <p>This paper presented a novel GAN oversampling method that integrates GPT-3 and DBpedia for
addressing class imbalance in Named Entity Recognition datasets. Our approach leverages the power
of large language models and structured knowledge graphs to generate high-quality, contextually
appropriate synthetic examples for underrepresented entity classes. Our experimental results show
significant improvements in balanced accuracy across entity classes, with the most substantial gains
observed in datasets with high initial imbalance. The method’s ability to generate high-quality, diverse
examples for rare entity types addresses a critical challenge in NER tasks, potentially improving model
generalization and robustness in real-world applications. Future research could explore the integration
of other knowledge graphs, such as Wikidata or YAGO, to enhance the diversity and contextual relevance
of the synthetic examples generated.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Perplexity in order to: Grammar and spelling
check. After using these tool, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Grishman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Information extraction</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>30</volume>
          (
          <issue>5</issue>
          ),
          <fpage>8</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mai</surname>
            ,
            <given-names>T. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>Q. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ninh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D. T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2024</year>
          , May).
          <source>The First ACM Workshop on AI-Powered Question Answering Systems for Multimedia. In Proceedings of the 2024 International Conference on Multimedia Retrieval</source>
          (pp.
          <fpage>1328</fpage>
          -
          <lpage>1329</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Chowdhary</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chowdhary</surname>
            ,
            <given-names>K. R.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Natural language processing</article-title>
          .
          <source>Fundamentals of artificial intelligence</source>
          ,
          <fpage>603</fpage>
          -
          <lpage>649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Moreo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2016</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Distributional random oversampling for imbalanced text classification</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval</source>
          (pp.
          <fpage>805</fpage>
          -
          <lpage>808</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>L. O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kegelmeyer</surname>
            ,
            <given-names>W. P.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>SMOTE: synthetic minority over-sampling technique</article-title>
          .
          <source>Journal of artificial intelligence research</source>
          ,
          <volume>16</volume>
          ,
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García-Silva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2011</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>DBpedia spotlight: shedding light on the web of documents</article-title>
          .
          <source>In Proceedings of the 7th international conference on semantic systems</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2006</year>
          ,
          <article-title>February)</article-title>
          .
          <article-title>The index organizations for RDF and RDF schema</article-title>
          .
          <source>In 2006 8th International Conference Advanced Communication Technology (Vol. 3</source>
          , pp.
          <fpage>4</fpage>
          -pp).
          <source>IEEE.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>SPARQL query language</article-title>
          .
          <source>The Web of Data</source>
          ,
          <fpage>323</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2018</year>
          ,
          <article-title>November)</article-title>
          .
          <article-title>An overview of named entity recognition</article-title>
          .
          <source>In 2018 International Conference on Asian Language Processing (IALP)</source>
          (pp.
          <fpage>273</fpage>
          -
          <lpage>278</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Bouarroudj</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boufaida</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Bellatreche</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Named entity disambiguation in short texts over knowledge graphs</article-title>
          .
          <source>Knowl Inf Syst</source>
          <volume>64</volume>
          ,
          <fpage>325</fpage>
          -
          <lpage>351</lpage>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Bouarroudj</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boufaida</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellatreche</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>WeLink: A Named Entity Disambiguation Approach for a QAS over Knowledge Bases</article-title>
          . In: Cuzzocrea,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Saccà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Andreasen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Christiansen</surname>
          </string-name>
          , H. (
          <article-title>eds) Flexible Query Answering Systems</article-title>
          .
          <source>FQAS 2019. Lecture Notes in Computer Science()</source>
          , vol
          <volume>11529</volume>
          . Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kayacik</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donsbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Terry</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2022</year>
          , April).
          <article-title>Discovering the syntax and strategies of natural language programming with generative language models</article-title>
          .
          <source>In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems</source>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Kovács</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets</article-title>
          .
          <source>Applied Soft Computing</source>
          ,
          <volume>83</volume>
          ,
          <fpage>105662</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheidegger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Istrate</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bekas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Malossi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Bagan: Data augmentation with balancing gan</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .09655.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Conditional generative adversarial nets</article-title>
          .
          <source>arXiv preprint arXiv:1411</source>
          .
          <fpage>1784</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Odena</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2017</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Conditional image synthesis with auxiliary classifier gans</article-title>
          .
          <source>In International conference on machine learning</source>
          (pp.
          <fpage>2642</fpage>
          -
          <lpage>2651</lpage>
          ). PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Mullick</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Generative adversarial minority oversampling</article-title>
          .
          <source>In Proceedings of the IEEE/CVF international conference on computer vision</source>
          (pp.
          <fpage>1695</fpage>
          -
          <lpage>1704</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skoularidou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuesta-Infante</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Veeramachaneni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Modeling tabular data using conditional gan</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Belbekri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benchikha</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slimani</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Marir</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>SocialNER2</year>
          .
          <article-title>0: A comprehensive dataset for enhancing named entity recognition in short human-produced text</article-title>
          .
          <source>Intelligent Data Analysis</source>
          <volume>28</volume>
          (
          <issue>3</issue>
          ),
          <fpage>841</fpage>
          -
          <lpage>865</lpage>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Bernier-Colborne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vajjala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Annotation Errors and NER: A Study with OntoNotes 5.0</article-title>
          . arXiv preprint arXiv:
          <volume>2406</volume>
          .
          <fpage>19172</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Sang</surname>
            ,
            <given-names>E. F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>De Meulder</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition</article-title>
          .
          <source>arXiv preprint cs/0306050.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>