<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical Opinion Classification using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuvam Banerji Seal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alok Mishra</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Utkarsha Ghosh</string-name>
          <email>utkarsha.ghosh2023@iem.edu.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Chemical Sciences, Indian Institute of Science Education and Research</institution>
          ,
          <addr-line>Kolkata</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information Technology, Institute of Engineering and Management</institution>
          ,
          <addr-line>Kolkata</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Indian Institute of Science Education and Research</institution>
          ,
          <addr-line>Kolkata</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>We address the task of hierarchical opinion classification with large language models (LLMs). Our approach employs parameter-eficient fine-tuning of the Gemma model by attaching a lightweight two-layer classification head (LayerNorm → Linear → GELU → Linear) and updating only the final transformer block, normalization, and output layers. The original three-level hierarchy of opinion labels is reformulated into an 8-class flat scheme, enabling direct optimization. To mitigate data imbalance, we adopt class-weighted cross-entropy loss, ensuring improved treatment of minority categories. Experimental evaluation is conducted on this reformulated dataset, using accuracy metrics sensitive to class imbalance. The second approach explores instruction fine-tuning, training the model to generate labels in a prompt-response format using a next-token prediction objective with a masked loss function that focuses only on the answer tokens. For both methods, the original three-level hierarchy of opinion labels is reformulated into an 8-class flat scheme, and class-weighted cross-entropy loss is used to mitigate data imbalance. Our evaluation contrasts the efectiveness of selective fine-tuning and custom heads for adapting LLMs to structured opinion classification under computational constraints, against the generative alignment of instruction-tuning, providing insights into adapting LLMs for hierarchical classification under strict computational and data imbalance constraints.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Hierarchical Labels</kwd>
        <kwd>Fine-Tuning</kwd>
        <kwd>Class Imbalance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>User-generated text on platforms such as Reddit, Twitter, and YouTube is highly diverse, noisy, and often
hierarchical in nature. Traditional sentiment analysis methods typically treat this as a flat classification
task, which fails to capture the layered structure of opinions. For instance, a comment may first be
categorized as subjective, then further divided into positive or negative, and in some cases split into
even finer categories such as questions or advertisements. This complexity, combined with heavy class
imbalance where certain categories dominate the data, poses significant challenges for robust opinion
classification.</p>
      <p>
        To address these issues, we adapt the Gemma-1B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] large language model for hierarchical opinion
classification. Specifically, we collapse the three-level label scheme into a flat 8-class structure and
replace the model’s original output layer with a custom classification head. To balance eficiency and
performance, we employ parameter-eficient fine-tuning by training only the last transformer block,
ifnal normalization layers, and the classification head, while freezing earlier layers. Furthermore, we
mitigate class imbalance through weighted cross-entropy loss with inverse-frequency class weights.
This approach improves the recognition of minority classes while maintaining strong overall accuracy,
demonstrating the efectiveness of lightweight fine-tuning for real-world opinion classification.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Objective</title>
      <p>The primary goal of this project is to develop an efective solution for an 8-class text classification task
derived from a 3-level hierarchical opinion classification dataset. The approach involves extending a
pretrained large language model (LLM) by attaching a custom classification head and selectively fine-tuning
specific layers. This design enables the model to adapt its generalized language representations to the
downstream classification task. In addition, strategies for addressing class imbalance are incorporated
to enhance predictive performance and ensure robustness across all categories.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>The task addressed in this work is an opinion classification problem based on a hierarchically structured
dataset. The dataset is annotated across three levels, each refining the granularity of classification:
• Level 1: Coarse-grained Opinion Classes. The first level categorizes each text into three broad
classes: Noise (label 0), Objective (label 1), and Subjective (label 2).
• Level 2: Subjective Subclasses. The second level refines the Subjective category into three
sentiment-based classes: Neutral (label 0), Negative (label 1), and Positive (label 2).
• Level 3: Neutral Subclasses. The third level further decomposes the Neutral class into four
specialized categories: Neutral Sentiment (label 0), Question (label 1), Advertisement (label 2), and
Miscellaneous (label 3).</p>
      <p>This hierarchical structure results in an efective 8-class text classification task at the leaf level
(Noise, Objective, Positive, Negative, Neutral Sentiment, Question, Advertisement, Miscellaneous). The
primary challenge lies in leveraging the hierarchical dependencies while addressing issues such as class
imbalance and semantic overlap among categories.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Cleaning and Preparation Workflow</title>
      <p>
        The quality of training data directly impacts the performance of large language models (LLMs). Therefore,
designing a systematic workflow for dataset cleaning and preparation is critical to ensure reliability
and reproducibility. A similar approach of such opinionated textual information pre-processing was
shown in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] where they created a multi-stage query reformulation pipeline. We got inspired from it
and have adapted and modified it for pre-processing. In this section, we outline the comprehensive
workflow used for cleaning and structuring multiple social media datasets (Reddit, Twitter, YouTube,
and QnA-Train). The workflow consists of four major phases, each addressing a distinct aspect of data
quality.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Phase 1: Initial Loading and Structural Integrity Checks</title>
        <p>This phase focused on loading heterogeneous social media datasets (Twitter, YouTube, and Reddit) and
ensuring their structural consistency before semantic processing.</p>
        <p>• Step 1.1: Dataset Ingestion and Schema Alignment.</p>
        <p>Each source dataset exhibited slight variations in field names (e.g., text, content, body). To
ensure a unified schema, these columns were standardized into a single textual efild text and
a label field label. Data were ingested using pandas.read_csv() with explicit encoding
(utf-8) to prevent parsing errors and ensure consistent handling of multilingual and special
characters.
• Step 1.2: Detection and Removal of Structural Duplicates and Null Records.</p>
        <p>Rather than relying on generic removal commands such as drop_duplicates() and dropna(),
a more controlled and semantically guided strategy was implemented to maintain dataset integrity.
– Step 1.2.1: Identification and Elimination of Null Entries.</p>
        <p>Instances containing missing textual fields were first identified using a Boolean mask
generated by is_nan(). Rows flagged as null were explicitly filtered out using conditional
selection rather than direct calls to dropna(), ensuring complete visibility into the number
and distribution of removed entries. This preemptive elimination of null samples prevented
downstream tokenization errors caused by empty or undefined strings.
– Step 1.2.2: Token-Based Duplicate Detection.</p>
        <p>To detect semantic duplicates, each text entry was tokenized using gemma 3b tokenizer after
null-value removal, and the total number of tokens was computed for every record. The
resulting counts were stored in an auxiliary column num_tokens. The dataset was then
sorted in ascending order based on this column, allowing easier visual and programmatic
inspection of potential duplicates.
– Step 1.2.3: Semantic Verification and Row Pruning.</p>
        <p>Rows exhibiting identical token sequences and matching token lengths were flagged as
structural duplicates. These entries, often representing repeated comments or cross-platform
reposts, were systematically removed to prevent redundant gradient updates during model
ifne-tuning. This token-level validation ensured that duplicate detection was performed at
a semantic rather than purely string-based level.
– Step 1.2.4: Corpus Integrity Verification.</p>
        <p>Following duplicate elimination, row indices were reindexed to maintain dataset continuity,
and sample-level statistics (mean and variance of token counts) were recalculated. This
confirmed that the cleaning process preserved the natural distribution of text lengths across
social platforms.
• Step 1.3: Elimination of Non-Informative Columns.</p>
        <p>Non-essential metadata fields such as user identifiers, timestamps, and comment IDs were dropped
to minimize noise and reduce the dataset to its semantically relevant components. This ensured
that the subsequent cleaning and modeling phases operated exclusively on meaningful linguistic
and categorical information.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Phase 2: Content-Level Cleaning and Text Refinement</title>
        <p>After achieving structural consistency, the next phase focused on cleaning the textual content itself to
eliminate noise and unify linguistic representation across platforms.</p>
        <p>• Step 2.1: URL and Hyperlink Removal.</p>
        <p>All web links (e.g., https://t.co/..., www.youtube.com/...) were stripped using regular
expressions (re.sub(r’http§+|www§+’, ”, text)). Links often introduce high-variance
tokens that provide no semantic value to opinion or sentiment classification.
• Step 2.2: Mention and Hashtag Filtering.</p>
        <p>Social references (@username) and hashtags (#topic) were removed to prevent token sparsity
and overfitting to platform-specific metadata. In selective cases, hashtags conveying clear
sentiment (e.g., “#happy”) were optionally retained during exploratory analysis but excluded in the
ifnal standardized corpus.
• Step 2.3: Emoji and Symbol Normalization.</p>
        <p>Emojis and pictographic symbols were filtered using a Unicode-based regular expression pattern.
These characters inflate the tokenizer’s vocabulary space without contributing consistent semantic
information across samples.
• Step 2.4: Punctuation and Special Character Handling.</p>
        <p>Non-alphanumeric symbols were removed except for interrogative punctuation (“?”) and
exclamatory marks (“!”). These were retained intentionally, as they serve as discriminative cues for
Level-3 categories such as Questions and Advertisements.
• Step 2.5: Whitespace Normalization and Compacting.</p>
        <p>Extraneous whitespace, tab characters, and newline symbols were consolidated into single
spaces using re.sub(r’\s+’, ’ ’, text). This ensured consistent token spacing prior to
tokenization.
• Step 2.6: Low-Quality and Gibberish Filtering. Posts with fewer than fifty token, repetitive
character patterns (e.g., “hahahahahahaha”), or alphabetic ratios below 30% were categorised
as noise. This step was critical for preserving meaningful linguistic structure in downstream
learning.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Phase 3: Text Normalization for Model Readiness</title>
        <p>The final phase standardized the cleaned text to ensure compatibility with transformer-based
tokenization and to preserve semantic cues necessary for hierarchical classification.</p>
        <p>• Step 3.1: Case Normalization.</p>
        <p>All text was converted to lowercase, ensuring that tokens such as “Great” and “great” are treated
identically by the tokenizer, thereby reducing vocabulary sparsity.
• Step 3.2: Domain-Specific Token Preservation.</p>
        <p>Certain lexical items indicative of advertisement or spam intent (e.g., “ofer”, “discount”,
“subscribe”) were deliberately retained, as they provide discriminative signals for the Advertisement
subclass.
• Step 3.3: Final Text Validation.</p>
        <p>Each cleaned entry was verified to contain at least one alphabetic token after normalization. The
ifnalized corpus was then stored as (text, label) pairs, ready for tokenization and batching
in the fine-tuning pipeline.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Phase 4: Final Formatting</title>
        <p>The final stage focused on ensuring label consistency, interpretability, and compatibility of the dataset
with the model’s classification head. This phase bridged the cleaned textual data and the numerical
representations required for supervised fine-tuning.</p>
        <p>• Step 4.1: Hierarchical Label Consolidation.</p>
        <p>The original annotation schema spanned three hierarchical levels:
To ensure unified supervision for the classification head, these labels were flattened into a single
categorical space, resulting in an eight-class system encompassing all terminal categories.
• Step 4.2: Numeric Label Encoding.</p>
        <p>Each unique label was assigned a numeric identifier to facilitate model training. The mapping
followed a deterministic scheme:</p>
        <p>Together, these four phases provide a robust framework for preparing noisy, heterogeneous social
media datasets for large-scale machine learning. By systematically addressing structural errors, content
quality, normalization, and label formatting, the workflow ensures high-quality, standardized inputs for
subsequent experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Data Analysis of Datasets</title>
      <p>Prior to model fine-tuning, an exploratory data analysis was performed to understand the distribution
of text lengths across diferent social media platforms. The number of tokens per entry, generated after
initial cleaning and tokenization, was used as a proxy for content length and complexity. This analysis
provides insight into potential padding/truncation requirements, the prevalence of extremely short
or long texts, and the overall distributional characteristics that may impact model learning. Separate
analyses were conducted for Reddit, Twitter, and YouTube datasets, as summarized below.
Reddit Dataset</p>
      <p>Value
5000
186.085
384.334
4
26
58
264
15535</p>
      <p>Label</p>
      <p>Count</p>
      <sec id="sec-5-1">
        <title>5.1. Dataset Token and Class Distribution Analysis</title>
        <p>Following the cleaning phase, we conducted a detailed analysis of token length and class distributions
across all datasets. This step ensured that the textual data exhibited consistent structural properties,
with minimal variance in sequence lengths and well-defined label proportions. The preprocessing
pipeline efectively removed incomplete and redundant entries, standardized class names, and merged
fragmented textual segments. As a result, the final datasets were more coherent and semantically
interpretable, providing a robust foundation for subsequent model fine-tuning and evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <sec id="sec-6-1">
        <title>6.1. The Transformer Architecture</title>
        <p>
          The Transformer architecture, introduced in the landmark paper “Attention Is All You Need” [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
replaced recurrent and convolutional layers with a self-attention mechanism that enables models to
process entire sequences in parallel. This design allows the model to capture long-range dependencies
more efectively while significantly reducing training time compared to recurrent networks. The
encoder–decoder structure of the Transformer has since become the foundation of nearly all modern
large language models (LLMs).
        </p>
        <p>
          Several extensions have been proposed to improve eficiency and scalability. The Reformer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
introduced locality-sensitive hashing (LSH) attention and reversible residual layers, reducing the
memory footprint and enabling the handling of very long sequences. The Longformer [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] proposed a
sparse attention mechanism, combining local sliding window attention with global tokens, making it
suitable for processing long documents with thousands of tokens.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Pre-training and Fine-tuning</title>
        <p>The paradigm of pre-training followed by fine-tuning has revolutionized natural language processing
(NLP). In this approach, large models are first pre-trained on massive text corpora to learn general
language representations, and then fine-tuned on smaller task-specific datasets to adapt to downstream
applications.</p>
        <p>
          BERT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] demonstrated the efectiveness of bidirectional transformers for pre-training, introducing
masked language modeling and next sentence prediction objectives. RoBERTa [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] improved upon BERT
by training on larger datasets, longer sequences, and removing next sentence prediction, achieving
higher accuracy across multiple benchmarks. T5 (Text-to-Text Transfer Transformer) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] unified a
wide range of NLP tasks under a single “text-to-text” framework, showing that pre-training with a
denoising objective can transfer efectively to diverse applications such as translation, summarization,
and classification.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Instruction Tuning</title>
        <p>Instruction tuning refers to fine-tuning language models on datasets where tasks are framed as natural
language instructions paired with the desired outputs. This makes models more adaptable and improves
their performance in zero-shot and few-shot scenarios.</p>
        <p>
          The FLAN work [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] showed that fine-tuning models on a mixture of instruction-following datasets
improves zero-shot generalization across unseen tasks. InstructGPT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] advanced this approach by
incorporating Reinforcement Learning with Human Feedback (RLHF), aligning model behavior more
closely with human intent and safety considerations. Self-Instruct [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] proposed a scalable method
in which the model itself generates synthetic instruction–response pairs, which are then used for
additional fine-tuning. This reduces the reliance on costly human-labeled instruction datasets.
        </p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Eficiency-Oriented Fine-Tuning Techniques</title>
        <p>To enable fine-tuning of the Gemma model on limited GPU resources without compromising model
ifdelity, multiple optimization and stabilization techniques were employed. These strategies collectively
enhanced computational eficiency, reduced memory consumption, and stabilized gradient dynamics
throughout training.</p>
        <p>
          • Step 1: 4-bit Quantization [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] for Memory Optimization.
        </p>
        <p>To fit the large-scale model within the available 24 GB GPU memory, quantization was performed
using the bitsandbytes backend in 4-bit precision. This reduced the memory footprint by
nearly 75% compared to full-precision weights while preserving representational fidelity through
adaptive rounding and group-wise scaling (NF4 quantization). The quantized format allowed
eficient gradient updates and larger batch sizes during fine-tuning.
• Step 2: Eager Attention Mechanism.</p>
        <p>The Gemma architecture was configured to use the Eager Attention backend, which eliminates
redundant CUDA graph recompilation and dynamically optimizes attention computation at
runtime. This approach accelerated forward–backward passes and reduced memory fragmentation
in GPU execution, improving throughput stability during training.
• Step 3: Gradient Clipping for Stabilized Training.</p>
        <p>To mitigate exploding gradients and maintain numerical stability during fine-tuning, gradient
norms were clipped. This ensured that parameter updates remained within a bounded range,
preventing destabilizing gradient spikes in low-precision training regimes.
• Step 4: Adaptive Optimization and Scheduling.</p>
        <p>
          The optimizer was configured as AdamW [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] with a learning rate of 5 × 10 −5 and weight decay
of 0.1 to ensure stable convergence.
        </p>
        <p>A cosine learning rate scheduler with warm-up was employed to gradually increase the learning
rate during the initial phase of training, followed by smooth decay:
This warm-up ratio 15% mitigated optimization shocks during early epochs and helped the model
achieve stable convergence across 3 training epochs.
• Step 5: Mixed-Precision and Eficient Data Handling.</p>
        <p>The training pipeline utilized mixed-precision (bfloat16/float32) computation to balance
speed and numerical precision. Data loading was parallelized via PyTorch’s DataLoader with
pinned memory and dynamic collation, minimizing CPU–GPU transfer overhead.
Collectively, these strategies enabled eficient fine-tuning of the multi-billion-parameter Gemma model on
a single 24 GB GPU, maintaining stability and performance without requiring full-parameter updates.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Model Architecture</title>
      <p>Large Language Models (LLMs) like Gemma are inherently designed for generative tasks, yet with
minor architectural adaptations, they can be efectively repurposed for discriminative objectives such
as text classification. In this work, we utilize the Gemma-1B model—a compact, one-billion-parameter
LLM that balances computational eficiency with strong pretrained representational capacity, enabling
feasible fine-tuning on limited hardware resources.</p>
      <p>To adapt Gemma for classification, we replace its original output projection layer—which maps hidden
representations onto a vocabulary of approximately 262,144 tokens—with a custom classification head.
This new head maps the final hidden states to eight target classes corresponding to the hierarchical
opinion dataset. Consequently, the model preserves the expressive power of the pretrained transformer
backbone while aligning its output space with the specific requirements of the downstream task. An
important architectural nuance exists between the smaller Gemma-1B and the larger Gemma-4B
variants. In Gemma-1B, internal components such as transformer blocks, the final LayerNorm, LM
head, and classification head are directly accessible through the model object. In contrast, Gemma-4B
encapsulates these components within an additional container module, altering the access patterns
for model submodules. This distinction is crucial when attaching custom heads, unfreezing layers, or
performing parameter-eficient fine-tuning, as overlooking it can lead to attribute access errors.</p>
      <sec id="sec-7-1">
        <title>Algorithm 1 Gemma-1B Architecture for Text Classification</title>
        <p>Step 2: Transformer Backbone Processing
for  ← 1 to  do
 ← TransformerBlock (−1 )</p>
        <p>◁ Uses Gemma3RMSNorm and Gemma3RotaryEmbedding
end for
Algorithm 2 Model Architecture Specifications
Algorithm 3 Adaptation from Generative to Discriminative Task
1: Input: Pretrained Gemma-1B model
2: Output: Fine-tuned classification model
3:
4: Step 1: Model Selection
5: Select Gemma-1B (1B params) over larger variants for computational eficiency
6: Preserve pretrained representations while enabling fine-tuning on limited hardware
7:
8: Step 2: Output Layer Replacement
9: Remove original: Linear(1152 → 262144)
10: Add custom: Sequential(1152 → 8)
11:
12: Step 3: Architectural Preservation
13: Maintain transformer backbone: Gemma3RMSNorm, Gemma3RotaryEmbedding
14: Keep internal representations: model = 1152
15: Utilize final hidden states for classification
16:
17: Step 4: Model-Specific Access Patterns
18: if using Gemma-1B then
19: Access components directly: model.component
20: else
21: Access through wrapper: model.model.component
22: end if
23:
24: Result: Adapted model capable of hierarchical opinion classification
◁ Vocabulary projection
◁ 8-class classification</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Training Setup</title>
      <sec id="sec-8-1">
        <title>8.1. Dataset Preparation</title>
        <p>The dataset consists of user-generated text entries, each annotated using a three-level hierarchical label
scheme. Level 1 contains the broad categories: NOISE, OBJECTIVE, and SUBJECTIVE. For entries labeled
as SUBJECTIVE, Level 2 further assigns NEUTRAL, POSITIVE, or NEGATIVE. Finally, for NEUTRAL
instances at Level 2, Level 3 provides more fine-grained labels: QUESTIONS, NEUTRAL SENTIMENTS,
ADVERTISEMENTS, and MISCELLANEOUS.</p>
        <p>To simplify the classification process, we collapsed the original 10-class hierarchy into a flat
8class label space, as shown in Table 2. This enables a single-step classification while preserving the
hierarchical semantics of the original taxonomy.</p>
        <p>To mitigate class imbalance, we computed class-specific weights based on inverse class frequency, as
shown below (Algorithm 4). These weights were later used in the weighted cross-entropy loss function
to penalize misclassification of minority classes more heavily.</p>
        <p>Algorithm 4 Class Weight Computation
1: Class penalty values: [225, 175, 80, 130, 175, 900, 70, 30]
2: Normalize to sum 1 and scale by number of classes
3: Implemented in PyTorch as:
label_penalty = [225,175,80,130,175,900,70,30]
label_tens = torch.tensor(label_penalty, dtype=torch.float)
weights = 1 / label_tens
weights = (weights / weights.sum()) * len(label_penalty)</p>
      </sec>
      <sec id="sec-8-2">
        <title>8.2. Preprocessing Steps</title>
        <sec id="sec-8-2-1">
          <title>Algorithm 5 Data Preprocessing Pipeline</title>
          <p>We applied a multi-stage preprocessing pipeline to improve data quality (Algorithm 5). The raw
dataset contained substantial noise from social media sources, including duplicated entries, incomplete
texts, and non-linguistic symbols. To address this, we performed systematic cleaning, filtering, and
consistency checks before model ingestion. Specifically, we removed duplicates, empty texts, and entries
containing only special characters (unless labeled as Noise). We then tokenized using the Gemma-3
tokenizer and discarded extremely long (&gt;2000 tokens) or unrealistically short (&lt;40 tokens) entries
with gibberish content (if not Noise). Finally, the cleaned dataset was split into training (70%), validation
(20%), and test (10%) subsets.</p>
        </sec>
      </sec>
      <sec id="sec-8-3">
        <title>8.3. Model Training</title>
        <p>
          All experiments were implemented using the PyTorch deep learning framework [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. We fine-tuned
the pretrained Gemma model using the AdamW optimizer [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] with a fixed learning rate and weight
decay to prevent overfitting. Gradient clipping was applied during backpropagation to ensure stable
convergence.
        </p>
        <p>
          Rather than updating all model parameters, we adopted a parameter-eficient fine-tuning (PEFT)
strategy [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Specifically, we froze the early transformer layers of the pretrained LLM and only trained
the final transformer block, the layer normalization modules, and a custom classification head. This
selective layer unfreezing significantly reduced the number of trainable parameters while retaining the
model’s representational capacity.
        </p>
        <p>
          To address class imbalance, we used a weighted cross-entropy loss function [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], where class weights
were derived from inverse class frequencies (Algorithm 4). This encouraged the model to pay more
attention to minority classes and reduced model bias toward dominant labels.
        </p>
        <p>Training was performed using mini-batches of fixed size, and the model was evaluated on the
validation set after each epoch. The best-performing checkpoint was selected based on the validation
macro-F1 score to ensure balanced performance across all classes. All experiments were conducted on a
single GPU, and identical hyperparameters were maintained across runs for fairness and reproducibility.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Data Processing</title>
      <p>Following structural and semantic cleaning, the datasets were transformed into model-compatible
formats for fine-tuning the Gemma architecture. This phase focused on tokenization, sequence length
standardization, label encoding, and dataset splitting for training and evaluation.</p>
      <sec id="sec-9-1">
        <title>9.1. Phase 1: Tokenization and Sequence Preparation</title>
        <p>The cleaned textual corpus was tokenized using Gemma’s native tokenizer to ensure compatibility with
the pretrained embedding space.</p>
        <p>• Step 1.1: Model-Compatible Tokenization.</p>
        <p>Each sentence was tokenized using the AutoTokenizer.from_pretrained utility.
• Step 1.2: Maximum-Length Padding.</p>
        <p>Instead of truncating sequences, the maximum token length across the dataset was computed,
and shorter sequences were padded to this length. This approach ensures that all tokens are
retained while maintaining a uniform input size for the model.</p>
      </sec>
      <sec id="sec-9-2">
        <title>9.2. Phase 2: Label Encoding and Dataset Structuring</title>
        <p>Categorical labels were mapped to numerical identifiers and the dataset was organized for supervised
learning.</p>
        <p>• Step 2.1: Label Indexing.</p>
        <p>Hierarchical labels were mapped to integer identifiers (0–7), corresponding to the eight distinct
opinion categories defined in the dataset.
• Step 2.2: Dataset Partitioning.</p>
        <p>The dataset was split into training (70%), validation (20%), and test (10%) subsets using a stratified
sampling approach to preserve class distribution. Prior to training, the data were shufled
randomly to avoid ordering biases.</p>
      </sec>
      <sec id="sec-9-3">
        <title>9.3. Phase 3: Data Loading for Training</title>
        <p>The processed datasets were wrapped into PyTorch DataLoader objects for eficient access during
ifne-tuning.</p>
        <p>• Step 3.1: DataLoader Configuration.</p>
        <p>Training, validation, and test sets were loaded with an appropriate batch size. Data shufling
occurred only once before training, ensuring reproducibility and balanced exposure of samples
without dynamic shufling at each epoch.</p>
      </sec>
      <sec id="sec-9-4">
        <title>9.4. Data Processing Instruction Fine-Tuning</title>
        <p>For instruction fine-tuning, the model is trained to generate a target output conditioned on a given
prompt and contextual input. This requires precise alignment between the input token sequence and
the corresponding target labels. The methodology implemented in this work is as follows:
• Step 4.1: Tokenization of Prompt-Response Pairs.</p>
        <p>Each sample, consisting of a concatenated prompt and input text (e.g., Text + Comment),
was tokenized using Gemma’s tokenizer. This produced the input token IDs (input_ids) and
attention masks (input_mask) required for model consumption.
• Step 4.2: Sequence Shifting for Autoregressive Learning.</p>
        <p>To enable autoregressive training, the tokenized sequence was shifted by one position to produce
the target tensor (target_ids). Formally, if the original token sequence is
the shifted target sequence is defined as
 = [0, 1, . . . , −1 ]
 = [1, 2, . . . ,  ].</p>
        <p>This shifting ensures that the model predicts the next token at each time step, thereby learning to
generate the label conditioned on the preceding tokens, including both prompt and input text.
• Step 4.3: Masking Non-Label Tokens.</p>
        <p>Since the objective is to compute the loss only on the label portion (e.g., the relevance or answer
tokens), a masking tensor was constructed. For each sequence:
1. The position of the label tokens within the shifted sequence was identified by comparing
the target tensor with the tokenized label.
2. All positions outside the label were set to -100, which is the standard ignore index in</p>
        <p>PyTorch’s CrossEntropyLoss.
3. Positions corresponding to the label remained unmasked, ensuring that the loss is computed
exclusively on the answer tokens.</p>
        <p>This selective masking prevents the model from backpropagating errors over the prompt or
context tokens, focusing learning exclusively on the label generation.
• Step 4.4: Verification of Label Alignment.</p>
        <p>To ensure correctness, each label token sequence was compared with the corresponding segment
in the shifted target tensor. Only sequences with an exact match were retained, and the mask
was applied accordingly. Any mismatches were flagged for inspection to guarantee precise
supervision.</p>
        <p>This approach enables instruction fine-tuning in a controlled manner, ensuring that the model:
1. Learns to predict the target label conditioned on the full prompt and input text.
2. Receives gradient updates only for the label tokens, avoiding spurious updates on non-informative
parts of the input.
3. Maintains the autoregressive property of the LLM, making it compatible with standard causal
language modeling objectives.</p>
      </sec>
      <sec id="sec-9-5">
        <title>9.5. Loss Function</title>
        <p>Given the nature of the task—single-label, multi-class classification with eight classes and significant
class imbalance—we employed the Cross-Entropy Loss. This choice was motivated by:
1. Its suitability for multi-class classification tasks.
2. Its support for class weighting, which is essential for imbalanced datasets.</p>
        <p>Handling Class Imbalance: To mitigate imbalance, we computed inverse-frequency class
weights based on the number of samples per class. These weights were incorporated into the
CrossEntropy Loss to penalize misclassifications of minority classes more heavily.</p>
        <p>
          Algorithm 6 Class Weight Computation
1: Class penalty values based on inverse population frequency:
2: weights ← [225, 175, 80, 130, 175, 900, 70, 30]
3: weights ← weights/sum(weights)
4: weights ← weights × 8
◁ Normalize to sum 1
◁ Scale by number of classes
10. Optimizer and Scheduler Configuration
To ensure stable convergence during fine-tuning, optimization and learning rate scheduling were
carefully configured. The AdamW optimizer [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] was employed, which decouples weight decay from
the gradient-based parameter updates, providing better generalization compared to conventional Adam.
The learning rate was set to 5 × 10 −5 with a weight decay of 0.1 to prevent overfitting and stabilize
training dynamics.
        </p>
        <p>A cosine learning rate schedule with warmup was adopted to further enhance training stability.
Specifically, the learning rate was gradually increased during the initial 15% of training steps (warmup
phase), followed by a smooth cosine decay over the remaining steps. This schedule helps the model
transition from the pretrained weights to the downstream classification objective without abrupt
parameter shifts.</p>
        <p>To prevent exploding gradients, gradient norms were clipped to a maximum value of 1.0 using
torch.nn.utils.clip_grad_norm_(). The total training was conducted for three epochs, with
a 70–20–10 train–validation–test split. Data shufling was applied once prior to training to ensure
randomized sample distribution, while maintaining deterministic ordering during the training iterations.</p>
        <p>
          This configuration ensured smooth optimization, efective regularization, and consistent convergence
across fine-tuning runs, particularly under hardware constraints associated with quantized and partially
unfrozen models.
11. Trainable Parameters
To balance eficiency with performance, we adopted parameter-eficient fine-tuning [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] , training
only a subset of the model’s parameters while keeping the rest frozen. This reduces overfitting risks
and lowers computational cost. The configuration is summarized in . The fine-tuning configuration
was designed to enable targeted learning on the downstream classification task while maintaining the
stability of the pretrained layers.
        </p>
        <p>• The last transformer block was unfrozen, allowing adaptation of the model’s highest-level
contextual representations to the task-specific semantic distribution.
• The final normalization layer (LayerNorm) was made trainable to recalibrate hidden state
activations after fine-tuning adjustments.
• The language modeling head (lm_head) was retained as trainable, ensuring that adapted
internal embeddings align efectively with the output representation space.
• A custom classification head was attached to the final hidden representation of the model,
responsible for mapping the 2056-dimensional feature vector to the eight output categories. The
architecture of this head is defined as:</p>
        <p>
          LayerNorm(2056) → Linear(2056, 1024) → GELU → Linear(1024, 512) → GELU → Linear(512, 8)
All remaining transformer blocks were frozen, preserving the pretrained linguistic priors and semantic
knowledge embedded within the model. This selective fine-tuning approach substantially reduced the
number of trainable parameters while maintaining suficient representational flexibility for efective
adaptation to the hierarchical opinion classification task. frozen and unfrizen parameter given in
Table 11
12. Problems and Fixes
During the course of training and experimentation, several challenges were encountered which
significantly impacted stability, memory usage, and generalization of the model. Below, we describe each
issue in detail along with the remedies applied.
12.1. Gradient Explosion During Early Training
One of the recurring problems was gradient explosion [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] , where the magnitude of the gradients grew
uncontrollably during backpropagation. This led to unstable weight updates, often producing NaN
values in the output tensors. Gradient explosion is a well-known issue in deep learning, particularly in
transformer-based models where the depth of the network and large learning rates can amplify unstable
updates.
        </p>
        <p>Fixes:
• Gradient Clipping: We applied gradient clipping with a maximum norm of 1.0, which constrains
the gradients from exceeding a predefined threshold, ensuring stable weight updates.
• Precision Adjustment: Instead of using float32 or float16, we shifted to torch.bfloat16.</p>
        <p>The bfloat16 format ofers a wider dynamic range compared to float16, improving numerical
stability while still reducing memory consumption relative to float32.
12.2. Memory Overflow During Training
When training with float32 precision, the model consistently exceeded the available GPU memory.
This problem was aggravated by the relatively large sequence lengths in our dataset and the quadratic
memory requirement of self-attention.</p>
        <p>Fixes:
• Batch Size Reduction: We reduced the training batch size from 8 to 4. This change efectively
lowered peak GPU memory usage per step, enabling stable training without out-of-memory
(OOM) errors.
12.3. Overfitting to Majority Classes
Another critical issue was the model’s tendency to overfit to high-frequency classes such as Objective
and Noise. While accuracy appeared high, minority classes (Miscellaneous, Advertisements, Questions)
received near-zero F1 scores. This imbalance reflects the skewed label distribution in the dataset, where
rare categories are underrepresented.</p>
        <p>Fixes:
• Weighted Cross-Entropy Loss: We computed class weights inversely proportional to class
frequencies. These weights were integrated into the cross-entropy loss, penalizing
misclassification of rare categories more heavily and forcing the model to learn discriminative features for
minority classes.
12.4. Stagnant Accuracy During Training
At certain stages of training, model accuracy plateaued around 50%, showing no significant improvement
across epochs. This stagnation suggested that the optimization landscape was poorly conditioned, and
the model was unable to escape local minima or saddle points.</p>
        <p>Fixes:
• Layer Normalization: Added layer normalization to stabilize the distribution of activations,
improving convergence.
• Dropout: Introduced dropout layers to reduce overfitting by preventing co-adaptation of neurons.
• Activation Functions: Adjusted activation functions to ensure smoother gradients and mitigate
vanishing/exploding gradient issues.
13. Results and Analysis
We evaluate two distinct experimental settings, each designed to explore diferent aspects of adapting
Large Language Models (LLMs) for downstream tasks.</p>
        <p>Run 1: Classification Fine-tuning. We directly fine-tune the pretrained Gemma-1B model for
hierarchical text classification. A custom classification head attached to the final transformer block
maps hidden representations to label probabilities. This setup evaluates the model’s ability to perform
end-to-end supervised classification without any task reformulation.</p>
        <p>Run 2: Instruction Fine-tuning. This setup is independent of classification fine-tuning. The model
is trained in an instruction-following format, where each example is a prompt–response pair. The loss
is computed via next-token prediction by shifting target tensors by one position and applying masking
to focus only on answer tokens. This setting measures the model’s alignment with instruction-based
reasoning.</p>
        <p>Dataset</p>
        <p>Level</p>
        <p>Level 1
Reddit</p>
        <p>Level 2 (Subj.)
Twitter</p>
        <p>Level 2 (Subj.)
YouTube</p>
        <p>Level 2 (Subj.)</p>
        <p>We evaluate both approaches across three social media platforms — Reddit, Twitter, YouTube and
Qna— each containing 500 randomly sampled posts. The hierarchical label taxonomy consists of three
levels:
13.1. Comparison and Insights
Compared to Run 1, instruction fine-tuning (Run 2) increased coverage of minority categories such as
Questions and Advertisements, while reducing the dominance of Noise. In QnA, the minority Class 1
proportion rose from 0.25% to 8.89%, highlighting the benefit of instruction tuning for balancing skewed
datasets.
14. Conclusion
In this work, we explored the task of hierarchical opinion classification by adapting Large Language
Models (LLMs) to an 8-class text classification problem derived from a 3-level dataset. We designed and
implemented a parameter-eficient fine-tuning approach using the Gemma model, attaching a custom
classification head and selectively training higher layers to handle limited compute resources.</p>
        <p>Our experiments highlighted both the challenges and opportunities of fine-tuning LLMs for
imbalanced datasets. Issues such as gradient explosion, memory overflow, and overfitting to majority classes
were systematically identified and addressed through techniques like gradient clipping, precision
adjustments, weighted loss functions, and dropout regularization. We further compared direct classification
ifne-tuning with instruction tuning, demonstrating that instruction-based training improved coverage
of minority classes and enhanced generalization across datasets.</p>
        <p>Overall, the study underscores the importance of carefully designed preprocessing, balanced training
strategies, and eficient fine-tuning in achieving reliable performance for domain-specific classification
tasks. Future work may focus on extending the approach to larger Gemma variants, experimenting
with more advanced imbalance-handling methods, and evaluating real-world deployment scenarios.
15. Future Works
This study introduces a hierarchical text classification framework that efectively sorts text into multiple
levels of labels. Based on this solid groundwork, there are several promising avenues for future research
to explore in the coming stages:
• Cross-Domain and Cross-Lingual Generalization: The proposed framework can be further
extended to test its efectiveness across diferent domains, such as news, product reviews, and
social media, or even across languages. These broader studies would provide deeper insights into
how hierarchical sentiment and intent categories transfer across diverse contexts, uncovering
potential challenges and encouraging the design of models that can handle domain shifts or subtle
cross-lingual diferences efectively.
• Hierarchical Label Dependency Modeling: Currently, hierarchical levels are modeled in
sequence, but without explicitly maintaining consistency or relationships between levels. Future
research could look into structured prediction methods, probabilistic graphical models, or neural
architectures specifically aimed at capturing these label dependencies, ensuring that Level 2 and
Level 3 predictions stay logically consistent with Level 1 outcomes. These enhancements could
significantly improve both the accuracy and reliability of hierarchical classifications.
• Robustness to Noisy and Imbalanced Data: Since social media and real-world text often
contain a lot of noise, lack context, and exhibit strong class imbalance, future work could explore
self-supervised pretraining, data augmentation, semi-supervised learning, or techniques that
account for uncertainty to increase robustness. Additionally, assessing model performance under
controlled changes could help pinpoint key weaknesses and encourage specific improvements in
data management and model design.
• Temporal and Emergent Patterns in Text: Hierarchical labels can also expose interesting
time-related patterns in sentiment, questions, and advertisements over long periods. Future
research could use this classification framework to investigate emerging trends in public opinion,
the spread of misinformation, or the dynamics of online conversations. These long-term analyses
could yield valuable insights that go beyond traditional static classification metrics.
• Refinement of Evaluation Metrics: Lastly, standard evaluation metrics such as accuracy or
F1-score may not fully reflect the hierarchical or semantic structure of model predictions. Future
work could focus on creating or adjusting metrics that consider hierarchical consistency, partial
correctness, or semantic similarity, ofering a more detailed and realistic assessment of overall
model performance.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>The authors acknowledge and thank Dr. Dwaipayan Roy from Department of Computational and
Data Sciences (CDS), Indian Institute of Science Education and Research - Kolkata, for providing the
computing resources for our finetuning runs.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used OpenAI’s Chat-GPT-5 and Google’s Gemini 2.5
Pro in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          , Gemma
          <volume>3</volume>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2503.19786.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Adhikary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Banerji</given-names>
            <surname>Seal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          , IISERK@ToT_2024:
          <article-title>Query reformulation and layered retrieval for tip-of-tongue items</article-title>
          ,
          <source>in: Proceedings of the Thirty-Third Text REtrieval Conference (TREC</source>
          <year>2024</year>
          ),
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2024</year>
          . URL: https://trec.nist. gov/pubs/trec33/papers/IISER-K.
          <article-title>tot</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kitaev</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levskaya</surname>
          </string-name>
          ,
          <article-title>Reformer: The eficient transformer</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Longformer: The long-document transformer</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>05150</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Finetuned language models are zero-shot learners</article-title>
          ,
          <source>arXiv preprint arXiv:2109.01652</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kordi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Self-instruct: Aligning language model with self generated instructions</article-title>
          ,
          <source>arXiv preprint arXiv:2212.10560</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <string-name>
            <surname>LLM.</surname>
          </string-name>
          <article-title>int8(): 8-bit matrix multiplication for transformers at scale</article-title>
          ,
          <source>in: Proceedings of the 6th Workshop on Scalable Cloud Data Management</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>arXiv preprint arXiv:1711.05101</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          , et al.,
          <article-title>Pytorch: An imperative style, high-performance deep learning library</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>8026</fpage>
          -
          <lpage>8037</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mangrulkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bossan</surname>
          </string-name>
          , et al.,
          <article-title>PEFT: State-of-the-art parameter-eficient fine-tuning methods</article-title>
          , https://github.com/huggingface/peft,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          , Deep Learning, MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>