<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Modern Data Science Technologies Doctoral Consortium, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Syntax-aware tokenizer for Go code style analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii Berko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladyslav Alieksieiev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrii Holovko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 S. Bandera str., Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>15</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>syntax trees (ASTs) for a Llama 3.2 model, and compared its performance against the model's standard sub-word tokenization. Our experiments show that this syntax-aware approach substantially improves model performance in detecting style violations when fine-tuning is restricted to the embedding layer and classification head. However, this advantage diminishes as additional transformer layers are fine-tuned, with standard tokenization eventually outperforming the syntax-aware variant due to its better alignment with pretrained knowledge. These ifndings highlight an important trade-of: syntax-aware tokenization works best in scenarios requiring minimal adaptation, whereas standard tokenization provides better performance when deep fine-tuning is feasible. Future research should focus on optimizing syntax-aware methods by improving AST-to-token mapping, fine-tuning embeddings only for newly introduced tokens, and increasing their robustness for practical applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Syntax-aware tokenization</kwd>
        <kwd>code style analysis</kwd>
        <kwd>Go programming language</kwd>
        <kwd>multi-label classification</kwd>
        <kwd>large language models</kwd>
        <kwd>tokenizer design</kwd>
        <kwd>abstract syntax trees</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The importance of maintaining a consistent code style across a codebase is widely recognized. This
consistency is especially critical when multiple developers are working on the same project. A clear
and uniform style helps developers more easily understand and review code, improves teamwork,
and reduces misunderstandings. When code follows established formatting and naming conventions,
identifying the program structure and logic becomes simpler, allowing developers to concentrate on
functionality rather than syntax or layout. Reducing cognitive overhead not only speeds up
comprehension [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but can also minimize the likelihood of errors during development. Recent studies confirm
that maintaining consistent naming, spacing, and formatting significantly enhances team collaboration
and improves overall code quality [2, 3]. By adopting uniform coding standards, software teams can
greatly increase the quality of their code and the eficiency of their development processes.
      </p>
      <p>A consistent programming style may also serve as a behavioral fingerprint, notably improving the
accuracy of plagiarism detection [4], authorship attribution [5], and even firmware analysis [6].</p>
      <p>Many companies and open-source projects enforce specific coding standards, such as Google’s style
guides for Java and Python, to ensure uniformity across their codebases. The Go programming language
goes a step further by including a built-in formatter, gofmt, which automatically formats code into
a canonical style, removing debates over minor style choices. Similarly, widely adopted tools like
Prettier for JavaScript and Ruf for Python automatically enforce a consistent coding style across various
projects.</p>
      <p>Code-focused large language models (LLMs), such as CodeLlama [7], StarCoder [8], and
DeepSeekCoder [9], have shown great efectiveness in software engineering tasks, including code generation
and analysis. However, even in the era of LLMs, readability remains essential to software development.
Atlassian, known for popular development tools like Jira, Confluence, and Bitbucket, demonstrated that
LLMs could be efectively integrated into software workflows without compromising code quality [ 10].</p>
      <p>Although the long-term influence of LLMs on software maintainability is still being explored, initial
studies are encouraging. In a recent randomized controlled trial (RCT) with 151 professional developers,
Borg et al. showed that AI-assisted coding tools such as GitHub Copilot enabled developers to complete
programming tasks roughly 40% faster without degrading code quality or maintainability. In fact, the
resulting code appeared slightly more maintainable and stylistically consistent, making it easier to hand
over, extend, and support [11].</p>
      <p>With the increasing integration of AI into software development, our research seeks to explore
potential improvements in the training and adaptation pipeline for LLMs. The goal is to enable these
models to assist developers not only in writing syntactically correct code but also stylistically consistent
and personalized code. In earlier work, we examined the impact of dataset size on fine-tuning LLMs to
classify Python code as compliant or non-compliant with the specific PEP-8 rule [ 12]. In the current
study, we focus on another critical aspect of the LLM ecosystem: tokenizer.</p>
      <p>Tokenization is an essential initial step in preparing text, including the source code, for analysis by
language models. It involves breaking raw text into smaller, manageable units called tokens. While
simple tokenization methods, such as splitting text by whitespace or punctuation, can be suficient for
basic tasks, they often rely on regular expressions and lack the flexibility needed for complex applications.
More advanced systems, including LLMs, typically employ sub-word tokenization algorithms like
BytePair Encoding (BPE) [13]. BPE constructs a vocabulary by iteratively merging frequently occurring
character sequences, efectively balancing vocabulary size and handling out-of-vocabulary tokens.</p>
      <p>However, an alternative approach is to recognize that code typically follows formal rules, so capturing
its syntactic structure—rather than treating it as a flat sequence of tokens—might be beneficial. Dagan et
al. demonstrated that adopting domain-specific tokenizers can yield significant eficiency improvements
without sacrificing performance or quality [ 14]. One efective method involves using abstract syntax
trees (ASTs), hierarchical representations that encode the structural syntax of programming languages,
thus providing a richer basis for model training or fine-tuning.</p>
      <p>In this study, we specifically look into whether adding syntax awareness through AST-based
tokenization makes it easier for a pretrained model to identify multiple style violations and whether
these benefits persist when fine-tuning deeper model layers. We expand the study’s scope to the Go
programming language and reformulate the task as a multi-label classification problem. By utilizing
ASTs to preserve the semantic and formal properties of Go code, we want to enhance the model’s
performance in identifying style violations.</p>
      <p>The ultimate aim of our recent research is to better understand strategies for adapting LLMs to
the task of code style analysis, contributing to the broader goal of improving software quality and
reliability.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Research on using LLMs to support code linting and assist developers in detecting potential issues, such
as memory leaks, was conducted by Holden and Kahani [15]. They trained two LLM-based classifiers
on a dataset of code snippets: one classifier identified the presence of issues, while the other classified
the type of issues. Their experiments demonstrated high accuracy (84.9% for issue detection and 83.6%
for issue classification) and showed that the approach was significantly faster than traditional tools,
highlighting the eficiency and versatility of LLMs for code linting tasks.</p>
      <p>Han et al. [16] investigated various program representation models, which convert code snippets
into numerical embeddings capturing their semantic meaning. They compared six models based on
ASTs with two simpler text-based models. Evaluation across code classification, clone detection, and
code search tasks revealed no single AST-based model consistently outperformed text-based models.
However, specific AST-based approaches showed superior performance on particular tasks, emphasizing
the importance of both textual and structural code representations.</p>
      <p>Practical applications of AST-based methods were explored in works on AstBERT [17] and
ASTT5 [18]. AstBERT integrates ASTs to efectively capture code structure and semantics. Trained on
a large corpus of Java and Python code, AstBERT demonstrated strong performance in tasks like
code question answering, clone detection, and refinement, underscoring the value of structural data
in improving code comprehension. AST-T5 employs structure-aware pretraining strategies, such as
AST-aware Segmentation and AST-aware Span Corruption. This model achieved improved structural
coherence and integrity in code generation, significantly outperforming some larger models without
complex architectural modifications or expensive analyses.</p>
      <p>Further developments in structure-aware and grammar-augmented code generation include
StructCoder [19] and SynCode [20]. StructCoder incorporates both ASTs and Data Flow Graphs (DFGs) into
its encoder and introduces new auxiliary decoding tasks. This method produced a syntactically correct
and semantically accurate code, achieving state-of-the-art results on various benchmarks. SynCode
implements a grammar-guided decoding strategy that greatly reduces syntax errors in the generated
code, particularly for languages with limited training data. These results highlight the benefits of
grammar-aware techniques in code generation.</p>
      <p>The value of grammar-based code representations for LLMs was explored by Liang et al. [21].
The authors examined whether incorporating grammar rules into LLMs remains beneficial, despite
billion-scale models typically generating syntactically correct code. Their GrammarCoder models use
grammar-based representations to significantly improve accuracy in code generation tasks. The study
concluded that grammar-based representations enable LLMs to detect subtle semantic diferences more
efectively, confirming their continued usefulness beyond basic syntax error prevention.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes our methodology, covering dataset construction, tokenizer modifications, model
architecture, and the experimental setup. Our primary goal is to evaluate the impact of syntax-aware
tokenization on detecting style violations in Go code.</p>
      <p>We selected the pretrained meta-llama/Llama-3.2-1B model as our baseline. This model was
then fine-tuned on a curated dataset of Go code snippets using two distinct tokenization strategies: a
standard approach and our proposed syntax-aware method.</p>
      <p>All code for data preprocessing, model fine-tuning, and analysis is publicly available in the project’s
GitHub repository1.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>Our study uses a dataset derived from the Go subset of “The Stack v2” [8], a comprehensive code
corpus from the BigCode Project2. From over 120,000 Go code samples, we extracted 300 representative
snippets for each of eight style rules defined by the go-critic linter’s “style” group3, resulting in a
multi-label dataset. We made this dataset available on Hugging Face [22].</p>
        <p>Table 1 shows eight selected style rules, their descriptions, and label indices. They generally advocate
for replacing certain code constructs with alternatives that are more idiomatic or simpler. The authors of
linters enable most of these rules by default and consider them non-opinionated, meaning the majority
of Go developers would agree they improve code style.</p>
        <p>This focus on idiomatic and simple constructs aligns with broader research into code legibility. For
instance, one systematic literature review identified thirteen formatting factors (including indentation,
spacing, block delimiters, line length, and identifier naming conventions) whose impacts on code
comprehension have been empirically studied [23]. Notably, while this review highlighted statistically
significant legibility improvements for some factors, such as proper indentation, its findings for others,
particularly concerning formatting layouts or identifier styles, were often divergent or inconclusive.</p>
        <sec id="sec-3-1-1">
          <title>1https://github.com/aholovko/go-ast-tokenizer 2https://www.bigcode-project.org 3https://go-critic.com/overview.html#checkers-from-the-style-group</title>
          <p>Simplifiable assignments via operators
Shadowing predeclared identifiers
Capitalized local variable names
Non-idiomatic comments
Nested if statements replaceable by else-if
Repeated if-else statements replaceable by switch
Function parameters combinable by type</p>
          <p>Switch statements replaceable by if</p>
          <p>Although label sampling was uniform, most code snippets analyzed contained only one or two distinct
violations, resulting in high label sparsity within our dataset. Figure 1 presents the label cardinality
distribution and the co-occurrence matrix, which illustrate these inter-label relationships.
(a) Label cardinality
(b) Co-occurrence matrix</p>
          <p>Label cardinality statistics confirm that the dataset has significant sparsity. Out of 2,206 code snippets,
2,038 contain exactly one style violation, 144 have two, and only 24 contain three or more (Figure 1a).
This means that only 1.1% of samples contain more than two violations, leading to highly sparse
multi-label target vectors where most entries are zero. Such sparsity complicates classifier learning and
increases the risk of overfitting—especially when fine-tuning models with a large number of parameters.
These observations motivated the use of stratified sampling and informed our selection of evaluation
metrics, ensuring robustness under imbalanced label combinations.</p>
          <p>A closer examination of label co-occurrence patterns reveals non-trivial relationships between
certain style rules (Figure 1b). For instance, paramTypeCombine frequently appeared alongside
commentFormatting (40 cases), ifElseChain (22), and assignOp (16). This suggests that
combinable parameter declarations often co-occur with broader structural or stylistic issues. In contrast,
rules like elseif and singleCaseSwitch were largely isolated, with minimal overlap, likely due to
their narrower syntactic scope.</p>
          <p>We partitioned the dataset into training, validation, and test subsets using a 70%-10%-20% split,
respectively. This was achieved using the multi-label stratified shufle method [24] to preserve the label
distribution across all sets.</p>
          <p>The data preparation pipeline integrated Go and Python components. A Go-based checker, leveraging
the go-critic linter, was responsible for identifying and recording style rule violations within each code
snippet. A Python wrapper orchestrated data ingestion, filtering, label assignment, stratified splitting,
and the dataset’s publication to the Hugging Face hub.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Syntax-aware tokenizer</title>
        <p>The default Llama tokenizer is based on TikToken and implements Byte Pair Encoding (BPE) [25, 13].
As illustrated in Figure 2, a sample Go code snippet is tokenized into a sequence of 19 sub-word tokens,
excluding the special &lt;begin_of_text&gt; marker.
package sample
func inc(i int) int {
i += 1
return i
}</p>
        <p>To better align tokenization with Go’s syntactic structure, we extended the tokenizer’s vocabulary
by introducing domain-specific tokens representing core language constructs. These include generic
tokens for identifiers, literals, and comments, as well as specific tokens for operators, control flow
keywords, and structural delimiters, as summarized in Table 2.
&lt;IDENT&gt;
&lt;LIT_INT&gt;,&lt;LIT_FLOAT&gt;,&lt;LIT_CHAR&gt;,&lt;LIT_STRING&gt;</p>
        <p>&lt;COMMENT&gt;
&lt;ASSIGN_OP&gt;,&lt;BINARY_OP&gt;
&lt;IF&gt;,&lt;ELSE&gt;,&lt;SWITCH&gt;,&lt;CASE&gt;,&lt;FUNC&gt;, etc.
&lt;LBRACE&gt;,&lt;RBRACE&gt;,&lt;LPAREN&gt;,&lt;RPAREN&gt;,COLON&gt;
Description
Identifier (variable, function)
Literals
Code comments
Operators
Specific Go keywords
Delimiters</p>
        <p>We implemented this syntax-aware tokenization in Go, utilizing the standard go/ast package.
Source code snippets were first parsed into abstract syntax trees, from which tokens were then extracted
based on their syntactic roles. The resulting token streams were passed to the Python-based training
pipeline via a C-shared library interface. Figure 3 shows an example of this tokenization applied to the
sample code.</p>
        <p>While our approach involves directly adding new tokens to the vocabulary and fine-tuning their
embeddings, it relates to broader challenges in adapting pretrained language models to new tokenization
schemes. The Zero-Shot Tokenizer Transfer (ZeTT) method by Minixhofer et al. [26] addresses such
problems by introducing a hypernetwork that predicts embeddings for new tokenizers without requiring
additional pretraining. Although ZeTT primarily targets natural language tasks, its core concept—
decoupling the tokenizer from the pretrained model via flexible embedding projection—resonates with
our objective of enhancing syntactic awareness. Our method difers by explicitly learning embeddings
for new syntactic tokens during fine-tuning. However, future work could explore ZeTT-inspired
strategies to potentially reduce retraining overhead or better align syntax-aware embeddings with the
pretrained model’s representational space.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experiments</title>
        <p>configurations:
We evaluated the impact of syntax-aware versus standard tokenization through three fine-tuning
embeddings were also fine-tuned.</p>
        <p>through 15).
• Experiment 1: Fine-tuning the classification head only. For the syntax-aware tokenizer, token
• Experiment 2: Fine-tuning the classification head and the final transformer layer (layer 15).
• Experiment 3: Fine-tuning the classification head and the last four transformer layers (layers 12
Listing 1 illustrates the Llama-based architecture implemented for sequence classification, highlighting
the added classification head ( score).</p>
        <p>LlamaForSequenceClassification(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
)
(score): Linear(in_features=2048, out_features=8, bias=False)
Listing 1: Llama-based architecture with a classification head for multi-label style violation detection</p>
        <p>Model selection for each experiment was based on the highest macro-averaged F1 score achieved
on the validation set during training [27]. Final performance evaluations were then conducted on the
held-out test dataset.</p>
        <p>Training and evaluation were performed on a system equipped with an NVIDIA L40S Tensor Core
GPU (48 GB VRAM), 16 virtual CPUs, and 128 GB of RAM, optimized for deep learning tasks [28]. We
utilized PyTorch Lightning as the training framework [29].</p>
        <p>The loss function used was BCEWithLogitsLoss, which is appropriate for multi-label classification
tasks. We did not apply class weights, as our data collection strategy ensured individual labels have
a balanced representation across the dataset. However, the instance-level sparsity requires a careful
approach to performance evaluation. In Section 3.1, we noted that most code snippets contain only one
or two style violations, and none exceed four. As a result, our multi-label classification task produces
very sparse output vectors, with the vast majority of labels set to zero. This sparsity can mislead
standard evaluation metrics. For instance, when we measured overall accuracy, the fine-tuned model
appeared to jump from a 48% baseline to 85%, suggesting excellent performance. In reality, the model
had simply learned to predict all-zero labels, and the way multi-label accuracy is computed meant
that this trivial “all negatives” strategy scored deceptively high. By contrast, the F1 score remained
below 0.1 throughout training, revealing that the model had not learned to identify any real violations
but had instead defaulted to predicting none at all. To account for this sparsity, we therefore used
macro-averaged Precision, Recall, and F1 score metrics.</p>
        <p>Each training session ran for 10 epochs with a batch size of 4 (Table 3), a configuration chosen to
operate within the memory capacity of a single 48 GB NVIDIA L40S GPU without requiring gradient
accumulation.</p>
        <p>Embeddings for the newly introduced syntax-aware tokens were initialized following the default
strategy of Hugging Face’s transformers library. This strategy, based on work by Hewitt [30],
involves sampling new embeddings from a multivariate normal distribution parameterized by the mean
and covariance of the existing pretrained embedding matrix. Such an approach aims to align new
embeddings with the established representational space, potentially improving stability and accelerating
convergence. While we did not explore semantically targeted averaging for initialization (e.g., combining
embeddings of if and else to initialize a generic &lt;IF&gt; token), such task-specific strategies are reserved
for future work.</p>
        <p>To manage GPU memory constraints during fine-tuning, we employed several optimization
techniques:
• Mixed precision training using “bf16-mixed” to balance computational precision with memory
footprint [31];
• Gradient checkpointing to reduce memory by recomputing forward activations during
backpropagation instead of storing them;
• KV-cache disabling to avoid storing large key/value tensors during the forward pass.</p>
        <p>Finally, we deliberately decided to use full rather than parameter-eficient fine-tuning (PEFT) methods
(e.g., LoRA, QLoRA). This choice ensures that all model weights, including those potentially afected
by the introduction of syntax-aware tokenization, were fully updated and adapted during the training
process [32].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we present the outcomes of our experiments, comparing the performance of standard
and syntax-aware tokenization across diferent fine-tuning configurations. All reported F1, precision,
and recall scores are macro-averaged.</p>
      <sec id="sec-4-1">
        <title>4.1. Experiment 1: Fine-tuning classification head</title>
        <p>In the first experiment, where only the classification head (and embeddings for the syntax-aware
tokenizer) were fine-tuned, the syntax-aware tokenizer achieved a considerably higher validation F1
score than the standard tokenizer. As shown by the training dynamics in Figures 4 and 5, the validation
F1 score for the syntax-aware tokenizer reached 0.282 (precision: 0.535, recall: 0.208) by epoch 5,
compared to the best value of 0.056 for the standard tokenizer in epoch 10.</p>
        <p>The model using the standard tokenizer showed minimal reduction in training loss and did not
generalize well to the validation set (validation F1 score ≤ 0.056).</p>
        <p>For the syntax-aware model, overfitting became apparent after epoch 5, with training loss decreasing
towards zero while validation loss began to increase (Figure 5a). This suggests that further improvements
could be achieved through early stopping, regularization techniques (e.g., dropout, weight decay), or
dataset augmentation. We did not apply such optimizations, as the primary objective was to compare
tokenizer performance under equivalent fine-tuning setups. Nevertheless, these techniques as well as
advanced loss functions tailored for multi-label scenarios [33] could be explored in future work.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experiment 2: Fine-tuning with transformer layer 15 unfrozen</title>
        <p>Unfreezing the final transformer layer (layer 15) alongside the classification head led to diferent
outcomes for the two tokenizers, as shown in Figure 6. The standard tokenizer’s performance improved
substantially, achieving a validation F1 score of 0.515. The syntax-aware model also improved but
achieved a comparatively lower validation F1 score of 0.318.</p>
        <p>(a) standard
(b) syntax-aware</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experiment 3: Fine-tuning with layers 12–15 unfrozen</title>
        <p>With deeper fine-tuning involving layers 12 through 15 (Figure 7), the standard tokenizer showed
further improvement, reaching a validation F1 score of 0.728. The syntax-aware model’s validation F1
score was 0.320 in this configuration, indicating no significant improvement over Experiment 2 for this
tokenizer.</p>
        <p>(a) standard
(b) syntax-aware</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation and resource comparison</title>
        <p>the syntax-aware model. These trends were generally consistent across precision and recall metrics
on the test set. The syntax-aware model consistently required more trainable parameters and GPU
memory across all experimental setups.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Ablation study</title>
      <p>We performed an ablation study to separately assess the impact of syntax-aware tokenization and
embedding adaptability on model performance. All experiments used the Llama 3.2-1B architecture,
with fine-tuning limited to the classification head and input embeddings. This setup allowed us to
clearly evaluate the individual and combined efects of tokenization strategy and embedding training.
The main insights from this ablation study are:
• Impact of training embeddings (A1 vs. A2): Making the embeddings trainable significantly
boosted performance for the standard tokenizer. The F1 score increased nearly fourfold (from
0.056 to 0.215), precision more than tripled (from 0.167 to 0.547), and recall improved over fourfold
(from 0.033 to 0.142). This highlights the importance of fine-tuning embeddings for efective
learning, especially in sparse multi-label classification tasks.
• Syntax-aware tokenization (A1 vs. B1): Introducing syntax-aware tokens while keeping
embeddings frozen resulted in a moderate F1 score improvement (from 0.056 to 0.070). Recall
increased substantially from 0.033 to 0.456—the highest recall value observed in this ablation
study. Precision also improved, rising from 0.167 to 0.231. These results suggest that syntax-aware
tokens, even without fine-tuning embeddings, enhance the model’s ability to identify violations.
• Combined strategy (A2 vs. B2): The combination of syntax-aware tokenization and trainable
embeddings (B2) yielded the highest overall performance in this study, achieving the best F1 score
(0.282). Compared to the standard tokenizer with trainable embeddings (A2), this configuration
showed comparable precision (0.535 vs. 0.547 for A2) and notably improved recall (0.208 vs.
0.142 for A2). This suggests that syntax-aware tokenization and embeddings fine-tuning are
complementary.
• Impact of training syntax-aware embeddings (B1 vs. B2): Making the embeddings trainable
for the syntax-aware tokenizer significantly improved performance. Precision increased from
0.231 to 0.535. While recall decreased (from 0.456 to 0.208), the substantial gain in precision led
to a much higher F1 score (0.282 for B2 vs. 0.070 for B1). This shift indicates that training the
embeddings helped the model become more discerning, reducing the tendency to over-predict
violations that was apparent with frozen syntax-aware embeddings.</p>
      <sec id="sec-5-1">
        <title>In summary, key takeaways from this ablation study include:</title>
        <p>• Syntax-aware tokenization notably boosts recall due to its targeted representation;
• Embedding adaptation significantly improves precision by refining token semantics for the task;
• Integrating both approaches (B2) provides the most balanced and efective model.</p>
        <p>These findings validate our strategic choice of combining syntax-aware tokenization with embeddings
adaptation, demonstrating its efectiveness when fine-tuning for style violation detection is restricted
to the classification head and input embeddings.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>This study aimed to investigate the impact of syntax-aware tokenization on detecting style violations
in Go code, comparing it against standard sub-word tokenization across various fine-tuning depths.
Our experiments revealed a nuanced relationship: the syntax-aware tokenizer notably outperformed
the standard tokenizer when fine-tuning was limited to the classification head and the embedding
layer (Experiment 1). In this scenario, the syntax-aware tokenizer achieved a test set F1 score of 0.290,
significantly exceeding the standard tokenizer’s score of 0.066. However, it is important to recognize
that the syntax-aware approach fine-tuned the entire embedding layer, updating many more parameters
than the standard tokenizer, which fine-tuned only the classification head. This larger parameter
space, combined with explicitly encoded syntactic information, likely contributed to its initial strong
performance.</p>
      <p>However, this initial advantage diminished as more transformer layers were unfrozen during
finetuning (Experiments 2 and 3). The standard tokenizer demonstrated continuous performance
improvements with deeper fine-tuning, ultimately achieving an F1 score of 0.718. In contrast, the syntax-aware
tokenizer’s performance plateaued at approximately 0.31–0.34. This shift suggests that the standard
tokenizer is more efective at leveraging the increased capacity gained from additional unfrozen layers
by adapting its well-established pretrained parameters. Conversely, the syntax-aware tokenizer faced
several challenges. Its extensive set of syntax-based embeddings required more training data or better
optimization strategies. Additionally, the altered sequence structures resulting from syntax-aware
tokenization might have limited the model’s ability to utilize its pretrained knowledge efectively. The
available dataset may also have been insuficient to fully train the new embeddings and adapt the model
appropriately.</p>
      <p>These contrasting outcomes present practical implications for choosing tokenization strategies in
code analysis tasks, such as code linting. Syntax-aware tokenization is particularly beneficial when
ifne-tuning is focused primarily on the embedding and classification layers, with minimal adjustments
to deeper transformer layers, as demonstrated in Experiment 1. In this context, syntactic information
provides a clear and valuable signal to the model. However, when resources allow deeper fine-tuning of
additional layers, standard tokenization generally ofers greater performance by efectively leveraging
extensive pretrained knowledge. Thus, the key challenge for syntax-aware methods in deeper
finetuning scenarios involves eficiently training syntactic token embeddings and efectively adapting the
model to the new input structures.</p>
      <p>The current study has several limitations, which also point to opportunities for future research.
Firstly, the design of our syntax-aware tokenizer—specifically the mapping from abstract syntax tree to
tokens—was relatively basic and could benefit from more advanced or customized AST-based features.
Secondly, our AST-based approach requires fully syntactically correct code, limiting its applicability to
incomplete or erroneous code snippets, which are common in real-world scenarios. Thirdly, certain
experimental choices, made to ensure a fair comparison between tokenizers, may have constrained the
syntax-aware tokenizer’s potential. The observed overfitting in Experiment 1 indicates that standard
regularization methods, such as early stopping, dropout, or weight decay, might enhance performance
if applied. Moreover, specialized loss functions designed for sparse multi-label classification, like those
proposed by Zhang and Wu [33], could ofer additional improvements.</p>
      <p>Future research should therefore address these limitations and pursue further improvements. One key
direction involves refining the AST-to-token mapping process to generate more efective syntax-aware
tokens. Applying regularization techniques, implementing early stopping, and utilizing specialized
loss functions represent essential next steps for optimizing the fine-tuning process. Integrating new
syntactic embeddings with the model’s pretrained knowledge is another important area. Future studies
could explore more parameter-eficient strategies, such as selectively training embeddings for newly
introduced syntactic tokens instead of retraining the entire embedding layer. Techniques such as
embedding projection or alignment, inspired by advances like Zero-Shot Tokenizer Transfer [26],
could help bridge this gap and reduce retraining eforts. Finally, improving the robustness of
syntaxaware tokenization for real-world scenarios involving syntactically incorrect code is essential. This
could include developing error-tolerant parsing techniques, leveraging partial ASTs, or creating hybrid
approaches that dynamically combine AST-based tokens with traditional sub-word units, greatly
enhancing the applicability and efectiveness of syntax-aware tokenization in diverse programming
environments.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This study demonstrates that syntax-aware tokenization can substantially enhance language model
performance in detecting coding style violations when fine-tuning is limited to the classification
head and input embeddings. Under such conditions, syntax-enriched tokens ofer valuable structural
information that accelerates learning by making Go-specific syntactic patterns more explicit to the
model.</p>
      <p>However, this initial advantage diminishes as more layers of the pretrained model are unfrozen for
ifne-tuning. Standard tokenization, by better leveraging extensively pretrained knowledge across a
larger number of adaptable layers, tends to achieve better performance in deeper fine-tuning
configurations. This reveals a crucial trade-of: the immediate efectiveness of explicit syntactic information,
which particularly benefits lightweight adaptation, versus the broader adaptive capacity of established
pretrained models when more extensive training is available.</p>
      <p>Therefore, the choice of tokenization for code analysis depends on available resources and desired
ifne-tuning depth. Syntax-aware methods are a good starting point for eficient model adaptation
in specific cases. However, future work needs to improve their scalability and how they work with
more broadly trained models. Key areas for development include better embedding alignment and
stronger tokenization models, aiming for code language systems that are both syntax-aware and highly
adaptable.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o in order to: Paraphrase and reword,
Improve writing style, Grammar and spelling check. After using this tool, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.
SBC, Porto Alegre, RS, Brasil, 2024, pp. 136–146. URL: https://sol.sbc.org.br/index.php/sbes/article/
view/30356. doi:10.5753/sbes.2024.3325.
[2] T. Kanoutas, T. Karanikiotis, A. L. Symeonidis, Enhancing Code Readability through Automated
Consistent Formatting, Electronics 13 (2024). URL: https://www.mdpi.com/2079-9292/13/11/2073.
doi:10.3390/electronics13112073.
[3] W. Zou, J. Xuan, X. Xie, Z. Chen, B. Xu, How does code style inconsistency afect pull
request integration? An exploratory study on 117 GitHub projects, Empirical Software
Engineering 24 (2019) 3871–3903. URL: https://doi.org/10.1007/s10664-019-09720-x. doi:10.1007/
s10664-019-09720-x.
[4] O. Karnalim, G. Kurniawati, Programming Style On Source Code Plagiarism And Collusion
Detection, International Journal of Computing 19 (2020) 27–38. URL: https://computingonline.net/
computing/article/view/1690. doi:10.47839/ijc.19.1.1690.
[5] I. Khomytska, V. Teslyuk, I. Bazylevych, I. Shylinska, Approach For Minimization Of Phoneme
Groups In Authorship Attribution, International Journal of Computing 19 (2020) 55–62. URL:
https://computingonline.net/computing/article/view/1693. doi:10.47839/ijc.19.1.1693.
[6] V. Yatskiv, N. Yatskiv, J. Su, A. Sachenko, Z. Hu, The Use of Modified Correction Code Based on
Residue Number System in WSN, in: Proceedings of the 2013 IEEE 7th International Conference on
Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), Institute of Electrical
and Electronics Engineers, Berlin, Germany, 2013, pp. 513–516. URL: http://ieeexplore.ieee.org/
document/6662738/. doi:10.1109/IDAACS.2013.6662738.
[7] B. Rozière, J. Gehring, F. Gloeckle, et al., Code Llama: Open Foundation Models for Code, 2023.</p>
      <p>URL: https://arxiv.org/abs/2308.12950. doi:10.48550/ARXIV.2308.12950.
[8] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei,
T. Liu, M. Tian, D. Kocetkov, A. Zucker, Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li,
W.-D. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozhskii, N. O. O. Dade, W. Yu, L. Krauß, N. Jain,
Y. Su, X. He, M. Dey, E. Abati, Y. Chai, N. Muennighof, X. Tang, M. Oblokulov, C. Akiki, M. Marone,
C. Mou, M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, N. Patry, C. Xu, J. McAuley,
H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. Anderson, N. Chapados, M. Patwary, N. Tajbakhsh,
Y. Jernite, C. M. Ferrandis, L. Zhang, S. Hughes, T. Wolf, A. Guha, L. von Werra, H. de Vries,
StarCoder 2 and The Stack v2: The Next Generation, 2024. URL: https://arxiv.org/abs/2402.19173.
doi:10.48550/ARXIV.2402.19173.
[9] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong,
W. Liang, DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise
of Code Intelligence, 2024. URL: http://arxiv.org/abs/2401.14196. doi:10.48550/ARXIV.2401.
14196.
[10] W. Takerngsaksiri, M. Fu, C. Tantithamthavorn, J. Pasuksmit, K. Chen, M. Wu, Code Readability
in the Age of Large Language Models: An Industrial Case Study from Atlassian, 2025. URL:
http://arxiv.org/abs/2501.11264. doi:10.48550/arXiv.2501.11264.
[11] M. Borg, D. Hewett, D. Graham, N. Couderc, E. Söderberg, L. Church, D. Farley, Does
CoDevelopment with AI Assistants Lead to More Maintainable Code? A Registered Report, 2024.</p>
      <p>URL: https://arxiv.org/abs/2408.10758. doi:10.48550/ARXIV.2408.10758.
[12] A. Holovko, V. Alieksieiev, Fine-Tuning Large Language Models for Code-Style Analysis: The
Significance of Dataset Size, IJC (2025) 141–147. URL: https://computingonline.net/computing/
article/view/3885. doi:10.47839/ijc.24.1.3885.
[13] R. Sennrich, B. Haddow, A. Birch, Neural Machine Translation of Rare Words with Subword Units,
in: K. Erk, N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
Berlin, Germany, 2016, pp. 1715–1725. URL: https://aclanthology.org/P16-1162/. doi:10.18653/
v1/P16-1162.
[14] G. Dagan, G. Synnaeve, B. Rozière, Getting the most out of your tokenizer for pre-training and
domain adaptation, 2024. URL: https://arxiv.org/abs/2402.01035. doi:10.48550/ARXIV.2402.
01035.
[15] D. Holden, N. Kahani, Code Linting using Language Models, 2024. URL: https://arxiv.org/abs/2406.</p>
      <p>19508. doi:10.48550/ARXIV.2406.19508.
[16] S. Han, D. Wang, W. Li, X. Lu, A Comparison of Code Embeddings and Beyond, 2021. URL:
https://arxiv.org/abs/2109.07173. doi:10.48550/ARXIV.2109.07173.
[17] R. Liang, T. Zhang, Y. Lu, Y. Liu, Z. Huang, X. Chen, AstBERT: Enabling Language Model
for Financial Code Understanding with Abstract Syntax Trees, in: Proceedings of the Fourth
Workshop on Financial Technology and Natural Language Processing (FinNLP), Association for
Computational Linguistics, Abu Dhabi, UAE (Hybrid), 2022, pp. 10–17. URL: https://aclanthology.
org/2022.finnlp-1.2. doi: 10.18653/v1/2022.finnlp-1.2.
[18] L. Gong, M. Elhoushi, A. Cheung, AST-T5: Structure-Aware Pretraining for Code Generation and</p>
      <p>Understanding, 2024. URL: https://arxiv.org/abs/2401.03003. doi:10.48550/ARXIV.2401.03003.
[19] S. Tipirneni, M. Zhu, C. K. Reddy, StructCoder: Structure-Aware Transformer for Code Generation,
ACM Transactions on Knowledge Discovery from Data 18 (2024) 1–20. URL: https://dl.acm.org/
doi/10.1145/3636430. doi:10.1145/3636430.
[20] S. Ugare, T. Suresh, H. Kang, S. Misailovic, G. Singh, SynCode: LLM Generation with Grammar</p>
      <p>Augmentation, 2024. URL: https://arxiv.org/abs/2403.01632. doi:10.48550/ARXIV.2403.01632.
[21] Q. Liang, Z. Zhang, Z. Sun, Z. Lin, Q. Luo, Y. Xiao, Y. Chen, Y. Zhang, H. Zhang, L. Zhang, B. Chen,
Y. Xiong, Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?, 2025. URL:
https://arxiv.org/abs/2503.05507. doi:10.48550/ARXIV.2503.05507.
[22] A. Holovko, go-critic-style, 2025. URL: https://huggingface.co/datasets/aholovko/go-critic-style.</p>
      <p>doi:10.57967/HF/5304.
[23] D. Oliveira, R. Santos, F. Madeiral, H. Masuhara, F. Castor, A systematic literature review on the
impact of formatting elements on code legibility, Journal of Systems and Software 203 (2023)
111728. URL: https://linkinghub.elsevier.com/retrieve/pii/S0164121223001231. doi:10.1016/j.
jss.2023.111728.
[24] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the Stratification of Multi-label Data, in: D. Gunopulos,
T. Hofmann, D. Malerba, M. Vazirgiannis (Eds.), Machine Learning and Knowledge Discovery in
Databases, volume 6913, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 145–158. URL:
http://link.springer.com/10.1007/978-3-642-23808-6_10. doi:10.1007/978-3-642-23808-6_
10.
[25] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten,
A. Yang, A. Fan, A. Goyal, et al., The Llama 3 Herd of Models, CoRR abs/2407.21783 (2024). URL:
https://doi.org/10.48550/arXiv.2407.21783. doi:10.48550/ARXIV.2407.21783.
[26] B. Minixhofer, E. M. Ponti, I. Vulić, Zero-Shot Tokenizer Transfer, 2024. URL: https://arxiv.org/abs/
2405.07883. doi:10.48550/ARXIV.2405.07883.
[27] S. Raschka, Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning,
2020. URL: http://arxiv.org/abs/1811.12808. doi:10.48550/arXiv.1811.12808.
[28] NVIDIA, NVIDIA L40 GPU Accelerator, 2023. URL: https://www.nvidia.com/content/dam/en-zz/</p>
      <p>Solutions/Data-Center/datasheets/L-40/product-brief-L40.pdf.
[29] W. Falcon, T. Z. L. Team, PyTorch Lightning, 2025. URL: https://zenodo.org/doi/10.5281/zenodo.</p>
      <p>3530844. doi:10.5281/ZENODO.3530844.
[30] J. Hewitt, Initializing New Word Embeddings for Pretrained Language Models, 2021. URL: https:
//nlp.stanford.edu/~johnhew/vocab-expansion.html.
[31] NVIDIA, Train with Mixed Precision, 2023. URL: https://docs.nvidia.com/deeplearning/
performance/mixed-precision-training/index.html.
[32] R. Shuttleworth, J. Andreas, A. Torralba, P. Sharma, LoRA vs Full Fine-tuning: An Illusion of</p>
      <p>Equivalence, 2024. URL: http://arxiv.org/abs/2410.21228. doi:10.48550/arXiv.2410.21228.
[33] Y. Zhang, Y. Cheng, X. Huang, F. Wen, R. Feng, Y. Li, Y. Guo, Simple and Robust Loss Design
for Multi-Label Learning with Missing Labels, 2021. URL: http://arxiv.org/abs/2112.07368. doi:10.
48550/arXiv.2112.07368.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gheyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          , Assessing Python Style Guides:
          <article-title>An Eye-Tracking Study with Novice Developers</article-title>
          , in: Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>