<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Nauky ave 14, Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Nowadays, deep learning systems are replacing traditional algorithmic approaches in grammatical error correction (GEC) task. Still, these systems have problems such as requiring much more computational resources, insufficient robustness and unpredictable behavior. Traditional sequence-to-sequence approach has a serious limitation in the GEC task due to the overgeneration problem. To overcome this problem, a sequence-to-edits approach is widely used, which is described in this research. To make the sequence-toedits approach more controllable and predictable, the architecture that allows combining traditional algorithmic approaches, like dictionary-based spell-check systems and a sequence-to-edits model based on transformers architecture is described in this research. The result of such combination is better quality of the result system that can be controlled.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>GEC</kwd>
        <kwd>Transformers Architecture</kwd>
        <kwd>ModernBERT1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text communication takes large part in the world of information. However, errors in writing may
disrupt the meaning of the text. That is why today there is still demand for grammatical error
correction systems. Over the past years such systems have improved significantly. The first solutions
were based on rules and dictionaries, then neural networks approaches were used that significantly
improved the performance of such systems. RNN and LSTM architectures greatly increased the
quality of such solutions, then breakthroughs like transformers architecture allowed to achieve even
better results [1, 2].</p>
      <p>Although traditional approaches like sequence-to-sequence models [3] can handle the task well,
they have certain limitations. One of the limitations is the performance of such systems. To correct
an input text, the model needs to regenerate the whole sequence. However, unlike in language
translation, in this domain many inputs shouldn’t be changed at all, and others need only minor
changes, which results in overgeneration problem. To resolve the issue sequence-to-edits approaches
are used. These are the models that instead of regenerating the whole sequence from scratch generate
only edits that need to be applied to the input sequence. Another issue with neural networks
approaches is that the same problem can be fixed in one context but skipped in another.</p>
      <p>Positions of certain errors could be obtained using algorithmic approaches. For example,
misspellings could be determined using dictionary-based approach, if word is not present in
dictionary, then it is misspelled and should be corrected. This research is aimed at combining
algorithmic approaches with sequence-to-edits model. The assumption is that such a hybrid
approach could strengthen the model’s robustness and make it more predictable.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related papers</title>
      <p>One of the main drawbacks of sequence-to-sequence solutions is the inference time and
computational resources usage on regenerating the entire input sequence. Unlike in other NLP tasks
like language translation, in GEC most of input’s part shouldn’t be changed at all. To overcome this
problem, Aggressive Decoding [4] was introduced, it is aimed to reduce the number of autoregressive
steps preserving strengths of sequence-to-sequence approach. However, this approach requires
encoder and decoder parts of the model to have shared tokenizer vocabulary. Shared tokenizer
vocabulary doesn’t allow to obtain benefits from the reduction of decoder’s tokenizer vocabulary,
which is believed can be much smaller than encoder’s vocabulary [5]. Smaller decoder vocabulary
can benefit in faster inference because it reduces the softmax layer complexity, also it helps model
to better generalize on smaller amounts of data.</p>
      <p>Another common approach in GEC task is sequence-to-edits models. The main goal of these
approaches is less computational resources usage, preserving the quality of error correction. There
is a wide variety of different architectures with this approach. One of the solutions is GECToR [6],
in its core there is pretrained encoder model, like BERT, RoBERTa, XLNet or others. This approach
is using token level corrections, which are obtained without using traditional transformer decoder
layer. To obtain the transformations first, transformation vocabulary is created with the most
common transformations used to correct input sequence, which is DELETE, APPEND, REPLACE
combined with the text, token should be replaced with, or which should be appended, also token can
be kept without any transformations. Projection layer is used to convert the encoder’s hidden state
to the transformations. Although not all corrections could be done in one launch, after applying the
corrections process repeats, until there are no new corrections.</p>
      <p>Another approach of obtaining corrections to an input sequence is using pointer networks, which
is used in RedPenNet [5] model. Here, the transformer decoder is used to generate corrections. Each
correction consists of a span, where correction should be done and tokens from decoder’s tokenizer
vocabulary. On each autoregressive step, the decoder generates a token, which is obtained using
projection layer on decoder outputs and a span, which is obtained by using projection layer on
decoder attention. The assumption is, that attention mechanism points on the most valuable
information in input sequence at the time of generating current token, so the position of the
correction can be obtained. Each correction consists of start and end positions of the correction, and
at least two tokens, one of which is a separator. If any tokens should be deleted, a special deletion
token is generated. This approach provides the ability to have a separate vocabulary for decoder,
also it allows decoder to generate new corrections based on already generated corrections which
allows to make all corrections at one model execution. However, the limitation of these approach is
that the model couldn’t be forced to correct any known mistakes, which positions can be determined
using another approach, which leads to behavior, when model corrects certain mistakes in one
sentence, but skips it in another.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and materials</title>
      <p>In our research, we introduce model architecture that allows us to combine algorithmic approaches,
like dictionary-based misspellings detection. We assume that by doing this we can make our solution
more predictable and robust.</p>
      <sec id="sec-3-1">
        <title>3.1. Detecting errors in the input sequence</title>
        <p>There are three main operations that are done during text error correction:


</p>
        <p>These operations are done on token-level to transform the input sequence into the corrected one.
To detect such tokens, we use a similar method as described in the GECToR [6] paper. For this,
pretrained transformer encoder is used, like BERT, RoBERTa, XLNet etc. The input sequence is
tokenized using pretrained tokenizer of specific model and then forwarded to the encoder, projection
layer is applied to the encoder’s last hidden state. Projection layer is used for binary classification of
each input token, where 0 is correct, and 1 is incorrect and need to be corrected. The overview of
how detecting incorrect tokens in input sequence is done is shown on Figure 1.
(1)
(2)
(3)
(4)</p>
        <p>After getting a vector of the same size as input sequence, which consists of 0 and 1, it should be
used to rewrite those tokens, that were classified as incorrect. Consequence of ones in such vector
form a chunk, that should be rewritten entirely. So, the next step is to convert this vector to a matrix,
in which each row contains a single chunk to process them independently. For a given vector v, we
compute the difference vector:

=  −</p>
        <p>,  ∈ {0, 1} ,
where 
=  , 
= −</p>
        <p>.</p>
        <p>Using the difference vector, start and end indices could be found:
For each start and end pair we can create a vector, containing a single chunk:
 =
 =
 ( ) =
 
 
1,  
0,
= 1 ,
= −1 .
ℎ
≤  &lt; 
.</p>
        <p>After this, all chunks can be stacked into a matrix:
 =  ,


where m – number of chunks.</p>
        <p>The example of dividing correctness vector to chunks is shown in Figure 2.</p>
        <p>=</p>
        <p>( ) ∨  ( ),  = 1,2, … , ,</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Rewriting selected parts of input</title>
        <p>After obtaining the positions in the input sequence that are needed to be corrected, replacements
should be generated. In this research it is done by using highlight and decode technique with a
transformer decoder [7]. Highlight and decode is a method of pointing the decoder exact tokens in
the input sequence that require changes. It is done by adding a trainable embedding vector to those
tokens. For each chunk, trainable embedding is added to the encoder last hidden state, where chunk
vector value is one which is done according to formula:</p>
        <p>= 1  +  ⊙ ,
where H – encoder last hidden state, E – trainable embedding, n – input sequence length, m –
number of chunks, C – chunk matrix.</p>
        <p>For each input sequence a batch size m is created for the decoder. Then, this batch of highlighted
encoder hidden states is fed into decoder which is used to generate output sequence for each chunk.
After detokenizing the generated sequences from the decoder, corresponding chunks in the input
sequence are replaced by sequences generated by decoder.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Combining with algorithmic methods</title>
        <p>As was stated earlier, some of the tokens can be considered as incorrect without using the model
(5)
(6)
(7)
where  – highlight vector produced by model,  – highlight vector produced by
spellchecking system, n – input sequence length.</p>
        <p>This combining increases the robustness of classifier, which could not detect some of the errors
in certain contexts. In this research, combining with a spell-checking system is examined, but it also
can be combined with other grammar checking systems, which allow to obtain the positions of error
in the input sequence, but struggling to generate good suggestions. The whole system, including
merging with algorithmic-based system results, is shown on Figure 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Selecting pretrained encoder</title>
        <p>As it was stated, a pretrained encoder is a core of the solution. The quality of the entire model
depends on the chosen encoder. To test the pretrained model performance an experiment was done.
The solution described in section 3 was implemented using different pretrained models. Every
solution was fine-tuned on WI+LOCNESS [8] dataset train set and evaluated on validation set.
Overall, three models were tested: BERT, RoBERTa and ModernBERT [9]. The result scores of
solutions are shown in Table 1.</p>
        <p>As a result, ModernBERT was picked as a pretrained encoder, because it has shown the best
preliminary results.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training data</title>
        <p>To train the model W&amp;I+LOCNESS and CoEdIT [10] datasets were combined. CoEdIT was cleared
of identical sentences from W&amp;I+LOCNESS dataset. The number of train and test sentences is shown
in Table 2. As this number of data instances is too small for model to generalize on the task in GEC
task, synthetic data is often used. In this research c4_200m [11] synthetic dataset was used. This
dataset contains millions of incorrect/correct sentence pairs.</p>
        <p>Most datasets for grammatical error correction task come in the format of sentences pairs, one
sentence in a pair is a source and another one is a target sentence. First, the data was cleaned and
normalized, which includes Unicode NFKC normalization, single and double quotes, hyphens and
dashes, semicolons were normalized to their base variants. Additional spaces were removed, so there
is not more than one space between the words. Text was tokenized using spaCy python library to
divide words from punctuation. As model architecture described in this work is a sequence-to-edits
type, pair of sentences need to be annotated, to find target edits that are needed to be made to convert
the input sequence to the target sequence. For this, the ERRANT [12] library was used. This library
provides a tool to annotate pairs of sentences and generate a list of edits that are stored using M2
format which is standard format for GEC task.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training the model</title>
        <p>During training, in one forward pass model is producing correctness vector using encoder’s last
hidden state and a classifier. BCE loss is used to calculate the loss of the classification part. In decoder,
target correctness vector is used instead of produced one, to allow the decoder to generalize better
and prevent error accumulation. Encoder’s last hidden state is highlighted and fed into decoder,
which is learned to produce target correction. For decoder, cross entropy loss function is used. The
losses of decoder and classifier are combined using formula below.</p>
        <p>=  +  ,
(8)
where  – classifier loss,  – decoder loss.</p>
        <p>Model training was done in two steps: pretraining and fine-tuning. Pretraining was done using
synthetic c4_200m dataset on 5 million random sentences. Half of the first encoder layers were frozen
for the first hundred thousand steps, then whole model weights were unfrozen. All hyperparameters
used during pretraining step can be found in Table 3.
Separate losses of classifier and decoder during training are shown on Figure 4.</p>
        <sec id="sec-4-3-1">
          <title>Parameter</title>
          <p>Batch size
Learning rate
Dropout rate</p>
          <p>Optimizer</p>
          <p>Fine-tuning part was done on combination of W&amp;I+LOCNESS and CoEdIT datasets. At this part,
all encoders were unfrozen, but to prevent overfitting the classifier, it was frozen for the first epoch.
During the first epoch, only the decoder part was trained. Fine-tuning was done in 3 epochs, all
training hyperparameters are shown in Table 4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Evaluation of the model was done using test splits of W&amp;I+LOCNESS and CoEdIT datasets
separately. Validation was done using the positions of misspellings found by spell-check engine.
There were two types of tests for each dataset:

</p>
      <p>Only model is used without spellings positions.</p>
      <p>Model mixed with misspellings positions from spell-check engine.</p>
      <p>Precision, recall and F0.5 score were calculated using the ERRANT tool. The results are shown
in Table 5.</p>
      <p>As these datasets mainly focused on grammatical errors, testing combination of spell-checking
system with the model isn’t affecting the results much. To test this combination the BookCorpus [13]
dataset was used. Ten thousand text samples were taken and were considered grammatically correct.
Randomly, spelling errors were made in texts using next operations:




</p>
      <p>Delete a random character.</p>
      <p>Replace a random character with another character.</p>
      <p>Swap two random adjacent characters.</p>
      <sec id="sec-5-1">
        <title>Without merging with spell-check engine results. With merging with a spell-check engine found errors positions. Making from 1 to 3 errors in a random word, ten thousand sentences with synthetically generated spelling errors were obtained. This dataset was used to evaluate the model in two scenarios:</title>
        <p>As with other evaluation datasets, precision, recall and F0.5 score were calculated for these two
scenarios. Results are shown in Table 6.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussions</title>
      <sec id="sec-6-1">
        <title>6.1. Discussions of the results</title>
        <sec id="sec-6-1-1">
          <title>Precision</title>
          <p>0.4745
As shown in Table 5, the usage of spell-check engine to find coordinates of misspellings and use it
inside the model isn’t giving much impact on the results. The reason might be that W&amp;I+LOCNESS
and CoEdIT datasets are aimed mostly at grammatical problems rather than spellings. Table 6 is
showing results on a synthetically created dataset which consists only of spellings. As it is shown,
precision is not affected much by using spell-check engine, but recall is growing. It can be explained
that positions from the spell-check engine can fix potential misclassifications of the classifier, but
it’s not giving decoder any advantage. So, by using a spell-check engine we achieve that the network
isn’t missing spelling errors. Combining the results of algorithmic approach and neural network
allows us to control the behavior of the system, because we can control spell-check system by
maintaining the dictionaries, allowing users to add their own words etc.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Limitations</title>
        <p>Even though we can control the positions in the input sequence that will be changed by the model,
model’s decoder can generate the same suggestion as highlighted text in the input sequence. Another
issue is that all corrections are generated by one batch, which means that the decoder is generating
new corrections without the information about previous corrections. This results in the system being
unable to generate two dependent corrections in different parts of the text, for example deleting the
word in the beginning and inserting it at the end of the text.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Further development</title>
        <p>To improve the quality of the system, additional steps in training can be taken. In this research, only
two training steps were done: pretrain using synthetic data and fine-tuning. Additional datasets can
be used as additional steps of training. Additional high-quality data can help model to generalize
better on the task.</p>
        <p>Another technique that wasn’t used in this research but could potentially improve the results is
training model to reproduce highlighted text. This can help to learn trainable highlight embedding
and decoder to rewrite specific parts of the input sequence of different length. Another potential
benefit is that it can allow the decoder to generate the same text that was highlighted if spell-check
engine or classifier highlighted part of the text that doesn’t need to be corrected which can turn a
model’s limitation into a potential advantage.</p>
        <p>Another possible use-case of the system that is not connected with the GEC task but worth
considering is a rewriting system. By removing the correctness classifier from the model and
replacing it with a user’s input rewriting system can be created where the user can control which
part of the sentence needs to be rewritten. Currently, large language models are primarily used in
text rephrasing tasks, which use a lot of computational resources. Such system can offer rewriting
only a part of the text or even one word and don’t spend resources on regenerating the whole input
sequence.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this paper the Grammatical Error Correction (GEC) model was presented. The model uses a
sequence-to-edits approach to reduce the use of computational resources which is crucial for systems
where user’s text is checked on fly which results in a lot of requests. Presented model architecture
allows to combine it with algorithmic approaches which are good at finding the positions of errors
in the text but struggling to propose suitable correction. In this research combination with
spellcheck engine was shown. The result of combination with algorithmic approach is higher quality of
proofreading and more predictive system results. Although the system demonstrated in this paper
has certain limitations that were discussed, it shows good performance of evaluation metrics and can
be further improved.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
      <p>References
[1] A. Koyama, K. Hotate, M. Kaneko, M. Komachi, Comparison of grammatical error correction
using back-translation models, Assoc. Comput. Linguistics (2021) 126–135.
doi:10.18653/v1/2021.naacl-srw.16.
[2] A. Bout, A. Podolskiy, S. Nikolenko, I. Piontkovskaya, Efficient grammatical error correction via
multi-task training and optimized training schedule, Assoc. Comput. Linguistics (2023) 5800–
5816. doi:10.18653/v1/2023.emnlp-main.355.
[3] H. Zhou, Y. Liu, Z. Li, M. Zhang, B. Zhang, C. Li, J. Zhang, F. Huang, Improving seq2seq
grammatical error correction via decoding interventions, Assoc. Comput. Linguistics (2023)
7393–7405. doi:10.18653/v1/2023.findings-emnlp.495.
[4] X. Sun, T. Ge, F. Wei, H. Wang, Instantaneous grammatical error correction with shallow
aggressive decoding, Assoc. Comput. Linguistics (2021) 5937–5947.
doi:10.18653/v1/2021.acllong.462.
[5] B. Didenko, A. Sameliuk, RedPenNet for grammatical error correction: outputs to tokens,
attentions to spans, Assoc. Comput. Linguistics (2023) 121–131. doi:10.18653/v1/2023.unlp-1.15.
[6] K. Omelianchuk, V. Atrasevych, A. Chernodub, O. Skurzhanskyi, GECToR – grammatical error
correction: tag, not rewrite, Assoc. Comput. Linguistics (2020) 163–170.
doi:10.18653/v1/2020.bea-1.16.
[7] B. Didenko, J. Shaptala, Multi-headed architecture based on BERT for grammatical errors
correction, Assoc. Comput. Linguistics (2019) 246–251. doi:10.18653/v1/W19-4426.
[8] C. Bryant, M. Felice, Ø. E. Andersen, T. Briscoe, The BEA-2019 Shared Task on Grammatical</p>
      <p>Error Correction, Assoc. Comput. Linguist. (2019) 52–75. doi:10.18653/v1/W19-4406.
[9] B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R.</p>
      <p>Biswas, F. Ladhak, T. Aarsen et al., Smarter, better, faster, longer: A modern bidirectional
encoder for fast, memory efficient, and long context finetuning and inference, 2024.
doi:10.48550/arXiv.2412.13663.
[10] V. Raheja, D. Kumar, R. Koo, D. Kang, CoEdIT: Text editing by task-specific instruction tuning,</p>
      <p>Assoc. Comput. Linguistics (2023) 5274–5291. doi:10.18653/v1/2023.findings-emnlp.350.
[11] F. Stahlberg, S. Kumar, Synthetic data generation for grammatical error correction with tagged
corruption models, Assoc. Comput. Linguistics (2021) 37–47. URL:
https://aclanthology.org/2021.bea-1.4/.
[12] C. Bryant, M. Felice, T. Briscoe, Automatic annotation and evaluation of error types for
grammatical error correction, Assoc. Comput. Linguistics (2017) 793–805.
doi:10.18653/v1/P171074.
[13] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning books
and movies: towards story-like visual explanations by watching movies and reading books, IEEE
Int. Conf. Comput. Vis. (ICCV) (2015). doi:10.1109/ICCV.2015.11.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>