BUG-T5: A Transformer-based Automatic Title Generation
Method for Bug Reports
Xinyi Tian1*, Jingkun Wu2, Guang Yang3
1
  Shanghai University
2
  Beijing Technology and Business University
3
  Nanjing University of Aeronautics and Astronautics

                 Abstract
                 In Github, developers may not clarify and summarize the critical problems in the bug report
                 titles due to a lack of domain knowledge or poor writing skills. Therefore, it is essential to help
                 practitioners draft high-quality titles. In this study, we propose the BUG-T5 method
                 automatically generating titles by fine-tuning the T5 model. In our empirical analysis, we
                 choose a publicly available corpus from Github. After comparing BUG-T5 with four state-of-
                 the-art baselines (i.e., TextRank, NMT, Transformer, and iTAPE) on ROUGE metrics, we
                 demonstrate the competitiveness of our proposed method, BUG-T5.

                 Keywords
                 Bug report, title generation, deep learning

1. Introduction 1

    Bug reports are usually stored in bug repositories, essential artifacts to help with software
development, testing, and maintenance. In the repository, bug report titles can help project practitioners
efficiently understand the core ideas of bug reports. However, project practitioners often fail to show
the core ideas of bug reports concisely and accurately by the title due to a lack of ability, time, and
attention, which brings difficulties in understanding, copying, tracking, classifying, and fixing [1].
Therefore, it is essential to help report authors draft high-quality titles effectively.
    In previous studies, researchers have used Structure-Based, Semantic-Based, and Learning-Based
methods for bug report title generation. Due to the strong performance of the pre-trained model T5
proposed by Google on generic knowledge acquisition and NLP problem solving, we present the BUG-
T5 method based on T5 [2] for bug report title generation.
    To verify the effectiveness of our proposed BUG-T5 method, we chose the dataset shared by Chen
et al. [1]. We first filter the corpus according to heuristic rules and then select 100,000 data from this
corpus for model training, 2000 data for model validation, and 2000 data for model testing. We used
SentencePiece [3] to tokenize the corpus and then used this corpus to fine-tune the T5 model. We
compared BUG-T5 with four state-of-the-art baselines (i.e., iTAPE [1], NMT [4], transformer [5], and
TextRank [6] ) and found that BUG-T5 outperforms these baselines in ROUGE [7] metrics.
    The main contributions of our study can be summarized as follows:
    • By fine-tuning the pre-trained model T5 [2], we propose a new method BUG-T5, which can
         automatically generate the titles of bug reports.
    • We take experiments using the dataset shared by Chen et al. [1], and incorporate iTAPE [1] ,
         NMT [4], transformer [5], and TextRank [6] as experimental baselines, demonstrating that
         BUG-T5 can significantly improve performance.

ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21-
23, 2022, Guangzhou, China
*
  Xinyi Tian is the corresponding author.
txy567@163.com (Xinyi Tian), wujingkun1207@163.com (Jingkun Wu), novelyg@outlook.com (Guang Yang)
              © 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                  45
2. Related Work

   In a previous study on automatic issue title generation, He et al. [8] first proposed an unsupervised
summary generation technique based on PageRank that considers additional information in relevant
duplicate bug reports to enhance the quality of summary generations. Gupta and Gupta [9] proposed a
two-level approach that uses features, PageRank, and natural language generation techniques to
synthesize the information in titles, descriptions, and comments. With the development of neural
network models and the dramatic increase in open-source data, deep learning has become a very
emerging method for generating question titles. Chen et al. [1] proposed the iTAPE method using
Seq2seq to solve this problem. The high-quality dataset was collected using three heuristic rules from
open-source projects. They used the copy mechanism and the human-named token tagging to handle
low-frequency tokens. Lin et al. [10] proposed a Quality Prediction-based Filter based on the iTAPE
method to filter out bug reports that can generate high-quality titles.
   In addition, among other title generation tasks in software engineering, Liu et al. [11] proposed a
novel Seq2Seq model that automatically generates the pull request descriptions (PR). They used
reinforcement learning to optimize rouge and a pointer network to solve OOV problems. Zhang et al.
[12] used a CodeBERT as an encoder, a stacked Transformer decoder, and a copy attention layer to
generate the Stack Overflow question title. Liu et al. [13] proposed SOTitle to generate Stack Overflow
post title, which used the pre-trained T5 model.

3. Method

    BUG-T5 contains Corpus Construction (Section 4.1), Fine-turning T5 Model, and Model
Application three phrases. This section will show the implementation details of BUG-T5 model (Figure
1).

3.1. Model Architecture

    For the given bug report, we first use the SentencePiece method to tokenize the bug report and get
the subwords sequence x=(x1，…，xn), where n means the length of the sequence. This method helps
to alleviate the problem of OOV (out-of-vocabulary). Next, BUG-T5 use the embedding layer to map
the sequences of subwords into a high-dimensional semantic vector X∈ ℝ × , where D means high
dimensionality.
    Next, for the model can handle the sequential order information of x, BUG-T5 uses a simplified
relative position encoding. Then the results of embedding encoding and location encoding are summed
to obtain the final vector X, where X = X + PositionEncoding(x).
    The encoder in BUG-T5 consists of "blocks" repeated several times, each containing two sub-layers,
a multi-head self-attention sub-layer, and a position-wise fully connected feed-forward network. Each
sub-layer is surrounded by residual connections and layer normalization so that its output becomes
LayerNorm(X + sub-layer(X)). The self-attention layer can map a set of queries (Q) and a set of key
(K), value (V) pairs to the output. Assuming that the dimension of each query is dk, the mapping is done
by first computing the dot product of queries and keys and passing it through the softmax function to
obtain the weights of the corresponding values. The formulas are as follows.


                                                  46
Figure 1. Framework of BUG-T5 method.

                             𝑄, 𝐾, 𝑉     𝑋𝑊      𝑏𝑖𝑎𝑠 , 𝑋𝑊       𝑏𝑖𝑎𝑠 , 𝑋𝑊         𝑏𝑖𝑎𝑠                    (1)

                                    Attention 𝑄, 𝐾, 𝑉        softmax           𝑉                           (2)

   The multi-head self-attention layer projects the query, key, and value linearly h times to obtain Q,
K, and V. The output is then received by performing the above self-attention calculation. The outputs
are concatenated and projected again to obtain the final values. The position-wise fully connected feed-
forward network consists of two linear transforms and a ReLU activation to obtain the output FFN(X).

                                       FFN 𝑋      𝑚𝑎𝑥 0, 𝑋𝑊        𝑏 𝑊       𝑏                             (3)

   The decoder in BUG-T5 also consists of "blocks" repeated multiple times, each containing three
sub-layers: a multi-head self-attention sub-layer, an encoder-decoder attention sub-layer, and a position-
wise fully connected feed-forward network, again using residual connections and layer normalization
around each sub-layer. The multi-head self-attention sub-layer of the decoder uses a causal mask to
ensure that the prediction of each position of the output sequence is based only on the antecedent of the
output sequence. The extra encoder-decoder multi-head attention sublayer takes the output of the
encoder as K and V, and the output result of the first sub-layer of the decoder as Q, so that the output
result of the decoder takes into account the output of the encoder. We transform the output of the
decoder ht into the predicted next token probability by linear transformation and softmax function.
Additionally, we use Beam search [16] to improve the accuracy of the prediction.

                                 𝑃 𝑦        ∣ 𝑦 ,𝑦 ,⋯,𝑦       softmax ℎ 𝑊          𝑏                       (4)

3.2. Model Fine-tuning

    We use the AdamW optimizer to fine-tune the model parameters. In the specific training process,
the input bug report sequence x is first mapped by the encoder to a sequence z = (z1, ..., zn). When the
token yj is generated, the decoder first performs a self-attention on the previously generated token (y1, ... ,
yj-1) and then computes the cross-attention with the output z of the encoder to finally obtain the
probability distribution of yj. The optimization objective of the model parameter 𝜃 is to minimize the
negative log-likelihood of the target text sequence t, as formulated below.

                                        𝐿       ∑| | 𝑡 log 𝑃 𝑦 ∣ 𝑦 , 𝑥                                     (5)


                                                      47
4. Experiment

   In our empirical study, we are interested in the following research questions.
   RQ1 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of
quantitative study?
   RQ2 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of qualitative
study?

4.1. Experimental Subject

   In our empirical study, we conduct experiments on the publicly available bug title generation dataset
shared by Chen et al. [1]. Chen et al. collected 922,730 issue samples from GitHub's issues. After
preprocessing the sample set, removing the samples that are difficult to segment, and clipping the
miscellaneous content in the data. They applied three heuristic rules to delete samples with low Title
Quality to build a high-quality dataset.
   We further followed Iyer et al. [14] to remove samples with a problem description length of more
than 150, randomly select 100,000 data for model training, 2000 data for model validating and 2000
data for model testing. Table I shows the statistical information of our used dataset.

Table 1. Length statistics of the dataset.
        Bug Report Length                                       Title Length
Type                                                                             Media
        Avg      Mode       Median   <100     <115     <130     Avg       Mode           <10      <15      <20
                                                                                 n


Train   74.27    51         72       77.67%   89.34%   97.36%   8.60      6      8       67.36%   96.88%   99.95%


Test    75.02    47         73       77.25%   89.35%   97.45%   8.58      7      8       67.70%   97.00%   99.95%


Valid   73.91    54         70       76.75%   88.65%   97.05%   8.53      6      8       68.50%   97.15%   99.85%


4.2. Performance Measures

    In our study, we use Rouge [7] as performance metrics, which is from the neural machine translation
domain. In a nutshell, Rouge calculates the lexical overlap between model-generated titles and reference
titles. Specifically, we use ROUGE-N (N = 1,2) and ROUGE-L to evaluate the quality of generated
titles.

4.3. Baselines

    In our study, we first compare our proposed method with the state-of-the-art method of bug report
title generation iTAPE [1]. Meanwhile, we also selected NMT [4], transformer [5] and unsupervised
TextRank [6] as the baselines.

4.4. Implementation Details

   We use Pytorch 1.8.0 to implement our proposed method. For the baseline methods, we run the
shared code of the corresponding author on the processed corpus, or adopt OpenNMT library [15] to re
implement the method according to the description of the author.
   We ran all experiments on a computer with GeForce RTX3090 GPU and 24GB memory. The
operating system platform running is Linux.


                                                         48
4.5. Result Analysis

   RQ1 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of
quantitative study?
   Table 2 shows the results of the comparison between BUG-T5 and the baselines. We used ROUGE-
1, ROUGE-2, and ROUGE-L for the performance metrics, and each metric was calculated for its
precision, recall, and F1-score. We highlight the best value in bold in each column.
   According to the experimental results, we can see that our method outperforms the baseline.
Comparing with the iTAPE method, BUG-T5 can achieve 19%, 28%, and 17% performance
improvement in ROUGE-1, ROUGE-2, and ROUGE-L F1-scores, respectively. The results show that
BUG-T5 can learn the deep semantics of bug reports more effectively and has better performance than
baselines in terms of quantitative study.

Table 2. The Comparison results between T5 and baselines.
                     ROUGE-1                          ROUGE-2                     ROUGE-L
    Method
                     P          R          F1         P         R        F1       P         R        F1
    TextRank         13.15%     32.61%     17.51%     3.23%     9.51%    4.46%    11.26%    27.51%   14.90%
    Transformer      5.62%      5.06%      5.20%      0.54%     0.65%    0.57%    5.55%     5.01%    5.15%
    NMT              24.38%     17.75%     19.88%     7.33%     5.13%    5.80%    22.94%    16.70%   18.71%
    iTAPE            35.37%     25.54%     28.78%     14.65%    10.30%   11.68%   33.16%    23.93%   26.97%
    BUG-T5           34.09%     36.86%     34.17%     15.07%    16.26%   14.98%   31.45%    33.93%   31.51%

   RQ2 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of qualitative
study?

Table 3. The titles generated by T5 and baselines.
     Bug Report                                       Titles

                                                      Ground Truth：basegradientboosting should use
     BaseGradientBoosting      should      use        decisiontreeregressor instead of tree
     DecisionTreeRegressor instead of Tree in
                                                      Ours：basegradientboosting should use
     order to stay consistent with other
                                                      decisiontreeregressor instead of tree
     ensemble classes. This will lead to some
                                                      iTAPE：use decisiontreeregressor instead of tree
     redundant input checks so before
                                                      NMT：make sure that the tree is used
     making any changes we should run some
     benchmarks.                                      Transformer：how to add a way to create a way to
                                                      use a file
     The issue came up in #1046.                      Textrank：phofnewline phofnewline the issue came
                                                      up in # 1046

    Table III shows the titles generated by BUG-T5 and Baseline according to Bug Report, which
collected from real world1. Through the cases we found that the title generated by Transformer are not
related to the original report. the titles generated by NMT and TextRank fail to express the core content
of the original report. The title generated by iTAPE is missing the important information in the original
report. However, the headlines generated by our method can accurately, smoothly and comprehensively
express the essential information of the original report. Therefore, our BUG-T5 model outperform
baselines in terms of qualitative study.
1
    https://github.com/scikit-learn/scikit-learn/issues/1047


                                                           49
5. Conclusion

    In this paper, we present BUG-T5 to help practitioners generate high-quality titles. The method uses
a fine-tuned T5 model [2] for automatic issue title generation. Experimental results on ROUGE metrics
show that BUG-T5 is capable of providing with the best performance, generating phraseology-
appropriate , precise and comprehensive titles.

6. References

[1] Chen, S., Xie, X., Yin, B., Ji, Y., Chen, L., & Xu, B., 2020. Stay professional and efficient:
     Automatically generate titles for your bug reports. In: 2020 35th IEEE/ACM International
     Conference on Automated Software Engineering (ASE) (pp. 385-397). IEEE.
[2] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.
     J (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
     Journal of Machine Learning Research, 21, 1-67.
[3] Kudo, T., & Richardson, J., 2018. SentencePiece: A simple and language independent subword
     tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on
     Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71).
[4] Bahdanau, D., Cho, K. H., & Bengio, Y., 2015. Neural machine translation by jointly learning to
     align and translate. In: 3rd International Conference on Learning Representations, ICLR    2015.
[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,T., &
     Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
     systems, 30.
[6] Mihalcea, R., & Tarau, P., 2004. Textrank: Bringing order into text. In: Proceedings of the 2004
     conference on empirical methods in natural language processing (pp. 404-411).
[7] Lin, C. Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Text
     summarization branches out (pp. 74-81).
[8] He, J., Nazar, N., Zhang, J., Zhang, T., & Ren, Z (2017). Prst: A pagerank-based summarization
     technique for summarizing bug reports with duplicates. International Journal of Software
     Engineering and Knowledge Engineering, 27, 869-96.
[9] Gupta, S., & Gupta, S. K. (2021). An approach to generate the bug report summaries using two-
     level feature extraction. Expert Systems with Applications, 176, 114816.
[10] Lin, H., Chen, X., Chen, X., Cui, Z., Miao, Y., & Su, Z. gen-Fl: Quality Prediction-Based Filter
     for Automated Issue Title Generation. Available at SSRN 4104452.
[11] Liu, Z., Xia, X., Treude, C., Lo, D., & Li, S., 2019. Automatic generation of pull request
     descriptions. In: 2019 34th IEEE/ACM International Conference on Automated Software
     Engineering (ASE) (pp. 176-188). IEEE.
[12] Zhang, F., Yu, X., Keung, J., Li, F., Xie, Z., Yang, Z., Ma, C., & Zhang, Z. (2022). Improving
     Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal
     information. Information and Software Technology, 148, 106922.
[13] Liu, K., Yang, G., Chen, X., & Yu, C., 2022. SOTitle: A Transformer-based Post Title Generation
     Approach for Stack Overflow. In: 2022 IEEE International Conference on Software Analysis,
     Evolution and Reengineering (SANER) (pp. 577-588). IEEE.
[14] Iyer, S., Konstas, I., Cheung, A., & Zettlemoyer, L., 2018. Mapping Language to Code in
     Programmatic Context. In: Proceedings of the 2018 Conference on Empirical Methods in Natural
     Language Processing (pp. 1643-1652).
[15] Klein, G. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings
     of ACL 2017, System Demonstrations, 67-72.
[16] Freitag, M., & Al-Onaizan, Y. (2017). Beam Search Strategies for Neural Machine
     Translation. ACL 2017, 56.


                                                  50