=Paper= {{Paper |id=Vol-3304/paper06 |storemode=property |title=BUG-T5: A Transformer-based Automatic Title Generation Method for Bug Reports |pdfUrl=https://ceur-ws.org/Vol-3304/paper06.pdf |volume=Vol-3304 |authors=Xinyi Tian,Jingkun Wu,Guang Yang }} ==BUG-T5: A Transformer-based Automatic Title Generation Method for Bug Reports== https://ceur-ws.org/Vol-3304/paper06.pdf

BUG-T5: A Transformer-based Automatic Title Generation
Method for Bug Reports
Xinyi Tian1*, Jingkun Wu2, Guang Yang3
1
Shanghai University
2
Beijing Technology and Business University
3
Nanjing University of Aeronautics and Astronautics

Abstract
In Github, developers may not clarify and summarize the critical problems in the bug report
titles due to a lack of domain knowledge or poor writing skills. Therefore, it is essential to help
practitioners draft high-quality titles. In this study, we propose the BUG-T5 method
automatically generating titles by fine-tuning the T5 model. In our empirical analysis, we
choose a publicly available corpus from Github. After comparing BUG-T5 with four state-of-
the-art baselines (i.e., TextRank, NMT, Transformer, and iTAPE) on ROUGE metrics, we
demonstrate the competitiveness of our proposed method, BUG-T5.

Keywords
Bug report, title generation, deep learning

1. Introduction 1

Bug reports are usually stored in bug repositories, essential artifacts to help with software
development, testing, and maintenance. In the repository, bug report titles can help project practitioners
efficiently understand the core ideas of bug reports. However, project practitioners often fail to show
the core ideas of bug reports concisely and accurately by the title due to a lack of ability, time, and
attention, which brings difficulties in understanding, copying, tracking, classifying, and fixing [1].
Therefore, it is essential to help report authors draft high-quality titles effectively.
In previous studies, researchers have used Structure-Based, Semantic-Based, and Learning-Based
methods for bug report title generation. Due to the strong performance of the pre-trained model T5
proposed by Google on generic knowledge acquisition and NLP problem solving, we present the BUG-
T5 method based on T5 [2] for bug report title generation.
To verify the effectiveness of our proposed BUG-T5 method, we chose the dataset shared by Chen
et al. [1]. We first filter the corpus according to heuristic rules and then select 100,000 data from this
corpus for model training, 2000 data for model validation, and 2000 data for model testing. We used
SentencePiece [3] to tokenize the corpus and then used this corpus to fine-tune the T5 model. We
compared BUG-T5 with four state-of-the-art baselines (i.e., iTAPE [1], NMT [4], transformer [5], and
TextRank [6] ) and found that BUG-T5 outperforms these baselines in ROUGE [7] metrics.
The main contributions of our study can be summarized as follows:
• By fine-tuning the pre-trained model T5 [2], we propose a new method BUG-T5, which can
automatically generate the titles of bug reports.
• We take experiments using the dataset shared by Chen et al. [1], and incorporate iTAPE [1] ,
NMT [4], transformer [5], and TextRank [6] as experimental baselines, demonstrating that
BUG-T5 can significantly improve performance.

ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21-
23, 2022, Guangzhou, China
*
Xinyi Tian is the corresponding author.
txy567@163.com (Xinyi Tian), wujingkun1207@163.com (Jingkun Wu), novelyg@outlook.com (Guang Yang)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)

45
2. Related Work

In a previous study on automatic issue title generation, He et al. [8] first proposed an unsupervised
summary generation technique based on PageRank that considers additional information in relevant
duplicate bug reports to enhance the quality of summary generations. Gupta and Gupta [9] proposed a
two-level approach that uses features, PageRank, and natural language generation techniques to
synthesize the information in titles, descriptions, and comments. With the development of neural
network models and the dramatic increase in open-source data, deep learning has become a very
emerging method for generating question titles. Chen et al. [1] proposed the iTAPE method using
Seq2seq to solve this problem. The high-quality dataset was collected using three heuristic rules from
open-source projects. They used the copy mechanism and the human-named token tagging to handle
low-frequency tokens. Lin et al. [10] proposed a Quality Prediction-based Filter based on the iTAPE
method to filter out bug reports that can generate high-quality titles.
In addition, among other title generation tasks in software engineering, Liu et al. [11] proposed a
novel Seq2Seq model that automatically generates the pull request descriptions (PR). They used
reinforcement learning to optimize rouge and a pointer network to solve OOV problems. Zhang et al.
[12] used a CodeBERT as an encoder, a stacked Transformer decoder, and a copy attention layer to
generate the Stack Overflow question title. Liu et al. [13] proposed SOTitle to generate Stack Overflow
post title, which used the pre-trained T5 model.

3. Method

BUG-T5 contains Corpus Construction (Section 4.1), Fine-turning T5 Model, and Model
Application three phrases. This section will show the implementation details of BUG-T5 model (Figure
1).

3.1. Model Architecture

For the given bug report, we first use the SentencePiece method to tokenize the bug report and get
the subwords sequence x=(x1，…，xn), where n means the length of the sequence. This method helps
to alleviate the problem of OOV (out-of-vocabulary). Next, BUG-T5 use the embedding layer to map
the sequences of subwords into a high-dimensional semantic vector X∈ ℝ × , where D means high
dimensionality.
Next, for the model can handle the sequential order information of x, BUG-T5 uses a simplified
relative position encoding. Then the results of embedding encoding and location encoding are summed
to obtain the final vector X, where X = X + PositionEncoding(x).
The encoder in BUG-T5 consists of "blocks" repeated several times, each containing two sub-layers,
a multi-head self-attention sub-layer, and a position-wise fully connected feed-forward network. Each
sub-layer is surrounded by residual connections and layer normalization so that its output becomes
LayerNorm(X + sub-layer(X)). The self-attention layer can map a set of queries (Q) and a set of key
(K), value (V) pairs to the output. Assuming that the dimension of each query is dk, the mapping is done
by first computing the dot product of queries and keys and passing it through the softmax function to
obtain the weights of the corresponding values. The formulas are as follows.

46
Figure 1. Framework of BUG-T5 method.

𝑄, 𝐾, 𝑉 𝑋𝑊 𝑏𝑖𝑎𝑠 , 𝑋𝑊 𝑏𝑖𝑎𝑠 , 𝑋𝑊 𝑏𝑖𝑎𝑠 (1)

Attention 𝑄, 𝐾, 𝑉 softmax 𝑉 (2)

The multi-head self-attention layer projects the query, key, and value linearly h times to obtain Q,
K, and V. The output is then received by performing the above self-attention calculation. The outputs
are concatenated and projected again to obtain the final values. The position-wise fully connected feed-
forward network consists of two linear transforms and a ReLU activation to obtain the output FFN(X).

FFN 𝑋 𝑚𝑎𝑥 0, 𝑋𝑊 𝑏 𝑊 𝑏 (3)

The decoder in BUG-T5 also consists of "blocks" repeated multiple times, each containing three
sub-layers: a multi-head self-attention sub-layer, an encoder-decoder attention sub-layer, and a position-
wise fully connected feed-forward network, again using residual connections and layer normalization
around each sub-layer. The multi-head self-attention sub-layer of the decoder uses a causal mask to
ensure that the prediction of each position of the output sequence is based only on the antecedent of the
output sequence. The extra encoder-decoder multi-head attention sublayer takes the output of the
encoder as K and V, and the output result of the first sub-layer of the decoder as Q, so that the output
result of the decoder takes into account the output of the encoder. We transform the output of the
decoder ht into the predicted next token probability by linear transformation and softmax function.
Additionally, we use Beam search [16] to improve the accuracy of the prediction.

𝑃 𝑦 ∣ 𝑦 ,𝑦 ,⋯,𝑦 softmax ℎ 𝑊 𝑏 (4)

3.2. Model Fine-tuning

We use the AdamW optimizer to fine-tune the model parameters. In the specific training process,
the input bug report sequence x is first mapped by the encoder to a sequence z = (z1, ..., zn). When the
token yj is generated, the decoder first performs a self-attention on the previously generated token (y1, ... ,
yj-1) and then computes the cross-attention with the output z of the encoder to finally obtain the
probability distribution of yj. The optimization objective of the model parameter 𝜃 is to minimize the
negative log-likelihood of the target text sequence t, as formulated below.

𝐿 ∑| | 𝑡 log 𝑃 𝑦 ∣ 𝑦 , 𝑥 (5)

47
4. Experiment

In our empirical study, we are interested in the following research questions.
RQ1 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of
quantitative study?
RQ2 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of qualitative
study?

4.1. Experimental Subject

In our empirical study, we conduct experiments on the publicly available bug title generation dataset
shared by Chen et al. [1]. Chen et al. collected 922,730 issue samples from GitHub's issues. After
preprocessing the sample set, removing the samples that are difficult to segment, and clipping the
miscellaneous content in the data. They applied three heuristic rules to delete samples with low Title
Quality to build a high-quality dataset.
We further followed Iyer et al. [14] to remove samples with a problem description length of more
than 150, randomly select 100,000 data for model training, 2000 data for model validating and 2000
data for model testing. Table I shows the statistical information of our used dataset.

Table 1. Length statistics of the dataset.
Bug Report Length Title Length
Type Media
Avg Mode Median <100 <115 <130 Avg Mode <10 <15 <20
n

Train 74.27 51 72 77.67% 89.34% 97.36% 8.60 6 8 67.36% 96.88% 99.95%

Test 75.02 47 73 77.25% 89.35% 97.45% 8.58 7 8 67.70% 97.00% 99.95%

Valid 73.91 54 70 76.75% 88.65% 97.05% 8.53 6 8 68.50% 97.15% 99.85%

4.2. Performance Measures

In our study, we use Rouge [7] as performance metrics, which is from the neural machine translation
domain. In a nutshell, Rouge calculates the lexical overlap between model-generated titles and reference
titles. Specifically, we use ROUGE-N (N = 1,2) and ROUGE-L to evaluate the quality of generated
titles.

4.3. Baselines

In our study, we first compare our proposed method with the state-of-the-art method of bug report
title generation iTAPE [1]. Meanwhile, we also selected NMT [4], transformer [5] and unsupervised
TextRank [6] as the baselines.

4.4. Implementation Details

We use Pytorch 1.8.0 to implement our proposed method. For the baseline methods, we run the
shared code of the corresponding author on the processed corpus, or adopt OpenNMT library [15] to re
implement the method according to the description of the author.
We ran all experiments on a computer with GeForce RTX3090 GPU and 24GB memory. The
operating system platform running is Linux.

48
4.5. Result Analysis

RQ1 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of
quantitative study?
Table 2 shows the results of the comparison between BUG-T5 and the baselines. We used ROUGE-
1, ROUGE-2, and ROUGE-L for the performance metrics, and each metric was calculated for its
precision, recall, and F1-score. We highlight the best value in bold in each column.
According to the experimental results, we can see that our method outperforms the baseline.
Comparing with the iTAPE method, BUG-T5 can achieve 19%, 28%, and 17% performance
improvement in ROUGE-1, ROUGE-2, and ROUGE-L F1-scores, respectively. The results show that
BUG-T5 can learn the deep semantics of bug reports more effectively and has better performance than
baselines in terms of quantitative study.

Table 2. The Comparison results between T5 and baselines.
ROUGE-1 ROUGE-2 ROUGE-L
Method
P R F1 P R F1 P R F1
TextRank 13.15% 32.61% 17.51% 3.23% 9.51% 4.46% 11.26% 27.51% 14.90%
Transformer 5.62% 5.06% 5.20% 0.54% 0.65% 0.57% 5.55% 5.01% 5.15%
NMT 24.38% 17.75% 19.88% 7.33% 5.13% 5.80% 22.94% 16.70% 18.71%
iTAPE 35.37% 25.54% 28.78% 14.65% 10.30% 11.68% 33.16% 23.93% 26.97%
BUG-T5 34.09% 36.86% 34.17% 15.07% 16.26% 14.98% 31.45% 33.93% 31.51%

RQ2 Can our proposed method BUG-T5 outperform state-of-the-art baselines in terms of qualitative
study?

Table 3. The titles generated by T5 and baselines.
Bug Report Titles

Ground Truth：basegradientboosting should use
BaseGradientBoosting should use decisiontreeregressor instead of tree
DecisionTreeRegressor instead of Tree in
Ours：basegradientboosting should use
order to stay consistent with other
decisiontreeregressor instead of tree
ensemble classes. This will lead to some
iTAPE：use decisiontreeregressor instead of tree
redundant input checks so before
NMT：make sure that the tree is used
making any changes we should run some
benchmarks. Transformer：how to add a way to create a way to
use a file
The issue came up in #1046. Textrank：phofnewline phofnewline the issue came
up in # 1046

Table III shows the titles generated by BUG-T5 and Baseline according to Bug Report, which
collected from real world1. Through the cases we found that the title generated by Transformer are not
related to the original report. the titles generated by NMT and TextRank fail to express the core content
of the original report. The title generated by iTAPE is missing the important information in the original
report. However, the headlines generated by our method can accurately, smoothly and comprehensively
express the essential information of the original report. Therefore, our BUG-T5 model outperform
baselines in terms of qualitative study.
1
https://github.com/scikit-learn/scikit-learn/issues/1047

49
5. Conclusion

In this paper, we present BUG-T5 to help practitioners generate high-quality titles. The method uses
a fine-tuned T5 model [2] for automatic issue title generation. Experimental results on ROUGE metrics
show that BUG-T5 is capable of providing with the best performance, generating phraseology-
appropriate , precise and comprehensive titles.

6. References

[1] Chen, S., Xie, X., Yin, B., Ji, Y., Chen, L., & Xu, B., 2020. Stay professional and efficient:
Automatically generate titles for your bug reports. In: 2020 35th IEEE/ACM International
Conference on Automated Software Engineering (ASE) (pp. 385-397). IEEE.
[2] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.
J (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
Journal of Machine Learning Research, 21, 1-67.
[3] Kudo, T., & Richardson, J., 2018. SentencePiece: A simple and language independent subword
tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71).
[4] Bahdanau, D., Cho, K. H., & Bengio, Y., 2015. Neural machine translation by jointly learning to
align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015.
[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,T., &
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
[6] Mihalcea, R., & Tarau, P., 2004. Textrank: Bringing order into text. In: Proceedings of the 2004
conference on empirical methods in natural language processing (pp. 404-411).
[7] Lin, C. Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Text
summarization branches out (pp. 74-81).
[8] He, J., Nazar, N., Zhang, J., Zhang, T., & Ren, Z (2017). Prst: A pagerank-based summarization
technique for summarizing bug reports with duplicates. International Journal of Software
Engineering and Knowledge Engineering, 27, 869-96.
[9] Gupta, S., & Gupta, S. K. (2021). An approach to generate the bug report summaries using two-
level feature extraction. Expert Systems with Applications, 176, 114816.
[10] Lin, H., Chen, X., Chen, X., Cui, Z., Miao, Y., & Su, Z. gen-Fl: Quality Prediction-Based Filter
for Automated Issue Title Generation. Available at SSRN 4104452.
[11] Liu, Z., Xia, X., Treude, C., Lo, D., & Li, S., 2019. Automatic generation of pull request
descriptions. In: 2019 34th IEEE/ACM International Conference on Automated Software
Engineering (ASE) (pp. 176-188). IEEE.
[12] Zhang, F., Yu, X., Keung, J., Li, F., Xie, Z., Yang, Z., Ma, C., & Zhang, Z. (2022). Improving
Stack Overflow question title generation with copying enhanced CodeBERT model and bi-modal
information. Information and Software Technology, 148, 106922.
[13] Liu, K., Yang, G., Chen, X., & Yu, C., 2022. SOTitle: A Transformer-based Post Title Generation
Approach for Stack Overflow. In: 2022 IEEE International Conference on Software Analysis,
Evolution and Reengineering (SANER) (pp. 577-588). IEEE.
[14] Iyer, S., Konstas, I., Cheung, A., & Zettlemoyer, L., 2018. Mapping Language to Code in
Programmatic Context. In: Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing (pp. 1643-1652).
[15] Klein, G. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings
of ACL 2017, System Demonstrations, 67-72.
[16] Freitag, M., & Al-Onaizan, Y. (2017). Beam Search Strategies for Neural Machine
Translation. ACL 2017, 56.