Automatic Generation of Russian News Headlines Ekaterina Tretiaka a Saint Petersburg State University, 7-9 Universitetskaya emb., St Petersburg, 199034, Russia Abstract Text summarization is one of the key Natural Language Processing tasks. Automated text summarization has the potential to save time when creating reviews, abstracts etc. of the texts across multiple domains. Automatic headline generation is a challenging kind of text summarization. Basically, the distinction between extractive and abstractive summarization methods is drawn. Application of the extractive summarization techniques results in the extraction of relevant words or sentences from the original text. Abstractive summarization models synthesize a summary in which some of its material is not present in the input document. This paper deals with the fine-tuning the pretrained model based on Transformer architecture for the task of generation of Russian news headlines. Experiments discussed were carried out for the new dataset of Russian news which was automatically compiled from the “Bumaga” website. The paper contains the quantitative evaluation results using BLEU und ROUGE metrics as well as the human evaluation results. Finally, the paper presents error analysis and discussion of particular contexts. Keywords 1 Headline generation, text summarization, abstractive summarization, Russian language, RuBERT 1. Introduction In modern computational linguistics, text summarization holds a special place among the tasks of natural language processing (NLP). The aim of summarization is to produce a shorter version of the text that expresses the main idea of the source document. That is, given input text x, a model writes a summary y which is shorter than x and contains vital information from x. Text summarization makes it possible to access and process large amounts of textual data and extract the necessary information from a huge corpus of texts. The automatic summarization problem can be addressed with two types of techniques, extractive and abstractive ones [1]. In extractive summarization, the most significant chunks of the source text are detected and extracted without any changes. That means that all words in the summary come from the input data. In contrast, abstractive summarization systems attempt to generate abstracts from new sentences, which may not even include words that occurred in the original. Although an abstractive model is much more complex than the extractive one, it produces detailed human-like summaries. It is this advantage that makes abstractive approaches increasingly popular today and, for this reason, we focus on them. In this paper, we are concerned with the task of headline generation that tends to be considered as a special type of text summarization [2]. This is accounted for by the fact that the headline is a key component of the news text since it includes its main ideas. On the one hand, it should be quite informative, and on the other, encourage readers to spend their time on reading the full text. However, for digital media, it is especially essential to provide clear and informative headlines, since the user does not have time to guess what the hidden meaning was intended. In addition, the headline like any other text should have grammatical and lexical linking and meaningfulness. IMS 2021 - International Conference "Internet and Modern Society", June 24-26, 2021, St. Petersburg, Russia EMAIL: evtretyak1999@gmail.com ORCID: © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 274 PART 2: Computational Linguistics There are several sections in this paper. The review of current studies in the field of automatic summarization is presented in the Related work section. In the Methods section there is information about the corpus of news messages which was to be processed as well as the description of the used model. The Experiment section describes the present method of generation of Russian News Headlines. In the Results section there are examples of headlines predicted by our model, automatic and human evaluation results and error analysis. The Conclusion section provides the conclusions drawn up by the presented results. 2. Related work Analysis of current research shows that the automatic summarization problem can be approached differently and a large number of papers covers it. Many studies are devoted to extractive methods of text summarization [3][4]. [5] was one of the first to work on this issue in terms of detecting the most informative words relying on the word frequency. His idea was to count the frequency of words in order to find a list of the most meaningful words. However, the main disadvantage of using extractive methods to headline generation is that the abstracts they produce are hardly headlines. They cannot be shorter that the minimum text blocks used to compose them (a sentence or a paragraph). This is how neural models for abstractive summarization and text generation came into being. Sequence-to-sequence (seq2seq) model is one of the most important recent concepts used in the current state-of-the-art applications in natural language processing. It is a type of encoder-decoder model using Recurrent neural network (RNN), that generates one sequence from the other after it has learned a great deal of sequence pairs. Developers from Google [6] demonstrated that translation models based on seq2seq outperform a standard statistical machine translation based (SMT-based) system. Not only machine translation benefits from seq2seq models; they do well on many other sequences learning problems, including text summarization and headline generation. In 2015, [7] proposed an approach called Attention-Based Summarization (ABS). It is a local attention-based model that generates the next word of the summary given the input sentence in terms of the combination of a neural language model and a contextual input encoder. Following them, [7] extended the ABS model using semantic and syntactic information about the source text in a standard neural attention-model. Later, copying mechanism was presented [9] to improve RNN encoder-decoder model. It is designed to copy tokens from the source text. This model was taken as a basis in another study [10] and trained on the dataset of Russian news. The Transformer architecture, originally developed for machine translation [11], is now applied to all the main tasks of natural language processing. There are many modified versions of Transformer. Thus, [2] adapted the Universal Transformer architecture [12], which is a modification of Transformer, to the task of headline generation. Previous advances in abstractive text summarization have been made using pretrained language models based on Transformer architecture. In 2019, a new Bidirectional Encoder Representations from Transformers (BERT) [13] architecture was developed specifically for text summarization (BertSumExt and BertSumAbs for extractive and abstractive summarization, respectively) [14]. The BertSumAbs model is a standard encoder-decoder framework for abstractive summarization [15], where the encoder is the pretrained BertSum and the decoder is a 6-layered Transformer initialized randomly. In [16] RuBERT [17] was used as a pretrained BERT for fine-tuned on the Russian texts BertSumAbs model. Application of this approach to the task of Russian news headlines allowed to obtain state-of-the-art results on the RIA [2] and Lenta2 datasets. 2 https://github.com/yutkin/Lenta.Ru-News-Dataset IMS-2021. International Conference “Internet and Modern Society” 275 3. Methods 3.1. Data We conduct our experiments on the new corpus of news messages in Russian. We have developed a programming algorithm for automatic corpus building from the website of the Russian online newspaper “Bumaga”3. It contains news messages from June, 2013 to April, 2021. In total, there are 38 499 news articles in the provided corpus which are supplied with additional meta information: title, date and link 4 . For the experiment, we split the Bumaga corpus into the train, validation, and test parts in a proportion of 90:5:5. 3.2. Model description We examine the BertSumAbs model, which utilizes RuBERT as a pretrained BERT [16]. The original BertSumAbs model is a standard encoder-decoder framework that was fine-tuned for abstractive summarization task. The encoder is 6 stacked layers of BERT, while the decoder is a 6- layered Transformer that is initialized randomly. Thus, the encoder is pretrained and the decoder must be trained from the ground up. The model has more than 317M parameters. We fine-tune a 40K checkpoint saved by the authors of [16], since its validation loss score was the best. That is, trained on the RIA dataset checkpoint is fine-tuned on the Bumaga dataset. 4. Experiment 4.1. Baseline model First Sentence This model uses the first sentence of a news message as its hypothesis for a news message headline. It is the most naïve approach to headline generation. Its application is valid due to the fact that the structure of news articles is based on the principle of inverted pyramid. It means that the most valuable information can be found in the first sentence through the answers to key questions: Who? When? Where? Why? What? How? 4.2. Training It has been mentioned that the encoder is pretrained while the decoder is trained from scratch in the BertSumAbs model. This mismatch between two parts of the Transformer can make fine-tuning unstable, as noted [14]. In order to overcome the difficulty, a new fine-tuning schedule was designed by [14] and then borrowed by [16][16]. This novel approach is characterized by using of different optimizers for the encoder and the decoder. Following [16], we separate the optimizers of the encoder and the decoder when training model on our dataset. We use two Adam optimizers [18] with = 0.9 and = 0.999 and learning rates = 0.002 и = 0.2 for the encoder and decoder, respectively. When setting the parameters for training the model, we rely on the idea of [14] that the pretrained encoder must be fine-tuned with a smaller learning rate. The model is fine-tuned with a batch size equals 128, gradient accumulation every 95 steps. The model was trained for 4,700 steps on a Tesla V100 GPU provided by Google's Colaboratory service 5. The training of the model took about 24 hours. 3 https://paperpaper.ru/ 4 The dataset is available at https://github.com/ekaterinatretyak/PreSumm. 5 https://colab.research.google.com/ 276 PART 2: Computational Linguistics 5. Results In Table 16 we present results of the headline generation based on the Bumaga corpus for Russian. Despite the fact that our fine-tuned model makes mistakes, which are discussed in Section 5.3., relevant headlines still prevail. Table 1 Samples of headlines generated after fine-tuning BertSumAbs № Lang. Original text Original headline Generated headline 1 ru В Эрмитаже появились коты с В Эрмитаже появились В Эрмитаже появились коты именами Трамп и Хиллари… коты Трамп и Хиллари с именами Трампа и Клинтон en Cats with the names Trump and Cats Trump and Hillary Cats with the names of Trump Hillary appeared in the Hermitage… appeared in the and Clinton appeared in the Hermitage Hermitage 2 ru Совет Федерации назначил дату Совет Федерации Совет Федерации назвал проведения президентских объявил дату дату проведения выборов в 2018 году — 18 проведения выборов президентских выборов в марта… президента в 2018 году 2018 году en The Federation Council set the date The Federation Council The Federation Council for the presidential elections in announced the date of named the date of the 2018 — March 18… the presidential election presidential elections in 2018 in 2018 3 ru Пожар на Васильевском острове На Васильевском На Васильевском острове затруднил дорожную обстановку острове скопились горела коммуналка, в центре Петербурга, поскольку пробки из-за пожара на движение перекрыто сотрудники ДПС перекрывали Кадетской линии участок дороги… в коммунальной квартире в доме 31/22 по Кадетской линии горела одна из комнат… en The fire on Vasilyevsky Island Traffic jams have On Vasilyevsky Island, a complicated the traffic situation in accumulated on communal apartment burned, the center of St. Petersburg, since Vasilyevsky Island due to traffic was blocked traffic officers blocked traffic on a a fire on the Kadetskaya section of road... in a communal line apartment in the house 31/22 on the Kadetskaya line, one of the rooms was burning… 4 ru В Петербурге 25 июля произошли Улицы на севере и юге В Петербурге прорвало прорывы труб на севере и юго- Петербурга затопило трубу. Машины оказались западе Петербурга… Водой из-за прорывов труб наполовину в воде залило перекресток улицы Симонова и проспекта Просвещения… en In St. Petersburg, on July 25, there Streets in the north and A pipe burst in St. Petersburg. were bursts of pipes in the north south of St. Petersburg The cars were half in the water and south-west of St. Petersburg… were flooded due to Water flooded the crossroad of bursts of pipes Simonov Street and Prosveshcheniya Avenue… 6 The texts of news articles are given in an abbreviated form IMS-2021. International Conference “Internet and Modern Society” 277 The generated headlines seem to have quite a high grammatical and semantic coherence. It should be noted that predicted headlines may consist of words that are not present in the text of the article. Moreover, the model effectively uses techniques from the theory of paraphrasing, e.g., use of converses, synonyms etc. Among generated news headlines single-sentence headlines predominate over headlines with two and more clauses. It was found that when the model produces two simple sentences, this decreases the text quality due to a repetition of the already generated word or phrase. These problems seem to be related to the fact that the checkpoint used was trained on the RIA corpus which includes more than 1 million news headlines, consisting mainly of a single sentence. Thus, the increase of training examples, in which the headline consists of two sentences, is expected to contribute to better results for generation of headlines of more complex structure. Nevertheless, the model is able to generate relevant headlines that consist of two sentences:  Сайт об архитектуре Петербурга Citywalls снова не работает. Петербуржцы встревожены (en) The website about the architecture of St. Petersburg ‘Citywalls’ is not working again. < q> Petersburgers are alarmed  Финляндия заняла первое место в рейтинге самых счастливых стран. Россия заняла 59-е место (en) Finland placed first in the list of the happiest countries. Russia took the 59th place The model generates complex sentences with subordinate clauses:  На Гороховой улице открылся ресторан «Мука и вода», где можно попробовать пасту (en) The restaurant "Flour and Water" has opened on Gorokhovaya Street, where you can taste pasta  Здание клуба «Камчатку», где работал Цой, расселят (en) The residents of the building of the club "Kamchatka", where Viktor Tsoi worked, are going to be rehoused An analysis of the headlines produced indicates that the model performs best when generating information-rich headlines consisting of a single sentence that inform the readers about the main facts of a news article. In addition, the model is able to produce headlines that contain quotes. However, among the generated headlines, it is quite difficult to find the headlines that would contain an irony, wordplay or hidden author’s opinion. Some examples can be seen below:  В РПЦ назвали позицию Эрмитажа по Исаакиевскому собору «провокацией» (en) The Russian Orthodox Church called the Hermitage's position on St. Isaac's Cathedral a "provocation"  «Путин, спаси нас»: жильцы дома на Ремесленной (en) "Putin, save us": residents of the house on Remeslennaya Street  На Гороховой улице восстановили под гостиницу дом Крутикова. Посмотрите, как выглядит (en) On Gorokhovaya Street, the Krutikov house was restored as a hotel. See what it looks like 5.1. Automatic Evaluation For automatic quality evaluation we use BLEU score [19] and ROUGE score [20]. Since the Bumaga corpus has no previous art, in Table 2 we present results for the baseline and fine-tuned model. Moreover, we present evaluation results on the Bumaga dataset while model is trained on the RIA dataset in order to evaluate the success of the model in headlines generation given news articles with another structure. The results obtained demonstrate that the BertSumAbs model fine-tuned on the Bumaga dataset performs best for all metrics. However, the Bumaga dataset evaluation results using model trained on the RIA dataset are the worst. This may indicate that the format and style of writing news texts and headlines differ from one news agency to another. Thus, one of the most noticeable differences is that 278 PART 2: Computational Linguistics the headlines from the Bumaga corpus often consist of two sentences, while ones from the RIA corpus mostly consist of a single sentence. Table 2 Bumaga dataset evaluation BLEU R1 R2 RL R-mean Model Bumaga First Sentence 41.06 38.9 22.8 36.8 32.8 BertSumAbs 25.84 21.8 9.0 20.5 17.1 trained on the RIA dataset BertSumAbs 48.51 44.1 28.4 42.4 38.3 fine-tuned on the Bumaga dataset 5.2. Human evaluation We have found out that the headline of digital media especially should be informative. We have established also that the headline should have grammatical and lexical cohesion (see Section 1). Since automatic quality evaluation methods evaluate the formal match of tokens, rather than the semantic one, it is hardly possible to use them to understand how well the headlines meet these requirements. Commonly, it is the degree to which native speakers perceive a text that is the main criterion when analyzing the results of text generation experiments. For this purpose, we perform a qualitative analysis by randomly sampling 190 examples of the news text, original headline and our fine-tuned model generated headline for human evaluation. We asked 5 annotators who are native speakers of Russian to choose the most preferred headline for a news article between the original headline (Reference) and the generated one (Hypothesis). If there was no preference, the annotators chose the third option (Tie). The annotators had no idea about the details of the experiment, including which of the headlines was the reference. The results can be seen in Table 3. Table 3 Human evaluation of generated headlines Reference Tie Hypothesis 36% 48.6% 15.4% From the results obtained it might be inferred that in almost every second case (48.6%) our model reaches human parity. This means that the headlines generated by the fine-tuned BertSumAbs model are interpreted to the same extent as the ones written by the journalists. Based on the criteria for choosing the preferred headline, it might be concluded that such headlines are informative and relevant, they are perceived as a single grammatical text. In Figure 1 we present examples of headlines for which the annotators have chosen the Tie option. Figure 1: Examples of the headlines for which Tie option was selected Analysing the aggregate statistics, we found that in 15.4% of cases, the annotators had a preference for generated headlines. This means that for some examples, such a headline was perceived more easily IMS-2021. International Conference “Internet and Modern Society” 279 and naturally than the reference one. Nevertheless, human generated headlines were chosen in 36%. Although we cannot yet claim that our model is completely equivalent to how a human produces headlines for news messages, this result is already pretty promising. 5.3. Error Analysis The neural network makes several types of errors. In Table 4, we provide some examples of generated headlines. The most common mistakes are incomplete phrases, as in examples a, b, c. In example d, there is a factual error, which brings to the erroneous understanding. there is a factual error. One more type of errors is grammar mistakes. Thus, in example e, the model produces a sentence with incorrect verbal government. Example f shows the use of an erroneous noun case-form. Table 4 Examples with errors a Увольняемые сотрудники Ford в Ленобласти провели пикет против «суверенного (en) Dismissed Ford employees in Leningrad Oblast held a picket against the “sovereign b Активиста «Весны» арестовали на 10 суток за акцию с манекенами на Марс (en) The activist of “Spring” was arrested for 10 days for the action with mannequins on the Mars c Россия с 1 апреля возобновляет регулярное авиасообщение с Германией, Шри-Ланкой и еще четырьмя (en) Russia resumes regular flights with Germany, Sri Lanka and four other from April 1 d Новостное сообщение: На улице Тамбасова, 5 в Красносельском районе Петербурга в ночь с 31 января на 1 февраля произошел сильный пожар в павильоне киностудии «Ленфильм» … (en) News text: On Tambasova Street, 5 in the Krasnoselsky district of St. Petersburg, there was a strong fire in the pavilion of the “Lenfilm” film studio on the night of January 31 to February 1… Сгенерированный заголовок: В Приморском районе Петербурга произошел сильный пожар в павильоне «Ленфильма» (en) Generated: In the Primorsky district of St. Petersburg there was a strong fire in the pavilion of “Lenfilm” e Минобороны официально подтвердило об уничтожении военного штаба в Сирии f На мосту Александра Невского с грузовика упал мешка с песком и цементом 6. Conclusion In this paper, we explored the effectiveness of application of the fine-tuned pretrained Transformer- based model, that as a pretrained BERT uses RuBERT, to the task of neural generation of Russian news headlines. We showed that predicted headlines are highly grammatically and semantically coherent and resemble original news headlines. We also present a newly gathered Bumaga corpus and provide results achieved by the BertSumAbs model applied to generation of headlines for news articles from this dataset. 7. Acknowledgements I would like to thank PhD, Associate Professor O.A. Mitrofanova (Saint Petersburg State University) for useful discussions and for comments that greatly improved this paper. 280 PART 2: Computational Linguistics 8. References [1] H. Saggion, T. Poibeau, Automatic text summarization: Past, present and future, 2013. URL: https://hal.archives-ouvertes.fr/hal-00782442/document. [2] D. Gavrilov, P. Kalaidin, V. Malykh, Self-Attentive Model for Headline Generation, 2019. URL: https://arxiv.org/abs/1901.07786. [3] E. Alsentzer, A. Kim, Extractive Summarization of EHR Discharge Notes, 2018. URL: https://arxiv.org/abs/1810.12085. [4] S. Xu, S. Yang, and F. C. M. Lau, Keyword extraction and headline generation using novel word features, in: AAAI, 2010, pp. 1461–1466. [5] H.P. Luhn, The automatic creation of literature abstracts, in: IBM Journal of Research and Development, 2 (2), 1958, pp. 159-165. [6] I. Sutskever, O. Vinyals, V. Le Quoc, Sequence to sequence learning with neural networks, 2014. URL: https://arxiv.org/abs/1409.3215. [7] A.M. Rush, S. Chopra, J. Weston, A Neural Attention Model for Abstractive Sentence Summarization, 2015. URL: https://arxiv.org/abs/1509.00685. [8] S. Takase, J. Suzuki, N. Okazaki, T. Hirao, M. Nagata, Neural headline generation on abstract meaning representation, 2016. URL: https://arxiv.org/abs/1603.06393. [9] J. Gu, Z. Lu, H. Li, V.O. Li, Incorporating copying mechanism in sequence-to-sequence learning, 2016. URL: https://arxiv.org/abs/1603.06393. [10] I.O. Gusev, Importance of copying mechanism for news headline generation, 2019. URL: http://www.dialog-21.ru/media/4599/gusevio-152.pdf. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, 2017. URL: https://arxiv.org/abs/1706.03762. [12] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, L. Kaiser, Universal transformers, 2018. URL: https://arxiv.org/abs/1807.03819. [13] J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805. [14] Y. Liu, M. Lapata, Text Summarization with Pretrained Encoders, 2019. URL: https://arxiv.org/abs/1908.08345. [15] A. See, P.J. Liu, C.D. Manning, Get to the point: Summarization with Pointer-Generator networks, 2017. URL: https://www.aclweb.org/anthology/P17-1099/. [16] A. Bukhtiyarov, I. Gusev, Advances of Transformer-Based Models for News Headline Generation, 2020. URL: https://arxiv.org/pdf/2007.05044.pdf. [17] Y. Kuratov, M. Arkhipov, Adaptation of deep bidirectional multilingual transformers for russian language, 2019. URL: https://arxiv.org/abs/1905.07213. [18] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2015. URL: https://arxiv.org/abs/1412.6980. [19] K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation, 2002. URL: https://www.aclweb.org/anthology/P02-1040/. [20] C.Y. Lin, Looking for a few good metrics: ROUGE and its evaluation, 2004. URL: https://research.nii.ac.jp/ntcir/ntcir-ws4/NTCIR4-WN/OPEN/OPENSUB_Chin-Yew_Lin.pdf.