Exploring Text Summarization Models for Indian
Languages
Shayak Chakraborty† , Darsh Kaushik† , Sahinur Rahman Laskar† and Partha Pakray†
Department of Computer Science and Engineering, National Institute of Technology, Silchar


                                         Abstract
                                         In the task of Indian Language Summarization presented by FIRE 2022, various methods of text, and
                                         summarization has been studied and used by the team TextSumEval. The summarization of mixed corpus
                                         languages is important as most articles and documents in India contain excerpts from English or other
                                         languages. For the summarization of such languages, a variety of summarization techniques has been
                                         studied. Finally, LSTM based sequence-to-sequence model, BART model, GPT model, and T5 model have
                                         been studied and experimented with and the results have been concluded.

                                         Keywords
                                         Text Summarization, Indian Language Summarization, Abstractive Text Summarization, Deep Learning


1. Introduction
With the advent of time, there has been a lot of textual information that has come into prevalence.
Articles, magazines, and other documents contain a lot of insignificant text which might be
difficult to read through because of time scarcity. Summarization of text so that the original text
is reduced without losing information is hugely beneficial in these scenarios. Creating precise
summaries from a long document has been a very important task throughout ages. This task
has been simplified by the advent of automatic text summarization tools in recent times.
   Automatic text summarization reduces volumes of text data into summaries which would
have been very difficult if it had to be done manually. Automatic text summarization is mainly
classified into two main types. The first type is extractive text summarization. In this type of
model and algorithm, the summary is created by extracting words from the original document
which usually have a higher frequency or have some importance in the sentences. The generated
summary has almost all the words from the original document. However, because of the
extraction process, the generated summary can produce a lot of erroneous sentences.
   The other type of automatic text summarization is abstractive text summarization. In this
approach, the models change the sentences into shorter sentences while retrieving the complete
context from the original document. This method uses deep learning architectures which are

Forum for Information Retrieval Evaluation, December 9-13, 2022, India
*
  Corresponding author.
†
  These authors contributed equally.
$ shayak_pg_21@cse.nits.ac.in (S. Chakraborty); darsh_ug@cse.nits.ac.in (D. Kaushik);
sahinurlaskar.nits@gmail.com (S. R. Laskar); partha@cse.nits.ac.in (P. Pakray)
 https://github.com/ShayakC98 (S. Chakraborty)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
able to learn the summarization task from the document to summary pairs. Deep learning models
have significantly improved the quality of text summarization over traditional summarization
methods such as frequency-based summarization, lexical ranking-based summarization, and
Latent Semantic analysis-based summarization.
  When it comes to using deep learning models for text generation models, the most commonly
used architectures are that of sequence-to-sequence models. Among these kinds of architectures
some are as follows:

    • Encoder-Decoder architecture - This type of model typically uses LSTM cells with some-
      times added attention units. This model is not always useful for processing inputs with
      long sentences.
    • Transformer-based summarization model - This incorporates positional encodings along
      with attention heads. This improves the working of the LSTM units.
    • BERT model trained for summarization - BERT introduced a word tokenizer which
      reduced the size of the vocabulary. This model has an Encoder layer without a decoder
      layer. This is useful for language modeling tasks but not suitable for language generation
      tasks.
    • GPT model trained for summarization - This model has a Decoder layer without a proper
      encoder layer. Although this model is mainly used for text generation, it requires very
      specific fine-tuning for multilingual corpora.

In this paper, as per the Indian Languages Summarization task under Fire 2022 [1] [2], the
multilingual T5 model was used and was fine-tuned for the summarization of the documents.
This model uses an encoder-decoder architecture along with a sentence piece tokenizer which
is suitable for multilingual corpora and being pre-trained on huge corpora the fine-tuning
becomes easier. In the following section, we look at examples of how summarization tasks have
been developed. Section 3 discusses the experimentation that have been conducted and the
results have been concluded in Section 4.


2. Background
Automatic text summarization was initially done by extracting sentences by scoring them
using Bayesian models [3] and term frequency-inverse document frequency models [4]. These
methods were good for extracting entire sentences from the document. However, because these
methods relied on the frequency of words from a sentence, if a word was a special name or a
stop word it would have a very low chance of getting produced in the summary.
   With the advent of machine learning algorithms, the problem of sentence summarization
changed. Deep learning-based models were built for language modeling tasks. With the advent
of Recurrent Neural Networks sequence-to-sequence, models came into existence. Recurrent
neural networks were then used for sequence generation [5]. These sequence-to-sequence
models were then used to model abstractive text summarization task-based problems [6]. The
working of RNN-based models could be improved by using Long Short Term Memory cells [7]
as suggested by [8].
   However, these architectures had a problem of losing context for sequences that were long.
Even LSTM cells are not very capable of keeping context from long sequences. This problem
was solved by [9] in 2017. Transformer architectures solved the problem of long sequence
processing by using positional encodings and multi-headed attention to keep information from
different parts of the sequences. This architecture was soon extended to various other models
to improve on the language modeling task.
   Bidirectional Encoder Representations from Transformers [10] was a transformer-based
architecture that pre-trained the encoder to improve the working on many NLP tasks such as
question answering. BERT also used a word piece tokenizer to improve vocabulary maintenance.
This model was used further for text summarization by [11]. Soon after BERT, the BART model
[12] was developed. This model improves the sequence processing capabilities by denoising
the BERT model. The BART model was trained by adding noise to the text and then it was
required to reconstruct the text. This made the model more robust than simple BERT model for
the specific language modeling tasks.
   After the foundation of pretraining was laid, models like GPT [13] and Pegasus [14] were
introduced. Although these models were developed for summarization tasks still they did not
produce good results for multilingual corpora. The GPT model introduced the strong decoder
which enabled it to generate proper sentences as output. However without a strong encoder,
the model lacked the ability to work on multiple language modeling tasks.
   The T5 [15] model when pre-trained with XL-sum [16] dataset was best suited to produce
summaries for multilingual corpora. The T5 model has been trained by systemic transfer
learning methodology. By training the model across multiple tasks the T5 model has come to
produce the best solutions for various language modelling tasks. The T5 model does not require
task specific retraining in most of the scenarios.
   The dataset given by the Indian Languages Summarization task has three main subtasks.
Each subtask was for a given language. There were three main languages - English, Hindi, and
Gujarati. The given dataset contained mixed corpora - implying that the English dataset file
contained Hindi/Gujarati words that had to be processed during summarization and similarly
for the other datasets other languages were also present. Each dataset file contains two main
columns having the Articles and their respective summary. These two columns were mainly
used for training the model for summarization tasks. In the following section, the experiments
and results have been discussed in detail.


3. Experiments and Results
For each given dataset the two columns were extracted from the dataset - Articles and Summaries.
For preprocessing the article and summaries were stripped of HTML tags. Multiple punctuations
and emoticons were removed from the columns. Since there could be proper nouns and other
characters from different languages the text was not entirely converted to lowercase and neither
were the non-English words removed.
  For the English dataset, the summarization was tested with four different models. The LSTM
with attention model which was a sequence to sequence model was the first model. It was
quickly eliminated as it produced completely arbitrary results. The rouge score for the given
Table 1
Validation scores for used models over the English subtask
                      Model      Rouge-1     Rouge-2      Rouge-3     Rouge-4
                      mBART      0.38        0.23         0.19        0.17
                      GPT        0.46        0.38         0.35        0.34
                      T5         0.48        0.35         0.33        0.32

Table 2
Test values for English subtask (submission id: TextSumEval - t5 small)
             Model/Team Name               Rouge-1     Rouge-2      Rouge-3     Rouge-4
             MT-NLP IIIT-H (top scorer)    0.56        0.44         0.43        0.42
             Ours (T5)                     0.48        0.35         0.33        0.32


model was about 0.01%. This behavior was justified as for this particular model the vocabulary
was built using a count vectorizer. With having lost a lot of words that occurred less than one
or two times the predictions of the model deteriorated. This was followed by three important
models - mBART, GPT, and T5 model. Table 1 shows the validation scores of the following
models: The mBART model clearly underperforms in the task. This might be as a result of
improper tokenization between the multiple languages and the fact that the mBART model
which is based on the transformer architecture lacks a strong decoder which does not allow the
model to generate proper summaries despite proper training.
   The GPT model which has been used is mainly used for generating text. It can be seen that
it outperforms the mBART model by a considerable difference in scores. The GPT model was
pre-trained on a huge corpus but when it was fine-tuned on the other language dataset, that is
for the Hindi dataset, it produced ambiguous output. This was probably a result of not being
trained on the specific task of summarization before. Also, the embeddings of GPT model for
multilingual corpora are not built well as the GPT model does not have a powerful encoder
system. It reduces the language modeling capacity of the model.
   The T5 model (marked as bold in Table 1) which outperformed both the mBART model and
the GPT model was pre-trained on a huge dataset. The T5 model was also trained for a variety
of tasks including summarization. The T5 model was fine-tuned for 20 epochs using a batch
size of 4 and truncating the input length of the original articles to a length of 250. The sentence
piece tokenizer used by the T5 model produces a byte pair encoded value for words that are not
present in the vocabulary. By using this method the T5 model is able to reproduce the proper
nouns and other words from the other languages’ corpus which is not present in the original
corpus of the dataset. The T5 model provided the best scores for the test run for the English
subtask with submission name : TextSumEval - t5 small, as shown in Table 2.
   Since the T5 model worked best for the English dataset, a multilingual variant of the T5
model was used to fine-tune the Gujarati and Hindi subtask datasets. The mT5 model [16] that
was used for fine-tuning was trained on the XL-Sum dataset after it was pre-trained on the mC4
corpus [17]. This model showed the best performance for summarization for Indian Languages.
During training, the model achieved a rogue-1 score of 32 upon the Hindi subtask.
   Overall using the T5 model and its variants the summarization task for Indian Languages has
shown marginal improvements over pre-existing models which are already being used. The
following section concludes the paper.


4. Conclusion
In this task of Indian Languages summarization under FIRE 2022 first, a list of text summa-
rization methods has been studied. The best methods for abstractive text summarization have
been chosen. The experiments have been performed and their results have been noted down
accordingly. First, the BART model has been studied which shows the least scores. Followed by
the GPT model which gives slightly better results. Finally, the T5 model which gave the highest
scores was used for the summarization of English-based sentences containing non-English
words. This led to the experimentation of using a trained mT5 model for summarization of other
Indian Languages subtasks namely Hindi and Gujarati, which produced considerable scores
during training.


References
 [1] S. Satapara, B. Modha, S. Modha, P. Mehta, Fire 2022 ilsum track: Indian language
     summarization, in: Proceedings of the 14th Forum for Information Retrieval Evaluation,
     ACM, 2022.
 [2] S. Satapara, B. Modha, S. Modha, P. Mehta, Findings of the first shared task on indian
     language summarization (ilsum): Approaches, challenges and the path ahead, in: Working
     Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata, India, December
     9-13, 2022, CEUR Workshop Proceedings, CEUR-WS.org, 2022.
 [3] T. Nomoto, Bayesian learning in text summarization, in: HLT/EMNLP 2005, Human Lan-
     guage Technology Conference and Conference on Empirical Methods in Natural Language
     Processing, Proceedings of the Conference, 6-8 October 2005, Vancouver, British Columbia,
     Canada, The Association for Computational Linguistics, 2005, pp. 249–256.
 [4] H. Christian, M. P. Agus, D. Suhartono, Single document automatic text summariza-
     tion using term frequency-inverse document frequency (tf-idf), ComTech: Computer,
     Mathematics and Engineering Applications 7 (2016) 285–294.
 [5] A. Graves, Generating sequences with recurrent neural networks, arXiv preprint
     arXiv:1308.0850 (2013).
 [6] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre, B. Xiang, Abstractive text summa-
     rization using sequence-to-sequence rnns and beyond, in: Y. Goldberg, S. Riezler (Eds.),
     Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning,
     CoNLL 2016, Berlin, Germany, August 11-12, 2016, ACL, 2016, pp. 280–290.
 [7] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
 [8] T. Shi, Y. Keneshloo, N. Ramakrishnan, C. K. Reddy, Neural abstractive text summarization
     with sequence-to-sequence models, Trans. Data Sci. 2 (2021) 1:1–1:37.
 [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems
     30: Annual Conference on Neural Information Processing Systems 2017, December 4-9,
     2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
[10] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
     1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186.
[11] Q. Wang, P. Liu, Z. Zhu, H. Yin, Q. Zhang, L. Zhang, A text abstraction summary model
     based on bert word embedding and reinforcement learning, Applied Sciences 9 (2019)
     4701.
[12] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, BART: denoising sequence-to-sequence pre-training for natural language genera-
     tion, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association
     for Computational Linguistics, 2020, pp. 7871–7880.
[13] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language under-
     standing by generative pre-training (2018).
[14] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: pre-training with extracted gap-sentences
     for abstractive summarization, in: Proceedings of the 37th International Conference on
     Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings
     of Machine Learning Research, PMLR, 2020, pp. 11328–11339.
[15] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach.
     Learn. Res. 21 (2020) 140:1–140:67.
[16] T. Hasan, A. Bhattacharjee, M. S. Islam, K. S. Mubasshir, Y. Li, Y. Kang, M. S. Rahman,
     R. Shahriyar, Xl-sum: Large-scale multilingual abstractive summarization for 44 languages,
     in: Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online
     Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, Association for
     Computational Linguistics, 2021, pp. 4693–4703.
[17] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel,
     mt5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the
     2021 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021,
     Association for Computational Linguistics, 2021, pp. 483–498.