=Paper=
{{Paper
|id=Vol-3395/T6-7
|storemode=property
|title=Text summarization for Indian languages using pre-trained models
|pdfUrl=https://ceur-ws.org/Vol-3395/T6-7.pdf
|volume=Vol-3395
|authors=Aishwarya Krishnakumar,Fathima Naushin A R,Mrithula K L,B Bharathi
|dblpUrl=https://dblp.org/rec/conf/fire/KrishnakumarRLB22
}}
==Text summarization for Indian languages using pre-trained models==
<pdf width="1500px">https://ceur-ws.org/Vol-3395/T6-7.pdf</pdf>
<pre>
Text summarization for Indian languages using
pre-trained models
Aishwarya Krishnakumar, Fathima Naushin A R, Mrithula K L and B Bharathi
Department of CSE, Sri Siva Subramaniya Nadar College of Engineering,Tamil Nadu, India


                                      Abstract
                                      The notion of summarising is as old as the ancient Greek and Roman plays. The information available
                                      in various situations, from plays to meetings to non-fiction work, has been and is summarised for as
                                      long as we remember. This abstract itself is an example of the usage of the concept of summarization.
                                      In this era of technology, everything around us is digitized. People tend to develop ideas that perform
                                      activities that only humans were able to do before the innovation of modern technology. Summarizing
                                      text documents is one such example. Today, we have developed various NLP and AI models to perform
                                      text summarization. While efficient models exist for native English, little attention is given to Indian
                                      languages. This paper discusses the work done by SSNCSENLP in ILSUM Indian Language Summarization
                                      on the multilingual code-mixed text task of FIRE 2022. In this paper, we present a comparison of the
                                      performance of a few existing models. From our best-evaluated model, we were ranked among the top
                                      ten on the validation sets for all three Indian languages—English, Gujarati, and Hindi. To summarize
                                      the above mentioned languages we have used mT5_m2m_CrossSum, XL-Sum, Bert and the mT5-small
                                      models of which mT5_m2m_CrossSum generated precise summaries of the given text.

                                      Keywords
                                      generated summary, Indian languages, pre-trained model, mT5_m2m_CrossSum


1. Introduction
Natural Language Processing(NLP), the branch of computer science and artificial intelligence(AI),
combines computational linguistics with statistical machine learning and deep learning models.
It develops a rule-based model for human language, which provides computers the ability
to process human language in the form of voice data or text. This enables the computers to
read text, hear speech, and interpret it. NLP breaks down the language into tokens and tries
to understand the relationship between the tokens. NLP offers several tasks which include
sentimental analysis, word sense disambiguation, grammatical tagging, content categorization,
text summarization, topic discovery and modeling,speech-to-text and vice-versa, and many
more. These tasks face several challenges to be more accurate in what they do because human
language is filled with ambiguities and, not to forget, the several languages in use or the usage
of metaphors, sarcasm, idioms, and other grammatical usage exceptions. There are several NLP


Forum for Information Retrieval Evaluation, December 9-13, 2022, India
$ aishwarya2010328@ssn.edu.in (A. Krishnakumar); fathima2010192@ssn.edu.in (F. N. A. R);
mrithula2010075@ssn.edu.in (M. K. L); bharathib@ssn.edu.in (B. B. )
 https://www.ssn.edu.in/staff-members/dr-b-bharathi/ (B. B. )
 0000-0001-7279-5357 (B. B. )
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
tools and approaches that resolves these challenges to its best. But these tools are very limited
for low resource languages.
   Text summarization uses NLP techniques to digest huge volumes of digital text in any form,
such as, from articles or magazines or from social media, and create summaries and synopses for
indexes, research databases, or busy readers who don’t have time to read full text. It is the process
of identifying most important meaningful information in a text document and compressing
them into a shorter version. When the computer performs this task using algorithms and
programs, it is called Automatic Text Summarization.


Figure 1: Basic flow of text summarizarion


  There are two approaches to text summarization based on the generated summary(output):
    • Extractive summarization.
    • Abstractive summarization.
Extractive summarization uses simple and traditional algorithms to extract keyphrases from the
text and use them to generate summaries. The keyphrases are usually chosen based on their
frequency in the text. Thus, the generated summaries can be grammatically strange. Abstractive
summarization uses advanced algorithms to create new phrases and sentences and generate
summaries to convey the most useful information from the text. While the abstractive approach
overcomes the grammatical inconsistencies in the extractive approach and performs better,
it is also more difficult to develop the algorithms for the abstractive approach. The content
of the generated summary can either be indicative or informative. An indicative summary
represents only the main idea of the text document, whereas an informative summary gives
a brief description of the text document. Indicative summaries are shorter than informative
summaries. There are several approaches within these summarization approaches [1]. There
are two types of summarization based on the text document(input):
    • Single document summarization.
    • Multi-document summarization.
Single document summarization generates summary of text from a single document while
multi-document summarization generates summary from more than one document. Multi-
document summarization is more difficult than single document because of the redundancy of
text, compression of text from multiple documents, collection of significant information etc.
MEAD, is a multi-document summarizer proposed in the paper [2]. There are broadly three
categories of summarization based on the purpose of the generated summary:
    • Generic summarization.
    • Domain-specific summarization.
    • Query-based summarization.

Generic Summarization condenses the overall information content available in the source text.
Domain-specific summarization generates summaries from documents or text related to the
given domain. Query-based summarization generates summaries that answer the search query.
The query is given as an input along with the text document.Monolingual summarization
generates summaries for a particular language domain, whereas multilingual summarization
generates summaries for more than one language. While large-scale datasets exist for a number
of languages like English, Chinese, French, German, Spanish, etc., no such datasets exist for any
Indian languages. In this paper, we will use a multilingual pre-trained model to summarise text
in Indian English and two other Indian languages—Hindi and Gujarati.


Figure 2: Categories in text summarizarion


   The paper is organised as follows: The literature works related to text summarization are given
in Section 2. The dataset analysis and task descriptions are given in Section 3. Section 4 details
the different models experimented during the course of the shared task. Section 5 describes
the internal architecture and working of the used pre-trained model mT5_m2m_CrossSum 4.1.
Section 6 provides the performance metrics used to evaluate a summarization task. Section 7
provides the performance results of the shared task for the model submitted. Finally, Section 8
concludes the paper.


2. Related works
The majority of works are on extractive summarization because it is simple to implement.The
complexity of natural language processing makes abstractive summarization a challenging task.
Research on extractive summarization has plateaued after reaching its peak performance. Now
the focus of researchers has shifted to abstractive summarization and the fusion of extractive
and abstractive techniques. Abstractive summarization helps resolve the dangling anaphora
problem and thus helps generate readable, concise, and cohesive summaries.
   The author of the paper [3] focuses on approaches for text summarization that initially
use extractive summarization techniques, followed by abstractive summarization techniques.
The paper provides a brief explanation of how combining various extractive summarization
techniques works. It demonstrates that fusion systems can assist us in improving the
consistency of the meta-system. The author discusses an alternative technique to abstractive
summarization, known as the Generative approach for text summarization. Their experiments
also include changing the informativeness criteria used in abstractive summarization from
TextRank scores of words to Log-Likelihood ratios of the words. The paper also proposes an
approach that uses statistical machine translation for document summarizations.

   Jagadish S Kallimani, et al in his paper, A Comprehensive Analysis of Guided Abstractive Text
Summarization, suggests a solution for abstractive summarization of native Indian languages
[4].In this method the abstract data is extracted and processed to gather the key concepts from
the original text using extraxtive summarization techniques. Earlier research on summarizing
documents in Indian languages adopted paradigms for extracting salient sentences from text
using features like word frequency and phrase frequency, position in the text and key phrases.
Such extractive summaries tend to have long sentences and the desired information is scattered
across the document.


3. Dataset analysis and task description
The primary goal of this shared task is to generate a meaningful fixed-length summary, either
extractive or abstractive, for the dataset’s code-mixed and script-mixed articles in Dravidian
languages (Hindi-English and Gujarati-English) and English. The news articles contain more
than one sentence, and the heading and link of each article are given. Each article may contain
English phrases even if the article itself is written in an Indian language. The task is to summarise
the news article at an appropriate length.The detailed descripiton of the task, Findings of the
First Shared Task on Indian Language Summarization (ILSUM): Approaches, Challenges and
the Path Ahead are present in [5] [6].


4. Methodologies
This paper reflects on a specific approach of abstractive text summarization applied to En-
glish and Indian languages like Hindi and Gujarati. In terms of model architecture, we focus
on approaches based on now-ubiquitous large-scale pre-trained language models (LM), such
as XL-Sum, cross sum, BERT (Devlin et al., 2019) and BERT (Lewis et al., 2020), which ob-
tained new state-of-the-art results in diverse natural language processing tasks, including text
summarization [7] [8] [9]

4.1. mT5_m2m_CrossSum
We have used mT5_m2m_CrossSum, a large-scale cross-lingual abstractive summarization that
has both the properties of the basic mt5 model and the fine-tuned m2m model. The LaSE, a
new metric for automatically evaluating model-generated summaries and showing a strong
correlation with ROUGE is used to analyse the performance of the model in addition to usually
existing measures like rouge and F1 scores. Performance on ROUGE and LaSE indicate that
pre-trained models fine-tuned on CrossSum consistently outperform baseline models, even
when the source and target language pairs are linguistically distant. CrossSum is the largest
cross-lingual summarization dataset and the first-ever that does not rely solely on English as
the pivot language.This model was the best to summarize the Hindi and Gujarati datasets [10].
The sample summaries generated by this model for English, Hindi and Gujarati datasets are
presented in 3.


Figure 3: A Sample of summaries generated by mT5_m2m_CrossSum model in all three languages


4.2. XL-Sum
Contemporary works on abstractive text summarization have focused primarily on high-
resource languages like English, mostly due to the limited availability of datasets for low/mid-
resource ones. In this work, we have used XL-Sum, a highly abstractive, concise, and high
quality pre-trained model, as indicated by human and intrinsic evaluation.XL-Sum induces
competitive results compared to the ones obtained using similar models that work only with
monolingual datasets.This model gives a relatively high rouge score when compared to other
models that follow. XL-Sum is the largest abstractive summarization dataset in terms of the
number of samples collected from a single source and the number of languages covered.XL-Sum
provided the highest performance scores for the English dataset [11].

                                      Rouge 1     Rouge-2     Rouge-3
                         Precision    0.1892      0.0637      0.0357
                         Recall       0.0919      0.0317      0.0178
                         F1-Score     0.1185      0.0407      0.0223
Table 1
Scores for generating summaries using XL-Sum for the English validation set.
4.3. BERT
We use a new language representation model called BERT, which stands for Bidirectional
Encoder Representations from Transformers. Unlike recent language representation models,
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly
conditioning on both left and right context in all layers. As a result, the pre-trained BERT model
can be fine-tuned with just one additional output layer to create state-of-the-art models for a
wide range of tasks, such as question answering and language inference, without substantial
task-specific architecture modifications.BERT is conceptually simple and empirically powerful.
It obtains new state-of-the-art results on eleven natural language processing tasks, including
pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to
86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point
absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). The
BERT model cannot be used for any other languages but for English [12].

                                       Rouge 1     Rouge-2     Rouge-3
                          Precision    0.0917      0.0781      0.0236
                          Recall       0.0829      0.0213      0.0379
                          F1-Score     0.0861      0.3347      0.2098
Table 2
Scores for generating summaries using Bert for the English validation set.


4.4. mT5-small
With mT5-small, a multilingual variant of T5, reframing all NLP tasks into a unified text-to-text-
format where the input and output are always text strings, in contrast to BERT-style models
that can only output either a class label or a span of the input can be done. The text-to-text
framework uses the same model, loss function, and hyperparameters on any NLP task, including
machine translation, document summarization, question answering, and classification tasks
(e.g., sentiment analysis). The mt5-small model can be used to train and validate a small datset
and is relatively slow for training huge datasets when compared to all the models [13].

                                       Rouge 1     Rouge-2     Rouge-3
                          Precision    0.0707      0.0413      0.0119
                          Recall       0.0917      0.0109      0.0246
                          F1-Score     0.0798      0.0172      0.0163
Table 3
Scores for generating summaries using mT5-small for the Gujarati validation set.


  All the models described above were used to generate the summaries for the given validation
and test datasets.
5. Architecture
The internal architecture and working of the used pre-trained mT5_m2m_CrossSum is given
below. This section explains a method which combines information extraction with summariza-
tion to produce a guided summary of domain specific documents. This method uses a narrower
view which is to identify instances of a particular class of events and extract arguments relevant
to this class of events.A fully abstractive approach with a separate process for the analysis of the
text, the content selection, and the generation of the summary has the potential for generating
summaries at a level comparable to humans. The presented method uses a rule-based, custom-
designed IE module, along with categorization, content selection and sentence generation
systems to fulfil the needs of abstractive summarization. The system uses repositories like rules
and gazetteers to refer to the language syntax and semantics.This novel IE rule-based approach
attempts to extract relevant information using lexical analysis tools like Part of Speech Tagging
(POST) and Named Entity Recognition (NER). This ensures an information rich summary that
reduces redundancy in not just the sentences produced but also in the information conveyed.
Figure 4 shows the diagrammatic representation of the architecture for an abstractive summary
generation system [14] [15] [16].


Figure 4: Proposed Architecture for Abstractive Summary Generation System


6. Performance analysis
The generated summaries for each language are evaluated using standard ROUGE metrics:
Rouge-1, Rouge-2, and Rouge-4 F-scores. ROUGE stands for Recall-Oriented Understudy for
Gisting Evaluation. There are four different ROUGE measures as describe in [17]. The scores are
computed by comparing the set of generated summaries against a set of reference summaries
(typically human-produced). ROUGE-1 refers to the overlap of unigrams between the generated
summaries and reference summaries; ROUGE-2 refers to the overlap of bigrams between the
generated summaries and reference summaries; and so on. Precision, Recall and F1-score are
the F scores.
   Recall (in the context of ROUGE) refers to how much of the reference summary the generated
summary captures, i.e., it is the fraction of sentences chosen by the human that were also
correctly identified by the system.
                             |overlap of generated and reference summary|
                  𝑅𝑒𝑐𝑎𝑙𝑙 =                                                                  (1)
                                          |reference summary|
    Precision refers to how much of the generated summary was in fact relevant or needed, i.e.,
it is the fraction of system sentences that were correct.
                               |overlap of generated and reference summary|
                𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                                                (2)
                                            |generated summary|
  If we just consider individual words, |generated summary| and |reference summary| refers to
the number of words in the generated summary and reference summary respectively, whereas
the |overlap of generated and reference summary| refers to the number of words overlapped
words between the reference summary and the generated summary.
  F1-score, also known as the F-measure or the F-score conveys the balance between the
precision and the recall.
                                           2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
                           𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 =                                                   (3)
                                            𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
   These performance metrics are directly proportional to the number of overlapping words
between the generated summary and the reference summary. It does not take into account the
type of summary generated. As we saw in Section 1, the summary generated can be extractive
or abstractive. Because extractive summarization uses the methodology of extracting key words
from the given summary, it produces a higher number of overlapping words, even if they lack
meaning. Whereas an abstractive summarization does not extract keyphrases, it generates a
meaningful summary which conveys the text document at its best. This might not lead to a
greater number of overlapping words,thereby falling short in the final scores calculation in
terms of efficiency [18].

   The summaries generated by abstraction are more meaningful and give the perfect gist
of the contents to be summarized. Out of all the models used to generate summaries, the
mT5_m2m_CrossSum model has provided accurate summaries for the given news article dataset
across Indian languages.


7. Result
The mT5_m2m_CrossSum model described in Section 4.1 was used to submit the validation
datasets for all three languages—English, Gujarati, and Hindi. Our submission secured the 7th
rank in the task on the Hindi dataset. The final performance results for the validation dataset in
Hindi are recorded in Table 4.
                                 Rouge-1     Rouge-2    Rouge-3     Rouge-4
                    F1-Score      0.371       0.223      0.168       0.132
                    Precision     0.535       0.321      0.241       0.191
                    Recall        0.299       0.181      0.137       0.108
Table 4
Performance of the mT5_m2m_CrossSum model with a validation dataset for Hindi


Our submission secured the 6th rank in the task on the Gujarati dataset. The final performance
results for the validation dataset in Gujarati are recorded in Table 5.

                                 Rouge-1     Rouge-2    Rouge-3     Rouge-4
                    F1-Score      0.119       0.040      0.022       0.014
                    Precision     0.189       0.063      0.035       0.021
                    Recall        0.091       0.031      0.018       0.011
Table 5
Performance of the mT5_m2m_CrossSum model with a validation dataset for Gujarati


Our submission secured the 9th rank in the task on the English dataset. The final performance
results for the validation dataset in English are recorded in Table 6.

                                 Rouge-1     Rouge-2    Rouge-3     Rouge-4
                    F1-Score      0.274       0.089      0.044       0.025
                    Precision     0.428       0.144      0.073       0.043
                    Recall        0.210       0.067      0.033       0.019
Table 6
Performance of the mT5_m2m_CrossSum model with a validation dataset for English


8. Conclusion
In this paper, we have briefly described about the existing text summarization methods for
Indian languages. We have showed the results and performance analysis of a few techniques.
We have worked using mT5_m2m_CrossSum, XL-Sum, Bert and the mT5-small models of
which mT5_m2m_CrossSum gave us the best results. Though models like Bert could be used
only for English datasets, multilingual cross models outperformed the pre-trained monolingual
models. We hope that this paper will gave a fair idea on the different models that can be used to
summarize Indian English and Indian languages effectively on a given dataset of news articles
collected overtime.
                                Rouge-1    Rouge-2    Rouge-3     Rouge-4
                   F1-Score       0.37       0.23       0.17        0.14
                   Precision      0.55       0.33       0.25        0.20
                   Recall         0.30       0.18       0.14        0.11
Table 7
Performance of the mT5_m2m_CrossSum model with test dataset for Hindi


References
 [1] N. Andhale, L. Bewoor, An overview of text summarization techniques, in: 2016 Interna-
     tional Conference on Computing Communication Control and automation (ICCUBEA),
     2016, pp. 1–7. doi:10.1109/ICCUBEA.2016.7860024.
 [2] D. R. Radev, H. Jing, M. Styś, D. Tam, Centroid-based summarization of multiple
     documents, Information Processing Management 40 (2004) 919–938. URL: https://
     www.sciencedirect.com/science/article/pii/S0306457303000955. doi:https://doi.org/
     10.1016/j.ipm.2003.10.006.
 [3] P. Mehta, From extractive to abstractive summarization: A journey, 2016, pp. 100–106.
     doi:10.18653/v1/P16-3015.
 [4] J. S. Kallimani, K. Srinivasa, B. E. Reddy, A comprehensive analysis of guided abstractive
     text summarization, International Journal of Computer Science Issues (IJCSI) 11 (2014)
     115.
 [5] S. Satapara, B. Modha, S. Modha, P. Mehta, Findings of the first shared task on indian
     language summarization (ilsum): Approaches, challenges and the path ahead, in: Working
     Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata, India, December
     9-13, 2022, CEUR Workshop Proceedings, CEUR-WS.org, 2022.
 [6] S. Satapara, B. Modha, S. Modha, P. Mehta, Fire 2022 ilsum track: Indian language
     summarization, in: Proceedings of the 14th Forum for Information Retrieval Evaluation,
     ACM, 2022.
 [7] D. Radev, S. Teufel, H. Saggion, W. Lam, J. Blitzer, A. Celebi, H. Qi, E. Drabek, D. Liu,
     Evaluation of text summarization in a cross-lingual information retrieval framework,
     Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD,
     Tech. Rep 6 (2002).
 [8] Y. Liu, M. Lapata, Text summarization with pretrained encoders, arXiv preprint
     arXiv:1908.08345 (2019).
 [9] K. Hong, A. Nenkova, Improving the estimation of word importance for news multi-
     document summarization, in: Proceedings of the 14th Conference of the European Chapter
     of the Association for Computational Linguistics, 2014, pp. 712–721.
[10] A. Bhattacharjee, T. Hasan, W. U. Ahmad, Y.-F. Li, Y.-B. Kang, R. Shahriyar, Crosssum:
     Beyond english-centric cross-lingual abstractive text summarization for 1500+ language
     pairs, 2021. URL: https://arxiv.org/abs/2112.08804. doi:10.48550/ARXIV.2112.08804.
[11] T. Hasan, A. Bhattacharjee, M. S. Islam, K. Samin, Y.-F. Li, Y.-B. Kang, M. S. Rahman,
     R. Shahriyar, Xl-sum: Large-scale multilingual abstractive summarization for 44 languages,
     arXiv preprint arXiv:2106.13822 (2021).
[12] Y. Liu, Fine-tune bert for extractive summarization, arXiv preprint arXiv:1903.10318
     (2019).
[13] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raf-
     fel, mt5: A massively multilingual pre-trained text-to-text transformer, arXiv preprint
     arXiv:2010.11934 (2020).
[14] G. Shilpa, D. Shashi Kumar, Abs-sum-kan an abstractive text summarization technique for
     an india regional language by induction of tagging rules, Int J Recent Technol Eng (IJRTE),
     ISSN (2019) 2277–3878.
[15] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio,
     Learning phrase representations using RNN encoder–decoder for statistical machine trans-
     lation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language
     Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp.
     1724–1734. URL: https://aclanthology.org/D14-1179. doi:10.3115/v1/D14-1179.
[16] T. Cohn, M. Lapata, Sentence compression beyond word deletion, in: Proceedings of the
     22nd International Conference on Computational Linguistics (Coling 2008), Coling 2008
     Organizing Committee, Manchester, UK, 2008, pp. 137–144. URL: https://aclanthology.org/
     C08-1018.
[17] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summariza-
     tion Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp.
     74–81. URL: https://aclanthology.org/W04-1013.
[18] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in:
     Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia,
     2018, pp. 328–339. URL: https://aclanthology.org/P18-1031. doi:10.18653/v1/P18-1031.

</pre>