1. Introduction

Enhancing Extractive Summarization for Low Resource Indian Languages using TF-IDF and SVD

Sangita Singh

sangitas.ph21.cs@nitp.ac.in 0

Jyoti Prakash Singh

Akshay Deepak

akshayd@nitp.ac.in 0

Supriya

supriya.phd20.cs@nitp.ac.in 0 0 National Institute of Technology Patna , Patna, 800005, Bihar

Text summarization is one of the well-known issues in natural language processing (NLP) task in recent years. A combination of the term frequency-inverse document frequency (TF-IDF) with a dimension reduction technique named as singular value decomposition (SVD) has shown promising results for extractive text summarization based on diferent Indian languages. Our main goal is to produce an extractive summary of a text document that is succinct, fluid, and stable. In this regard, we have used the Indian Language Summarization (ILSUM)-2024 datasets, which is the third additional task shared by the Forum for Information Retrieval Evaluation (FIRE-2024). Our team, Sangita_NIT_Patna, achieved third place for Bengali and Gujarati languages in task 1. For Hindi and Telugu languages, we secured fourth place, and for Tamil, we ranked fifth. We have used article descriptions as our input data and generated a simple summary of that article description as an output.

eol>TF-IDF SVD Extractive Text Summarization ILSUM-2024 Datasets

1. Introduction

task, we developed a single document extractive text summarization framework using TF-IDF and SVD techniques for various Indian languages.

The remaining of the paper is formatted as follows. Section 2 provides a synopsis of the related works. Section 3 presents our proposed framework for ILSUM-2022. Section 4 presents the proposed systems discovery and analysis of the results. Finally, in Section 5, we conclude the paper.

2. Related work

Text summarization is a vibrant research area in Natural Language Processing (NLP), with a focus on automatically condensing text into concise summaries. This article provides an overview of the numerous studies conducted in the area of text summarization, such as Kumar et al. [ 4 ] designed an extractive text summarization framework, which involves multiple text features including position, length, similarity, frequent words, and sentence numbers in ILSUM task at FIRE 2022. These features are then combined with optimized weights, determined using Genetic Algorithm (GA), to rank sentences. They achieved an F-score of 0.3843 for ROUGE-1, 0.2584 for ROUGE-2, 0.1997 for ROUGE-3, and 0.2190 for ROUGE-4 in the best run, submitted along with two other runs. Singh et al. [ 5 ] used PSO-based technique with ROUGE-1 recall cost function in a supervised manner for single-document extractive text summarization task. They also preduced new feature “incorrect word” in this work. Agarwal et al. [ 6 ] employed the IndicBART model to generate text summaries on the provided Hindi dataset for ILSUM-2022. IndicBART is a multilingual sequence-to-sequence pre-trained model that supports 11 Indian languages. By leveraging the IndicBART model for training, they achieved a ROUGE-1 F-score of 0.544 on the testing dataset, demonstrating the model’s efectiveness in generating high-quality summaries. Singh et al. [ 7 ] employed a sequence-to-sequence attention model based on recurrent neural networks (RNNs) for English in ILSUM-2022, which has shown promising results for abstractive text summarization. Specifically, we used article text descriptions as input data in Bidirectional Long Short-Term Memory (Bi-LSTM) networks in the encoding layer, and generated a simplified summary of the article description as output using LSTMs in the decoding layer. Singh et al. [ 8 ] extracted features from each sentence in the document using ten statistical features, and then summed these features to score sentences for both Hindi and English languages to generate the summary. Kumari et al. [ 9 ] introduced an extractive text summarization method employing K-means clustering for the ILSUM-2022 dataset. The technique comprises text tokenization, Word2Vec-based word and sentence vectorization, and dimensionality reduction using autoencoders and K-means clustering. This process facilitates the identification and extraction of key sentences and phrases, generating a coherent and informative summary. Chakraborty et al. [ 10 ] experimented pre-trained BART model, GPT model, and T5 model for English Languae in ILSUM-2022 at FIRE-2022. TeamMT-NLP-IIITH [ 11, 2 ] achieved the best performance across all three summarization tasks. The authors fine-tuned various transformer models, treating text summarization as a bottleneck task. Specifically: For Hindi and Gujarati, they ifne-tuned MT5, MBart, and IndicBART for five epochs with a learning rate of 5e-5 and a maximum input length of 512. MT5 emerged as the best-performing model for Hindi, while MBart performed best for Gujarati. For English, they fine-tuned PEGASUS, BART, T5, and ProphetNet using similar hyperparameters, and PEGASUS outperformed the other models on text data. Satapara et al. [ 11 ] ofers a comprehensive overview of the first edition of the ILSUM shared task, organized as part of the 14th FIRE-2022 conference. They covered the task’s goals, approach, participant submissions, and evaluation outcomes, providing a valuable snapshot of the current research landscape in Indian language summarization. TeamBITSPilani [ 12 ] fine-tunedmT5 (mT5-multilingual-XLSum) model on the ILSUM dataset for all four languages. TeamNITK-AI [ 12 ] outperformed other teams where they ifne-tuned T5-base on ILSUM English dataset. TeamIrlab-IITBHU [ 12 ] utilized name entity-aware text summarization, NER emerges as important factor to extract in-depth information and prioritising key entities for the summary by utilizing a pre-trained Muril-based Hindi NER model and fine-tuning MBART-50 for Hindi language. This literature is primarily based on the FIRE-2023 (ILSUM) shared task. The NITK-AI (SCALAR) team [ 13 ] utilized the T5-Base model for Indian English, achieving scores of 0.3321, 0.1731, 0.121, and 0.282 for ROUGE-1 F1, ROUGE-2 F1, ROUGE-4 F1, and ROUGE-L F1, respectively. The authors [14] employed mT5-base along with a fine-tuned T5-base to generate more accurate summaries, resulting in scores of 0.3022, 0.1111, 0.2504, and 0.8616 for English, and 0.2701, 0.1214, 0.2237, and 0.6782 for Hindi across the same metrics. The Irlab-IITBHU team [ 12 ] fine-tuned the MBART-50 pre-trained model, achieving ROUGE scores of 0.5625, 0.471, 0.4032, and 0.5373 for Hindi. Meanwhile, the BITS Pilani [ 12 ] team fine-tuned the mT5 (mT5-multilingual-XLSum) model, with results of 0.174, 0.0747, 0.0333, and 0.1655 for Gujarati, and 0.12, 0.0567, 0.0254, and 0.1087 for Bengali [ 12 ], respectively. Among all the teams, NITK-AI [ 12 ] performed the best, fine-tuning the T5-base model on the ILSUM English dataset and achieving the scores mentioned. Gupta et al. [15] employed an approach called Named Entity-Aware Abstractive Text Summarization (NEA-ATS) for the Hindi language. Their method distinctively combines Named Entity Recognition with advanced pretrained language models, emphasizing key entities like people, places, and organizations.

3. Proposed Model

In this section, we discussed the methodology and datasets. We proposed an extractive text summarization framework for varioues Indian languagess. We will explain each step in detail in the following section, and the overall architecture is shown in Figure 1. So, the proposed model generates multi-sentence summaries.

Datasets m a r g n h it w F D IF T )

D leu (SV a n raV iito l s ug op n m iS co e D

Ranking

Add these features for sentence ranking

Generation

Top n% selected sentences for summay

3.1. Data Collection

To evaluate the our model, we utilized the ILSUM-2024 datasets provided by FIRE-2024 [ 3 ]. By developing reusable corpora for diferent Indian languages summarization, they hope to fill the current gap through this joint efort. The third edition of ILSUM adds three Dravidian languages—Kannada, Tamil, and Telugu—in addition to Hindi, Gujarati, Bengali, and Indian English from the previous [16] edition. The dataset for this task is built using articles and headline pairs from several leading newspapers of the country. They provide over 15,000 news articles for each language (except Tamil). The dataset description is shown in Table 1. The objective is to generate a concise, fixed-length summary for each article, which can be either extractive or abstractive in nature.

3.2. Data Preprocessing

We preprocessed the dataset by removing the missing and duplicates values from the “Article" or “Summary" columns. We then tokenized the article into sentences and removed punctuation and empty strings to prepare the text for further processing.

3.3. TF-IDF with -gram

In this section, we used TF-IDF technique to respresent the sentence in the vector form of the article. Here, the TF-IDF vectorizer is configured to extract features from the text data using a range of ngrams, including single words (unigrams), two-word combinations (bigrams), three-word combinations (trigrams), and four-word combinations (4-grams). By selecting the top 2000 most frequent terms across all documents, the TF-IDF matrix is truncated, reducing its dimensionality and focusing on the most important terms. This helps conserve memory and boost computational eficiency. It combines two measures: 1. Term Frequency (TF): Term Frequency (TF) is a numerical measure that represents how frequently a term appears in a given document. Here, represents a word (or -gram, which could be a single word or a sequence of words), and represents a specific document. This score increases with the frequency of a word in a document but doesn’t consider whether the word is common across other documents in the corpus. The TF score indicates the relative importance of a term within a document by measuring its occurrence. For a specific -gram g in a document , the TF is calculated as: where

TF(, ) = ∑︀′∈ ′,

, • , is the frequency of -gram in the document . • is the set of all -grams in the document .

• ∑︀′∈ ′, is the total count of all -grams in the document. 2. Inverse Document Frequency (IDF): Measures how common or rare a word is across all documents in the corpus. A word that appears in many documents will have a lower IDF score. If a term appears in almost every document, its IDF score will be close to zero, meaning it has less unique significance. The inverse document frequency of -gram is:

IDF() = log ︂(

)︂ 1 + containing the -gram . scores: where, is the total number of documents in the corpus, is the number of documents 3. TF-IDF Calculation: The TF-IDF score for a term in a document is the product of its TF and IDF - (, ) = TF(, ) × IDF()

Words with high TF-IDF scores are considered important or unique to that document compared to other documents in the corpus. (1) (2) (3)

3.4. Singular value decomposition (SVD):

SVD is a mathematical technique used in linear algebra for decomposing a matrix into three other matrices. SVD is used in Latent Semantic Analysis (LSA) to uncover relationships between terms and documents by reducing dimensionality in text data. SVD can reduce the dimensionality of feature vectors for sentences in the article, produced by the TF-IDF technique, while preserving the essential features and relationships in the article.

= ∑︁ (4) • A is a matrix of dimension M× N. • U: × matrix of the orthonormal eigenvectors of . • ∑︀: diagonal matrix with r elements equal to the root of the positive eigenvalues of or . • : transpose of a × matrix containing the orthonormal eigenvectors of .

The sums the rows of , giving a score for each sentence based on its importance.

3.5. Ranking 3.6. Generation

In this step, we prioritize the sentences in the article, ranking them in descending order of importance. This produces a list of ranked sentences, with the most important sentences at the top and the less significant ones at the bottom.

To generate the summary, we first calculate the sentence count by taking =15% of the total sentences. We then combine the top-ranked sentences up to this count to create the summary.

4. Evaluation Metric and Results

In this section, we present the evaluation metrics used to assess the performance of our proposed approach, followed by a detailed discussion of the results obtained.

4.1. Evaluation Metric

In this study, we utilized the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [17] and BERTScore (B) metrics to assess the performance of our model. ROUGE (R) measures the quality of the generated summaries by counting the number of overlapping lexical units between the generated and reference summaries. Unlike R, which relies on exact word or phrase matches, B leverages pretrained contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers) to calculate the semantic similarity between the generated and reference summaries. ROUGE and BERTScore includes precision (Pre), recall (Rec), and F1 score as part of its evaluation metrics. Here, we employed R-N (with N=1,2,4) and R-L (Longest Common Subsequence) to compute the R-1, R-2, and R-L scores based on Pre, Rec and F1 for both the training and validation datasets. For the testing dataset, R-N (with N=1, 2, and 4) and R-L were evaluated based on the F1-score, while B was evaluated using Pre, Rec, and F1-score.

4.2. Results

In this section, we discussed about the results obtained on Training, validation and test datasets provided by ILSUM-2024 [18] [ 11 ]. Table 2 shows the results on the training dataset for all diferent languages for Task 1. Table 3 shows the results on the validation dataset for five diferent languages for Task 1. In the Bengali and Gujarati languages category, we secured the 3 position and present the corresponding Methods Telugu Tamil Kannada Hindi English Gujarati Bengali Methods Telugu Tamil Kannada Hindi

English results in Table 4 and Table 6, respectively, for Task 1. In the Hindi and Telugu languages category, we secured the 4ℎ position and present the corresponding results in Table 5 and Table 8, respectively, for Task 1. Similarly, for the Tamil language, we achieved the 5ℎ position and show the results in Table 7. In this study, TF-IDF was employed to represent text data by assigning weights to terms based on their importance within the corpus. This method proved efective in reducing the influence of high-frequency, low-relevance terms, resulting in a more meaningful feature space. Singular value decomposition further enhanced the feature set by capturing latent semantic relationships and reducing dimensionality, thereby improving computational eficiency and model generalization.

5. Conclusion and Future work

In this work, we applied TF-IDF and SVD techniques for extractive text summarization in various Indian languages. These findings ofer valuable insights for higher quality summaries, reduction of redundancy, more coherent summaries, when used together, TF-IDF and SVD create summaries that better capture both the key terms (from TF-IDF) and latent concepts (from SVD), producing summaries that are both relevant and coherent. While these traditional methods lack the contextual embeddings provided by deep learning techniques, their simplicity and interpretability make them valuable tools, particularly for resource-constrained applications. Singular value decomposition helps reduce redundancy by identifying overlapping information within sentences. It also captures semantic relationships between words and sentences. However, TF-IDF doesn’t consider sentence structure or semantics, so the resulting summary may miss out on coherence, as it focuses purely on the frequency and rarity of terms. Singular value decomposition can be computationally expensive, especially for large documents. Also, it may struggle with small texts where latent structures are harder to detect.

Several areas can be explored to further enhance the extractive text summarization process for Indian languages. Future work could explore hybrid approaches combining TF-IDF and SVD with contextual word embeddings for improved performance. One promising direction is the integration of more advanced techniques, such as transformer-based architectures (e.g., BERT or GPT), which can capture deeper semantic understanding and sentence structure beyond what TF-IDF and SVD ofer.

Acknowledgments

This first author would want to acknowledge the Ministry of Education (MOE), Government of India for ifnancial support during the research work through the Rajiv Gandhi fellowship Ph.D scheme (UGC) for computer science & engineering.

Declaration on Generative AI

The authors confirm that no generative AI tools were used in the writing, editing, or analysis processes of this manuscript. All content was created and reviewed by the authors. [14] V. Ilanchezhiyan, R. Darshan, E. M. Dhitshithaa, B. Bharathi, Text summarization for indian languages: Finetuned transformer model application., in: FIRE (Working Notes), 2023, pp. 766–774. [15] S. Gupta, S. Pal, Named entity-aware abstractive text summarization for hindi language., in: FIRE (Working Notes), 2023, pp. 755–765. [16] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Indian language summarization at FIRE 2023, in: D. Ganguly, S. Majumdar, B. Mitra, P. Gupta, S. Gangopadhyay, P. Majumder (Eds.), Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Panjim, India, December 15-18, 2023, ACM, 2023, pp. 27–29. URL: https://doi.org/10.1145/3632754.3634662. doi:10.1145/3632754.3634662. [17] F. Liu, Y. Liu, Exploring correlation between rouge and human evaluation on meeting summaries,

IEEE Transactions on Audio, Speech, and Language Processing 18 (2009) 187–196. [18] S. Satapara, P. Mehta, S. Modha, A. Hegde, S. HL, D. Ganguly, Key insights from the third ilsum track at fire 2024, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2024, Gandhiinagar, India. December 12-15, 2024, ACM, 2024.

[1]

W. S.

El-Kassas ,

C. R.

Salama ,

A. A.

Rafea ,

H. K.

Mohamed , Automatic text summarization: A comprehensive survey , Expert Systems with Applications 165 ( 2021 ) 113679 .

[2]

Satapara ,

Modha ,

Modha , P. Mehta, FIRE 2022 ILSUM track: Indian language summarization , in: D. Ganguly , S.

Gangopadhyay , M.

Mitra , P. Majumder (Eds.), Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation , FIRE 2022 , Kolkata, India, December 9- 13 , 2022 , ACM, 2022 , pp. 8 - 11 . URL: https://doi.org/10.1145/3574318.3574328. doi: 10 .1145/ 3574318.3574328.

[3]

Satapara ,

Mehta ,

Modha ,

Hegde , S. HL ,

Ganguly , Overview of the third shared task on indian language summarization (ilsum 2024 ), in: K. Ghosh,

Mandl ,

Majumder , D. Ganguly (Eds.), Working Notes of FIRE 2024 - Forum for Information Retrieval Evaluation, Gandhinagar, India . December 12-15 , 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[4]

D. V. P.

Kumar ,

S. S.

Raj ,

Verma ,

Pal , Extractive text summarization using meta-heuristic approach ., in: FIRE (Working Notes) , 2022 , pp. 464 - 474 .

[5]

Singh ,

J. P.

Singh ,

Deepak , Supervised weight learning-based pso framework for single document extractive summarization , Applied Soft Computing 161 ( 2024 ) 111678 .

[6]

Agarwal ,

Naik ,

S. S.

Sonawane , Abstractive text summarization for hindi language using indicbart ., in: FIRE (Working Notes) , 2022 , pp. 409 - 417 .

[7]

Singh ,

J. P.

Singh ,

Deepak , Deep learning based abstractive summarization for english language ., in: FIRE (Working Notes) , 2022 , pp. 383 - 392 .

[8]

Singh ,

J. P.

Singh ,

Deepak , Statistical and linguistic features based extractive text summarization for english and hindi languages , in: 2024 First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT) , IEEE, 2024 , pp. 222 - 227 .

[9]

Kumari ,

Kumari , An extractive approach for automated summarization of indian languages using clustering techniques ., in: FIRE (Working Notes) , 2022 , pp. 418 - 423 .

[10]

Chakraborty ,

Kaushik ,

S. R.

Laskar ,

Pakray , Exploring text summarization models for indian languages ., in: FIRE (Working Notes) , 2022 , pp. 443 - 448 .

[11]

Satapara ,

Modha ,

Mehta , Findings of the first shared task on indian language summarization (ILSUM): approaches challenges and the path ahead , in: K. Ghosh,

Mandl ,

Majumder , M. Mitra (Eds.), Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata , India, December 9- 13 , 2022 , volume 3395 of CEUR Workshop Proceedings, CEUR-WS.org , 2022 , pp. 369 - 382 . URL: https://ceur-ws. org/ Vol- 3395 / T6 -1.pdf.

[12]

Satapara ,

Mehta ,

Modha ,

Ganguly , Key takeaways from the second shared task on indian language summarization (ILSUM 2023) , in: K. Ghosh,

Mandl ,

Majumder , M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation (FIRE-WN 2023 ), Goa, India, December 15-18 , 2023 , volume 3681 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 , pp. 724 - 733 . URL: https://ceur-ws. org/ Vol- 3681 / T8 -1.pdf.

[13]

Gowhar ,

Sharma ,

A. K.

Gupta ,

A. K.

Madasamy , Advancing human-like summarization: Approaches to text summarization ., in: FIRE (Working Notes) , 2023 , pp. 747 - 754 .