Extractive Text Summarization using Meta-heuristic Approach Doppalapudi Venkata Pavan Kumar1,*,† , Srigadha Shreyas Raj1,† , Pradeepika Verma2,† and Sukomal Pal1,† 1 Indian Institute of Technology (BHU), Varanasi, INDIA 2 TIH, Indian Institute of Technology, Patna, INDIA Abstract This paper describes a system for Indian Language Extractive Summarization which has been developed under the shared task ILSUM of FIRE 2022[1][2] Conference by SUMIL22 team. This system works by picking up sentences directly from the text using a ranking function to form the corresponding summary. In our approach, for each sentence, we first calculated various text features such as sentence position, sentence length, sentence similarity, frequent words, and sentence numbers. Then, these text features along with their optimized weights are used for ranking the sentences, and then the summary is generated by selecting top-ranked sentences. For optimizing the weights of the text features, we have used a population based meta-heuristic approach, Genetic Algorithm (GA). We submitted three runs and got an F-score of 0.3843 for ROUGE-1, 0.2584 for ROUGE-2, 0.1997 for ROUGE-3 and 0.2190 for ROUGE-4 in the best run. Keywords Extractive Text Summarization, Genetic Algorithm, Text Features, ROUGE 1. Introduction Information has become the most important part of our life. People generally use a large number of sources from news articles to social media posts to know the information. Gener- ating automatic summaries of larger texts will be useful for compressing the larger texts of information into smaller texts and also saves a lot of time. Our project focuses on automatic text summarization of news data. Text summarization is the creation of a shorter, precise, and more relevant summary of a larger text. Automatic summarization methods will be able to handle the ever-increasing amounts of internet text data, allowing users to find and consume important information more quickly. Abstractive and extractive summarizations are the two broad types of automatic text summa- rization. Abstractive summarization methods attempt to create a summary by trying to interpret the text using advanced natural language methods to create a unique and more concise text of Forum for Information Retrieval Evaluation, December 9-13, 2022, India * Corresponding author. † These authors contributed equally. $ dvenkata.pavankumar.cse19@itbhu.ac.in (D. V. P. Kumar); srigadha.shreyasraj.cse19@itbhu.ac.in (S. S. Raj); pradeepikav.verma093@gmail.com (P. Verma); spal.cse@itbhu.ac.in (S. Pal) € https://ShreyasRaj4.github.io/ (S. S. Raj); https://cse-iitbhu.github.io/irlab/spal.html (S. Pal) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) parts that may not show up in the original article, that symbolizes the most critical information from the original text, going to require paraphrasing of sentences and combining knowledge from the total article to generate summaries similar to what a human-written abstract does. An adequate abstractive summary is linguistically fluent and enwraps the core information of the input. On the other hand, extractive summarization methods create the summary by picking up sentences directly from the text, based on a ranking function to form a relevant summary. This method identifies essential sections of the text, cropping out and joining together parts of the content to produce an abridged version. We have used Extractive text summarization for English language in our approach. The rest of the article is organized as follows. We review the state-of-the-art approaches related to extractive and abstractive document summarization in Section 2. Then, we presented our proposed approach in the Section 3. Section 4 discussed the results and analysis part of the work followed by conclusion and future scope in Section 5. 2. Literature Survey A brief literature survey[3] of extractive and abstractive summarization has been discussed as follows. Text Rank[4] is an unsupervised method to summarize text documents. It is based on graphs. Authors considered sentences of the document as the nodes of the graphs and similarity between the sentences is represented as the edges of the graphs. The sentences of the documents are transformed into the vectors using different techniques like bag-of-words, Tf-Idf, word2vec etc. Moreover, sentences are ranked using cosine similarity and the top sentences are extracted to form the summary. Next, Erkan and Radev[5] proposed a Graph-based summarization system known as LexRank. This system identifies the most central sentences in the cluster of sentences, which gives us sufficient information regarding the main topic of the cluster. The sentences of the articles are converted into vectors by Tf-Idf technique and thereafter sentence vectors are clustered in order to extract the summary of the article. Further, a heuristic Approach for Telugu text summarization with improved sentence ranking had been proposed by Mamidal et al.[6]. In this work, extractive text summarization for Telugu languague had been done using optimized sentence ranking. This sentence scoring mechanism was based on the event and named entity scores. Scoring of sentences was done by applying statistical measures on events and named entities features. Next, Verma and Om[7] proposed a maximum coverage and relevancy with minimal redundancy based multi-document summarization system. Here, summarization is done by generating a single document from all the documents and then scoring each sentence based on different text features and their optimal weights assigned by using Shark Smell Optimization (SSO). Moreover, Verma et al.[8][9] also proposed some other effective methods for document summarization. 3. Text Summarization using Proposed approach In the proposed work, the task of summarization has been done into four stages that are Pre- processing of text input, Text features analysis, optimal weight assignment of the text feature, Figure 1: Flowchart of our approach and summary generation. These steps are described as follows: 3.1. Preprocessing Preprocessing is the first process to be done in our method. It covers removing of stop words, punctuations, unwanted data from the articles and doing sentence tokenization, word tok- enization. Stop words are the words that are commonly occurred and do not have a specific significance meaning. Sentence tokenization divides the data into sentences and word tokeniza- tion divides the data into words. Stop words removal will remove all the stop words from the data after word tokenization. We have collected stop words for English language from Natural Language Tool Kit (NLTK) library. We have used NLTK library for word tokenization and for sentence tokenization, we used RegexpTokenizer from NLTK by writing our own regular expression. 3.2. Text feature’s analysis We have used various text features for scoring the sentences of each news articles. Let 𝑠𝑖,𝑗 represent the 𝑗 𝑡ℎ sentence of 𝑖𝑡ℎ news article. And 𝑡𝑥 represent 𝑥𝑡ℎ text feature. Then 𝑆𝑡𝑥 (𝑠𝑖,𝑗 ) represent the score of 𝑥𝑡ℎ text feature for a sentence 𝑠𝑖,𝑗 . The text features used in our approach are as follows[10]. 3.2.1. Sentence Position(𝑡1 ): In a article, sentences present at the start and end are more relevant than those present in the middle of the article[11]. Since, the primary cause and the crucial details are present at the start and conclusions are present at the end of the articles. So, we assign the score to a sentence based on the equation: ⃒𝐿 ⃒ ⃒ − 𝑗⃒ 𝑆𝑡1 (𝑠𝑖,𝑗 ) = 2 𝐿 (1) 2 3.2.2. Sentence Length(𝑡2 ): Sentences with longer lengths are more informative than sentences with fewer words. So, we assign less weight to sentences with shorter lengths. We calculate the score of each sentence by taking the ratio of the number of words in the sentence to the number of words in the longest length sentence. |𝐴𝐿(𝑠𝑖 ) − |𝑠𝑖,𝑗 || 𝑆𝑡2 (𝑠𝑖,𝑗 ) = 1 − (2) 𝑚𝑎𝑥(|𝑠𝑖,𝑗 |) 3.2.3. Sentence Similarity(𝑡4 ): Sentences which are like other sentences and have more common information are considered more important. We calculate this feature score by taking the ratio of the intersection of words in two sentences and the length of comparing sentence[12]. ∑︀|𝑠𝑖 | 𝑗 ′ =1,𝑗̸=𝑗 ′ (𝑠𝑖𝑚(𝑠𝑖,𝑗 , 𝑠𝑖,𝑗 )) ′ 𝑠𝑡4 (𝑠𝑖,𝑗 ) = (3) |𝑠𝑖 | 3.2.4. Sentences with Frequent words(𝑡5 ): Words which are more frequent in the text are more important. Hence, the sentences containing the frequent are also given more weightage than other sentences[7]. For our approach, we have taken top 20% words having maximum frequency as frequent words. ∑︀ 𝑓 𝑤∈𝑠𝑖,𝑗 𝑐𝑜𝑢𝑛𝑡(𝑔𝑟𝑎𝑚𝑁 ) 𝑆𝑡5 (𝑠𝑖,𝑗 ) = 𝑆𝑡5 (𝑠𝑖,𝑗 ) = (4) |𝑓 𝑤| 3.2.5. Sentences with Numerical Data(𝑡6 ): Sentences containing numerical data are regarded as more informational as numbers describe more analytical information. ∑︀ 𝑛𝑑∈𝑠𝑖,𝑗 𝑐𝑜𝑢𝑛𝑡(𝑔𝑟𝑎𝑚𝑁 ) 𝑆𝑡6 (𝑠𝑖,𝑗 ) = (5) |𝑛𝑑| 3.3. Optimal weights assignment to text features After computing the scores of each sentence for every text feature, a population based meta- heuristic approach Genetic Algorithm (GA) assigns the weights to these text features. 3.3.1. Overview of Genetic Algorithm Genetic Algorithm is a model or abstraction of evolution through natural selection based on Charles Darwin’s theory of natural selection. It is a form of evolution that occurs on computer[13]. It was proposed by John Holland and his collaborators in the 1960s and 1970s. Genetic Algorithms are adaptive and can be used to solve both constraint and non-constrained, search and optimization problems[14]. The genetic algorithm is based on the population’s chromosome’s genetic structure and behaviour. It will be helpful while working with large and complex datasets. Genetic algorithms are built on the following principles. • Each chromosome represents a potential solution. As a result, the population is made up of collection of chromosomes. • Each chromosome in the population is assigned a fitness value. As a result, higher fitness the better the solution is. • The best chromosomes out of the existing chromosomes in the population are used for reproducing the following generation offsprings. • Because of crossover, the offspring created will inherit characteristics from both parents. A minor change in the structure of a chromosome will be called mutation. The main advantages of GA over general optimization algorithms are, it supports multiple objective optimization, it is good for noisy environments and has the ability to deal with complex problems and parallelism. The basic steps present in GA are initialization, fitness evaluation, selection, crossover, mutation, and termination. The fitness function varies with respect to the problem. The steps to obtain the optimized weights of the text features are as follows (we consider our method as a maximization problem) 3.3.2. Initialization of population First, we generate random weights in (0,1) as the initial weights for the text features. These are generated in the form of solutions 𝐴 = {𝑎1 , 𝑎2 , 𝑎3 , ...., 𝑎𝑛 } [7] , where “n” is the number of solutions in each generation and each solution 𝑎𝑞 can be further represented as 𝑎𝑞 = {𝑎𝑞,1 , 𝑎𝑞,2 , ....., 𝑎𝑞,𝑁 } [7], where N = number of dimensions for each solution. Each element of a solution 𝑎𝑞 can be represented as 𝑎𝑞,𝑥 , where 𝑞 = {1, 2, ...., 𝑛} and 𝑥 = {1, 2, ...., 𝑁 }.[7] We have used the python library ‘numpy.random’ to generate these initial set of population. Figure 2: Flowchart for assigning optimal weights to text features 3.3.3. Fitness evaluation Now, we have to evaluate the fitness of each solution. So, for calculating the fitness of each solution, we have to rank the sentences in each article, with each element of the solution as the weight of the corresponding text feature[7]. After ranking the sentences, select the top twenty percent ranked sentences in each article as the generated summary. Now, compare the generated summary with reference summary for each article using a fitness function and the sum of results obtained is the fitness value of the given solution. We have also optimized the fitness evaluation by passing the fitness values of previous generation to the fitness function, as it saves time by not calculating the fitness of the solutions passed from previous generation to this generation. Sentence Ranking: For scoring the sentences of an 𝑖𝑡ℎ article, having feature’s scores of 𝑆𝑡𝑝 with a solution 𝑎𝑞 , we use following equation. 𝑁 ∑︁ 𝑠𝑐𝑜𝑟𝑒(𝑠𝑖,𝑗 ) = 𝑎𝑞,𝑥 × (𝑆𝑡𝑥 (𝑠𝑖,𝑗 )) (6) 𝑥=1 Where, 𝑎𝑞,𝑥 represents the 𝑥𝑡ℎ dimension in the 𝑞 𝑡ℎ solution. 𝑆𝑡𝑥 (𝑠𝑖,𝑗 ) represents the 𝑥𝑡ℎ feature score for the 𝑗 𝑡ℎ sentence in the 𝑖𝑡ℎ article. After ranking the sentences, select the top twenty percent ranked sentences in each article as the generated summary. We will extract the sentences for each article for every solution. Now, these extracted sentences are compared with the use of a fitness function. Fitness Function: Here, we use ROUGE metrics ROUGE-1 and ROUGE-2 for calculat- ing the similarity between the generated summary and the reference summary. For calculating the fitness of a solution, we have to calculate the sum of similarity between the generated summary and the reference summary for every article. 𝑒 ∑︁ 𝑅𝑂𝑈 𝐺𝐸1𝑖 [′ 𝑓 ′ ] + 𝑅𝑂𝑈 𝐺𝐸2𝑖 [′ 𝑓 ′ ] 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 = (7) 2 𝑖=1 Where, e is the total number of articles present and 𝑅𝑂𝑈 𝐺𝐸1𝑖 [′ 𝑓 ′ ] is the F1-score of the ROUGE-1 of 𝑖𝑡ℎ article. 3.3.4. Selection After calculating the fitness values of each solution, now we have to select the parents from the set of solutions to produce the offsprings. We have to select half of the solutions for mating-pool. For selecting the parents, we have used elitist based roulette wheel selection. First, it selects the top 25 percent of the solutions as parents based on their fitness values. Then, it selects another 25 percent of the solutions from the remaining 75 percent solutions with the use of roulette wheel selection. 3.3.5. Crossover Crossover, which is also known as recombination is used for combining the information of two solutions to produce a new offspring. There are many different types of crossovers available. But we have used simple single-point crossover. First, we select two parents from mating-pool and select a crossover point. A new offspring is produced by taking data before crossover point from first parent and combining it with data after crossover point from second parent.[15] 3.3.6. Mutation Mutation is performed on the newly created offsprings to bring some changes in them. It is used to bring diversity from one generation of solutions to the next generation solutions. We have Figure 3: crossover Figure 4: mutation applied mutation on two different points for each solution in the offsprings. For each offspring, we select a point in first half of the offspring and select a random value in (-0.25,0.25) and add to the data at the selected point. Now, again for each offspring, we select a point in second half of the offspring and select a random value in (-0.25,0.25) and add to the data at the selected point. After performing mutation, all the new offsprings and the parents in the mating-pool will be combined to form the solutions for next generation. 3.3.7. Termination Once the number of predefined iterations are completed, the best solution having the highest fitness value will be selected as the final weights of the text features. Figure 5: Optimal weights of text features using GA 3.4. Summary generation Once the optimal weights for the text features are obtained with the use of genetic algorithm, we generate the final summary. For producing the summary, we first score the sentences in each article with the optimized weights of the features. If the GA gives the optimal weights for text features at solution q = O, then the score of 𝑠𝑖,𝑗 is 𝑁 ∑︁ 𝑠𝑐𝑜𝑟𝑒(𝑠𝑖,𝑗 ) = 𝑎𝑞,𝑥 × (𝑆𝑡𝑥 (𝑠𝑖,𝑗 )) (8) 𝑞=𝑂,𝑥=1 After ranking the sentences, the top twenty percent ranked sentences in each article are extracted as our summary. 4. Results and Analysis In the English dataset, for each sentence in each of the articles, we calculate the values of all the text features. Then, we initialize 20 random individual population as the weights of the text features. Now, the GA is run for 200 generations. The optimized weights obtained using GA for the above described text features (in Section 3.3) are 0.86276778, 0.514368, 0.3003737, 0.49962974 and 0.01303291 respectively, which are shown in Figure-5. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring algorithm calculates the similarity between a text and a collection of reference articles. It determines the quality of the generated text. Rouge Score is the official metric for the ILSUM track. By using these weights, we calculated the weighted average of the text feature scores of each sentence. We took the top twenty percentage ranked sentences in each article as the summary generated. The f-measure of ROUGE-1, ROUGE-2 and ROUGE-4 values of the summary gen- erated and our rank for the English test dataset is shown in table-1. The detailed evaluation Table 1 Evaluation results on test data and rank list for English Text Summarization Team Name ROUGE-I ROUGE-II ROUGE-IV Rank IIIT-H 0.5583 0.4458 0.4180 1/10 SUMIL22 0.3844 0.2584 0.2190 6/10 Table 2 Detailed evaluation results of test data for English Text Summarization using our approach Evaluation Metric F1-Score Precision Recall ROUGE-I 0.3843673 0.3317808 0.5327141 ROUGE-II 0.2584236 0.2238869 0.359915 ROUGE-III 0.2313431 0.1997904 0.3252433 ROUGE-IV 0.2190241 0.1887224 0.3111874 results of English test dataset containing F1-Score, Precision and Recall for ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 is shown in table-2. From the obtained evaluation results, it can be understood that our technique does not give great results as expected at least upto 50% F1-Score. This is mainly because we produce 20% of the sentences as our summary but the actual summary can be any number of sentences and also during the sentence tokenization, we used a regular expression which reduced the length of the sentences than expected as compared to expected sentences. 5. Conclusion This work reports for the shared task of the Automatic Text Summarization for English Language Text -FIRE 2022. We have experimented with several types of algorithms for the shared task, in- cluding abstractive text summarization like BERT, BART, etc. For extractive text summarization, we have performed sentence ranking using text features. The text features are weighted using genetic algorithm, which gave us the better results. However the scores can be improved further by using slightly tweaking the genetic algorithm, by training for many more generations and by increasing the text features. In future, we can use this approach to summarize the documents written in other Indian languages such as Hindi, Bengali, Tamil, Telugu etc. Acknowledgement This work is supported by TIH, Indian Institute of Technology, Patna and in lined with it’s theme. References [1] S. Satapara, B. Modha, S. Modha, P. Mehta, Findings of the first shared task on indian language summarization (ilsum): Approaches, challenges and the path ahead, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata, India, December 9-13, 2022, CEUR Workshop Proceedings, CEUR-WS.org, 2022. [2] S. Satapara, B. Modha, S. Modha, P. Mehta, Fire 2022 ilsum track: Indian language summarization, in: Proceedings of the 14th Forum for Information Retrieval Evaluation, ACM, 2022. [3] P. Verma, S. Pal, H. Om, A comparative analysis on hindi and english extractive text summarization, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18 (2019) 1–39. [4] R. Mihalcea, P. Tarau, Textrank: Bringing order into text, in: Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 404–411. [5] G. Erkan, D. R. Radev, LexRank: Graph-based lexical centrality as salience in text summarization, Journal of Artificial Intelligence Research 22 (2004) 457–479. URL: https://doi.org/10.1613%2Fjair.1523. doi:10.1613/jair.1523. [6] K. K. Mamidala, et al., A heuristic approach for telugu text summarization with improved sentence ranking, Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12 (2021) 4238–4243. [7] P. Verma, H. Om, Mcrmr: Maximum coverage and relevancy with minimal redundancy based multi-document summarization, Expert Systems with Applications 120 (2019) 43–56. [8] P. Verma, A. Verma, S. Pal, An approach for extractive text summarization using fuzzy evolutionary and clustering algorithms, Applied Soft Computing 120 (2022) 108670. [9] P. Verma, A. Verma, S. Pal, A fusion of variants of sentence scoring methods and collabo- rative word rankings for document summarization, Expert Systems (2022) e12960. [10] P. Verma, H. Om, A novel approach for text summarization using optimal combination of sentence scoring methods, Sādhanā 44 (2019) 1–15. [11] M. A. Fattah, F. Ren, Automatic text summarization, World Academy of Science, Engineer- ing and Technology 37 (2008) 192. [12] P. Achananuparp, X. Hu, X. Shen, The evaluation of sentence similarity measures, in: International Conference on data warehousing and knowledge discovery, Springer, 2008, pp. 305–316. [13] S. Forrest, Genetic algorithms: principles of natural selection applied to computation, Science 261 (1993) 872–878. [14] D. Beasley, D. R. Bull, R. R. Martin, An overview of genetic algorithms: Part 1, fundamentals, University computing 15 (1993) 56–69. [15] A. J. Umbarkar, P. D. Sheth, Crossover operators in genetic algorithms: a review., ICTACT journal on soft computing 6 (2015).