Extractive Text Summarization Using Word Frequency Algorithm for English Text AbinayaN1, AnbukkarasiS1 and VaradhaganapathyS1 1 Kongu Engineering College, Erode, Tamilnadu Abstract Text summarization is the method of retaining the key information of article without losing its vital content. Summarization is essential in all the areas since a large volume of data is generated day by day. Because of the availability of huge data, it becomes difficult to extract the exact information. People lack patience to understand the content by reading the entire article. Summarization plays a major role in these times to provide available vital information fast and effectively. This can be done in two ways: Extractive Summary and Abstractive Summary. Extractive summary is simple compared to abstractive summary. While an abstractive summary creates new phrases, an extractive summary entails locating highly ranked sentences from the given text. Various techniques, including sentence ranking, Graph Based Modeling, RBF Models, and Sentence Similarity Measures, can be used for extractive summarization. This paper provides extractive text summarization for code mixed English text provided by ILSUM track of FIRE 2022. In this work, Word Frequency Algorithm is used for summarization and the ILSUM team measured the performance of the system by standard ROUGE metrics. Keywords 1 Automatic Text Summarization (ATS), Natural Language Processing (NLP), WordFrequency Algorithm 1. Introduction In this modern era, huge volume of text data is available in internet in the form of documents,e- books, news, movie reviews, articles etc. People find very difficult to obtain the significant information from the lengthy texts. We need a mechanism to identify the key information from the text, fast and effectively by reducing the time of reading. The fundamental problem in this digital world is how quickly the information can be compressed and located from the text. Automatic Text Summarization (ATS) helps to overcome this problem effectively [1]. Various approaches have been developed to generate two different summaries namely, extractive and abstractive. Former one is generated from the original text in the article whereas the later generate their own text which provides the information of original documents. Moreover, applications like search engines, news articles need summarizer as search engines tries to provide the snippet and news websites generate the headings based on the content [2]. Their application is also needed in many areas like library to summarize the content of magazine, e-books, journals etc. Various machine learning algorithms under both supervised and unsupervised category are used for generating a good summary from a given text. The various issues that arise during summarizations are redundancy, ambiguity, key word identification, similarity etc. The approaches including word frequency, sentence scoring, sentence ranking are not much challenging for summarizer because of their statistical approach. The biggest challenge faced by summarizer is to identify the new features Forum for Information Retrieval Evaluation, December 9-13, 2022, India EMAIL: abi9106@gmail.com (A. 1); anbu.1318@gmail.com (A. 2); varadhaganapathy@gmail.com (A. 3) ORCID: 0000-0002-8419-6201 (A. 1); 0000-0003-0226-8150 (A. 2) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) that help to generate the summary and retain the semantics of the content [3]. The statistical approaches try to provide a good summary compared to other approaches implemented. Figure 1: Sample Text from Dataset The goal of this work is to summarize the English news article with normal statistical approach called Word Frequency algorithm and measure the performance of the system with ROUGE metrics. Figure 1 provides the sample text from the training dataset. 2. Related Works Summarization based on the hypergraph transversal was done in [4]. The sentence of the corpus is considered as nodes and grouping the sentence having the same theme is mapped with hyperedges. This concentrates on achieving summary with minimal length and maximal content coverage without exceeding target length. This model outperforms other approaches by 6% of ROUGE-SU4 score. [2] searches for clustering of sentences based on semantic and lexical features. Doc2Vec and LDA are used for obtaining semantic features. This provides better performance on CNN/Daily Mail dataset with ROUGE-1 as 41.4. The unsupervised approach has been completed combining clustering along with topic modeling. Topic modeling used Latent Dirichlet Allocation, while K-Medoids clustering used for summary generation. They evaluated their system on three different datasets DUC2002 Corpus, CNN/DailyMail and Wikihow [5]. [6] integrates word embeddings into deep neural network to enhance the quality of the summary being generated. They implemented ensemble techniques in three ways: BOW and Word2vec using majority voting, BOW combined with unsupervised neural networks and Word2vec combined with unsupervised neural networks. Summarization is also performed as binary optimization problem where quality of summary is based on sentence length, sentence position and relevance to the title. They use genetic operators and guided local search which improves the quality of the summary than other optimization techniques [7]. A model based on the rank fusion is implemented with four multidimensional sentences features like topic information, significant keywords, semantic content and position of the sentence [8]. This follows unsupervised model for generating scores and the weights are learned based on the labeled document. [9] proposed an idea on summarization based on combining fuzzy inference system, evolutionary and clustering algorithms. The summaries generated by this system are analyzed by the experts to know the performance. Summarization is majorly done using sentence ranking. Each sentence used in the text is given with weights and are ranked depending upon weights. The sentences with highest rank are used in summary to accomplish good summary [10]. Similarly, summarization is performed on various approaches considering features at word level and sentence level. The word feature includes content, cue phrase, case of the word, bias word and title of the word. The sentence level feature includes location, length, paragraph location and cohesion with other sentence [11]. 3. Methodology The methodology used in this work is Work Frequency algorithm which is implemented using Natural Language ToolKit (NLTK) library. Figure 2 shows the process of text summarization implemented in this paper. The steps involved in the proposed work are given below. 3.1. Preprocessing The entire dataset provided by ILSUM track of FIRE 2022 have been imported into python dataframe. The text from the dataframe is processed for summarization. Text consists of various symbols and special characters are to be removed through preprocessing. This step also involves removing the stopwords from the given content. The list of words included in NLTK library is used to eliminate the stopwords from the text. 3.2. Sentence Score The preprocessed sentences are tokenized to get the list of entire words used in the article. The weighted frequency for each word is calculated based on their occurrence. Equation (1) helps in calculating the weighted frequency for words that are tokenized. (1) WF = Freq word / Freqmost occurred word where WF refers Weighted Frequency Freqwordrefers the frequency of the current word for which WF is calculated Freqmost occurred word refers the frequency of the word that is most occurred in the text Each sentence score is calculated based on replacing the words with their weighted frequency and summing up all the WF for each sentence. Sentence Score for each sentence is calculated based on Equation (2). Sentence Score = ∑𝑛1 𝑊𝐹 (2) where n refers the number of words in a sentence WF refers Weighted Frequency 3.3. Generating the Summary The average of all computed sentence scores is determined, and this average is used as a threshold value. Equation (3) gives the average of sentence scores. If the sentence score is more than the average score, it will be retained for the summary. This methodology is an extractive summarization technique which tries to retain the sentences of the text which has highest score and include the original sentences from the test into the summary. A threshold value can be modified to get different summaries. The sentences score that is above the threshold will be hold-on to generate summary. ∑ 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑆𝑐𝑜𝑟𝑒𝑠 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = (3) 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 Figure 2: Proposed methodology of text summarization 4. Results The size of dataset used in this work is showed in Table 1. The performance of the system was evaluated by organizing team using ROUGE metrics. The ROUGE-1, ROUGE-2 and ROUGE-4 are used for measuring the summary quality. Table 2 provides the measures of Precision, recall and F1- Score of our system. Figure 3 provides the graphical representation of the results achieved. Table 1 Size of the Dataset Dataset No. of Articles Train 12565 Validation 899 Test 4487 Table 2 Performance Measure of Test Data ROUGE F1- Score Precision Recall ROUGE-1 0.34013 0.272376 0.5193 ROUGE-2 0.208011 0.164724 0.323263 ROUGE-4 0.170998 0.133656 0.272965 0.6 0.5 0.4 F1- Score 0.3 Precision Recall 0.2 0.1 0 ROUGE-1 ROUGE-2 ROUGE-4 Figure 3: Performance measure for ROUGE metrics 5. Conclusion The proposed system is used for summarizing the given information by retaining the vital information of the original text. In the proposed work, Word Frequency Algorithm is used to get the summary of the text by computing the weighted frequency for each words used in the content. With the help of weighted frequency, each sentence is assigned with a score and threshold value is computed. By changing the threshold value of the sentence score, different summary can be obtained. From the results, it is evident that the proposed methodology provides acceptable summary and it can be further improved by including lexicon information of the given text. 6. References [1] Mengli Zhang, Gang Zhou, Wanting Yu, Ningbo Huang ,and Wenfen Liu, “A Comprehensive Survey of Abstractive Text Summarization Based on Deep Learning”, Computational Intelligence and Neuroscience, (2022) doi:10.1155/2020/9365340. [2] Ángel Hernández-Castañeda, René Arnulfo García-Hernández, YuliaLedeneva, Christian Eduardo Millán-Hernández, ”Language-independent extractive automatic text summarization based on automatic keyword extraction”, Computer Speech & Language, (2022) doi:10.1016/j.csl.2021.101267. [3] AdhikaPramitaWidyassari, SupriadiRustad, GuruhFajarShidik, Edi Noersasongko, Abdul Syukur, AffandyAffandy, De Rosal Ignatius Moses Setiadi, “Review of automatic text summarization techniques & methods”,Journal of King Saud University - Computer and Information Sciences, (2022), Volume 34, Issue 4, 1029-1046 doi: 10.1016/j.jksuci.2020.05.006. [4] H. Van Lierde, Tommy W.S. Chow, “Query-oriented text summarization based on hypergraph transversals”, Information Processing & Management, (2019), Volume 56, Issue 4, 1317-1338, doi: 10.1016/j.ipm.2019.03.003. [5] Ridam Srivastava, Prabhav Singh, K.P.S. Rana, Vineet Kumar, “A topic modeled unsupervised approach to single document extractive text summarization”, Knowledge-Based Systems,(2022), Volume 246, doi: 10.1016/j.knosys.2022.108636. [6] Nabil Alami, Mohammed Meknassi, Noureddine En-nahnahi, “Enhancing unsupervised neural networks based text summarization with word embedding and ensemble learning”, Expert Systems with Applications, (2019), Volume 123, 195-211, doi: 10.1016/j.eswa.2019.01.037. [7] Martha Mendoza, Susana Bonilla, Clara Noguera, Carlos Cobos, Elizabeth León, “Extractive single-document summarization based on genetic operators and guided local search”,Expert Systems with Applications, (2014), Volume 41, Issue 9, 4158-4169. [8] Akanksha Joshi, Eduardo Fidalgo, Enrique Alegre, Rocio Alaiz-Rodriguez, ”RankSum—An unsupervised extractive text summarization based on rank fusion”, Expert Systems with Applications, (2022), doi: 10.1016/j.eswa.2022.116846. [9] Pradeepika Verma, Anshul Verma, Sukomal Pal, ”An approach for extractive text summarization using fuzzy evolutionary and clustering algorithms”, Applied Soft Computing, (2022), doi: 10.1016/j.asoc.2022.108670. [10] J. N. Madhuri and R. Ganesh Kumar, "Extractive Text Summarization Using Sentence Ranking," in: International Conference on Data Science and Communication, (2019), pp. 1-3, doi: 10.1109/IconDSC.2019.8817040. [11] Abinaya N, Anand R and Arunkumar T, ”An Exhaustive Survey on Automatic Text Summarization Using Machine Learning Approches”, Webology, (2021), pp.1184-1190. [12] Akash Panchal, url: https://github.com/akashp1712/summarize-webpage [13] S. Satapara, B. Modha, S. Modha, P. Mehta, Findings of the First Shared Task on Indian Language Summarization (ILSUM): Approaches, Challenges and the Path Ahead. In Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata, India, December 9- 13, 2022. [14] S. Satapara, B. Modha, S. Modha, P. Mehta, FIRE 2022 ILSUM track: Indian Language Summarization. In Proc. of the 14th Forum for Information Retrieval Evaluation, Kolkata, India, December 9-13, 2022.