1. Introduction

10.1016/j.eswa.2019.01.037

Extractive Text Summarization Algorithm for English Text

AbinayaN

AnbukkarasiS

VaradhaganapathyS

varadhaganapathy@gmail.com 0 0 Kongu Engineering College , Erode, Tamilnadu

2022

123 9 4158 4169

ive summary creates new phrases, an extractive summary entails locating highly ranked sentences from the given text. Various techniques, including sentence ranking, Graph Based Modeling, RBF Models, and Sentence Similarity Measures, can be used for extractive summarization. This paper provides extractive text summarization for code mixed English text provided by ILSUM track of FIRE 2022. In this work, Word Frequency Algorithm is used for summarization and the ILSUM team measured the performance of the system by standard ROUGE metrics.

1 Automatic Text Summarization (ATS) Natural Language Processing (NLP) WordFrequency Algorithm

1. Introduction

In this modern era, huge volume of text data is available in internet in the form of documents,ebooks, news, movie reviews, articles etc. People find very difficult to obtain the significant information from the lengthy texts. We need a mechanism to identify the key information from the text, fast and effectively by reducing the time of reading. The fundamental problem in this digital world is how quickly the information can be compressed and located from the text. Automatic Text Summarization (ATS) helps to overcome this problem effectively [ 1 ]. Various approaches have been developed to generate two different summaries namely, extractive and abstractive. Former one is generated from the original text in the article whereas the later generate their own text which provides the information of original documents. Moreover, applications like search engines, news articles need summarizer as search engines tries to provide the snippet and news websites generate the headings based on the content [ 2 ]. Their application is also needed in many areas like library to summarize the content of magazine, e-books, journals etc.

Various machine learning algorithms under both supervised and unsupervised category are used for generating a good summary from a given text. The various issues that arise during summarizations are redundancy, ambiguity, key word identification, similarity etc. The approaches including word frequency, sentence scoring, sentence ranking are not much challenging for summarizer because of their statistical approach. The biggest challenge faced by summarizer is to identify the new features that help to generate the summary and retain the semantics of the content [ 3 ]. The statistical approaches try to provide a good summary compared to other approaches implemented.

The goal of this work is to summarize the English news article with normal statistical approach called Word Frequency algorithm and measure the performance of the system with ROUGE metrics. Figure 1 provides the sample text from the training dataset.

2. Related Works

Summarization based on the hypergraph transversal was done in [ 4 ]. The sentence of the corpus is considered as nodes and grouping the sentence having the same theme is mapped with hyperedges. This concentrates on achieving summary with minimal length and maximal content coverage without exceeding target length. This model outperforms other approaches by 6% of ROUGE-SU4 score. [ 2 ] searches for clustering of sentences based on semantic and lexical features. Doc2Vec and LDA are used for obtaining semantic features. This provides better performance on CNN/Daily Mail dataset with ROUGE-1 as 41.4.

The unsupervised approach has been completed combining clustering along with topic modeling. Topic modeling used Latent Dirichlet Allocation, while K-Medoids clustering used for summary generation. They evaluated their system on three different datasets DUC2002 Corpus, CNN/DailyMail and Wikihow [ 5 ].

[6] integrates word embeddings into deep neural network to enhance the quality of the summary being generated. They implemented ensemble techniques in three ways: BOW and Word2vec using majority voting, BOW combined with unsupervised neural networks and Word2vec combined with unsupervised neural networks. Summarization is also performed as binary optimization problem where quality of summary is based on sentence length, sentence position and relevance to the title. They use genetic operators and guided local search which improves the quality of the summary than other optimization techniques [7].

A model based on the rank fusion is implemented with four multidimensional sentences features like topic information, significant keywords, semantic content and position of the sentence [8]. This follows unsupervised model for generating scores and the weights are learned based on the labeled document. [9] proposed an idea on summarization based on combining fuzzy inference system, evolutionary and clustering algorithms. The summaries generated by this system are analyzed by the experts to know the performance.

Summarization is majorly done using sentence ranking. Each sentence used in the text is given with weights and are ranked depending upon weights. The sentences with highest rank are used in summary to accomplish good summary [10]. Similarly, summarization is performed on various approaches considering features at word level and sentence level. The word feature includes content, cue phrase, case of the word, bias word and title of the word. The sentence level feature includes location, length, paragraph location and cohesion with other sentence [11].

3. Methodology 3.1. Preprocessing

The methodology used in this work is Work Frequency algorithm which is implemented using Natural Language ToolKit (NLTK) library. Figure 2 shows the process of text summarization implemented in this paper. The steps involved in the proposed work are given below.

The entire dataset provided by ILSUM track of FIRE 2022 have been imported into python dataframe. The text from the dataframe is processed for summarization. Text consists of various symbols and special characters are to be removed through preprocessing. This step also involves removing the stopwords from the given content. The list of words included in NLTK library is used to eliminate the stopwords from the text. 3.2.

Sentence Score

The preprocessed sentences are tokenized to get the list of entire words used in the article. The weighted frequency for each word is calculated based on their occurrence. Equation (1) helps in calculating the weighted frequency for words that are tokenized.

WF = Freq word/ Freqmost occurred word 3.3.

Generating the Summary

The average of all computed sentence scores is determined, and this average is used as a threshold value. Equation (3) gives the average of sentence scores. If the sentence score is more than the average score, it will be retained for the summary. This methodology is an extractive summarization technique which tries to retain the sentences of the text which has highest score and include the original sentences from the test into the summary. A threshold value can be modified to get different summaries. The sentences score that is above the threshold will be hold-on to generate summary. = ∑

where where

WF refers Weighted Frequency Freqwordrefers the frequency of the current word for which WF is calculated

Freqmost occurred word refers the frequency of the word that is most occurred in the text Each sentence score is calculated based on replacing the words with their weighted frequency and summing up all the WF for each sentence. Sentence Score for each sentence is calculated based on Equation (2).

Sentence Score = ∑1 n refers the number of words in a sentence WF refers Weighted Frequency (1) (2) (3)

4. Results

The size of dataset used in this work is showed in Table 1. The performance of the system was evaluated by organizing team using ROUGE metrics. The ROUGE-1, ROUGE-2 and ROUGE-4 are used for measuring the summary quality. Table 2 provides the measures of Precision, recall and F1Score of our system. Figure 3 provides the graphical representation of the results achieved.

5. Conclusion

The proposed system is used for summarizing the given information by retaining the vital information of the original text. In the proposed work, Word Frequency Algorithm is used to get the summary of the text by computing the weighted frequency for each words used in the content. With the help of weighted frequency, each sentence is assigned with a score and threshold value is computed. By changing the threshold value of the sentence score, different summary can be obtained. From the results, it is evident that the proposed methodology provides acceptable summary and it can be further improved by including lexicon information of the given text.

6. References

[1]

Mengli

Zhang , Gang Zhou, Wanting Yu,

Ningbo

Huang ,and Wenfen Liu, “ A Comprehensive Survey of Abstractive Text Summarization Based on Deep Learning” , Computational Intelligence and Neuroscience , ( 2022 ) doi:10.1155/ 2020 /9365340.

[2]

Ángel

Hernández-Castañeda ,

René

Arnulfo García-Hernández, YuliaLedeneva,

Christian

Eduardo Millán-Hernández, ” Language-independent extractive automatic text summarization based on automatic keyword extraction” , Computer Speech & Language , ( 2022 ) doi:10.1016/j.csl. 2021 . 101267 .

[3] AdhikaPramitaWidyassari, SupriadiRustad, GuruhFajarShidik, Edi Noersasongko, Abdul Syukur, AffandyAffandy, De Rosal Ignatius Moses Setiadi, “ Review of automatic text summarization techniques & methods” , Journal of King Saud University - Computer and Information Sciences, ( 2022 ), Volume 34 , Issue

, 1029 -1046 doi: 10.1016/j.jksuci. 2020 . 05 .006.

[4] H. Van Lierde , Tommy W.S. Chow, “ Query-oriented text summarization based on hypergraph transversals” , Information

Processing

& Management, ( 2019 ), Volume 56 , Issue

, 1317 - 1338 , doi: 10.1016/j.ipm. 2019 . 03 .003.

[5]

Ridam

Srivastava ,

Prabhav

Singh ,

K.P.S.

Rana , Vineet Kumar, “ A topic modeled unsupervised approach to single document extractive text summarization” , Knowledge-Based

Systems

,( 2022 ), Volume 246 , doi: 10.1016/j.knosys. 2022 . 108636 .