<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.eswa.2019.01.037</article-id>
      <title-group>
        <article-title>Extractive Text Summarization Algorithm for English Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AbinayaN</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AnbukkarasiS</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>VaradhaganapathyS</string-name>
          <email>varadhaganapathy@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kongu Engineering College</institution>
          ,
          <addr-line>Erode, Tamilnadu</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>123</volume>
      <issue>9</issue>
      <fpage>4158</fpage>
      <lpage>4169</lpage>
      <abstract>
        <p>ive summary creates new phrases, an extractive summary entails locating highly ranked sentences from the given text. Various techniques, including sentence ranking, Graph Based Modeling, RBF Models, and Sentence Similarity Measures, can be used for extractive summarization. This paper provides extractive text summarization for code mixed English text provided by ILSUM track of FIRE 2022. In this work, Word Frequency Algorithm is used for summarization and the ILSUM team measured the performance of the system by standard ROUGE metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Automatic Text Summarization (ATS)</kwd>
        <kwd>Natural Language Processing (NLP)</kwd>
        <kwd>WordFrequency Algorithm</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In this modern era, huge volume of text data is available in internet in the form of
documents,ebooks, news, movie reviews, articles etc. People find very difficult to obtain the significant
information from the lengthy texts. We need a mechanism to identify the key information from the
text, fast and effectively by reducing the time of reading. The fundamental problem in this digital
world is how quickly the information can be compressed and located from the text. Automatic Text
Summarization (ATS) helps to overcome this problem effectively [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Various approaches have been
developed to generate two different summaries namely, extractive and abstractive. Former one is
generated from the original text in the article whereas the later generate their own text which provides
the information of original documents. Moreover, applications like search engines, news articles need
summarizer as search engines tries to provide the snippet and news websites generate the headings
based on the content [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Their application is also needed in many areas like library to summarize the
content of magazine, e-books, journals etc.
      </p>
      <p>
        Various machine learning algorithms under both supervised and unsupervised category are used
for generating a good summary from a given text. The various issues that arise during summarizations
are redundancy, ambiguity, key word identification, similarity etc. The approaches including word
frequency, sentence scoring, sentence ranking are not much challenging for summarizer because of
their statistical approach. The biggest challenge faced by summarizer is to identify the new features
that help to generate the summary and retain the semantics of the content [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The statistical
approaches try to provide a good summary compared to other approaches implemented.
      </p>
      <p>The goal of this work is to summarize the English news article with normal statistical approach
called Word Frequency algorithm and measure the performance of the system with ROUGE metrics.
Figure 1 provides the sample text from the training dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Summarization based on the hypergraph transversal was done in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The sentence of the corpus is
considered as nodes and grouping the sentence having the same theme is mapped with hyperedges.
This concentrates on achieving summary with minimal length and maximal content coverage without
exceeding target length. This model outperforms other approaches by 6% of ROUGE-SU4 score. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
searches for clustering of sentences based on semantic and lexical features. Doc2Vec and LDA are
used for obtaining semantic features. This provides better performance on CNN/Daily Mail dataset
with ROUGE-1 as 41.4.
      </p>
      <p>
        The unsupervised approach has been completed combining clustering along with topic modeling.
Topic modeling used Latent Dirichlet Allocation, while K-Medoids clustering used for summary
generation. They evaluated their system on three different datasets DUC2002 Corpus,
CNN/DailyMail and Wikihow [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>[6] integrates word embeddings into deep neural network to enhance the quality of the summary
being generated. They implemented ensemble techniques in three ways: BOW and Word2vec using
majority voting, BOW combined with unsupervised neural networks and Word2vec combined with
unsupervised neural networks. Summarization is also performed as binary optimization problem
where quality of summary is based on sentence length, sentence position and relevance to the title.
They use genetic operators and guided local search which improves the quality of the summary than
other optimization techniques [7].</p>
      <p>A model based on the rank fusion is implemented with four multidimensional sentences features
like topic information, significant keywords, semantic content and position of the sentence [8]. This
follows unsupervised model for generating scores and the weights are learned based on the labeled
document. [9] proposed an idea on summarization based on combining fuzzy inference system,
evolutionary and clustering algorithms. The summaries generated by this system are analyzed by the
experts to know the performance.</p>
      <p>Summarization is majorly done using sentence ranking. Each sentence used in the text is given
with weights and are ranked depending upon weights. The sentences with highest rank are used in
summary to accomplish good summary [10]. Similarly, summarization is performed on various
approaches considering features at word level and sentence level. The word feature includes content,
cue phrase, case of the word, bias word and title of the word. The sentence level feature includes
location, length, paragraph location and cohesion with other sentence [11].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology 3.1.</title>
    </sec>
    <sec id="sec-4">
      <title>Preprocessing</title>
      <p>The methodology used in this work is Work Frequency algorithm which is implemented using
Natural Language ToolKit (NLTK) library. Figure 2 shows the process of text summarization
implemented in this paper. The steps involved in the proposed work are given below.</p>
      <p>The entire dataset provided by ILSUM track of FIRE 2022 have been imported into python
dataframe. The text from the dataframe is processed for summarization. Text consists of various
symbols and special characters are to be removed through preprocessing. This step also involves
removing the stopwords from the given content. The list of words included in NLTK library is used to
eliminate the stopwords from the text.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Sentence Score</title>
      <p>The preprocessed sentences are tokenized to get the list of entire words used in the article. The
weighted frequency for each word is calculated based on their occurrence. Equation (1) helps in
calculating the weighted frequency for words that are tokenized.</p>
      <p>WF = Freq word/ Freqmost occurred word
3.3.</p>
    </sec>
    <sec id="sec-6">
      <title>Generating the Summary</title>
      <p>The average of all computed sentence scores is determined, and this average is used as a threshold
value. Equation (3) gives the average of sentence scores. If the sentence score is more than the
average score, it will be retained for the summary. This methodology is an extractive summarization
technique which tries to retain the sentences of the text which has highest score and include the
original sentences from the test into the summary. A threshold value can be modified to get different
summaries. The sentences score that is above the threshold will be hold-on to generate summary.

=

∑</p>
      <p>where
where</p>
      <p>WF refers Weighted Frequency
Freqwordrefers the frequency of the current word for which WF is calculated</p>
      <p>Freqmost occurred word refers the frequency of the word that is most occurred in the text
Each sentence score is calculated based on replacing the words with their weighted frequency and
summing up all the WF for each sentence. Sentence Score for each sentence is calculated based on
Equation (2).</p>
      <p>Sentence Score = ∑1 
n refers the number of words in a sentence
WF refers Weighted Frequency
(1)
(2)
(3)</p>
    </sec>
    <sec id="sec-7">
      <title>4. Results</title>
      <p>The size of dataset used in this work is showed in Table 1. The performance of the system was
evaluated by organizing team using ROUGE metrics. The ROUGE-1, ROUGE-2 and ROUGE-4 are
used for measuring the summary quality. Table 2 provides the measures of Precision, recall and
F1Score of our system. Figure 3 provides the graphical representation of the results achieved.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Conclusion</title>
      <p>The proposed system is used for summarizing the given information by retaining the vital
information of the original text. In the proposed work, Word Frequency Algorithm is used to get the
summary of the text by computing the weighted frequency for each words used in the content. With
the help of weighted frequency, each sentence is assigned with a score and threshold value is
computed. By changing the threshold value of the sentence score, different summary can be obtained.
From the results, it is evident that the proposed methodology provides acceptable summary and it can
be further improved by including lexicon information of the given text.</p>
    </sec>
    <sec id="sec-9">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mengli</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Gang Zhou, Wanting Yu,
          <string-name>
            <given-names>Ningbo</given-names>
            <surname>Huang</surname>
          </string-name>
          ,and Wenfen Liu, “
          <article-title>A Comprehensive Survey of Abstractive Text Summarization Based on Deep Learning”</article-title>
          ,
          <source>Computational Intelligence and Neuroscience</source>
          , (
          <year>2022</year>
          ) doi:10.1155/
          <year>2020</year>
          /9365340.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ángel</given-names>
            <surname>Hernández-Castañeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>René</given-names>
            <surname>Arnulfo</surname>
          </string-name>
          García-Hernández, YuliaLedeneva,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Eduardo</surname>
          </string-name>
          Millán-Hernández, ”
          <article-title>Language-independent extractive automatic text summarization based on automatic keyword extraction”</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          , (
          <year>2022</year>
          ) doi:10.1016/j.csl.
          <year>2021</year>
          .
          <volume>101267</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] AdhikaPramitaWidyassari, SupriadiRustad, GuruhFajarShidik, Edi Noersasongko, Abdul Syukur, AffandyAffandy, De Rosal Ignatius Moses Setiadi, “
          <article-title>Review of automatic text summarization techniques &amp; methods”</article-title>
          ,
          <source>Journal of King</source>
          Saud University - Computer and Information Sciences, (
          <year>2022</year>
          ), Volume
          <volume>34</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>4</given-names>
          </string-name>
          ,
          <fpage>1029</fpage>
          -1046 doi: 10.1016/j.jksuci.
          <year>2020</year>
          .
          <volume>05</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>H. Van Lierde</surname>
          </string-name>
          , Tommy W.S. Chow, “
          <article-title>Query-oriented text summarization based on hypergraph transversals”</article-title>
          ,
          <string-name>
            <surname>Information</surname>
            <given-names>Processing</given-names>
          </string-name>
          &amp; Management, (
          <year>2019</year>
          ), Volume
          <volume>56</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>4</given-names>
          </string-name>
          ,
          <fpage>1317</fpage>
          -
          <lpage>1338</lpage>
          , doi: 10.1016/j.ipm.
          <year>2019</year>
          .
          <volume>03</volume>
          .003.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ridam</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Prabhav</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.P.S.</given-names>
            <surname>Rana</surname>
          </string-name>
          , Vineet Kumar, “
          <article-title>A topic modeled unsupervised approach to single document extractive text summarization”</article-title>
          ,
          <string-name>
            <surname>Knowledge-Based</surname>
            <given-names>Systems</given-names>
          </string-name>
          ,(
          <year>2022</year>
          ), Volume
          <volume>246</volume>
          , doi: 10.1016/j.knosys.
          <year>2022</year>
          .
          <volume>108636</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>