=Paper= {{Paper |id=Vol-1746/paper-06 |storemode=property |title=Extractive Summarization Methods – Subtitles and Method Combinations |pdfUrl=https://ceur-ws.org/Vol-1746/paper-06.pdf |volume=Vol-1746 |authors=Nikitas N. Karanikolas |dblpUrl=https://dblp.org/rec/conf/rtacsit/Karanikolas16 }} ==Extractive Summarization Methods – Subtitles and Method Combinations== https://ceur-ws.org/Vol-1746/paper-06.pdf
      Extractive summarization methods – subtitles and method combinations

                                             Nikitas N. Karanikolas
                                   Technological Educational Institute of Athens
                                   Ag. Spyridodos street, Aigaleo 12243, Greece
                                                  nnk@teiath.gr
                                                            don’t have other titles (like chapter, section, subsection
                                                            titles; in the following medially titles). Here, we are
                                                            going to resolve this simplification and consider how
                      Abstract                              the existence of words from the medially titles in some
    In some previous work, we have presented a              sentence can adapt the likelihood of sentence to be
    software tool for experimenting with well               relevant for expressing the meaning of the document.
    known methods for text summarization. The               Moreover we suppose and consider using a non-linear
    methods offered are belonging to the                    function for measuring the likelihood of some sentence
    extractive summarization direction. These               that contains more than one from the (front and
    methods do not understand the meaning in                medially) title words. Also some other issues regarding
    order to condense the text but simply extract a         the uniformity of the Title Method and the competition
    subset of the original sentences which are the          and also combination of the Title Method with other
    most (promising as being) relevant for                  extraction-based      summarization       methods      are
    expressing shortly the text meaning. However,           examined.
    in order to pay attention to the whole idea (a
    workbench for testing available extractive                  In the following we present some extraction-based
    summarization), we have avoided to                      summarization methods. We provide a simple, user
    concentrate to some potential improvements              configurable, combination schema. Next we invent and
    or we have made some simplification                     consider using a non-linear function for measuring the
    assumptions of the existing extractive                  likelihood of sentences having more than one from the
    summarization methods. Here, we remove the              title words. The proposed function also ensures the
    simplifications and also examine some                   uniformity of the Title Method. Next we consider how
    improvements to the existing methods, in                the existence of words from the medially titles in some
    order to achieve better summarizations.                 sentence can adapt the likelihood of sentence to be
                                                            included in the extraction-based summary. An
1. Introduction                                             evaluation of the adapted Title method is conducted.
                                                            Conclusions and Future work is the last section.
Summarization is technology for the reduction of a
text’s length in order to be easily and quickly
understandable. The reduction can be based either on        2. Extraction-based summarization
shallow processing methods or on semantic oriented          methods
ones. The semantic oriented methods understand –
somehow – the text and try to combine the meanings of       The extraction-based summarization methods follow
similar sentences and generate generalizations. Shallow     the idea that some sentences are more important than
processing methods do not actually take into account        others for expressing the meaning of the document.
the meaning of the text but they statistically select the   Consequently, the summarization can be based on
most promising (as being relevant) sentences for quick      some weighting function that assigns weights to
understanding. Such an extraction-based summary is          sentences and extract the sentences having the greater
not necessarily coherent. In some previous work, we         weighting. We can mention three main Sentence
have presented a software tool for experimenting with       weighting ideas: based on the terms importance, based
well known shallow processing (extraction-based)            on sentence location and based on the inclusion of title
methods for text summarization. One of these methods        terms.
is the Title Method proposed by Edmundson [Edm69].             The Sentence weighting based on the terms
In our consideration of method we made the                  importance has to combine two factors: what is the
simplification assumption that documents have only a        importance of term inside a document and what is the
title (something that is in general correct) but they       ability of the term to discriminate among documents in
                                                            the collection. There are three schemas that combine
                                                            these two factors. These are: Sentence weighting based
                                                            on TF*IDF, Sentence weighting based on TF*ISF and
                                                            Sentence weighting based on TF*RIDF. TF (Term
                                                            Frequency) and IDF (Inverse Document Frequency)
are basic ideas coming from the past and from the           (Baxendale’s and News Articles) approaches, the
Information Retrieval discipline [Kar07]. ISF (Inverse      Edmundson’s Title Method, together with the
Sentence Frequency) [Cho09] and RIDF (Residual              alternative Sentence weightings based on the terms
IDF) [Mur07] are newer ideas.                               importance are provided to the user. Regarding the
    Baxendale [Bax58] examined the position of              contribution of these three categories of factors, we
sentences as a feature for selecting sentences for          decided to use a simple linear relation, but leave the
summarization. He concluded that in 85% of the              user to decide on the weight of each factor. The
paragraphs the topic sentence came as the first one and     following equation is implemented in our system:
in 7% of paragraphs the last sentence was the topic
sentence. Thus, a naive but fairly accurate way to            w1 * ST + w2 * SL + w3 * TT                         (1)
select a topic sentence would be to choose one of these     where ST is the sentence weighting based on terms, SL
two [Das07]. Another more sophisticated sentence            is the sentence location factor, and TT is the title terms
weighting based on sentence location is the “News           factor.
Articles” algorithm [Har10]. It utilizes a simple
equation in order to assign a different weight to each      4. Non-linear combination of title words
sentence in a text, based on the position of the sentence
inside the document as a whole and inside the host          As it is already stated, our previous system assigns a
paragraph:                                                  predefined constant for each title word that exists in a
    Edmundson [Edm69] has proposed the “Title               sentence. Thus, the “final Title weight” for each
Method” which supposes that an author conceives the         sentence is the product of the predefined constant
title as circumscribing the subject matter of the           multiplied by the number of title words occurring in
document. According to this method, sentences that          the examined sentence. In other words we have a linear
include words from the document’s title are more            function for sentence weighting according to the
relevant for expressing the meaning of the document.        inclusion of title terms. However, another idea says
The suggested “final Title weight” for each sentence is     that even a single title word existing in some sentence,
the sum of the “Title weights” of its constituent words.    the plausibility of sentence to express the meaning of
Edmundson also defined the “Title glossary” which is        document is very high. Two title words existing in
the set of words existing in the title and subheadings,     some sentence increase this plausibility but they do not
with different weights for title and subheading words.      double it. Thus a non linear function should be
    In our previous work [Kar12] we made the                invented. In table 1 we present two such non linear
simplification assumption that documents have only a        functions. We assume a title having sixteen words.
title (something that is in general correct) but they       Third and fifth (last) columns of table 1 represent these
don’t have other medially titles (like chapter, section,    functions and contain the result (the sentence weight)
subsection titles/subheadings). This assumption is          for a sentence containing x (out of 16) title words. It is
because our system was designed in order to work with       a matter of experimentation for selecting one of the
articles available through the internet, blog posts, and    functions.
other similar sources. According to this assumption,
our previous system assigns a predefined constant for       5. Ensuring uniformity of the Title Method
each title word. Thus, in our previous system, the
“final Title weight” for each sentence is the product of    Our previous linear approach for assigning weights to
the predefined constant multiplied by the number of         sentences according to their title words had also a
title words occurring in the examined sentence. In the      negative consequence. The proportion of contribution
above, we talk about words but we actually mean valid       of each factor (ST, SL and TT) in the overall sentence
word stems.                                                 weight (see equation 1) varied. In documents with long
                                                            title, the TT factor had greater contribution than the
3. Combination of methods                                   contribution of TT factor in a document with short title.
                                                                In order to explain, we assume that the values of SL
During the design phase of our summarization methods
                                                            range from 0.0 to 1.0 (this is the actual range of values
benchmarking system (our previous work [Kar12]), we
                                                            in the “News Articles” algorithm). We also assume that
decided to provide all above discussed sentence
                                                            the constant weight of a term title is C. Thus a sentence
weighting approaches. Both sentence location
having x title terms gets a TT factor as defined in next            6. Exploit words from the medially titles
equation.
                                                                    In our present approach we are not aiming to create a
  TT = x*C                                                    (2)   method for automatic document structure detection.
    Because of these, documents with different length of            Something like this demand to identify the diferent
titles have different range of their TT factor while their          parts of the document (such as chapters, sections,
SL factor remains in the same range of values. For                  subsections, articles and paragraphs), identify how
example, any sentence from an 8-words-title document                each one of these (narrower structure) nests inside
gets a TT factor value in the range 0.0 to 8*C while                other (broader structure) and then add markups for
any sentence from a 4-words-title document gets a TT                these parts. A parser for automatic mark-up of such a
factor value in the range 0.0 to 4*C. In both cases                 document structure is a very demanding process.
(both title lengths) the range of SL remains from 0.0 to            However, it is simply enough to create parser that
1.0.                                                                identifies titles in between paragraphs. In other words,
    This problem is resolved with our non linear                    we are expecting from our parser to return a list of
(logarithmic) function. The range of TT is always from              items where the first item is the front title while the rest
0.0 to 1.0.                                                         items can be either paragraphs or medially titles.
                                                                       Having identified a front title and medially titles we
Table 1. Sentence weight for sentence having x (out of              can apply the previous non-linear function and assign a
                   16) title terms                                  sentence weight against title words and a sentence
                                                                    weight against the words of the medially-title coming
                 Log2(x+1)                      Log3(x+2)
  x Log2(x+1) ------------------- Log3(x+2) --------------------    before the sentence. In a simpler approach we can
              max(Log2(x+1))                max(Log3(x+2))          assume that words from all medially titles constitute a
  1      1,00              0,24        1,00                0,38     second glossary, the “Global medially title glossary”.
  2      1,58              0,39        1,26                0,48     In the later case we can apply the previous non-linear
  3      2,00              0,49        1,46                0,56     function and assign a sentence weight against title
  4      2,32              0,57        1,63                0,62     words (“front Title Terms”, shortly fTT) and a sentence
  5      2,58              0,63        1,77                0,67     weight against the “Medially title glossary” (“medially
  6      2,81              0,69        1,89                0,72     Title Terms”, shortly mTT). In our evaluation we
  7      3,00              0,73        2,00                0,76     assume the second (Global medially title glossary)
  8      3,17              0,78        2,10                0,80     approach. The final weight for a sentence based on the
  9      3,32              0,81        2,18                0,83     inclusion of terms can be:
 10      3,46              0,85        2,26                0,86
 11      3,58              0,88        2,33                0,89        ΤΤ = α * fTT + β * mTT                               (3)
 12      3,70              0,91        2,40                0,91        where α=0.6 and β=0.4
 13      3,81              0,93        2,46                0,94        (in general, α is set in range 0.1 .. 0.9 and β=1-α)
 14      3,91              0,96        2,52                0,96
                                                                       or
 15      4,00              0,98        2,58                0,98
 16      4,09              1,00        2,63                1,00        ΤΤ = max (fTT, mTT)                                  (4)

Table 2. Sentence weight for sentence having x (out of                 Since “Global medially title glossary” consists of
                    8) title terms                                  words from many subtitles/subheadings, we suppose
                                                                    that mTT should be computed with the Log3(x+2)
                  Log2(x+1)               Log3(x+2)                 based function and fTT should be computed with the
         x    ---------------------   ---------------------         Log2(x+1) based function.
               max(Log2(x+1))          max(Log3(x+2))
         1                    0,32                    0,48
         2                    0,50                    0,60          7. Evaluation
         3                    0,63                    0,70          In order to evaluate our approach, we have selected a
         4                    0,73                    0,78          small subset of documents from the Greek language
         5                    0,82                    0,85          corpora. All the selected documents have a front title
         6                    0,89                    0,90          and few (usually 2 to 5) medially titles. One such
         7                    0,95                    0,95
                                                                    document is presented in figure 1.
         8                    1,00                    1,00
For each document, we have asked text retrieval         result since in the automatic summarization we have
experts to extract the most promising (20%) subset of   excluded the ST factor (terms-based sentence
sentences for shortly expressing the document           weighting). In order to evaluate if the medially titles
meaning. These extractions are the manually selected    has influence in the result, we conducted the
summaries. Then the same documents are given in our     experiment again but now considering the medially
system to mechanically extract summaries. For this      titles as simple single-sentence paragraphs. In this
reason we have excluded the ST factor and given         experiment the average percent of matching sentences
equally weights for the SL and TT factors (w1=0, w2=1   (between manual and mechanical summary) is
and w3=1 in the first (1st) equation). For the          decreased 46%. A third experiment is conducted but
computation of TT factor, we have used the fourth       now using our previous system. We remind that in our
(4th) equation. The number of sentences for the         previous system the “final Title weight” (TT factor) for
mechanic summarization is set to the same percentage    each sentence is the product of the predefined constant
(20%). Next, for each document, we have measured the    (C) multiplied by the number of title words occurring
percent of sentences in the mechanically extracted      in the examined sentence). Again we set w1=0 and
summary that exist in the manually extracted summary.   moreover we set C=0.5. Now, the average percent of
The average percent is 54% which is a very promising    matching sentences is more decreased to 41%.




                  Figure 1. Example document (#3644) taken from http://www.greek-language.gr/
                                                                    eRA-2: 2nd Conference for the contribution of
8. Conclusions and Future Work                                      Information    Technology     to     Science,
                                                                    Economy, Society and Education, September
The results in our experiments suppose that medially                22-23, 2007, Athens, Greece.
titles should be considered in order to get better
mechanically extracted summaries. Also the TT factor        [Edm69] H. P. Edmundson. New Methods in
contributes in a better way to the summarization when              Automatic Extracting. Journal of the ACM, 16
equation 4 is used (versus equation 2). In our plans we            (2): 264–285, 1969.
have to repeat our experiments with a larger document       [Das07] D. Das and A.F.T. Martins. A Survey on
set (the current is constituted with only 21 documents)             Automatic Text Summarization. Carnegie
and also have to consider all factors together (enable
                                                                    Mellon University, 2007.
the ST factor). Moreover alternative approaches for the
TT factor (e.g. equation 3) should be evaluated.            [Har10] S. Hariharan. Multi Document Summarization
                                                                    by Combinational Approach. International
References                                                          Journal of Computational Cognition, 8 (4),
                                                                    December 2010.
[Cho09] L. H. Chong, and Y. Y. Chen. Text
                                                            [Bax59] P. B. Baxendale. Machine-Made Index for
       Summarization for Oil and Gas News Article.
                                                                   Technical Literature—An Experiment. IBM
       World Academy of Science, Engineering and
                                                                   Journal of Research and Development, 2:
       Technology, 53, 2009.
                                                                   354-361, 1958.
[Mur07] G. Murray and S. Renals. Term-Weighting for
                                                            [Kar12] N. N. Karanikolas, E. Galiotou and C.
        Summarization of Multi-Party Spoken
                                                                   Tsoulloftas. A workbench for extractive
        Dialogues. In A. Popescu-Belis, S. Renals, and H.
         Bourlard (eds), Machine Learning for Multimodal
                                                                   summarizing methods. PCI'2012: 16th
         Interaction IV. Lecture Notes in Computer
                                                                   Panhellenic Conference on Informatics,
         Science, 4892: 155-166. Springer, 2007.                   October 5-7, 2012, Piraeus, Greece. IEEE
                                                                   CPS.
[Kar07] N. N. Karanikolas, The measurement of
       similarity in stock data documents collections.