<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extractive summarization methods - subtitles and method combinations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Nikitas N. Karanikolas Technological Educational Institute of Athens Ag.</institution>
          <addr-line>Spyridodos street, Aigaleo 12243</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In some previous work, we have presented a software tool for experimenting with well known methods for text summarization. The methods offered are belonging to the extractive summarization direction. These methods do not understand the meaning in order to condense the text but simply extract a subset of the original sentences which are the most (promising as being) relevant for expressing shortly the text meaning. However, in order to pay attention to the whole idea (a workbench for testing available extractive summarization), we have avoided to concentrate to some potential improvements or we have made some simplification assumptions of the existing extractive summarization methods. Here, we remove the simplifications and also examine some improvements to the existing methods, in order to achieve better summarizations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Summarization is technology for the reduction of a
text’s length in order to be easily and quickly
understandable. The reduction can be based either on
shallow processing methods or on semantic oriented
ones. The semantic oriented methods understand –
somehow – the text and try to combine the meanings of
similar sentences and generate generalizations. Shallow
processing methods do not actually take into account
the meaning of the text but they statistically select the
most promising (as being relevant) sentences for quick
understanding. Such an extraction-based summary is
not necessarily coherent. In some previous work, we
have presented a software tool for experimenting with
well known shallow processing (extraction-based)
methods for text summarization. One of these methods
is the Title Method proposed by Edmundson [Edm69].
In our consideration of method we made the
simplification assumption that documents have only a
title (something that is in general correct) but they
don’t have other titles (like chapter, section, subsection
titles; in the following medially titles). Here, we are
going to resolve this simplification and consider how
the existence of words from the medially titles in some
sentence can adapt the likelihood of sentence to be
relevant for expressing the meaning of the document.
Moreover we suppose and consider using a non-linear
function for measuring the likelihood of some sentence
that contains more than one from the (front and
medially) title words. Also some other issues regarding
the uniformity of the Title Method and the competition
and also combination of the Title Method with other
extraction-based summarization methods are
examined.</p>
      <p>In the following we present some extraction-based
summarization methods. We provide a simple, user
configurable, combination schema. Next we invent and
consider using a non-linear function for measuring the
likelihood of sentences having more than one from the
title words. The proposed function also ensures the
uniformity of the Title Method. Next we consider how
the existence of words from the medially titles in some
sentence can adapt the likelihood of sentence to be
included in the extraction-based summary. An
evaluation of the adapted Title method is conducted.
Conclusions and Future work is the last section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Extraction-based summarization methods</title>
      <p>The extraction-based summarization methods follow
the idea that some sentences are more important than
others for expressing the meaning of the document.
Consequently, the summarization can be based on
some weighting function that assigns weights to
sentences and extract the sentences having the greater
weighting. We can mention three main Sentence
weighting ideas: based on the terms importance, based
on sentence location and based on the inclusion of title
terms.</p>
      <p>The Sentence weighting based on the terms
importance has to combine two factors: what is the
importance of term inside a document and what is the
ability of the term to discriminate among documents in
the collection. There are three schemas that combine
these two factors. These are: Sentence weighting based
on TF*IDF, Sentence weighting based on TF*ISF and
Sentence weighting based on TF*RIDF. TF (Term
Frequency) and IDF (Inverse Document Frequency)
are basic ideas coming from the past and from the
Information Retrieval discipline [Kar07]. ISF (Inverse
Sentence Frequency) [Cho09] and RIDF (Residual
IDF) [Mur07] are newer ideas.</p>
      <p>Baxendale [Bax58] examined the position of
sentences as a feature for selecting sentences for
summarization. He concluded that in 85% of the
paragraphs the topic sentence came as the first one and
in 7% of paragraphs the last sentence was the topic
sentence. Thus, a naive but fairly accurate way to
select a topic sentence would be to choose one of these
two [Das07]. Another more sophisticated sentence
weighting based on sentence location is the “News
Articles” algorithm [Har10]. It utilizes a simple
equation in order to assign a different weight to each
sentence in a text, based on the position of the sentence
inside the document as a whole and inside the host
paragraph:</p>
      <p>Edmundson [Edm69] has proposed the “Title
Method” which supposes that an author conceives the
title as circumscribing the subject matter of the
document. According to this method, sentences that
include words from the document’s title are more
relevant for expressing the meaning of the document.
The suggested “final Title weight” for each sentence is
the sum of the “Title weights” of its constituent words.
Edmundson also defined the “Title glossary” which is
the set of words existing in the title and subheadings,
with different weights for title and subheading words.</p>
      <p>In our previous work [Kar12] we made the
simplification assumption that documents have only a
title (something that is in general correct) but they
don’t have other medially titles (like chapter, section,
subsection titles/subheadings). This assumption is
because our system was designed in order to work with
articles available through the internet, blog posts, and
other similar sources. According to this assumption,
our previous system assigns a predefined constant for
each title word. Thus, in our previous system, the
“final Title weight” for each sentence is the product of
the predefined constant multiplied by the number of
title words occurring in the examined sentence. In the
above, we talk about words but we actually mean valid
word stems.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Combination of methods</title>
      <p>During the design phase of our summarization methods
benchmarking system (our previous work [Kar12]), we
decided to provide all above discussed sentence
weighting approaches. Both sentence location
(Baxendale’s and News Articles) approaches, the
Edmundson’s Title Method, together with the
alternative Sentence weightings based on the terms
importance are provided to the user. Regarding the
contribution of these three categories of factors, we
decided to use a simple linear relation, but leave the
user to decide on the weight of each factor. The
following equation is implemented in our system:
w1 * ST + w2 * SL + w3 * TT
(1)
where ST is the sentence weighting based on terms, SL
is the sentence location factor, and TT is the title terms
factor.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Non-linear combination of title words</title>
      <p>As it is already stated, our previous system assigns a
predefined constant for each title word that exists in a
sentence. Thus, the “final Title weight” for each
sentence is the product of the predefined constant
multiplied by the number of title words occurring in
the examined sentence. In other words we have a linear
function for sentence weighting according to the
inclusion of title terms. However, another idea says
that even a single title word existing in some sentence,
the plausibility of sentence to express the meaning of
document is very high. Two title words existing in
some sentence increase this plausibility but they do not
double it. Thus a non linear function should be
invented. In table 1 we present two such non linear
functions. We assume a title having sixteen words.
Third and fifth (last) columns of table 1 represent these
functions and contain the result (the sentence weight)
for a sentence containing x (out of 16) title words. It is
a matter of experimentation for selecting one of the
functions.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Ensuring uniformity of the Title Method</title>
      <p>Our previous linear approach for assigning weights to
sentences according to their title words had also a
negative consequence. The proportion of contribution
of each factor (ST, SL and TT) in the overall sentence
weight (see equation 1) varied. In documents with long
title, the TT factor had greater contribution than the
contribution of TT factor in a document with short title.</p>
      <p>In order to explain, we assume that the values of SL
range from 0.0 to 1.0 (this is the actual range of values
in the “News Articles” algorithm). We also assume that
the constant weight of a term title is C. Thus a sentence</p>
      <p>Because of these, documents with different length of
titles have different range of their TT factor while their
SL factor remains in the same range of values. For
example, any sentence from an 8-words-title document
gets a TT factor value in the range 0.0 to 8*C while
any sentence from a 4-words-title document gets a TT
factor value in the range 0.0 to 4*C. In both cases
(both title lengths) the range of SL remains from 0.0 to
1.0.</p>
      <p>This problem is resolved with our non linear
(logarithmic) function. The range of TT is always from
0.0 to 1.0.
In our present approach we are not aiming to create a
method for automatic document structure detection.
Something like this demand to identify the diferent
parts of the document (such as chapters, sections,
subsections, articles and paragraphs), identify how
each one of these (narrower structure) nests inside
other (broader structure) and then add markups for
these parts. A parser for automatic mark-up of such a
document structure is a very demanding process.
However, it is simply enough to create parser that
identifies titles in between paragraphs. In other words,
we are expecting from our parser to return a list of
items where the first item is the front title while the rest
items can be either paragraphs or medially titles.</p>
      <p>Having identified a front title and medially titles we
can apply the previous non-linear function and assign a
sentence weight against title words and a sentence
weight against the words of the medially-title coming
before the sentence. In a simpler approach we can
assume that words from all medially titles constitute a
second glossary, the “Global medially title glossary”.
In the later case we can apply the previous non-linear
function and assign a sentence weight against title
words (“front Title Terms”, shortly fTT) and a sentence
weight against the “Medially title glossary” (“medially
Title Terms”, shortly mTT). In our evaluation we
assume the second (Global medially title glossary)
approach. The final weight for a sentence based on the
inclusion of terms can be:
ΤΤ = α * fTT + β * mTT
where α=0.6 and β=0.4
(in general, α is set in range 0.1 .. 0.9 and β=1-α)
or
ΤΤ = max (fTT, mTT)
(3)
(4)</p>
      <p>Since “Global medially title glossary” consists of
words from many subtitles/subheadings, we suppose
that mTT should be computed with the Log3(x+2)
based function and fTT should be computed with the
Log2(x+1) based function.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Evaluation</title>
      <p>In order to evaluate our approach, we have selected a
small subset of documents from the Greek language
corpora. All the selected documents have a front title
and few (usually 2 to 5) medially titles. One such
document is presented in figure 1.</p>
      <p>For each document, we have asked text retrieval
experts to extract the most promising (20%) subset of
sentences for shortly expressing the document
meaning. These extractions are the manually selected
summaries. Then the same documents are given in our
system to mechanically extract summaries. For this
reason we have excluded the ST factor and given
equally weights for the SL and TT factors (w1=0, w2=1
and w3=1 in the first (1st) equation). For the
computation of TT factor, we have used the fourth
(4th) equation. The number of sentences for the
mechanic summarization is set to the same percentage
(20%). Next, for each document, we have measured the
percent of sentences in the mechanically extracted
summary that exist in the manually extracted summary.
The average percent is 54% which is a very promising
result since in the automatic summarization we have
excluded the ST factor (terms-based sentence
weighting). In order to evaluate if the medially titles
has influence in the result, we conducted the
experiment again but now considering the medially
titles as simple single-sentence paragraphs. In this
experiment the average percent of matching sentences
(between manual and mechanical summary) is
decreased 46%. A third experiment is conducted but
now using our previous system. We remind that in our
previous system the “final Title weight” (TT factor) for
each sentence is the product of the predefined constant
(C) multiplied by the number of title words occurring
in the examined sentence). Again we set w1=0 and
moreover we set C=0.5. Now, the average percent of
matching sentences is more decreased to 41%.
The results in our experiments suppose that medially
titles should be considered in order to get better
mechanically extracted summaries. Also the TT factor
contributes in a better way to the summarization when
equation 4 is used (versus equation 2). In our plans we
have to repeat our experiments with a larger document
set (the current is constituted with only 21 documents)
and also have to consider all factors together (enable
the ST factor). Moreover alternative approaches for the
TT factor (e.g. equation 3) should be evaluated.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Cho09]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Chong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Text Summarization for Oil and Gas News Article</article-title>
          .
          <source>World Academy of Science, Engineering and Technology</source>
          ,
          <volume>53</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Mur07]
          <string-name>
            <given-names>G.</given-names>
            <surname>Murray</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          .
          <article-title>Term-Weighting for Summarization of Multi-Party Spoken Dialogues</article-title>
          . In A.
          <string-name>
            <surname>Popescu-Belis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Renals</surname>
          </string-name>
          , and H. Bourlard (eds),
          <source>Machine Learning for Multimodal Interaction IV. Lecture Notes in Computer Science</source>
          ,
          <volume>4892</volume>
          :
          <fpage>155</fpage>
          -
          <lpage>166</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Kar07]
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Karanikolas</surname>
          </string-name>
          ,
          <article-title>The measurement of similarity in stock data documents collections</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>eRA-2: 2nd Conference for the contribution of Information Technology to Science</source>
          , Economy, Society and Education,
          <source>September 22-23</source>
          ,
          <year>2007</year>
          , Athens, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Edm69]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Edmundson</surname>
          </string-name>
          .
          <article-title>New Methods in Automatic Extracting</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <fpage>264</fpage>
          -
          <lpage>285</lpage>
          ,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>[Das07] D. Das</surname>
            and
            <given-names>A.F.T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          .
          <article-title>A Survey on Automatic Text Summarization</article-title>
          . Carnegie Mellon University,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Har10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          <article-title>. Multi Document Summarization by Combinational Approach</article-title>
          .
          <source>International Journal of Computational Cognition</source>
          ,
          <volume>8</volume>
          (
          <issue>4</issue>
          ),
          <year>December 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>[Bax59] P. B. Baxendale</surname>
          </string-name>
          .
          <article-title>Machine-Made Index for Technical Literature-An Experiment</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>2</volume>
          :
          <fpage>354</fpage>
          -
          <lpage>361</lpage>
          ,
          <year>1958</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Kar12]
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Karanikolas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Galiotou</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Tsoulloftas</surname>
          </string-name>
          .
          <article-title>A workbench for extractive summarizing methods</article-title>
          .
          <source>PCI'2012: 16th Panhellenic Conference on Informatics, October 5-7</source>
          ,
          <year>2012</year>
          , Piraeus, Greece. IEEE CPS.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>