=Paper=
{{Paper
|id=Vol-1746/paper-06
|storemode=property
|title=Extractive Summarization Methods – Subtitles and Method Combinations
|pdfUrl=https://ceur-ws.org/Vol-1746/paper-06.pdf
|volume=Vol-1746
|authors=Nikitas N. Karanikolas
|dblpUrl=https://dblp.org/rec/conf/rtacsit/Karanikolas16
}}
==Extractive Summarization Methods – Subtitles and Method Combinations==
Extractive summarization methods – subtitles and method combinations Nikitas N. Karanikolas Technological Educational Institute of Athens Ag. Spyridodos street, Aigaleo 12243, Greece nnk@teiath.gr don’t have other titles (like chapter, section, subsection titles; in the following medially titles). Here, we are going to resolve this simplification and consider how Abstract the existence of words from the medially titles in some In some previous work, we have presented a sentence can adapt the likelihood of sentence to be software tool for experimenting with well relevant for expressing the meaning of the document. known methods for text summarization. The Moreover we suppose and consider using a non-linear methods offered are belonging to the function for measuring the likelihood of some sentence extractive summarization direction. These that contains more than one from the (front and methods do not understand the meaning in medially) title words. Also some other issues regarding order to condense the text but simply extract a the uniformity of the Title Method and the competition subset of the original sentences which are the and also combination of the Title Method with other most (promising as being) relevant for extraction-based summarization methods are expressing shortly the text meaning. However, examined. in order to pay attention to the whole idea (a workbench for testing available extractive In the following we present some extraction-based summarization), we have avoided to summarization methods. We provide a simple, user concentrate to some potential improvements configurable, combination schema. Next we invent and or we have made some simplification consider using a non-linear function for measuring the assumptions of the existing extractive likelihood of sentences having more than one from the summarization methods. Here, we remove the title words. The proposed function also ensures the simplifications and also examine some uniformity of the Title Method. Next we consider how improvements to the existing methods, in the existence of words from the medially titles in some order to achieve better summarizations. sentence can adapt the likelihood of sentence to be included in the extraction-based summary. An 1. Introduction evaluation of the adapted Title method is conducted. Conclusions and Future work is the last section. Summarization is technology for the reduction of a text’s length in order to be easily and quickly understandable. The reduction can be based either on 2. Extraction-based summarization shallow processing methods or on semantic oriented methods ones. The semantic oriented methods understand – somehow – the text and try to combine the meanings of The extraction-based summarization methods follow similar sentences and generate generalizations. Shallow the idea that some sentences are more important than processing methods do not actually take into account others for expressing the meaning of the document. the meaning of the text but they statistically select the Consequently, the summarization can be based on most promising (as being relevant) sentences for quick some weighting function that assigns weights to understanding. Such an extraction-based summary is sentences and extract the sentences having the greater not necessarily coherent. In some previous work, we weighting. We can mention three main Sentence have presented a software tool for experimenting with weighting ideas: based on the terms importance, based well known shallow processing (extraction-based) on sentence location and based on the inclusion of title methods for text summarization. One of these methods terms. is the Title Method proposed by Edmundson [Edm69]. The Sentence weighting based on the terms In our consideration of method we made the importance has to combine two factors: what is the simplification assumption that documents have only a importance of term inside a document and what is the title (something that is in general correct) but they ability of the term to discriminate among documents in the collection. There are three schemas that combine these two factors. These are: Sentence weighting based on TF*IDF, Sentence weighting based on TF*ISF and Sentence weighting based on TF*RIDF. TF (Term Frequency) and IDF (Inverse Document Frequency) are basic ideas coming from the past and from the (Baxendale’s and News Articles) approaches, the Information Retrieval discipline [Kar07]. ISF (Inverse Edmundson’s Title Method, together with the Sentence Frequency) [Cho09] and RIDF (Residual alternative Sentence weightings based on the terms IDF) [Mur07] are newer ideas. importance are provided to the user. Regarding the Baxendale [Bax58] examined the position of contribution of these three categories of factors, we sentences as a feature for selecting sentences for decided to use a simple linear relation, but leave the summarization. He concluded that in 85% of the user to decide on the weight of each factor. The paragraphs the topic sentence came as the first one and following equation is implemented in our system: in 7% of paragraphs the last sentence was the topic sentence. Thus, a naive but fairly accurate way to w1 * ST + w2 * SL + w3 * TT (1) select a topic sentence would be to choose one of these where ST is the sentence weighting based on terms, SL two [Das07]. Another more sophisticated sentence is the sentence location factor, and TT is the title terms weighting based on sentence location is the “News factor. Articles” algorithm [Har10]. It utilizes a simple equation in order to assign a different weight to each 4. Non-linear combination of title words sentence in a text, based on the position of the sentence inside the document as a whole and inside the host As it is already stated, our previous system assigns a paragraph: predefined constant for each title word that exists in a Edmundson [Edm69] has proposed the “Title sentence. Thus, the “final Title weight” for each Method” which supposes that an author conceives the sentence is the product of the predefined constant title as circumscribing the subject matter of the multiplied by the number of title words occurring in document. According to this method, sentences that the examined sentence. In other words we have a linear include words from the document’s title are more function for sentence weighting according to the relevant for expressing the meaning of the document. inclusion of title terms. However, another idea says The suggested “final Title weight” for each sentence is that even a single title word existing in some sentence, the sum of the “Title weights” of its constituent words. the plausibility of sentence to express the meaning of Edmundson also defined the “Title glossary” which is document is very high. Two title words existing in the set of words existing in the title and subheadings, some sentence increase this plausibility but they do not with different weights for title and subheading words. double it. Thus a non linear function should be In our previous work [Kar12] we made the invented. In table 1 we present two such non linear simplification assumption that documents have only a functions. We assume a title having sixteen words. title (something that is in general correct) but they Third and fifth (last) columns of table 1 represent these don’t have other medially titles (like chapter, section, functions and contain the result (the sentence weight) subsection titles/subheadings). This assumption is for a sentence containing x (out of 16) title words. It is because our system was designed in order to work with a matter of experimentation for selecting one of the articles available through the internet, blog posts, and functions. other similar sources. According to this assumption, our previous system assigns a predefined constant for 5. Ensuring uniformity of the Title Method each title word. Thus, in our previous system, the “final Title weight” for each sentence is the product of Our previous linear approach for assigning weights to the predefined constant multiplied by the number of sentences according to their title words had also a title words occurring in the examined sentence. In the negative consequence. The proportion of contribution above, we talk about words but we actually mean valid of each factor (ST, SL and TT) in the overall sentence word stems. weight (see equation 1) varied. In documents with long title, the TT factor had greater contribution than the 3. Combination of methods contribution of TT factor in a document with short title. In order to explain, we assume that the values of SL During the design phase of our summarization methods range from 0.0 to 1.0 (this is the actual range of values benchmarking system (our previous work [Kar12]), we in the “News Articles” algorithm). We also assume that decided to provide all above discussed sentence the constant weight of a term title is C. Thus a sentence weighting approaches. Both sentence location having x title terms gets a TT factor as defined in next 6. Exploit words from the medially titles equation. In our present approach we are not aiming to create a TT = x*C (2) method for automatic document structure detection. Because of these, documents with different length of Something like this demand to identify the diferent titles have different range of their TT factor while their parts of the document (such as chapters, sections, SL factor remains in the same range of values. For subsections, articles and paragraphs), identify how example, any sentence from an 8-words-title document each one of these (narrower structure) nests inside gets a TT factor value in the range 0.0 to 8*C while other (broader structure) and then add markups for any sentence from a 4-words-title document gets a TT these parts. A parser for automatic mark-up of such a factor value in the range 0.0 to 4*C. In both cases document structure is a very demanding process. (both title lengths) the range of SL remains from 0.0 to However, it is simply enough to create parser that 1.0. identifies titles in between paragraphs. In other words, This problem is resolved with our non linear we are expecting from our parser to return a list of (logarithmic) function. The range of TT is always from items where the first item is the front title while the rest 0.0 to 1.0. items can be either paragraphs or medially titles. Having identified a front title and medially titles we Table 1. Sentence weight for sentence having x (out of can apply the previous non-linear function and assign a 16) title terms sentence weight against title words and a sentence weight against the words of the medially-title coming Log2(x+1) Log3(x+2) x Log2(x+1) ------------------- Log3(x+2) -------------------- before the sentence. In a simpler approach we can max(Log2(x+1)) max(Log3(x+2)) assume that words from all medially titles constitute a 1 1,00 0,24 1,00 0,38 second glossary, the “Global medially title glossary”. 2 1,58 0,39 1,26 0,48 In the later case we can apply the previous non-linear 3 2,00 0,49 1,46 0,56 function and assign a sentence weight against title 4 2,32 0,57 1,63 0,62 words (“front Title Terms”, shortly fTT) and a sentence 5 2,58 0,63 1,77 0,67 weight against the “Medially title glossary” (“medially 6 2,81 0,69 1,89 0,72 Title Terms”, shortly mTT). In our evaluation we 7 3,00 0,73 2,00 0,76 assume the second (Global medially title glossary) 8 3,17 0,78 2,10 0,80 approach. The final weight for a sentence based on the 9 3,32 0,81 2,18 0,83 inclusion of terms can be: 10 3,46 0,85 2,26 0,86 11 3,58 0,88 2,33 0,89 ΤΤ = α * fTT + β * mTT (3) 12 3,70 0,91 2,40 0,91 where α=0.6 and β=0.4 13 3,81 0,93 2,46 0,94 (in general, α is set in range 0.1 .. 0.9 and β=1-α) 14 3,91 0,96 2,52 0,96 or 15 4,00 0,98 2,58 0,98 16 4,09 1,00 2,63 1,00 ΤΤ = max (fTT, mTT) (4) Table 2. Sentence weight for sentence having x (out of Since “Global medially title glossary” consists of 8) title terms words from many subtitles/subheadings, we suppose that mTT should be computed with the Log3(x+2) Log2(x+1) Log3(x+2) based function and fTT should be computed with the x --------------------- --------------------- Log2(x+1) based function. max(Log2(x+1)) max(Log3(x+2)) 1 0,32 0,48 2 0,50 0,60 7. Evaluation 3 0,63 0,70 In order to evaluate our approach, we have selected a 4 0,73 0,78 small subset of documents from the Greek language 5 0,82 0,85 corpora. All the selected documents have a front title 6 0,89 0,90 and few (usually 2 to 5) medially titles. One such 7 0,95 0,95 document is presented in figure 1. 8 1,00 1,00 For each document, we have asked text retrieval result since in the automatic summarization we have experts to extract the most promising (20%) subset of excluded the ST factor (terms-based sentence sentences for shortly expressing the document weighting). In order to evaluate if the medially titles meaning. These extractions are the manually selected has influence in the result, we conducted the summaries. Then the same documents are given in our experiment again but now considering the medially system to mechanically extract summaries. For this titles as simple single-sentence paragraphs. In this reason we have excluded the ST factor and given experiment the average percent of matching sentences equally weights for the SL and TT factors (w1=0, w2=1 (between manual and mechanical summary) is and w3=1 in the first (1st) equation). For the decreased 46%. A third experiment is conducted but computation of TT factor, we have used the fourth now using our previous system. We remind that in our (4th) equation. The number of sentences for the previous system the “final Title weight” (TT factor) for mechanic summarization is set to the same percentage each sentence is the product of the predefined constant (20%). Next, for each document, we have measured the (C) multiplied by the number of title words occurring percent of sentences in the mechanically extracted in the examined sentence). Again we set w1=0 and summary that exist in the manually extracted summary. moreover we set C=0.5. Now, the average percent of The average percent is 54% which is a very promising matching sentences is more decreased to 41%. Figure 1. Example document (#3644) taken from http://www.greek-language.gr/ eRA-2: 2nd Conference for the contribution of 8. Conclusions and Future Work Information Technology to Science, Economy, Society and Education, September The results in our experiments suppose that medially 22-23, 2007, Athens, Greece. titles should be considered in order to get better mechanically extracted summaries. Also the TT factor [Edm69] H. P. Edmundson. New Methods in contributes in a better way to the summarization when Automatic Extracting. Journal of the ACM, 16 equation 4 is used (versus equation 2). In our plans we (2): 264–285, 1969. have to repeat our experiments with a larger document [Das07] D. Das and A.F.T. Martins. A Survey on set (the current is constituted with only 21 documents) Automatic Text Summarization. Carnegie and also have to consider all factors together (enable Mellon University, 2007. the ST factor). Moreover alternative approaches for the TT factor (e.g. equation 3) should be evaluated. [Har10] S. Hariharan. Multi Document Summarization by Combinational Approach. International References Journal of Computational Cognition, 8 (4), December 2010. [Cho09] L. H. Chong, and Y. Y. Chen. Text [Bax59] P. B. Baxendale. Machine-Made Index for Summarization for Oil and Gas News Article. Technical Literature—An Experiment. IBM World Academy of Science, Engineering and Journal of Research and Development, 2: Technology, 53, 2009. 354-361, 1958. [Mur07] G. Murray and S. Renals. Term-Weighting for [Kar12] N. N. Karanikolas, E. Galiotou and C. Summarization of Multi-Party Spoken Tsoulloftas. A workbench for extractive Dialogues. In A. Popescu-Belis, S. Renals, and H. Bourlard (eds), Machine Learning for Multimodal summarizing methods. PCI'2012: 16th Interaction IV. Lecture Notes in Computer Panhellenic Conference on Informatics, Science, 4892: 155-166. Springer, 2007. October 5-7, 2012, Piraeus, Greece. IEEE CPS. [Kar07] N. N. Karanikolas, The measurement of similarity in stock data documents collections.