=Paper= {{Paper |id=None |storemode=property |title=Using Snippets in Text Summarization: a Comparative Study and an Application |pdfUrl=https://ceur-ws.org/Vol-835/paper15.pdf |volume=Vol-835 |dblpUrl=https://dblp.org/rec/conf/iir/ArmanoGV12 }} ==Using Snippets in Text Summarization: a Comparative Study and an Application== https://ceur-ws.org/Vol-835/paper15.pdf
Using Snippets in Text Summarization: a
Comparative Study and an Application

Giuliano Armano, Alessandro Giuliani, and Eloisa Vargiu




Abstract Automatic text summarization consists of automatically creating a sum-
mary of one or more texts. As for Web pages, unfortunately classical techniques
cannot be applied in presence of dynamic contents. In this paper, we propose the
adoption of snippets –i.e., page excerpts provided together with user query results
by search engines– as a text summarization technique. The study is conducted along
two directions: comparing the proposed approach with a classical text summariza-
tion technique and (ii) assessing whether snippet summarization can be successfully
applied to contextual advertising. On the one hand, comparative experiments show
that the proposed approach has performances similar to those obtained by using the
selected classical technique. On the other hand, the adoption of snippets as text sum-
marization technique in contextual advertising show that the performances are quite
satisfactory.



1 Introduction

During the 60’s, a large amount of scientific papers and books have been digitally
stored and made searchable. Due to the limitation of storage capacity, documents
were stored, indexed, and made searchable only through their summaries [29]. For
this reason, how to automatically create summaries became a primary task and sev-
eral techniques were defined and developed [18, 12, 25].
   More recently, there has been a renewed interest on automatic summarization
techniques. The problem now is no longer due to limited storage capacity, but to
retrieval and filtering needs. Since digitally stored information is more and more
available, users need suitable tools able to select, filter, and extract only relevant
information. Therefore, text summarization techniques are currently adopted in sev-

G. Armano, A. Giuliani, and E. Vargiu
University of Cagliari, Dept.of Electrical and Electronic Engineering, Piazza d’Armi, I09123
Cagliari (Italy) e-mail: {armano, alessandro.giuliani, vargiu}@diee.unica.it


                                                                                          1
eral fields of information retrieval and filtering [7], such as, information extraction
[21], text mining [31], document classification [27], recommender systems [23], and
contextual advertising [1].
   Unfortunately, classical techniques are not easily applicable to dynamic Web
pages, which often rely on Microsoft Silverligh1 , Adobe Flash2 , Adobe Shock-
wave3 , or contain applets written in Java. Conventional parsing methods are often
not applicable for the created webpage. Therefore, we claim that snippets, which
are provided together with user query results by search engines, might be adopted
to perform text summarization on Web pages.
   In this paper, we are interested in studying the impact of snippets to perform text
summarization. In particular, we conduct the study along two directions: (i) compar-
ing performances obtained by using snippets with those obtained by adopting one of
the classical text summarization techniques proposed in [3] and (ii) adopting snip-
pets as text summarization technique in a selected application field, i.e., contextual
advertising.
   The rest of the paper is organized as follows. Section 2 recalls the main work on
text summarization and introduces snippets and their use in search engines. Section
3 presents comparative experiments obtained by adopting snippets with respect to a
classical text summarization technique. In Section 4, an application of snippet text
summarization in the field of contextual advertising is proposed. Section 5 ends the
paper with conclusions and future work.



2 Background

2.1 Text Summarization

Automatic text summarization is a technique in which a text is summarized by a
computer program. Given a text, its summary (i.e., a non redundant extract from the
original text) is returned.
   Mani [19] made a distinction among different kinds of summaries: an extract
consists entirely of material copied from the input; an abstract contains material that
is not present in the input or, at least, expresses it in a different way; an indicative
abstract is aimed at providing a basis for selecting documents for closer study of
the full text; an informative abstract covers the salient information in the source at
some level of detail; and a critical abstract evaluates the subject matter of the source
document, expressing the abstractor views on the quality of the author’s work.
   According to [15], summarization techniques can be divided in two groups: those
that extract information from the source documents (extraction-based approaches)
and those that abstract from the source documents (abstraction-based approaches).

1 http://www.microsoft.com/silverlight/
2 http://www.adobe.com/products/flashplayer.html
3 http://get.adobe.com/it/shockwave/
The former impose the constraint that a summary uses only components extracted
from the source document. These approaches put strong emphasis on the form, aim-
ing to produce a grammatical summary, which usually requires advanced language
generation techniques. The latter relax the constraints on how the summary is cre-
ated. These approaches are mainly concerned with what the summary content should
be, usually relying solely on extraction of sentences.
    Although potentially more powerful, abstraction-based approaches have been far
less popular than their extraction-based counterparts, mainly because generating the
latter is easier. An extraction-based summary consists of a subset of words from
the original document and its bag of words (BoW ) representation can be created by
selectively removing a number of features from the original term set. Typically, an
extraction-based summary whose length is only 10-15% of the original is likely to
lead to a significant feature reduction as well. Many studies suggest that also sim-
ple summaries are quite effective in carrying over the relevant information about a
document. Straightforward but effective extraction-based text summarization tech-
niques have been proposed and compared in [15]. In a subsequent work, Armano
et al. [3] proposed some enriched techniques. In particular, they showed that the
technique with best performances in terms of precision, recall, and Fmeasure was the
so-called T FLP, i.e., the technique that considers the title of the document and its
first and last paragraphs.
    One may argue that extraction-based approaches are too simple. However, as
shown in [9], extraction-based summaries of news articles can be more informative
than those resulting from more complex approaches. Also, headline-based article
descriptors proved to be effective in determining user’s interests [14]. Moreover,
these approaches have been successfully applied in the contextual advertising field
[5] and in a multimodal scenario [2].




Fig. 1 An example of results given by Yahoo! search engine for the query “Information retrieval”.




2.2 Snippets in Search Engines

A general definition of snippet is “a small piece of something”. In programming, it
refers to a small region of reusable source code, machine code, or text. Snippets are
often used to clarify the meaning of an otherwise cluttered function, or to minimize
the use of repeated code that is common to other functions.
    Snippets are also used by search engines to provide a textual excerpt of the cor-
responding Web page according to the keywords used in the query. Snippet can be
considered as a topic-driven summarization, since the summary content depends on
the preferences of the user and can be assessed via a query, making the final sum-
mary focused on a particular topic. In a preliminary work, Boydell used snippets as
summary fragments in the field of social Web [8].
    While replying to a user’s query, search engines provide a ranked list of related
Web pages, each described by a title, a set of snippets, and its URL (see Figure 1).
The title is directly taken from the title tag of the page, whereas the URL is the http
address of the page.
    For a search engine, the choice of a snippet is an important task. If a snippet
shown to the user is not very informative, the user may click on search results that
do not contain the information s/he is looking for, or s/he may not click on helpful
pages. Moreover, poorly chosen snippets can lead to bad searching experiences.
Snippets are usually directly taken from the description meta tag, if available. If
the description meta tag is not provided, the search engine may use the description
for the site supplied by the Open Directory Project (aka, DMoz)4 or a summary
extracted from the main content of the page.
    Snippet extraction depends on the adopted search engine. Google5 does not al-
ways use the meta description of the page. In fact, if the content provided by the
Web developer in the description meta tag is not helpful, or less than reasonable
quality, then Google replaces it with its own description of the site. In so doing,
Google snippets will be different, depending on the user’s search query. Yahoo!6
provides a patent application that describes how to better decide which snippet to
show to users. The gist of Yahoo! patent application is based on three main issues7 :
(i) a query-independent relevance for each line of text, i.e., a degree to which the
line of text of the document summarizes the document; (ii) a query-dependent rele-
vance of each of the lines of text, i.e., a relevance of the line of text to the query; and
(iii) the intent behind a query. To our best knowledge, Bing8 developers do not give
information on how snippets are extracted. In the literature there are several studies
focused on the techniques of snippet extraction, usually relying on algorithms of
natural language processing, e.g., as proposed by Li [17].




4 http://dmoz.org
5 http://www.google.com
6 http://www.yahoo.com
7 http://www.seobythesea.com/2009/12/how-a-search-engine-may-choose-search-snippets/
8 http://www.bing.com
3 Comparative Study and Results

The first goal of this paper is to compare performances obtained by using snippets
with those obtained by adopting a classical text summarization technique. Compar-
ative experiments and the corresponding results are presented in this Section.




Fig. 2 The system adopted to perform comparative experiments on text summarization.


   To perform comparative experiments, we devised a suitable system, depicted in
Figure 2, in which the Text Summarizer module performs text summarization and
the Classifier module is a centroid-based classifier aimed at classifying each page
in order to calculate precision, recall and Fmeasure of the adopted text summarization
techniques. In other words, to assess the text summarization techniques, we used
a Rocchio classifier [24] with only positive examples and no relevance feedback,
preliminary trained with about 100 Web pages for class. Pages are classified by
considering the highest score(s) obtained by the cosine similarity method. To eval-
uate the effectiveness of the classifier, we performed also a preliminary experiment
in which pages are classified without relying on text summarization. The classifier
showed a precision of 0.862 and a recall of 0.858.



3.1 Setting Up the Experiments

Experiments have been performed on two datasets extracted by the Open Directory
Project and Yahoo! Categories. The former, called BankSearch [28], consists of
about 11000 Web pages classified by hand in 11 categories (see Figure 3)9 . The
latter, called Recreation, consists of about 5000 Web pages classified by hand in 18
categories (see Figure 4).


9 The 11 selected classes are the leaves of the taxonomy, together with the class Sport, which

contains Web documents from all the sites that were classified as Sport, except for the sites that
were classified as Soccer or Motor Sport.
Fig. 3 The taxonomy of BankSearch Dataset.




Fig. 4 The taxonomy of Recreation Dataset.



   As a baseline for our comparative experiments, we adopted the text summariza-
tion technique called T FLP (Title, First and Last Paragraph summarization), which
considers the title and the first and last paragraphs of the given Web page. This tech-
nique, proposed in [3], showed the best results compared with the state-of-the-art
techniques proposed in [15]. As for snippets, we performed queries to Yahoo!, ask-
ing for the url of each webpage of the dataset, and we used the returned snippets. We
performed experiments by considering the snippets by themselves (S) and in con-
junction with the title of the corresponding Web page (ST ). It is worth noting that we
disregarded dynamic pages from both datasets in order to process the same number
of pages independently by the adopted text summarization technique to perform a
fair comparison.
3.2 Results

Table 1 reports our experimental results in terms of precision (π), recall (ρ), and
Fmeasure (F1 ). The Table gives also the average number of extracted terms (T ).
   The results obtained on BankSearch are better than those obtained on Recre-
ation. Moreover, they point out that, in both datasets, results obtained by relying on
snippets together with the title (ST ) are comparable with those obtained by adopting
T FLP. In particular, T FLP performs slightly better in BankSearch, whereas ST per-
forms slightly better in Recreation. This proves that snippets can be adopted as text
summarization techniques, especially when classical techniques can not be applied,
as in the case of dynamic Web pages.
   Let us note that, for each dataset, the average number of terms for the TFLP
technique is about twice the number of terms for the method that uses to snippets.
This is due to the fact that a snippet is built as a very short text, not less than two
rows, wheres in a TFLP summary is usually longer (two complete paragraphs).


Table 1 Results of text summarization techniques comparison.
                              BankSearch         Recreation
                           TFLP S       ST TFLP S          ST
                        π 0.849 0,734 0.806 0.575 0.544 0.595
                        ρ 0.845 0.730 0.804 0.556 0.506 0.554
                        F1 0.847 0.732 0.805 0.565 0.524 0.574
                        T    26   12     14    26    11     13




4 Using Snippets as Text Summarization Technique in
  Contextual Advertising

The second goal of this paper is to study the impact of snippet text summarization in
a selected application field. Among other relevant information retrieval and filtering
fields in which snippet text summarization could be adopted, we concentrate on
contextual advertising.



4.1 Contextual Advertising

Web advertising is one of the major sources of income for a large number of web-
sites. Its main goal is to suggest products and services to the ever growing popu-
lation of Internet users. There are two primary channels for distributing ads: Spon-
sored Search (or Paid Search Advertising) and Contextual Advertising (or Content
Match). Sponsored Search displays ads on the page returned from a search engine
following a query [13]; whereas Contextual Advertising (CA) displays ads within
the content of a generic, third party, Web page.
   Ribeiro-Neto et al. [22] examined a number of strategies to match pages and ads
based on extracted keywords. In a subsequent work, Lacerda et al. [16] proposed
a method to learn the impact of individual features using genetic programming.
Broder et al. [10] classified both pages and ads into a given taxonomy and matched
ads to the page falling into the same node of the taxonomy. Starting from that work,
Armano et al. [4] proposed a semantic enrichment by adopting concepts. Further-
more, modern contextual advertising systems use text summarization techniques in
conjunction with the model developed in [10] (see, for instance [1, 5]). Since bid
phrases are basically search queries, another relevant approach is to view contex-
tual advertising as a problem of query expansion and rewriting [20, 11]. Another
perspective consists on addressing a contextual advertising problem as a recom-
mendation task [6]. Thus, authors view the task of suggesting an ad to a Web page
as the task of recommending an item (the ad) to a user (the Web page).




Fig. 5 The implemented contextual advertising system.




4.2 The Implemented System

Being interested in studying the impact of snippets as text summarization technique
in contextual advertising, we devised a suitable system (see Figure 5). The system
takes a Web page as input. The BoW builder, first, retrieves the snippets of the
page by asking to Yahoo! search engine and then removes stop-words and performs
stemming. This module outputs a vector representation of the original text as BoW ,
each word being represented by its TFIDF [26]. Starting from the BoW provided by
the BoW builder, the Classifier classifies the page according to the given taxonomy
by adopting a centroid-based approach. This module outputs a vector representation
in terms of Classification Features (CF), each features corresponding to the score
given by the classifier to each category. Finally, the Matcher ranks the categories
according to the scores given by the classifier (i.e., the CF of the target page) and,
for each category, randomly extracts a corresponding ad from the Ads repository.
   Let us note that the proposed system, except for the adopted text summariza-
tion technique, is compliant with the system proposed in [1] in which only CF are
considered in the matching phase.



4.3 System Performances

To assess the effectiveness of the proposed approach, experiments have been per-
formed on the Recreation dataset described in Section 3.1. As for the ads to be
suggested, we built a suitable repository in which ads are classified according to
the given taxonomy. In this repository, each ad is represented by the Web page of a
product or service company.
   Performances have been calculated in terms of precision at k with k ∈ [1, 5], i.e.,
the precision in suggesting k ads. Given a page p and an ad a, the hp, ai pair has
been scored on a 1 to 3 scale defined as follows:
1 - Relevant: a is semantically directly related to the main subject of p, i.e., a and p
    belongs to the same category;
2 - Somewhat relevant: (i) a is related to a similar subject of p (sibling), i.e., a and
    p belongs to sibling categories; (ii) a is related to the main topic of p in a more
    general way (generalization), i.e., a belongs to the parent node of the category
    p; or (iii) a is related to the main topic of p in a too specific way (specification),
    i.e., a belongs to a child of the category of p;
3 - Irrelevant. a is unrelated to p, i.e., the category to which a belongs is in a differ-
    ent branch with respect to the category to which p belongs.
According to state-of-the-art contextual advertising systems (e.g., [10]), we consid-
ered as True Positives (T P) ads scored as 1 or 2, and a False Positives (FP) ads
scored as 3.

Table 2 Precision at k of the proposed contextual advertising system by adopting: T FLP
(CAT FLP ), the sole snippets (CAS ); and the snippets together with the page title (CAST ).
                                  k CAT FLP CAS CAST
                                  1 0.868 0.837 0.866
                                  2 0.835 0.801 0.836
                                  3 0.770 0.746 0.775
                                  4 0.722 0.701 0.729
                                  5 0.674 0.657 0.681



   In performing experiments, we compared the performances obtained by using
as text summarization technique: T FLP, the resulting system being CAT FLP ; the
sole snippets, the resulting system being CAS ; and the snippets together with the
page title, the resulting system being CAST . Let us note that, as the focus of this
paper is on text summarization, comparative experiments among the implemented
contextual advertising system and selected state-of-the-art systems are out of the
scope of this work. Nevertheless, let us stress that CAT FLP coincides with the system
proposed in [5] in which the α parameter is set to 0 (i.e., only CF are considered in
the matching phase).
    Table 2 shows that, for all the compared systems, results are quite satisfactory,
especially in suggesting 1 or 2 ads. It also clearly shows that, except for k = 1, CAST
is the system that performs better. This proves the effectiveness of adopting snippets
as text summarization technique in the field of contextual advertising.



5 Conclusions and Future Work

Since classical text summarization techniques are not applicable for dynamic Web
pages, in this paper we proposed to use snippets. The aim of the paper was twofold:
(i) to compare performances obtained by using snippets with those obtained by
adopting a classical text summarization technique and (ii) to study the impact of
snippets in a selected application field, i.e., contextual advertising. The comparisons
showed that the proposed snippet text summarization technique has performances
(in terms of precision, recall, and F1 ) similar to those obtained by using a classical
technique (i.e., T FLP). The adoption of snippets as text summarization technique in
contextual advertising showed that performances, calculated in terms of precision at
k, are quite good, especially in suggesting 1 or 2 ads, and that the system that uses
both snippets and title is the one with the best performances.
    As for future work we are planning to perform further comparative experiments
with the methods described in [18, 30, 12].



Acknowledgment

This work has been partially supported by Hoplo srl. We wish to thank, in particular,
Ferdinando Licheri and Roberto Murgia for their help and useful suggestions.



References

 1. Anagnostopoulos, A., Broder, A.Z., Gabrilovich, E., Josifovski, V., Riedel, L.: Just-in-time
    contextual advertising. In: CIKM ’07: Proceedings of the sixteenth ACM conference on Con-
    ference on information and knowledge management, pp. 331–340. ACM, New York, NY, USA
    (2007). DOI http://doi.acm.org/10.1145/1321440.1321488
 2. Armano, G., Giuliani, A., Messina, A., Montagnuolo, M., Vargiu, E.: Experimenting text sum-
    marization on multimodal aggregation. In: 5th International Workshop DART 2011, New
    Challenges on Information Retrieval and Filtering, CEUR Workshop Proceedings, Vol. 771.
    C. Lai and G. Semeraro and E. Vargiu (2011)
 3. Armano, G., Giuliani, A., Vargiu, E.: Experimenting text summarization techniques for con-
    textual advertising. In: IIR’11: Proceedings of the 2nd Italian Information Retrieval (IIR)
    Workshop (2011)
 4. Armano, G., Giuliani, A., Vargiu, E.: Semantic enrichment of contextual advertising by using
    concepts. In: International Conference on Knowledge Discovery and Information Retrieval
    (2011)
 5. Armano, G., Giuliani, A., Vargiu, E.: Studying the impact of text summarization on contextual
    advertising. In: 8th International Workshop on Text-based Information Retrieval (2011)
 6. Armano, G., Vargiu, E.: A unifying view of contextual advertising and recommender sys-
    tems. In: Proceedings of International Conference on Knowledge Discovery and Information
    Retrieval (KDIR 2010), pp. 463–466 (2010)
 7. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Long-
    man Publishing Co., Inc., Boston, MA, USA (1999)
 8. Boydell, O., Smyth, B.: From social bookmarking to social summarization: an experiment in
    community-based summary generation. In: Proceedings of the 12th international conference
    on Intelligent user interfaces, IUI ’07, pp. 42–51. ACM, New York, NY, USA (2007)
 9. Brandow, R., Mitze, K., Rau, L.F.: Automatic condensation of electronic publications by sen-
    tence selection. Inf. Process. Manage. 31, 675–685 (1995)
10. Broder, A., Fontoura, M., Josifovski, V., Riedel, L.: A semantic approach to contextual adver-
    tising. In: SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference
    on Research and development in information retrieval, pp. 559–566. ACM, New York, NY,
    USA (2007). DOI http://doi.acm.org/10.1145/1277741.1277837
11. Ciaramita, M., Murdock, V., Plachouras, V.: Online learning from click data for sponsored
    search. In: Proceeding of the 17th international conference on World Wide Web, WWW ’08,
    pp. 227–236. ACM, New York, NY, USA (2008)
12. Edmundson, H.P.: New methods in automatic extracting. J. ACM 16, 264–285 (1969)
13. Feldman, J., Muthukrishnan, S.: Algorithmic methods for sponsored search advertising. CoRR
    abs/0805.1759 (2008)
14. Koĺcz, A., Alspector, J.: Asymmetric missing-data problems: Overcoming the lack of negative
    data in preference ranking. Inf. Retr. 5, 5–40 (2002)
15. Kolcz, A., Prabakarmurthi, V., Kalita, J.: Summarization as feature selection for text catego-
    rization. In: CIKM ’01: Proceedings of the tenth international conference on Information and
    knowledge management, pp. 365–370. ACM, New York, NY, USA (2001)
16. Lacerda, A., Cristo, M., Gonçalves, M.A., Fan, W., Ziviani, N., Ribeiro-Neto, B.: Learning to
    advertise. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference
    on Research and development in information retrieval, pp. 549–556. ACM, New York, NY,
    USA (2006). DOI http://doi.acm.org/10.1145/1148170.1148265
17. Li, Q., Chen, Y.P.: Personalized text snippet extraction using statistical language models. Pat-
    tern Recogn. 43, 378–386 (2010)
18. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research and De-
    velopment 2(2), 159–165 (1958)
19. Mani, I.: Automatic summarization. John Benjamins, Amsterdam (2001)
20. Murdock, V., Ciaramita, M., Plachouras, V.: A noisy-channel approach to contextual advertis-
    ing. In: Proceedings of the 1st international workshop on Data mining and audience intelli-
    gence for advertising, ADKDD ’07, pp. 21–27. ACM, New York, NY, USA (2007)
21. Rau, L.F., Jacobs, P.S., Zernik, U.: Information extraction and text summarization using lin-
    guistic knowledge acquisition. Inf. Process. Manage. 25, 419–428 (1989)
22. Ribeiro-Neto, B., Cristo, M., Golgher, P.B., Silva de Moura, E.: Impedance coupling in
    content-targeted advertising. In: SIGIR ’05: Proceedings of the 28th annual international
    ACM SIGIR conference on Research and development in information retrieval, pp. 496–503.
    ACM, New York, NY, USA (2005). DOI http://doi.acm.org/10.1145/1076034.1076119
23. Ricci, F., Rokach, L., Shapira, B., Kantor, P.: Recommender Systems Handbook. Springer,
    US (2010)
24. Rocchio, J.: The SMART Retrieval System: Experiments in Automatic Document Processing,
    chap. Relevance feedback in information retrieval, pp. 313–323. PrenticeHall (1971)
25. Salton, G., Buckley, C.: On the use of spreading activation methods in automatic information.
    In: Proceedings of the 11th annual international ACM SIGIR conference on Research and
    development in information retrieval, SIGIR ’88, pp. 147–160. ACM, New York, NY, USA
    (1988)
26. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill Book
    Company (1984)
27. Shen, D., Chen, Z., Yang, Q., Zeng, H.J., Zhang, B., Lu, Y., Ma, W.Y.: Web-page classification
    through summarization. In: Proceedings of the 27th annual international ACM SIGIR confer-
    ence on Research and development in information retrieval, SIGIR ’04, pp. 242–249. ACM,
    New York, NY, USA (2004)
28. Sinka, M., Corne, D.: A large benchmark dataset for web document clustering. In: Soft Com-
    puting Systems: Design, Management and Applications, Volume 87 of Frontiers in Artificial
    Intelligence and Applications, pp. 881–890. Press (2002)
29. de Smedt, K., Liseth, A., Hassel, M., Dalianis, H.: How short is good? An evaluation of auto-
    matic summarization, pp. 267–287. Museum Tusculanums Forlag, Kbenhavn (2005)
30. Tsegay, Y., Puglisi, S.J., Turpin, A., Zobel, J.: Document compaction for efficient query biased
    snippet generation. In: Proceedings of the 31th European Conference on IR Research on Ad-
    vances in Information Retrieval, ECIR ’09, pp. 509–520. Springer-Verlag, Berlin, Heidelberg
    (2009)
31. Witten, I.H., Bray, Z., Mahoui, M., Teahan, B.: Text mining: A new frontier for lossless com-
    pression. In: Proceedings of the Conference on Data Compression, DCC ’99, pp. 198–. IEEE
    Computer Society, Washington, DC, USA (1999)