<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experimenting Text Summarization Techniques for Contextual Advertising</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuliano Armano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Giulian</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eloisa Vargiu</string-name>
          <email>vargiug@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Cagliari Department of Electrical and Electronic Engineering</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Contextual advertising systems suggest suitable advertisings to users while sur ng the Web. Focusing on text summarization, we propose novel techniques for contextual advertising. Comparative experiments between these techniques and existing ones have been performed.</p>
      </abstract>
      <kwd-group>
        <kwd>contextual advertising</kwd>
        <kwd>information retrieval and ltering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Most of the advertisements on the Web are short textual messages, usually
marked as \sponsored links". Two main kinds of textual advertising approaches
are used on the Web today [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: sponsored search and contextual advertising. The
former puts advertisements (ads) on the pages returned from a Web search
engine following a query. All major current Web search engines support this kind
of ads, acting simultaneously as search engine and advertisement agency. The
latter puts ads within the content of a generic, third party, Web page. A
commercial intermediary, namely an ad-network, is usually in charge of optimizing
the selection of ads. In other words, contextual advertising (CA hereinafter) is
a form of targeted advertising for ads appearing on websites or other media,
such as contents displayed in mobile browsers. Ads are selected and served by
automated systems based on the content displayed to the user.
      </p>
      <p>We consider a scenario of online advertising, in which an intermediating
commercial net (ad-network) is responsible for optimizing the selection of ads. The
goal is twofold: (i) increasing commercial company revenues and (ii) improving
user experience. Let us point out in advance that, in information retrieval, the
term \context" may have di erent interpretations depending on the research
eld. For instance, it denotes \event which modify the user behavior in the eld
of recommender systems". For CA it denotes \keywords used in search engines".</p>
      <p>A CA system typically involves four main tasks: (i) pre-processing, (ii) text
summarization, (iii) classi cation, and (iv) matching. In this paper, we are
mainly interested in text summarization, which is aimed at generating a short
representation of a textual document (e.g., a Web page) with negligible loss of
information.</p>
      <p>Starting from state-of-the-art text-summarization techniques, we propose
new and more e ective techniques. Then, we perform comparative experiments
to assess the e ectiveness of the proposed techniques. Preliminary results show
that the proposed techniques perform better than existing ones.</p>
      <p>The paper is organized as follows. First, the main work on CA is brie y
recalled. Subsequently, text summarization is illustrated from both a generic
perspective and in the context of CA. After illustrating an implementation of a
CA system, preliminary experimental results are then reported and discussed.
Conclusions and future directions end the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Contextual Advertising</title>
      <p>
        As discussed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], CA is an interplay of four players:
{ The advertiser provides the supply of ads. Usually the activity of the
advertisers is organized around campaigns which are de ned by a set of ads with
a particular temporal and thematic goal (e.g., sale of digital cameras during
the holiday season). As in traditional advertising, the goal of the advertisers
can be broadly de ned as the promotion of products or services.
{ The publisher is the owner of the Web pages on which the advertising is
displayed. The publisher typically aims to maximize advertising revenue while
providing a good user experience.
{ The ad network is a mediator between the advertiser and the publisher;
it selects the ads to display on the Web pages. The ad-network shares the
advertisement revenue with the publisher.
{ The Users visit the Web pages of the publisher and interact with the ads.
      </p>
      <p>
        Ribeiro-Neto et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] examine a number of strategies to match pages and
ads based on extracted keywords. Ads and pages are represented as vectors in
a vector space. To deal with semantic problems that may arise from a pure
keyword-based approach, the authors expand the page vocabulary with terms
from similar pages weighted according to their similarity to the matched page.
In a subsequent work, the authors propose a method to learn the impact of
individual features using genetic programming [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Another approach to CA is to reduce it to the problem of sponsored search
by extracting phrases from a Web page and matching them with the bid phrases
of each ad. In [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], a system for phrase extraction is proposed, which uses a
variety of features to determine the importance of page phrases for advertising
purposes. The system is trained with pages that have been annotated by hand
with important phrases. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the same approach is used, with a phrase extractor
based on the work reported in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Text Summarization</title>
      <p>
        Radev et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] de ne a summary as \a text that is produced from one or
more texts, that conveys important information in the original text(s), and that
is no longer than half of the original text(s) and usually signi cantly less than
that". This simple de nition highlights three important aspects that
characterize research on automatic summarization: (i) summaries may be produced from
a single document or multiple documents; (ii) summaries should preserve
important information; and (iii) summaries should be short. Unfortunately, attempts
to provide a more elaborate de nition for this task resulted in disagreement
within the community [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Summarization techniques can be divided into two groups [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: (i) those that
extract information from the source documents (extraction-based approaches)
and (ii) those that abstract from the source documents (abstraction-based
approaches). The former impose the constraint that a summary uses only
components extracted from the source document, whereas the latter relax the
constraints on how the summary is created. Extraction-based approaches are mainly
concerned with what the summary content should be, usually relying solely on
extraction of sentences. On the other hand, abstraction-based approaches put
strong emphasis on the form, aiming to produce a grammatical summary, which
usually requires advanced language generation techniques. Although potentially
more powerful, abstraction-based approaches have been far less popular than
their extraction-based counterparts, mainly because generating the latter is
easier. In a paradigm more tuned to information retrieval, one can also consider
topic-driven summarization, which assumes that the summary content depends
on the preference of the user and can be assessed via a query, making the nal
summary focused on a particular topic. In this paper, we exclusively focus on
extraction-based methods.
      </p>
      <p>An extraction-based summary consists of a subset of words from the original
document and its bag of words representation can be created by selectively
removing a number of features from the original term set. In text categorization,
such process is known as feature selection and is guided by the \usefulness" of
individual features as far as the classi cation accuracy is concerned. However, in
the context of text summarization, feature selection is only a secondary aspect.
It might be argued that in some cases a summary may contain the same set
of features as the original; for example, when it is created by removing the
redundant/repetitive words or phrases. Typically, an extraction-based summary
whose length is only 10-15% of the original is likely to lead to a signi cant feature
reduction as well.</p>
      <p>Many studies suggest that even simple summaries are quite e ective in
carrying over the relevant information about a document. From the text
categorization perspective, their advantage over specialized feature selection methods lies
in their reliance on a single document only (the one that is being summarized)
without computing the statistics for all documents sharing the same category
label, or even for all documents in a collection. Moreover, various forms of
summaries become ubiquitous on the Web and in certain cases their accessibility
may grow faster than that of full documents.</p>
      <p>
        Earliest instances of research on summarizing scienti c documents proposed
paradigms for extracting salient sentences from text using features like word and
phrase frequency [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], position in the text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and key phrases [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Various works
published since then had concentrated on other domains, mostly on newswire
data. Many approaches addressed the problem by building systems dependent
on the type of the required summary.
      </p>
      <p>
        Simple summarization-like techniques have been long applied to enrich the
set of features used in text categorization. For example, a common strategy is to
give extra weight to words appearing in the title of a story [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] or to treat the
title-words as separate features, even if the same words were present elsewhere in
the text body [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It has been also noticed that many documents contain useful
formatting information, loosely de ned as context, that can be utilized when
selecting the salient words, phrases or sentences. For example, Web search
engines select terms di erently according to their HTML markup [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Summaries,
rather than full documents, have been successfully applied to document
clustering [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Ker and Chen [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] evaluated the performance of a categorization system
using title-based summaries as document descriptors. In their experiments with
a probabilistic TF-IDF based classi er, they shown that title-based document
descriptors positively a ected the performance of categorization.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Text Summarization in Contextual Advertising</title>
      <p>As the input of a contextual advertiser is an HTML document, contextual
advertising systems typically rely on extraction-based approaches, which are applied
to the relevant blocks of a Web page (e.g., the title of the Web page, its rst
paragraph, and the paragraph which has the highest title-word count).</p>
      <p>
        In the work of Kolcz et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] seven straightforward (but e ective)
extractionbased text summarization techniques have been proposed and compared. In all
cases, a word occurring at least three times in the body of a document is a
keyword, while a word occurring at least once in the title of a document is a
title-word. For the sake of completeness, let us recall the proposed techniques:
{ Title (T), the title of a document;
{ First Paragraph (FP), the rst paragraph of a document;
{ First Two Paragraphs (F2P), the rst two paragraphs of a document;
{ First and Last Paragraphs (FLP), the rst and the last paragraphs of a
document;
{ Paragraph with most keywords (MK), the paragraph that has the highest
number of keywords;
{ Paragraph with most title-words (MT), the paragraph that has the highest
number of title-words;
{ Best Sentence (BS), sentences in the document that contain at least 3
titlewords and at least 4 keywords.
      </p>
      <p>
        One may argue that the above methods are too simple. However, as shown
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], extraction-based summaries of news articles can be more informative
than those resulting from more complex approaches. Also, headline-based article
descriptors proved to be e ective in determining user's interests [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Our proposal consists of enriching some of the techniques introduced by Kolcz
et al. with information extracted from the title, as follows:
{ Title and First Paragraph (TFP), the title of a document and its rst
paragraph:
{ Title and First Two Paragraphs (TF2P), the title of a document and its rst
two paragraphs;
{ Title, First and Last Paragraphs (TFLP), the title of a document and its
rst and last paragraphs;
{ Most Title-words and Keywords (MTK), the paragraph with the highest
number of title-words and that with the highest number of keywords.</p>
      <p>We also de ned a further technique, called NKeywords (NK), that selects the
N most frequent keywords.1
1 N is a global parameter that can be set starting from some relevant characteristics
of the input (e.g., from the average document length).</p>
    </sec>
    <sec id="sec-5">
      <title>The Implemented System</title>
      <p>Our view of CA is sketched in Figure 1, which illustrates a generic architecture
that can give rise to speci c systems depending on the choices made on each
involved module. Notably, most of the state-of-the-art solutions are compliant
with this view. So far, we implemented in Java the sub-system depicted in Figure
1.a, which encompasses (i) a pre-processor, (ii) a text summarizer, and (iii) a
classi er.</p>
      <p>
        Pre-processor. Its main purpose is to transform an HTML document (a Web
page or an ad) into an easy-to-process document in plain-text format, while
maintaining important information. This is obtained by preserving the blocks
of the original HTML document, while removing HTML tags and stop-words.2
First, any given HTML page is parsed to identify and remove noisy elements,
such as tags, comments and other non-textual items. Then, stop-words are
removed from each textual excerpt. Finally, the document is tokenized and each
term stemmed using the well-known Porter's algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Text summarizer. The text summarizer outputs a vector representation of the
original HTML document as bag of words (BoW), each word being weighted by
TF-IDF [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. So far, we implemented the methods of Kolcz et al. (see Section
4), but not \Title" and \Best Sentence". These two methods were de ned to
extract summaries from textual documents such as articles, scienti c papers
and books. In fact, we are interested in summarizing HTML documents, in which
the title is often not representative. Moreover, they are often too short to nd
meaningful sentences composed by at least 3 title-words and 4 keywords in the
same sentence.
      </p>
      <p>
        Classi er. Text summarization is a purely syntactic analysis and the
corresponding Web-page classi cation is usually inaccurate. To alleviate possible harmful
e ects of summarization, both page excerpts and advertisings are classi ed
according to a given set of categories [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The corresponding classi cation-based
features (CF) are then used in conjunction with the original BoW. In the current
implementation, we adopt a centroid-based classi cation technique [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which
represents each class with its centroid calculated starting from the training set.
A page is classi ed measuring the distance between its vector and the centroid
vector of each class by adopting the cosine similarity.
      </p>
      <p>
        Matcher. It is devoted to suggest ads (a) to the Web page (p) according to
a similarity score based on both BoW and CF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In formula ( is a global
parameter that permits to control the emphasis of the syntactic component with
respect to the semantic one):
score(p; a) =
simBoW (p; a) + (1
) simCF (p; a)
(1)
2 To this end, the Jericho API for Java has been adopted, described at the Web page:
http://jericho.htmlparser.net/docs/index.html
where, simBoW (p; a) and simCF (p; a) are cosine similarity scores between p and
a using BoW and CF, respectively. This module has not been implemented yet.
However, it is worth recalling that in this paper we are interested in making
comparisons among text summarization techniques.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Preliminary Results</title>
      <p>
        We performed experiments aimed at comparing the techniques described in
Section 4. To assess them we used the BankSearch Dataset [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], built using the
Open Directory Project and Yahoo! Categories3, consisting of about 11000 Web
pages classi ed by hand in 11 di erent classes.
3 http://www.dmoz.org and http://www.yahoo.com, respectively
extracted terms (T) is shown. For NKeywords summarization, we performed
experiments with N=10.
      </p>
      <p>As a nal remark, let us note that just adding information about the title
improves the performances of summarization. Another interesting result is that,
as expected, the TFLP summarization provides the best performance, as FLP
summarization does for the classic techniques.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Directions</title>
      <p>In this paper, we presented a preliminary study on text summarization
techniques applied to CA. In particular, we proposed some straightforward
extractionbased techniques that improve those proposed in the literature. Experimental
results con rm the hypothesis that adding information about titles to well-known
techniques allows to improve performances.</p>
      <p>
        As for future directions, we are currently studying a novel semantic technique.
The main idea is to improve syntactic techniques by exploiting semantic
information (such as, synonyms and hypernyms) extracted from a lexical database (e.g.,
WordNet [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]) in conjunction with a POS-tagging and word sense
disambiguation. Further experiments are also under way. In particular, we are setting up the
system to calculate its performances with a larger dataset extracted by DMOZ
in which documents are categorized according to a given taxonomy of classes.
Moreover, as we deem that bringing ideas from recommender systems will help
in devising CA systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we are also studying a collaborative approach to CA.
Acknowledgments. This work has been partially supported by Hoplo srl. We
wish to thank, in particular, Ferdinando Licheri and Roberto Murgia for their
help and useful suggestions.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Addis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Armano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giuliani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Vargiu</surname>
          </string-name>
          .
          <article-title>A recommender system based on a generic contextual advertising approach</article-title>
          .
          <source>In Proceedings of ISCC'10: IEEE Symposium on Computers and Communications</source>
          , pages
          <volume>859</volume>
          {
          <fpage>861</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <article-title>Just-in-time contextual advertising</article-title>
          .
          <source>In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management</source>
          , pages
          <volume>331</volume>
          {
          <fpage>340</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Baxendale</surname>
          </string-name>
          .
          <article-title>Machine-made index for technical literature - an experiment</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>2</volume>
          :
          <fpage>354</fpage>
          {
          <fpage>361</fpage>
          ,
          <year>1958</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Belew</surname>
          </string-name>
          .
          <article-title>Finding out about: A Cognitive Perspective on Search Engine Technology and the WWW</article-title>
          . Cambridge University Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Brandow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mitze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. F.</given-names>
            <surname>Rau</surname>
          </string-name>
          .
          <article-title>Automatic condensation of electronic publications by sentence selection</article-title>
          .
          <source>Inf. Process. Manage.</source>
          ,
          <volume>31</volume>
          :
          <fpage>675</fpage>
          {
          <fpage>685</fpage>
          ,
          <year>September 1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fontoura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Josifovski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <article-title>A semantic approach to contextual advertising</article-title>
          .
          <source>In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>559</volume>
          {
          <fpage>566</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>D. Das</surname>
            and
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          .
          <article-title>A survey on automatic text summarization</article-title>
          .
          <source>Technical Report Literature Survey for the Language and Statistics II course at CMU</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Deepayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deepak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanja</surname>
          </string-name>
          .
          <article-title>Contextual advertising by combining relevance with click feedback</article-title>
          .
          <source>In WWW '08: Proceeding of the 17th international conference on World Wide Web</source>
          , pages
          <volume>417</volume>
          {
          <fpage>426</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heckerman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahami</surname>
          </string-name>
          .
          <article-title>Inductive learning algorithms and representations for text categorization</article-title>
          .
          <source>In Proceedings of the seventh international conference on Information and knowledge management</source>
          ,
          <source>CIKM '98</source>
          , pages
          <fpage>148</fpage>
          {
          <fpage>155</fpage>
          , New York, NY, USA,
          <year>1998</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Edmundson</surname>
          </string-name>
          .
          <article-title>New methods in automatic extracting</article-title>
          .
          <source>Journal of ACM</source>
          ,
          <volume>16</volume>
          :
          <fpage>264</fpage>
          {
          <fpage>285</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          .
          <article-title>Cactus: clustering categorical data using summaries</article-title>
          .
          <source>In Proceedings of the fth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '99</source>
          , pages
          <fpage>73</fpage>
          {
          <fpage>83</fpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. E.-H. Han and
          <string-name>
            <surname>G</surname>
          </string-name>
          . Karypis.
          <article-title>Centroid-based document classi cation: Analysis and experimental results</article-title>
          .
          <source>In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery</source>
          ,
          <source>PKDD '00</source>
          , pages
          <fpage>424</fpage>
          {
          <fpage>431</fpage>
          , London, UK,
          <year>2000</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Ker</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>A text categorization based on summarization technique</article-title>
          .
          <source>In Proceedings of the ACL-2000 workshop on Recent advances in natural language processing</source>
          and
          <article-title>information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics</article-title>
          - Volume
          <volume>11</volume>
          , pages
          <fpage>79</fpage>
          {
          <fpage>83</fpage>
          ,
          <string-name>
            <surname>Morristown</surname>
          </string-name>
          , NJ, USA,
          <year>2000</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolcz</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Alspector</surname>
          </string-name>
          .
          <article-title>Asymmetric missing-data problems: Overcoming the lack of negative data in preference ranking</article-title>
          .
          <source>Inf. Retr.</source>
          ,
          <volume>5</volume>
          :5{
          <fpage>40</fpage>
          ,
          <year>January 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolcz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabakarmurthi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          .
          <article-title>Summarization as feature selection for text categorization</article-title>
          .
          <source>In CIKM '01: Proceedings of the tenth international conference on Information and knowledge management</source>
          , pages
          <volume>365</volume>
          {
          <fpage>370</fpage>
          , New York, NY, USA,
          <year>2001</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>A.</given-names>
            <surname>Lacerda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cristo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Learning to advertise</article-title>
          .
          <source>In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>549</volume>
          {
          <fpage>556</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>H.</given-names>
            <surname>Luhn</surname>
          </string-name>
          .
          <article-title>The automatic creation of literature abstracts</article-title>
          .
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>2</volume>
          :
          <fpage>159</fpage>
          {
          <fpage>165</fpage>
          ,
          <year>1958</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <volume>39</volume>
          {
          <fpage>41</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>D.</given-names>
            <surname>Mladenic</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          .
          <article-title>Feature selection for classi cation based on text hierarchy</article-title>
          .
          <source>In Text and the Web, Conference on Automated Learning and Discovery CONALD-98</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>M.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for su x stripping</article-title>
          .
          <source>Program</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <volume>130</volume>
          {
          <fpage>137</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>D. R. Radev</surname>
            , E. Hovy, and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McKeown</surname>
          </string-name>
          .
          <article-title>Introduction to the special issue on summarization</article-title>
          .
          <source>Computational Linguistic</source>
          ,
          <volume>28</volume>
          :
          <fpage>399</fpage>
          {
          <fpage>408</fpage>
          ,
          <year>December 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. B.
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cristo</surname>
            ,
            <given-names>P. B.</given-names>
          </string-name>
          <string-name>
            <surname>Golgher</surname>
          </string-name>
          , and E. Silva de Moura.
          <article-title>Impedance coupling in content-targeted advertising</article-title>
          .
          <source>In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>496</volume>
          {
          <fpage>503</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. McGill</surname>
          </string-name>
          .
          <article-title>Introduction to Modern Information Retrieval</article-title>
          .
          <source>McGrawHill Book Company</source>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>M.</given-names>
            <surname>Sinka</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Corne</surname>
          </string-name>
          .
          <article-title>A large benchmark dataset for web document clustering</article-title>
          .
          <source>In Soft Computing Systems: Design, Management and Applications</source>
          , Volume
          <volume>87</volume>
          of Frontiers in
          <source>Arti cial Intelligence and Applications</source>
          , pages
          <volume>881</volume>
          {
          <fpage>890</fpage>
          . Press,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>R.</given-names>
            <surname>Stata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bharat</surname>
          </string-name>
          , and
          <string-name>
            <surname>F. Maghoul.</surname>
          </string-name>
          <article-title>The term vector database: fast access to indexing terms for web pages</article-title>
          .
          <source>Comput. Netw.</source>
          ,
          <volume>33</volume>
          (
          <issue>1-6</issue>
          ):
          <volume>247</volume>
          {
          <fpage>255</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26. W.-t. Yih,
          <string-name>
            <given-names>J.</given-names>
            <surname>Goodman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          .
          <article-title>Finding advertising keywords on web pages</article-title>
          .
          <source>In WWW '06: Proceedings of the 15th international conference on World Wide Web</source>
          , pages
          <volume>213</volume>
          {
          <fpage>222</fpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>