<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Experimenting Text Summarization Techniques for Contextual Advertising</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Giuliano</forename><surname>Armano</surname></persName>
							<email>armano@diee.unica.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Giulian</surname></persName>
							<email>alessandro.giuliani@diee.unica.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Eloisa</forename><surname>Vargiu</surname></persName>
							<email>vargiu@diee.unica.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Experimenting Text Summarization Techniques for Contextual Advertising</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9B74D760D12BA5D36EEFF156AE5AF7E7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>contextual advertising, information retrieval and filtering</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Contextual advertising systems suggest suitable advertisings to users while surfing the Web. Focusing on text summarization, we propose novel techniques for contextual advertising. Comparative experiments between these techniques and existing ones have been performed.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Most of the advertisements on the Web are short textual messages, usually marked as "sponsored links". Two main kinds of textual advertising approaches are used on the Web today <ref type="bibr" target="#b7">[8]</ref>: sponsored search and contextual advertising. The former puts advertisements (ads) on the pages returned from a Web search engine following a query. All major current Web search engines support this kind of ads, acting simultaneously as search engine and advertisement agency. The latter puts ads within the content of a generic, third party, Web page. A commercial intermediary, namely an ad-network, is usually in charge of optimizing the selection of ads. In other words, contextual advertising (CA hereinafter) is a form of targeted advertising for ads appearing on websites or other media, such as contents displayed in mobile browsers. Ads are selected and served by automated systems based on the content displayed to the user.</p><p>We consider a scenario of online advertising, in which an intermediating commercial net (ad-network) is responsible for optimizing the selection of ads. The goal is twofold: (i) increasing commercial company revenues and (ii) improving user experience. Let us point out in advance that, in information retrieval, the term "context" may have different interpretations depending on the research field. For instance, it denotes "event which modify the user behavior in the field of recommender systems". For CA it denotes "keywords used in search engines".</p><p>A CA system typically involves four main tasks: (i) pre-processing, (ii) text summarization, (iii) classification, and (iv) matching. In this paper, we are mainly interested in text summarization, which is aimed at generating a short representation of a textual document (e.g., a Web page) with negligible loss of information.</p><p>Starting from state-of-the-art text-summarization techniques, we propose new and more effective techniques. Then, we perform comparative experiments to assess the effectiveness of the proposed techniques. Preliminary results show that the proposed techniques perform better than existing ones.</p><p>The paper is organized as follows. First, the main work on CA is briefly recalled. Subsequently, text summarization is illustrated from both a generic perspective and in the context of CA. After illustrating an implementation of a CA system, preliminary experimental results are then reported and discussed. Conclusions and future directions end the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Contextual Advertising</head><p>As discussed in <ref type="bibr" target="#b5">[6]</ref>, CA is an interplay of four players:</p><p>-The advertiser provides the supply of ads. Usually the activity of the advertisers is organized around campaigns which are defined by a set of ads with a particular temporal and thematic goal (e.g., sale of digital cameras during the holiday season). As in traditional advertising, the goal of the advertisers can be broadly defined as the promotion of products or services. -The publisher is the owner of the Web pages on which the advertising is displayed. The publisher typically aims to maximize advertising revenue while providing a good user experience. -The ad network is a mediator between the advertiser and the publisher; it selects the ads to display on the Web pages. The ad-network shares the advertisement revenue with the publisher. -The Users visit the Web pages of the publisher and interact with the ads. Ribeiro-Neto et al. <ref type="bibr" target="#b21">[22]</ref> examine a number of strategies to match pages and ads based on extracted keywords. Ads and pages are represented as vectors in a vector space. To deal with semantic problems that may arise from a pure keyword-based approach, the authors expand the page vocabulary with terms from similar pages weighted according to their similarity to the matched page. In a subsequent work, the authors propose a method to learn the impact of individual features using genetic programming <ref type="bibr" target="#b15">[16]</ref>.</p><p>Another approach to CA is to reduce it to the problem of sponsored search by extracting phrases from a Web page and matching them with the bid phrases of each ad. In <ref type="bibr" target="#b25">[26]</ref>, a system for phrase extraction is proposed, which uses a variety of features to determine the importance of page phrases for advertising purposes. The system is trained with pages that have been annotated by hand with important phrases. In <ref type="bibr" target="#b5">[6]</ref>, the same approach is used, with a phrase extractor based on the work reported in <ref type="bibr" target="#b24">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Text Summarization</head><p>Radev et al. <ref type="bibr" target="#b20">[21]</ref> define a summary as "a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that". This simple definition highlights three important aspects that characterize research on automatic summarization: (i) summaries may be produced from a single document or multiple documents; (ii) summaries should preserve important information; and (iii) summaries should be short. Unfortunately, attempts to provide a more elaborate definition for this task resulted in disagreement within the community <ref type="bibr" target="#b6">[7]</ref>.</p><p>Summarization techniques can be divided into two groups <ref type="bibr" target="#b14">[15]</ref>: (i) those that extract information from the source documents (extraction-based approaches) and (ii) those that abstract from the source documents (abstraction-based approaches). The former impose the constraint that a summary uses only components extracted from the source document, whereas the latter relax the constraints on how the summary is created. Extraction-based approaches are mainly concerned with what the summary content should be, usually relying solely on extraction of sentences. On the other hand, abstraction-based approaches put strong emphasis on the form, aiming to produce a grammatical summary, which usually requires advanced language generation techniques. Although potentially more powerful, abstraction-based approaches have been far less popular than their extraction-based counterparts, mainly because generating the latter is easier. In a paradigm more tuned to information retrieval, one can also consider topic-driven summarization, which assumes that the summary content depends on the preference of the user and can be assessed via a query, making the final summary focused on a particular topic. In this paper, we exclusively focus on extraction-based methods.</p><p>An extraction-based summary consists of a subset of words from the original document and its bag of words representation can be created by selectively removing a number of features from the original term set. In text categorization, such process is known as feature selection and is guided by the "usefulness" of individual features as far as the classification accuracy is concerned. However, in the context of text summarization, feature selection is only a secondary aspect. It might be argued that in some cases a summary may contain the same set of features as the original; for example, when it is created by removing the redundant/repetitive words or phrases. Typically, an extraction-based summary whose length is only 10-15% of the original is likely to lead to a significant feature reduction as well.</p><p>Many studies suggest that even simple summaries are quite effective in carrying over the relevant information about a document. From the text categorization perspective, their advantage over specialized feature selection methods lies in their reliance on a single document only (the one that is being summarized) without computing the statistics for all documents sharing the same category label, or even for all documents in a collection. Moreover, various forms of summaries become ubiquitous on the Web and in certain cases their accessibility may grow faster than that of full documents.</p><p>Earliest instances of research on summarizing scientific documents proposed paradigms for extracting salient sentences from text using features like word and phrase frequency <ref type="bibr" target="#b16">[17]</ref>, position in the text <ref type="bibr" target="#b2">[3]</ref>, and key phrases <ref type="bibr" target="#b9">[10]</ref>. Various works published since then had concentrated on other domains, mostly on newswire data. Many approaches addressed the problem by building systems dependent on the type of the required summary.</p><p>Simple summarization-like techniques have been long applied to enrich the set of features used in text categorization. For example, a common strategy is to give extra weight to words appearing in the title of a story <ref type="bibr" target="#b18">[19]</ref> or to treat the title-words as separate features, even if the same words were present elsewhere in the text body <ref type="bibr" target="#b8">[9]</ref>. It has been also noticed that many documents contain useful formatting information, loosely defined as context, that can be utilized when selecting the salient words, phrases or sentences. For example, Web search engines select terms differently according to their HTML markup <ref type="bibr" target="#b3">[4]</ref>. Summaries, rather than full documents, have been successfully applied to document clustering <ref type="bibr" target="#b10">[11]</ref>. Ker and Chen <ref type="bibr" target="#b12">[13]</ref> evaluated the performance of a categorization system using title-based summaries as document descriptors. In their experiments with a probabilistic TF-IDF based classifier, they shown that title-based document descriptors positively affected the performance of categorization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Text Summarization in Contextual Advertising</head><p>As the input of a contextual advertiser is an HTML document, contextual advertising systems typically rely on extraction-based approaches, which are applied to the relevant blocks of a Web page (e.g., the title of the Web page, its first paragraph, and the paragraph which has the highest title-word count).</p><p>In the work of Kolcz et al. <ref type="bibr" target="#b14">[15]</ref> seven straightforward (but effective) extractionbased text summarization techniques have been proposed and compared. In all cases, a word occurring at least three times in the body of a document is a keyword, while a word occurring at least once in the title of a document is a title-word. For the sake of completeness, let us recall the proposed techniques:</p><p>-Title (T), the title of a document; -First Paragraph (FP), the first paragraph of a document; -First Two Paragraphs (F2P), the first two paragraphs of a document; -First and Last Paragraphs (FLP), the first and the last paragraphs of a document; -Paragraph with most keywords (MK), the paragraph that has the highest number of keywords; -Paragraph with most title-words (MT), the paragraph that has the highest number of title-words; -Best Sentence (BS), sentences in the document that contain at least 3 titlewords and at least 4 keywords.</p><p>One may argue that the above methods are too simple. However, as shown in <ref type="bibr" target="#b4">[5]</ref>, extraction-based summaries of news articles can be more informative than those resulting from more complex approaches. Also, headline-based article descriptors proved to be effective in determining user's interests <ref type="bibr" target="#b13">[14]</ref>.</p><p>Our proposal consists of enriching some of the techniques introduced by Kolcz et al. with information extracted from the title, as follows:</p><p>-Title and First Paragraph (TFP), the title of a document and its first paragraph: -Title and First Two Paragraphs (TF2P), the title of a document and its first two paragraphs; -Title, First and Last Paragraphs (TFLP), the title of a document and its first and last paragraphs; -Most Title-words and Keywords (MTK), the paragraph with the highest number of title-words and that with the highest number of keywords.</p><p>We also defined a further technique, called NKeywords (NK), that selects the N most frequent keywords.  1 N is a global parameter that can be set starting from some relevant characteristics of the input (e.g., from the average document length).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">The Implemented System</head><p>Our view of CA is sketched in Figure <ref type="figure" target="#fig_0">1</ref>, which illustrates a generic architecture that can give rise to specific systems depending on the choices made on each involved module. Notably, most of the state-of-the-art solutions are compliant with this view. So far, we implemented in Java the sub-system depicted in Figure <ref type="figure" target="#fig_0">1</ref>.a, which encompasses (i) a pre-processor, (ii) a text summarizer, and (iii) a classifier.</p><p>Pre-processor. Its main purpose is to transform an HTML document (a Web page or an ad) into an easy-to-process document in plain-text format, while maintaining important information. This is obtained by preserving the blocks of the original HTML document, while removing HTML tags and stop-words. <ref type="foot" target="#foot_0">2</ref>First, any given HTML page is parsed to identify and remove noisy elements, such as tags, comments and other non-textual items. Then, stop-words are removed from each textual excerpt. Finally, the document is tokenized and each term stemmed using the well-known Porter's algorithm <ref type="bibr" target="#b19">[20]</ref>.</p><p>Text summarizer. The text summarizer outputs a vector representation of the original HTML document as bag of words (BoW), each word being weighted by TF-IDF <ref type="bibr" target="#b22">[23]</ref>. So far, we implemented the methods of Kolcz et al. (see Section 4), but not "Title" and "Best Sentence". These two methods were defined to extract summaries from textual documents such as articles, scientific papers and books. In fact, we are interested in summarizing HTML documents, in which the title is often not representative. Moreover, they are often too short to find meaningful sentences composed by at least 3 title-words and 4 keywords in the same sentence.</p><p>Classifier. Text summarization is a purely syntactic analysis and the corresponding Web-page classification is usually inaccurate. To alleviate possible harmful effects of summarization, both page excerpts and advertisings are classified according to a given set of categories <ref type="bibr" target="#b1">[2]</ref>. The corresponding classification-based features (CF) are then used in conjunction with the original BoW. In the current implementation, we adopt a centroid-based classification technique <ref type="bibr" target="#b11">[12]</ref>, which represents each class with its centroid calculated starting from the training set.</p><p>A page is classified measuring the distance between its vector and the centroid vector of each class by adopting the cosine similarity.</p><p>Matcher. It is devoted to suggest ads (a) to the Web page (p) according to a similarity score based on both BoW and CF <ref type="bibr" target="#b1">[2]</ref>. In formula (α is a global parameter that permits to control the emphasis of the syntactic component with respect to the semantic one):</p><formula xml:id="formula_0">score(p, a) = α • sim BoW (p, a) + (1 − α) • sim CF (p, a)<label>(1)</label></formula><p>where, sim BoW (p, a) and sim CF (p, a) are cosine similarity scores between p and a using BoW and CF, respectively. This module has not been implemented yet. However, it is worth recalling that in this paper we are interested in making comparisons among text summarization techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Preliminary Results</head><p>We performed experiments aimed at comparing the techniques described in Section 4. To assess them we used the BankSearch Dataset <ref type="bibr" target="#b23">[24]</ref>, built using the Open Directory Project and Yahoo! Categories<ref type="foot" target="#foot_1">3</ref> , consisting of about 11000 Web pages classified by hand in 11 different classes. Figure <ref type="figure" target="#fig_2">2</ref> shows the overall hierarchy. The 11 selected classes are the leaves of the taxonomy, together with the class Sport, which contains web documents from all the sites that were classified as sport, except for the sites that were classified as Soccer or Motor Sport. In <ref type="bibr" target="#b23">[24]</ref>, the authors show that this structure provides a good test not only for generic classification/clustering methods, but also for hierarchical techniques.</p><p>Table <ref type="table" target="#tab_0">1</ref> shows the performances in terms of accuracy (A), macro-precision (P), and macro-recall (R). For each technique, the average number of unique extracted terms (T) is shown. For NKeywords summarization, we performed experiments with N=10. As a final remark, let us note that just adding information about the title improves the performances of summarization. Another interesting result is that, as expected, the TFLP summarization provides the best performance, as FLP summarization does for the classic techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions and Future Directions</head><p>In this paper, we presented a preliminary study on text summarization techniques applied to CA. In particular, we proposed some straightforward extractionbased techniques that improve those proposed in the literature. Experimental results confirm the hypothesis that adding information about titles to well-known techniques allows to improve performances.</p><p>As for future directions, we are currently studying a novel semantic technique. The main idea is to improve syntactic techniques by exploiting semantic information (such as, synonyms and hypernyms) extracted from a lexical database (e.g., WordNet <ref type="bibr" target="#b17">[18]</ref>) in conjunction with a POS-tagging and word sense disambiguation. Further experiments are also under way. In particular, we are setting up the system to calculate its performances with a larger dataset extracted by DMOZ in which documents are categorized according to a given taxonomy of classes. Moreover, as we deem that bringing ideas from recommender systems will help in devising CA systems <ref type="bibr" target="#b0">[1]</ref>, we are also studying a collaborative approach to CA.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>1</head><label>1</label><figDesc></figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. A generic CA architecture at a glance.</figDesc><graphic coords="5,192.76,459.54,226.78,113.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Class hierarchy of BankSearch Dataset.</figDesc><graphic coords="7,160.70,286.76,293.95,214.42" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Results of text summarization techniques comparison.</figDesc><table><row><cell></cell><cell cols="10">FP F2P FLP MK MT TFP TF2P TFLP MTK NK</cell></row><row><cell cols="8">A 0.598 0.694 0.743 0.608 0.581 0.802 0.821</cell><cell>0.833</cell><cell cols="2">0.721 0.715</cell></row><row><cell cols="8">P 0.606 0.699 0.745 0.702 0.717 0.802 0.822</cell><cell>0.832</cell><cell cols="2">0.766 0.722</cell></row><row><cell cols="8">R 0.581 0.673 0.719 0.587 0.568 0.772 0.789</cell><cell>0.801</cell><cell cols="2">0.699 0.693</cell></row><row><cell>T</cell><cell>13</cell><cell>24</cell><cell>24</cell><cell>25</cell><cell>15</cell><cell>16</cell><cell>27</cell><cell>26</cell><cell>34</cell><cell>10</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">To this end, the Jericho API for Java has been adopted, described at the Web page: http://jericho.htmlparser.net/docs/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">http://www.dmoz.org and http://www.yahoo.com, respectively</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. This work has been partially supported by Hoplo srl. We wish to thank, in particular, Ferdinando Licheri and Roberto Murgia for their help and useful suggestions.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A recommender system based on a generic contextual advertising approach</title>
		<author>
			<persName><forename type="first">A</forename><surname>Addis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Armano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giuliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Vargiu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ISCC&apos;10: IEEE Symposium on Computers and Communications</title>
				<meeting>ISCC&apos;10: IEEE Symposium on Computers and Communications</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="859" to="861" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Just-in-time contextual advertising</title>
		<author>
			<persName><forename type="first">A</forename><surname>Anagnostopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Z</forename><surname>Broder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gabrilovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Josifovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Riedel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CIKM &apos;07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="331" to="340" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Machine-made index for technical literature -an experiment</title>
		<author>
			<persName><forename type="first">P</forename><surname>Baxendale</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IBM Journal of Research and Development</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="354" to="361" />
			<date type="published" when="1958">1958</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Finding out about: A Cognitive Perspective on Search Engine Technology and the WWW</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename><surname>Belew</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<publisher>Cambridge University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatic condensation of electronic publications by sentence selection</title>
		<author>
			<persName><forename type="first">R</forename><surname>Brandow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mitze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Rau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Process. Manage</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="675" to="685" />
			<date type="published" when="1995-09">September 1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A semantic approach to contextual advertising</title>
		<author>
			<persName><forename type="first">A</forename><surname>Broder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fontoura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Josifovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Riedel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="559" to="566" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">A survey on automatic text summarization</title>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Martins</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
		<respStmt>
			<orgName>Statistics II course at CMU</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical Report</note>
	<note>Literature Survey for the Language and</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Contextual advertising by combining relevance with click feedback</title>
		<author>
			<persName><forename type="first">C</forename><surname>Deepayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Deepak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WWW &apos;08: Proceeding of the 17th international conference on World Wide Web</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="417" to="426" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Inductive learning algorithms and representations for text categorization</title>
		<author>
			<persName><forename type="first">S</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Platt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Heckerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sahami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the seventh international conference on Information and knowledge management, CIKM &apos;98</title>
				<meeting>the seventh international conference on Information and knowledge management, CIKM &apos;98<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="148" to="155" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">New methods in automatic extracting</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Edmundson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of ACM</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="264" to="285" />
			<date type="published" when="1969-04">April 1969</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Cactus: clustering categorical data using summaries</title>
		<author>
			<persName><forename type="first">V</forename><surname>Ganti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gehrke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ramakrishnan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD &apos;99</title>
				<meeting>the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD &apos;99<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="73" to="83" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Centroid-based document classification: Analysis and experimental results</title>
		<author>
			<persName><forename type="first">E.-H</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Karypis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD &apos;00</title>
				<meeting>the 4th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD &apos;00<address><addrLine>London, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="424" to="431" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A text categorization based on summarization technique</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Ker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-N</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -Volume 11</title>
				<meeting>the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -Volume 11<address><addrLine>Morristown, NJ, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="79" to="83" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Asymmetric missing-data problems: Overcoming the lack of negative data in preference ranking</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Alspector</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Retr</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="5" to="40" />
			<date type="published" when="2002-01">January 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Summarization as feature selection for text categorization</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kolcz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Prabakarmurthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CIKM &apos;01: Proceedings of the tenth international conference on Information and knowledge management</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="365" to="370" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning to advertise</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lacerda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cristo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Gonçalves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ziviani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ribeiro-Neto</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="549" to="556" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">The automatic creation of literature abstracts</title>
		<author>
			<persName><forename type="first">H</forename><surname>Luhn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IBM Journal of Research and Development</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="159" to="165" />
			<date type="published" when="1958">1958</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Wordnet: A lexical database for english</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="39" to="41" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Feature selection for classification based on text hierarchy</title>
		<author>
			<persName><forename type="first">D</forename><surname>Mladenić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grobelnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text and the Web, Conference on Automated Learning and Discovery CONALD-98</title>
				<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">An algorithm for suffix stripping</title>
		<author>
			<persName><forename type="first">M</forename><surname>Porter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Program</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="130" to="137" />
			<date type="published" when="1980">1980</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Introduction to the special issue on summarization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Radev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mckeown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistic</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="399" to="408" />
			<date type="published" when="2002-12">December 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Impedance coupling in content-targeted advertising</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ribeiro-Neto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cristo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">B</forename><surname>Golgher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Silva De Moura</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR &apos;05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="496" to="503" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Introduction to Modern Information Retrieval</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mcgill</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1984">1984</date>
			<publisher>McGraw-Hill Book Company</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A large benchmark dataset for web document clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sinka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Corne</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Soft Computing Systems: Design, Management and Applications</title>
				<imprint>
			<publisher>Press</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="volume">87</biblScope>
			<biblScope unit="page" from="881" to="890" />
		</imprint>
	</monogr>
	<note>Frontiers in Artificial Intelligence and Applications</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">The term vector database: fast access to indexing terms for web pages</title>
		<author>
			<persName><forename type="first">R</forename><surname>Stata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bharat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Maghoul</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comput. Netw</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">1-6</biblScope>
			<biblScope unit="page" from="247" to="255" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Finding advertising keywords on web pages</title>
		<author>
			<persName><forename type="first">W.-T</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">R</forename><surname>Carvalho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WWW &apos;06: Proceedings of the 15th international conference on World Wide Web</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="213" to="222" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
