<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An Investigation into Information Navigation via Diverse Keyword-based Facets</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">M</forename><surname>Atif Qureshi</surname></persName>
							<email>muhammad.qureshi@ucd.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Derek</forename><surname>Greene</surname></persName>
							<email>derek.greene@ucd.ie</email>
							<affiliation key="aff0">
								<orgName type="department">Insight Centre for Data Analytics</orgName>
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">An Investigation into Information Navigation via Diverse Keyword-based Facets</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">79859CC52399E41F9E14677A1E1ABD02</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T14:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In the age of information overload, it is necessary to provide effective information navigation tools that operate over unstructured textual data. Current state-of-the-art methods are limited in terms of providing effective exploration capabilities for various information seeking tasks, especially those arising in domains such as online journalism. Here we argue for improvements in faceted search systems, via new strategies for identifying keyword-based facets. Our proposed technique utilises a PageRank model operating over the graph of terms appearing in documents, while also employing novel methods for biasing significant terms and named entities. In addition, we consider the notion of diversity within extracted keywords in an effort to maximize coverage over a range of topics. We perform experimental evaluations over issues relevant to the Irish General Elections 2016, demonstrating the superior performance of our proposed technique.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Web 2.0 technologies have enabled online information to grow at an exponential rate, with a sizable form of this information being textual in nature. To a large extent, the textual information is unstructured, leading to what is commonly known as "information overload" problem for lengthy text documents <ref type="bibr" target="#b4">[5]</ref>. Generating effective summaries of these documents can help to minimize the impact of information overload. Recently the research community within text mining has looked into various methods for dealing with this problem <ref type="bibr" target="#b0">[1]</ref>. Keyword extraction is one strategy that summarizes important aspects of a document, while enabling the reader to quickly contrast among documents using the selected keywords <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b23">24]</ref>.</p><p>Another emerging information-seeking paradigm is one where a user's information need is vague and exploratory in nature <ref type="bibr" target="#b21">[22]</ref>. In these cases, classification of information nuggets into various facets, commonly known as "faceted search" <ref type="bibr" target="#b19">[20]</ref>, helps users in this knowledge discovery task. Similar to keyword extraction, faceted search attempts to organize unstructured information into a structure leading to a paradigm of exploratory information seeking <ref type="bibr" target="#b19">[20]</ref>. Traditionally, faceted search systems organize document collections into various attributes, and allow the user to navigate along the various organized attributes.</p><p>Despite sharing a common goal of organizing unstructured text into structured information, to the best of our knowledge keyword extraction and faceted search have not been explored in combination with each other. Previous approaches for faceted search have attempted to extract various facets from within textual metadata <ref type="bibr" target="#b3">[4]</ref> <ref type="foot" target="#foot_0">1</ref> and tend to limit exploration to useful knowledge hidden within the textual content. Such exploration can be particularly helpful for textual data containing a variety of useful topics from which meaningful inferences can be made. Consider for example the case of a journalist wishing to examine the various ways in which news sources are reporting on different issues that are relevant to an on-going election campaign. In such a scenario, faceted search can offer a way to support extensive navigation required by the journalist, and more so if the facets are extracted from within the text appearing in news articles rather than metadata alone. This preliminary study proposes to utilize keywords for the refinement of faceted search thereby leading to exploration of novel information, and presents as a case-study extraction of keyword-based facets from within news articles. Our approach to keyword extraction models a document as a graph of terms to which we apply biased PageRank followed by maximization of topical coverage through extraction of diverse aspects of a given topic. We demonstrate the effectiveness of keyword-based facets through a system-centric evaluation on a dataset of news articles collected from eight different Irish and international news sources. The remainder of this paper is organized as follows. In Section 2, we provide an overview of related work. In Section 3, we present a detailed explanation of the manner in which we extract diverse keyword-based facets. In Section 4, we present experimental evaluations on various queries applied to the news article dataset to demonstrate the usefulness of our approach. In Section 5, we conclude the paper with a discussion of possible future extensions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Our work here touches on a number of different fields. In the following we review related work in keyword extraction together with faceted search. We also highlight how our work differs from existing approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Keyword Extraction</head><p>Recent years have seen keyword extraction as a dominant technique for summarizing the contents of a text corpus, with numerous applications in various information access tasks such as query expansion, document classification, and document clustering, to name but a few.</p><p>Due to the differences in the nature of textual documents, generally four document specific factors have influenced keyword extraction techniques: document length, structural consistency of the document, possibility of topic change within the document, and possibility of topic correlation among topics within the document <ref type="bibr" target="#b6">[7]</ref>. The longer the document, the more candidate keywords are available (e.g. scientific articles and technical reports compared to news articles and emails). A well structured document contains certain sections (fields) and formatting that can be exploited for keyword extraction, such as a scientific paper's title and abstract <ref type="bibr" target="#b11">[12]</ref>, and metadata of webpages <ref type="bibr" target="#b24">[25]</ref>. Documents such as conversational texts, logs of open-ended meetings generally contain several topics in sequence (as in talking points), and in such type of documents a topical change happens as the discussion moves <ref type="bibr" target="#b10">[11]</ref>, e.g., first topic can be about cleaning, second can be related to cooking, etc. Documents such as news articles and scientific articles can possess a topical correlation (i.e., interconnected topics) in the entire flow of the article, unlike informal chat. Therefore, in these type of documents the keywords are usually related to one another <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b20">21]</ref>.</p><p>Several approaches have been proposed in the literature to address the problem of keyword extraction from a piece of text. However, keyword extraction is generally performed in two steps. First a list of candidate keywords are extracted using some heuristics, and then each candidate is scored using either a supervised or an unsupervised strategy. A candidate keyword is typically extracted on the basis of n-grams <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b16">17]</ref>, words with specific parts of speech tags (i.e. nouns, verbs, adjectives) <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b17">18]</ref>, noun phrases <ref type="bibr" target="#b22">[23]</ref>, words other than stopwords <ref type="bibr" target="#b14">[15]</ref>, and n-grams appearing as Wikipedia articles titles <ref type="bibr" target="#b5">[6]</ref>. Scoring each candidate keyword in a supervised strategy is influenced by the selection of different features and by the process of task re-definition. Scoring in an unsupervised strategy is often addressed using graph-based approaches or topic modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Faceted Search</head><p>Faceted search has been dominant in commercial applications, with prime examples being e-commerce sites such as those of IBM Websphere and Amazon. Within the academic setting, Flamenco by Hearst <ref type="bibr" target="#b8">[9]</ref> is one of earliest faceted search systems which uses rich metadata in a flexible manner to guide the user's navigational behavior. The traditional definition of facets says that it consists of a "a set of meaningful labels organized in such a way as to reflect the concepts relevant to a domain" <ref type="bibr" target="#b7">[8]</ref>. Earlier works however are limited in the sense that properties and dimensions of document collections are used to extract facets and textual data with its inherently unstructured nature cannot be organized properly into facets. This led the research community towards methods for automatic facet extraction through lexical subsumption <ref type="bibr" target="#b2">[3]</ref>, synsets and hypernym relations from WordNet <ref type="bibr" target="#b18">[19]</ref>, and personalized PageRank on ODP categories <ref type="bibr" target="#b12">[13]</ref>. These efforts remain limited to defining concept hierarchies for document collections thereby limiting information discovery to a few broad concepts.</p><p>We propose to advance faceted search through the utilization of keyword-based facets, and present a prototype of such a system applied to news articles. The system aims to address limitations of current faceted search systems by facilitating navigation at a higher granularity than what is allowed by current systems. To this end, we extract keywords from within retrieved news articles in response to a certain query, while maximizing the topical coverage and hence, the diversity of extracted keywords. Note that this differs from traditional keyword extraction where the document corpus is static in nature.</p><p>In this section, we present an overview of our methodology for the extraction of diverse keywords. We first explain how we utilize biased PageRank for the process of keyword extraction followed by an explanation of our technique to maximize diversity within the extracted keywords.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Keyword Extraction</head><p>As mentioned in Section 1, we apply PageRank to the graph of terms appearing in a given document. The effectiveness of graph-based ranking algorithms in Web search applications has encouraged researchers to apply similar models to textual data for natural language processing. TextRank by Mihalcea and Tarau <ref type="bibr" target="#b17">[18]</ref> is an example of such a model, and we follow a similar intuition by treating terms within a document as nodes with edges between terms that co-occur. It is significant to note that the existing keyword extraction techniques utilise a static corpus while our computation is of a dynamic, real-time nature over a set of documents generated in response to a query<ref type="foot" target="#foot_1">2</ref> . We list below a number of ways in which our proposed technique differs from TextRank:</p><p>-We apply the keyword extraction model over a collection of documents instead of using a single document. -We define edge weights between the terms with respect to word distance (as an exponential decay factor) between them, instead of uniform edge weights within a fixed window length. -We use the relevance score of each document in relation to the query to further bias the PageRank node vector (i.e., terms extracted from a each document are biased towards the relevance score). -Lastly, we identify significant terms from the corpus through chi-square test of independence, and further bias the PageRank node vector for these significant terms and named entities in the document collection.</p><p>We now explain the various steps of our keyword extraction process:</p><p>1. We apply cost-effective, time-series-based clustering to the ranked list of documents <ref type="foot" target="#foot_2">3</ref> . We cluster the retrieved documents using a single feature which is the creation time-stamp of each document. We argue that sub-topics of similar interests are usually clustered around a specific time window. For example: pre-election, on the election day, and post-election can be three different time windows, clustering various sub-topics around the main topic "elections". 2. We pick the top retrieved articles, in proportion to the size of each cluster. This reduces the full set of documents to a representative sub-sample of documents prior to the application of PageRank. For example, the volume of documents may be higher around the election dates as opposed to a few months before the elections hence resulting in a larger number of articles for the "election" cluster.</p><p>3. We apply biased PageRank to the reduced set of documents from the previous step and extract single terms representative of the document collection. To ensure further computational efficiency, we compute the PageRank scores for terms appearing in titles and sub-titles of documents only, which helps reduce the size of the term graph for real-time operation.</p><p>Finally, we merge single terms identified by biased PageRank to extract bigrams keywords. To achieve this, we add the individual PageRank scores of the co-occurring terms according to their probability of co-occurrence implying that n-grams that co-occur frequently within the documents retrieved are highly likely to constitute a keyword. This step utilises the terms in entire document instead of titles and sub-titles alone.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Maximizing Topical Coverage of Extracted Keywords</head><p>One potential limitation of utilizing biased PageRank for keyword extraction is the selection of keywords that are redundant in terms of coverage over a range of different sub-topics. This limits the coverage of diverse topics in response to a given query. We illustrate with an example from the news domain: in response to the query "syria" our algorithm retrieves keywords "syria state", "syria russian" and "syria government". which do not cover a wide variety of topics. It is therefore essential to ensure a coverage of a maximum range of diverse topics through the extracted keywords (or facets). We utilize the documents retrieved with a given keyword to measure the "freshness" of a given keyword. Here, freshness is the probability of the number of unique documents that a certain keyword is able to retrieve compared to other keywords (or facets), and this is then multiplied with the PageRank score calculated from Section 3.1. Finally, keywords with highest scores after the application of proposed freshness strategy are extracted as facets. To return to our example query "syria", our algorithm is now able to retrieve keywords "syria talks", "syria un" and "syria vote" which are more diverse and maximize the coverage over a range of different sub-topics. Algorithm Facets Proposed algorithm education secretary, education system, education scheme TextRank system, next, years TFxIDF school education, new school, education schools Table <ref type="table">3</ref>. Top-3 facets by various algorithms for the query "education".</p><p>Fig. <ref type="figure">1</ref>. Experimental results for "cumulative return", "average novelty", and "f-measure" corresponding to the query"Irish water".</p><p>Algorithm Facets Proposed algorithm housing minister, housing top, cork council TextRank market, house, year TFxIDF housing dublin, dublin housing, housing crisis Table <ref type="table">4</ref>. Top-3 facets by various algorithms for the query "housing".</p><p>Fig. <ref type="figure">2</ref>. Experimental results for "cumulative return", "average novelty", and "f-measure" corresponding to the query "education".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluations</head><p>In this section, we present experimental evaluations over two system-centric measures which demonstrate the strength of our keyword-based facets in a real-time information retrieval setting. Our dataset comprises news articles extracted from eight different news sources, <ref type="foot" target="#foot_3">4</ref> and we show results for three different information access needs. Table <ref type="table" target="#tab_0">1</ref> shows detailed statistics about the number of articles from each news source, and they cover a period from 8th July, 2015 to 7th April, 2016. We utilize system-centric Fig. <ref type="figure">3</ref>. Experimental results for "cumulative return", "average novelty", and "f-measure" corresponding to the query "housing'.' evaluation measures, namely, cumulative return and average novelty. Cumulative return is defined as the coverage of the total number of documents retrieved by all facets relative to the total number of documents returned without using faceted search. Average novelty is the average ratio between the unique documents retrieved at each pair of successive facets. Both measures effectively help us to quantify the various ways in which facets aid the process of information navigation -cumulative return indicates the potential of facets to return as many documents as possible; average novelty indicates the potential of facets to return as many undiscovered documents as possible. We use these measure to compare our proposed approach to two existing competing approaches -TextRank and TFxIDF.</p><p>In what follows we present experimental results for three important issues of concern during Irish General Election 2016, namely "irish water", "education", and "housing". A journalist monitoring various issues surrounding Irish Elections 2016 needs ex-tensive information navigation capabilities for queries issued in response to the above issues. Tables 2, 3 and 4 show the facets returned by our algorithm and the two competitors. From these results, it is evident that our algorithm extracts informative facets. As further proof of concept, we utilize the facets returned by the three algorithms to retrieve more documents and plot "cumulative return" together with "average novelty" for each of the three queries. Figures <ref type="figure">1, 2 and 3</ref> show the experimental results for each of these queries; and we also give an overall picture through f-score <ref type="foot" target="#foot_4">5</ref> between "cumulative return" and "average novelty".</p><p>In the case of the query "Irish Water" as shown in Figure <ref type="figure">1</ref>, TFxIDF demonstrates higher "cumulative return" but our algorithm fetches new documents at the higher rate as shown by "average novelty". This implies that TFxIDF is able to retrieve overall more documents through its extracted facets, but these documents are retrieved under general facet terms which are generally not informative or representative of the original query "Irish Water" (refer to the list of top terms in Table <ref type="table" target="#tab_0">1</ref>, and observe the non-informative facets in terms of human intuition for TFxIDF). In the case of the queries "education" and "housing", our algorithm predominantly outperforms TFxIDF and TextRank.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion and Future Work</head><p>In this paper we have proposed an approach to utilize keywords for effective information navigation in faceted search systems. Current approaches to faceted search operate over metadata or hierarchical concepts, and we proposed an improvement over this in the form of keyword-based facets. Our technique made use of a graph-based modeling over terms of a document with the documents being retrieved in response to a query. We used significant terms and named entities to bias PageRank over the graph of terms followed by application of a diversity-aware strategy called "freshness". Experimental evaluations over issues pertaining to Irish General Elections 2016 showed the superiority of our technique. As future work, we aim to present more extensive experimental evaluations, by means of both system-centric and user-centric evaluation measures as applied to feedback coming from a user study.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="6,134.77,310.12,384.00,320.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="7,134.77,116.83,384.00,320.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="8,134.77,116.83,384.00,320.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Information about news sources and articles in each news source.</figDesc><table><row><cell cols="2">News Source No. of Articles</cell></row><row><cell cols="2">Independent 50,309</cell></row><row><cell>BBC</cell><cell>35,179</cell></row><row><cell>Irish Times</cell><cell>30,919</cell></row><row><cell cols="2">Irish Examiner 19,258</cell></row><row><cell>RTE</cell><cell>18,299</cell></row><row><cell>Reuters</cell><cell>17,899</cell></row><row><cell>The Journal</cell><cell>12,020</cell></row><row><cell>Al Jazeera</cell><cell>2,786</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Top-3 facets by various algorithms for the query "Irish water".</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The most popular faceted search interfaces have been deployed on e-commerce sites where the data contains pre-defined attributes (price, genre etc.) from which facets are extracted.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Recall from Section 1 and 2 that our main goal is to propose keywords as facets for information navigation in a faceted search system.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">These are basically the ranked list of documents retrieved in response to a query</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The Irish Independent, The Irish Times, The Irish Examiner, RT É, The Journal, BBC, Reuters, Al Jazeera.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">It is simply the harmonic mean between the two measures.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments: This publication has emanated from research conducted with the support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Mining text data</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhai</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>Springer Science &amp; Business Media</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Linking educational materials to encyclopedic knowledge</title>
		<author>
			<persName><forename type="first">A</forename><surname>Csomai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2007 Conference on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work</title>
				<meeting>the 2007 Conference on Artificial Intelligence in Education: Building Technology Rich Learning Contexts That Work<address><addrLine>Amsterdam, The Netherlands, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="557" to="559" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automatic construction of multifaceted browsing interfaces</title>
		<author>
			<persName><forename type="first">W</forename><surname>Dakka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">G</forename><surname>Ipeirotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Wood</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM &apos;05</title>
				<meeting>the 14th ACM International Conference on Information and Knowledge Management, CIKM &apos;05<address><addrLine>Bremen, Germany</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="768" to="775" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Dynamic faceted search for discovery-driven analysis</title>
		<author>
			<persName><forename type="first">D</forename><surname>Dash</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Megiddo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ailamaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lohman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM &apos;08</title>
				<meeting>the 17th ACM Conference on Information and Knowledge Management, CIKM &apos;08<address><addrLine>Napa Valley, California, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="3" to="12" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">The text mining handbook: advanced approaches in analyzing unstructured data</title>
		<author>
			<persName><forename type="first">R</forename><surname>Feldman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sanger</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Cambridge University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Extracting key terms from noisy and multitheme documents</title>
		<author>
			<persName><forename type="first">M</forename><surname>Grineva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Grinev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lizorkin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th international conference on World wide web, WWW &apos;09</title>
				<meeting>the 18th international conference on World wide web, WWW &apos;09<address><addrLine>Madrid, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="661" to="670" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Automatic keyphrase extraction: A survey</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">S</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Association for Computational Linguistics (ACL), ACL 2014</title>
				<meeting>the Association for Computational Linguistics (ACL), ACL 2014<address><addrLine>Baltimore, Maryland, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1262" to="1273" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Design recommendations for hierarchical faceted search interfaces</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hearst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM SIGIR workshop on faceted search</title>
				<meeting><address><addrLine>Seattle, WA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Next generation web search: Setting our sites</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Data Eng. Bull</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="38" to="48" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Improved automatic keyword extraction given more linguistic knowledge</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hulth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2003 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Sapporo, Japan</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="216" to="223" />
		</imprint>
	</monogr>
	<note>EMNLP &apos;03</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Extracting keywords from multi-party live chats</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation</title>
				<meeting>the 26th Pacific Asia Conference on Language, Information, and Computation<address><addrLine>Bali, Indonesia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="199" to="208" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Automatic keyphrase extraction from scientific articles</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Medelyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baldwin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language resources and evaluation</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="723" to="742" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Using link analysis to identify aspects in faceted web search</title>
		<author>
			<persName><forename type="first">C</forename><surname>Kohlschütter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-A</forename><surname>Chirita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nejdl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR&apos;2006 Faceted Search Workshop</title>
				<meeting><address><addrLine>Seattle, WA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="55" to="59" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Unsupervised approaches for automatic keyword extraction using meeting transcripts</title>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pennell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL &apos;09</title>
				<meeting>Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL &apos;09<address><addrLine>Boulder, Colorado</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="620" to="628" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Clustering to find exemplar terms for keyphrase extraction</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2009 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="257" to="266" />
		</imprint>
	</monogr>
	<note>EMNLP &apos;09</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Keyword extraction from a single document using word cooccurrence statistical information</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ishizuka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Artificial Intelligence Tools</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">01</biblScope>
			<biblScope unit="page" from="157" to="169" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Human-competitive tagging using automatic keyphrase extraction</title>
		<author>
			<persName><forename type="first">O</forename><surname>Medelyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2009 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1318" to="1327" />
		</imprint>
	</monogr>
	<note>EMNLP &apos;09</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Textrank: Bringing order into text</title>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tarau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004</title>
				<meeting>the 2004 Conference on Empirical Methods in Natural Language Processing , EMNLP 2004<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="404" to="411" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Automating creation of hierarchical faceted metadata structures</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stoica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Richardson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">HLT-NAACL</title>
				<meeting><address><addrLine>Rochester, New York, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="244" to="251" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Faceted search. Synthesis lectures on information concepts</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tunkelang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">retrieval, and services</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="80" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Coherent keyphrase extraction via web mining</title>
		<author>
			<persName><forename type="first">P</forename><surname>Turney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th International Joint Conference on Artificial Intelligence</title>
				<meeting>the 18th International Joint Conference on Artificial Intelligence<address><addrLine>Acapulco, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="434" to="439" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Exploratory search: beyond the query-response paradigm (synthesis lectures on information concepts, retrieval &amp; services</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">W</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Roth</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Morgan and Claypool Publishers</publisher>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Domain-specific keyphrase extraction</title>
		<author>
			<persName><forename type="first">Y.-F</forename><forename type="middle">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>Bot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th ACM international conference on Information and knowledge management</title>
				<meeting>the 14th ACM international conference on Information and knowledge management<address><addrLine>Bremen, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="283" to="284" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Document concept lattice for text understanding and summarization</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-S</forename><surname>Chua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qiu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1643" to="1662" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Finding advertising keywords on web pages</title>
		<author>
			<persName><forename type="first">W.-T</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">R</forename><surname>Carvalho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th international conference on World Wide Web</title>
				<meeting>the 15th international conference on World Wide Web<address><addrLine>Edinburgh, Scotland, UK</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="213" to="222" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
