<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Wikipedia to Identify Domain-Speci c Key Terms/Phrases from a Short-Text Collection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Atif Qureshi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Colm O'Riordan</string-name>
          <email>colm.oriordang@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>P@20</email>
          <email>pasig@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Intelligence Research Group, Information Technology, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Retrieval Lab</institution>
          ,
          <addr-line>Informatics, Systems and Communication</addr-line>
          ,
          <institution>University of Milan Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>Extracting from a given document collection what we call \domain-speci c" key terms/phrases is a challenging task. By \domainspeci c" key terms/phrases we mean words/expressions representative of the topical areas speci c to the focus of a document collection. For example, when a collection is related to academic research (i.e., its focus is related to topics dealing with academic research), the domain-speci c key terms/phrases could be `Information Retrieval', `Marine Biology', `Science', etc. In this contribution a technique for identifying domain-speci c key terms/phrases from a collection of documents is proposed. The proposed technique works on short textual descriptions, and it makes use of the titles of Wikipedia articles and of the Wikipedia category graph. We performed some experiments over the document collection (html title text only) of eight post-graduate school Web sites of ve di erent countries. The evaluations show promising results for the identi cation of domain-speci c key terms/phrases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In domain-speci c search applications, documents in the indexed collection cover
topics which are related to the focus of the document collection. Finding
domainspeci c key terms/phrases (from now on referred to as \keywords") from a given
collection is a signi cant research challenge. Despite the fact that in the
literature, several approaches have been proposed to extract knowledge from a text,
the goal of current knowledge extraction approaches remains limited to the
identi cation of keywords that describe the document content independent of the
focus of the considered collection from where it was drawn. Among the several
approaches, the one based on classical tf-idf [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is shown in a recent study to
form a strong baseline when compared across di erent datasets [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The approaches in the literature fall into four categories: statistical learning
techniques [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], techniques based on latent variable topic models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
techniques utilizing open-domain knowledge resources [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and techniques based
on word graphs [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>However, the current approaches do not focus on extracting the
domainspeci c keywords from a given document collection. Moreover, these approaches
operate on full-text documents with the consequence of being computationally
expensive. We address the problem of extracting domain-speci c keywords from
a given collection using short-text snippets (i.e., the text of title of a Web page) of
each document. We present a novel domain-speci c keyword extraction method
built upon n-gram overlap between the titles of Wikipedia articles and
documents (the text of titles of Web pages); the proposed method is aimed at
discovering keywords related to domains from a text collection. Furthermore, our
method applies a community detection algorithm (by making use of Wikipedia
Category graph) with the aim of reducing the candidates to domain-speci c
keywords.</p>
      <p>
        In order to prove the e ectiveness of the proposed method we have performed
experiments on a collection of academic Web sites; we show that the proposed
method outperforms existing baseline algorithms i.e., classical tf-idf [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
BM25 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The rest of the paper is organized as follows. In Section 2, we describe the
research background by giving an overview of the related literature. In Section
3, we discuss the underlying methodology for extraction of the domain-speci c
keywords using Wikipedia articles and Wikipedia category graph. In Section 4,
we present some experimental evaluations along with their results. Section 5,
concludes the paper with possible future directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Several approaches have been proposed in the literature to address the problem
of keyword extraction from a text. Approaches based on the tf-idf model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] are the oldest ones that incorporate the in uence of the collection while
estimating the relevance of keywords for each document. The research
community investigated supervised learning methods [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] where training data is
used to provide syntactic and lexical features for keywords extraction. A more
recent line of research utilizes features extracted from open-domain knowledge
resources such as Wikipedia for improving the accuracy of supervised learning
based keywords extraction systems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, supervised learning is a
laborious task and is not desirable for Web scale data.
      </p>
      <p>
        On the other hand, an alternative class of approaches applies graph-based
semantic relatedness measures for extracting keywords [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Some variants
of these algorithms use the graph generated from a Wikipedia ontology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, these techniques operate at document level instead of
identifying domain-speci c keywords which is the focus of our work. Similar to these
techniques, we propose to use graph-based semantic relatedness methods in
combination with the Wikipedia category graph [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] for achieving better precision
and recall.
      </p>
      <p>
        A recently proposed word graph technique called ExpandRank [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] makes
use of similar documents (called neighbourhood) for the extraction of keywords
from a document. This technique requires an input parameter which is a small
number of neighbouring documents. However, for nding similar documents the
technique uses cosine similarity which is computationally very expensive and
practically inapplicable for large datasets (also for our study). Furthermore, the
exploitation of a neighbourhood of documents may result in topic drift resulting
in the extraction of noisy terms for a document [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        The Information Retrieval research community is increasingly making use
of the information richness in open-domain knowledge sources for improving
the e ectiveness of Web search applications. The use of open-domain knowledge
resources has been investigated for query intent identi cation, document analysis
and understanding, and for query expansion [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To the best of our
knowledge, this paper is the rst attempt to address the problem of extracting
domain-speci c keywords from short-text (where standard NLP techniques may
fail). Furthermore, we di er from previous approaches [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] in that we
use the relationship between Wikipedia articles and Wikipedia categories for
semantic relatedness, and we apply the infomap [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] algorithm for community
detection.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section we discuss the method we propose to extract domain-speci c
keywords from a set of Web pages (in our case the Web pages crawled from
University Web sites). To this aim we rst create an index of titles of the crawled
Web pages in order to build the collection from where the keywords have to be
extracted. Furthermore, we also index the titles of Wikipedia articles. Then, we
apply an intersection between the two indexes (by only considering 2-5 grams).
We then generate a subset of the intersection by applying a community detection
algorithm. In the last phase, we also add to the selected 2-5 grams the signi cant
1 grams (i.e., single terms such as `Science'). As an optional step we show how
to reduce all domain-speci c keywords (i.e., n-grams) to single terms; this may
be useful for single terms tag cloud applications. In the following subsections the
phases applied by the proposed extraction process are explained.
3.1</p>
      <sec id="sec-3-1">
        <title>Web Pages and Wikipedia Indexes</title>
        <p>The aim of the rst phase of the proposed extraction process is to generate Web
Pages and Wikipedia indexes. In Procedure 1 the meta-algorithm implementing
this phase is shown. The inputs are the set of Web pages and the data-set of
Wikipedia. The output consists of two indexes: the index of the titles of Web
pages and the index of the titles of Wikipedia articles. In Step 1, empty indexes
are initialized. Steps 2-8 are aimed at creating an index of the titles of Web pages
after stopwords removal, followed by the generation of possible n-grams (up to
5 grams) for each title. While generating this index, we maintain the frequency
count of each n-gram. We call this index Indexweb. Similarly, Steps 9-15 create
an index of titles of the Wikipedia articles after stopwords removal. We call this
index Indexwiki.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Intersection between Indexes</title>
        <p>The aim of the second phase of the proposed extraction process is to calculate the
intersection of the indexes obtained from Procedure 1 followed by the removal of
some non-topical noise. The meta-algorithm is shown in Procedure 2; the inputs
to this module are Indexweb, Indexwiki, and a map we generate to highlight
associations between Wikipedia articles and categories3 (W ikipediaartMapCat).
The output consists of a re ned index called Indexinter. In Step 1, the
intersection between the indexes Indexweb and Indexwiki is found while preserving the
frequency count which was observed during the generation of Indexweb. This step
helps to discover meaningful key phrases (i.e., known phrases over Wikipedia).
In Step 2, key phrases containing numeric information (such as `2 May') are
ltered out (they are not representative of a topic). In Step 3, we remove key
phrases that appear under non-topical (i.e., not domain speci c) categories of
Wikipedia (such as `people', `country', `sports'). This is based on the rationale
that Wikipedia categories related to people are mainly about people and not
about topics. We call the index produced by this module Indexinter.
Procedure 1 Index generation module: indexGeneration()
data-set
of</p>
        <p>Wikipedia
Require: Set of crawled Web Pages(W ebP ages),</p>
        <p>Articles(W ikipediaArticles)
1: create empty Indexweb, Indexwiki
2: for all page in W ebP ages do
3: text ExtractTitleText(page)
4: text RemoveStopwords(text)
5: for i = 1 to 5 do
6: Indexweb.push(extract grams of length i from text)
7: end for
8: end for
9: for all article in W ikipediaArticles do
10: text ExtractTitleText(artilce)
11: text RemoveStopwords(text)
12: for i = 2 to 5 do
13: Indexwiki.push(extract grams of i from text)
14: end for
15: end for
16: return Indexweb, Indexwiki
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Domain restriction via community detection</title>
        <p>
          Procedure 3 is the core of our approach, i.e., it discovers domain-speci c key
phrases by applying a community detection algorithm. The rationale behind
the application of a community detection algorithm at this stage is that it
contributes to select high quality domain-speci c keywords by exploiting the
semantic relatedness within communities. The inputs to this module are the
Wikipedia category graph and Indexinter. The output consists of an index of
3 In Wikipedia an article falls into one or more categories and the map returns
Wikipedia categories corresponding to a Wikipedia article.
domain-speci c key phrases and a list of top communities. In Step 1, the index
of key phrases is initialized. In Step 2, an undirected graph of the Wikipedia
categories is generated, where a node represents a Wikipedia category, the edge
represents relationship between categories4, and the weight on an edge is de ned
as the sum of the number of articles belonging to the rst category and those
belonging to the second category node (as shown in Steps 3-7). In Step 8, an
undirected multi-level infomap algorithm 5 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is applied for community
detection among Wikipedia categories. The application of the infomap algorithm
yields an assignment of each category to exactly one community. Some
communities may contain many categories while other communities may contain just
one category. In Steps 9-10, the top-k communities are found on the basis of
the number of unique articles that they contain. A community contains several
Wikipedia categories and each category contains articles. To understand what
we mean by unique articles, let us consider a community that contains `k'
categories (discovered by the infomap algorithm) and `t' total articles. The `t-x'
articles that are not contained in any other community (i.e., only speci c to the
considered community) are called unique articles. As a simple example, an
article on chemistry may be unique to the community that contains the categories
related to chemical sciences and it would not be mentioned in communities such
as the community that contains categories related to political sciences. This way,
if a community contains several unique articles then it becomes a strong
representative of their domain of interest. Similarly, the less the number of unique
articles in the community the less its chance to be representative of the domain
of interest (of the collection). As an extreme case, a community may contain zero
or one unique article, which means it may be considered as an outlier (having
little or no association with the considered domain). For example, a community
containing only a unique article in the Wikipedia category `1979 births' implies
to be a random outlier (if academic documents are considered). We consider
articles contained in the top-k ranked communities as being relevant to the
domain of interest. Therefore, we declare top-k communities as being the most
representative of the domain. Based on this we further reduce Indexinter and
we call this subset IndexdomainP hrases (as shown in Step 11-18).
Procedure 2 First pass of identifying key phrases: indexesIntersection()
Require: Indexes of titles of Web pages (Indexweb) &amp; Wikipedia articles (Indexwiki),
        </p>
        <p>
          Wikipedia article to category map (W ikipediaartMapCat)
1: IndexsimpleInter (Indexweb \ Indexwiki)
2: IndexnoNumInter IndexsimpleInter - All key phrases with Numeric values
3: Indexinter IndexnoNumInter - All key phrases mentioned under non-topical
category of Wikipedia (discovered using W ikipediaartMapCat)
4: return Indexinter
4 A category of Wikipedia may or may not have a super category i.e., a kind of
hierarchical category structure but not strict [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] as there could be rare cycles in it.
5 The predecessor to the algorithm was found to be the best-known algorithm for the
community detection problem [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
Procedure 3 Application of community detection for identifying key phrases:
communityDetection()
Require: Wikipedia category graph (W ikipediacatgraph), index after the application
of rst pass Indexinter
1: create empty IndexdomainP hrases
2: graph make Wikipedia categories as undirected graph using W ikipediacatgraph
3: for all edge in graph:edges do
4: superCategorynumArticles (Indexinter \ edge:superCategory:articles):length
5: subCategorynumArticles (Indexinter \ edge:subCategory:articles):length
6: edge[weight] superCategorynumArticles + subCategorynumArticles
7: end for
8: Listcommunities infomap undirected(graph)
9: OrderedListcommunities Descending order Listcommunities by number of unique
articles in community
10: T opcommunities topk(OrderedListcommunities, 10)
11: for all comm in T opcommunities do
12: for all category in comm:categories do
13: Articlesinter Indexinter \ category:articles
14: for all article in Articlesinter do
15: IndexdomainP hrases.push(article:title)
16: end for
17: end for
18: end for
19: return IndexdomainP hrases, T opcommunities
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Incorporating Single Terms</title>
        <p>The aim of this phase is to expand the IndexdomainP hrases by including single
terms as shown in Procedure 4. The inputs to this module are IndexdomainP hrases
and T opcommunities. The output consists of the index of domain-speci c key
terms and phrases. In Step 1, an index of single terms is initialized. In Steps
2-12, each category in T opcommunities is visited and the category name which is
composed of two or more words is considered as a candidate for extracting single
terms (as shown in Steps 5-9), e.g., `Cell biology' is composed of two terms, `cell'
and `biology'. To the aim of selecting meaningful single terms (in the academic
sites application as we describe in Section 4), we have applied a simple rule in the
considered context, a single key term (topical) usually ends with some post x,
such as `logy', `ics', `ulus'6. In Step 13, we merge the extracted key terms with
IndexdomainP hrases to produce the nal output IndexdomainT ermsandP hrases.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Complete Algorithm</title>
        <p>Procedure 5 shows the complete algorithm, the phases of which have been
described in Sections 3.1{3.4. The inputs to this module are the set of Web pages
and data-set of Wikipedia. The output consists of the index of domain-speci c
6 Similar type heuristic rules can be crafted or learned for other domains.
key terms and phrases. In Step 1, indexes for the Web pages and the Wikipedia
data-set are generated by calling Procedure 1. In Step 2, the index is re ned by
creating the intersection of the indexes by making a call to Procedure 2. In Step
3, the index is further re ned through community detection by using Procedure
3. Finally, the index is expanded to incorporate domain-speci c key terms by
making a call to Procedure 4.</p>
        <p>Procedure 4 Final pass for identifying single key terms: expandtoSingleTerms()</p>
      </sec>
      <sec id="sec-3-6">
        <title>Application of the proposed algorithm to the Extraction of</title>
      </sec>
      <sec id="sec-3-7">
        <title>Single Key Terms</title>
        <p>In this section, we show the application of the proposed algorithm for extracting
important single terms instead of key phrases. This application can be useful
for generating tag clouds of single key terms. To this aim, the index of
domainspeci c key terms and phrases is reduced to a list of single terms while preserving
the frequency count of each term in the key term/phrase in such a way that none
of the terms are over-counted. For example, consider there were only two n-grams
in the index; `a b' with frequency `n' and `b c' with frequency `n'. Upon reducing
to single terms we can say `a' and `c' occurs with `n' frequency. However we can't
say with certainty that `b' occurs with frequency `2n' as it may not be correct
in the case the n-gram is produced by a stream of data like `a b c'. Therefore
in order to overcome the problem of over-counting, we maintained positional
indexes of words within the stream (i.e., position of n-gram per Web page title).
So now for the stream `a b c' the positional index of `a' is one, of `b' is two and
of `c' is three, whereas for the discovered n-grams `a b' and `b c' there is a same
positional index for both the `b' terms therefore its frequency would be counted
just once (this will aid in avoiding over-counting problem).</p>
        <p>As a nal step, we lemmatize all the obtained terms in order to use a
conceptual representation of a term (e.g., sciences becomes science). Finally, all terms
over all n-grams are arranged via frequency count, and the term having highest
frequency represent the most important key term.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>In this section we present the employed dataset, the evaluation measures, and
the experimentations. We also present a discussion on the obtained results.
4.1</p>
      <sec id="sec-4-1">
        <title>Dataset and Evaluation Measures</title>
        <p>
          To evaluate the proposed approach we focus on academic Web sites for the
identi cation of domain-speci c key terms/phrases. To this aim, we crawled the
English Web pages of eight post-graduate school Web sites from ve di erent
countries as shown in Table 1. For each Web site, we crawled up to the depth
of ve from the root page in order to cover at least 80%-95% of important Web
pages [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In addition, to avoid a crawler trap, i.e. in nite dynamic Web pages
such as calendars, we adopted the policy to crawl a maximum of the rst 500
instances of each dynamic Web page.
        </p>
        <p>To the aim of performing the evaluations we use the metric of Precision at
k (P@k) results. P@k is de ned by the ratio of correctly matching results over
the rst top-k results.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Evaluations</title>
        <p>We conducted two experiments; in the rst experiment we evaluated the
quality of the methodology proposed in Section 3.5, and in the second experiment
we evaluated the quality of the extension proposed in Section 3.6. For both
experiments, 13 human annotators7 made the relevance judgements for the top-20
results by associating a label relevant, irrelevant or uncertain with each keyword.
For each keyword, the 13 judgements are aggregated to produce a single label: a
keyword is labelled as relevant (or irrelevant ) if the majority8 of the annotators
7 Except one, all the annotators have completed (at least) their post-graduate studies.
8 In case of a tie we assign a random label (i.e., either relevant or irrelevant ).
labelled it as relevant (or irrelevant ). The keywords labelled as uncertain play
no role in the aggregation process. The aggregated judgement is used in the
graphical representations of the experiments.</p>
        <p>Before conducting experiments, we produced four variants of the proposed
methodology for the identi cation of domain-speci c key terms/phrases,
explained below:</p>
        <p>n-grams: the basic algorithm (baseline) that uses Indexweb of Section 3.1
ordered by (descending) frequency of each n-gram.</p>
        <p>simple inter : the algorithm that uses IndexsimpleInter of Section 3.2
ordered by (descending) frequency of each n-gram.</p>
        <p>intersect noNum : the algorithm that uses IndexnoNumInter of Section 3.2
ordered by (descending) frequency of each n-gram.</p>
        <p>
          complete : the algorithm that uses IndexdomainT ermsandP hrases of Section
3.5 ordered by (descending) frequency of each n-gram.
a The URL has now changed to www.unimib.it/go/102/Home/English
Experiment 1 In this experiment, we compared four variations of the proposed
methodology to evaluate the capability to generate high quality domain-speci c
key terms/phrases. We asked annotators to label a key term/phrase as relevant
when it correctly represents a complete name of a topical domain or sub-domain
(academic topical area of interest). For instance,`Information Retrieval', `Marine
Biology', and `Science' are relevant examples but `Marine' is an irrelevant key
term because it does not represent the name of a topical domain or sub-domain.
In order to evaluate the agreement among annotators we calculated the value of
Fleiss's Kappa [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which showed high agreements (value 0.81).
        </p>
        <p>Fig. 1(a) shows the quality of identifying domain-speci c key terms/phrases
at P@20. This gure shows that the complete algorithm outperforms the other
algorithms. Fig. 1(b) shows the quality of noise elimination at steps simple inter,
intersect noNum, and complete at P@20. Generally the graph shows very
competitive precision in eliminating noisy information, however, the precision
dropped for IBA-KHI in complete to 0.8 as some key terms/phrases were
identi ed as noisy due to problems of disambiguation, e.g., `computer studies' was
disambiguated by the Wikipedia data set as `computer' (instead of `computer
science'), which then led to non-topical categories (hence recognizing it as noisy
information).</p>
        <p>
          In our previous work [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], we conducted a similar experiment without
incorporating single terms as explained in Section 3.4 in complete algorithm (as
explained in Section 3.5) and found similar nding i.e., it outperformed the rest
of algorithms, however incorporating single terms makes it even better.
Experiment 2 In this experiment, we evaluated the capability to generate high
quality domain-speci c key terms (i.e., single terms only). We asked annotators
to label a key term as relevant when it correctly represents a complete or partial
name of a topical domain or sub-domain (academic topical area of interest).
For instance, `Science' and `Biology' are relevant examples, and so is `Marine' if
it represents a partial representation of `Marine Biology'. As with the previous
experiment, we calculated Fleiss's Kappa and found the value of 0.77, showing
a high agreement among annotators.
        </p>
        <p>
          In this experiment, we compared three well known algorithms i.e., BM25[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ],
TF-IDF and TF-Norm (term frequency normalized) against the three variations
of our methodology n-grams, simple inter, and complete. The indexes
generated by the three variations were processed to generate single terms respectively
as discussed in Section 3.6. Fig. 2 shows that complete outperforms the other
algorithms. However, there is still a room for improvement for the best case.
        </p>
        <p>To provide an illustration of typical results, Table 2 shows the data from
the Milano-Bicocca Web site. In this table, we show top-15 domain-speci c
single key terms detected, domain-speci c key terms/phrases detected, and noisy
information that was eliminated by complete.
Single Key Terms science, statistic, mathematics, computer, sociology, business,
Only biotechnology, technology, developmental, psychology, service,
material, social, political, communication
Key Terms/Phrases science, economics, technology, psychology, mathematics,
statistics, physics, sociology, medicine, law, biotechnology,
developmental psychology, computer science, materials science,
surgery
Noisy Information research university, masters university, summer schools,
doctoral degree, drop out, degree courses, campus university,
student unions, ranking university, laboratory techniques,
department statistics, union university, marco polo,
department biotechnology, erasmus university
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this contribution we presented an approach for identifying domain-speci c key
terms/phrases using Wikipedia. Furthermore, we have also presented an
extension of our approach for identifying domain-speci c single key terms only, which
could be useful for some applications such as single word tag cloud de nition.
The evaluations have shown promising results in overall. In future, we would
like to address the problem of disambiguation, and we would like to apply our
methodology on the full text of Web pages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Wikipedia in Action: Ontological Knowledge in Text Categorization,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. R.
          <article-title>Baeza-yates and</article-title>
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Castillo. Crawling the in nite web: ve levels are enough</article-title>
          .
          <source>In In Proceedings of the third Workshop on Web Graphs (WAW</source>
          , pages
          <volume>156</volume>
          {
          <fpage>167</fpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>M. I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Fleiss</surname>
          </string-name>
          et al.
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .
          <source>Psychological Bulletin</source>
          ,
          <volume>76</volume>
          (
          <issue>5</issue>
          ):
          <volume>378</volume>
          {
          <fpage>382</fpage>
          ,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Fortunato</surname>
          </string-name>
          .
          <article-title>Community detection in graphs</article-title>
          .
          <source>Physics Reports</source>
          ,
          <volume>486</volume>
          (
          <issue>3-5</issue>
          ):
          <volume>75</volume>
          {
          <fpage>174</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Grineva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grinev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lizorkin</surname>
          </string-name>
          .
          <article-title>Extracting key terms from noisy and multitheme documents</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web, WWW '09</source>
          , pages
          <fpage>661</fpage>
          {
          <fpage>670</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Hasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING '10</source>
          , pages
          <fpage>365</fpage>
          {
          <fpage>373</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lochovsky</surname>
          </string-name>
          , J. tao Sun, and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Understanding user's query intent with wikipedia</article-title>
          .
          <source>In WWW '09: Proceedings of the 18th international conference on World wide web</source>
          , pages
          <volume>471</volume>
          {
          <fpage>480</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pennacchiotti</surname>
          </string-name>
          .
          <article-title>Open entity extraction from web search query logs</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Computational Linguistics, COLING '10</source>
          , pages
          <fpage>510</fpage>
          {
          <fpage>518</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kotov</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for di cult queries</article-title>
          .
          <source>In Proceedings of the fth ACM international conference on Web search and data mining</source>
          ,
          <source>WSDM '12</source>
          , pages
          <fpage>403</fpage>
          {
          <fpage>412</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Automatic keyphrase extraction by bridging vocabulary gap</article-title>
          .
          <source>In Proceedings of the Fifteenth Conference on Computational Natural Language Learning</source>
          ,
          <source>CoNLL '11</source>
          , pages
          <fpage>135</fpage>
          {
          <fpage>144</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Clustering to nd exemplar terms for keyphrase extraction</article-title>
          .
          <source>In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 -</source>
          Volume 1, EMNLP '
          <volume>09</volume>
          , pages
          <fpage>257</fpage>
          {
          <fpage>266</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2009</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Csomai</surname>
          </string-name>
          . Wikify!
          <article-title>: linking documents to encyclopedic knowledge</article-title>
          .
          <source>In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management</source>
          ,
          <source>CIKM '07</source>
          , pages
          <fpage>233</fpage>
          {
          <fpage>242</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>D.</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>Learning to link with wikipedia</article-title>
          .
          <source>In Proceedings of the 17th ACM conference on Information and knowledge management</source>
          ,
          <source>CIKM '08</source>
          , pages
          <fpage>509</fpage>
          {
          <fpage>518</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. A. Qureshi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. O'Riordan</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Pasi</surname>
          </string-name>
          .
          <article-title>Short-text domain speci c key terms/phrases extraction using an n-gram model with wikipedia</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>2515</volume>
          {
          <fpage>2518</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatfor</surname>
          </string-name>
          .
          <source>Okapi at trec-3. In The Third Text REtrieval Conference TREC-3</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosvall</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Bergstrom</surname>
          </string-name>
          .
          <article-title>Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems</article-title>
          .
          <source>PLoS ONE</source>
          ,
          <volume>6</volume>
          (
          <issue>4</issue>
          ):e18209,
          <year>04 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. G. Salton and
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          .
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>In INFORMATION PROCESSING AND MANAGEMENT</source>
          , pages
          <volume>513</volume>
          {
          <fpage>523</fpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Talukdar</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Experiments in graph-based semi-supervised learning methods for class-instance acquisition</article-title>
          .
          <source>In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10</source>
          , pages
          <fpage>1473</fpage>
          {
          <fpage>1481</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>P. D. Turney</surname>
          </string-name>
          .
          <article-title>Learning algorithms for keyphrase extraction</article-title>
          .
          <source>Inf. Retr.</source>
          ,
          <volume>2</volume>
          :
          <fpage>303</fpage>
          {
          <fpage>336</fpage>
          , May
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          .
          <article-title>Single document keyphrase extraction using neighborhood knowledge</article-title>
          .
          <source>In Proceedings of the 23rd national conference on Arti cial intelligence - Volume 2, AAAI'08</source>
          , pages
          <fpage>855</fpage>
          {
          <fpage>860</fpage>
          . AAAI Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Paynter</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutwin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Nevill-Manning</surname>
          </string-name>
          .
          <article-title>Kea: practical automatic keyphrase extraction</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Digital libraries, DL '99</source>
          , pages
          <fpage>254</fpage>
          {
          <fpage>255</fpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>T.</given-names>
            <surname>Zesch</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Gurevych.</surname>
          </string-name>
          <article-title>Analysis of the Wikipedia Category Graph for NLP Applications</article-title>
          .
          <source>In Proceedings of the TextGraphs-2 Workshop (NAACL-HLT)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>