=Paper=
{{Paper
|id=Vol-1109/paper5
|storemode=property
|title=A Keyphrase Generation Technique Based upon Keyphrase Extraction and Reasoning on Loosely Structured Ontologies
|pdfUrl=https://ceur-ws.org/Vol-1109/paper5.pdf
|volume=Vol-1109
|dblpUrl=https://dblp.org/rec/conf/aiia/NartT13
}}
==A Keyphrase Generation Technique Based upon Keyphrase Extraction and Reasoning on Loosely Structured Ontologies==
<pdf width="1500px">https://ceur-ws.org/Vol-1109/paper5.pdf</pdf>
<pre>
A Keyphrase Generation Technique Based upon
Keyphrase Extraction and Reasoning on Loosely
           Structured Ontologies

                           Dario De Nart, Carlo Tasso

                           Artificial Intelligence Lab
                Department of Mathematics and Computer Science
                           University of Udine, Italy
                    {dario.denart,carlo.tasso}@uniud.it


      Abstract. Associating meaningful keyphrases to documents and web
      pages is an activity that can greatly increase the accuracy of Information
      Retrieval and Personalization systems, but the growing amount of text
      data available is too large for an extensive manual annotation. On the
      other hand, automatic keyphrase generation, a complex task involving
      Natural Language Processing and Knowledge Engineering, can signifi-
      cantly support this activity. Several different strategies have been pro-
      posed over the years, but most of them require extensive training data,
      which are not always available, suffer high ambiguity and differences
      in writing style, are highly domain-specific, and often rely on a well-
      structured knowledge that is very hard to acquire and encode. In order
      to overcome these limitations, we propose in this paper an innovative un-
      supervised and domain-independent approach that combines keyphrase
      extraction and keyphrase inference based on loosely structured, collab-
      orative knowledge such as Wikipedia, Wordnik, and Urban Dictionary.
      Such choice introduces a higher level of abstraction in the generated KPs
      that allows us to determine if two texts deal with similar topics even if
      they do not share a word.


1   Introduction
Due to the constant growth of the amount of text data available on the Web
and in digital libraries, the demand for automatic summarization and real-time
information filtering has rapidly increased. However, such systems need meta-
data that can precisely and compactly represent the content of the document.
As broadly discussed in literature and proven by web usage analysis [16], is par-
ticularly convenient for such metadata to come in the form of KeyPhrases(KP),
since they can be very expressive (much more than single Keywords), pretty
much straightforward in their meaning, and have a high cognitive plausibility,
because humans tend to think in terms of KPs rather than single Keywords. In
the rest of this paper we will refer to KP generation as the process of associat-
ing a meaningful set of KPs to a given text, regardless to their origin, while we
will call KP extraction the act of selecting a set of KP from the text and KP
inference the act of associating to the text a set of KP that may not be found
inside it. KP generation is a trivial and intuitive task for humans, since anyone
can tell at least the main topics of a given text, or decide whether it belongs to
a certain domain (news item, scientific literature, narrative, etc., ...) or not, but
it can be extremely hard for a machine since most of the documents available
lack any kind of semantic hint.
    Over the years several authors addressed this issue proposing different ap-
proaches towards both KP extraction and inference, but, in our opinion, each
one of them has severe practical limitations that prevent massive employment of
automatic KP generation in Information Retrieval, Social Tagging, and Adaptive
Personalization. Such limitations are the need of training data, the impossibil-
ity of associating to a given text keyphrases which are not already included in
that text, an high domain specificity, and the need of structured, detailed, and
expansive domain knowledge coded in the form of a thesaurus or an ontology.
    In this paper we propose an unsupervised KP generation method that com-
bines KP Extraction and KP inference based on Ontology Reasoning upon
knowledge sources that though not being formal ontologies can be seen as loosely
structured ones, in order to associate to any given text a meaningful and detailed
set of keyphrases.
    The rest of the paper is organized as follows: in Section 2 we briefly introduce
some related works; in Section 3 we present our keyphrase extraction technique;
in Section 4 we illustrate our keyphrase inference technique; in Section 5 we
discuss some experimental results and, finally, in Section 6 we conclude the
paper.


2   Related Work
Many works over the past few years have discussed different solutions for the
problem of automatically tagging documents and Web pages as well as the possi-
ble applications of such technologies in the fields of Personalization and Informa-
tion Retrieval in order to significantly reduce information overload and increase
accuracy. Both keyphrase extraction and inference have been widely discussed in
literature. Several different keyphrase extraction techniques have been proposed,
which usually are structured into two phases:
 – a candidate phrase identification phase, in which all the possible phrases are
   detected in the text;
 – a selection phase in which only the most significant of the above phrases are
   chosen as keyphrases.
The wide span of proposed methods can be roughly divided into two distinct
categories:
 – Supervised approaches: the underlying idea of these methods is that KP Ex-
   traction can be seen as a classification problem and therefore solved with a
   sufficient amount of training data (manually annotated) and machine learn-
   ing algorithms [19]. Several authors addressed the problem in this direction
    [18] and many systems that implement supervised approaches are available,
    such as KEA [20], Extractor2 , and LAKE [4]. All the above systems can
    be extremely effective and, as far as reliable data sets are available, can be
    flawlessly applied to any given domain [10]. However, requiring training data
    in order to work properly, implies two major drawbacks: (i) the quality of
    the extraction process relies on the quality of training data and (ii) a model
    trained on a specific domain just won’t fit another application domain unless
    is trained again.
 – Unsupervised approaches: this second class of methods eliminates the need
   for training data by selecting candidate KP according to some ranking strat-
   egy. Most of the proposed systems rely on the identification of noun phrases
   (i.e. phrases made of just nouns) and then proceed with a further selec-
   tion based on heuristics such as frequency of the phrase [1] or upon phrase
   clustering [2]. A third approach proposed by [12] and [9], exploits a graph-
   based ranking model algorithm, bearing much similarity to the notorious
   Page Rank algorithm, in order to select significant KPs and identify related
   terms that can be summarized by a single phrase. All the above techniques
   share the same advantage over the supervised strategies, that is being truly
   domain independent, since they rely on general principles and heuristics and
   therefore there is no need for training data. However, such generalist ap-
   proaches may not always lead to excellent results, especially when dealing
   with peculiar documents whose structure does not satisfy the assumptions
   that drive the KP extraction process.


     Hybrid approaches have been proposed as well, incorporating semi-supervised
domain knowledge in an otherwise unsupervised extraction strategy [15], but
still remain highly domain-specific. Keyphrase extraction, however, is severely
limited by the fact it can ultimately return only words contained in the input
document, which are highly prone to ambiguity and subject to the nuances
of different writing styles (e.g: an author can write “mining frequent patterns”
where another one would write “frequent pattern mining” ). Keyphrase inference
can overcome these limitations and has been widely explored in literature as well,
spanning from systems that simply combine words appearing in the text in order
to construct rather than extract phrases [3] to systems that assign Keyphrases
that may built with terms that never appear in the document. In the latter case,
KPs come from a controlled dictionary, possibly an ontology; in such case, a
classifier is trained in order to find which entries of the exploited dictionary may
fit the text [6]. If the dictionary of possible KPs is an ontology, its structure
can be exploited in order to provide additional evidence for inference [13] and,
by means of ontological reasoning, evaluate relatedness between terms [11]. In
[14] a KP inference technique is discussed, which is based on a very specific
domain OWL ontology and which combines both KP Extraction and inference,
in the context of a vast framework for personalized document annotation. KP
inference based on dictionaries, however, is strongly limited by the size, the
domain coverage, and the specificity level of the considered dictionary.
3   System Overview

In order to test our approach and to support our claims we developed a new ver-
sion of the system presented in [14] which introduces an original innovation, i.e.
the exploitation of a number of generalist online External Knowledge Sources,
rather than a formal ontology, in order to improve extraction quality and infer
meaningful KPs not included in the input text but preserving domain indepen-
dence.


                       Figure 1. Architecture of the System


    In Figure 1 the overall organization of the proposed system is presented. It
is constituted by the following main components:

 – A KP Extraction Module (KPEM ), devoted to analyse the text end extract
   from it meaningful KPs. It is supported by some linguistic resources, such
   as a POS tagger (for the English Language) and a Stopwords Database and
   it accesses online some External Knowledge Sources (EKS ) mainly exploited
   in order to provide support to the candidate KPs identified in the text (as
   explained in the following section). The KPEM receives in input an unstruc-
   tured text and it produces in output a ranked list of KPs, which is stored in
   an Extracted Keyphrases Data Base(EKPDB ).
 – A KP Inference Module (KPIM ), which works on the KP list produced
   by the KPEM and it is devoted to infer new KPs, (possibly) not already
   included in the input text. It relies on some ontological reasoning based on
   the access to the External Knowledge Sources, exploited in order to identify
   concepts which are related to the concepts referred to by the KPs previously
   extracted by the KPEM. Inferred KPs are stored in the Inferred KP Data
   Base (IKPDB ).
    The access to the online External Knowledge Sources is provided by a Gen-
eralized Knowledge Gateway (GKG). Both the EKPDB and the IKPDB can be
accessed through Web Services by external applications, providing in such a way
and advanced KP Generation service to interested Web users, which can exploit
such capability in other target applications.


4   Phrase Extraction

KPEM is an enhanced version of DIKPE, the unsupervised, domain independent
KP extraction approach described in [14] and [8]. In a nutshell, DIKPE gener-
ates a large set of candidate KPs; the exploited approach then merges different
types of knowledge in order to identify meaningful concepts in a text, also trying
to model a human-like KP assignment process. In particular we use:Linguistic
Knowledge (POS tagging, sentence structure, punctuation); Statistical Knowl-
edge (frequency, tf/idf,...); knowledge about the structure of a document (po-
sition of the candidate KP in the text, title, subtitles, ...); Meta-knowledge
provided by the author (html tags,...); knowledge coming from online external
knowledge sources, useful for validating candidate keyphrases which have been
socially recognized, for example, in collaborative wikis (e.g. Wikipedia, Wordnik,
and other online resources).
By means of the above knowledge sources, each candidate phrase, is character-
ized by a set of features, such as, for example:

 – Frequency: the frequency of the phrase in the text;
 – Phrase Depth: at which point of the text the phrase occurs for the first time,
   the sooner it appears, the higher the value;
 – Phrase Last Occurrence: at which point of the text the phrase occurs for the
   last time, the later it appears, the higher the value;
 – Life Span: the fraction of text between the first and the last occurrence of
   the phrase;
 – POS value: a parameter taking into account the grammatical composition of
   the phrase, excluding some patterns and assigning higher priority to other
   patterns (typically, for example but not exclusively, it can be relevant to
   consider the number of nouns in the phrase over the number of words in the
   phrase).
 – WikiFlag: a parameter taking into account the fact that the phrase is or is
   not an entry of collaborative external knowledge sources (EKS).

A weighted mean of the above features, called Keyphraseness is then computed
and the KPs are sorted in descending keyphraseness order. The weight of each
feature can be tuned in order to fit particular kinds of text, but, usually, a gen-
eralist preset can be used with good results. The topmost n KPs are finally
suggested.
In this work, we extended the DIKPE system with the GKG to access EKS,
allowing access to multiple knowledge sources at the same time. We also added
a more general version of the WikiFlag feature.This feature is computed as fol-
lows: if the phrase matches an entry in at least one of the considered knowledge
sources, the its value is set to 1, otherwise the phrase is split into single terms
and the WikiFlag value is the percentage corresponding to the number of terms
that have a match in at least one of the considered knowledge sources. By do-
ing so, a KP that does not match as phrase, but is constituted by terms that
match as single words, still gets a high score, but lower than a KP that features
a perfect match. The WikiFlag feature is processed as all the other features,
concurring to the computation of the keyphraseness and, therefore, influencing
the ranking of the extracted KPs. The rationale of this choice is that a KP is
important insofar it represents a meaningful concept or entity, rather than a ran-
dom combination of words, and matching a whole phrase against collaborative
human-made knowledge sources (as the EKS are) guarantees that it makes bet-
ter sense, providing a strong form of human/social validation. This also reduces
the tendency of the system to return typos, document parsing errors, and other
meaningless strings as false positives.

    Another improvement over the original DIKPE approach is represented by
the fact that, instead of suggesting the top n KPs extracted, the new system
evaluates the decreasing trend of Keyphraseness among ordered KPs, it detects
the first significant downfall in the keyphraseness value, and it suggests all the
KPs occurring before that (dynamic) threshold. By doing so, the system suggests
a variable number of high-scored KPs, while the previous version suggests a fixed
number of KPs, that could have been either too small or too large for the given
text.


5    Phrase inference

The KP Inference Module (KPIM), as well as the knowledge-based WikiFlag
feature described in the previous section, rely on a set of external knowledge
sources that are accessed via web. We assume that (i) there is a way to match
extracted KPs with entities described in EKSs (e.g.: querying the exploited ser-
vice using the KP as search key) and (ii) each one of the EKSs considered is
organized according to some kind of hierarchy, as shown in (Figure 2), even if
very weak and loosely structured, in which is possible to associate to any entity
a set of parent entities and another set made of related entities. Such sets may
be void, since we do not assume each entity being linked to at least another one,
nor the existence of a root entity that is ancestor to all the other entities in the
ontology.
    Even if such structure is loose, assuming its existence is not trivial at all, but
an increasing number of collaborative resources allow users to classify and link
together knowledge items, generating a pseudo-ontology. Clear examples of this
tendency are Wikipedia, where almost any article contains links to other articles
and many articles are grouped into categories, and Wordnik, an online collabo-
rative dictionary where any word has sets of hypernyms, synonyms, hyponyms
          Figure 2. Example of the assumed Knowledge Source structure.


and related words associated. Recently also several entertainment sites, like Ur-
ban Dictionary, have begun to provide these possibilities, making them eligible
knowledge sources for our approach. Knowledge sources may be either general-
ist (like Wikipedia), or specific (like the many domain-specific wikis hosted on
wikia.com) and several different EKS can be exploited at the same time in order
to provide better results.
    In the case of Wikipedia, parent entities are given by the categories, that are
thematic groups of articles (i.e.: “Software Engineering” belongs to the “Engi-
neering Disciplines” category). An entry may belong to several categories, for
example the entry on “The Who” belongs to the “musical quartets” category
as well as to the “English hard rock musical groups” one and the “Musical
groups established in 1964” one. Related entities, instead, can be deduced by
links contained in the entry associated to the given entity: such links can be
very numerous and heterogeneous, but the most closely related ones are often
grouped into one or more templates, that are the thematic collections of inter-
nal Wikipedia links usually displayed on the bottom of the page, as shown in
Figure 3. For instance, in a page dedicated to a film director, it is very likely to
find a template containing links to the all movies he directed or the actors he
worked with.


Figure 3. The lowest section of a Wikipedia page, containing templates (the “Engi-
neering” template has been expanded) and categories (bottom line).
    Wordnik, instead, provides hierarchical information explicitly by associating
to any entity lists of hypernyms (parent entities) and synonyms (related entities).
    The inference algorithm considers the topmost half of the extracted KPs,
that typically is still a significantly larger set than the one suggested, and, for
each KP that can be associated to an entity, retrieves from each EKS a set of
parent entities and a set of related entities. If a KP corresponds to more than one
entity on one or more EKSs, all of the retrieved entities are taken into account.
The sets associated to single KPs are then merged into a table of related entities
and a table of parent entities for the whole text. Each retrieved entity is scored
accordingly to the sum of the Keyphraseness value of the KPs from which it has
been derived and then it is sorted by descending score. The top entries of such
tables are suggested as meaningful KPs for the input document.


                 Figure 4. inference and scoring of parent entities.


   By doing so, we select only entities which are related or parent to a signifi-
cant number of hi-scored KPs, addressing the problem of polysemy among the
extracted KP. For instance, suppose we extracted “Queen” and “Joy Division”
from the same text (Figure 4): they both are polysemic phrases since the first
may refer to the English band as well as to a regent and the latter to the English
band or to Nazi concentration camps. However, since they appear together, and
they are both part of the “musical quartets” category in Wikipedia, we it can
be deduced that the text is about music rather than politics or World War II.


6   Evaluation

Formative tests were performed in order to test the accuracy of the inferred KPs
and their ability to add meaningful information to the set of extracted KPs,
regardless of the domain covered by the input text. Three data sets, dealing
with different topics, were processed, article by article, with the same feature
weights and exploiting Wikipedia and Wordnik as External Knowledge Source.
For each article a list of extracted KPs and one of inferred KPs were generated,
then the occurrences of each KP were counted, in order to evaluate which portion
of the data set is covered by each KP. We call set coverage the fraction of the
data set labelled with a single KP. Since the topics covered in the texts included
in each data set are known a-priori, we expect the system to generate KPs that
associate the majority of the texts in the data set to their specific domain topic.
  Extracted Keyphrase Set coverage Inferred Keyphrase          Set Coverage
  program                 0,13     Mathematics                     0,47
  use                     0,12     Programming language            0,26
  function                0,12     move                            0,25
  type                    0,10     Computer science                0,22
  programming language    0,10     Set (mathematics)               0,17
  programming             0,08     Data types                      0,15
  functions               0,07     Aristotle                       0,16
  class                   0,07     Function (mathematics)          0,14
  code                    0,06     C (programming language)        0,14
  COBOL                   0,06     Botanical nomenclature          0,12
  chapter                 0,05     C++                             0,11
  variables               0,05     Information                     0,08
  number                  0,05     Java (programming language)     0,08

Table 1. The most frequently extracted and inferred KPs from the “programming
tutorials” data set.


    The first data set contained 113 programming tutorials, spanning from brief
introductions published on blogs and forums to extensive articles taken from
books and journals, covering both practical and theoretical aspects of program-
ming. A total of 776 KPs were extracted and 297 were inferred. In Table 1
are reported the most frequently extracted and inferred KPs. As expected, ex-
tracted KPs are highly specific and tend to characterize a few documents in the
set (the most frequent KP covers just the 13% of the data set), while inferred
ones provide an higher level of abstraction, resulting in an higher coverage over
the considered data set. However some Inferred KPs are not accurate, such as
“ Botanical nomenclature “ that clearly derive from the presence of terms such
as “tree”, “branch”, “leaf”, and “forest” that are frequently used in Computer
Science, and “Aristotele” which comes from the frequent references to Logic,
which Wikipedia frequently associates with the Greek philosopher.
The second data set contained 159 car reviews taken from American and British
magazines written by professional journalists. Unlike the previous data set, in
which all the texts share a very specific language and provide technical informa-
tion, in this set different writing stiles and different kinds of target audiences are
present. Some of the reviews are very specific, focusing on technical details, while
others are more aimed at entertaining rather than informing. Most of the con-
sidered texts, however, stand at some point between these two ends, providing a
good deal of technical information together with an accessible and entertaining
style.
   In Table 2 the most frequently extracted and inferred KPs are reported. While
extracted KPs clearly identify the automotive domain, inferred ones don’t, with
only the 44% of the considered texts being covered by the “Automobile” KP and
the 64% being labelled with “English-language films”. However this is mostly
due to the fact that several reviews tend to stress a car’s presence in popular
     Extracted Keyphrase Set coverage Inferred Keyphrase      Set Coverage
     car                     0,16     United States               0,66
     sports car              0,08     English-language films      0,64
     SUV                     0,06     Automobile                  0,44
     fuel economy            0,05     United Kingdom              0,33
     ride                    0,05     American films              0,16
     looks                   0,04     Internet Movie Database     0,16
     Lotus                   0,04     Japan                       0,14
     GT                      0,04     2000s automobiles           0,11
     top speed               0,04     Physical quantities         0,09
     gas mileage             0,04     2010s automobiles           0,09
     look                    0,04     Germany                     0,09
     hot hatch               0,03     Sports cars                 0,08

Table 2. The most frequently extracted and inferred KPs from the “car reviews” data
set.


movies (eg: Aston Martin in the 007 franchise or any given Japanese car in the
Fast and Furious franchise) and only 18 out of 327 (5.5%) different inferred
KPs deal with cinema and television. KP such as “Unites States” and “United
Kingdom” are also frequently inferred due to the fact that the reviewed cars
are mostly designed for USA and UK markets, have been tested in such coun-
tries, and several manufacturers are based in those countries. As a side note,
98% of the considered text are correctly associated with the manufacturer of the
reviewed car. The third data set contained reviews of 211 heavy metal albums
published in 2013. Reviews were written by various authors, both professionals
and non-professionals, and combine a wide spectrum of writing styles, from ut-
terly specific, almost scientific, to highly sarcastic, with many puns and popular
culture references.


     Extracted Keyphrase Set coverage Inferred Keyphrase      Set Coverage
     metal                   0,23     Music genre                   1
     album                   0,21     Record label                0,97
     death metal             0,17     Record producer             0,54
     black metal             0,17     United States               0,48
     band                    0,16     Studio album                0,16
     bands                   0,08     United Kingdom              0,11
     death                   0,08     Bass guitar                 0,09
     old school              0,07     Single (music)              0,08
     sound                   0,06     Internet Movie Database     0,07
     albums                  0,05     Heavy metal music           0,07
     power metal             0,05     Allmusic                    0,06

Table 3. The most frequently extracted and inferred KPs from the “album reviews”
data set.
   In Table 3 are reported the most frequently extracted and inferred KPs. All
the documents in the set were associated with the Inferred KP “Music Genre”
and the 97% of them with “Record Label”, which clearly associates the texts
with the music domain. Evaluation and development, however, are still ongoing
and new knowledge sources, such as domain-specific wikis and Urban Dictionary,
are being considered.

7   Conclusions
In this paper we proposed a truly domain independent approach to both KP
extraction and inference, able to generate significant semantic metadata with
different layers of abstraction for any given text without need for training. The
KP extraction part of the system provides a very fine granularity, producing
KPs that may not be found in a controlled dictionary (such as Wikipedia),
but characterize the text. Such KPs are extremely valuable for the purpose of
summarization and provide great accuracy when used as search keys. However,
they are not widely shared, meaning, from an information retrieval point of view,
a very low recall. On the other hand, the KP inference part generates only KPs
taken from a controlled dictionary (the union of the considered EKS) that are
more likely to be general and, therefore, shared among a significant number of
texts.
    As shown in the previous section, our approach can annotate a set of doc-
uments with good precision, however, a few unrelated KPs may be inferred,
mostly due to ambiguities of the text and to the generalist nature of the ex-
ploited Knowledge Sources. This unrelated terms, fortunately, tend to appear
in a limited number of cases and to be clearly unrelated not only to the ma-
jority of the generated KPs, but to also each other. In fact, our next step in
this research will be precisely to identify such false positives by means of an
estimate of the Semantic Relatedness[17], [7] between terms in order to identify,
for each generated KP, a list of related concepts and detect concept clusters in
the document.
    The proposed KP generation technique can be applied both in the Informa-
tion Retrieval domain and in the Adaptive Personalization one. The previous
version of the DIKPE system has already been integrated with good results in
RES [5], a personalized content-based recommender system for scientific papers
that suggests papers accordingly to their similarity with one or more documents
marked as interesting by the user, and in the PIRATES framework [14] for tag
recommendation and automatic document annotation. We expect this extended
version of the system to provide an even more accurate and complete KP gen-
eration and, therefore, to improve the performance of these existing systems, in
this way supporting the creation of new Semantic Web Intelligence tools.

References
 1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document
    keyphrases. In: Advances in Artificial Intelligence, pp. 40–52. Springer (2000)
 2. Bracewell, D.B., Ren, F., Kuriowa, S.: Multilingual single document keyword ex-
    traction for information retrieval. In: Natural Language Processing and Knowl-
    edge Engineering, 2005. IEEE NLP-KE’05. Proceedings of 2005 IEEE International
    Conference on. pp. 517–522. IEEE (2005)
 3. Danilevsky, M., Wang, C., Desai, N., Guo, J., Han, J.: Kert: Automatic extraction
    and ranking of topical keyphrases from content-representative document titles.
    arXiv preprint arXiv:1306.0271 (2013)
 4. DAvanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization
    purposes: The lake system at duc-2004. In: Proceedings of the 2004 document
    understanding conference (2004)
 5. De Nart, D., Ferrara, F., Tasso, C.: Personalized access to scientific publications:
    from recommendation to explanation. In: User Modeling, Adaptation, and Person-
    alization, pp. 296–301. Springer (2013)
 6. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms
    and representations for text categorization. In: Proceedings of the seventh interna-
    tional conference on Information and knowledge management. pp. 148–155. ACM
    (1998)
 7. Ferrara, F., Tasso, C.: Integrating semantic relatedness in a collaborative filtering
    system. In: Mensch & Computer Workshopband. pp. 75–82 (2012)
 8. Ferrara, F., Tasso, C.: Extracting keyphrases from web pages. In: Digital Libraries
    and Archives, pp. 93–104. Springer (2013)
 9. Litvak, M., Last, M.: Graph-based keyword extraction for single-document summa-
    rization. In: Proceedings of the workshop on multi-source multilingual information
    extraction and summarization. pp. 17–24. Association for Computational Linguis-
    tics (2008)
10. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., Neto, J.P.: Supervised
    topical key phrase extraction of news stories using crowdsourcing, light filtering
    and co-reference normalization. arXiv preprint arXiv:1306.4886 (2013)
11. Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In:
    Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries. pp.
    296–297. ACM (2006)
12. Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of
    EMNLP. vol. 4. Barcelona, Spain (2004)
13. Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual
    text collections with a conceptual thesaurus. arXiv preprint cs/0609059 (2006)
14. Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase
    extraction and ontology mining for content-based tag recommendation. Interna-
    tional Journal of Intelligent Systems 25(12), 1158–1186 (2010)
15. Sarkar, K.: A hybrid approach to extract keyphrases from medical documents.
    arXiv preprint arXiv:1303.1441 (2013)
16. Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web
    search engine query log. In: ACm SIGIR Forum. vol. 33, pp. 6–12. ACM (1999)
17. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using
    wikipedia. In: AAAI. vol. 6, pp. 1419–1424 (2006)
18. Turney, P.D.: Learning to extract keyphrases from text. national research council.
    Institute for Information Technology, Technical Report ERB-1057 (1999)
19. Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval
    2(4), 303–336 (2000)
20. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea:
    Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM con-
    ference on Digital libraries. pp. 254–255. ACM (1999)

</pre>