-

A Keyphrase Generation Technique Based upon Keyphrase Extraction and Reasoning on Loosely Structured Ontologies

Dario De Nart

Carlo Tasso

carlo.tassog@uniud.it 0 0 Arti cial Intelligence Lab Department of Mathematics and Computer Science University of Udine , Italy

Associating meaningful keyphrases to documents and web pages is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation, a complex task involving Natural Language Processing and Knowledge Engineering, can signi cantly support this activity. Several di erent strategies have been proposed over the years, but most of them require extensive training data, which are not always available, su er high ambiguity and di erences in writing style, are highly domain-speci c, and often rely on a wellstructured knowledge that is very hard to acquire and encode. In order to overcome these limitations, we propose in this paper an innovative unsupervised and domain-independent approach that combines keyphrase extraction and keyphrase inference based on loosely structured, collaborative knowledge such as Wikipedia, Wordnik, and Urban Dictionary. Such choice introduces a higher level of abstraction in the generated KPs that allows us to determine if two texts deal with similar topics even if they do not share a word.

Due to the constant growth of the amount of text data available on the Web and in digital libraries, the demand for automatic summarization and real-time information ltering has rapidly increased. However, such systems need metadata that can precisely and compactly represent the content of the document. As broadly discussed in literature and proven by web usage analysis [ 16 ], is particularly convenient for such metadata to come in the form of KeyPhrases (KP), since they can be very expressive (much more than single Keywords), pretty much straightforward in their meaning, and have a high cognitive plausibility, because humans tend to think in terms of KPs rather than single Keywords. In the rest of this paper we will refer to KP generation as the process of associating a meaningful set of KPs to a given text, regardless to their origin, while we will call KP extraction the act of selecting a set of KP from the text and KP inference the act of associating to the text a set of KP that may not be found inside it. KP generation is a trivial and intuitive task for humans, since anyone can tell at least the main topics of a given text, or decide whether it belongs to a certain domain (news item, scienti c literature, narrative, etc., ...) or not, but it can be extremely hard for a machine since most of the documents available lack any kind of semantic hint.

Over the years several authors addressed this issue proposing di erent approaches towards both KP extraction and inference, but, in our opinion, each one of them has severe practical limitations that prevent massive employment of automatic KP generation in Information Retrieval, Social Tagging, and Adaptive Personalization. Such limitations are the need of training data, the impossibility of associating to a given text keyphrases which are not already included in that text, an high domain speci city, and the need of structured, detailed, and expansive domain knowledge coded in the form of a thesaurus or an ontology.

In this paper we propose an unsupervised KP generation method that combines KP Extraction and KP inference based on Ontology Reasoning upon knowledge sources that though not being formal ontologies can be seen as loosely structured ones, in order to associate to any given text a meaningful and detailed set of keyphrases.

The rest of the paper is organized as follows: in Section 2 we brie y introduce some related works; in Section 3 we present our keyphrase extraction technique; in Section 4 we illustrate our keyphrase inference technique; in Section 5 we discuss some experimental results and, nally, in Section 6 we conclude the paper. 2

Related Work

Many works over the past few years have discussed di erent solutions for the problem of automatically tagging documents and Web pages as well as the possible applications of such technologies in the elds of Personalization and Information Retrieval in order to signi cantly reduce information overload and increase accuracy. Both keyphrase extraction and inference have been widely discussed in literature. Several di erent keyphrase extraction techniques have been proposed, which usually are structured into two phases: { a candidate phrase identi cation phase, in which all the possible phrases are detected in the text; { a selection phase in which only the most signi cant of the above phrases are chosen as keyphrases.

The wide span of proposed methods can be roughly divided into two distinct categories: { Supervised approaches : the underlying idea of these methods is that KP Extraction can be seen as a classi cation problem and therefore solved with a su cient amount of training data (manually annotated) and machine learning algorithms [ 19 ]. Several authors addressed the problem in this direction [ 18 ] and many systems that implement supervised approaches are available, such as KEA [ 20 ], Extractor2, and LAKE [ 4 ]. All the above systems can be extremely e ective and, as far as reliable data sets are available, can be awlessly applied to any given domain [ 10 ]. However, requiring training data in order to work properly, implies two major drawbacks: (i) the quality of the extraction process relies on the quality of training data and (ii) a model trained on a speci c domain just won't t another application domain unless is trained again. { Unsupervised approaches : this second class of methods eliminates the need for training data by selecting candidate KP according to some ranking strategy. Most of the proposed systems rely on the identi cation of noun phrases (i.e. phrases made of just nouns) and then proceed with a further selection based on heuristics such as frequency of the phrase [ 1 ] or upon phrase clustering [ 2 ]. A third approach proposed by [ 12 ] and [ 9 ], exploits a graphbased ranking model algorithm, bearing much similarity to the notorious Page Rank algorithm, in order to select signi cant KPs and identify related terms that can be summarized by a single phrase. All the above techniques share the same advantage over the supervised strategies, that is being truly domain independent, since they rely on general principles and heuristics and therefore there is no need for training data. However, such generalist approaches may not always lead to excellent results, especially when dealing with peculiar documents whose structure does not satisfy the assumptions that drive the KP extraction process.

Hybrid approaches have been proposed as well, incorporating semi-supervised domain knowledge in an otherwise unsupervised extraction strategy [ 15 ], but still remain highly domain-speci c. Keyphrase extraction, however, is severely limited by the fact it can ultimately return only words contained in the input document, which are highly prone to ambiguity and subject to the nuances of di erent writing styles (e.g: an author can write \mining frequent patterns" where another one would write \frequent pattern mining" ). Keyphrase inference can overcome these limitations and has been widely explored in literature as well, spanning from systems that simply combine words appearing in the text in order to construct rather than extract phrases [ 3 ] to systems that assign Keyphrases that may built with terms that never appear in the document. In the latter case, KPs come from a controlled dictionary, possibly an ontology; in such case, a classi er is trained in order to nd which entries of the exploited dictionary may t the text [ 6 ]. If the dictionary of possible KPs is an ontology, its structure can be exploited in order to provide additional evidence for inference [ 13 ] and, by means of ontological reasoning, evaluate relatedness between terms [ 11 ]. In [ 14 ] a KP inference technique is discussed, which is based on a very speci c domain OWL ontology and which combines both KP Extraction and inference, in the context of a vast framework for personalized document annotation. KP inference based on dictionaries, however, is strongly limited by the size, the domain coverage, and the speci city level of the considered dictionary.

System Overview

In order to test our approach and to support our claims we developed a new version of the system presented in [ 14 ] which introduces an original innovation, i.e. the exploitation of a number of generalist online External Knowledge Sources, rather than a formal ontology, in order to improve extraction quality and infer meaningful KPs not included in the input text but preserving domain independence.

In Figure 1 the overall organization of the proposed system is presented. It is constituted by the following main components: { A KP Extraction Module (KPEM ), devoted to analyse the text end extract from it meaningful KPs. It is supported by some linguistic resources, such as a POS tagger (for the English Language) and a Stopwords Database and it accesses online some External Knowledge Sources (EKS ) mainly exploited in order to provide support to the candidate KPs identi ed in the text (as explained in the following section). The KPEM receives in input an unstructured text and it produces in output a ranked list of KPs, which is stored in an Extracted Keyphrases Data Base(EKPDB ). { A KP Inference Module (KPIM ), which works on the KP list produced by the KPEM and it is devoted to infer new KPs, (possibly) not already included in the input text. It relies on some ontological reasoning based on the access to the External Knowledge Sources, exploited in order to identify concepts which are related to the concepts referred to by the KPs previously extracted by the KPEM. Inferred KPs are stored in the Inferred KP Data Base (IKPDB ).

The access to the online External Knowledge Sources is provided by a Generalized Knowledge Gateway (GKG). Both the EKPDB and the IKPDB can be accessed through Web Services by external applications, providing in such a way and advanced KP Generation service to interested Web users, which can exploit such capability in other target applications. 4

Phrase Extraction

KPEM is an enhanced version of DIKPE, the unsupervised, domain independent KP extraction approach described in [ 14 ] and [ 8 ]. In a nutshell, DIKPE generates a large set of candidate KPs; the exploited approach then merges di erent types of knowledge in order to identify meaningful concepts in a text, also trying to model a human-like KP assignment process. In particular we use:Linguistic Knowledge (POS tagging, sentence structure, punctuation); Statistical Knowledge (frequency, tf/idf,...); knowledge about the structure of a document (position of the candidate KP in the text, title, subtitles, ...); Meta-knowledge provided by the author (html tags,...); knowledge coming from online external knowledge sources, useful for validating candidate keyphrases which have been socially recognized, for example, in collaborative wikis (e.g. Wikipedia, Wordnik, and other online resources).

By means of the above knowledge sources, each candidate phrase, is characterized by a set of features, such as, for example: { Frequency : the frequency of the phrase in the text; { Phrase Depth: at which point of the text the phrase occurs for the rst time, the sooner it appears, the higher the value; { Phrase Last Occurrence: at which point of the text the phrase occurs for the last time, the later it appears, the higher the value; { Life Span: the fraction of text between the rst and the last occurrence of the phrase; { POS value: a parameter taking into account the grammatical composition of the phrase, excluding some patterns and assigning higher priority to other patterns (typically, for example but not exclusively, it can be relevant to consider the number of nouns in the phrase over the number of words in the phrase). { WikiFlag : a parameter taking into account the fact that the phrase is or is not an entry of collaborative external knowledge sources (EKS). A weighted mean of the above features, called Keyphraseness is then computed and the KPs are sorted in descending keyphraseness order. The weight of each feature can be tuned in order to t particular kinds of text, but, usually, a generalist preset can be used with good results. The topmost n KPs are nally suggested.

In this work, we extended the DIKPE system with the GKG to access EKS, allowing access to multiple knowledge sources at the same time. We also added a more general version of the WikiFlag feature.This feature is computed as follows: if the phrase matches an entry in at least one of the considered knowledge sources, the its value is set to 1, otherwise the phrase is split into single terms and the WikiFlag value is the percentage corresponding to the number of terms that have a match in at least one of the considered knowledge sources. By doing so, a KP that does not match as phrase, but is constituted by terms that match as single words, still gets a high score, but lower than a KP that features a perfect match. The WikiFlag feature is processed as all the other features, concurring to the computation of the keyphraseness and, therefore, in uencing the ranking of the extracted KPs. The rationale of this choice is that a KP is important insofar it represents a meaningful concept or entity, rather than a random combination of words, and matching a whole phrase against collaborative human-made knowledge sources (as the EKS are) guarantees that it makes better sense, providing a strong form of human/social validation. This also reduces the tendency of the system to return typos, document parsing errors, and other meaningless strings as false positives.

Another improvement over the original DIKPE approach is represented by the fact that, instead of suggesting the top n KPs extracted, the new system evaluates the decreasing trend of Keyphraseness among ordered KPs, it detects the rst signi cant downfall in the keyphraseness value, and it suggests all the KPs occurring before that (dynamic) threshold. By doing so, the system suggests a variable number of high-scored KPs, while the previous version suggests a xed number of KPs, that could have been either too small or too large for the given text. 5

Phrase inference

The KP Inference Module (KPIM), as well as the knowledge-based WikiFlag feature described in the previous section, rely on a set of external knowledge sources that are accessed via web. We assume that (i) there is a way to match extracted KPs with entities described in EKSs (e.g.: querying the exploited service using the KP as search key) and (ii) each one of the EKSs considered is organized according to some kind of hierarchy, as shown in (Figure 2), even if very weak and loosely structured, in which is possible to associate to any entity a set of parent entities and another set made of related entities. Such sets may be void, since we do not assume each entity being linked to at least another one, nor the existence of a root entity that is ancestor to all the other entities in the ontology.

Even if such structure is loose, assuming its existence is not trivial at all, but an increasing number of collaborative resources allow users to classify and link together knowledge items, generating a pseudo-ontology. Clear examples of this tendency are Wikipedia, where almost any article contains links to other articles and many articles are grouped into categories, and Wordnik, an online collaborative dictionary where any word has sets of hypernyms, synonyms, hyponyms and related words associated. Recently also several entertainment sites, like Urban Dictionary, have begun to provide these possibilities, making them eligible knowledge sources for our approach. Knowledge sources may be either generalist (like Wikipedia), or speci c (like the many domain-speci c wikis hosted on wikia.com) and several di erent EKS can be exploited at the same time in order to provide better results.

In the case of Wikipedia, parent entities are given by the categories, that are thematic groups of articles (i.e.: \Software Engineering" belongs to the \Engineering Disciplines" category). An entry may belong to several categories, for example the entry on \The Who" belongs to the \musical quartets" category as well as to the \English hard rock musical groups" one and the \Musical groups established in 1964" one. Related entities, instead, can be deduced by links contained in the entry associated to the given entity: such links can be very numerous and heterogeneous, but the most closely related ones are often grouped into one or more templates, that are the thematic collections of internal Wikipedia links usually displayed on the bottom of the page, as shown in Figure 3. For instance, in a page dedicated to a lm director, it is very likely to nd a template containing links to the all movies he directed or the actors he worked with.

Wordnik, instead, provides hierarchical information explicitly by associating to any entity lists of hypernyms (parent entities) and synonyms (related entities).

The inference algorithm considers the topmost half of the extracted KPs, that typically is still a signi cantly larger set than the one suggested, and, for each KP that can be associated to an entity, retrieves from each EKS a set of parent entities and a set of related entities. If a KP corresponds to more than one entity on one or more EKSs, all of the retrieved entities are taken into account. The sets associated to single KPs are then merged into a table of related entities and a table of parent entities for the whole text. Each retrieved entity is scored accordingly to the sum of the Keyphraseness value of the KPs from which it has been derived and then it is sorted by descending score. The top entries of such tables are suggested as meaningful KPs for the input document.

By doing so, we select only entities which are related or parent to a signi cant number of hi-scored KPs, addressing the problem of polysemy among the extracted KP. For instance, suppose we extracted \Queen" and \Joy Division" from the same text (Figure 4): they both are polysemic phrases since the rst may refer to the English band as well as to a regent and the latter to the English band or to Nazi concentration camps. However, since they appear together, and they are both part of the \musical quartets" category in Wikipedia, we it can be deduced that the text is about music rather than politics or World War II. 6

Evaluation

Formative tests were performed in order to test the accuracy of the inferred KPs and their ability to add meaningful information to the set of extracted KPs, regardless of the domain covered by the input text. Three data sets, dealing with di erent topics, were processed, article by article, with the same feature weights and exploiting Wikipedia and Wordnik as External Knowledge Source. For each article a list of extracted KPs and one of inferred KPs were generated, then the occurrences of each KP were counted, in order to evaluate which portion of the data set is covered by each KP. We call set coverage the fraction of the data set labelled with a single KP. Since the topics covered in the texts included in each data set are known a-priori, we expect the system to generate KPs that associate the majority of the texts in the data set to their speci c domain topic.

The rst data set contained 113 programming tutorials, spanning from brief introductions published on blogs and forums to extensive articles taken from books and journals, covering both practical and theoretical aspects of programming. A total of 776 KPs were extracted and 297 were inferred. In Table 1 are reported the most frequently extracted and inferred KPs. As expected, extracted KPs are highly speci c and tend to characterize a few documents in the set (the most frequent KP covers just the 13% of the data set), while inferred ones provide an higher level of abstraction, resulting in an higher coverage over the considered data set. However some Inferred KPs are not accurate, such as \ Botanical nomenclature \ that clearly derive from the presence of terms such as \tree", \branch", \leaf", and \forest" that are frequently used in Computer Science, and \Aristotele" which comes from the frequent references to Logic, which Wikipedia frequently associates with the Greek philosopher. The second data set contained 159 car reviews taken from American and British magazines written by professional journalists. Unlike the previous data set, in which all the texts share a very speci c language and provide technical information, in this set di erent writing stiles and di erent kinds of target audiences are present. Some of the reviews are very speci c, focusing on technical details, while others are more aimed at entertaining rather than informing. Most of the considered texts, however, stand at some point between these two ends, providing a good deal of technical information together with an accessible and entertaining style.

In Table 2 the most frequently extracted and inferred KPs are reported. While extracted KPs clearly identify the automotive domain, inferred ones don't, with only the 44% of the considered texts being covered by the \Automobile" KP and the 64% being labelled with \English-language lms". However this is mostly due to the fact that several reviews tend to stress a car's presence in popular movies (eg: Aston Martin in the 007 franchise or any given Japanese car in the Fast and Furious franchise) and only 18 out of 327 (5.5%) di erent inferred KPs deal with cinema and television. KP such as \Unites States" and \United Kingdom" are also frequently inferred due to the fact that the reviewed cars are mostly designed for USA and UK markets, have been tested in such countries, and several manufacturers are based in those countries. As a side note, 98% of the considered text are correctly associated with the manufacturer of the reviewed car. The third data set contained reviews of 211 heavy metal albums published in 2013. Reviews were written by various authors, both professionals and non-professionals, and combine a wide spectrum of writing styles, from utterly speci c, almost scienti c, to highly sarcastic, with many puns and popular culture references.

Extracted Keyphrase Set coverage Inferred Keyphrase metal 0,23 Music genre album 0,21 Record label death metal 0,17 Record producer black metal 0,17 United States band 0,16 Studio album bands 0,08 United Kingdom death 0,08 Bass guitar old school 0,07 Single (music) sound 0,06 Internet Movie Database albums 0,05 Heavy metal music power metal 0,05 Allmusic

In Table 3 are reported the most frequently extracted and inferred KPs. All the documents in the set were associated with the Inferred KP \Music Genre" and the 97% of them with \Record Label", which clearly associates the texts with the music domain. Evaluation and development, however, are still ongoing and new knowledge sources, such as domain-speci c wikis and Urban Dictionary, are being considered. 7

Conclusions

In this paper we proposed a truly domain independent approach to both KP extraction and inference, able to generate signi cant semantic metadata with di erent layers of abstraction for any given text without need for training. The KP extraction part of the system provides a very ne granularity, producing KPs that may not be found in a controlled dictionary (such as Wikipedia), but characterize the text. Such KPs are extremely valuable for the purpose of summarization and provide great accuracy when used as search keys. However, they are not widely shared, meaning, from an information retrieval point of view, a very low recall. On the other hand, the KP inference part generates only KPs taken from a controlled dictionary (the union of the considered EKS) that are more likely to be general and, therefore, shared among a signi cant number of texts.

As shown in the previous section, our approach can annotate a set of documents with good precision, however, a few unrelated KPs may be inferred, mostly due to ambiguities of the text and to the generalist nature of the exploited Knowledge Sources. This unrelated terms, fortunately, tend to appear in a limited number of cases and to be clearly unrelated not only to the majority of the generated KPs, but to also each other. In fact, our next step in this research will be precisely to identify such false positives by means of an estimate of the Semantic Relatedness [ 17 ], [ 7 ] between terms in order to identify, for each generated KP, a list of related concepts and detect concept clusters in the document.

The proposed KP generation technique can be applied both in the Information Retrieval domain and in the Adaptive Personalization one. The previous version of the DIKPE system has already been integrated with good results in RES [ 5 ], a personalized content-based recommender system for scienti c papers that suggests papers accordingly to their similarity with one or more documents marked as interesting by the user, and in the PIRATES framework [ 14 ] for tag recommendation and automatic document annotation. We expect this extended version of the system to provide an even more accurate and complete KP generation and, therefore, to improve the performance of these existing systems, in this way supporting the creation of new Semantic Web Intelligence tools.

1. Barker , K. , Cornacchia , N.: Using noun phrase heads to extract document keyphrases . In: Advances in Arti cial Intelligence , pp. 40 { 52 . Springer ( 2000 )

2. Bracewell , D.B. , Ren , F. , Kuriowa , S. : Multilingual single document keyword extraction for information retrieval . In: Natural Language Processing and Knowledge Engineering , 2005 . IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on . pp. 517 { 522 . IEEE ( 2005 )

3. Danilevsky , M. , Wang , C. , Desai , N. , Guo , J ., Han, J .: Kert: Automatic extraction and ranking of topical keyphrases from content-representative document titles . arXiv preprint arXiv:1306.0271 ( 2013 )

4. DAvanzo , E., Magnini , B. , Vallin , A. : Keyphrase extraction for summarization purposes: The lake system at duc-2004 . In: Proceedings of the 2004 document understanding conference ( 2004 )

5. De Nart , D. , Ferrara , F. , Tasso , C. : Personalized access to scienti c publications: from recommendation to explanation . In: User Modeling, Adaptation, and Personalization, pp. 296 { 301 . Springer ( 2013 )

6. Dumais , S. , Platt , J. , Heckerman , D. , Sahami , M.: Inductive learning algorithms and representations for text categorization . In: Proceedings of the seventh international conference on Information and knowledge management . pp. 148 { 155 . ACM ( 1998 )

7. Ferrara , F. , Tasso , C. : Integrating semantic relatedness in a collaborative ltering system . In: Mensch & Computer Workshopband . pp. 75 { 82 ( 2012 )

8. Ferrara , F. , Tasso , C. : Extracting keyphrases from web pages . In: Digital Libraries and Archives , pp. 93 { 104 . Springer ( 2013 )

9. Litvak , M. , Last , M. : Graph-based keyword extraction for single-document summarization . In: Proceedings of the workshop on multi-source multilingual information extraction and summarization . pp. 17 { 24 . Association for Computational Linguistics ( 2008 )

10. Marujo , L. , Gershman , A. , Carbonell, J., Frederking , R. , Neto , J.P. : Supervised topical key phrase extraction of news stories using crowdsourcing, light ltering and co-reference normalization . arXiv preprint arXiv:1306.4886 ( 2013 )

11. Medelyan , O. , Witten , I.H. : Thesaurus based automatic keyphrase indexing . In: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries . pp. 296 { 297 . ACM ( 2006 )

12. Mihalcea , R. , Tarau , P. : Textrank: Bringing order into texts . In: Proceedings of EMNLP . vol. 4 . Barcelona , Spain ( 2004 )

13. Pouliquen , B. , Steinberger , R. , Ignat , C. : Automatic annotation of multilingual text collections with a conceptual thesaurus . arXiv preprint cs/0609059 ( 2006 )

14. Pudota , N. , Dattolo , A. , Baruzzo , A. , Ferrara , F. , Tasso , C. : Automatic keyphrase extraction and ontology mining for content-based tag recommendation . International Journal of Intelligent Systems 25 ( 12 ), 1158 { 1186 ( 2010 )

15. Sarkar , K. : A hybrid approach to extract keyphrases from medical documents . arXiv preprint arXiv:1303.1441 ( 2013 )

16. Silverstein , C. , Marais , H. , Henzinger , M. , Moricz , M.: Analysis of a very large web search engine query log . In: ACm SIGIR Forum . vol. 33 , pp. 6 { 12 . ACM ( 1999 )

17. Strube , M. , Ponzetto , S.P. : Wikirelate! computing semantic relatedness using wikipedia . In: AAAI . vol. 6 , pp. 1419 { 1424 ( 2006 )

18. Turney , P.D.: Learning to extract keyphrases from text . national research council. Institute for Information Technology, Technical Report ERB-1057 ( 1999 )

19. Turney , P.D.: Learning algorithms for keyphrase extraction . Information Retrieval 2 ( 4 ), 303 { 336 ( 2000 )

20. Witten , I.H. , Paynter , G.W. , Frank , E. , Gutwin , C. , Nevill-Manning , C.G. : Kea: Practical automatic keyphrase extraction . In: Proceedings of the fourth ACM conference on Digital libraries . pp. 254 { 255 . ACM ( 1999 )