=Paper=
{{Paper
|id=Vol-2038/paper4
|storemode=property
|title=Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia
|pdfUrl=https://ceur-ws.org/Vol-2038/paper4.pdf
|volume=Vol-2038
|authors=Hung-Hsuan Chen,Jian Wu,C. Lee Giles
|dblpUrl=https://dblp.org/rec/conf/ercimdl/Chen0G17
}}
==Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia==
Compiling Keyphrase Candidates for Scientific Literature Based on Wikipedia Hung-Hsuan Chen1 , Jian Wu2 , and C. Lee Giles2 1 Computer Science and Information Engineering, National Central University hhchen@ncu.edu.tw 2 Information Sciences and Technology, Pennsylvania State University {jxw394,giles}@ist.psu.edu Abstract. Keyphrase candidate compilation is a crucial step for both supervised and unsupervised keyphrase extractors. The traditional meth- ods are usually based on the lexical or frequency properties of the phrases to come up the list. However, terms collected based on these properties do not always semantically meaningful. We show that Wikipedia can be a great auxiliary resource to compile meaningful keyphrase candidates for scientific literature. We conducted empirical experiments on digital libraries of two disciplines, namely Computer Science and Chemistry. The results suggest that Wikipedia has a good coverage of the two disciplines and has the potential to be applied to other scientific disciplines. Keywords: Keyphrase extraction, keyphrase candidate compilation, Wikipedia 1 Introduction Extracting keyphrases from articles is essential for natural language processing and digital libraries. The extracted keyphrases can also be the foundation of other services, such as expert search [3], collaborator search [1], venue search, and algorithm search [10]. Although the problem has been investigated for decades, recent research suggested that automatic keyphrase identification is still challenging [4, 5]. Keyphrase extraction can be supervised or unsupervised. Supervised keyphrase extraction typically formulates the task as a binary classification problem in which a model đ is trained to determine a phrase đ to be a keyphrase or not. Such method is highly dependent on the training data. As a result, the model đ could be biased toward a certain domain and less effective in others. In addition, it is not easy to obtain numerous articles with keyphrases of high quality for training. On the other hand, unsupervised keyphrase extractors rely on the characteristics of the words or the phrases to infer their likelihood of being keyphrases. Common techniques include TF-IDF and its variations, graph based ranking, cluster based ranking, etc. [4] Both supervised and unsupervised keyphrase extractors usually require gener- ating a list of potential keyphrases, called keyphrase candidates, before performing 2 Hung-Hsuan Chen, Jian Wu, and C. Lee Giles keyphrase extraction. Since the final set of extracted keyphrases is a subset of the keyphrase candidates, the candidate list should include as many potential keyphrases as possible to achieve a higher recall. However, naĂŻvely adding terms to the list may hurt the analysis efficiency and lower the precision. Several heuristics are commonly applied to compile the list. We list three possible methods below. First, allowing only terms of certain part-of-speech (POS), such as a noun or a noun phrase, to be included in the list [7]. Second, only đ-grams conforming to certain conditions are collected [9]. Third, removing the stop words and treat the single-word terms as the candidates [6]. Although these approaches are widely used, they analyze only the lexical properties, not the semantic properties, of the terms in the article. As a result, it is very likely to include trivial terms, such as âexperimental resultsâ and âdifficult problemâ, in the candidate list. We propose to utilize Wikipedia as an auxiliary resource to compile the list of keyphrase candidates for scientific literature. Since Wikipedia is manually edited, the titles, the links, and the category structure are typically non-trivial terms. Experiments were performed on two scientific domains, namely Computer Science and Chemistry. The results suggested that Wikipedia is a promising resource for keyphrase candidate compilation and has a good coverage of the two disciplines. 2 Methodology We collected the titles and the anchor texts (i.e., the visible and clickable text in a hyperlink) of Wikipedia pages to compile keyphrase candidates. Compared to the POS tagger and đ-gram based approaches, using Wikipedia has three advantages, as described below. First, the title or the anchor text of a Wikipedia page typically represents one concept, such as a person, an algorithm, a molecule, etc. Thus, it is usually appropriate to assume the entire title or the entire anchor text as exactly one keyphrase candidate, no matter how long or how short the phrase is. On the other hand, when using only lexical properties, it is sometimes challenging to automatically decide which terms should be joined together to represent one concept. For example, the term âBarnes & Nobleâ should be one phrase to represent the giant book corporation, but it is very likely to be treated as two separated terms âBarnesâ and âNobelâ by a lexical-based analyzer; the term âMarkov chain Monte Carloâ should be one term, although both âMarkov chainâ and âMonte Carloâ are valid concepts by themselves. Several languages, such as Thai, Chinese, and Japanese, can be even more challenging in determining a set of characters as a meaningful concept, because these languages exhibit no space boundaries between words and therefore difficult to tokenize and identify a valid term. Second, the title or the anchor text of a Wikipedia page is usually written as a commonly represented format. Therefore, we do not need to worry about converting a term into its normally used type, such as converting a plural noun into a singular noun. Traditionally, format conversion is accomplished by stemming. However, not every term should be expressed in the stemmed format. Compiling Keyphrases based on Wikipedia 3 0.10 0.10 0.08 0.08 probability mass probability mass 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 12 14 16 18 20 22 24 31 number of keyphrases found number of keyphrases found (a) 10,000 randomly selected documents (b) 10,000 randomly selected documents with at least 4 words in title and 20 words in abstract Fig. 1. Empirical probability mass function of number of keyphrases found in the title and the abstract for a document in CiteSeerX. For example, we mostly say âsocial mediaâ rather than âsocial mediumâ, and we use âdata analysisâ instead of âdatum analysisâ. In addition, a stemmer may make mistakes, such as over-stemming or under-stemming, because natural languages are not always regularly constructed. The stemming problem can be more severe in other languages, such as Hebrew and Arabic, which have much more complex rules than English. Third, Wikipedia can be helpful in identifying the ambiguous terms or the acronym of many possible candidate terms. Given the targeted documents are within a certain domain, say Computer science, we could crawl only the pages related to the topic. In practice, we utilize the category structure of Wikipedia to perform focused crawling. A disambiguated term, such as SVM, may refer to Saskatchewan Volunteer Medal, a civil decoration for volunteers in Canada, Schuylkill Valley Metro, a proposal for a railway system linking Philadelphia and Reading in Pennsylvania, or Support Vector Machine, a powerful machine learning technique. When crawling Wikipedia pages of Computer Science domain, SVM would naturally be determined as Support Vector Machine, since the other alternatives do not of fall in the Computer Science category. To identify the keyphrases from a document, we compared the context with the candidate list and claimed a phrase to be one keyphrase if it is in the candidate list. To efficiently search the candidate list and perform the longest prefix matching lookup, we created a trie (a prefix tree) for the keyphrase candidates, as suggested in [9]. 4 Hung-Hsuan Chen, Jian Wu, and C. Lee Giles 0.10 0.10 0.08 0.08 probability mass probability mass 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 57 2 5 8 11 15 19 23 27 31 35 39 43 47 51 55 60 number of keyphrases found number of keyphrases found (a) 10,000 randomly selected documents (b) 10,000 randomly selected documents with at least 4 words in title and 20 words in abstract Fig. 2. Empirical probability mass function of number of keyphrases found in the title and the abstract for a document in RSC Table 1. Statistics of the number of keyphrases found per document in CiteSeerX and RSC. Set ID Desc. Min Q1 Q2 Mean Q3 Max. A 10,000 randomly selected CiteSeerX documents 0 4 7 7.409 10 28 B 10,000 randomly selected RSC documents 0 8 13 15.413 22 66 C 10,000 CiteSeerX documents whose titles have at 0 5 8 8.313 11 31 least 4 words and abstracts have at least 20 words D 10,000 RSC documents whose titles have at least 2 11 16 17.741 24 67 4 words and abstracts have at least 20 words 3 Experiments 3.1 Experimental Data Wikipedia is edited manually and therefore the title or the anchor text typically represents a meaningful topic. However, the coverage of Wikipedia in scientific domain, such as Computer Science or Chemistry, is unknown. To answer the question, we conducted empirical study on two digital libraries of different discipline: (1) CiteSeerX, a digital library currently focused on Computer Science and several related fields, and (2) the publicly available metadata of documents from Royal Society of Chemistry (RSC), a professional chemistry society in UK. We randomly selected 10, 000 documents from CiteSeerX as Set A and 10, 000 documents from RSC as Set B. Using the title and the abstract, we counted the number of terms appeared in the keyphrase candidate. Compiling Keyphrases based on Wikipedia 5 3.2 Results Figure 1(a) and Figure 2(a) show the empirical density function of the number of matched terms per document in CiteSeerX and RSC respectively. As shown, only less than 4% of the documents in CiteSeerX and 1% of the documents in RSC have no keyphrase match. To further study the documents with 0 matched keyphrases, we scrutinized 100 of them and found that the titles and the abstracts for most of them are extremely short, mainly due to parsing error. To eliminate the confounding parsing factor, we randomly selected 10, 000 documents whose titles have at least 4 words and abstracts have at least 20 words from CiteSeerX as Set C and from RSC as Set D. The empirical density functions for the new samples are shown in Figure 1(b) and Figure 2(b). Only less than 0.5% of the sampled papers in CiteSeerX and none of them in RSC have no keyphrase match. Statistical summaries of the number of matched keyphrases per sampled document are shown in Table 1. The result demonstrated that Wikipedia has a good coverage of the two disciplines, and very likely to be a helpful resource in compiling keyphrase candidate for documents of other scientific disciplines as well. 4 Deployment We have utilized the discovered keyphrase candidates to support several systems. Here we introduce some of them. 4.1 CSSeer and CollabSeer CSSeer1 is an expert recommender system built on top of four million academic documents in the fields related to Computer Science and Information Science [2,3]. To efficiently return a list of experts of the specified sub-domain (e.g., information retrieval), CSSeer preprocesses the texts in the title and the abstract of each document to extract the keyphrase candidates as the input texts for more complex algorithms. Since most interesting keyphrases are preprocessed and indexed, CSSeer can effectively return a list of experts within seconds. On the other hand, if a user submits a query term which is not included in the preprocessed keyphrase list, calculating the expert score of a user to the query term in real time is impractical [2]. Alternatively, we probably need to approximate the expert score by considering only the top related important documents (instead of the full four million documents). However, the approximation considers at most hundreds of documents, which inevitably ignores most of the available information. As a result, the keyphrase candidate extracting method forms an essential component in the CSSeer recommendation service. Figure 3 shows two snapshots of the CSSeer system. On the left (i.e., Fig- ure 3(a)), the list of expertise of Dr. W. Bruce Croft is compiled based on the 1 http://csseer.ist.psu.edu/ 6 Hung-Hsuan Chen, Jian Wu, and C. Lee Giles (a) W. Bruce Croftâs expertise list and (b) Related keyphrases and experts of âin- publication list formation retrievalâ Fig. 3. Screenshots of CSSeer. Table 2. Statistics of the increase ratio of the keyphrase candidates of the 1,000 sampled CiteSeerX documents. Min Q1 Q2 Mean Q3 Max. 0% 42.86% 56.52% 60.73% 72.73% 600% keyphrase candidates extracted from his publications. On the right (i.e., Fig- ure 3(b)), the phrases that are most relevant to the query phrase âinformation retrievalâ is also generated based on the keyphrase candidates compiled by our introduced method. CollabSeer is another system that was leveraged on the keyphrase candidate compiled based on the introduced method. Essentially, CollabSeer recommends potential collaborators to a researcherâs interested area within her academic social circle. Like CSSeer, we identify each userâs research interest and expertise based on the keyphrase candidates discovered from her previous publications. Figure 4 shows a snapshot of the expertise list of an author. Compiling Keyphrases based on Wikipedia 7 Fig. 4. A snapshot of the expertise list Table 3. A comparison of the average recalls based on the 100 sampled CiteSeerX documents. Method POS-tagging Wikipedia matching A combination of both Avg. Num. of Keyphrase 15.95 11.39 24.96 Candidates Average Recall 73.06% 48.00% 91.67% 4.2 CiteSeerX CiteSeerX2 is an autonomous digital library for scientific literature. For each document, CiteSeerX provides a summary tab that shows the abstract and the keyphrases extracted from the abstract, as shown in Figure 5. The current online version of the keyphrase list is compiled based on an unsupervised method which tags the nouns and the noun phrases by the Stanford POS Tagger and noun phrase rules [8,11,12] and naĂŻvely treats these noun phrases as the keyphrase candidates. However, we found that the recall of such a method is only about 70%. Since the final extracted keyphrases are only a subset of the keyphrase candidates, we would like the keyphrase candidates to include many potential keyphrases to achieve a higher recall. We plan to update this keyphrases candidate generating process by a mixture of the original method (POS-tagging-based) and the method introduced in this paper (Wikipedia-based) to increase the recall. 2 http://citeseerx.ist.psu.edu/ 8 Hung-Hsuan Chen, Jian Wu, and C. Lee Giles Fig. 5. A snapshot of the CiteSeerX summary tab As an initial study, we randomly selected 1, 000 papers whose abstract contains at least 20 words, and compile the keyphrase candidates by a mixture of the original and the new method (i.e., we merge the keyphrase candidates returned by the two methods). We found that, on average, the mixture approach increases the number of keyphrase candidates per document from the original 14.49 to 23.29. The increase ratio is (23.29 â 14.49)/14.49 = 60.73% on average. Table 2 shows the summary of the increase ratio of the 1, 000 sampled documents, and Figure 6 displays the empirical cumulative density function (ECDF) of the increase ratio of these documents. In the meanwhile, we manually labeled the keyphrases of these 100 documents. We computed the recall of the keyphrase candidates generated from the following methods: (1) generating keyphrases based on the POS tagging; (2) generating keyphrases based on the Wikipedia terms; (3) a combination of (1) and (2). The average recall from this test dataset is shown in Table 3. By combining these two methods, we can achieve an average recall rate to 91.67% (increasing the number of keyphrase candidates by 9.01 on average). 5 Discussion In this paper, we empirically validated that Wikipedia titles and the anchor texts are valuable resources to generate keyphrase candidates for scientific articles. We found that, based only on the abstract texts of the scientific documents, such a simple method can generate 8.3 keyphrase candidates for a typical paper in the field of Computer Science and Information Systems and 17.7 keyphrase candidates for a typical Chemistry paper. If we combine the Wikipedia resource Compiling Keyphrases based on Wikipedia 9 1.00 0.75 Empirical CDF 0.50 0.25 0.00 0 2 4 6 increase ratio Fig. 6. The empirical cumulative density function of the increase ratio and simple POS-tagging technique, the generated keyphrase candidates yield a very high recall rate (over 90% on average). We built several systems partially based on the concept. Specifically, we generated each authorâs research expertise based on the keyphrase candidates of her previous publications and integrated the function into CSSeer (an expert recommender system for computer scientists) and CollabSeer (a collaborator recommender system for computer scientists). We generated the keyphrases for the documents collected by CiteSeerX and plan to update the current keyphrase list shown online. For future work, we plan to apply similar concept to different domains. Finally, we are also in the process of releasing the title, abstract, and the extracted keyphrases of the 10 million academic documents collected by CiteSeerX. We hope that such a large dataset can benefit the research community in the digital library and information retrieval. References 1. Chen, H.H., Gou, L., Zhang, X., Giles, C.L.: CollabSeer: a search engine for collaboration discovery. In: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries. pp. 231â240. ACM (2011) 2. Chen, H.H., Ororbia, I., Alexander, G., Giles, C.L.: ExpertSeer: a Keyphrase Based Expert Recommender for Digital Libraries. arXiv preprint arXiv:1511.02058 (2015) 3. Chen, H.H., Treeratpituk, P., Mitra, P., Giles, C.L.: CSSeer: an expert recommen- dation system based on CiteseerX. In: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. pp. 381â382. ACM (2013) 4. Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of the 23rd International Conference 10 Hung-Hsuan Chen, Jian Wu, and C. Lee Giles on Computational Linguistics: Posters. pp. 365â373. Association for Computational Linguistics (2010) 5. Hasan, K.S., Ng, V.: Automatic keyphrase extraction: A survey of the state of the art. In: ACL (1). pp. 1262â1273 (2014) 6. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. pp. 257â266. Association for Computational Linguistics (2009) 7. Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. In: Proceedings of EMNLP. vol. 4. Barcelona, Spain (2004) 8. Nguyen, T.D., Kan, M.Y.: Keyphrase extraction in scientific publications. In: Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, pp. 317â326. Springer (2007) 9. Treeratpituk, P., Teregowda, P., Huang, J., Giles, C.: SEERLAB: a system for extracting keyphrases from scholarly documents. In: Proceedings of the 5th Interna- tional Workshop on Semantic Evaluation. Association for Computational Linguistics (2010) 10. Tuarob, S., Mitra, P., Giles, C.: Building a search engine for algorithms. ACM SIGWEB Newsletter p. 5 (2014) 11. Williams, K., Chen, H.H., Choudhury, S.R., Giles, C.L.: Unsupervised ranking for plagiarism source retrieval. Notebook for PAN at CLEF (2013) 12. Williams, K., Chen, H.H., Giles, C.L.: Classifying and ranking search engine results as potential sources of plagiarism. In: Proceedings of the 2014 ACM symposium on Document engineering. pp. 97â106. ACM (2014)