Conceptual Impact-Based Recommender System for CiteSeerx Kevin Labille Susan Gauch Ann Smittu Joseph Department of Computer Department of Computer Department of Computer Science and Computer Science and Computer Science and Computer Engineering Engineering Engineering University of Arkansas University of Arkansas University of Arkansas Fayetteville, AR 72701, USA Fayetteville, AR 72701, USA Fayetteville, AR 72701, USA kclabill@uark.edu sgauch@uark.edu ann@email.uark.edu ABSTRACT 1. INTRODUCTION CiteSeerx is a digital library for scientific publications writ- In recent years, recommender systems have become ubiq- ten by Computer Science researchers. Users are able to re- uitous, recommending movies, restaurants, and books etc. trieve relevant documents from the database by searching by The recommendations ease information overload for users author name and/or keyword queries. Users may also receive by pro-actively suggesting relevant items to the users, mov- recommendations of papers they might want to read pro- ing the burden of discovery from the user to the system. vided by an existing conceptual recommender system. This The number and type of applications that use recommender system recommends documents based on an automatically- systems keeps growing [1]; one practical application that is constructed user profile. Unlike traditional content-based of interest to researchers in any domain is the ability of rec- recommender systems, the documents and the user profile ommender systems to suggest relevant scientific literature. are represented as concepts vectors rather than keyword These systems can expedite scientific innovation by helping vectors and papers are recommended based on conceptual researchers keep abreast of new publications in their fields matches rather than keyword matches between the profile and also help new researchers learn about the most impor- and the documents. Although the current system provides tant literature in an area new to them. Digital libraries can recommendations that are on-topic, they are not necessarily employ recommender systems that suggest papers to their high quality papers. In this work, we introduce the Concep- users based on each user’s research interests. However, an tual Impact-Based Recommender (CIBR), a hybrid recom- effective recommender system should not only consider the mender system that extends the existing conceptual recom- subject of a paper, it should also take into account the pa- mender system in CiteSeerx by including an explicit quality per’s quality when making recommendations. To this end, factor as part of the recommendation criteria. To measure we present a recommender system that recommends scien- quality, our system considers the impact factor of each pa- tific papers based on user preferences as well as paper qual- per’s authors as measured by the authors’ h-index. Exper- ity as measured by the authors’ impact factors to provide iments to evaluate the effectiveness of our hybrid system recommendations of high-quality papers that are relevant show that the CIBR system recommends more relevant pa- to the user’s research area. To help CiteSeerx users locate pers as compared to the conceptual recommender system. scientific papers related to their work, a citation-based rec- ommender system was developed by Chandrasekaran et al. Categories and Subject Descriptors in 2008 [4] . Although citations are effective at identifying Information Systems [Information retrieval]: Retrieval papers that have relevant content and are also high quality, tasks and goals:Recommender systems this approach is only effective in recommending papers with many citations. These unfortunately tend to be older papers that have been published long enough ago to generate many General Terms citations. Especially in a fast-moving domain like computer Performance, Reliability, Design, Experimentation science, researchers need to know about recent contribu- tions to their field, yet recent papers have few citations. Keywords To solve this problem, a content-based recommender sys- Recommender System, h-index, Content-based Recommender tem for CiteSeerx was developed by Pudhiyaveetil et al.[8]. System, CiteSeerx , Information Retrieval This conceptual recommender system automatically builds conceptual profiles for users based on their interactions with the system. It also builds conceptual profiles for each docu- ment and recommends papers based on conceptual matches between document and user profiles. Even though the rec- ommendations were shown to be more relevant than those produced by a keyword-based recommender system, they are not always high quality papers that the researcher wanted to read. Our objective is to improve upon the conceptual CBRecSys 2015, September 20, 2015, Vienna, Austria. recommender system by providing better quality recommen- Copyright remains with the authors and/or original copyright holders dations to the users. To do so, we developed a recommender mender systems are useful for researchers to be up to date system that recommends papers based on the paper authors’ in their research area. Many content-based recommender impact factors. We combined the impact-factor based rec- systems represent the user interests and the documents as ommendations with the concept-based recommendations in weighted keyword vectors. One example is [13] in which varying proportions to create a hybrid recommender sys- tf ∗ idf weights are calculated for keywords and the cosine tem. We evaluated the effectiveness of the conceptual rec- similarity measure is used to determine the relevancy of a ommender system, the impact-factor recommender system, paper to a user’s profile. An approach similar to ours is used and the hybrid recommender system and found that the hy- in [5]. In their work, each paper’s features are represented brid recommender system provides the most accurate recom- as concepts created by automatically extracting keyphrases. mendations. The rest of this paper is organized as follows: User profiles are constructed from the concepts in previ- In section 2 we review related work. Section 3 describes the ously viewed papers and the recommender system matches Conceptual Impact-Based Recommender (CBIR) system in the user profile concepts to each papers’ concepts to suggest detail. In section 4, we present our experimental evaluation new papers in a scientific library. In [8], a conceptual rec- to analyze the effectiveness of our recommender system. Fi- ommender system was presented that recommends research nally, we present our conclusions and discuss future work in papers for CiteSeerx users. Unlike the previous work, the section 5. concepts for each paper are assigned by automatically clas- sifying papers into a set of concepts defined by a pre-existing 2. RELATED WORK ontology. A conceptual user profile is implicitly built as users view papers in the collection and this user profile is used to The design of a recommender system can vary based on the recommend conceptually similar papers. nature of user feedback or the availability of data. There are The content-based recommender systems can recommend three main approaches: collaborative filtering, content based literature that is similar in topic to the user’s profile, but recommender systems, and recommender systems that are a it does not necessarily recommend high-quality papers. Al- hybrid of the two [1]. The first approach generates recom- though there is no perfect way to measure the quality of mendations based on similarities between the users’ behavior articles, the Impact Factor (IF) introduced in 1955 is still or/and preferences. In contrast, content-based approaches considered the best way to evaluate a paper’s scientific merit recommend items to the users based on similarities between [6]. There are several types of IFs, including the widely used the attributes of the items themselves [10]. Collaborative h-index that evaluates a researcher’s impact [7]. It has been approaches are typically used when semantic features can- recently used is several fields such as health services research not easily be extracted from the items, so indirect evidence [3], business and management [11] or even academic psychia- based on user’s likes or ratings must be compared. To be try [14] . Although the work in [5], [8], and [13] are similar to effective, collaborative filtering requires a large active user ours, our recommender system expands upon their work by community to avoid the well-known ”cold-start” problem in incorporating a quality factor as measured by the authors’ which there are many more items to be recommended than h-indexes. there are users with likes or ratings upon which recommen- dations can be based. On the other hand, pure content- based recommender systems do not consider external infor- 3. APPROACH mation that might be available from the users, e.g., popular- ity. For these reasons, many recommender systems employ a hybrid approach combines both of the previously-described approaches. Content-based recommender systems match the users’ pref- erences to each items’ features to recommend new objects [10]. Many share the approach of building a user profile from a set of features extracted from previously liked items. This user profile is then compared to the features of all items in the collection and the most similar items are recommended to the user [12]. This type of recommender system can be used in domain for which semantically relevant features can be extracted and it is particularly well-suited for domains that include textual items as scientific literature or domains with annotations such as movies or music [12]. Kompan et al. used this approach to recommend news articles on a web Figure 1: Architecture of the CIBR site [9]. In this domain, the volume of articles and the dy- namic nature of news make collaborative filtering infeasible The architecture of the Conceptual Impact-Based Recom- so they implemented a content-based recommender system mender System (CIBR) is shown in Figure 1. The Profile based on cosine similarity that suggested articles that best Subsystem classifies all documents in the CiteSeerx database matched an implicitly constructed user model [9]. into the 369 predefined categories in the ACM Computing Our work is a hybrid approach that enhances a content- Classification System (CCS). Documents manually tagged based recommender system with a quality measure to rec- with ACM categories by their authors are used as the train- ommend scientific literature. According to Beel et al., rec- ing set for a k-nearest neighbor classifier. As users interact ommender systems for research papers are flourishing with with the system, the documents that they examine are in- more than 80 approaches existing today that have been dis- put to the Profile Subsystem. The categories associated with cussed in over 170 articles and patents [2]. Such recom- each examined document are combined to create a weighted conceptual user profile. This user profile is used by both user. We tried other approaches to calculate the impact the Conceptual Recommender and the Impact-Based Rec- factor among which we consider the sum of each authors’ ommender described in the following sections. The outputs h-indices. This particular method is limited since the high- of these two Recommenders are combined to produce the est weighted papers would usually be the ones with many recommendations from the CBIR. authors. 3.1 Concept-Based Recommender System 3.3 Conceptual Impact-Based Recommender System The Conceptual Impact-Based Recommender System (CIBR) combines the Conceptual Weights and the Impact Weights to produce its recommendations. The two sub-component weights are normalized to fall between 0 to 1 using linear Figure 2: Conceptual Recommender System Archi- scaling and then combined based on a tunable parameter, tecture α. The weight of the conceptual impact match between doc- ument i and user j, γij , is calculated using: As a user views documents in CiteSeerx , the Profile Subsys- 0 tem builds a conceptual user profile for them by accumulat- γij = α ∗ Cij + (1 − α) ∗ Ii0 (2) ing the concept weights associated with the documents that Where the user examines. The Conceptual Recommender System then recommends documents to the user based on the sim- 0 Cij = normalized ConceptualW eightij = ilarity between each document’s conceptual profile and the ConceptualW eightij −minj (ConceptualW eight) user’s conceptual profile [8]. The weight of the conceptual maxj (ConceptualW eight)−minj (ConceptualW eight) match between document i and user j is calculated using the cosine similarity function over all M=369 concepts in Ii0 = normalized ImpactW eighti = the ACM taxonomy: ImpactW eighti −minj (ImpactW eight) maxj (ImpactW eight)−minj (ImpactW eight) ConceptualW eightij = M P K=1 (cwtik ∗ cwtjk ) α = controls the relative contributions of two sub-weights Where cwtik = weight of concept k in document profile i and By varying α from 0 to 1, we can adjust the relative con- cwtjk = weight of concept k in user profile j as explained tributions of two underlying recommender systems. When and detailed in [8]. α = 0, the CBIR is a pure impact-based recommender sys- tem whilst when α = 1, the CBIR is a purely Conceptual 3.2 Impact-Based Recommender System recommender system. 4. EXPERIMENTAL EVALUATION 4.1 Subjects and Dataset We conducted several experiments to measure the effective- ness of our hybrid recommender system. Experiments were Figure 3: Impact-based Recommender System Ar- done with 30 subjects, undergraduate and graduate com- chitecture puter science and computer engineering students from the university of Arkansas. We use the 2190179 documents in The Impact Factor Generator precalculates an impact fac- our snapshot of the CiteSeerx , a digital library and a search tor for each document in the collection as measured by its engine for computer and information sciences literature. Be- authors’ h-indices. As described by Hirsch, an author has cause previous experiments have shown that profiles become an h-index of m based on his/her N published articles if m stable after viewing 20 papers, users we asked to search for articles have at least m citations each, and the other N-m and view at least that many papers related to their own re- articles have no more than m citations each [7]. The impact search area. Based on those documents, user profiles were factor for a document is calculated by finding the h-index automatically constructed for each user value of each of the authors of the document and then select- ing the highest h-index value. Thus, document i’s h-index 4.2 Evaluation Method is equal that of its most impactful author: The goal of this experiment was first to determine what com- ImpactW eighti = max (hindexil ) (1) bination the conceptual match and the paper quality is most l∈Ail effective in our hybrid recommender system. The relative Where combinations of the two is given by the equation in Section Ail = list of the authors l of document i 3. By changing the value of α we are able to control the rel- Since the impact factor is independent of users, the Impact- ative contributions of the two recommender systems with α Based recommendations would be the same for all users, = 0.0 being a pure impact-based recommender system and i.e., the most impactful documents in the entire collection. α = 1.0 being a pure conceptual recommender system and We do, however, use the user profile to filter out docu- α = 0.5 using even contributions from both. We varied the ments from categories in which the user has shown no previ- value of α from 0.0 to 1.0 with an increment of 0.1 for each ous input. Thus, Impact-Based Recommender returns high- of the subjects in the experiment and for each value of α impact documents from categories of some interest to the we collected the top ten recommended documents. For each 6. ACKNOWLEDGMENTS This research was supported in part by the National Science Foundation grant number 0958123 : Collaborative Research: CI-ADDO-EN: Semantic CiteSeerx 7. REFERENCES [1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE Transactions on, 17(6):734–749, 2005. Figure 4: Mean Average Weighted Precision for ev- ery α [2] J. Beel, S. Langer, M. Genzmehr, B. Gipp, C. Breitinger, and A. Nürnberger. Research paper recommender system evaluation: A quantitative user, we presented them with the set of all documents rec- literature survey. In Proceedings of the International ommended by any of the versions of the system (removing Workshop on Reproducibility and Replication in duplicates) in random order. They provided explicit rele- Recommender Systems Evaluation, pages 15–22. ACM, vance feedback by rating the papers as very relevant (2), 2013. relevant (1), or irrelevant (0). We then used the Mean Av- [3] Y. Birks, C. Fairhurst, K. Bloor, M. Campbell, erage Weighted Precision (MAWP) of each user for each W. Baird, and D. Torgerson. Use of the h-index to α as a metric. The MAWP is essentially the Mean Aver- measure the quality of the output of health services age Precision modified to handle weights from 0..2 rather researchers. Journal of health services research & than just Boolean relevance judgments. The mean of every policy, 19(2):102–109, 2014. MAWP for each α is calculated and summarized in Figure [4] K. Chandrasekaran, S. Gauch, P. Lakkaraju, and H. P. 4. As shown on Figure 4, an α of 0.9 gives the best results, Luong. Concept-based document recommendations for 0.6355, meaning that a 90% contribution from the concep- citeseer authors. In Adaptive Hypermedia and Adaptive tual recommender system and a 10% contribution from the Web-Based Systems, pages 83–92. Springer, 2008. impact-based recommender performed the best. For the sec- [5] D. De Nart and C. Tasso. A personalized ond part of our analysis, we compared the effectiveness of concept-driven recommender system for scientific the three recommender systems head-to-head. The hybrid libraries. Procedia Computer Science, 38:84–91, 2014. recommender system with α = 0.9 outperformed the concep- [6] E. Garfield. Journal impact factor: a brief review. tual recommender system’s MWAP of 0.6083 (α = 1.0) by Canadian Medical Association Journal, 4.5% relative (or 2.72% absolute) and the impact-based rec- 161(8):979–980, 1999. ommender system’s MWAP of 0.2867 (α = 0.0) by 121.67% [7] J. E. Hirsch. An index to quantify an individual’s relative or 34.88% absolute. Both of these results are statis- scientific research output. Proceedings of the National tically significant (p < 0.05), based on the paired two-tailed academy of Sciences of the United States of America, student t-test. 102(46):16569–16572, 2005. [8] A. Kodakateri Pudhiyaveetil, S. Gauch, H. Luong, and 5. CONCLUSION AND FUTURE WORK J. Eno. Conceptual recommender system for citeseerx. In this paper, a hybrid recommender system was introduced In Proceedings of the third ACM conference on that recommends high quality papers to CiteSeerx users. Recommender systems, pages 241–244. ACM, 2009. The new recommender combines a conceptual recommender [9] M. Kompan and M. Bieliková. Content-based news system along with an impact-factor-based recommender sys- recommendation. In E-commerce and web technologies, tem. The former incorporates the user’s preferences repre- pages 61–72. Springer, 2010. sented as a concept vector whilst the latter incorporates pa- [10] P. Lops, M. De Gemmis, and G. Semeraro. per quality using the authors’ impact factors as measured Content-based recommender systems: State of the art by their h-indexes. User experiments were conducted to and trends. In Recommender systems handbook, pages compare the concept-based recommender system and the 73–105. Springer, 2011. impact-based recommender system with our hybrid system. [11] J. Mingers, F. Macri, and D. Petrovici. Using the The results confirm that our hybrid recommender gener- h-index to measure the quality of journals in the field ates relevant documents as compared to the conceptual or of business and management. Information Processing the impact-factor-based recommender. Future work could & Management, 48(2):234–241, 2012. consider using social networks of co-authors or differential [12] M. J. Pazzani and D. Billsus. Content-based weighting of the papers. Another direction would be to in- recommendation systems. In The adaptive web, pages vestigate the effectiveness of our hybrid recommender sys- 325–341. Springer, 2007. tem by considering the g-index that gives a stronger weight [13] S. Philip and A. O. John. Application of content-based to highly-cited papers as compared to the h-index. Alter- approach in research paper recommendation system natively, we could use the e-index that complements the h- for a digital library. International Journal of Advanced index by distinguishing authors having the same h-index but Computer Science & Applications, 5(10), 2014. different numbers of citations. [14] S. Selek and A. Saleh. Use of h index and g index for american academic psychiatry. Scientometrics, 99(2):541–548, 2014.