Random Walk and Feedback on Scholarly Network Yingying Yu Zhuoren Jiang Xiaozhong Liu College of Transportation College of Transportation School of Informatics and Management Management Computing Dalian Maritime University Dalian Maritime University Indiana University Dalian, China, 116026 Dalian, China, 116026 Bloomington uee870927@126.com jzr1986@gmail.com Bloomington, IN, USA, 47405 liu237@indiana.edu ABSTRACT could significantly improve the scholarly recommendation perfor- The approach of random walk on heterogeneous bibliographic graph mance [3,7,9,12]. For instance, Liu et al., [2,3] constructed the het- has been proven effective in the previous studies. In this study, by erogeneous scholarly graph and proposed a novel ranking method using various kinds of positive and negative feedbacks, we propose based on pseudo relevance feedback (PRF), which can effectively the novel method to enhance the performance of meta-path-based recommend candidate citation papers via different kinds of meta- random walk for scholarly recommendation. We hypothesize that paths on the graph. the nodes on the heterogeneous graph should play different roles In this paper, we intend to further investigate feedback informa- in terms of different queries or various kinds implicit/explicit feed- tion and enhance the meta-path-based random walk performance. backs. Meanwhile, we prove that the node usefulness probabil- Intuitively, for different information needs, when user feedbacks ity has significant impact for the path importance. When positive are available, the nodes on the graph should play different roles and negative feedback information is available, we can calculate in the final measure. For example, given two different queries each node’s proximity to the feedback nodes, and use the prox- "Content-based Citation Recommendation" and "Heterogeneous In- imity to infer the usefulness probability of each node via the sig- formation Network", the same paper "ClusCite: effective citation moid function. By combining the transition probability and the use- recommendation by information network-based clustering" may be fulness probability of nodes on the path instance, we propose the retrieved by scholarly search engines, e.g., Google Scholar. But the new random walk function to compute the importance of each path target paper can be more useful (positive) for the second query than instance. Experimental results with ACM full-text corpus show the first one. As another example, for user X, if she prefers to cite that the proposed method (considering the node usefulness) sig- influential scholars’ work, the highly cited authors will be useful for nificantly outperforms the previous approaches. her. While for user Y, if she tends to cite the frontiers, she will mark the newest publications and the newly topics as the useful feedback information. Therefore, the same node may perform significantly Categories and Subject Descriptors different based on different information needs and feedback infor- H.3.3 [Information Storage and Retrieval]: Information Search mation. Furthermore, by using (implicit/explicit positive/negative) and Retrieval feedbacks, it is possible to infer the usefulness probability of other nodes on the graph. So that, the importance of path instance will vary in terms of the probability of node usefulness. Keywords The main contribution of this paper is threefold. First, in Meta-path-based Random Walk, Feedback, Heterogeneous Graph this paper, the feedback is not limited to documents. In scholarly network, user could provide feedback judgments for authors, key- words and venues, either useful or not useful. If the explicit user 1. INTRODUCTION feedback is unavailable, we propose an approach to automatically The volume of scientific publications has increased dramatically generate the feedback nodes based on user queries and the relation- in the past couple of decades, which challenges existing systems ships among the entities on the heterogeneous graph. Second, we and methods to retrieve and access scientific resources. Classi- infer the usefulness of the nodes in terms of feedback information. cal text-based information retrieval algorithms can recommend the For instance, a node is less useful when it is close to the negative candidate publications for scholars. However, most of them ig- node(s). We make a conjecture that the usefulness probability of nored the complex and heterogeneous relations among the schol- each node depends on its average proximity to the feedback set and arly objects. Not until recently, some studies proved that adopt- can be estimated via sigmoid function. Third, we emphasize the ing the mining approaches on heterogeneous information networks node usefulness has a great impact on the path importance. Our ap- proach about computing the random walk probability differs from the previous study in that, not only the transition probability, but also the usefulness probability of the node should be taken into ac- count for random walk. To verify these hypotheses, we adopt a number of meta-paths on the graph (Figure 1) and make a com- parison between the classical random walk function and the novel method. Experimental results on ACM corpus show that the pro- Copyright c 2015 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published posed method significantly outperforms the original one. and copyrighted by its editors. The remainder of this paper is structured as follows. We 1) re- Published on CEUR-WS: http://ceur-ws.org/Vol-1393/. view relevant methodologies for pseudo relevance feedback, 2) in- Given a specific scholarly network, there can be many kinds of w w troduce the preliminaries, 3) propose the improved methods, 4) de- meta-paths. For example, P ∗ → A ← P ? is a simple meta-path scribe the experiment setting and evaluation results, and 5) con- on the scholarly network, denoting all the papers published by the clude with a discussion and outlook. seed paper’ author. P ∗ is the starting paper node (seed node) in this path. P ? denotes the candidate publication node. More examples 2. RELATED WORK can be found in Table 1. Pseudo relevance feedback, also known as blind relevance feed- back, provides a way for automatic local analysis. When the user 4. RESEARCH METHODS judgments or interactions are not available, it turns out to be an ef- fective method to improve the retrieval performance. Traditional 4.1 Generate the Feedback Nodes pseudo relevance feedback tends to treat the top ranked documents Generally, given user initial queries, a list of ranking publications as relevant feedback, and then expand the initial queries. How- would be found via text retrieval. Based on the top ranked docu- ever, some of the top retrieved documents may be irrelevant, which ments, user would probably give explicit judgments on whether the could result in noisy feedback into the process. So that, there are related keywords, authors or venues are useful or not. However, various efforts to improve the traditional pseudo feedback. [11] ex- explicit feedback is not easy to get. In this study, we propose meth- ploited the possible utility of Wikipedia for query dependent ex- ods to infer the implicit feedback nodes on the heterogeneous graph pansion. From the perspective of each query and each set of feed- according to the given information. back documents, [4] proposed how to dynamically predict an op- The feedback is a collection of multiple nodes marked with use- timal balance coefficient query expansion rather than using a fixed ful (positive) or unuseful (negative) on the heterogeneous graph. value. [1] suggested to use evolutionary techniques along with se- We represent this collection as N F . N FP and N FN denote the mantic similarity notion for query expansion. [6] introduced an ap- positive and negative nodes set respectively. The kinds of feedback proach to expand the queries for passage retrieval, not based on nodes in discussion include keyword (K), author (A) and venue (V). the top ranked documents, but via a new term weighting function, which gives a score to terms of corpus according to their related- 4.1.1 Generate the Positive Feedback Nodes ness to the query, and identify the most relevant ones. Instead of Since we know the initial queries (i.e., author provided paper using term expansion, graph-based feedback provides a new rank- keywords) that the users should be most concerned with, it is rea- ing assumption based on topology expansion. [2] used the pseudo sonable to take the explicit keywords KP as the positive feedback relevant papers as the seed nodes, and then explored the potential nodes. Next, we will infer the positive authors and venues based on relevant nodes via specific restricted/combined meta-paths on the KP . We deem that the authors or venues that are highly likely re- heterogeneous graph. Our study is motivated by this approach and lated to KP are positive as well. So we rank authors via meta-paths con r w mainly focused on updating the random walk algorithm by inves- KP −−→ A? and KP ← P − → A? , and take the top ranked Kpos tigating both the positive and negative feedbacks. In fact, posi- authors as the pseudo positive authors AP . Similarly, we locate the con r p tive and negative feedback approach has been studied in image re- positive venues via KP −−→ V ? and KP ← P − → V ? , and select trieval [5]. With several steps of positive and negative feedback, the top ranked Kpos venues as the positive nodes VP . the retrieval performance could be increasingly enhanced. From the view of negative feedback, [10] studied and compared different 4.1.2 Generate the Negative Feedback Nodes kinds of methods, it addressed that negative feedback is important Intuitively, to generate the negative feedbacks, our basic assump- especially when the target topic is difficult and initial results are tion is that the negative nodes should be directly related to the poor. Besides, using multiple negative feedback methods could be searched results, but least relevant to the explicit positive keywords. more effective. First, based on text retrieval results, we define the top ranked topK papers as Pr , and then we locate the keywords, authors and venues r 3. PRELIMINARIES that are directly connected to Pr via different meta-paths, Pr → w p Following the work [2,8], an information network can be defined Kr , Pr → Ar and Pr → Vr . as follows. Next, we filter collections of Kr , Ar and Vr . 1. Rank the key- con r words Kr via the transition probability of meta-path KP → P → D EFINITION 1. (Information network) An information network Kr . Use the last ranked Kneg keywords as the pseudo negative is defined as a directed graph G = (V, E) with an object type nodes KN . 2. Similar to keywords, rank the authors Ar via the mapping function τ : V → A and a link type mapping function con w transition probability of meta-path KP → P → Ar , and use the φ : E → R, where each object v ∈ V belongs to one particular last ranked Kneg authors as the pseudo negative nodes AN . 3. object type τ (v) ∈ A, each link e ∈ E belongs to a particular con p relation φ(e) ∈ R, and if two links belong to the same relation Rank the venues Vr via KP → P → Vr , and use the last ranked con type, the two links share the same starting object type as well as Kneg venues as the negative nodes VN . Here we use KP → P in- r the ending object type. stead of KP ← P because the "contribution" characterizes the im- portance of each paper, given a topic. It does not necessarily means When there are more than one type of node or link in the infor- paper is relevant to topic [2]. Even if one paper is not explicit rel- mation network, it is called heterogeneous information network. evant to some topic, it might also be important. The "contribute" In [8], Sun further defined meta-path as follows. conveys more information. Thus, we obtain all the positive and negative feedback nodes. D EFINITION 2. (Meta-path) A meta-path P is a path defined N FP includes KP , AP and VP . N FN contains KN , AN and VN . on the graph of network schema TG = (A, R), and is denoted in the form of Ȧ1 −→ R1 R 2 Ȧ2 −→ R l . . . −→ Ȧl+1 , which defines a 4.2 Infer the Usefulness Probability of Node composite relation R = R1 ◦ R2 ◦ . . . ◦ Rl between types Ȧ1 and Unlike previous studies, in this paper, the importance of nodes Ȧl+1 , where ◦ denotes the composition operator on relations. on scholarly network is not even. The usefulness probability of node Ni is determined by the feedback nodes. Intuitively, if node path to optimize the weight of each sub-meta-path. For this study, Ni is more closely related to the positive nodes, it could be more we set β = 0.6. useful. Conversely, if Ni is much closer to the negative nodes, Then, the random walk probability will be decided by the tran- and further away from the positive nodes, it indicates that Ni may sition probability and the usefulness probability of the node on the be not very useful. Therefore, the proximity between given node path instance. In this paper, we use eight meta-paths to investigate and feedback node set is very crucial. We should note that the the novel random walk method with node feedback information for usefulness probability of each node varies from different feedback citation recommendation. All the meta-paths are listed in Table 1. node sets. To infer the usefulness probability of node Ni , we adopt the sig- 1 5. EXPERIMENT moid function Pu (Ni ) = 1+e−αD(N i) to convert the proximity into probability, where α controls the convergent rate (default is 5.1 Data Preprocessing 1). In our assumption, if Nj is positive node, Pu (Nj ) = 1, other- We used 41,370 publications (as candidate citation collection), wise P (Nj ) = 0. D(Ni ) denotes the proximity between Ni and published between 1951 and 2011, on computer science for the ex- the feedback node set N F . It can be derived from the following periment (mainly from the ACM digital library). As [2] introduced, formula. P we constructed the heterogeneous graph shown in Figure 1 and Ta- Nj ∈N FP d(Ni ,Nj )) P Nk ∈N FN d(Ni ,Nk ) D(Ni ) = |N FN | − |N FP | , where ble 2. |N FN | and |N FP | represents the size of collection N FN and N FP For the evaluation part, we used a test collection with 274 papers. respectively. d(Ni , Nj ) indicates the proximity between node Ni The selected papers have more than 15 citations from the candidate and node Nj . In this paper, we will estimate the proximity d(Ni , Nj ) citation collection. based on the paths Ni Nj on the graph. There could be lots of 5.2 Generate Feedback Nodes path instances connected node Ni and Nj . If the length of path is too long, the influence would be too small to be considered. We as- Attaining different types of feedback information is the most im- sume the maximum of path length is 10. Then we select the shortest portant part in this research. Since it is not available to get the user path and define its length as the proximity d(Ni , Nj ). judgments right away. We used the method introduced in section If D(Ni ) is negative, it reflects node Ni is closer to negative 4.1 to create positive and negative feedback nodes. As aforemen- nodes than positive ones, which means node Ni could be less im- tioned, the collection KP is the set of user given keywords. It is ex- portant, and vice versa. Particularly, if D(Nj ) → +∞, it indi- plicit positive feedbacks. While AP and VP can be derived by their cates that Nj is far away from negative feedback nodes, so the im- connectivity to set KP based on the heterogeneous graph. Here we portance of this node approach to 1; If D(Nj ) = 0, it indicates set Kpos = 10, and take the top 10 ranked authors/ venues as the that Nj has the same distance to negative and positive nodes, then implicit positive feedbacks. Pu (Nj ) = 0.5 ; If D(Nj ) → −∞, it indicates that Nj is closest Next, we produced the implicit negative feedback nodes. Through to negative feedback node, then Pu (Nj ) → 0. the text retrieved results, we grabbed the top ranked papers as Pr (topK = 20). Then we located the list of keywords/ authors/ venues which have direct correlations to Pr , but the least relevance 4.3 Compute the Random Walk Probability to KP . Find the last ranked Kneg = 10 and used them as KN , AN Based on Meta-path and VN respectively. Meta-path illustrates how the nodes are connected in the hetero- geneous graph. Once a meta-path is specified, a meta-path-based 5.3 Experiment Result ranking function is defined, so that relevant papers determined by In the evaluation part, we experimented with 8 different meta- the ranking function can be recommended [3]. It turns out that paths. For each meta-path, two sets of results were shown on row meta-path based feedback on heterogeneous graph performs better ‘N’ and ‘Y’ in Table 3. The ‘N/Y’ column in Table 3 indicates than other methods (PageRank) based PRF [2]. Random walk on whether we use the positive and negative feedback nodes or not for heterogenous network can explore more global information, com- computing the path importance. ‘N’ indicates that the result was bining multiple feedback nodes, which might be very important for from the baseline in [2], while ‘Y’ means multiple feedback nodes the recommendation tasks. were employed and the node influence was appended into the final In order to quantify the ranking score of candidates relevant to random walk function. MAP and NDCG are used as the ranking the seeds following one given meta-path, a random walk based ap- function training and evaluation metrics. For MAP, binary judg- proach was proposed in [2]. The relevance between P ∗ and P ? ment is provided for each candidate cited paper (cited or not cited). (1) (l+1) P can be estimated via s(ai , aj ) = (1) (l+1) RW (t), NDCG estimates the cumulative relevance gain a user receives by t=a i a j (1) (l+1) examining recommendation results up to a given rank on the list. where t is a path instance from node ai to aj following the We used an importance score, 0-4, as the candidate cited paper im- specified meta-path, and RW (t) is the random walk probability of portance to calculate NDCG scores. Apparently, in most cases, the instance t. row ‘Y’ significantly outperforms row ‘N’ , which shows that the (1) (2) (l+1) Suppose t = (ai1 , ai2 , . . . , ail+1 ), the random walk proba- positive/negative feedbacks enhance the random walk performance Q (j) (j+1) bility can be computed via RW (t) = j w(aij , ai,j+1 ). While quite well. We also used t-test to verify this improvement and most this formula only considers the weight of link on the path instance. meta-paths are significantly refined. Based on our hypothesis, the node usefulness probability has a great effect on the path importance. So in this study, we propose a 6. CONCLUSION AND LIMITIONS novel randomQ walk function as follows. In this study we use multiple kinds of feedback nodes and pro- (j) (j+1) (j+1) RW (t) = j (β ·w(aij , ai,j+1 )+(1−β)·Pu (ai,j+1 )), where pose a new method to enhance the meta-path-based random walk (j+1) (j+1) Pu (ai,j+1 ) is the usefulness probability of the node ai,j+1 on the performance. The new random walk function considers both tran- path (derived from section 4.2), and β determines which factor is sition probability and node usefulness probability on the path in- more important. Theoretically, we need to tune β for each meta- stance. We find that the node influence varies from the set of feedback nodes, which could be inferred based on the explicit user queries via a series of steps. Experimental results with ACM data Table 2: Graph statistics Node/Edge Number Description illustrate that the new approach with positive/negative feedback in- P 41,370 Paper formation helps to improve the performance of meta-path-based A 63,323 Author recommendation. V 369 Venue For further study, we will continue this approach based on real K 3,911 Keyword user explicit feedbacks and design the personalized recommenda- c P →P 168,554 Paper cites another paper tion model to improve user experience. Not only the node useful- w P →A 105,992 Paper is written by an author ness is related to the feedback nodes, but also the weight of each re- p P →V 41,013 Paper is published at venue lation type may be affected by the feedback nodes or retrieval task. co A→A 239,744 Co-author relationship If the retrieval task is to search the relevant papers based on given r P →K 587,252 Paper is relevant to keyword(topic) authors, the author feedback nodes will be more useful for "writ- con tenby" relation, "writtenby" and "co-author" relation might be more K → P 3,577,111 Keyword (topic) is contributed by paper con important. This hypothesis will be discussed in the next step. Be- K → A 2,397,205 Keyword (topic) is contributed by author con sides, more sophisticated inference models will be adopted which K → V 18,450 Keyword (topic) is contributed by venu may enhance the ranking performance. 7. FIGURES AND TABLES Table 3: Meta-path Based Random Walk Performance Comparison(|P ∗ | = 10) NO. N/Y MAP MAP@5 MAP@10 NDCG NDCG@5 NDCG@10 N 0.0277 0.0085 0.0129 0.1035 0.0306 0.0394 1 Y 0.0365 0.015 0.0211 0.1149 0.0459 0.0565 *** *** *** *** ** *** N 0.1315 0.0552 0.0773 0.2193 0.1427 0.1548 2 Y 0.1459 0.0678 0.0904 0.2307 0.1656 0.1705 ** *** *** *** ** *** N 0.0744 0.0306 0.0404 0.1539 0.0689 0.0766 3 Y 0.0948 0.0441 0.0582 * 0.1707 0.0945 * 0.1002 ** *** *** *** N 0.027 0.0042 0.0076 0.1378 0.0146 0.025 4 Y 0.038 0.0109 0.0153 0.1521 0.0318 0.0387 *** *** *** *** *** *** N 0.0436 0.0121 0.0187 0.1672 0.0476 0.0585 5 Y 0.0561 0.0257 0.0328 0.1854 0.0867 0.0885 *** *** *** *** *** *** N 0.0327 0.0234 0.03 0.0734 0.0693 0.0748 6 Y 0.0872 0.0359 0.0471 0.1962 0.0805 * 0.09 * *** *** *** *** Figure 1: Heterogeneous Bibliographic Graph N 0.0238 0.0083 0.0097 0.1529 0.0216 0.0224 7 Y 0.0373 0.0133 0.0163 0.1718 0.0317 0.0344 ** *** *** *** *** ** N 0.0092 0.0005 0.0007 0.1397 0.0011 0.0013 8 Y 0.012 0.0011 0.0017 0.1476 0.0027 0.0045 Table 1: All the meta-paths used in this study *** *** *** *** *** *** NO. Meta-path Feedback ranking hypothesis p < 0.05: *, p < 0.01: **, p < 0.001: *** w w 1 P ∗ −→ A ←− P ? Relevant paper’s author’s other papers can be relevant c 2 P∗ − → P? Relevant paper’s cited papers can be rel- recommendation. In Proceedings of the 23rd ACM International evant c c Conference on Conference on Information and Knowledge 3 P∗ − → P? →P − Relevant paper’s cited paper’s cited pa- Management, pages 121–130. ACM, 2014. per can be relevant c w w [3] X. Liu, Y. Yu, C. Guo, Y. Sun, and L. Gao. Full-text based 4 P∗ − → P −→ A ←− P ? Relevant paper’s cited papers’ authors’ context-rich heterogeneous network mining approach for citation papers can be relevant w co w recommendation. In ACM/IEEE Joint Conference on Digital 5 P∗ → A → A ← P? Relevant paper’s author’s co-author’s pa- Libraries, 2014. pers can be relevant w w c [4] Y. Lv and C. Zhai. Adaptive relevance feedback in information 6 P ∗ −→ A ←− P − → P? Relevant paper’s author’s cited papers can be relevant retrieval. In Proceedings of the 18th ACM conference on Information p p c and knowledge management, pages 255–264. ACM, 2009. 7 P∗ → V ← P → P? Paper can be relevant if it is cited by the ones published at the same venue as the [5] H. Muller, W. Muller, S. Marchand-Maillet, T. Pun, and D. M. relevant paper Squire. Strategies for positive and negative relevance feedback in 8 p p w P ∗ → V ← P −→ A ←− P ? w Paper can be relevant if its authors’ pa- image retrieval. In Pattern Recognition, 2000. Proceedings. 15th pers are published at the same venue as International Conference on, volume 1, pages 1043–1046. IEEE, the relevant paper 2000. [6] H. Saneifar, S. Bonniol, P. Poncelet, and M. Roche. Enhancing passage retrieval in log files by query expansion based on explicit and pseudo relevance feedback. Computers in Industry, 8. REFERENCES 65(6):937–951, 2014. [1] P. Bhatnagar and N. Pareek. Improving pseudo relevance feedback [7] Y. Sun and J. Han. Meta-path-based search and mining in based query expansion using genetic fuzzy approach and semantic heterogeneous information networks. Tsinghua Science and similarity notion. Journal of Information Science, page Technology, 18(4), 2013. 0165551514533771, 2014. [8] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. PathSim: Meta [2] X. Liu, Y. Yu, C. Guo, and Y. Sun. Meta-path-based ranking with path-based top-k similarity search in heterogeneous information pseudo relevance feedback on heterogeneous graph for citation networks. In Proc. 2011 Int. Conf. Very Large Data Bases (VLDB’11), Seattle, WA, 2011. [9] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In Proc. of 2012 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’12), Beijing, China, 2012. [10] X. Wang, H. Fang, and C. Zhai. A study of methods for negative relevance feedback. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 219–226. ACM, 2008. [11] Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 59–66. ACM, 2009. [12] X. Yu, X. Ren, Y. Sun, B. Sturt, U. Khandelwal, Q. Gu, B. Norick, and J. Han. Recommendation in heterogeneous information networks with implicit user feedback. In Proc. of 2013 ACM Int. Conf. Series on Recommendation Systems (RecSys’13), pages 347–350, Hong Kong, 2013.