=Paper= {{Paper |id=Vol-1393/paper-12 |storemode=property |title=Random Walk and Feedback on Scholarly Network |pdfUrl=https://ceur-ws.org/Vol-1393/paper-12.pdf |volume=Vol-1393 |dblpUrl=https://dblp.org/rec/conf/sigir/YuJL15 }} ==Random Walk and Feedback on Scholarly Network== https://ceur-ws.org/Vol-1393/paper-12.pdf

Random Walk and Feedback on Scholarly Network

Yingying Yu Zhuoren Jiang Xiaozhong Liu
College of Transportation College of Transportation School of Informatics and
Management Management Computing
Dalian Maritime University Dalian Maritime University Indiana University
Dalian, China, 116026 Dalian, China, 116026 Bloomington
uee870927@126.com jzr1986@gmail.com Bloomington, IN, USA, 47405
liu237@indiana.edu

ABSTRACT could significantly improve the scholarly recommendation perfor-
The approach of random walk on heterogeneous bibliographic graph mance [3,7,9,12]. For instance, Liu et al., [2,3] constructed the het-
has been proven effective in the previous studies. In this study, by erogeneous scholarly graph and proposed a novel ranking method
using various kinds of positive and negative feedbacks, we propose based on pseudo relevance feedback (PRF), which can effectively
the novel method to enhance the performance of meta-path-based recommend candidate citation papers via different kinds of meta-
random walk for scholarly recommendation. We hypothesize that paths on the graph.
the nodes on the heterogeneous graph should play different roles In this paper, we intend to further investigate feedback informa-
in terms of different queries or various kinds implicit/explicit feed- tion and enhance the meta-path-based random walk performance.
backs. Meanwhile, we prove that the node usefulness probabil- Intuitively, for different information needs, when user feedbacks
ity has significant impact for the path importance. When positive are available, the nodes on the graph should play different roles
and negative feedback information is available, we can calculate in the final measure. For example, given two different queries
each node’s proximity to the feedback nodes, and use the prox- "Content-based Citation Recommendation" and "Heterogeneous In-
imity to infer the usefulness probability of each node via the sig- formation Network", the same paper "ClusCite: effective citation
moid function. By combining the transition probability and the use- recommendation by information network-based clustering" may be
fulness probability of nodes on the path instance, we propose the retrieved by scholarly search engines, e.g., Google Scholar. But the
new random walk function to compute the importance of each path target paper can be more useful (positive) for the second query than
instance. Experimental results with ACM full-text corpus show the first one. As another example, for user X, if she prefers to cite
that the proposed method (considering the node usefulness) sig- influential scholars’ work, the highly cited authors will be useful for
nificantly outperforms the previous approaches. her. While for user Y, if she tends to cite the frontiers, she will mark
the newest publications and the newly topics as the useful feedback
information. Therefore, the same node may perform significantly
Categories and Subject Descriptors different based on different information needs and feedback infor-
H.3.3 [Information Storage and Retrieval]: Information Search mation. Furthermore, by using (implicit/explicit positive/negative)
and Retrieval feedbacks, it is possible to infer the usefulness probability of other
nodes on the graph. So that, the importance of path instance will
vary in terms of the probability of node usefulness.
Keywords The main contribution of this paper is threefold. First, in
Meta-path-based Random Walk, Feedback, Heterogeneous Graph this paper, the feedback is not limited to documents. In scholarly
network, user could provide feedback judgments for authors, key-
words and venues, either useful or not useful. If the explicit user
1. INTRODUCTION feedback is unavailable, we propose an approach to automatically
The volume of scientific publications has increased dramatically generate the feedback nodes based on user queries and the relation-
in the past couple of decades, which challenges existing systems ships among the entities on the heterogeneous graph. Second, we
and methods to retrieve and access scientific resources. Classi- infer the usefulness of the nodes in terms of feedback information.
cal text-based information retrieval algorithms can recommend the For instance, a node is less useful when it is close to the negative
candidate publications for scholars. However, most of them ig- node(s). We make a conjecture that the usefulness probability of
nored the complex and heterogeneous relations among the schol- each node depends on its average proximity to the feedback set and
arly objects. Not until recently, some studies proved that adopt- can be estimated via sigmoid function. Third, we emphasize the
ing the mining approaches on heterogeneous information networks node usefulness has a great impact on the path importance. Our ap-
proach about computing the random walk probability differs from
the previous study in that, not only the transition probability, but
also the usefulness probability of the node should be taken into ac-
count for random walk. To verify these hypotheses, we adopt a
number of meta-paths on the graph (Figure 1) and make a com-
parison between the classical random walk function and the novel
method. Experimental results on ACM corpus show that the pro-
Copyright c 2015 for the individual papers by the papers’ authors. Copy-
ing permitted for private and academic purposes. This volume is published
posed method significantly outperforms the original one.
and copyrighted by its editors. The remainder of this paper is structured as follows. We 1) re-
Published on CEUR-WS: http://ceur-ws.org/Vol-1393/.
view relevant methodologies for pseudo relevance feedback, 2) in- Given a specific scholarly network, there can be many kinds of
w w
troduce the preliminaries, 3) propose the improved methods, 4) de- meta-paths. For example, P ∗ → A ← P ? is a simple meta-path
scribe the experiment setting and evaluation results, and 5) con- on the scholarly network, denoting all the papers published by the
clude with a discussion and outlook. seed paper’ author. P ∗ is the starting paper node (seed node) in this
path. P ? denotes the candidate publication node. More examples
2. RELATED WORK can be found in Table 1.
Pseudo relevance feedback, also known as blind relevance feed-
back, provides a way for automatic local analysis. When the user 4. RESEARCH METHODS
judgments or interactions are not available, it turns out to be an ef-
fective method to improve the retrieval performance. Traditional 4.1 Generate the Feedback Nodes
pseudo relevance feedback tends to treat the top ranked documents Generally, given user initial queries, a list of ranking publications
as relevant feedback, and then expand the initial queries. How- would be found via text retrieval. Based on the top ranked docu-
ever, some of the top retrieved documents may be irrelevant, which ments, user would probably give explicit judgments on whether the
could result in noisy feedback into the process. So that, there are related keywords, authors or venues are useful or not. However,
various efforts to improve the traditional pseudo feedback. [11] ex- explicit feedback is not easy to get. In this study, we propose meth-
ploited the possible utility of Wikipedia for query dependent ex- ods to infer the implicit feedback nodes on the heterogeneous graph
pansion. From the perspective of each query and each set of feed- according to the given information.
back documents, [4] proposed how to dynamically predict an op- The feedback is a collection of multiple nodes marked with use-
timal balance coefficient query expansion rather than using a fixed ful (positive) or unuseful (negative) on the heterogeneous graph.
value. [1] suggested to use evolutionary techniques along with se- We represent this collection as N F . N FP and N FN denote the
mantic similarity notion for query expansion. [6] introduced an ap- positive and negative nodes set respectively. The kinds of feedback
proach to expand the queries for passage retrieval, not based on nodes in discussion include keyword (K), author (A) and venue (V).
the top ranked documents, but via a new term weighting function,
which gives a score to terms of corpus according to their related- 4.1.1 Generate the Positive Feedback Nodes
ness to the query, and identify the most relevant ones. Instead of Since we know the initial queries (i.e., author provided paper
using term expansion, graph-based feedback provides a new rank- keywords) that the users should be most concerned with, it is rea-
ing assumption based on topology expansion. [2] used the pseudo sonable to take the explicit keywords KP as the positive feedback
relevant papers as the seed nodes, and then explored the potential nodes. Next, we will infer the positive authors and venues based on
relevant nodes via specific restricted/combined meta-paths on the KP . We deem that the authors or venues that are highly likely re-
heterogeneous graph. Our study is motivated by this approach and lated to KP are positive as well. So we rank authors via meta-paths
con r w
mainly focused on updating the random walk algorithm by inves- KP −−→ A? and KP ← P − → A? , and take the top ranked Kpos
tigating both the positive and negative feedbacks. In fact, posi- authors as the pseudo positive authors AP . Similarly, we locate the
con r p
tive and negative feedback approach has been studied in image re- positive venues via KP −−→ V ? and KP ← P − → V ? , and select
trieval [5]. With several steps of positive and negative feedback, the top ranked Kpos venues as the positive nodes VP .
the retrieval performance could be increasingly enhanced. From
the view of negative feedback, [10] studied and compared different 4.1.2 Generate the Negative Feedback Nodes
kinds of methods, it addressed that negative feedback is important Intuitively, to generate the negative feedbacks, our basic assump-
especially when the target topic is difficult and initial results are tion is that the negative nodes should be directly related to the
poor. Besides, using multiple negative feedback methods could be searched results, but least relevant to the explicit positive keywords.
more effective. First, based on text retrieval results, we define the top ranked topK
papers as Pr , and then we locate the keywords, authors and venues
r
3. PRELIMINARIES that are directly connected to Pr via different meta-paths, Pr →
w p
Following the work [2,8], an information network can be defined Kr , Pr → Ar and Pr → Vr .
as follows. Next, we filter collections of Kr , Ar and Vr . 1. Rank the key-
con r
words Kr via the transition probability of meta-path KP → P →
D EFINITION 1. (Information network) An information network Kr . Use the last ranked Kneg keywords as the pseudo negative
is defined as a directed graph G = (V, E) with an object type nodes KN . 2. Similar to keywords, rank the authors Ar via the
mapping function τ : V → A and a link type mapping function con w
transition probability of meta-path KP → P → Ar , and use the
φ : E → R, where each object v ∈ V belongs to one particular
last ranked Kneg authors as the pseudo negative nodes AN . 3.
object type τ (v) ∈ A, each link e ∈ E belongs to a particular con p
relation φ(e) ∈ R, and if two links belong to the same relation Rank the venues Vr via KP → P → Vr , and use the last ranked
con
type, the two links share the same starting object type as well as Kneg venues as the negative nodes VN . Here we use KP → P in-
r
the ending object type. stead of KP ← P because the "contribution" characterizes the im-
portance of each paper, given a topic. It does not necessarily means
When there are more than one type of node or link in the infor- paper is relevant to topic [2]. Even if one paper is not explicit rel-
mation network, it is called heterogeneous information network. evant to some topic, it might also be important. The "contribute"
In [8], Sun further defined meta-path as follows. conveys more information.
Thus, we obtain all the positive and negative feedback nodes.
D EFINITION 2. (Meta-path) A meta-path P is a path defined N FP includes KP , AP and VP . N FN contains KN , AN and VN .
on the graph of network schema TG = (A, R), and is denoted
in the form of Ȧ1 −→
R1 R
2
Ȧ2 −→
R
l
. . . −→ Ȧl+1 , which defines a 4.2 Infer the Usefulness Probability of Node
composite relation R = R1 ◦ R2 ◦ . . . ◦ Rl between types Ȧ1 and Unlike previous studies, in this paper, the importance of nodes
Ȧl+1 , where ◦ denotes the composition operator on relations. on scholarly network is not even. The usefulness probability of
node Ni is determined by the feedback nodes. Intuitively, if node path to optimize the weight of each sub-meta-path. For this study,
Ni is more closely related to the positive nodes, it could be more we set β = 0.6.
useful. Conversely, if Ni is much closer to the negative nodes, Then, the random walk probability will be decided by the tran-
and further away from the positive nodes, it indicates that Ni may sition probability and the usefulness probability of the node on the
be not very useful. Therefore, the proximity between given node path instance. In this paper, we use eight meta-paths to investigate
and feedback node set is very crucial. We should note that the the novel random walk method with node feedback information for
usefulness probability of each node varies from different feedback citation recommendation. All the meta-paths are listed in Table 1.
node sets.
To infer the usefulness probability of node Ni , we adopt the sig-
1
5. EXPERIMENT
moid function Pu (Ni ) = 1+e−αD(N i)
to convert the proximity
into probability, where α controls the convergent rate (default is 5.1 Data Preprocessing
1). In our assumption, if Nj is positive node, Pu (Nj ) = 1, other- We used 41,370 publications (as candidate citation collection),
wise P (Nj ) = 0. D(Ni ) denotes the proximity between Ni and published between 1951 and 2011, on computer science for the ex-
the feedback node set N F . It can be derived from the following periment (mainly from the ACM digital library). As [2] introduced,
formula. P we constructed the heterogeneous graph shown in Figure 1 and Ta-
Nj ∈N FP d(Ni ,Nj ))
P
Nk ∈N FN d(Ni ,Nk )
D(Ni ) = |N FN |
− |N FP |
, where ble 2.
|N FN | and |N FP | represents the size of collection N FN and N FP For the evaluation part, we used a test collection with 274 papers.
respectively. d(Ni , Nj ) indicates the proximity between node Ni The selected papers have more than 15 citations from the candidate
and node Nj . In this paper, we will estimate the proximity d(Ni , Nj ) citation collection.
based on the paths Ni Nj on the graph. There could be lots of 5.2 Generate Feedback Nodes
path instances connected node Ni and Nj . If the length of path is
too long, the influence would be too small to be considered. We as- Attaining different types of feedback information is the most im-
sume the maximum of path length is 10. Then we select the shortest portant part in this research. Since it is not available to get the user
path and define its length as the proximity d(Ni , Nj ). judgments right away. We used the method introduced in section
If D(Ni ) is negative, it reflects node Ni is closer to negative 4.1 to create positive and negative feedback nodes. As aforemen-
nodes than positive ones, which means node Ni could be less im- tioned, the collection KP is the set of user given keywords. It is ex-
portant, and vice versa. Particularly, if D(Nj ) → +∞, it indi- plicit positive feedbacks. While AP and VP can be derived by their
cates that Nj is far away from negative feedback nodes, so the im- connectivity to set KP based on the heterogeneous graph. Here we
portance of this node approach to 1; If D(Nj ) = 0, it indicates set Kpos = 10, and take the top 10 ranked authors/ venues as the
that Nj has the same distance to negative and positive nodes, then implicit positive feedbacks.
Pu (Nj ) = 0.5 ; If D(Nj ) → −∞, it indicates that Nj is closest Next, we produced the implicit negative feedback nodes. Through
to negative feedback node, then Pu (Nj ) → 0. the text retrieved results, we grabbed the top ranked papers as Pr
(topK = 20). Then we located the list of keywords/ authors/
venues which have direct correlations to Pr , but the least relevance
4.3 Compute the Random Walk Probability to KP . Find the last ranked Kneg = 10 and used them as KN , AN
Based on Meta-path and VN respectively.
Meta-path illustrates how the nodes are connected in the hetero-
geneous graph. Once a meta-path is specified, a meta-path-based 5.3 Experiment Result
ranking function is defined, so that relevant papers determined by In the evaluation part, we experimented with 8 different meta-
the ranking function can be recommended [3]. It turns out that paths. For each meta-path, two sets of results were shown on row
meta-path based feedback on heterogeneous graph performs better ‘N’ and ‘Y’ in Table 3. The ‘N/Y’ column in Table 3 indicates
than other methods (PageRank) based PRF [2]. Random walk on whether we use the positive and negative feedback nodes or not for
heterogenous network can explore more global information, com- computing the path importance. ‘N’ indicates that the result was
bining multiple feedback nodes, which might be very important for from the baseline in [2], while ‘Y’ means multiple feedback nodes
the recommendation tasks. were employed and the node influence was appended into the final
In order to quantify the ranking score of candidates relevant to random walk function. MAP and NDCG are used as the ranking
the seeds following one given meta-path, a random walk based ap- function training and evaluation metrics. For MAP, binary judg-
proach was proposed in [2]. The relevance between P ∗ and P ? ment is provided for each candidate cited paper (cited or not cited).
(1) (l+1) P
can be estimated via s(ai , aj ) = (1) (l+1) RW (t), NDCG estimates the cumulative relevance gain a user receives by
t=a i a j
(1) (l+1)
examining recommendation results up to a given rank on the list.
where t is a path instance from node ai to aj following the We used an importance score, 0-4, as the candidate cited paper im-
specified meta-path, and RW (t) is the random walk probability of portance to calculate NDCG scores. Apparently, in most cases,
the instance t. row ‘Y’ significantly outperforms row ‘N’ , which shows that the
(1) (2) (l+1)
Suppose t = (ai1 , ai2 , . . . , ail+1 ), the random walk proba- positive/negative feedbacks enhance the random walk performance
Q (j) (j+1)
bility can be computed via RW (t) = j w(aij , ai,j+1 ). While quite well. We also used t-test to verify this improvement and most
this formula only considers the weight of link on the path instance. meta-paths are significantly refined.
Based on our hypothesis, the node usefulness probability has a
great effect on the path importance. So in this study, we propose a 6. CONCLUSION AND LIMITIONS
novel randomQ walk function as follows. In this study we use multiple kinds of feedback nodes and pro-
(j) (j+1) (j+1)
RW (t) = j (β ·w(aij , ai,j+1 )+(1−β)·Pu (ai,j+1 )), where pose a new method to enhance the meta-path-based random walk
(j+1) (j+1)
Pu (ai,j+1 ) is the usefulness probability of the node ai,j+1 on the performance. The new random walk function considers both tran-
path (derived from section 4.2), and β determines which factor is sition probability and node usefulness probability on the path in-
more important. Theoretically, we need to tune β for each meta- stance. We find that the node influence varies from the set of
feedback nodes, which could be inferred based on the explicit user
queries via a series of steps. Experimental results with ACM data Table 2: Graph statistics
Node/Edge Number Description
illustrate that the new approach with positive/negative feedback in- P 41,370 Paper
formation helps to improve the performance of meta-path-based A 63,323 Author
recommendation. V 369 Venue
For further study, we will continue this approach based on real K 3,911 Keyword
user explicit feedbacks and design the personalized recommenda- c
P →P 168,554 Paper cites another paper
tion model to improve user experience. Not only the node useful- w
P →A 105,992 Paper is written by an author
ness is related to the feedback nodes, but also the weight of each re- p
P →V 41,013 Paper is published at venue
lation type may be affected by the feedback nodes or retrieval task. co
A→A 239,744 Co-author relationship
If the retrieval task is to search the relevant papers based on given r
P →K 587,252 Paper is relevant to keyword(topic)
authors, the author feedback nodes will be more useful for "writ- con
tenby" relation, "writtenby" and "co-author" relation might be more K → P 3,577,111 Keyword (topic) is contributed by paper
con
important. This hypothesis will be discussed in the next step. Be- K → A 2,397,205 Keyword (topic) is contributed by author
con
sides, more sophisticated inference models will be adopted which K → V 18,450 Keyword (topic) is contributed by venu
may enhance the ranking performance.

7. FIGURES AND TABLES Table 3: Meta-path Based Random Walk Performance
Comparison(|P ∗ | = 10)
NO. N/Y MAP MAP@5 MAP@10 NDCG NDCG@5 NDCG@10
N 0.0277 0.0085 0.0129 0.1035 0.0306 0.0394
1
Y 0.0365 0.015 0.0211 0.1149 0.0459 0.0565
*** *** *** *** ** ***
N 0.1315 0.0552 0.0773 0.2193 0.1427 0.1548
2
Y 0.1459 0.0678 0.0904 0.2307 0.1656 0.1705 **
*** *** *** ** ***
N 0.0744 0.0306 0.0404 0.1539 0.0689 0.0766
3
Y 0.0948 0.0441 0.0582 * 0.1707 0.0945 * 0.1002 **
*** *** ***
N 0.027 0.0042 0.0076 0.1378 0.0146 0.025
4
Y 0.038 0.0109 0.0153 0.1521 0.0318 0.0387
*** *** *** *** *** ***
N 0.0436 0.0121 0.0187 0.1672 0.0476 0.0585
5
Y 0.0561 0.0257 0.0328 0.1854 0.0867 0.0885
*** *** *** *** *** ***
N 0.0327 0.0234 0.03 0.0734 0.0693 0.0748
6
Y 0.0872 0.0359 0.0471 0.1962 0.0805 * 0.09 *
*** *** *** ***
Figure 1: Heterogeneous Bibliographic Graph N 0.0238 0.0083 0.0097 0.1529 0.0216 0.0224
7
Y 0.0373 0.0133 0.0163 0.1718 0.0317 0.0344 **
*** *** *** *** **
N 0.0092 0.0005 0.0007 0.1397 0.0011 0.0013
8
Y 0.012 0.0011 0.0017 0.1476 0.0027 0.0045
Table 1: All the meta-paths used in this study *** *** *** *** *** ***
NO. Meta-path Feedback ranking hypothesis p < 0.05: *, p < 0.01: **, p < 0.001: ***
w w
1 P ∗ −→ A ←− P ? Relevant paper’s author’s other papers
can be relevant
c
2 P∗ −
→ P? Relevant paper’s cited papers can be rel- recommendation. In Proceedings of the 23rd ACM International
evant
c c Conference on Conference on Information and Knowledge
3 P∗ − → P?
→P − Relevant paper’s cited paper’s cited pa- Management, pages 121–130. ACM, 2014.
per can be relevant
c w w [3] X. Liu, Y. Yu, C. Guo, Y. Sun, and L. Gao. Full-text based
4 P∗ −
→ P −→ A ←− P ? Relevant paper’s cited papers’ authors’
context-rich heterogeneous network mining approach for citation
papers can be relevant
w co w recommendation. In ACM/IEEE Joint Conference on Digital
5 P∗ → A → A ← P? Relevant paper’s author’s co-author’s pa-
Libraries, 2014.
pers can be relevant
w w c [4] Y. Lv and C. Zhai. Adaptive relevance feedback in information
6 P ∗ −→ A ←− P −
→ P? Relevant paper’s author’s cited papers
can be relevant retrieval. In Proceedings of the 18th ACM conference on Information
p p c and knowledge management, pages 255–264. ACM, 2009.
7 P∗ → V ← P → P? Paper can be relevant if it is cited by the
ones published at the same venue as the [5] H. Muller, W. Muller, S. Marchand-Maillet, T. Pun, and D. M.
relevant paper Squire. Strategies for positive and negative relevance feedback in
8
p p w
P ∗ → V ← P −→ A ←− P ?
w
Paper can be relevant if its authors’ pa- image retrieval. In Pattern Recognition, 2000. Proceedings. 15th
pers are published at the same venue as International Conference on, volume 1, pages 1043–1046. IEEE,
the relevant paper 2000.
[6] H. Saneifar, S. Bonniol, P. Poncelet, and M. Roche. Enhancing
passage retrieval in log files by query expansion based on explicit
and pseudo relevance feedback. Computers in Industry,
8. REFERENCES 65(6):937–951, 2014.
[1] P. Bhatnagar and N. Pareek. Improving pseudo relevance feedback [7] Y. Sun and J. Han. Meta-path-based search and mining in
based query expansion using genetic fuzzy approach and semantic heterogeneous information networks. Tsinghua Science and
similarity notion. Journal of Information Science, page Technology, 18(4), 2013.
0165551514533771, 2014. [8] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. PathSim: Meta
[2] X. Liu, Y. Yu, C. Guo, and Y. Sun. Meta-path-based ranking with path-based top-k similarity search in heterogeneous information
pseudo relevance feedback on heterogeneous graph for citation networks. In Proc. 2011 Int. Conf. Very Large Data Bases
(VLDB’11), Seattle, WA, 2011.
[9] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu. Integrating
meta-path selection with user-guided object clustering in
heterogeneous information networks. In Proc. of 2012 ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
(KDD’12), Beijing, China, 2012.
[10] X. Wang, H. Fang, and C. Zhai. A study of methods for negative
relevance feedback. In Proceedings of the 31st annual international
ACM SIGIR conference on Research and development in information
retrieval, pages 219–226. ACM, 2008.
[11] Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance
feedback based on wikipedia. In Proceedings of the 32nd
international ACM SIGIR conference on Research and development
in information retrieval, pages 59–66. ACM, 2009.
[12] X. Yu, X. Ren, Y. Sun, B. Sturt, U. Khandelwal, Q. Gu, B. Norick,
and J. Han. Recommendation in heterogeneous information networks
with implicit user feedback. In Proc. of 2013 ACM Int. Conf. Series
on Recommendation Systems (RecSys’13), pages 347–350, Hong
Kong, 2013.