<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Transition probability
(vj ) from A to its neighbors vj
p =</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>PageRank Revisited: On the Relationship between Node Degrees and Node Significances in Different Applications ∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jung Hyun Kim</string-name>
          <email>jkim294@asu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K. Selçuk Candan</string-name>
          <email>candan@asu.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Luisa Sapino</string-name>
          <email>marialuisa.sapino@unito.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arizona State University</institution>
          ,
          <addr-line>Tempe, AZ 85287</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Arizona State University</institution>
          ,
          <addr-line>Tempe, AZ 85287-8809</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Torino</institution>
          ,
          <addr-line>I-10149 Torino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>0</volume>
      <issue>2</issue>
      <abstract>
        <p>Random-walk based techniques, such as PageRank, encode the structure of the graph in the form of a transition matrix of a stochastic process from which significances of the graph nodes can be inferred. Recommendation systems leverage such node significance measures to rank the objects in the database. Context-aware recommendation techniques complement the data graph with additional data that provide the recommendation context. However, despite their wide-spread use in many graph-based knowledge discovery and recommendation applications, conventional PageRank-based measures have various shortcomings. As we experimentally show in this paper, one such shortcoming is that PageRank scores are tightly coupled with the degrees of the graph nodes, whereas in many applications the relationship between the significance of the node and its degree in the underlying network may not be as implied by PageRank-based measures. In fact, as we also show in the paper, in certain applications, the significance of the node may be negatively correlated with the node degree and in such applications a naive application of PageRank may return poor results. To address these challenges, in this paper, we propose degree decoupled PageRank (D2PR) techniques to improve the effectiveness of PageRank based knowledge discovery and recommendation systems. These suitably penalize or (if needed) boost the transition strength based on the degree of a given node to adapt the node significances based on the network and application characteristics. ∗This work is supported by NSF Grants 1339835 ”E-SDMS: Energy Simulation Data Management System Software”, 1318788 ”Data Management for Real-Time Data Driven Epidemic Spread Simulations”, 1518939 ”RAPID: Understanding the Evolution Patterns of the Ebola Outbreak in West-Africa and Supporting Real-Time Decision Making and Hypothesis Testing through Large Scale Simulations”, and 1430144 ”Fraud Detection via Visual Analytics: An Infrastructure to Support Complex Financial Patterns (CFP) based Real-Time Services Delivery”.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In recent years, there has been considerable interest in measuring
the significance of a node in a graph and relatedness between two
nodes in the graph, as if measured accurately, these can be used
for supporting many knowledge discovery, search, and
recommen1.1</p>
      <p>
        PageRank as a Measure of Significance
Since enumerating all paths among the graph nodes would
require time exponential in the size of the graph, random-walk based
techniques encode the structure of the network in the form of a
transition matrix of a stochastic process from which the node
significance can be inferred.PageRank [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is one of the most widely-used
random-walk based methods for measuring node significance and
has been used in a variety of application domains, including web
search, biology, and social networks. The basic thesis of
PageRank is that a node is important if it is pointed to by other important
nodes – it takes into account the connectivity of nodes in the graph
by defining the score of the node vi ∈ V as the amount of time
spent on vi in a sufficiently long random walk on the graph. More
specifically, given a graph G(V, E), the PageRank scores are
represented as ~r, where
      </p>
      <p>~r = αTG~r + (1 − α)~t
where TG is a transition matrix corresponding to the graph G, ~t is
a teleportation vector (such that ~t[i] = kV1 k ), and α is the residual
probability (or equivalently, (1 − α) is the so-called teleportation
probability). Unless the graph is weighted, the transition matrix,
TG, is constructed such that for a node v with k (outgoing)
neighbors, the transition probability from v to each of its (outgoing)
neighbors will be 1/k. If the graph is weighted, then the
transition probabilities are adjusted in a way to account for the relative
weights of the (outgoing) edges.
1.2</p>
      <p>Tight Coupling of PageRank Scores of
Nodes and their Degrees</p>
      <p>Let us consider an undirected graph G(V, E). There are two
factors that contribute to the PageRank of a given node, v ∈ V :
• Factor 1: Significance of Neighbors: The more significant
the neighbors of a node are, the higher its likelihood to be
also significant.
• Factor 2: Number of Neighbors (Degree of the Node) : Even
if the neighbors are not all significant, a large number of
Correlation between
PageRank and Degree
neighbors would imply that the node, v, is well-connected
and, thus, likely to be structurally important.</p>
      <p>In theory, these two factors should complement each other. In
practice, however, the PageRank formulation described above implies
that there is a very tight coupling between the degrees of the nodes
in the graph and their PageRank scores (see Table 1).
1.2.1</p>
      <p>Problem I: When a Large Node Degree Does
Not Indicate High Node Significance</p>
      <p>In this paper, we highlight (and experimentally show) that,
in many applications, node degree and node significance are in fact
inversely related and that the tight-coupling between node degrees
and PageRank scores might be counter-productive in generating
accurate recommendations.</p>
      <p>EXAMPLE 1. Consider, for example, a recommendation
application where a movie graph, consisting of movie and actor
nodes, is used for generating movie recommendations. In this
application, the first factor (significance of neighbors) clearly has a
positive contribution: a movie with good actors is likely to be a
good movie and an actress playing in good movies is likely to be
a good actress. On the other hand, the second factor (number of
neighbors) may in fact be a negative contributor to node
significance: the fact that an actor has played in a large number of
movies may be a sign that he is a non-discriminating (’B movie’)
actor, whereas an actress with relatively fewer movies may be a
more discriminating (’A movie’) actress.</p>
      <p>As we see in Section 4, this observation turns out to be true in many
applications, where (a) acquiring additional edges has a cost that is
correlated with the significance of the neighbor (e.g. the effort one
needs to invest to a high quality movie) and (b) each node has a
limited budget (e.g. total effort an actor/actress can invest in his/her
work).
1.2.2</p>
      <p>Problem II: When PageRank Does Not
Sufficiently Account for Contributions of Degrees
The mismatch between PageRank and node significance is not
limited to the cases where node degrees are inversely related to the
node significance. As we see in Section 4, there are other scenarios
where PageRank may, in fact, fail to sufficiently account for the
contribution of the node degrees to their significances.
1.3</p>
      <p>PageRank Revisited: De-coupling Node
Significance from Node Degrees</p>
      <p>As we discussed above, one key shortcoming of the conventional
PageRank scores is that they are often tightly coupled with the
degrees of the graph nodes and in many applications the relationship
between the significance of the node and its degree in the
underlying network may not be as implied by PageRank-based measure: in
certain applications, the significance of the node may be negatively
correlated with the node degree, whereas in others PageRank may
not be sufficient in accounting for degree contributions. Naturally,
in such applications a naive application of PageRank in generating
recommendations may return poor results.</p>
      <p>To address these challenges, in this paper, we propose degree
decoupled PageRank (D2PR) techniques to improve the effectiveness
of PageRank based knowledge discovery and recommendation
systems. These techniques suitably penalize or (if needed) boost1 the
transition strength based on the degree of a given node to adapt the
node significances based on the network and application
characteristics. This paper is organized as follows: Next, we discuss the
related literature. In Sections 3, we introduce the proposed
degreedecoupled PageRank techniques. We evaluate the proposed
techniques in Section 4 and conclude in Section 5.
2.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORKS</title>
    </sec>
    <sec id="sec-3">
      <title>Context-Sensitive PageRank</title>
      <p>
        Path-length based definitions of node relatedness, such as those
proposed by [
        <xref ref-type="bibr" rid="ref24 ref4">4, 24</xref>
        ] help capture the relatedness of a pair of nodes
solely based on the properties of the nodes and edges on the shortest
path between the pair. Random-walk based definitions, such as
hitting distance [
        <xref ref-type="bibr" rid="ref10 ref21">10, 21</xref>
        ] and personalized page rank (PPR) score [
        <xref ref-type="bibr" rid="ref1 ref16 ref9">1,
9, 16</xref>
        ], of node relatedness further take into account the density of
the edges: as in path-length based definitions, random-walk based
definitions also recognize that a node is more related to another
node if there are short paths between them; however, random
walkbased definitions of relatedness also consider how well the given
pair of nodes are connected.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], authors construct a transition matrix, TS , where edges
leading away from the seed nodes are weighted less than those
edges leading towards the seed nodes. An alternative approach for
contextualizing PageRank scores is to use the PPR techniques [
        <xref ref-type="bibr" rid="ref1 ref9">1,9</xref>
        ]
discussed in the introduction. One key advantage of this
teleportation vector modification based approach over modifying the
transition matrix, as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], is that the term α can be used to
directly control the degree of seeding (or personalization) of the PPR
score. [
        <xref ref-type="bibr" rid="ref10 ref21">10, 21</xref>
        ] rely on a random walk hitting time based approach,
where the hitting time is defined as the expected number of steps a
random walk from the source vertex to the destination vertex will
take. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] leveraged these properties of PPR to develop
localitysensitive algorithms to rank nodes of graphs which are relative to a
given set of seed nodes efficiently.
2.2
      </p>
      <p>
        Improvements to the PageRank Function
Due to the obvious relationship between ranking and monetary
rewards (e.g. through selling of advertisements on web search
applications), there has been considerable effort in engineering (or
manipulating) graphs in a way to maximize ranking scores of
particular nodes. This is commonly referred to as PageRank
optimization. One way to achieve this goal is carefully adding or removing
certain links: If, for example, one or more colluding webmasters
can add or remove edges, PageRank scores of target web pages
or domains can be increased [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] established several bounds
indicating to what extent the rank of the pages of a website can
be changed and the authors derived an optimal referencing
strategy to boost PageRank scores. A related, but opposite, problem is
to protect the PageRank scores against negative links (which may
indicate, for example, negative influence or distrust in a social
network), artificial manipulation, and spam. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], for example, focused
on identifying spam pages and link farms and showed that better
PageRank scores can be obtained after filtering spam pages and
links. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], authors show that PPR algorithms that do not
differentiate among the seed nodes may not properly rank nodes and
present robust personalized PageRank (RPR) strategies, which are
insensitive to noise in the set of seed nodes.
1In this context, de-coupled does not necessarily imply
decorrelated. In fact, D2PR can boost correlation between node
degree and PageRank if that is required by the application.
      </p>
      <p>
        There are some efforts to change the impact of degrees on the
PageRank computation. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a way to boost the power of
low-degree nodes in a network. The impact from nodes which are
important but are not hubs is relatively small compared to other
nodes which are less important with high degrees. To boost the
low-degree important nodes for equal opportunity, the teleportation
vector is modified with being proportional to the degrees of nodes.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] boosted the degrees of nodes to reduce the expected cover time
of the entire graph by the biassed random-walk.
      </p>
      <p>DEGREE DE-COUPLED PAGERANK
The key difficulty of de-coupling node degrees from the
PageRank scores is that the definition of the PageRank, based on random
walk transitions, is inherently dependent on the number of
transitions available from one node to the other. As we mentioned above,
the more ways there are to reach into a node, the higher will be its
PageRank score.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Desideratum</title>
      <p>Therefore, to de-couple the PageRank score from node degrees,
we need to modify the transition matrix. In particular, for each node
vi in the graph, we would like to be able to control the transition
process with a single parameter (p), such that
• if p ≪ −1, transitions from node vi are ∼ 100% towards
the neighbor with the highest degree,
• if p = −1, transition probabilities from node vi are
proportional to the degrees of its neighbors,
• if p = 0, the transition probabilities mirror the standard
PageRank probabilities (assuming undifferentiated
neighbors),
• if p = 1, transition probabilities from node vi are inversely
proportional to the degrees of its neighbors,
• if p ≫ 1, transitions from node vi are ∼ 100% towards the
neighbor with the lowest degree.</p>
      <p>In other words, the transition function should de-couple the
transition process from node-degrees and penalize or boost the
contributions of node degrees in the transition process, as needed.
3.2</p>
      <p>Degree De-coupling Transition Matrix
In this subsection, we will consider degree de-coupling of the
transition matrix as implied by the above desideratum.
3.2.1</p>
      <sec id="sec-4-1">
        <title>Undirected Unweighted Graphs</title>
        <p>Let G = (V, E) be an undirected and unweighted graph. Let
α also be a given residual probability parameter, and deg(v) be a
function which returns the number of edges on the node v. We
represent degree de-coupled PageRank (D2PR) scores in the form
of a vector</p>
        <p>d~ = αTDd~ + (1 − α)~t,
where ~t is the teleportation vector, such that ~t[i] = kV k for all i
1
and TD is a degree de-coupled transition matrix,</p>
        <p>TD(j, i) =</p>
        <p>deg(vj )−p
Pvk∈neighbor(vi) deg(vk)−p ,
(1)
where
• TD(j, i) denotes the degree de-coupled transition
probability from node vi to node vj over an edge eij = [vi → vj ]
when there exists at least one edge between two nodes,
• neighbor(vi) is the set of all neighbors of the source node,
vi, and
A
C
D</p>
        <p>F
(a) A sample graph</p>
        <p>Intuitively, the numerator term, deg(vj )−p, ensures that the edge
incoming to vj is weighted by its degree: if p &gt; 0, then its
degree negatively impacts (reduces) transition probabilities into vj , if
p &lt; 0 then its degree positively impacts (boosts2) transition
probabilities into vj , and if p = 0, we obtain the standard PageRank
formulation without degree de-coupling. In other words, the
transition function satisfies our desideratum of de-coupling the
transition process from node-degrees and penalizing or boosting the
contributions of node degrees on-demand. Note that, since all
transitions from the node vi are degree de-coupled individually
based on the degrees of their destinations, the denominator term,
Pvk∈neighbor(vi) deg(vk)−p, ensures that the transition
probabilities from node vi add up to 1.0. Note also that when there is no
edge between node vi and vj , TD(j, i) = 0 and, consequently, the
term TD(j, i) is not affected by the degree de-coupling process.</p>
        <p>EXAMPLE 2. Figure 1 shows how the random walk
probabilities are differentiated in a degree de-coupled transition matrix on
a sample graph where a node A has three neighbors, B (with
degree 2), C (with degree 3), and D (with degree 1). In
conventional PageRank, the transition probabilities from node A to all its
neighbor nodes are equal to 0.33. In degree de-coupled PageRank
(D2PR), however, the value of p is used for explicitly accounting
for the impact of node degree on the transition probabilities: When
p = 2, the transition probabilities from A to its neighbors are 0.18,
0.08, and 0.74, which penalizes nodes which have larger degrees,
whereas when p = −2, D2PR boosts the transition probabilities
to large degree nodes leading to transition probabilities 0.29, 0.64,
and 0.07, respectively. ⋄</p>
        <p>
          This example shows that, in degree de-coupled PageRank
(D2PR), as we also see in Table 2, the value of p can be used to
penalize (p &gt; 0) or boost (p &lt; 0) transition probabilities based on
the degree of the destination, vj .
2In fact, a similar function was used in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to quickly locate nodes
with higher degrees in a given graph.
        </p>
        <p>The semantics of degree de-coupling is slightly different in
directed graphs. In particular, edges incoming to vi often do not
require a particular effort from vi to establish and hence are often out
of the control of vi, but indicate a certain degree of interestingness,
usefulness, or authority as perceived by others. The same is not
true for edges outgoing from vi; in particular, a vertex with a large
number of outgoing edges may either indicate a potential hub or
simply indicate a non-discerning connection maker. The distinction
between these two situations gains importance especially in
applications where establishing a new connection has a non-negligible
cost to the source node and, thus, a large number of outgoing edges
may indicate either (a) a very strong participant to the network or
(b) a very poor participant with a large number of weak linkages.</p>
        <p>Let G = (V, E) be a directed graph and for the simplicity of the
discussion, without any loss of generality, let us assume that G is
unweighted. Let us also be given a residual probability parameter,
α and let outdeg(v) be a function which returns the number of
outgoing edges from the node v. The degree de-coupled PageRank
(D2PR) scores can be represented in the form of a vector d~, d~ =
αTD d~ + (1 − α)~t, where ~t is the teleportation vector, such that
~t[i] = kV k for all i and</p>
        <p>1
where TD(j, i) denotes the degree de-coupled transition
probability from node vi to node vj over an edge eij = [vi → vj ],
out_edges(vi) is the set of out-going edges from the source node,
vi, and p ∈ R is a degree de-coupling weight.</p>
        <p>EXAMPLE 3. Figure 2 (a) in Section 4 provides an example
illustrating the correlations between the degree de-coupled
PageRank (D2PR) scores and external evidence for different values of p
for some application: here, the higher the correlation, the better
resulting ranking reflects the application semantics. As we see in this
example, which we will investigate in greater detail in Section 4,
the optimal de-coupling weight is not always p = 0 as implied by
the conventional PageRank measure. In this particular case, for
example, the correlation between D2PR and external evidence of
significance is maximized when the de-coupling weight, p, is equal
to 0.5, implying that in this application a moderate degree of
penalization based on the node degrees is needed to align PageRank
scores and application semantics. ⋄
3.2.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Weighted Graphs</title>
        <p>Once again, the semantics of degree de-coupling need to be
reconsidered for weighted graphs. Let G = (V, E, w) be a directed,
weighted graph, where w(e) is a function which returns the weight
of the edge associated with edge e. It is important to note that, in
such a graph, the weight of an edge can 1) indicate the strength
of the connection between two nodes (thus positively contributing
to the significance of the destination node); and at the same time
and 2) contribute to the degree of a node as a multiplier (thus
positively or negatively contributing to the node significance depending
on the degree-sensitivity of the application). In other words, given
an edge eij = [vi → vj ], from node vi to node vj , the transition
probability from vi to vj can be written as</p>
        <p>T(j, i) = βTconn_strength(j, i) + (1 − β)TD(j, i),
where</p>
        <p>Tconn_strength(j, i) =
w(vi → vj )
P[vi→vh]∈out_edges(vi) w(vi → vh)
accounts for the connection strength (as in the conventional
PageRank) whereas TD is a degree de-coupled transition matrix,
TD(j, i) =</p>
        <p>Θ(vj )−p</p>
        <p>P[vi→vk]∈out_edges(vi) Θ(vk)−p ,
such that, TD(j, i) denotes the degree de-coupled transition
probability from node vi to node vj over an edge eij = [vi → vj ],
p ∈ R is a degree de-coupling weight, and
Θ(v) =</p>
        <p>X
[v→vh]∈out_edges(v)
w(v → vh).</p>
        <p>Note that, above, β controls whether accounting for the
connection strength or degree de-coupling is more critical in a given
application. In Section 4, we will study the impact of degree de-coupling
in weighted graphs for different scenarios.
4.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>CASE STUDIES</title>
      <p>In this section, we present case studies assessing the
effectiveness of the degree de-coupling process and the relationship between
the degree de-coupling weight p and recommendation accuracy for
different data graphs.</p>
    </sec>
    <sec id="sec-6">
      <title>4.1 Setup</title>
      <p>For all experiments, the degree de-coupling weight, p, is varied
between -4 and 4 with increments of 0.5. The residual probability,
α, is varied between 0.5 and 0.9, with default value chosen as 0.85.
We also varied the β parameter, which controls whether accounting
for the connection strength or degree de-coupling is more critical
in a given application, between 0.0 and 1.0, with the default value
set to 0 (indicating full decoupling).
4.1.1</p>
      <sec id="sec-6-1">
        <title>Datasets</title>
        <p>
          Four real data sets are used for the experiments. Each data set
is used to create two distinct data graphs and corresponding ratings
data. Table 3 provides further details about the various graphs
created using these four data sets. These recommendation tasks based
on these data graphs are detailed below:
• For the IMDB [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] data set, we created (a) a movie-movie graph,
where movie nodes are connected by an edge if they share common
contributors, such as actors, directors, writers, composers, editors,
cosmetic designers, and producers and (b) an actor-actor graph
based on whether two actors played in the same movie.
Applications: For this data set, we consider applications where movies
are rated by the users: thus, we merged the IMDB data with the
MovieLens 10M [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] data (based on movie names) to identify user
ratings (between 1 and 5) for the movies in the graph. We
consider the (a) average user rating as the significance of the movies
in the movie-movie graph and (b) average user rating of the movies
played in as the significance of the actors in the actor-actor graph.
• For the DBLP [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] data set, we constructed (a) an article-article
graph where scientific articles were connected to each other if they
shared a co-author and (b) an author-author graph based on
coauthorship. Applications: (a) In the article-article graph, the
number of citations to an article is used to indicate its significance.
Similarly, (b) in the author-author graph, average number of citations
to an author’s papers is used as his/her significance.
• For the Last.fm [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], we constructed (a) a listener-listener graph,
where the nodes are Last.FM listeners and undirected edges reflect
friendship information among these listeners. We also constructed
(b) an artist-artist graph based on shared listeners. Applications:
(a) In the listener-listener graph, we considered the total listening
activity of a given listener as his/her significance. (b) In the
artistartist graph, the number of times an artist has been listened is
considered as his/her significance.
• For the Epinions [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]: We constructed (a) a
commentercommenter graph based on the products on which two individuals
both commented and (b) a product-product graph based on shared
commenters. Applications: (a) For the nodes on the
commentercommenter graph, the number of trusts the commenter received
from others is used as his/her commenter significance. (b) For
each product in the product-product graph, its average rating by
the commenters is used as its node significance.
4.2
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Measures</title>
      <p>In this section, our goal is to observe the impact of
different D2PR degree de-coupling weights on the relationship between
D2PR rankings and application specific significance measures for
the above data sets3. We also aim to verify whether de-coupling
weights can also be used to improve recommendation accuracies.</p>
      <p>In order to measure the relationship between the degree
decoupled PageRank (D2PR) scores and the application-specific node
significance, we used Spearman’s rank correlation,</p>
      <p>Pi(xi − x¯)(yi − y¯)
pPi(xi − x¯)2 Pi(yi − y¯)2
,
which measures the agreement between the D2PR ranks of the
nodes in the graph and their application-specific significances.
Here, x are rankings by D2PR and y are significances for an
application and x¯ and y¯ are averages of two values.
4.3</p>
      <p>Impact of De-Coupling in Different
Applications (Unweighted Graphs)</p>
      <p>In this subsection, we present results that aim to assess D2PR
under the settings described above. For these experiments, the
residual probability, α, and the parameter, β, are set to the default
values, 0.85 and 0, respectively. In these experiments, we consider
only unweighted graphs (we will study the weighted graphs and
the impact of parameter β later in Section 4.5).</p>
      <p>Figures 2 through 4 include charts showing the Spearman’s
correlations between the D2PR ranks and application specific node
significances for different values of p and for different data graphs.
These figures clearly illustrate that different data graphs require
different degrees of de-coupling4 to best match the application specific
node significance criterion.
4.3.1</p>
      <p>Application Group A: When Degree
Penalization Helps</p>
      <p>The actor-actor (based on common movies) and
commentercommenter (based on common products) graphs have highest
correlation at p = 0.5, with the correlations dropping significantly
3In this paper, we are not proposing a new PageRank computation
mechanism. Because of this (and since the focus is not improving
scalability of PR), we do not report execution times and compare
our results with other PageRank computation mechanisms.
4Degree penalization or degree-based boosting
when the degrees are over-penalized (i.e., when p ≫ 0.5). The
Epinions product-product graph (based on common commenters,
Figure 2(c)) also provides the highest correlations with p &gt; 0, but
behaves somewhat differently from the other two cases: the
correlations stabilize and do not deteriorate significantly when degrees
are over-penalized, indicating that the need for degree penalization
is especially critical in this case: this is due to the fact that, the
larger the number of comments a product has, the more likely it
is that the comments are negative (Figure 5). In fact, we see that,
among the three graphs, this is the only graph where the traditional
PageRank (with p = 0) leads to negative correlations between
node ranks and node significances.</p>
      <p>These results indicate that actors who have had many co-actors,
commenters who commented on products also commented by
many others, or products which received comments from
individuals who also commented on many other products are not good
candidates for transition during random walk. This aligns with
our expectation that, in applications where each new movie role or
comment requires additional effort, high degree may indicate lower
per-movie or per-comment effort and, hence, lower significance.</p>
      <sec id="sec-7-1">
        <title>When Conventional 4.3.2</title>
      </sec>
      <sec id="sec-7-2">
        <title>Application Group B: PageRank is Ideal</title>
        <p>Figure 3 shows that, for movie-movie (based on common actors)
and author-author (based on common articles) graphs, the peak
correlation is at p = 0 indicating that the conventional PageRank
which gives positive weight to node degree, is appropriate.</p>
        <p>This perhaps indicates that movies with a lot of actors tend to
be big-budget products and that authors with a large number of
coauthors tend to be experts with whom others want to collaborate.
Note that, in these applications, additional boosting, with p &lt; 0,
negatively affects the correlation, indicating that the relationship
between node degree and significance is not very strong (Figure 5).
The quick change when p &lt; 0 is because, as we see in Table 3,
median standard deviations of neighbors’ degrees are low; i.e.,
degrees of neighbors of a node are comparable: there is no dominant
contributor to TD(j, i) in Equation 1 (Section 3) and, thus, the
transition probabilities are sensitive to changes in p, when p &lt; 0.
4.3.3</p>
        <p>Application Group C: When Degree Boosting
Helps</p>
        <p>Figure 4 shows that there are scenarios where additional
boosting based node degrees provides some benefits. The article-article
(based on common authors), listener-listener (based on common
artists), and artist-artist (based on common listeners) graphs reach
their peaks around p ∼ −1, indicating that these also benefit from
large node degrees though improvements over p = 0 are slight.</p>
        <p>A significant difference between applications in Group B and
Group C is that, for p &lt; 0, the correlation curve is more or less
stable. This is because, as we see in Table 3, in these graphs median
standard deviations of neighbors’ degrees are high: in other words,
for each node, there is a dominant neighbor with a high degree and
this neighbor has the highest contribution to TD(j, i); thus, the
rankings are not very sensitive to p, when p &lt; 0.
**(-%%%)!$$$##+'"',,.&amp;!&amp;% !'#(% !'% !)#(% !)% !*#(% /!#*0%%##!"*/!#"""(####%"""12&amp;"&amp;,%%%3""%-4'0"*#5(%#406*7%*8"9**#(% )% )#(% '% '#(% &amp;%
4.3.4 Summary: Correlations between Node
Degrees and Application Specific Significances
The experiments reported above show that degree de-coupling
is important as different applications, even on the same data set,
may associate different semantics to node degrees and the
conventional PageRank scores are too tightly coupled with node degrees
to be effective in all scenarios. Figure 5, which plots correlations
between node degrees and application specific significances for
different data graphs, re-confirms that the ideal value of the p is related
to the usefulness of the node degree in capturing the application
specific definition of node significance.
4.4 Relationship between α and p</p>
        <p>In Figures 6 through 8, we investigate the relationship between
the value α and the degree de-coupling parameter p for different
application types. Here we use the default value, 0, for the
parameter β and present the results for unweighted graphs (the results for
the weighted graphs are similar).</p>
        <p>First thing to notice in these figures is that the grouping of the
applications (into those where, respectively, p &gt; 0, p = 0, or p &lt; 0
is useful) is preserved when different values of α are considered.</p>
        <p>Figure 6 studies the impact of the value of α in application group
A, where degree penalization helps (p &gt; 0). As we see here,
for the IMDB actor-actor (Figure 6(a)) and Epinions
commentercommenter (Figure 6(b)) graphs, having a lower value of α (i.e.,
lower probability of forward movement during the random walk)
provides the highest possible correlations between D2PR ranks and
node significance (with the optimal value of p being ∼ 0.5
independent of the value of α). This indicates that in these graphs,
it is not necessary to traverse far during the random walk.
Interestingly, though, when degrees are over-penalized (i.e., p ≫ 0),
smaller values of α start leading to worse correlations, indicating
that (while not being optimal) severe penalization of node degrees
helps make random traversals more useful than random jumps. As
we have already observed in Figure 2(c), the Epinions
productproduct graph (Figure 6(c)) behaves somewhat differently from the
other two cases where degree penalization (p &gt; 0) leads to larger
correlations: in this case, unlike the other two graphs, the highest
possible correlations between D2PR ranks and node significance
are obtained for large values of α, indicating that this application
benefits from longer random walks (though the differences among
the correlations for different α values are very small).</p>
        <p>Figure 7 shows that the pattern is different for application group
B, where conventional PageRank is ideal (p = 0): in this case,
having a larger value of α (i.e., larger probability of forward movement
during the random walk) provides the highest correlations between
ranks and significance. Interestingly, in these applications, when
p ≪ 0 or p ≫ 0, higher probabilities of random walk traversal
(i.e., larger α) stop being beneficial and lower values of α lead to
larger correlations. This re-confirms that, for these applications,
p ∼ 0 leverages the random walk traversal the best.</p>
        <p>As we see in Figure 8, in application group C, where degree
boosting helps (p &lt; 0), it is also the case that larger values of α
(i.e., larger probabilities of forward transitions during the random
walk) provides the highest correlations between node ranks and
significance. On the other hand, in these applications, p ∼ 0.5 serves
as a balance point where the value of α stops being relevant; in
fact, for p &gt; 0.5 the higher values of α stops being beneficial and
lower values of α lead to larger correlations. This re-confirms that
smaller values of p (which provides degree boosting) help leverage
the random walk traversal the best.
4.5 Relationship between β and p in</p>
        <p>Weighted Graphs</p>
        <p>Finally, in Figures 9 through 11, we investigate the
relationship between the value β (which controls whether accounting for
the connection strength or degree de-coupling is more critical in a
given application) and the degree de-coupling parameter p for
different application types. Here we use the default value, 0.85, for
the parameter α and present the results for weighted graphs:</p>
        <p>Figure 9 depicts the impact of the value of the parameter β in
application group A, where degree penalization helps (p &gt; 0). As we
see here, for all three weighted graphs, performing degree
penalization (i.e., β &lt; 1.0) provides better rank-significance correlation
than relying solely on the connection strength (i.e., β = 1.0). Note
that the value of β impacts the optimal value of degree penalization
parameter p: the more weight is given to connection strength (i.e.,
the greater β is), the larger is the optimal value of p.</p>
        <p>Figure 10 shows that, for applications in group B, where p ∼ 0
is ideal, when the connection strength is given significantly more
weight than degree de-coupling (i.e., β ∼ 0), we observe high
ranksignificance correlations. Interestingly however, for the
moviemovie graph (where the edge weights denote common actors) the
highest correlations are obtained not with p = 0, but with p = 0.5
and β = 0.75, indicating that degree penalization is actually
beneficial in this case: movies that share large numbers of actors with
other movies are likely to be B-movies, which are not good
candidates for transitions during the random walk.</p>
        <p>Figure 11 shows that in application group C, where degree
boosting (p &lt; 0) helps, giving more weight to connection strength (i.e.,
β ∼ 1.0) is a good, but not necessarily the best strategy. In fact,
in these graphs, the highest overall correlations are obtained with
β = 0 or β = 0.25, indicating that degree de-coupling is
beneficial also in these cases. Interestingly, (unlike the case with the
unweighted listener-listener graph, where the best correlation was
obtained when p &lt; 0) for the weighted version of the
listenerlistener graph (where edge weights denote the number of shared
friends), when β = 0 through 0.5, p = 0 provides the highest
correlation and when β = 0.75, p = 0.5 provides the highest
correlation – these indicate that listeners who have large numbers of
shared friends with others are good candidates for random walk.</p>
        <p>Note that a key observation from the above results is that the
conventional PageRank, based on connection strength (i.e., β =
1.0), is not always the best strategy for the applications considered.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>In this paper, we noted that in many applications the
relation**(-%%)$$#+'',,. """###""$&amp;""%%%
%&amp;!'% !(#&amp;% !(% !)#&amp;% !)% !$#&amp;% !$% !"#&amp;% "% "#&amp;% $% $#&amp;% )% )#&amp;% (% (#&amp;% '%
!$#" !"#"&amp;%
/#0%##*/#12,3"-4'0*5#4067*8"9*</p>
      <p>!"#$"%
BJK* BJKL&lt;M* BJKLM* BJKLNM* BJO*
**(-%%%)!$$$##+''",,.&amp;!'% !(#&amp;% !(% !)#&amp;% !)% !$#/&amp;%#0%#!$#%*/#1!2"!,#""""&amp;3####%""$""&amp;""&amp;-%%%%4'"0%*5#"40#&amp;6%7*8"9$*% $#&amp;% )% )#&amp;% (% (#&amp;% '%
+,%%#-$.,'*,:*;&lt;=&gt;*&gt;$'?)*$'/*@,/#*!40'4A2$'2#*,'*/4B#%#'7*C*D$-3#)*
8E$)7F:&amp;G*-4)7#'#%1-4)7#'#%*5#4067#/*0%$"69*H#/0#15#4067I*J*,:*)6$%#/*:%4#'/)K*</p>
      <p>"#'%
*
*(-%%%)!$$$##+''",,.&amp;!(% !'#)% !'% !$#)% !$% !&amp;#)% /#!0&amp;%%##*/!"##)1!""""2%,####&amp;&amp;$"3%%%%""-%4'0*5"##)%4067*&amp;8%"9* &amp;#)% $% $#)% '% '#)% (%</p>
      <p>!"#$%
CLM* CLMF&lt;N*</p>
      <p>CLMFN*</p>
      <p>CLMFON* CLP*
ship between the significance of the node and its degree in the
underlying network may not be as strong (or as weak) as implied
by PageRank-based measures. We proposed degree de-coupled
PageRank (D2PR) to improve the effectiveness of PageRank based
knowledge discovery and recommendation tasks. Evaluations on
different data graphs and recommendation tasks have confirmed
that degree de-coupling would be an effective way to match
application specific node significances and improve recommendation
accuracies using PageRank based approaches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Balmin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hristidis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          . ObjectRank:
          <article-title>Authority-based keyword search in databases</article-title>
          .
          <source>In VLDB'04.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Banky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gromusz</surname>
          </string-name>
          .
          <article-title>Equal Opportunity for Low-Degree Network Nodes: A PageRank-Based Method for Protein Target Identification in Metabolic Graphs</article-title>
          .
          <source>PLoS One</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>e542</fpage>
          -
          <lpage>4</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Becchetti</surname>
          </string-name>
          et al.
          <article-title>Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection</article-title>
          .
          <source>WebKDD</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Boldi</surname>
          </string-name>
          et al.
          <article-title>HyperANF: Approximating the neighbourhood function of very large graphs on a budget</article-title>
          .
          <source>WWW</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.G.</given-names>
            <surname>Borgatti</surname>
          </string-name>
          , et al.
          <article-title>Network measures of social capital</article-title>
          .
          <source>Connections</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>36</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Page.</surname>
          </string-name>
          <article-title>The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          ,
          <volume>30</volume>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.S.</given-names>
            <surname>Candan</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.D.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Using random walks for mining web document associations</article-title>
          .
          <source>PAKDD'00</source>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen.</surname>
          </string-name>
          , et al.
          <article-title>Clustering via random walk hitting time on directed graphs</article-title>
          .
          <source>AAAI'08.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <article-title>Dynamic personalized pagerank in entity-relation graphs</article-title>
          .
          <source>In WWW'07</source>
          , pp
          <fpage>571</fpage>
          -
          <lpage>580</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , J. Liu, and
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Clustering via random walk hitting time on directed graphs</article-title>
          .
          <source>AAAI'08</source>
          , pp.
          <fpage>616</fpage>
          -
          <lpage>621</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cooper</surname>
          </string-name>
          , et al.
          <article-title>A fast algorithm to find all high degree vertices in graphs with a power law degree sequence</article-title>
          .
          <source>In WAW'12</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>O. Fercoq.</surname>
          </string-name>
          <article-title>PageRank optimization applied to spam detection</article-title>
          .
          <source>arXiv:1203.1457</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Haveliwala</surname>
          </string-name>
          .
          <article-title>Topic-sensitive PageRank</article-title>
          . WWW,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.S</given-names>
            <surname>Candan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.L</given-names>
            <surname>Sapino</surname>
          </string-name>
          .
          <article-title>”Can you really trust that seed?”: Reducing the Impact of Seed Noise in Personalized PageRank</article-title>
          .
          <source>ASONAM'14</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>[15] IMDB website: http://www.imdb.com/</mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jeh</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Scaling personalized web search</article-title>
          .
          <source>Stanford Univ. Tech. Report</source>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J.H Kim</surname>
            ,
            <given-names>K.S</given-names>
          </string-name>
          <string-name>
            <surname>Candan</surname>
            ,
            <given-names>M.L</given-names>
          </string-name>
          <string-name>
            <surname>Sapino</surname>
          </string-name>
          .
          <article-title>Locality-sensitive and Re-use Promoting Personalized PageRank Computations</article-title>
          .
          <source>Knowledge and Information Systems. 10.1007/s10115-015-0843-6</source>
          ,
          <fpage>2015</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>[18] http://ir.ii.uam.es/hetrec2011/datasets.html.</mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Nikolakopoulos</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Garofalakis,</surname>
          </string-name>
          <article-title>NCDawareRank: A Novel Ranking Method that Exploits the Decomposable Structure of the Web</article-title>
          .
          <source>WSDM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Viennot</surname>
          </string-name>
          ,
          <article-title>Local aspects of the global ranking of web pages</article-title>
          .
          <source>In I2CS'06</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Church</surname>
          </string-name>
          .
          <article-title>Query suggestion using hitting time</article-title>
          .
          <source>CIKM'08</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>[22] http://grouplens.org/datasets/movielens</mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Olsen</surname>
          </string-name>
          .
          <article-title>Maximizing PageRank with New Backlinks</article-title>
          . CIAC,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gibbons</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <article-title>Anf: a fast and scalable tool for data mining in massive graphs</article-title>
          .
          <source>KDD</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          . mTrust:
          <article-title>Discerning multi-faceted trust in a connected world</article-title>
          .
          <source>WSDM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , et al.
          <article-title>ArnetMiner: Extraction and Mining of Academic Social Networks</article-title>
          .
          <source>In SIGKDD'08</source>
          , pp
          <fpage>990</fpage>
          -
          <lpage>998</lpage>
          ,
          <year>2008</year>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>D.R.</given-names>
            <surname>White</surname>
          </string-name>
          , et al.
          <article-title>Betweenness centrality measures for directed graphs</article-title>
          .
          <source>Social Networks</source>
          ,
          <volume>16</volume>
          ,
          <fpage>335</fpage>
          -
          <lpage>346</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>