=Paper=
{{Paper
|id=Vol-190/paper-3
|storemode=property
|title=Propagating Trust and Distrust to Demote Web Spam
|pdfUrl=https://ceur-ws.org/Vol-190/paper03.pdf
|volume=Vol-190
|authors=Baoning Wu,Vinay Goel and Brian Davison
|dblpUrl=https://dblp.org/rec/conf/mtw/WuGD06
}}
==Propagating Trust and Distrust to Demote Web Spam==
Propagating Trust and Distrust to Demote Web Spam
Baoning Wu Vinay Goel Brian D. Davison
Department of Computer Science & Engineering
Lehigh University
Bethlehem, PA 18015 USA
{baw4,vig204,davison}@cse.lehigh.edu
ABSTRACT Trust can be used to combat Web spam. Gyöngyi et
Web spamming describes behavior that attempts to deceive al. [13] present the TrustRank algorithm based on this idea.
search engine’s ranking algorithms. TrustRank is a recent This technique assumes that a link between two pages on
algorithm that can combat web spam by propagating trust the Web signifies trust between them; i.e., a link from page
among web pages. However, TrustRank propagates trust A to page B is a conveyance of trust from page A to page
among web pages based on the number of outgoing links, B. In this technique, human experts, initially, select a list of
which is also how PageRank propagates authority scores seed sites that are well-known and trustworthy on the Web.
among Web pages. This type of propagation may be suited Each of these seed sites is assigned an initial trust score. A
for propagating authority, but it is not optimal for calculat- biased PageRank [23] algorithm is then used to propagate
ing trust scores for demoting spam sites. these trust scores to the descendants of these sites. The au-
In this paper, we propose several alternative methods to thors observed that on applying this technique, good sites
propagate trust on the web. With experiments on a real web had relatively high trust scores, while spam sites had low
data set, we show that these methods can greatly decrease trust scores.
the number of web spam sites within the top portion of the TrustRank shows that the idea of propagating trust from
trust ranking. In addition, we investigate the possibility of a set of highly trusted seed sites helps a great deal in the de-
propagating distrust among web pages. Experiments show motion of Web spam. But TrustRank is just one implemen-
that combining trust and distrust values can demote more tation of this idea. This approach makes certain assump-
spam sites than the sole use of trust values. tions with regard to how trust is propagated from a parent
page to a child page. For example, the authors claim that
the possibility of a page pointing to a spam page increases
Categories and Subject Descriptors with the number of links the pointing page has. Because of
H.3.3 [Information Storage and Retrieval]: Information this, they proposed the idea that the trust score of a parent
Search and Retrieval page be equally split amongst its children pages.
This assumption is open to argument. Why should two
equally trusted pages propagate different trust scores to
General Terms their children just because one made more recommendations
Algorithms, Performance than the other? Also, with respect to the accumulation of
trust scores from multiple parents, TrustRank puts forth
Keywords just one solution, that of simple summation. Clearly, there
are other alternatives.
Web spam, Trust, Distrust, PageRank, TrustRank
A natural extension of the idea of the conveyance of trust
between links is that of the conveyance of distrust. Here, dis-
1. INTRODUCTION trust has a different meaning to that in the context of social
In today’s Web, a link between two pages can be consid- networks. In social networks, distrust between two nodes A
ered to be an implicit conveyance of trust from the source and B usually means that A shows distrust explicitly to B.
page to the target page. In this case, trust implies that In contrast, in our system, distrust is a penalty awarded to
the author of the source page believes that the target page the source page for linking to an untrustworthy page. Hence,
provides some content value. this distrust is an indication that we don’t trust some web
With the increasing commercial interest of being ranked pages, not an indication that one page doesn’t trust another
high in search engine results, content providers resort to page on the web. Actually, the trust score of a page can also
techniques that manipulate these results. This behavior is be interpreted as how much we trust this page.
usually termed Web spam, or search engine spam. Many In general, spam pages can be considered to be one type
kinds of spam have been discovered [24, 12, 5]. Henzinger of untrustworthy pages. To elaborate on this idea, consider
et al. [15] mention that Web spam is one of the major chal- that a page links to another page and hence according to the
lenges faced by search engines. There is no universal method above definition of trust, this page expresses trust towards
that can detect all kinds of spam at the same time. the target page. But if this target page is known to be a
spam page, then clearly the trust judgment of the source
page is not valid. The source page needs to be penalized
for trusting an untrustworthy page. It is likely that the is that a page will get high BadRank value if it points to
source page itself is a spam page, or is a page that we believe some pages with high BadRank value. This idea is similar
should not be ranked highly for its negligence in linking to in spirit to our mechanism of propagating distrust in this
an untrustworthy page. paper.
In this paper, we explore the different issues present in the
problem of propagating trust on the Web. We also study the 3. RELATED WORK
application of propagating distrust on the Web. Addition-
While the idea of a focused or custom PageRank vector
ally, we present techniques to combine trust and distrust
has existed from the beginning [23], Haveliwala [14] was
scores to improve the overall performance in demoting Web
the first to propose the idea of bringing topical information
spam.
into PageRank calculation. In his technique, pages listed
The rest of this paper is organized as follows: the back-
in DMOZ [22] are used as the seed set to calculate the bi-
ground and related work will be introduced in Section 2
ased PageRank values for each of the top categories. Then a
and Section 3 respectively. The motivation of this work will
similarity value of a query to each of these categories is cal-
be introduced in Section 4. The details of our technique
culated. A unified score is then calculated for each page con-
are given in Section 5. The experiments and results will be
taining the given query term(s). Finally, pages are ranked
shown in Section 7. We finish with discussion and conclusion
by this unified score. Experiments show that Topic-sensitive
in Sections 8 and 9.
PageRank has better performance than PageRank in gener-
ating better response lists to a given query.
2. BACKGROUND Jeh and Widom [17] specialize the global notion of impor-
tance that PageRank provides to create personalized views
2.1 Matrix Definition of importance by introducing the idea of preference sets.
The web can be represented by a directed graph, given The rankings of results can then be biased according to this
web pages as the nodes and hyperlinks among web pages as personalized notion. For this, they used the biased PageR-
the directed links among the nodes. The adjacency matrix ank formula.
M of the web graph is: M [i, j] equals 1 if there is a hyperlink Several researchers have done some work to combat dif-
from page i to page j, or 0 otherwise. Suppose we use I(i) to ferent kind of Web spam. Fetterly et al. propose using sta-
represent the in-degree of node i and O(i) as the out-degree tistical analysis to detect spam [7]. Acharya et al. [2] first
of node i, the definition of the transition matrix T is: publicly propose using historical data to identify link spam
pages. Wu and Davison [26] proposed using the intersection
of the incoming and outgoing link sets plus a propagation
T [i, j] = M [j, i]/O(j) (1)
step to detect link farms. Mishne et al. [20] used a language
and the definition of the reverse transition matrix R is: model to detect comment spam. Drost and Scheffer [6] pro-
posed using a machine learning method to detect link spam.
R[i, j] = M [i, j]/I(j) (2) Recently, Fetterly et al. [8] describe methods to detect a spe-
2.2 TrustRank and BadRank cial kind of spam that provides pages by stitching together
sentences from a repository.
Gyöngyi et al. [13] introduce TrustRank. It is based on Benczur et al. proposed SpamRank in [4]. For each page,
the idea that good sites seldom point to spam sites and they check the PageRank distribution of all its incoming
people trust these good sites. This trust can be propagated links. If the distribution doesn’t follow a normal pattern,
through the link structure on the Web. So, a list of highly the page will be penalized and used as seed page. They also
trustworthy sites are selected to form the seed set and each adopt the idea that spam values are propagated backward
of these sites is assigned a non-zero initial trust score, while and finally spam pages will have high SpamRank values.
all the other sites on the Web have initial values of 0. Then a Compared to SpamRank, we use labeled spam pages as our
biased PageRank algorithm is used to propagate these initial seed set.
trust scores to their outgoing sites. After convergence, good In prior work, we [27] pointed out that TrustRank has
sites will get a decent trust score, while spam sites are likely a bias towards better represented communities in the seed
to get lower trust scores. The formula of TrustRank is: set. In order to neutralize this bias, we proposed “Topical
t = (1 − α) × T × t + α × s (3) TrustRank”, which uses topics to partition the seed set and
different mechanisms to combine trust scores from each par-
where t is the TrustRank score vector, α is the jump prob- tition. We showed that this algorithm can perform better
ability, T is the transition matrix and s is the normalized than TrustRank in reducing the number of highly ranked
trust score vector for the seed set. Before calculation, t is spam sites. Compared with that paper, we do not consider
initialized with the value of s. Gyöngyi et al. iterated the partitions for the seed set here. Instead, we show that dif-
above equation 20 times with α set to 0.15. ferent mechanisms for propagating trust can also help to
In many SEO discussion boards, participants discuss the demote more top ranked spam sites. The methods proposed
latest ranking and spam-finding techniques employed by in this paper can generate better performance than Topical
commercial search engines. One approach, called Bad- TrustRank.
Rank1 , is believed by some to be used by a commercial Guha et al. [11] study how to propagate trust scores
engine to combat link farms.2 BadRank is based on prop- among a connected network of people. Different propagation
agating negative value among pages. The idea of BadRank schemes for both trust score and distrust score are studied
1
One description of BadRank can be found at [1]. based on a network from a real social community website.
2 Compared with their ideas, our definition of distrust is not
See, for example http://www.webmasterworld.com
/forum3/20281-22-15.htm. exactly same. Their goal is to predict whether two people
will show trust (or distrust) to the other, but our goal is to
use trust and distrust to demote Web spam, especially top
ranked spam pages or sites.
Massa and Hayes [19] review several current proposals for
extending the link mechanism to incorporate extra semantic
information, primarily those that allow the authors of a web
page to describe their opinion on pages they link to. They
argue that any change to the hyperlink facility must be easily
understood by the ordinary users of the Web, but the more Figure 1: A link on the Web can propagate both
expressive linking structure would produce a richer semantic trust and distrust.
network from which more precise information can be mined.
They used a real world data set from Epinions.com as a
proxy for the Web with the analogy that web pages are Page B, then trust is propagated from Page A to Page B,
Epinions users and links are trust and distrust statements. while distrust is propagated from Page B to Page A.
They show that this additional link information would allow We explore different techniques for the handling of prop-
the PageRank algorithm to identify highly trusted web sites. agation of trust and distrust from the respective seed sets
Ziegler and Lausen [28] introduce the Appleseed algo- to other pages on the Web.
rithm, a proposal for local group trust computation. The
basic intuition of the approach is motivated by spreading 5. ALGORITHM DETAILS
activation strategies. The idea of spreading activation is the
propagation of energy in a network. Also, the edges between In this section, we present details of our ideas on propa-
the nodes are weighted based on the type of the edges. This gating trust and distrust among web pages.
idea of energy flow is tailored for trust propagation. In con-
trast, our algorithm doesn’t consider a weighted graph.
5.1 Propagating Trust
Gray et al. [9] proposed a trust-based security framework TrustRank propagates trust among web pages in the same
for ad hoc networks. The trust value among two nodes con- manner as the PageRank algorithm propagates authority
nected by a path is the average of the weighted sum of trust among web pages. The basic idea is that during each iter-
values of all nodes in the path. No experimental results are ation, a parent’s trust score is divided by the number of its
shown. outgoing links and each of its children gets an equal share.
Then a child’s overall trust score is the sum of the shares
from all its parents.
4. MOTIVATION Two key steps in the technique described above may be
The original TrustRank paper proposed that trust should explored. One is, for each parent, how to divide its score
be reduced as we move further and further away from the amongst its children; we name this the “splitting” step. The
seed set of trusted pages. To achieve this attenuation of other is, for each child, how to calculate the overall scores
trust, the authors propose two techniques, trust dampening given the shares from all its parents; we name this the “ac-
and trust splitting. With trust dampening, a page gets the cumulation” step.
trust score of its parent page dampened by a factor less than For the splitting step, we study three choices:
1. With trust splitting, a parent’s trust score is equally
divided amongst its children. A child’s overall trust score is • Equal Splitting: a node i with O(i) outgoing links
given by the sum of the shares of the trust scores obtained and trust score T R(i) will give d × TO(i)
R(i)
to each child.
from its parents. d is a constant with 0 < d < 1;
In the case of trust splitting, we raise a question: Given
two equally trusted friends, why should the recommenda- • Constant Splitting: a node i with trust score T R(i)
tions made by one friend be weighted less than the other, will give d × T R(i) to each child;
simply because the first made more recommendations? A
similar argument has been made by Guha [10]. • Logarithm Splitting: a node i with O(i) outgoing
T R(i)
It is observed that a spam page often points to other spam links and trust score T R(i) will give d × log(1+O(i)) to
pages for the purposes of boosting their PageRank value each child.
and manipulating search engine results [26]. Motivated by
the idea of trust propagation, we believe that propagating We term d to be the decay factor, which determines how
distrust given a labeled spam seed set, will help to penalize much of the parents’ score is propagated to its children. In
other spam pages. fact, if d equals 1, then the above “Equal Splitting” is the
Hence, given a set of labeled spam seed set, we can prop- same as the method used in TrustRank. As discussed in
agate distrust from this set to the pages that point to mem- the Section 4, why should equally trusted pages propagate
bers of this set. The idea is that a page pointing to a spam different trust scores just because they have different number
page is likely to be spam itself. But sometimes, good pages of children? With “Constant Splitting”, each parent will
may unintentionally point to spam pages. In this case, these give a constant portion of its trust value to all of its children
pages are penalized for not being careful with regard to cre- irrespective of the number of its children. Thus for a child,
ating or maintaining links (as suggested by [3]). if two of its parents have identical trust values but different
In doing so, each page on the Web is assigned two scores, number of children, then the child will get the same value
a trust score and a distrust score. In the combined model, from both of these parents. The third choice, “Logarithm
a link on the Web can then propagate these two scores. As Splitting” does not eliminate the effect of the number of
shown in Figure 1, suppose there is a link from Page A to children that a page has but can decrease it.
Since “Equal Splitting” is the choice already being em- 2 and r is the normalized distrust score vector for the dis-
ployed in TrustRank, we will focus on “Constant Splitting” trusted seed set. Before calculation, n is initialized with the
and “Logarithm Splitting” in our experiments. value of r.
For the accumulation step, we study three choices. However, as discussed in Section 5.1, the propagation
mechanism of TrustRank may not be optimal to propagate
• Simple Summation: Sum the trust values from each trust or distrust for the purpose of demoting spam pages.
parent. We propose that the same choices to propagate trust, dis-
• Maximum Share: Use the maximum of the trust cussed in Section 5.1, can be taken to propagate distrust.
values sent by the parents. Suppose we use DIS T R(i) to represent the distrust score
for node i. For the splitting step, we have three choices:
• Maximum Parent: Sum the trust values in such a
way as to never exceed the trust score of the most- • Equal Splitting: a node i with I(i) incoming links
trusted parent. and DIS T R(i) will give dD × DISI(i)
T R(i)
to each par-
ent. where 0 < dD < 1;
The first choice is the same as in PageRank and
TrustRank; using the sum of trust scores from all parents • Constant Splitting: a node i with DIS T R(i) will
as the child’s trust score. For “Maximum Share”, the max- give dD × DIS T R(i) to each parent;
imum value among the trust values inherited from all the
parents is used as the child’s trust score. For “Maximum • Logarithm Splitting: a node i with I(i) incoming
DIS T R(i)
Parent”, first the sum of trust values from each parent is links and DIS T R(i) will give dD × log(1+I(i)) to each
calculated and this sum is compared with the largest trust parent.
score among each of its parents, the smaller of these two
values is used as the child’s trust score. The “Equal Splitting” choice is quite similar to that in
By using the above choices, the equation for calculating the case of trust propagation in TrustRank. Intuitively, this
trust score is different from Equation 3. For example, if kind of splitting may raise problems when the purpose of
using “Constant Splitting” and “Simple Summation”, the propagating distrust is to demote spam. For a simple ex-
equation will become: ample, by “Equal Splitting”, a spam site with more parents
will propagate smaller distrust to its parents, while spam
t = (1 − α) × d × M T × t + α × s (4) sites with fewer parents will propagate bigger distrust to its
where t is the trust score vector, α is the jump probability, d parents. Obviously, this policy supports popular spam sites
is the constant discussed in the above splitting choices, M is and this is clearly not desirable for the purpose of demoting
the web matrix shown in Section 2.1 and s is the normalized spam. In comparison, “Constant Splitting” and “Logarithm
trust score vector for the seed set. Splitting” present better choices.
For the accumulation step, we also have three choices:
5.2 Propagating Distrust • Simple Summation: Sum the distrust values from
The trust score of a page is an indication of how trustwor- each child.
thy the page is on the Web. In the case of web spam, the
trust score can be seen as a measure of the likelihood that • Maximum Share: Use the maximum of the distrust
a page is not a spam page. values sent by the children;
Similarly, we introduce the concept of distrust to penal-
• Maximum Parent: Sum the distrust values in such
ize the pages that point to untrustworthy pages. Now, it is
a way as to never exceed the distrust score of the most-
possible that pages unintentionally point to spam pages. In
distrusted child.
these cases, we argue that the (otherwise good) page should
be penalized to some extent for not being careful in its link- Different choices will employ different equations during
ing behavior. the calculation. For example, if using “Constant Splitting”
Distrust propagation makes sense when spam sites are and “Simple Summation”, the equation of calculating dis-
used as the distrusted seed set and distrust is propagated trust score is:
from a child to its parent. So, based on this idea, one link can
n = (1 − α) × dD × M × n + α × r (6)
represent two propagation processes, i.e., the trust score is
propagated from the parent to the children while the distrust where n is the distrust score vector, α is the jump prob-
score is propagated from the children to the parent. ability, d is the constant discussed in the above splitting
In this technique, some known spam pages are selected choices, M is the web matrix shown in Section 2.1 and r is
as the distrusted seeds and assigned some initial distrust the normalized distrust score vector for the distrusted seed
scores. During each iteration, the distrust score is propa- set.
gated from children pages to parent pages iteratively. After
convergence, a higher distrust score indicates that this page 5.3 Combining Trust and Distrust
is more likely to be a spam page. On propagating trust and distrust to the pages on the
A direct method of calculating distrust score for each page web, each page will be assigned two scores, a trust score
is to follow the same idea as TrustRank. The calculation can and a distrust score. Then comes the question of combining
be represented by Equation 5. them to generate a unified ranking of pages that is indicative
of their trustworthiness.
n = (1 − α) × R × n + α × r (5)
Our goal of propagating trust and distrust is to demote
where n is the distrust score vector, α is the jump proba- spam sites in the ranking. Since the trust score is an indi-
bility, R is the reverse transition matrix shown in Equation cation of how unlikely it is that the page is a spam page,
while the distrust score is an indication of how likely it is
that the page is a spam page, a direct solution is to simply
calculate the difference of these two scores and use this value
to represent the overall trustworthiness of the Web page.
Additionally, we may apply several methods for the com-
bination. For example, we may give different weights when
calculating the sum. Suppose we use T otal(i) to represent
the difference of trust and distrust score for page i. Then
we can apply the following formula:
T otal(i) = η × T R(i) − β × DIS T R(i) (7)
where η and β (0 < η < 1, 0 < β < 1) are two coefficients
to give different weights to trust and distrust scores in this Figure 2: Number of spam sites within each bucket
formula. by PageRank and TrustRank.
6. DATA SET
The data set used in our experiments is courtesy of TrustRank is shown in Figure 2. It is clear that TrustRank
search.ch search engine [25]. It is a 2003 crawl of pages that is good at demoting spam sites compared to PageRank.
are mostly from the Switzerland domain. There are about In this paper, we use the number of spam sites within the
20M pages within this data set and around 350K sites with top 10 buckets as the metric for measuring the performance
the “.ch” domain. Since we were also provided with 3, 589 la- of algorithms. This choice of choosing the top 10 buckets was
beled sites and domains applying different spam techniques, arbitrary as in the case of [27]. The smaller the number of
we used the site graph for testing the ideas we propose in spam sites in the top 10 buckets, the better the performance
this paper. of the algorithm in demoting spam sites from the top ranking
In order to generate a trusted seed set, we extract all the positions.
URLs listed within the search.ch topic directory [25] of 20 The results of this metric for the PageRank and
different topics, which is similar to the DMOZ directory but TrustRank algorithms are shown in Table 1. These results
only lists pages primarily within the Switzerland domain. will be used as the baseline results. We can see that PageR-
Since we use the site graph in our calculation and the topic ank ranks 90 spam sites within the top ten buckets, while
directory listed only pages, we used a simple transfer policy: TrustRank ranks only 58 spam sites.
if a site had a page listed in a certain topic directory, we
put the site into a trusted seed set. In doing so, we marked 7.1 Different Jump Probabilities
20, 005 unique sites to form the seed set. In TrustRank, the jump probability α in Equation 3 is
For the generation of a distrusted seed set, we use the usually assigned a value of 0.15. We measure the perfor-
labeled spam list which contains 3, 589 sites or domains. In mance of TrustRank with different values of this jump fac-
our experiments, we use only a portion of this list as the tor.
distrusted seed set with the rest being used to evaluate the Since we use all the URLs listed in dir.search.ch as the
performance. trusted seed set, it is quite possible that some spam sites
get included in this set too. On checking, we find that 35
labeled spam sites are within the trusted seed set. It is
7. EXPERIMENTS worthwhile to drop these spam sites from the seed set. We
We test all the ideas we propose in Section 5 by using run TrustRank again with different jump probabilities after
the search.ch data set. Since the goal of this paper is to dropping these 35 labeled spam sites from the seed set.
investigate how different mechanisms of propagating trust The results with both the original seed set and the cleaned
and distrust can help to demote top ranking spam sites, we seed set are shown in Figure 3. We observe that larger jump
will focus on the ranking positions of the labeled 3, 589 spam probabilities decrease the number of spam sites from top
sites. ranking positions. Since a larger jump probability means
We first calculate the PageRank value for each site based that smaller trust values are propagated from a parent to its
on the search.ch site graph. These sites are then ranked children, the results show that for the purpose of demoting
in a descending order of their PageRank values. Based on spam sites, in TrustRank, a better approach is of relatively
this ranking, we divide these sites among 20 buckets, with little trust propagation. We also observe that the dropping
each bucket containing sites with the sum of their PageRank of spam sites from the seed set results in fewer spam sites
values equal to 1/20th of the sum of the PageRank values within the top ten buckets.
of all sites.
We then calculate the TrustRank score for each site based
on the site graph, to generate a ranking of sites sorted in
the descending order of these scores. As in the case of the
Algorithm No. of Spam sites
TrustRank paper [13], we iterated 20 times during this cal- in top 10 buckets
culation. We then divide these sites among 20 buckets such PageRank 90
that each TrustRank bucket has an identical number of sites TrustRank 58
to the corresponding PageRank bucket. The distribution of
the 3,589 spam sites in the 20 buckets by PageRank and Table 1: Baseline results for search.ch data set.
Algorithm Constant Logarithm
Splitting Splitting
d value d=0.1 d=0.3 d=0.7 d=0.9 d=0.1 d=0.3 d=0.7 d=0.9
Simple summation 364 364 364 364 364 364 364 364
Maximum Share 34 34 34 34 13 12 20 18
Maximum Parent 27 32 33 33 372 27 29 32
Table 2: Results for the combination of different methods of propagating trust. Experiments are done with
different values for d. Only trust score is used in this table.
7.2 Different Trust Propagation Methods these choices. In Table 2, our best result is 12 spam sites in
As introduced in Section 5, we explore two choices in the the top ten buckets, which is a much greater improvement
splitting step: “Constant Splitting” (d × T R(i)) and “Loga- when compared to the baseline results of 58 spam sites in
T R(i)
rithm Splitting” (d log(1+O(i)) ), while we have three choices Table 1.
It is worth mentioning that by introducing the above ideas
in the accumulation step: “Simple Summation”, “Maximum
of splitting and accumulating trust, we notice, in some cases,
Share” and “Maximum Parent”.
long ties in the trust scores. For example, the top several
The number of different combinations of the above choices
thousand of sites may have identical trust scores. This is
is six. For each combination we try using different values of
different from the values by PageRank or TrustRank. We
d ranging from 0.1 to 0.9. The results of these six combina-
think this tie is still reasonable as long as few spam sites
tions with different values of d are shown in Table 2.
are in the ties close to the top. Since there are 3, 823 sites
From the results in Table 2, we can tell that “Simple Sum-
in the top ten buckets by PageRank, we consider the ties
mation” always generates the worst performance, which is
that have rankings around this position still within top ten
worse than TrustRank and even PageRank. A lot of spam
buckets, thus all the spam sites before or within this tie will
sites are raised in the ranking. Intuitively, this “Simple Sum-
still be counted within top ten buckets.
mation” will boost the rankings of sites with multiple par-
Actually, we find that for most cases, these ties can help
ents. In general, it is likely a spam site that has a large
to demote more spam sites. But some small d may cause
number of incoming links will be able to accumulate a fairly
a strong tie with more than 10, 000 sites and thus raise the
large value of trust. Hence, spam sites may be benefited by
number of spam sites within top ten buckets. One example is
this “Simple Summation” method.
that there are 372 spam sites within top ten buckets when
We also observe that, in most cases, both “Maximum
combining “Maximum Parent” and “Logarithm Splitting”
Share” and “Maximum Parent” methods generate much bet-
with d set to 0.1.
ter performance than TrustRank and the “Simple Summa-
tion” method. With regard to the splitting methods, we
observe that in most cases, “Logarithm Splitting” performs 7.3 Introducing Distrust
better than “Constant Splitting”. Trust can be propagated from trusted seed set to the chil-
The results clearly demonstrate that for the purpose of dren pages iteratively. Similarly, distrust can be propagated
demoting web spam, propagating trust based on the idea of from a distrusted seed set to the parent pages iteratively.
“Equal Splitting” and “Simple Summation” which is used While our distrusted seed set was provided to us, in general
by TrustRank, is not the optimal solution. a search engine will maintain a spamming blacklist, using
Gyöngyi et al. [13] mentioned that there are different pos- both manual and automatic methods (perhaps, e.g., [7, 8,
sibilities for splitting trust scores; the reason that they chose 26, 21]).
the method similar to PageRank is that only minor changes In order to investigate whether introducing distrust can
are needed for calculating TrustRank by using existing effi- help to improve the performance in demoting spam sites,
cient methods for computing PageRank [18]. We argue that we randomly select a portion of labeled spam sites as the
if different choices of splitting and accumulating trust can distrusted seed set and calculate distrust values for each
greatly demote spam sites, it is worthwhile to implement site. The ranking positions of the remaining spam sites will
be used to evaluate the performance.
7.3.1 Basic Propagation of Distrust
As described, there are several different choices of prop-
agating distrust among web pages, we first use the method
shown in Equation 5.
We randomly select 200 spam sites from the 3, 589 labeled
spam sites as the distrusted seed set to calculate distrust
score. Then we calculate the sum of this distrust score and
the trust score generated by TrustRank. By using the sum
for ranking, we count the number of spam sites (m) in the
top ten buckets as in the case of previous experiments.
But we can not compare the above number m directly with
the results shown in Table 1. The reason is that some top
ranked spam sites may have been selected in the distrusted
Figure 3: Number of top ranked spam sites with seed and they will get demoted as an effect of their selection,
different jump probabilities for TrustRank. not as an effect of our algorithm. Thus, in order to be fair,
Algorithm Constant Logarithm
Splitting Splitting
dD value dD =0.1 dD =0.3 dD =0.7 dD =0.9 dD =0.1 dD =0.3 dD =0.7 dD =0.9
Simple Summation 53 53 55 55 57 53 53 53
Maximum Share 53 53 53 53 59 53 52 52
Maximum Parent 53 53 53 53 57 53 53 53
Table 3: Results for different methods of propagating distrust. The ranking is determined by the combination
of distrust score and TrustRank.
we need to count the number of spam sites (n) that are each different combination, we can see how different choices
in the top ten buckets by TrustRank which are also in the of propagating distrust can affect the overall performance
distrusted seed set. Only when the sum of m and n is smaller and thus we can tell which choice is better for propagating
than 58, which is listed in Table 1, we can claim that the distrust. For simplicity, we only choose 200 spam sites to
performance is better than that of TrustRank. generate the distrusted seed set once. Results of six different
Also the random selection of distrusted seeds may still combinations with different d values are shown in Table 3.
not be representative of the 3, 589 spam sites. In order to From the results in Table 3, we can see that some choices
neutralize this bias, we repeated the above seed selection can help to demote more spam sites than others. For exam-
five times for calculating distrust scores. Then we use the ple, the combination of “Logarithm Splitting” and “Maxi-
average results as the final results for the distrusted seed set mum Share” with d set to 0.7 or 0.9.
with 200 seeds. On average, there are 54 spam sites still in
the top ten buckets and 4 spam sites are in the distrusted 7.4 Combining Trust and Distrust Values
seed set. The sum of 54 and 4 equals the number of spam In the above experiments, we use the sum of the trust and
sites, which is 58, in top ten TrustRank buckets; this shows distrust values as the final value for ranking. As discussed
that using TrustRank’s mechanism (Equation 5) to prop- in Section 5, we may use different weights to combine trust
agate distrust is not helpful in demoting top ranked spam and distrust values.
sites. In practice, we did the following experiment to show how
In order to verify whether introducing more distrusted the combination of trust and distrust values can affect per-
seeds with this basic distrust propagation is useful, we gen- formance.
erated distrusted seed sets of sizes ranging from 200 to 1, 000.
• To calculate trust score, we select the choice that can
Similarly, for each seed set size, we repeated this generation
generate best performance in Table 2, i.e., using “Max-
five times. The average results are shown in Table 4. The
imum Share” for accumulation and “Logarithm Split-
results show that no matter how many seeds are selected
ting” for splitting while with d set to 0.3.
for the distrusted seed set, the sum of the second element
and third element in Table 4 is always around 60. Since this • To calculate distrust score, we select the choice that
sum is quite close to the 58 spam sites in Table 1, we believe can generate best performance in Table 3, i.e., using
that using the same mechanism as TrustRank to propagate “Maximum Share” for accumulation and “Logarithm
distrust can not help to demote top ranked spam sites. Splitting” for splitting with dD set to 0.9.
7.3.2 Different Choices of Propagating Distrust • For combining trust and distrust values, we follow the
Since we have shown that propagating distrust by using Equation 7, with β equals 1 − η. Test with different
the TrustRank mechanism may not be helpful, the next obvi- values of η.
ous step is to investigate whether the choices of propagating • We test with different numbers of distrusted seeds.
trust can also be applied for propagating distrust in order
to demote top ranked spam sites. The results for these experiments are shown in Figure 4.
Similar to the methods used for generating results in Ta- There are three lines in the figure. Each represents the re-
ble 2, we applied the six combinations of different choices sults by using 200, 400, 600 spam sites as distrusted seed
for the splitting step and accumulation steps to the propa- respectively. From these results, we can tell that an increase
gation of distrust. In order to evaluate the performance, for in the size of the distrusted seed set will result in an increase
each combination, we calculate the sum of the distrust value in performance.
and TrustRank value for each site. Then this sum is used Compared with the baseline results in Figure 1, more than
for ranking. Since the TrustRank value is unchanged for 80% of spam sites disappear from the top ten buckets. This
verifies our hypothesis that using different trust propaga-
tion methods together with distrust propagation can help
Number of No. of Spam sites No. of Spam to demote spam sites effectively.
seeds in top 10 buckets sites in seed set Actually, the results in Figure 4 are not our best results.
200 54 4 During our experiments, we find that by using “Constant
400 55 5
600 49 12 Splitting” and “Maximum Parent” for trust propagation,
800 48 13 “Logarithm Splitting” and “Maximum Share” for distrust
1000 45 16 propagation with d, dD and η as 0.1, we can remove all the
spam sites from the top ten buckets. We believe that there
Table 4: Results when using same mechanism as may be several other combinations that generate optimal
the propagation of trust in TrustRank to propagate results. However, due to resource constraints, we have not
distrust. enumerated every such combination.
Algorithm Constant Logarithm
Splitting Splitting
d value d=0.1 d=0.3 d=0.7 d=0.9 d=0.1 d=0.3 d=0.7 d=0.9
Maximum Share 77.71 77.73 77.74 77.74 77.19 77.72 77.73 77.73
Maximum Parent 77.52 77.71 77.73 77.74 76.93 77.60 77.71 77.72
Table 5: Percentage of sites affected by combining different ideas to propagate trust.
7.5 Impact of Trust Propagation sites. In the future, we intend to study how the propagation
Since the trust or distrust scores are propagated from lim- of trust or distrust can help raise high quality sites in the
ited number of seed pages, it is quite possible that only a ranking positions.
part of the whole web graph can be touched by this prop- We show that mechanisms such as “Logarithm Splitting”
agation. In other words, some pages will have zero values or “Maximum Share” for propagating trust and distrust can
after the algorithm is employed. We are not in a position do better than TrustRank in demoting top ranked spam
to make trust judgments with regard to these pages. It is sites. We intend to explore other choices that can help im-
highly desirable to have a well performing algorithm that prove the performance.
with a limited seed set enables us to make trust judgments In our paper, we combine trust and distrust scores only
about a large fraction of web pages. at the final step. It is possible that this combination can be
Intuitively, different values for α in Equation 3 or d in done during the calculation of trust and distrust scores. We
“Constant Splitting” and “Logarithm Splitting” will de- aim to study the different choices that may be taken into
termine how far trust and distrust are propagated. In this combination.
TrustRank, smaller α means that more trust will be prop- Ranking algorithms such as PageRank are used by sev-
agated to children pages in each iteration; thus more pages eral popular search engines for ranking Web pages to given
may have nonzero value after 20 iterations. In order to show queries. The concept of authority and trustworthiness are
this, for the same experiment shown in Figure 3, we check not identical—PageRank gives an authority value for each
what percentage of sites have nonzero values according to page, while propagating trust from seed sets tells how trust-
different values of α. The results are shown in Table 6. worthy a page on the web is as a source of ranking informa-
If more sites have nonzero values by using different tion. In this paper we have only explored the value of trust
choices, then we can claim that the trust scores are prop- propagation for spam demotion; ultimately the goal, how-
agated further by these choices. Since the results obtained ever, is to improve the quality of search results. We plan to
by using “Maximum Share” and “Maximum Parent” in Ta- investigate combinations of trust and distrust with author-
ble 2 are better than TrustRank, we check the percentage of ity to measure the effect on search results ranking (quality
pages with nonzero values for these choices. The results are of results).
shown in Table 5. All of our experiments are based on the search.ch data
The results in Table 5 show larger numbers when com- set. This data set may have special characteristics different
pared to the results in Table 6. This demonstrates that from the whole web. We need to test the ideas presented
our choices can affect more pages as well as generate better here on a larger data set, such as the WebBase [16] data set,
performance in the demotion of top ranking spam sites. in the future.
8. DISCUSSION 9. CONCLUSION
In this paper, we investigate the possibility of using dif- In this paper, we show that propagating trust based on
ferent choices to propagate trust and distrust for ranking the number of outgoing links is not optimal in demoting top
Web pages or sites. We only focus on the demotion of spam ranked spam sites. Instead, we demonstrate that using dif-
ferent choices such as “Constant Splitting” or “Logarithm
Splitting” in the splitting step and “Maximum Share” or
No. of spam sites in top 10 buckets
14 200
“Maximum Parent” in the accumulation step for propagat-
400 ing trust can help to demote top ranked spam sites as well
12 600
as increase the range of trust propagation.
10
8
Jump Percentage of sites
6 Probability with nonzero values
4 0.9 59.28
0.8 66.72
2 0.7 70.52
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.6 72.79
Value of eta 0.5 74.07
0.4 74.99
0.3 75.56
0.2 75.91
Figure 4: Number of top ranked spam sites when 0.1 76.13
ranking by the combination of trust score and dis-
trust score. Different η and different number of Table 6: Percentage of sites affected when using dif-
seeds (200, 400, 600) are used. ferent jump probabilities.
Additionally, by introducing the concept of propagating [13] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen.
distrust among Web pages or sites, we show that the per- Combating web spam with TrustRank. In Proceedings
formance of demoting top ranked spam sites can be further of the 30th International Conference on Very Large
improved. Data Bases (VLDB), pages 271–279, Toronto,
Canada, Sept. 2004.
Acknowledgments [14] T. Haveliwala. Topic-sensitive PageRank. In
This work was supported in part by the National Science Proceedings of the Eleventh International World Wide
Foundation under award IIS-0328825. We are grateful to Web Conference, pages 517–526, Honolulu, Hawaii,
Urban Müller for helpful discussions and for providing access May 2002.
to the search.ch dataset. [15] M. R. Henzinger, R. Motwani, and C. Silverstein.
Challenges in web search engines. SIGIR Forum,
36(2):11–22, Fall 2002.
10. REFERENCES [16] J. Hirai, S. Raghavan, H. Garcia-Molina, and
[1] Pr0 - google’s pagerank 0, 2002. A. Paepcke. WebBase: a repository of Web pages.
http://pr.efactory.de/e-pr0.shtml. Computer Networks, 33(1–6):277–293, 2000.
[2] A. Acharya, M. Cutts, J. Dean, P. Haahr, [17] G. Jeh and J. Widom. Scaling personalized web
M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, search. In Proceedings of the Twelfth International
O. Sercinoglu, and S. Tong. Information retrieval World Wide Web Conference, pages 271–279,
based on historical data, Mar. 31 2005. US Patent Budapest, Hungary, May 2003.
Application number 20050071741. [18] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub.
[3] Z. Bar-Yossef, A. Z. Broder, R. Kumar, and Extrapolation methods for accelerating PageRank
A. Tomkins. Sic transit gloria telae: Towards an computations. In Proceedings of the Twelfth
understading of the web’s decay. In Proceedings of the International World Wide Web Conference, 2003.
Thirteenth International World Wide Web [19] P. Massa and C. Hayes. Page-rerank: using trusted
Conference, New York, May 2004. links to re-rank authority. In Proceedings of Web
[4] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Intelligence Conference, France, Sept. 2005.
SpamRank - fully automatic link spam detection. In [20] G. Mishne, D. Carmel, and R. Lempel. Blocking blog
Proceedings of the First International Workshop on spam with language model disagreement. In
Adversarial Information Retrieval on the Web Proceedings of the First International Workshop on
(AIRWeb), 2005. Adversarial Information Retrieval on the Web
[5] G. Collins. Latest search engine spam techniques, (AIRWeb), 2005.
Aug. 2004. Online at [21] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.
http://www.sitepoint.com/article/search-engine- Detecting spam web pages through content analysis.
spam-techniques. In Proceedings of the 15th International Conference on
[6] I. Drost and T. Scheffer. Thwarting the nigritude the World Wide Web, Edinburgh, Scotland, May 2006.
ultramarine: Learning to identify link spam. In [22] Open Directory Project, 2005. http://dmoz.org/.
Proceedings of European Conference on Machine [23] L. Page, S. Brin, R. Motwani, and T. Winograd. The
Learning, pages 96–107, Oct. 2005. PageRank citation ranking: Bringing order to the
[7] D. Fetterly, M. Manasse, and M. Najork. Spam, damn web. Technical report, Stanford Digital Library
spam, and statistics: Using statistical analysis to Technologies Project, 1998.
locate spam web pages. In Proceedings of WebDB, [24] A. Perkins. White paper: The classification of search
pages 1–6, June 2004. engine spam, Sept. 2001. Online at
[8] D. Fetterly, M. Manasse, and M. Najork. Detecting http://www.silverdisc.co.uk/articles/spam-
phrase-level duplication on the world wide web. In classification/.
Proceedings of the 28th Annual International ACM [25] Räber Information Management GmbH. The Swiss
SIGIR Conference on Research & Development in search engine, 2006. http://www.search.ch/.
Information Retrieval, pages 170–177, Salvador, [26] B. Wu and B. D. Davison. Identifying link farm spam
Brazil, August 2005. pages. In Proceedings of the 14th International World
[9] E. Gray, J. Seigneur, Y. Chen, and C. Jensen. Trust Wide Web Conference, pages 820–829, Chiba, Japan,
propagation in small worlds. In Proceedings of the May 2005.
First International Conference on Trust Management, [27] B. Wu, V. Goel, and B. D. Davison. Topical
2003. TrustRank: Using topicality to combat web spam. In
[10] R. Guha. Open rating systems. Technical report, Proceedings of the 15th International World Wide
Stanford University, 2003. Web Conference, Edinburgh, Scotland, May 2006.
[11] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. [28] C.-N. Ziegler and G. Lausen. Spreading activation
Propagation of trust and distrust. In Proceedings of models for trust propagation. In Proceedings of the
the 13th International World Wide Web Conference, IEEE International Conference on e-Technology,
pages 403–412, New York City, May 2004. e-Commerce, and e-Service, Taipei, Taiwan, March
[12] Z. Gyöngyi and H. Garcia-Molina. Web spam 2004. IEEE Computer Society Press.
taxonomy. In First International Workshop on
Adversarial Information Retrieval on the Web
(AIRWeb), Chiba, Japan, 2005.